Context
This dataset contains anime images for 231 different anime, with approximately 380 image for each of those anime. Please note that you might need to clean the image directories a bit, since the images might contain merchandise and live-action photos in addition to the actual anime itself.
Scripts
If you'd like to take a look at the scripts used to make this dataset, you can find them on this GitHub repo. Feel free to extend it, scrape your own images, etc. etc.
Inspiration
As a big anime fan, I found a lot of anime related datasets on Kaggle. I was however disappointed to find no dataset containing anime specific images for popular anime. Some other great datasets that I've been inspired by include:
- Top 250 Anime 2023
- Anime Recommendations Database
- Anime Recommendation Database 2020
- Anime Face Dataset
- Safebooru - Anime Image Metadata
Process
- You need a list of anime to scrape it. You can either:
- Make your own list. This is what I do in the directory called "scraped_anime_list".
- Use someone else's list. This is what I do in the directory called "kaggle_anime_list" and "top_anime_list".
- To be honest, I wanted to make my own list. To make a list of anime, I used the python wrapper of the unofficial MAL (MyAnimeList) API called JikanPy. JikanPy scraped MAL.
- Animes on MAL have a unique identifier called anime id, think of this as a unique number for each anime. This is supposed to be sequential but there are a lot of gaps from valid anime id to the next, which I discovered based on this post.
- These IDs can go from 1 - 100,000 and maybe beyond. However, I decided to go through the anime ids one by one from 1-50,000 and retrive the id, rank and anime_name. This is what you will find in the folder called "scraped_anime_list". Note that I prefer using the English name of the anime if it exists, and if it doesn't I get the Japanese name. Please use this list to obtain the anime ids if you intend to scrape MAL yourself, it will save you a LOT of time.
- I thought that someone else might've gone through and same process and voila, I found MyAnimeList Dataset on kaggle. I didn't want to wait for my scraper to finish scraping, so I decided to use this "anime_cleaned.csv" version of this list. The lists from this dataset are what you find in the "kaggle_anime_list" folder.
- Cleaning anime names is a task in and of itself. Within the GitHub repo, refer to the file called "notes_and_todo.md" to look at all the cleaning troubles. I tried my best to remove all:
- Anime Movies: Since you have for instance One Piece (the anime) and One Piece Movie 1, One Piece Movie 2, and so on.
- Seasons: MAL is an anime ranker. Different anime seasons can show up on the list with different ranks. I retain the original anime name (the most basic ones, for instance, just "Gintama" instead of "Gintama Season 4".
- Ultimately, I manually curated around 300 anime names, which reduced to 231 after removing duplicates, since after the curation, "Gintama" and "Gintama: Enchousen" would both be named "Gintama". This list with the duplicates is what you find in the file called "UsableAnimeList.xlsx" within the "top_anime_list" folder.
- This list is then rid of the duplicates and used to scrape the image URLs for each anime found in the folder called "anime_img_urls".
- These URLs are then used to scrape the anime images themselves, found in the folder called "anime_images".
- Also the tags are only a guide, feel free to use this dataset for any Deep Learning task.
Sources
- JikanPy
- Useful MAL forum post
- Google Image Search
- Cover image and thumnail obtained from Safebooru
看了又看
验证报告
以下为卖家选择提供的数据验证报告:





