麻酱

verify-tagAnime Images Dataset

classificationdeep learninganime and mangaimage style transfersegmentation

27

已售 0
868.32MB

数据标识:D17171363569827351

发布时间:2024/05/31

以下为卖家选择提供的数据验证报告:

数据描述

Context

This dataset contains anime images for 231 different anime, with approximately 380 image for each of those anime. Please note that you might need to clean the image directories a bit, since the images might contain merchandise and live-action photos in addition to the actual anime itself.

Scripts

If you'd like to take a look at the scripts used to make this dataset, you can find them on this GitHub repo. Feel free to extend it, scrape your own images, etc. etc.

Inspiration

As a big anime fan, I found a lot of anime related datasets on Kaggle. I was however disappointed to find no dataset containing anime specific images for popular anime. Some other great datasets that I've been inspired by include:

Process

  1. You need a list of anime to scrape it. You can either:
  • Make your own list. This is what I do in the directory called "scraped_anime_list".
  • Use someone else's list. This is what I do in the directory called "kaggle_anime_list" and "top_anime_list".
  1. To be honest, I wanted to make my own list. To make a list of anime, I used the python wrapper of the unofficial MAL (MyAnimeList) API called JikanPy. JikanPy scraped MAL.
  2. Animes on MAL have a unique identifier called anime id, think of this as a unique number for each anime. This is supposed to be sequential but there are a lot of gaps from valid anime id to the next, which I discovered based on this post.
  3. These IDs can go from 1 - 100,000 and maybe beyond. However, I decided to go through the anime ids one by one from 1-50,000 and retrive the id, rank and anime_name. This is what you will find in the folder called "scraped_anime_list". Note that I prefer using the English name of the anime if it exists, and if it doesn't I get the Japanese name. Please use this list to obtain the anime ids if you intend to scrape MAL yourself, it will save you a LOT of time.
  4. I thought that someone else might've gone through and same process and voila, I found MyAnimeList Dataset on kaggle. I didn't want to wait for my scraper to finish scraping, so I decided to use this "anime_cleaned.csv" version of this list. The lists from this dataset are what you find in the "kaggle_anime_list" folder.
  5. Cleaning anime names is a task in and of itself. Within the GitHub repo, refer to the file called "notes_and_todo.md" to look at all the cleaning troubles. I tried my best to remove all:
    • Anime Movies: Since you have for instance One Piece (the anime) and One Piece Movie 1, One Piece Movie 2, and so on.
  • Seasons: MAL is an anime ranker. Different anime seasons can show up on the list with different ranks. I retain the original anime name (the most basic ones, for instance, just "Gintama" instead of "Gintama Season 4".
  1. Ultimately, I manually curated around 300 anime names, which reduced to 231 after removing duplicates, since after the curation, "Gintama" and "Gintama: Enchousen" would both be named "Gintama". This list with the duplicates is what you find in the file called "UsableAnimeList.xlsx" within the "top_anime_list" folder.
  2. This list is then rid of the duplicates and used to scrape the image URLs for each anime found in the folder called "anime_img_urls".
  3. These URLs are then used to scrape the anime images themselves, found in the folder called "anime_images".
  4. Also the tags are only a guide, feel free to use this dataset for any Deep Learning task.

Sources

data icon
Anime Images Dataset
27
已售 0
868.32MB
申请报告