困困

verify-tagMyAnimeList Anime and Manga Datasets

arts and entertainmentmovies and tv showsrecommender systemscomics and animationanime and manga

1

已售 0
25.98MB

数据标识:D17222446156345814

发布时间:2024/07/29

数据描述

Complete, cleaned and weekly updated MyAnimeList Anime and Manga datasets, containing 24.165 Animes and 67.273 Mangas, last updated on 7th August 2023. For Characters and People, check MyAnimeList Jikan Database.

This dataset contains data from MyAnimeList (abbreviated MAL), scraped with the official API and Jikan API. The official API contains some extra data which is not available on the website (and therefore neither on the Jikan API), always contains the latest data and is really fast to scrape. However, there are some attributes missing, so I scraped also the Jikan API. All the data is scraped trough lists (500 entries / API call with the official API and 25 entries / API call with Jikan), which makes it really fast to scrape again, so I'll be able to update it weekly.

Anime

Scraping all the Animes, both approved and 'pending approval', from the official API takes just 2 minutes! The Jikan API Anime database scrape takes 20 minutes. The extra information in the MAL API is: MAL Anime entries creation and last updated by moderator dates, nº users scored for entries without score, average episode duration with seconds precision (instead of minutes), broadcast time for not weekly anime and correct separated title synonyms (in the website the title synonyms are separated by ", " so if a title contains a ", " it's impossible to tell it's from the title itself instead of the separator).

The missing columns in the MAL API and present in Jikan are producers, licensors, background, url and trailer url. This is the only data used from the Jikan scrape, as Jikan caches the data in its internal Database and might offer older results (up to a week old), so the common attributes are always better from the official API. It's also worth noting that the Jikan API contains Animes which have been posteriorly deleted from MAL, and when scraping only Jikan they can't be detected.

The Jikan scraping process is really simple, the MAL API is a bit more complicated (you need to request a key previously to use it). The cleaning process is quite long. Both the scraping and the cleaning are explained in GitHub. The Anime attributes are already explained in the file, but I'll add a detailed explanation is the GitHub repository.

Manga

The Manga MAL scraping takes 10 minutes, while the Jikan API one takes 55 minutes. The MAL API extra information is similar to the one from the Animes, but for some reason the creation date is missing so I've partially reconstructed it using the id's and the updated at column (the id's are given incrementally and I've used the dates of the Mangas which have never been updated). To get more information from the authors, check the the id's from MAL Jikan people.csv

Future

The beauty of this dataset scraping is that offers a lot of cleaned data that can be scraped really fast (I said previously it was scraped through lists). However there's some data which cannot be obtained through list scraping, like Related Anime, Anime Characters or Anime Score stats.

I plan on scraping them in the near future and upload them in another Dataset, but I won't be able to upload that dataset as frequently, as if every Anime is scraped individually (1 Anime / API call) the scraping would take 7 hours at best.

In the far future maybe I'll also scrape User Ratings with the official API, but it takes and incredible amount of time to take a significant sample to train a Recommender. For now there are already available datasets on these and they don't need to be as recent as the Anime database, but they won't recommend new Anime.

Special thanks to MyAnimeList and its users for the awesome Database, as well as both the MAL API and Jikan API for being so useful and fast.

验证报告

以下为卖家选择提供的数据验证报告:

data icon
MyAnimeList Anime and Manga Datasets
1
已售 0
25.98MB
申请报告