数据描述
CONTEXT
This dataset is heavily inspired by a arXiv Paper Abstracts dataset and might be considered as a logical extenstion of it. It contains 536,914 research papers' titles & abstracts and is ready for multilabel classification task. The data was scraped for my project.
Differences: 1. I used the official arXiv metadata to collect the papers instead of the arXiv API used originally; 2. The dataset is expanded from 38,979 papers originally to 536,914 papers in this dataset. 2. The data is cleaned of duplicates, as well as tags that are not arXiv categories (such as ACM, MSC classes), leaving only the arXiv ones; 3. The dataset is divided into 2 files, also split into train\test;
CONTENT
./arxiv_data.csv - contains 155 arXiv tags as target classes; ./arxiv_data_grouped.csv - identical to ./arxiv_data.csv, but the 155 tags are grouped into 8 classes according to arXiv taxonomy https://arxiv.org/category_taxonomy; ./train - contains train splits of both datasets; ./test - contains test splits of both datasets;
You can find the collection proccess here.
Simple EDA can be found in this notebook
验证报告
以下为卖家选择提供的数据验证报告:
