CONTEXT

This dataset is heavily inspired by a arXiv Paper Abstracts dataset and might be considered as a logical extenstion of it. It contains 536,914 research papers' titles & abstracts and is ready for multilabel classification task. The data was scraped for my project.

Differences: 1. I used the official arXiv metadata to collect the papers instead of the arXiv API used originally; 2. The dataset is expanded from 38,979 papers originally to 536,914 papers in this dataset. 2. The data is cleaned of duplicates, as well as tags that are not arXiv categories (such as ACM, MSC classes), leaving only the arXiv ones; 3. The dataset is divided into 2 files, also split into train\test;

CONTENT

./arxiv_data.csv - contains 155 arXiv tags as target classes; ./arxiv_data_grouped.csv - identical to ./arxiv_data.csv, but the 155 tags are grouped into 8 classes according to arXiv taxonomy https://arxiv.org/category_taxonomy; ./train - contains train splits of both datasets; ./test - contains test splits of both datasets;

You can find the collection proccess here.

Simple EDA can be found in this notebook

看了又看

验证报告

以下为卖家选择提供的数据验证报告：

arXiv Titles, Abstracts & Tags

￥7

已售 0

859.88MB

申请报告

arXiv Titles, Abstracts & Tags

CONTEXT

CONTENT

关于典枢

下载与支持

服务协议

关于我们

官方公众号

技术交流群