ㄜ吟屿

verify-tagarXiv Titles, Abstracts & Tags

education

7

已售 0
859.88MB

数据标识:D17193997153961369

发布时间:2024/06/26

数据描述

CONTEXT

This dataset is heavily inspired by a arXiv Paper Abstracts dataset and might be considered as a logical extenstion of it. It contains 536,914 research papers' titles & abstracts and is ready for multilabel classification task. The data was scraped for my project.

Differences: 1. I used the official arXiv metadata to collect the papers instead of the arXiv API used originally; 2. The dataset is expanded from 38,979 papers originally to 536,914 papers in this dataset. 2. The data is cleaned of duplicates, as well as tags that are not arXiv categories (such as ACM, MSC classes), leaving only the arXiv ones; 3. The dataset is divided into 2 files, also split into train\test;

CONTENT

./arxiv_data.csv - contains 155 arXiv tags as target classes; ./arxiv_data_grouped.csv - identical to ./arxiv_data.csv, but the 155 tags are grouped into 8 classes according to arXiv taxonomy https://arxiv.org/category_taxonomy; ./train - contains train splits of both datasets; ./test - contains test splits of both datasets;

You can find the collection proccess here.

Simple EDA can be found in this notebook

验证报告

以下为卖家选择提供的数据验证报告:

data icon
arXiv Titles, Abstracts & Tags
7
已售 0
859.88MB
申请报告