麻酱

20 newsgroup preprocessed

educationnlpclusteringtextmulticlass classificationonline communities

￥16

50.27MB

数据标识：D17171227495873840

发布时间：2024/05/31

Context

Our goal in this dataset is to provide easy-to-go data to researchers that want to try different machine learning techniques, such as text classification and text clustering, without having to worry about how the corpus is structured and initial preprocessing.

We know this is just the beginning to normalize your text documents, but we didn't want to apply more cleaning to the point you can't try different approaches.

Content

The original dataset was collected by Ken Lang and is available here. In this version, we present the same data, changing the structure to a dataframe, and adding the text after cleaning.

The script that converts every document in a row of the dataframe can be found here.

The notebook that does the pre-processing can be found here.

Dataset properties

18.828 documents
20 classes
3 features
- target: 20 newsgroups corresponding to a different topic.
- text: the original text extracted from the original document, has the same format.
- text_cleaned: the result text after pre-processing.

20 newsgroups topics

Below you can see each newsgroup. Some of them are related (e.g. rec.sport.baseball and rec.sport.hockey), while others are unrelated (e.g alt.atheism and misc.forsale).

alt.atheism
comp.graphics
comp.os.ms-windows.misc
comp.sys.ibm.pc.hardware
comp.sys.mac.hardware
comp.windows.x
misc.forsale
rec.autos
rec.motorcycles
rec.sport.baseball
rec.sport.hockey
sci.crypt
sci.electronics
sci.med
sci.space
soc.religion.christian
talk.politics.guns
talk.politics.mideast
talk.politics.misc
talk.religion.misc

Citation

Filipe Filardi de Jesus, Glauber da Rocha Balthazar, and Kevin Danglau Mejia Maldonado, “20 newsgroup preprocessed.” Kaggle, 2020, doi: 10.34740/KAGGLE/DS/997253.

Inspiration

Can you propose a different pre-processing that can enhance the model prediction?
Can you find any text structure (e.g. header) that wasn't removed in the proposed pre-processing that might cause an unfair increase in our prediction capability?
What different NLP techniques can be applied in this dataset to do text-classification and clustering?

看了又看

验证报告

以下为卖家选择提供的数据验证报告：

20 newsgroup preprocessed

￥16

50.27MB

申请报告

20 newsgroup preprocessed

Context

Content

Dataset properties

20 newsgroups topics

Citation

Inspiration

关于典枢

下载与支持

服务协议

关于我们

官方公众号

技术交流群