麻酱

verify-tag20 newsgroup preprocessed

educationnlpclusteringtextmulticlass classificationonline communities

16

已售 0
50.27MB

数据标识:D17171227495873840

发布时间:2024/05/31

以下为卖家选择提供的数据验证报告:

数据描述

Context

Our goal in this dataset is to provide easy-to-go data to researchers that want to try different machine learning techniques, such as text classification and text clustering, without having to worry about how the corpus is structured and initial preprocessing.

We know this is just the beginning to normalize your text documents, but we didn't want to apply more cleaning to the point you can't try different approaches.

Content

The original dataset was collected by Ken Lang and is available here. In this version, we present the same data, changing the structure to a dataframe, and adding the text after cleaning.

The script that converts every document in a row of the dataframe can be found here.

The notebook that does the pre-processing can be found here.

Dataset properties

  • 18.828 documents
  • 20 classes
  • 3 features
    • target: 20 newsgroups corresponding to a different topic.
    • text: the original text extracted from the original document, has the same format.
    • text_cleaned: the result text after pre-processing.

20 newsgroups topics

Below you can see each newsgroup. Some of them are related (e.g. rec.sport.baseball and rec.sport.hockey), while others are unrelated (e.g alt.atheism and misc.forsale).

  • alt.atheism
  • comp.graphics
  • comp.os.ms-windows.misc
  • comp.sys.ibm.pc.hardware
  • comp.sys.mac.hardware
  • comp.windows.x
  • misc.forsale
  • rec.autos
  • rec.motorcycles
  • rec.sport.baseball
  • rec.sport.hockey
  • sci.crypt
  • sci.electronics
  • sci.med
  • sci.space
  • soc.religion.christian
  • talk.politics.guns
  • talk.politics.mideast
  • talk.politics.misc
  • talk.religion.misc

Citation

Filipe Filardi de Jesus, Glauber da Rocha Balthazar, and Kevin Danglau Mejia Maldonado, “20 newsgroup preprocessed.” Kaggle, 2020, doi: 10.34740/KAGGLE/DS/997253.

Inspiration

  • Can you propose a different pre-processing that can enhance the model prediction?
  • Can you find any text structure (e.g. header) that wasn't removed in the proposed pre-processing that might cause an unfair increase in our prediction capability?
  • What different NLP techniques can be applied in this dataset to do text-classification and clustering?
data icon
20 newsgroup preprocessed
16
已售 0
50.27MB
申请报告