wei德辉

verify-tagTelugu NLP

social sciencenews

4

已售 0
88.72MB

数据标识:D17222541544812890

发布时间:2024/07/29

以下为卖家选择提供的数据验证报告:

数据描述

Context

Indic NLP - Natural Language Processing for Indian Languages.

This dataset is a step towards the same for telugu language. Thanks to Anusha for getting the data from websites. The idea is to add more datasets related to Telugu NLP at a single place.

Similar dataset for other Indian languages

Content

The dataset has the following files

Telugu Books

This folder has the file that has the text extracted from telugu books. The data is obtained from this link by Anusha and put together as a single file.

Telugu News

This folder has telugu news extract that can be used for multi-class classification problems. The folder has two files - train and test. Categories of the news are following

  • business
  • editorial
  • entertainment
  • nation
  • sport

The data is obtained from this link by Anusha. Post processing is done to extract the above five topics.

Acknowledgements

Sincere thanks to Anusha for collating the dataset from multiple places.

Photo by Prasanth Dasari on Unsplash

Inspiration

Some ideas would be

  • Books data can be used for nlp tasks like topic modeling, word embeddings, transfer learning etc
  • News dataset can be used for supervised learning problems
data icon
Telugu NLP
4
已售 0
88.72MB
申请报告