以下为卖家选择提供的数据验证报告:
数据描述
Context
Indic NLP - Natural Language Processing for Indian Languages.
This dataset is a step towards the same for telugu language. Thanks to Anusha for getting the data from websites. The idea is to add more datasets related to Telugu NLP at a single place.
Similar dataset for other Indian languages
Content
The dataset has the following files
Telugu Books
This folder has the file that has the text extracted from telugu books. The data is obtained from this link by Anusha and put together as a single file.
Telugu News
This folder has telugu news extract that can be used for multi-class classification problems. The folder has two files - train and test. Categories of the news are following
- business
- editorial
- entertainment
- nation
- sport
The data is obtained from this link by Anusha. Post processing is done to extract the above five topics.
Acknowledgements
Sincere thanks to Anusha for collating the dataset from multiple places.
Photo by Prasanth Dasari on Unsplash
Inspiration
Some ideas would be
- Books data can be used for nlp tasks like topic modeling, word embeddings, transfer learning etc
- News dataset can be used for supervised learning problems
