数据描述
Context
The dataset contains the Hindi and English subtitles for famous YouTube channels. This dataset was mainly created for the Hindi Language channel since the main goal was to use this dataset to build LLMs using the Hindi Language.
Data from channels in Information, Entertainment, Politics, Comedy, News, etc categories has been included in this dataset.
Dataset Stats:
- 85 channels
- 168,039 total videos
Content
- Video subtitles in Hindi and English
- Video metadata like duration, number of comments, likes, counts, published date
Acknowledgements
The source of this dataset is YouTube. The following packages were used to generate this dataset:
Inspiration
- Build LLMs model in Hindi Language
- Finetune models in Hindi Language for tasks like classification, summarization, translation, etc
验证报告
以下为卖家选择提供的数据验证报告:

Youtube Transcripts[Hindi+English]
855.69MB
申请报告