🍤CJR🍥

Wiki STEM Corpus

Earth and NatureComputer ScienceScience and TechnoloBeginnerText

5

已售 0
850.28MB

数据标识:D17169497395801788

发布时间:2024/05/29

卖家暂未授权典枢平台对该文件进行数据验证,您可以向卖家

申请验证报告

数据描述

About Dataset

We created a STEM (Science, Technology, Engineering and Mathematics) corpus by filtering wikipedia articles based on their category metadata. During extraction of wiki page contents, we mitigated the frequent rendering issues (number, equations & symbols) prevalent in existing wiki datasets.

For filtering, we first defined a set of seed wikipedia categories related to STEM topics such as Category:Concepts in physicsCategory:Physical quantities, etc. For each category, recursively collect the member pages and subcategories up to a certain depth. We next extracted the page contents of the collected wiki URLs using Wikipedia-API (400k+ pages).

Chunking: We first split the full text from each article based on different sections. The longer sections were further broken down into smaller chunks containing approximately 300 tokens (deberta-v3 tokenizer).

This dataset can be embedded and used for RAG over STEM wiki.

data icon
Wiki STEM Corpus
5
已售 0
850.28MB
申请报告