🍤CJR🍥

Wiki STEM Corpus

Earth and NatureComputer ScienceScience and TechnoloBeginnerText

5

已售 0
850.28MB

数据标识:D17169497395801788

发布时间:2024/05/29

About Dataset

We created a STEM (Science, Technology, Engineering and Mathematics) corpus by filtering wikipedia articles based on their category metadata. During extraction of wiki page contents, we mitigated the frequent rendering issues (number, equations & symbols) prevalent in existing wiki datasets.

For filtering, we first defined a set of seed wikipedia categories related to STEM topics such as Category:Concepts in physicsCategory:Physical quantities, etc. For each category, recursively collect the member pages and subcategories up to a certain depth. We next extracted the page contents of the collected wiki URLs using Wikipedia-API (400k+ pages).

Chunking: We first split the full text from each article based on different sections. The longer sections were further broken down into smaller chunks containing approximately 300 tokens (deberta-v3 tokenizer).

This dataset can be embedded and used for RAG over STEM wiki.

看了又看

暂无推荐

验证报告

目前该文件尚无匹配的数据质量验证程序。我们将在后续版本中提供相应的验证支持,敬请谅解。

data icon
Wiki STEM Corpus
5
已售 0
850.28MB
申请报告