Wiki STEM Corpus

🍤CJR🍥

Wiki STEM Corpus

Earth and NatureComputer ScienceScience and TechnoloBeginnerText

￥5

已售 0

850.28MB

数据标识：D17169497395801788

发布时间：2024/05/29

About Dataset

We created a STEM (Science, Technology, Engineering and Mathematics) corpus by filtering wikipedia articles based on their category metadata. During extraction of wiki page contents, we mitigated the frequent rendering issues (number, equations & symbols) prevalent in existing wiki datasets.

For filtering, we first defined a set of seed wikipedia categories related to STEM topics such as Category:Concepts in physics, Category:Physical quantities, etc. For each category, recursively collect the member pages and subcategories up to a certain depth. We next extracted the page contents of the collected wiki URLs using Wikipedia-API (400k+ pages).

Chunking: We first split the full text from each article based on different sections. The longer sections were further broken down into smaller chunks containing approximately 300 tokens (deberta-v3 tokenizer).

This dataset can be embedded and used for RAG over STEM wiki.

看了又看

验证报告

目前该文件尚无匹配的数据质量验证程序。我们将在后续版本中提供相应的验证支持，敬请谅解。

Wiki STEM Corpus

￥5

已售 0

850.28MB

申请报告

Wiki STEM Corpus

About Dataset

关于典枢

下载与支持

服务协议

关于我们

官方公众号

技术交流群