About Dataset
We created a STEM (Science, Technology, Engineering and Mathematics) corpus by filtering wikipedia articles based on their category metadata. During extraction of wiki page contents, we mitigated the frequent rendering issues (number, equations & symbols) prevalent in existing wiki datasets.
For filtering, we first defined a set of seed wikipedia categories related to STEM topics such as Category:Concepts in physics, Category:Physical quantities, etc. For each category, recursively collect the member pages and subcategories up to a certain depth. We next extracted the page contents of the collected wiki URLs using Wikipedia-API (400k+ pages).
Chunking: We first split the full text from each article based on different sections. The longer sections were further broken down into smaller chunks containing approximately 300 tokens (deberta-v3 tokenizer).
This dataset can be embedded and used for RAG over STEM wiki.
看了又看
验证报告

目前该文件尚无匹配的数据质量验证程序。我们将在后续版本中提供相应的验证支持,敬请谅解。





