
卖家暂未授权典枢平台对该文件进行数据验证,您可以向卖家
数据描述
About Dataset
We created a STEM (Science, Technology, Engineering and Mathematics) corpus by filtering wikipedia articles based on their category metadata. During extraction of wiki page contents, we mitigated the frequent rendering issues (number, equations & symbols) prevalent in existing wiki datasets.
For filtering, we first defined a set of seed wikipedia categories related to STEM topics such as Category:Concepts in physics
, Category:Physical quantities
, etc. For each category, recursively collect the member pages and subcategories up to a certain depth. We next extracted the page contents of the collected wiki URLs using Wikipedia-API (400k+ pages).
Chunking: We first split the full text from each article based on different sections. The longer sections were further broken down into smaller chunks containing approximately 300 tokens (deberta-v3
tokenizer).
This dataset can be embedded and used for RAG over STEM wiki.
