以下为卖家选择提供的数据验证报告:
数据描述
Context
This corpus of syllabi aims to support the Nimbus Assistant, an AI similar to Siri/Alexa that answers students’ questions.
In the context of syllabi, students may ask questions like:
What textbook does MATH 143 need? Do I need to buy a new book after MATH 142? What’s the course website for Anton Kaul’s 143? What’s Dr. Kaul’s grading policy? What’s the bare minimum I need to do to pass Kaul’s 143 class? How do I ace Kaul’s math 143 class?
Content
Data was scraped using Thruuu, an awesome and easy to use SERP (search engine result pages) scraper.
Thruuu
thruuu.xlsx
- the data exported from Thruuu.thruuu.pdf
- the preliminary analysis exported from Thruuu.
Notebooks/Process
step-1-get-documents-from-sheet-urls.ipynb
- a notebook that inputsthruuu.xlsx
and outputsdownloads.tar.gz
along withdownloads.csv
step-2-extract-document-data-with-OCR.ipynb
- a notebook that inputsdownloads.tar.gz
along withdownloads.csv
and outputsextracted.csv
step-3-get-simple-logistical-information.ipynb
- a notebook that inputsextracted.csv
and outputslogistical_info.csv
Notebook Outputs
downloads.tar.gz
- 100 PDF files (some files are corrupted).downloads.csv
- a table associating search result positions with individual PDF files for a syllabus.extracted.csv
- a table associating each PDF file with the extracted OCR text (also the plain text but OCR is preferred).logistical_info.csv
- a table associating each PDF file with the logistical info (instructor/office/email/etc) that is found through regular expressions.
Acknowledgements
Thank you Samuel Schmitt for making Thruuu!
Inspiration
- What kinds of factoids could you mine from the syllabus text?
- What are common phrases used by Cal Poly professors in their syllabi?
- What are the rarest phrases found in syllabi?
- Can you identify a professor’s writing style from their syllabus?

Cal Poly Syllabus Corpus
11.94MB
申请报告