王七七

verify-tagCal Poly Syllabus Corpus

universities and collegeseducationnlptext miningtextspaCy

20

已售 0
11.94MB

数据标识:D17171555131175583

发布时间:2024/05/31

以下为卖家选择提供的数据验证报告:

数据描述

Context

This corpus of syllabi aims to support the Nimbus Assistant, an AI similar to Siri/Alexa that answers students’ questions.

In the context of syllabi, students may ask questions like:

What textbook does MATH 143 need? Do I need to buy a new book after MATH 142? What’s the course website for Anton Kaul’s 143? What’s Dr. Kaul’s grading policy? What’s the bare minimum I need to do to pass Kaul’s 143 class? How do I ace Kaul’s math 143 class? 

Content

Data was scraped using Thruuu, an awesome and easy to use SERP (search engine result pages) scraper.

Thruuu

  • thruuu.xlsx - the data exported from Thruuu.
  • thruuu.pdf - the preliminary analysis exported from Thruuu.

Notebooks/Process

  • step-1-get-documents-from-sheet-urls.ipynb - a notebook that inputs thruuu.xlsx and outputs downloads.tar.gz along with downloads.csv
  • step-2-extract-document-data-with-OCR.ipynb - a notebook that inputs downloads.tar.gz along with downloads.csv and outputs extracted.csv
  • step-3-get-simple-logistical-information.ipynb - a notebook that inputs extracted.csv and outputs logistical_info.csv

Notebook Outputs

  • downloads.tar.gz - 100 PDF files (some files are corrupted).
  • downloads.csv - a table associating search result positions with individual PDF files for a syllabus.
  • extracted.csv - a table associating each PDF file with the extracted OCR text (also the plain text but OCR is preferred).
  • logistical_info.csv - a table associating each PDF file with the logistical info (instructor/office/email/etc) that is found through regular expressions.

Acknowledgements

Thank you Samuel Schmitt for making Thruuu!

Inspiration

  • What kinds of factoids could you mine from the syllabus text?
  • What are common phrases used by Cal Poly professors in their syllabi?
  • What are the rarest phrases found in syllabi?
  • Can you identify a professor’s writing style from their syllabus?
data icon
Cal Poly Syllabus Corpus
20
已售 0
11.94MB
申请报告