以下为卖家选择提供的数据验证报告:
数据描述
Dataset consists of merged data science Web articles data for 2020 and 2021. Original datasets were obtained by Vinicius Lambert (https://www.kaggle.com/viniciuslambert) scraping Medium and other popular data science article platforms. Title and subtitle are cleaned from stopwords, lemmatized and transformed to lowercase; other textual features are left unchanged.
Different numerical features were extracted from the text and added to this dataset as a contribution:
- sum, max, min, mean, std deviation of claps, responses and reading time received by the author before posting a new article
- sum, max, min, mean, std deviation of claps, responses and reading time received by the author for previous articles with the same tag
- length of preprocessed title, subtitle and author in words
- number of numericals in preprocessed title, subtitle and author
- number of jargon and technical terms (words that are not present in NLTK English dictionary) in title and subtitle text
Original dataset for 2020: https://www.kaggle.com/viniciuslambert/medium-data-science-articles-dataset
Original dataset for 2021: https://www.kaggle.com/viniciuslambert/medium-2021-data-science-articles-dataset

Medium 2020/21 articles with numerical stats
109.42MB
申请报告