姜饼果子

verify-tagMedium 2020/21 articles with numerical stats

websitescomputer sciencetext miningtabulartext

7

已售 0
109.42MB

数据标识:D17220488718978406

发布时间:2024/07/27

以下为卖家选择提供的数据验证报告:

数据描述

Dataset consists of merged data science Web articles data for 2020 and 2021. Original datasets were obtained by Vinicius Lambert (https://www.kaggle.com/viniciuslambert) scraping Medium and other popular data science article platforms. Title and subtitle are cleaned from stopwords, lemmatized and transformed to lowercase; other textual features are left unchanged.

Different numerical features were extracted from the text and added to this dataset as a contribution:

  • sum, max, min, mean, std deviation of claps, responses and reading time received by the author before posting a new article
  • sum, max, min, mean, std deviation of claps, responses and reading time received by the author for previous articles with the same tag
  • length of preprocessed title, subtitle and author in words
  • number of numericals in preprocessed title, subtitle and author
  • number of jargon and technical terms (words that are not present in NLTK English dictionary) in title and subtitle text

Original dataset for 2020: https://www.kaggle.com/viniciuslambert/medium-data-science-articles-dataset

Original dataset for 2021: https://www.kaggle.com/viniciuslambert/medium-2021-data-science-articles-dataset

data icon
Medium 2020/21 articles with numerical stats
7
已售 0
109.42MB
申请报告