以下为卖家选择提供的数据验证报告:
数据描述
Context
Dataset from the famous Stack Overflow site, exported thanks to Stack Exchange. These data are used within the framework of the processing of textual data to create a program of automatic generation of tags for the questions asked.
Content
This set of 13 CSV files includes the following variables:
- Id: Unique identifier of the post
- CreationDate: Creation date of the post
- Title: Post title
- Body: Complete question in HTML format
- Tags: The tags used by users for the question
- ViewCount: Number of views
- CommentCount: Number of comments
- AnswerCount: Number of answers
- Score: Upvote score of the post.
The data was extracted using the following SQL query:
DECLARE @start_date DATE DECLARE @end_date DATE SET @start_date = '2011-01-01' SET @end_date = DATEADD(m , 12 , @start_date) SELECT p.Id, p.CreationDate, p.Title, p.Body, p.Tags, p.ViewCount, p.CommentCount, p.AnswerCount, p.Score FROM Posts as p LEFT JOIN PostTypes as t ON p.PostTypeId = t.id WHERE p.CreationDate between @start_date and @end_date AND t.Name = 'Question' AND p.ViewCount > 20 AND p.CommentCount > 5 AND p.AnswerCount > 1 AND p.Score > 5 AND len(p.Tags) > 0
Inspiration
Data cleaning on textual data, automatic tag generator, NLP ...

StackOverflow questions filtered 2009 - 2020
69.41MB
申请报告