笑笑

verify-tagkaggle-survey-2017-2022-long-format

educationcomputer sciencesurvey analysisdata analytics

5

已售 0
48.68MB

数据标识:D17222393138032847

发布时间:2024/07/29

以下为卖家选择提供的数据验证报告:

数据描述

Purpose

This notebook aims at combining the kaggle annual surveys from 2017 to 2022.

What is Kaggle survey data?

Kaggle is a worldwide platform for knowledge sharing on machine learning and data science. It hosts competitions in which data science pratitioners can collaborate or compete over specific tasks. From 2017, Kaggle sends out survey to its online users and publishes anonymous result as analytics competitions, where Kaggle users can deep dive the dataset, explore and share insights.

Since it is interesting to look at the pattern of survey answers over the time, this notebook consolidates each year of the survey questions, choices and answers into one dataset, in order to make it easier for data consumers.

How is this dataset generated?

The code is published on this kaggle notebook. Please feel free to use, check and comment on it.

Schema design

The whole project generates three data tables -

fact_question - meta information of the survey question

  • question_id, STRING, join key, unique identifier of each question, e.g. "2022Q1"
  • survey_year, INT, year of the survey the question belongs to, e.g. 2020
  • question_number, INT, the number of the question, e.g. 1 means the first question of the survey
  • question_content, STRING, the question content, e.g. What is your age?

fact_choice - meta information of the survey choice per question

  • question_id, STRING, foreign key, the question id which the choice belongs to
  • choice_id, STRING, unique identifier of the choice, it follows the format as question_id + choice number. e.g. if the question id is 2022Q1, and it is the first choice, then the choice_id is 2022Q1C1
  • choice_number, INT, the number of the choice, e.g. 1 means the first choice of the question
  • choice_content, STRING, the choice content, e.g. 18~21

dim_answer - the choice each participant has taken per question. Here I transform the dataset into long format. (What is long format and why?)

  • year, INT, the year of the survey
  • answer_id, STRING, unique identifier of each answer from the participant, e.g. "2022A000001"
  • answer_time_spent, INT, time spent for finishing the overall survey
  • question_number, INT, the number of the question, e.g. 1 means the first question of the survey
  • choice_content, STRING, the choice content, e.g. 18~21

Disclaimer

  • Free text response is excluded.
  • In the survey of year 2018, some choices are allocated numbers (e.g. from Q34). We need to further clean the data at a later stage.
  • In the survey 2019 year, question 14 is unique since it contains both choice and text free response. A further data clean is needed.

Next steps

  • Check the data quality. I'm happy to get more pairs of eyes on it :)
  • Adjust the schema. There was some back-and-forth when I was thinking about the dataset design. I might change it in the future.

Main change log

  • Added the dataset and description 2023-07-30
data icon
kaggle-survey-2017-2022-long-format
5
已售 0
48.68MB
申请报告