Purpose

This notebook aims at combining the kaggle annual surveys from 2017 to 2022.

What is Kaggle survey data?

Kaggle is a worldwide platform for knowledge sharing on machine learning and data science. It hosts competitions in which data science pratitioners can collaborate or compete over specific tasks. From 2017, Kaggle sends out survey to its online users and publishes anonymous result as analytics competitions, where Kaggle users can deep dive the dataset, explore and share insights.

Since it is interesting to look at the pattern of survey answers over the time, this notebook consolidates each year of the survey questions, choices and answers into one dataset, in order to make it easier for data consumers.

How is this dataset generated?

The code is published on this kaggle notebook. Please feel free to use, check and comment on it.

Schema design

The whole project generates three data tables -

fact_question - meta information of the survey question

question_id, STRING, join key, unique identifier of each question, e.g. "2022Q1"
survey_year, INT, year of the survey the question belongs to, e.g. 2020
question_number, INT, the number of the question, e.g. 1 means the first question of the survey
question_content, STRING, the question content, e.g. What is your age?

fact_choice - meta information of the survey choice per question

question_id, STRING, foreign key, the question id which the choice belongs to
choice_id, STRING, unique identifier of the choice, it follows the format as question_id + choice number. e.g. if the question id is 2022Q1, and it is the first choice, then the choice_id is 2022Q1C1
choice_number, INT, the number of the choice, e.g. 1 means the first choice of the question
choice_content, STRING, the choice content, e.g. 18~21

dim_answer - the choice each participant has taken per question. Here I transform the dataset into long format. (What is long format and why?)

year, INT, the year of the survey
answer_id, STRING, unique identifier of each answer from the participant, e.g. "2022A000001"
answer_time_spent, INT, time spent for finishing the overall survey
question_number, INT, the number of the question, e.g. 1 means the first question of the survey
choice_content, STRING, the choice content, e.g. 18~21

Disclaimer

Free text response is excluded.
In the survey of year 2018, some choices are allocated numbers (e.g. from Q34). We need to further clean the data at a later stage.
In the survey 2019 year, question 14 is unique since it contains both choice and text free response. A further data clean is needed.

Next steps

Check the data quality. I'm happy to get more pairs of eyes on it :)
Adjust the schema. There was some back-and-forth when I was thinking about the dataset design. I might change it in the future.

Main change log

Added the dataset and description 2023-07-30

验证报告

以下为卖家选择提供的数据验证报告：

kaggle-survey-2017-2022-long-format

￥5

已售 0

48.68MB

申请报告

kaggle-survey-2017-2022-long-format

Purpose

What is Kaggle survey data?

How is this dataset generated?

Schema design

Disclaimer

Next steps

Main change log

关于典枢

下载与支持

服务协议

关于我们

官方公众号

技术交流群