以下为卖家选择提供的数据验证报告:
数据描述
Context
This dataset is created as a part of covid-19 global forecasting challenge. It contains parameters for the SIR model for different locations worldwide. But the main value of the dataset is estimated transmission period (average period between single infected individual infects next susceptible in pure susceptible population) per week per location.
The model is defined as ODE system as follows:
In order to reflect the transmission rate changes caused by spread constraining measures (social distancing, etc.) the Beta parameter is modelled separately as spline model (spline node estimate for every week). See paramsWeekly.csv which holds the Beta parameter values for every week as well as estimated R0 values (derived from Beta and Gamma paramters) for every week.
The models are fitted on John Hopkins University data (time series) using several runs of Nelder-Mead simplex optimization method (best run is taken) starting at different initial locations and RMSE as a loss.
What parameters are fitted (estimated) per country/province:
- the day when the infection emerged in the country
- the initial infected count on the first day of the infection
- beta (separate value for every week) - an average number of contacts (sufficient to spread the disease) per day each infected individual has
- gamma - fixed fraction of the infected group that will recover during any given day
- R0 - Equals beta/gamma
How to read the figures.
points are real observed data provided by Johns Hopkins University
curves are model prediction
blue is susceptible population - people that are not yet infected but can get the infection
red is infected population
green is removed population (recovered or dead). people that are not susceptible any more as they came through the infection.
Content
The dataset contains 3 data portions:
- Fitted SIR model parameters for different locations worldwide. a. Params.csv - parameters (and derived values) constant over time b. ParamsWeekly.csv - parameters (and derived values) that are estimated for every week separatly
- Figures directory that visually show how the fitted parameters match the data points.
- Predictions directory with CSV files with prediction for one year in the future for each individual location.
Warning
Always do visual check of the model fit (Figures
directory) for quality control before start to use the corresponding parameter values in your analysis, as the dataset is obtained by automatic fitting procedure without manual quality control.
Acknowledgements
Thanks a lot Kaggle for organizing data sharing and challenges that make the world better.
Also many thanks to John Hopkins University for their hard work of gathering COVID-19 statistics worldwide.
Inspiration
You can try to find correlation between model parameters (e.g. gamma - patient recovery rate) and other properties of the modelled locations worldwide (e.g. weather, population density, level of medical care, etc.)
