数据描述
Context
Like so many other aggregated metric of human activity, hourly passenger traffic within the NYC subway system was an exceptionally predictable signal, with each station having its regular seasonal, daily and weekly fluctuations - until March of 2020. How can we as data scientists quickly respond to unforeseen events that completely change the nature of the behavior we are trying to model? Should we throw away our old models and data and try to start from scratch, or can we do better?
Content
The main dataset includes the number of subway station entries and exits, as counted by the number of people passing through the turnstiles located at the station entrances, at 4 hour intervals, for 469 subway stations from Feb. 4th 2017 to Aug. 13th 2021.
In addition, a dataset of NYC census data by neighborhood (source: https://furmancenter.org/neighborhoods) is provided as an auxiliary dataset. Each of the 469 stations in the main dataset was referenced to one of 51 neighborhoods, each associated with 87 aggregate financial and demographic variables.
Preprocessing
See the accompanying notebook for the full data acquisition and preprocessing script. You may use it to generate a new updated dataset with up-to-date traffic data.
The data was downloaded from the MTA website (http://web.mta.info/developers/turnstile.html), where is it available as weekly data per turnstile but suffers from noisy samples (e.g. from bad turnstile counters), missing data and a confusing hierarchical structure of subway station elements (turnstile machine, control area, remote unit, station, etc.).
To make the data usable, the following preprocessing steps were taken:
- For each turnstile's data, resample the data to fixed 4-hour intervals (instead of having some samples referenced to 2pm and others to 3pm, etc.). The given timestamp in each sample corresponds to the center of this interval (e.g. 4pm corresponds to the 2-6pm interval).
- Convert the cumulative sum of entries/exits reported by each turnstile to an absolute number of passengers for that 4-hour interval
- Drop missing data, outliers and and negative values, and drop stations with bad data or too many missing data points over time
- Aggregate over turnstiles belonging to the same station (assuming we don't care how many people pass through each individual turnstile)
- Join the time series data with station metadata such as latitude and longitude, daytimes routes, station structure etc. Since some stations have multiple rows for multiple connecting lines, these single-character lines are concatenated for the stations, for example "LNQR456" indicates 7 separate connecting lines. Also, some stations have the same name, e.g. "103 st. which corresponds to three actual stations along this street both in the Upper West Side and East Harlem.
- Generate a new "Unique ID" that corresponds to a unique combination of Station and Line (with Remote Unit, Connecting Lines, Daytime Routes, and North/South Direction Label being also unique as result). This identifier, not included in the original data, is the most suitable hierarchical level for modeling aggregated passenger traffic through the stations, as it corresponds to a specific line within a specific stop, but aggregated over individual turnstiles.
- Add a neighborhood ID for each station, based on the lat-long coordinates and a neighborhood shapefile (downloaded from https://geodata.lib.berkeley.edu). Neighborhood names in the NYC census dataset were manually edited to match these neighborhood names.
Inspiration
This dataset is a great example for time series data that was drastically affected by the COVID19 outbreak, with subway passenger traffic plummeting during March of 2020 and very slowly climbing back since. It can be used for any kind of task that requires time series / geospatial data, and in particular analyses interested in investigating concept drift or "New Normal" scenarios.
Acknowledgements
I spent a lot of effort generating this dataset (for an internal research project) because none of the existing resources suited my needs, but several versions of this data can be found on Kaggle and elsewhere. See for example: https://www.kaggle.com/new-york-state/nys-turnstile-usage-data https://www.kaggle.com/cyaris/mta-turnstile-traffic https://www.kaggle.com/monsieurwagner/nyctransit https://medium.com/qri-io/taming-the-mtas-unruly-turnstile-data-c945f5f96ba0
验证报告
以下为卖家选择提供的数据验证报告:
