醒醒

verify-tagHDB flat prices 1990-2021 March

housingreal estategeospatial analysishotels and accommodations

2

已售 0
67.62MB

数据标识:D17222361656744167

发布时间:2024/07/29

以下为卖家选择提供的数据验证报告:

数据描述

Background

I've been sitting on this for quite a while and it's a project that I'm glad that I attempted because it forced me to learn new things through trial and error in order to enrich the dataset. Like mapping variables based on set/dict of values for the CPI, lease and GNI adjustment columns. Like trying to scrape Google Maps source code and getting blocked from their maps for a while before finding out how to use other geocoders responsibly. Like running multiple simulations of different linreg algorithms and analysing their variance contributions.

Anyway, this dataset can be used to create rich data visualisations or you can try using it for machine learning because of the large sample size in a small geographic area.

Dataset contents

ALL Prices 1990-2021 Mar.csv

The largest file by far, ALL Prices 1990-2021 Mar.csv, contains over 800k rows of transactions of Singapore Housing Development Board (HDB) resale flats. As the name implies, BTO, SERS, HUDC and private housing are not included, though resale DBSS flat transactions are treated as ordinary HDB flat transactions. Many of the columns in the file are calculated columns or mapped columns (based on supplementary information) like the CPI index and lease percentage columns. For the full metadata/glossary of how I derived these terms, see the bottom of this description and/or the file and column descriptions.

Balas Table.csv

Balas Table.csv contains the ratios of leasehold land value to freehold land value for each year of remaining lease, from 1 to 99. This table is used by Singapore Land Authority (SLA) in determining land valuations which affect property value since most land in Singapore is leasehold. As there are some mistakes/anomalies in the dataset with 100 and 101 years, I used the maximum values of 96% ratio when mapping in the 2nd version so please don't use the old version.

MAS Core Inflation.csv

This file contains SIngapore's core CPI index value for each month from January 1990 to February 2021 as compiled by MAS. For March 2021 and subsequent future transactions, you have to make estimates and also update this table based on new releases by MAS. For this dataset, I used March 2021=100.4.

complete.csv

Contains all UNIQUE BLOCK addresses, along with their geocoder-supplied full address (inconsistent, many missing) and more importantly, their latitude and longitude coordinates. As there are 9000+ addresses, they were first geocoded using a mix of Photon and Google Maps source code scraping (more accurate but doesn't give full address for quick checking). Then, I manually looked through the addresses and coordinates to find and update all blatantly wrong (outside Singapore or wrong neighbourhood) and most slightly inaccurate (correct neighbourhood and/or street but tagged to wrong block number) for a total of around 1600 addresses, many in Whampoa/Boon Keng, Sengkang, Yishun and Woodlands.

gni per capita.csv

As the name implies, this file contains Singapore's GNI per capita in nominal S$ for the years 1990-2020. For 2021, you have to make an estimate based on the projected economic recovery from COVID-19 until the actual value is released. For this dataset, I used 2021=75000.

HDB machine learning.xlsx

This file is my own basic analysis of (numerical) variables that potentially help to determine the final resale price (measured by inflation-adjusted price per square metre). I used 5 linear regression algorithms and tested each variable individually as well as tried to maximise the R^2 using multiple linear regression with as many relevant variables as possible. I also included the correlation matrix between all the variables and that for relevant variables which helps in calculating the incremental contribution to variance for each variable.

Usefulness in training models?

As i've shown in the "HDB machine learning.xlsx" workbook, some variables are more influential than others but even the amount of variance contributed changes depending on the conditions applied. Various multiple linear regression models i've tried can only post up to 0.60+ in combined R^2, which means that up to 40% of variance in the inflation-adjusted price per square metre/foot flat prices is essentially just random noise or could have another hidden variable! Perhaps you can try to find another strongly related variable? Some ideas are proximity to MRTs/bus stops, ratio of HDB to private housing, average household size, other housing to population ratio indicators? It's important to consider whether correlated factors are causes or effects as well.

More importantly, do you think you can train a model to post much better numbers than multiple linear regression?

Metadata for ALL Prices file

Original columns: month, town, flat_type, block, street_name, storey_range, area_sqm, flat_model, lease_start, resale_price

month: Year and month in YYYY-MM format year: Year in YYYY format town: Town according to HDB planning areas (see under categorisation below for all of them). One town (Lim Chu Kang) only has transactions in the early 1990s while Punggol is a very new town without transactions in the early 1990s so you may encounter errors in some scenarios. town_dummy: dummy variable from 1-6 based on my categorisation of towns from most to least prime. These are subjective, not objective so if you disagree with my categorisation and/or want to use your own then go ahead! Categorisation: north=['SEMBAWANG','YISHUN','WOODLANDS'] --> 5 northwest=['CHOA CHU KANG','BUKIT BATOK','BUKIT PANJANG','LIM CHU KANG'] --> 6 northeast=['HOUGANG','SENGKANG','PUNGGOL'] --> 4 primecentral=['BISHAN','BUKIT TIMAH', 'CENTRAL AREA'] --> 1 geocentral=['ANG MO KIO','TOA PAYOH','SERANGOON'] --> 2 westcentral=['CLEMENTI','QUEENSTOWN','BUKIT MERAH'] --> 2 eastcentral=['KALLANG/WHAMPOA', 'MARINE PARADE','GEYLANG'] --> 3 fareast=['BEDOK','PASIR RIS','TAMPINES'] --> 4 farwest=['JURONG EAST', 'JURONG WEST'] --> 5 flat_type: HDB-specified flat type ['4 ROOM' '3 ROOM' '1 ROOM' '5 ROOM' 'EXECUTIVE' '2 ROOM' 'MULTI GENERATION'] block: Block number (NOT flat number) of the HDB block street_name: Street name of the HDB block address: Identifying address formed by concatenating block+street_name+"SINGAPORE". No postal code unfortunately but I may try to add it in the future. latitude: Latitude coordinate of the HDB block, mostly between 1.27-1.46 longitude: Longitude coordinate of the HDB block, mostly between 103.5-104.0 storey_range: 3-number ranges (e.g. 1 TO 3, 19 TO 21) that contain the actual storey number (HDB doesn't reveal the exact floor) storey: Middle number in the storey range, useful as a numerical approximation that can be used for data visualisation/machine learning area_sqm: Size of the flat in square metres flat_model: HDB-specified flat model (not the same as flat type) ['New Generation', 'Improved', 'Standard', 'Model A', 'Apartment', 'Maisonette', 'Model A-Maisonette', 'Simplified', 'Terrace', 'Improved-Maisonette', 'MULTI GENERATION', 'Premium Apartment', 'Multi Generation', 'Adjoined flat', 'Premium Maisonette', '2-room', 'Model A2', 'DBSS', 'Type S1', 'Type S2', 'Premium Apartment Loft'] lease_start: Year in which the 99-year lease commenced. lease_rem: The number of years of remaining lease, as calcualted from 2021-lease_start. For a few anomalies, the year of transaction is BEFORE the year of lease starting which is strange but could be due to house flipping before the Minimum Occupancy Period was introduced? resale_price: The agreed transaction price in nominal S$ of the house, excluding stamp duties and other agent fees price_psm: The price per square metre as calculated from resale_price/area_sqm price_psm_yearly: The "yearly" price per square metre as calculated from price_psm/lease_rem --> removed in Version 2 Core CPI: The core CPI index value at the month of the transaction, based on the data by MAS price cpi_adj: The real, inflation-adjusted resale price of the flat, calculated using (resale_price/Core CPI)*100 price_psm cpi_adj: The real, inflation-adjusted price per square metre of the flat, which is the best indicator of how expensive the flat is. Calculated using (price_psm/Core CPI)*100 bala lease pct: The HDB's land value percentage of freehold land value, referencing Bala's Table. Important to consider because when the 99-year lease expires, the HDBs technically have 0 value. Adjusting for the lease remaining in the same way as inflation-adjusting, you can consider the price that is implied if the flat were to have a full 99-year lease. price lease_adj implied: The resale price implied if the flat were to have a full 99-year lease, calculated from (resale_price/bala lease pct)*96 price_psm lease_adj implied: The price per square metre implied if the flat were to have a full 99-year lease, calculated from (price_psm/bala lease pct)*96 price cpi_lease_adj implied: The inflation-adjusted resale price implied if the flat were to have a full 99-year lease, calculated from (price cpi_adj/bala lease pct)*96 price_psm cpi_lease_adj implied: The inflation-adjusted price per square metre implied if the flat were to have a full 99-year lease, calculated from (price_psm cpi_adj/bala lease pct)*96 year_gni: Singapore's GNI per capita (nominal S$) in the year of the transaction

data icon
HDB flat prices 1990-2021 March
2
已售 0
67.62MB
申请报告