数据描述
The Unofficial Challenge
Since this is an "unofficial challenge" we need to set a few ground rules for collaboration, evaluation, etc.
Notebook Names
In order to have an unofficial learderboard, I ask that you use the following format to title your notebooks and scripts.
[GDUC - 0.0000] Whatever You Want
For example "[GDUC - 0.9417] Self-training Baseline". This will allow people to sort by notebook name and see the highest scoring notebooks for this "unofficial challenge"!
The Data
Target Column:
AllGarageDoorEntrancesVisible
Training data can be found in the following file:
image_labels_train.csv
Images can be found in GarageImages folder.
Only a small faction of the images are labeled for training and holdout, and there are many more unlabeled images, hence the need for semi-supervised learning!
Column Descriptions:
ID: random unique ID number for each image
GarageDoorEntranceIndicator: value is 1 if a garage door entrance is visible in the photo, otherwise 0 - [MAIN TARGET]
AllGarageDoorEntrancesVisible: value is 1 if it seems reasonable to assume that all garage doors for the property are accounted for in the photo, otherwise 0
Evaluation
Please use the following file to score your final model:
image_labels_holdout.csv
Performance will be based on AUC. Please use the following code to compute your holdout AUC score.
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=1) metrics.auc(fpr, tpr) # 0.75
Other Notes
Context
In my career I had a project to build a garage detection system, but unfortunately (as with many projects) the project never went forward. I was excited about the project, so I decided to try and replicate the project using images on Google. I thought this would also be a great opportunity to work on semi-supervised learning techniques and collaborate with the data science community on the topic!
Content
- The following code was used to scrape images from Google images: https://github.com/yeamusic21/GarageDetection/blob/master/ScrapeImages.py
- The data was scraped at random times over a 1-2 month period from May-June 2020
- Various search terms were used to scrape the images from Google Images, including "zillow homes ohio", "garage", etc.
Acknowledgements
Script to extract images from Google is revised code from Gene Kogan.
- https://gist.github.com/genekogan/ebd77196e4bf0705db51f86431099e57
- https://gist.github.com/genekogan
- https://genekogan.com/
Inspiration
Kaggle has a lot of competitions, but not many that really REQUIRE semi-supervised learning. This dataset and "unofficial challenge" is in spirit of semi-supervised learning education and experimentation.
Here are come excellent resources on semi-supervised learning:
- http://www.cs.cmu.edu/~10701/slides/17_SSL.pdf
- https://ruder.io/semi-supervised/
- https://www.molgen.mpg.de/3659531/MITPress--SemiSupervised-Learning.pdf
- http://pages.cs.wisc.edu/~jerryzhu/pub/sslicml07.pdf
License Information
Since this information was scrapped from Google Images, this is posted here purely for educational purposes. Again, the posting of this data is for educational purposes only. Posting of this data is intended to fall under fair use, so I advise you self educate on fair use prior to using this data.
验证报告
以下为卖家选择提供的数据验证报告:
