以下为卖家选择提供的数据验证报告:
数据描述
Context
It's Datasets December this is my third dataset uploaded to Kaggle this month.
From Wikipedia:
"During her tenure as United States Secretary of State, Hillary Clinton drew controversy by using a private email server for official public communications rather than using official State Department email accounts maintained on secure federal servers. An FBI examination of Clinton's server found over 100 emails containing classified information, including 65 emails deemed "Secret" and 22 deemed "Top Secret". An additional 2,093 emails not marked classified were retroactively classified by the State Department."
Content
There's a good amount of email data here, though not as dense as the Enron dataset, it's much more numerous. Note that Clinton deleted a subset of these emails prior to turning them over to the State Department, so this isn't a perfect sample of someone's emails. A good bit is also redacted.
I have also included the raw PDF files from State that could be used for OCR training. There's also a CSV that maps the names in the database to real human names, since they're not always easy to tell.
Acknowledgements
Most of the hard work for this one was done by Martin Burch at the WSJ who created a series of Python scripts to download the data from the US State Department and upload them to a SQLite database.
Inspiration
This dataset is really cool and I think there's a lot of really interesting stuff that could be done with it. There are sections of the data that are redacted, and I don't know if the OCR done by the State Department properly marks what's redacted. Maybe someone could use deep learning to identify redacted data and to
