以下为卖家选择提供的数据验证报告:
数据描述
Misinformation, fake news & propaganda data set
A dataset containing 79k articles of misinformation, fake news and propaganda.
- 34975 'true' articles. --> MisinfoSuperset_TRUE.csv
- 43642 articles of misinfo, fake news or propaganda --> MisinfoSuperset_FAKE.csv
The 'true' articles comes from a variety of sources, such as Reuters, the New York TImes, the Washington Post and more.
The 'fake' articles are sourced from:
- American right wing extremist websites (such as Redflag Newsdesk, Beitbart, Truth Broadcast Network)
- A previously made public dataset described in the following article: Ahmed H, Traore I, Saad S. (2017) “Detection of Online Fake News Using N-Gram Analysis and Machine Learning Techniques. In: Traore I., Woungang I., Awad A. (eds) Intelligent, Secure, and Dependable Systems in Distributed and Cloud Environments. ISDDC 2017. Lecture Notes in Computer Science, vol 10618. Springer, Cham (pp. 127-138).
- Disinformation and propaganda cases collected by the EUvsDisinfo project. A project started in 2015 that identifies and fact checks disinformation cases originating from pro-Kremlin media that are spread across the EU.
The articles have all information except the actual text removed and are split up into a set with all the fake news / misinformation, and one with al the true articles.
// For those only interested in Russian propaganda (and not so much misinformation in general), I have added the Russian propaganda in a separate csv called 'EXTRA_RussianPropagandaSubset.csv..'
--
Note. While this might immediately seem like a great classification task, I would suggest also considering clustering / topic modelling. Why clustering? Because by clustering we make a model that can match a newly written article to a previously debunked lie / misinformation narrative, thereby we can immediately debunk a new article (or at least link it to a actual fact-checked statement) without either using an algorithm as argument , or encountering a time delay with regards to waiting for confirmation of a fact checking organisation.
An example disinformation project using this dataset can be found on https://stevenpeutz.com/disinformation/
Enjoy! You have chosen an incredibly important topic for your project!
