HM

verify-tag假新闻分类数据集40587条新闻文本数据支持深度学习模型训练自然语言处理研究-假新闻检测系统-自然语言处理、机器学习、深度学习-假新闻检测算法的研发、文本分类模型

10

已售 0
99.83MB

数据标识:D17695879185757167

发布时间:2026/01/28

# 假新闻分类数据集

## 引言与背景

在数字化信息爆炸的时代,新闻传播的速度和范围前所未有,但同时也带来了虚假信息泛滥的严峻挑战。假新闻的传播不仅误导公众认知,影响社会稳定,甚至对民主制度构成威胁。因此,开发有效的假新闻检测系统已成为学术界和产业界的重要研究课题。本数据集正是为应对这一挑战而构建的高质量文本分类数据集,包含了40,587条新闻文本数据,每条数据都经过精心标注,明确标识了新闻的真实性。数据集完整包含新闻标题、正文内容以及真实性标签三个核心字段,为研究人员提供了丰富的文本信息用于模型训练和算法验证。该数据集对于自然语言处理、机器学习、深度学习等领域的研究具有重要价值,能够支持假新闻检测算法的研发、文本分类模型的训练以及相关应用系统的开发,为构建更加可信的信息环境提供数据支撑。

## 数据基本信息

### 数据字段说明

| 字段名称 | 字段类型 | 字段含义 | 数据示例 | 完整性 |
|---------|---------|---------|---------|-------|
| title | 文本 | 新闻标题 | Palestinians switch off Christmas lights in Bethlehem in anti-Trump protest | 100% |
| text | 文本 | 新闻正文内容 | RAMALLAH, West Bank (Reuters) - Palestinians switched off Christmas lights... | 100% |
| label | 整数 | 新闻真实性标签(0=虚假新闻,1=真实新闻) | 1 | 100% |

### 数据分布情况

#### 数据集划分分布

| 数据集类型 | 记录数量 | 占比 | 累计占比 |
|-----------|---------|------|---------|
| 训练集 | 24,353 | 60.03% | 60.03% |
| 测试集 | 8,117 | 20.00% | 80.03% |
| 评估集 | 8,117 | 20.00% | 100.03% |

#### 标签分布(全量数据)

| 标签类别 | 含义 | 记录数量 | 占比 |
|---------|------|---------|------|
| 1 | 真实新闻 | 21,924 | 54.03% |
| 0 | 虚假新闻 | 18,663 | 45.97% |

#### 训练集标签分布

| 标签类别 | 含义 | 记录数量 | 占比 |
|---------|------|---------|------|
| 1 | 真实新闻 | 13,246 | 54.39% |
| 0 | 虚假新闻 | 11,107 | 45.61% |

#### 测试集标签分布

| 标签类别 | 含义 | 记录数量 | 占比 |
|---------|------|---------|------|
| 1 | 真实新闻 | 4,364 | 53.76% |
| 0 | 虚假新闻 | 3,753 | 46.24% |

#### 评估集标签分布

| 标签类别 | 含义 | 记录数量 | 占比 |
|---------|------|---------|------|
| 1 | 真实新闻 | 4,314 | 53.15% |
| 0 | 虚假新闻 | 3,803 | 46.85% |

#### 文本长度统计(训练集)

| 统计指标 | 数值 |
|---------|------|
| 平均文本长度 | 2,502字符 |
| 最小文本长度 | 1字符 |
| 最大文本长度 | 48,835字符 |
| 平均标题长度 | 76字符 |
| 最小标题长度 | 2字符 |
| 最大标题长度 | 443字符 |

该数据集规模庞大,总计包含40,587条新闻记录,数据格式为CSV,覆盖了政治、经济、社会、国际事务等多个新闻领域。每条记录都包含完整的新闻标题和正文内容,以及经过人工标注的真实性标签,数据完整性达到100%,无任何缺失值。数据集按照6:2:2的比例划分为训练集、测试集和评估集,便于模型训练、验证和测试。标签分布相对均衡,真实新闻和虚假新闻的比例约为54:46,为模型训练提供了良好的类别平衡性。

## 数据优势

| 优势特征 | 具体表现 | 应用价值 |
|---------|---------|---------|
| 数据规模庞大 | 总计40,587条新闻记录,其中训练集24,353条 | 支持深度学习模型的充分训练,提高模型泛化能力 |
| 标注质量高 | 每条新闻都经过人工标注,标签清晰明确 | 为监督学习提供可靠的训练目标,确保模型学习效果 |
| 数据完整性高 | 所有字段100%完整,无缺失值 | 减少数据预处理工作量,提高分析效率 |
| 类别分布均衡 | 真实新闻与虚假新闻比例约为54:46 | 避免类别不平衡问题,提高模型对各类别的识别能力 |
| 文本内容丰富 | 包含完整的新闻标题和正文,平均文本长度2502字符 | 提供丰富的语义信息,支持深度文本分析和特征提取 |
| 数据划分合理 | 按6:2:2比例划分为训练集、测试集和评估集 | 便于模型训练、参数调优和性能评估 |
| 覆盖领域广泛 | 涵盖政治、经济、社会、国际事务等多个领域 | 提高模型的泛化能力,适用于多种新闻场景 |
| 格式标准统一 | CSV格式,字段结构清晰 | 便于数据读取和处理,兼容主流数据分析工具 |

## 数据样例

### 元数据样例

样例1(真实新闻) - 标题:Palestinians switch off Christmas lights in Bethlehem in anti-Trump protest - 标签:1 - 文本内容:RAMALLAH, West Bank (Reuters) - Palestinians switched off Christmas lights at Jesus traditional birthplace in Bethlehem on Wednesday night in protest at U.S. President Donald Trump s decision to recognize Jerusalem as Israel s capital. A Christmas tree adorned with lights outside Bethlehem s Church of the Nativity, where Christians believe Jesus was born, and another in Ramallah, next to the burial site of former Palestinian leader Yasser Arafat, were plunged into darkness...样例2(真实新闻) - 标题:China says Trump call with Taiwan president won't change island's status - 标签:1 - 文本内容:BEIJING (Reuters) - U.S. President-elect Donald Trump's call with Taiwan President Tsai Ing-wen was a "petty" move by Taiwan that does not change its status as part of China, China's Taiwan Affairs Office said on Saturday. China will "unswervingly" stick to its position of opposing Taiwan independence, it said, in a statement released on the official Xinhua news agency.样例3(虚假新闻) - 标题:FAIL! The Trump Organization's Credit Score Will Make You Laugh - 标签:0 - 文本内容:While the controversy over Trump s personal tax returns continues, business credit rating company Nav decided to take a look at his business credit, and published the results on their website. Nav, which actually does have an A+ rating from the Better Business Bureau (as opposed to Trump U. s final rating), pulled together the factors affecting business credit scores and discovered something truly laughable.The highest possible business score is 100. The Trump Organization s score is 19...样例4(真实新闻) - 标题:Suspected Boko Haram suicide bombers kill at least 13 in Nigeria: officials - 标签:1 - 文本内容:BAUCHI, Nigeria (Reuters) - Suspected Boko Haram suicide bombers have killed at least 13 other people in an attack on a market in the northeast Nigerian town of Biu in Borno state, officials said on Saturday. The blasts struck while aid workers were distributing food to people affected by the eight-year conflict with Boko Haram, said Aliyu Idrisa, a community leader.样例5(虚假新闻) - 标题:THE MOST UNCOURAGEOUS PRESIDENT EVER Receives A Courage Award…Proceeds To Whine About Current President - 标签:0 - 文本内容:There has never been a more UNCOURAGEOUS person in the White House than Barack Obama. He never faced a decision on foreign affairs without backing down. Yes, he s the one who gave Iran the opportunity to have nuclear capability. Remember the red line with Syria?When his horrible policies didn t work out, he never took the blame but pointed at anyone and everyone for his idiotic decisions.样例6(真实新闻) - 标题:A North Korea nuclear test over the Pacific? Logical, terrifying - 标签:1 - 文本内容:SEOUL/TOKYO (Reuters) - Detonating a nuclear-tipped missile over the Pacific Ocean would be a logical final step by North Korea to prove the success of its weapons program but would be extremely provocative and carry huge risks, arms control experts said on Friday. North Korean Foreign Minister Ri Yong Ho suggested leader Kim Jong Un was considering testing "an unprecedented scale hydrogen bomb" over the Pacific in response to U.S. President Donald Trump's threat at the United Nations to "totally destroy" the country.样例7(虚假新闻) - 标题:WATCH: John Oliver Presents GOP Debates As 'Clowntown Fck-the-World Shtshow 2016' - 标签:0 - 文本内容:John Oliver isn t known for mincing words when it comes to his description of Republicans. Last night was no exception as he left no hold barred when he discussed the insanity of the Republican candidates during the debates.His primary focus was on the man that has an uncanny tendency for outdoing himself for his insane behavior: Donald Trump. This time, however, he had a partner in Marco Rubio.样例8(真实新闻) - 标题:Senate Democrats ask Trump attorney general pick to recuse himself from Russia probes - 标签:1 - 文本内容:WASHINGTON (Reuters) - Nine Democratic senators asked President-elect Donald Trump's nominee to be U.S. attorney general, Senator Jeff Sessions, on Tuesday to recuse himself from any FBI or Justice Department investigation into Russia's efforts to interfere with the 2016 presidential election. The request was signed by every Democrat on the Senate Judiciary Committee, the panel responsible for confirming Sessions' appointment.样例9(虚假新闻) - 标题:Trump HUMILIATES Republicans In Latest Hissy Fit After Siding With Democrats On Debt Ceiling - 标签:0 - 文本内容:Donald Trump sure knows how to add insult to injury.Republicans in Congress must be seriously regretting their decision to endorse Trump now after they were totally humiliated on Thursday at the White House.During a meeting inside the Oval Office to discuss debt ceiling proposals, Trump stunned Paul Ryan and Mitch McConnell by taking a deal offered by the Democrats which increases the debt ceiling for three months and provides Texas with Hurricane Harvey relief with no strings attached.样例10(真实新闻) - 标题:Zimbabwe military chief's China trip was normal visit, Beijing says - 标签:1 - 文本内容:BEIJING (Reuters) - A trip to Beijing last week by Zimbabwe s military chief was a normal military exchange , China's Foreign Ministry said on Wednesday, after the military in the southern African nation seized power. Zimbabwe's military took control targeting criminals around President Robert Mugabe but gave assurances on national television that the 93-year-old leader and his family were safe and sound.样例11(虚假新闻) - 标题:MACY'S GETS THE BOOT FROM LOYAL CUSTOMERS AFTER FIRING TRUMP - 标签:0 - 文本内容:I know Patty and I are boycotting Macy s for dumping Donald Trump. It looks like thousands of Americans are also really sick and tired of all the pc actions taken by companies like Macy s. Boycott Macy s!Macy s is paying the price for sacking Donald Trump, because we ve learned thousands of customers are cutting up their Macy s credit card in protest.样例12(真实新闻) - 标题:Czech police ask parliament to allow prosecution of prospective PM Babis - 标签:1 - 文本内容:PRAGUE (Reuters) - Czech police have requested parliament lift the immunity of prospective prime minister Andrej Babis to allow prosecution in a case involving alleged fraud in tapping European Union subsidies, the lower house s press office said on Tuesday. Babis, whose ANO party was the runaway winner in a parliamentary election in October on pledges to run the state better and fight corruption among traditional parties, denies any wrongdoing and has called the charges politically motivated.样例13(虚假新闻) - 标题:Twitter Erupts With Glee Over #CruzSexScandal Rumors (TWEETS) - 标签:0 - 文本内容:The last thing any politician running for the presidency needs is negative or scandalous hashtags about them trending on Twitter. However, that is just what America is waking up with regards to GOP presidential hopeful Ted Cruz. You see, overnight, rumors began circulating that Ted Cruz has been cheating on his wife with MULTIPLE women. Of course, those are just rumors at this time, apparently started by The National Enquirer, and there is certainly no proof.样例14(真实新闻) - 标题:Kremlin: Syria peoples' congress being 'actively discussed' - 标签:1 - 文本内容:MOSCOW (Reuters) - A proposal to convene a congress of all Syria s ethnic groups is a joint initiative which is being promoted by Russia and others and is now being actively discussed, the Kremlin said on Friday. It is premature, however, to discuss the time and venue for the congress, which is seen as a mechanism to assist Syria s post-war development, Putin s spokesman Dmitry Peskov told a conference call with reporters.样例15(虚假新闻) - 标题:MUST WATCH VIDEO: Obama Tries To Trash Trump But Turns Into A Babbling Mess [Video] - 标签:0 - 文本内容:This is too good to miss! Mr. Teleprompter didn t do so well when he went off script during an appearance in Indiana.

## 应用场景

### 假新闻检测系统开发

该数据集是构建假新闻检测系统的核心资源,可用于训练和评估基于机器学习和深度学习的文本分类模型。研究人员可以利用数据集中的新闻标题和正文内容作为输入特征,将真实性标签作为训练目标,开发出高效的假新闻识别算法。通过使用BERT、RoBERTa、GXL等预训练语言模型,结合该数据集进行微调,可以构建出准确率极高的假新闻检测系统。此类系统可广泛应用于社交媒体平台、新闻聚合网站、内容审核系统等场景,自动识别和标记可疑的虚假新闻内容,帮助平台维护信息生态的健康性,减少虚假信息的传播。在实际应用中,系统可以实时分析用户发布的新闻内容,快速判断其真实性,为用户提供可信度评估,同时为平台管理员提供内容审核的决策支持。该数据集丰富的文本内容和准确的标签标注,为模型训练提供了高质量的数据基础,确保了检测系统的可靠性和实用性。

### 自然语言处理算法研究

该数据集为自然语言处理领域的研究提供了丰富的实验数据,支持多种NLP任务的算法研发和性能评估。研究人员可以利用该数据集进行文本分类、情感分析、主题建模、命名实体识别、语义理解等多个研究方向。在文本分类方面,该数据集的二分类任务(真实新闻vs虚假新闻)为算法评估提供了标准的基准测试,研究人员可以对比不同分类算法的性能,包括传统的机器学习方法如朴素贝叶斯、支持向量机、随机森林等,以及深度学习方法如卷积神经网络、循环神经网络、Transformer等。在特征工程研究方面,数据集中的新闻文本可以用于提取各种语言特征,如词频特征、TF-IDF特征、词嵌入特征、句法特征等,研究不同特征组合对分类性能的影响。此外,该数据集还可以用于研究文本预处理技术、词向量训练、语言模型微调等NLP基础技术,为自然语言处理领域的发展提供实验支撑。通过在该数据集上的系统研究,研究人员可以深入理解新闻文本的语言特征,探索更有效的文本表示方法和分类算法。

### 深度学习模型训练与优化

该数据集包含40,587条高质量的标注文本数据,为深度学习模型的训练提供了充足的数据支撑。研究人员可以利用该数据集训练各种深度神经网络模型,包括基于Transformer的预训练语言模型(如BERT、GPT、RoBERTa等)、卷积神经网络(CNN)、循环神经网络(RNN、LSTM、GRU)以及混合架构模型。由于数据集规模较大,可以支持深度模型的充分训练,避免过拟合问题,提高模型的泛化能力。在模型训练过程中,研究人员可以进行超参数调优、网络架构设计、正则化技术应用等优化工作,探索最佳的模型配置。该数据集还支持迁移学习研究,研究人员可以在该数据集上对预训练语言模型进行微调,研究不同微调策略对模型性能的影响。此外,该数据集可以用于研究深度学习中的各种技术问题,如注意力机制的作用、多层网络的特征提取能力、不同损失函数的效果等。通过在该数据集上的深度学习研究,可以推动文本分类技术的发展,为其他NLP任务提供技术参考。

### 新闻推荐与内容质量评估

该数据集可以用于开发智能新闻推荐系统和内容质量评估工具。在新闻推荐方面,研究人员可以利用该数据集训练模型,学习真实新闻和虚假新闻的语言特征差异,构建能够识别高质量新闻内容的推荐算法。推荐系统可以优先向用户推送经过真实性验证的新闻,提升用户的信息获取质量,减少虚假新闻的曝光机会。在内容质量评估方面,该数据集可以用于训练自动评估模型,对新闻内容的可信度、专业性、客观性等多个维度进行评分。这种评估工具可以为新闻编辑、内容审核人员提供辅助决策支持,帮助他们快速识别可疑内容。此外,该数据集还可以用于研究用户行为与新闻真实性之间的关系,分析用户对不同类型新闻的阅读偏好、分享行为、评论倾向等,为个性化推荐系统提供更丰富的用户画像信息。通过将真实性评估融入推荐系统,可以构建更加负责任和可信的新闻推荐平台,提升用户体验的同时维护信息生态的健康。

### 媒体素养教育与公众意识提升

该数据集可以用于开发媒体素养教育工具和公众意识提升项目,帮助人们识别虚假新闻,提高信息辨别能力。教育机构可以利用该数据集开发互动式学习平台,通过展示真实新闻和虚假新闻的对比案例,教授人们识别虚假新闻的技巧和方法。学习者可以通过分析数据集中的样例,学习如何从新闻标题、内容来源、语言风格、事实陈述等多个维度判断新闻的可信度。该数据集还可以用于开发在线测试工具,让用户练习识别虚假新闻的能力,系统根据用户的回答提供反馈和指导。在公众意识提升方面,该数据集可以支持研究虚假新闻的传播特征和影响机制,帮助制定有效的公众教育策略。通过分析数据集中虚假新闻的语言特征,可以总结出常见的虚假新闻写作模式,如夸张的标题、情绪化的表达、缺乏具体细节等,将这些特征转化为公众易于理解的识别指南。此外,该数据集还可以用于开发浏览器插件或移动应用,为用户提供实时的新闻可信度提示,在日常浏览中培养用户的批判性思维习惯。

### 学术研究与论文发表

该数据集为学术研究提供了高质量的实验数据,支持研究人员进行深入的学术探索和论文发表。计算机科学、信息科学、传播学、社会学等多个学科的研究人员都可以利用该数据集开展相关研究。在计算机科学领域,研究人员可以基于该数据集开发新的文本分类算法、深度学习架构或特征提取方法,并将研究成果发表在顶级会议和期刊上。在信息科学领域,该数据集可以用于研究虚假信息的传播机制、检测技术以及干预策略,为信息生态治理提供理论支撑。在传播学和社会学领域,研究人员可以分析数据集中新闻内容的语言特征、主题分布、情感倾向等,研究虚假新闻的社会影响和传播规律。该数据集的标准化和高质量特性,使得基于该数据集的研究成果具有可比性和可重复性,有利于学术界的交流与合作。此外,该数据集还可以用于教学实践,作为课程案例或实验数据,培养学生的数据分析能力和研究素养。通过在该数据集上的深入研究,可以推动多个学科领域的发展,为解决虚假新闻这一社会问题提供学术支持。

## 结尾

本假新闻分类数据集以其40,587条高质量标注数据、完整的文本内容、均衡的类别分布和标准化的数据格式,为假新闻检测研究提供了卓越的数据资源。数据集包含完整的新闻标题和正文内容,每条记录都经过人工标注,确保了数据的高质量和可靠性,为模型训练提供了坚实的数据基础。该数据集的核心价值在于其规模庞大、标注准确、覆盖领域广泛,能够支持从传统机器学习到深度学习的多种算法研究,适用于假新闻检测系统开发、自然语言处理算法研究、深度学习模型训练、新闻推荐与内容质量评估、媒体素养教育以及学术研究等多个应用场景。通过利用该数据集,研究人员和开发者可以构建出高效准确的假新闻识别系统,为维护信息生态的健康性、提升公众的信息辨别能力做出重要贡献。该数据集不仅具有显著的研究价值,也具有广泛的应用前景,是假新闻检测领域不可或缺的重要资源。有需要可私信获取更多信息。

看了又看

暂无推荐

验证报告

以下为卖家选择提供的数据验证报告:

data icon
假新闻分类数据集40587条新闻文本数据支持深度学习模型训练自然语言处理研究-假新闻检测系统-自然语言处理、机器学习、深度学习-假新闻检测算法的研发、文本分类模型
10
已售 0
99.83MB
申请报告