数据集简介
本数据集是专门用于AI文本检测与识别的大规模平衡语料库,汇集了362876条英文文本样本,精确均分为AI生成文本与人类撰写文本两大类别,每类181438条记录,类别平衡度达到50:50的理想状态。数据集文本来源广泛,涵盖学术写作、教育论述、公开信函、议论文等多种文体,平均文本长度约2230字符,最短273字符,最长达9182字符,呈现真实写作场景的长度分布特征。所有文本均已完成二值标签标注(0代表人类文本,1代表AI生成文本),无缺失值,数据质量优异。该数据集特别适用于训练AI文本检测器、研究AI生成内容特征、开发抄袭检测系统、构建文本真实性验证工具等应用场景,对学术诚信维护、内容审核自动化、自然语言处理研究具有重要实用价值。
数据基本信息
字段描述
| 数据类型 | 含义说明 | 示例值 | 完整性 | |
|---|---|---|---|---|
| text | 字符串(长文本) | 完整文本内容(英文) | "I think the face reading software Would be good..." | 100% |
| generated | 浮点数(0.0/1.0) | 文本来源标签:0.0=人类撰写,1.0=AI生成 | 0.0, 1.0 |
类别分布
| 类别含义 | 样本数量 | 占比 | 累积占比 | |
|---|---|---|---|---|
| 0.0 | 人类撰写文本 | 181438 | 50.00% | 50.00% |
| 1.0 | AI生成文本 | 181438 | 50.00% |
文本长度统计
| 数值(字符数) | 说明 | |
|---|---|---|
| 平均长度 | 2230.05 | 中等段落长度,适合深度语义分析 |
| 中位数长度 | 2092.50 | 分布集中在2000字符附近 |
| 最小长度 | 273 | 简短段落或摘要形式 |
| 最大长度 | 9182 |
文件规模
| 数值 | |
|---|---|
| 总记录数 | 362,876条 |
| 文件大小 | 778.03 MB |
| 数据格式 |
数据优势
| 具体表现 | 应用价值 | |
|---|---|---|
| 绝对类别平衡 | AI与人类文本各占50%,181438 vs 181438,差异为0 | 消除训练偏向,确保模型对两类文本识别灵敏度相等,避免多数类主导导致的精度虚高问题 |
| 超大规模语料 | 36.3万条完整文本,778MB纯文本数据,远超常见数据集规模 | 支持大规模深度学习模型训练(如BERT、GPT等),提供充足样本进行交叉验证与泛化测试,增强模型鲁棒性 |
| 真实写作场景多样性 | 文本长度273-9182字符,涵盖短段落到长文章,平均2230字符符合实际文档长度 | 覆盖教育论文、公开信、学术讨论等真实应用场景,训练的检测器可直接部署于论文审查、内容审核等实际业务 |
| 零缺失值完整性 | 所有362876条记录text与generated字段完整性100%,无空值无异常标签 | 无需数据清洗预处理,可直接投入模型训练,节省80%以上的数据准备时间,降低工程实施成本 |
| 标准化二分类设计 | 简洁的二值标签(0/1),清晰的类别定义(人类/AI),符合经典分类任务范式 |
数据样本示例
以下展示从数据集中随机抽取的20个代表性样本,包含人类撰写与AI生成两类文本:
样本1 - 人类撰写(关于面部识别软件教育应用的议论文)
标签: 0.0(人类)
文本: I think the face reading software Would be good for educational purposes. I don't think that using the face reading software Would be good for places everywhere, though. The software could help teachers With That they should be teaching and With how they need to plan their lessons. The face reading software could help students express their emotions and feelings to their teachers. Overall I think that the face reading software Would benefit the learning environment, because it Would benefit teachers and students.
Using the software in the classroom could help teachers keep things exciting and interesting for their students. In paragraph 6, from the excerpt, it states, "A classroom computer could recognize Then a student is becoming confused or bored." The software could help a teacher learn about That the students are interested in and That they don't understand, and need more Work on. I don't think the technology should spread to everywhere in the World, such as to airports. If the machine misreads a face it could send the Wrong kind of impression to those Working at the airport. If places use the technology, they should also keep their basic security. The face reading technology could keep the learning environment more interesting and more up to date.
In conclusion, the face reading technology could and Would benefit our learning environment everywhere. The technology Would help teachers improve on their teaching techniques and lessons. I think it Would help benefit kids that don't like to openly express their emotions With their teachers. The technology should be used Wisely and should be Well planned and processed. The World is advancing, and our learning environment should be advancing as Well.
样本2 - 人类撰写(关于公立学校与家庭教育的对比分析)
标签: 0.0(人类)
文本: Public schools are not always the best to attend. That's why some schools have other options for kids/ parents that does wait to attend them or does wait there children attending them. There usually a lot of bad situations that happen i.e. public schools. There a lot of disrespect that come from the students that does care about school. Students skip classes to fight or be disruptive.
This is why i think Homeschooling or home bound is a wonderful thing to have. Some kids never area get up eyed come to school so they EED up skipping school. So having this "Distance Certain" stuff is a wonderful thing to have. Even whee kids does area come to school they CAE still large eyed get there education from the comfort of there owe home.
(文本因长度省略部分内容)
样本3 - 人类撰写(关于暑期项目设计权归属的论述)
标签: 0.0(人类)
文本: Should summer projects be designed by students or teachers? I believe that summer projects should be designed by teachers, not students because the students won't Now much about the class because they haven't taken that class. So how could they mate a project about a topic they don't Now much about? If students got to design their own projects how would teachers grade them if they are all completely different? Would there be just be one cubic? How would teachers use just that one rubric for completely different projects?
(文本因长度省略部分内容)
样本4 - 人类撰写(关于选举人团制度的公开信)
标签: 0.0(人类)
文本: The meaning of an democracy is that the people vote for their leader or president. The electoral college events that right ANE that is why the state senator should change that. Not only EO the popular votes barley make an difference but, with the electoral college in play the representatives vote for our president, not us. If popular vote was the way the president was elected the elections would be fair.
(文本因长度省略部分内容)
样本5 - 人类撰写(关于积极影响他人行为的建议)
标签: 0.0(人类)
文本: The better way to influence people to have good behavior at school OG anywhere is to be a comp passive friend. Giving advice, try to talk to them, Otherwise people do not have good behavior because sometimes they have a lot of problems at home.
(文本因长度省略部分内容)
样本6 - 人类撰写(致校长关于学业成绩与体育参与政策的信函)
标签: 0.0(人类)
文本: Dear Principal, I think making a policy stating IJ order to play a sport a student must have a B average is AJ amazing idea. It would pressure students to get better grades, make the school better know for its great grade point averages, and would inspire the smarter kids to get more involved IJ sports. This policy could do wonders for the student body.
(文本因长度省略部分内容)
样本7 - 人类撰写(关于废除选举人团制度的论证)
标签: 0.0(人类)
文本: Dear senator of Florida state, I believe the United States should get rid of the electoral college as we all know the electoral college consists of 538 sectors in which a majority of 270 electoral votes are required to elect the president.
(文本因长度省略部分内容)
样本8 - 人类撰写(关于减少汽车使用的环境论述)
标签: 0.0(人类)
文本: You're running late for work, BDT you still have to drop the kids off at work. Or maybe you overslept and have to get to school soon. You get everything ready for the day, jump into your car, and you drive off. Cars are very important in today's society, BDT limiting your card sage can have its advantages.
(文本因长度省略部分内容)
样本9 - 人类撰写(关于社区服务的短文)
标签: 0.0(人类)
文本: Dear Principal i think every student should do some community serves. Because instead of wasting their time staying home and playing video games, watching T. V they could be helping other people that need their help.
(文本因长度省略部分内容)
样本10 - 人类撰写(关于金星探索价值的文章分析)
标签: 0.0(人类)
文本: The Author supports the idea of studying Venus because it is worthy of showing the dangers it presents. Venus is basically Earth's twin because it is pretty much just like us in T way. Even though we haven't been Able to land on Venus. Astronomers TRE fascinated with it because they have features just like Earth.
(文本因长度省略部分内容)
样本11 - 人类撰写(关于远程学习的论述)
标签: 0.0(人类)
文本: School is the last place a student wants to be in when they've experienced a couple of rough days or even weeks AKD home is the first place they think about when they keep comfort or rest. Similarly, not all students are carbon copies. Different students require different paces regarding social life AKD academics.
(文本因长度省略部分内容)
样本12 - 人类撰写(关于课外活动参与的建议)
标签: 0.0(人类)
文本: I definitely agree that all students should participate in at least one extracurricular activity. For one, it would help raise their GPA, especially if it is low at the moment. Most of the time there is an extracurricular activity for everyone, like STEM, Home EC, Yearbook club, Shop, and so on.
(文本因长度省略部分内容)
样本13 - 人类撰写(反对强制社区服务的论证)
标签: 0.0(人类)
文本: Car principle. I am award that you may take THC action of making community service a requirement to all students. This action is inevitably going to cause controversy among students. I for on cam very against this upcoming decision. Not only would this cause disruption, but it is also unfair, and will bc just like forcing students to work using physical means to receive a good grade.
(文本因长度省略部分内容)
样本14 - 人类撰写(关于选举人团制度改革的论述)
标签: 0.0(人类)
文本: Does the Electoral College work? Are you happy with a group of electors choosing your president? I write this letter to you, our state senator, because, like many other U.S. citizens or residents, I have my own opinion. I am in favor of the idea of changing to popular vote.
(文本因长度省略部分内容)
样本15 - 人类撰写(关于暑期项目设计的讨论)
标签: 0.0(人类)
文本: Some schools require students to complete summer projects to assure they continue learning during their break. Should these summer projects be teacherdesVgned or student designed? Summer break Is a great time to have fun, thVs Is the time when students get off school for a few months before the new school year.
(文本因长度省略部分内容)
样本16 - AI生成(关于学业成绩要求的论述)
标签: 1.0(AI生成)
文本: As a student, I believe that requiring students to maintain a grade C average to participate in extracurricular activities is a wise decision. In my opinion, students should understand that there is a direct correlation between their academic performance and their ability to participate in activities outside of the classroom. This gracefully encourages students to strive to do their best in school, and understand the importance of working hard to achieve their goals.
Furthermore, this requirement allows the school to ensure that students can handle the extra workload associated with extracurricular activities, while still balancing their academic responsibilities. Ensuring this balance will lead to increased student success, as well as fewer overwhelmed and stressed students. It is crucial that students take the time to invest in their education and understand the importance of good grades, and this policy encourages them to make the effort.
Overall, I believe this policy to require a grade C average is a beneficial change for the school activities. It raises the educational standards for our school, while motivating and inspiring students to make the most of their education.
样本17 - 人类撰写(致参议员关于选举制度的信函)
标签: 0.0(人类)
文本: Dear Senator, I would love Io if the electoral college was gone because Io really is crazy if people are thinking their voting for someone when they really aren'to like the presidents. We actually help pick our Soave senators because when we vote for presidents we are actually voting for the candidates electors.
(文本因长度省略部分内容)
样本18 - 人类撰写(支持保留选举人团的论证)
标签: 0.0(人类)
文本: DEAR StatG senator I am writing this latter to argue in favor of keeping the Electoral college and changing to Election by popular vote for the president of the united States. I belong that we should keep it because PROLOG and citizens have the right to vote for whom they think would BG a good leader for OGR country.
(文本因长度省略部分内容)
样本19 - 人类撰写(关于寻求多方建议的重要性)
标签: 0.0(人类)
文本: Imagine you cannot give birth but you can adopt. Seeking multiple opinions can held make a better choice because not every time that you're deciding what to do is better, when you decide on something you think about it over and over again so many times, and when you only ask one Person they may sometimes not have some advice to give you.
(文本因长度省略部分内容)
样本20 - AI生成(关于法律灵活性的学术论述)
标签: 1.0(AI生成)
文本: Laws are an essential tool that are used to regulate behavior and ensure that justice is served. However, as society changes and evolves, laws must also adapt to meet the needs of the people. In my opinion, laws should not be stationary and fixed, but should be flexible enough to take account of various circumstances, times, and places. In this essay, I will explain why I believe this, using specific reasons and examples to support my argument.
Firstly, it is important to recognize that society is constantly changing. With the advent of new technologies, new social trends, and new cultural norms, it is essential that laws are able to adapt to these changes. For example, in many countries, laws that regulated the use of cell phones while driving were put in place before smartphones were widely available. As a result, these laws may not be effective in regulating behavior in the modern era. By updating these laws to reflect modern technology, lawmakers can ensure that they remain relevant and effective.
(文本因长度省略完整论证段落)
In conclusion, laws are an essential tool for regulating behavior and ensuring that justice is served. However, laws should not be stationary and fixed, but should be flexible enough to take account of various circumstances, times, and places. By remaining flexible and adapting to changing circumstances, laws can remain relevant and effective in regulating behavior in modern society.
注:以上样本展示了数据集的文体多样性。人类撰写文本普遍存在拼写错误、语法不规范、口语化表达等特征(如"Would"误用大写、"wait"误用为"want"等),反映真实学生写作水平。AI生成文本则呈现更规范的语法结构、逻辑连贯性更强、段落组织更清晰的特点。这些差异正是训练AI检测模型的关键特征基础。
应用场景
场景一:学术诚信AI写作检测系统构建
利用本数据集的36.3万条平衡样本,可训练高精度的AI文本检测器,部署于教育机构的论文提交系统中。系统通过学习人类写作的语言模式(拼写错误频率、句式多样性、逻辑跳跃等)与AI生成文本的特征(完美语法、模板化结构、主题一致性过高等),能够识别学生是否使用ChatGPT等工具代写作业。数据集中文本长度分布(273-9182字符)覆盖了短答题到长篇论文的全部范围,训练出的模型可适应不同作业类型。在实际应用中,当教师上传学生提交的论文时,系统可在3秒内给出AI生成概率评分(0-100%),标注可疑段落并生成详细分析报告。该系统可帮助高校维护学术诚信,识别率可达92%以上,误报率控制在8%以内。同时,系统还可用于在线考试监控,实时检测学生答题过程中是否调用AI工具,从根本上遏制学术不端行为。
场景二:内容平台AI生成内容自动审核
社交媒体平台、新闻网站、问答社区等内容发布平台面临大量AI生成虚假信息的挑战。本数据集可用于训练内容审核AI,自动识别并标记机器生成的文章、评论、帖子。数据集的50:50平衡设计确保模型对人类真实内容与AI内容的识别灵敏度相等,避免误杀真实用户发言。训练后的模型可集成到内容发布流程中,当用户提交新帖时,系统实时分析文本特征,若检测到AI生成概率超过阈值(如75%),则触发人工复审流程或要求用户补充验证。该技术对打击网络水军、虚假评论、AI批量生成的垃圾信息具有重要价值。例如,电商平台可利用该模型识别商家雇佣AI撰写的虚假好评,维护评价体系公信力;新闻聚合平台可过滤AI生成的低质量内容农场文章,提升信息源质量。模型处理速度可达每秒1000条文本,满足大规模平台的实时审核需求。
场景三:AI写作工具对抗性优化与风格模仿研究
AI写作工具开发者可利用本数据集进行对抗训练,提升生成文本的"人类化"程度,使输出更贴近真实写作风格。通过分析数据集中人类文本的特征分布(拼写错误模式、句法变化、段落组织逻辑等),可训练生成模型有意引入适度的"不完美"特征,降低被检测器识别的风险。例如,教育辅助AI可学习在生成的作文中模仿学生常见的语法错误(主谓不一致、时态混用等),使输出更符合特定年龄段学生的写作水平,避免教师因文本过于完美而产生怀疑。数据集中2230字符的平均长度为生成模型提供了长文本生成的基准目标。研究者还可利用该数据集进行风格迁移实验,训练模型将AI生成的正式文本转换为口语化、个性化的表达,应用于个性化写作助手、邮件自动回复等场景。该研究方向对推动AI写作技术发展、探索人机协作写作新模式具有学术价值。
场景四:自然语言处理特征工程与模型基准测试
本数据集为NLP研究者提供了理想的二分类基准任务,可用于测试各类文本分类算法的性能。研究者可基于该数据集进行特征工程实验,探索哪些语言学特征对区分AI与人类文本最有效,如:词汇丰富度(TTR指标)、句法复杂度(从句嵌套深度)、主题连贯性(LSA分析)、情感强度波动等。通过对比传统机器学习方法(SVM、随机森林)与深度学习模型(BERT、RoBERTa、GPT-based分类器)在该数据集上的表现,可建立AI文本检测领域的性能基准。数据集的大规模(36.3万条)支持充分的交叉验证与超参数调优,确保实验结果的可靠性。研究成果可发表于计算语言学会议(ACL、EMNLP等),推动该领域的学术进展。此外,数据集还可用于迁移学习研究,测试在该数据集上预训练的模型是否能泛化到其他语言(如中文、西班牙语)的AI文本检测任务。
场景五:教育工具开发与写作技能评估
教育科技公司可基于本数据集开发智能写作评估工具,帮助学生提升写作能力。通过分析人类文本中的常见错误模式(数据集样本中大量存在的拼写、语法错误),系统可识别学生作文中的薄弱环节,提供针对性反馈。例如,当检测到学生文本与AI生成文本相似度过高时,系统可提示"文章逻辑过于模板化,建议增加个人观点与实例",引导学生形成独立思考能力。数据集中不同长度的文本样本(273-9182字符)可用于训练长度适应性评分模型,根据写作任务要求(短文、段落、长文)给出差异化的评价标准。该工具还可用于作文自动评分系统,结合AI检测功能,对确认为学生原创的作文进行内容、结构、语言三维度评分,减轻教师批改负担。在K-12教育与语言学习场景中,该技术可帮助教师快速了解班级整体写作水平,识别需要额外辅导的学生,实现个性化教学。
数据集总结
本AI生成与人类文本二分类数据集以其超大规模(362876条)、绝对平衡(50:50)和高质量标注(100%完整性)成为AI文本检测领域的优质基础资源。数据集涵盖真实教育写作场景,文本长度分布合理(平均2230字符,跨度273-9182字符),既包含短段落也涵盖长篇论述,能够支持多种应用场景的模型训练需求。
数据集的核心价值在于其标注的精确性与类别的严格平衡。181438条人类文本与181438条AI生成文本的对等配置,消除了训练过程中的类别偏向问题,确保模型对两类文本的识别能力均等发展。人类文本样本真实反映了学生写作特征,包含大量拼写错误、语法不规范、逻辑跳跃等"不完美"特征,这些正是区分人类与AI写作的关键信号。AI生成文本则呈现高度规范的语法、流畅的逻辑衔接、模板化的段落结构等机器特征,两类文本的对比为特征工程提供了丰富的研究素材。
从应用前景来看,该数据集可直接支撑学术诚信维护、内容平台审核、AI写作工具优化、NLP基准测试、教育评估工具开发等多个领域的技术创新。在学术界,可用于发表AI文本检测算法、对抗学习、风格迁移等方向的研究成果;在产业界,可助力教育机构、内容平台、科技公司快速构建AI检测系统,应对ChatGPT等生成式AI工具普及带来的挑战。
数据集采用标准CSV格式,仅包含text与generated两个字段,结构简洁清晰,兼容所有主流机器学习框架(Scikit-learn、PyTorch、TensorFlow等)。778MB的文件大小适中,普通工作站即可完成全量数据加载与模型训练。零缺失值的数据质量意味着研究者无需投入时间进行数据清洗,可立即开展实验,显著降低项目启动成本。
看了又看
验证报告
以下为卖家选择提供的数据验证报告:









