# 15万条多标签情感分析数据集:专利文献文本分类数据集,适用于自然语言处理模型训练与文本挖掘研究
## 引言与背景
在当今数字化时代,专利文献作为技术创新的重要载体,蕴含着丰富的技术信息和商业价值。随着人工智能技术的快速发展,基于专利文献的文本分析和情感挖掘已成为自然语言处理领域的重要研究方向。本数据集包含15万条专利文献文本数据,涵盖多个技术领域,每条数据均标注了对应的情感标签,为科研人员和开发者提供了高质量的训练素材。
该数据集不仅包含完整的原始文本数据,还提供了经过清洗和标准化处理的标注信息,可直接用于机器学习模型的训练和评估。通过对这些专利数据的分析,研究人员可以深入了解技术发展趋势、识别技术热点,为企业的技术创新决策提供数据支持。同时,该数据集也为情感分析算法的研究提供了真实场景下的测试基准,有助于推动自然语言处理技术在专业领域的应用与发展。
## 数据基本信息
### 字段说明
| 字段名称 | 字段类型 | 字段含义 | 数据示例 | 完整性 |
| :--- | :--- | :--- | :--- | :--- |
| text | String | 专利文献文本内容 | "An image forming apparatus of the present invention includes..." | 100% |
| target | Integer | 情感标签(0、1、2) | 0 | 100% |
### 数据分布情况
#### 标签分布
| 标签 | 记录数量 | 占比 |
| :--- | :--- | :--- |
| 0 | 50,000 | 33.33% |
| 1 | 50,000 | 33.33% |
| 2 | 50,000 | 33.33% |
#### 文本长度分布
| 统计指标 | 值 |
| :--- | :--- |
| 最短文本 | 1字符 |
| 最长文本 | 45,691字符 |
| 平均长度 | 1,869字符 |
### 主要技术领域分布
通过对文本内容的初步分析,该数据集涵盖以下主要技术领域:
| 技术领域 | 大致占比 |
| :--- | :--- |
| 机械工程 | 约25% |
| 电子通信 | 约25% |
| 材料科学 | 约20% |
| 化学化工 | 约15% |
| 其他领域 | 约15% |
## 数据优势
| 优势特征 | 具体表现 | 应用价值 |
| :--- | :--- | :--- |
| 数据规模庞大 | 包含15万条标注数据 | 提供充足的训练样本,提高模型泛化能力 |
| 标签分布均衡 | 三个标签各占约33.33% | 避免类别不平衡问题,提升模型训练效果 |
| 文本质量高 | 专利文献语言规范、专业性强 | 适用于专业领域的文本分析任务 |
| 数据覆盖广 | 涵盖机械、电子、材料、化工等多个领域 | 支持跨领域的文本分类研究 |
| 标注完整性 | 所有数据均已标注,无缺失值 | 可直接用于模型训练,无需额外处理 |
| 文本长度适中 | 平均约1,870字符 | 适合深度学习模型处理,兼顾信息丰富性和计算效率 |
## 数据样例
以下展示15条数据样例,涵盖不同标签和技术领域:
样例1(标签0): "An image forming apparatus of the present invention includes: a replaceable part used for image formation; an output unit configured to output information indicating that a usage amount of the replaceable part has reached a threshold; an input unit configured to input information; and a control unit configured to set the threshold in accordance with information about image quality input by the input unit.", "An image forming system of the present invention includes an image forming apparatus and an input device connected to the image forming apparatus, in which the system includes: a replaceable part used for image formation; an output unit configured to output information indicating that a usage amount of the replaceable part has reached a threshold; and a control unit configured to set the threshold in accordance with information about image quality input by the input device."样例2(标签2): "However, in the conventional aerial vehicle described above, the balloon has a problem that it is difficult to control the direction of movement of the balloon to be greatly affected by the wind because the balloon is not provided with the power source for horizontal movement, and hence, the balloon is unsuitable for carrying a load and a person. Further, the airship has a problem that, since the airship has a balloon which is an elongated elliptical body, the airship easily receives wind resistance and hence cannot quickly move or change its moving direction.", "Further, it is also reported that the conventional drone, including the unmanned aerial vehicle described in Patent Literature 1, crashes at the rate of once every 20 flights due to troubles, such as battery exhaustion (the current average flight time is only about 20 minutes), drive source failure, and damage of the propeller. Therefore, the conventional drone has high risk of accident caused by falling and of damage of the vehicle body. Further, the conventional drone performs the vertical movement and the horizontal movement by a common propeller oriented only in the vertical direction, and hence has the problem that the horizontal mobility capability is low. Further, the conventional drone needs to have a propeller larger than the vehicle body, and hence has the problem of noise and the problem of high danger when in contact with a person, or the like.", "The present invention has been made in order to solve the above-described problems. An object of the present invention is to provide a buoyant aerial vehicle which can reduce the risk of crashing to thereby secure high safety, and which can suppress the influence of wind to thereby facilitate the control of movement and exhibit high mobility capability."样例3(标签1): "As described above, according to the cap, the cap mold, and the securing structure part using the cap of the present invention, the structures thereof are simple and have improved productivity and low manufacturing cost, and it is possible to prevent the cap from falling off from the fastener.", "According to the method for mounting a cap of the present invention, when the cap is mounted on the fastener using the filler, it is possible to easily prevent the filler, which protrudes from the cap, from being attached to the periphery of the cap."样例4(标签2): "The present invention, in view of this situation, has as its task making up for the defects of the low surface hardness or internal hardness resulting from just induction hardening or soft nitriding by combining induction hardening and soft nitriding and providing a steel part for machine structure use excellent in contact fatigue strength (i) provided with a high surface hardness, internal hardness, and temper softening resistance unable to be obtained by a conventional soft nitrided and induction hardened steel part and, furthermore, (ii) formed with a sufficient lubricating film at its operating surface and a steel for machine structure use for surface hardening use used for said steel part."样例5(标签0): "The objects of the present invention can be implemented by following methods. First, a plurality of (preferably, three or more) electromagnetic sensors is combined and forms a partial discharge sensor while a relative positional relation (distance and angle) is kept between the sensors. Next, spatial intensity distribution of electromagnetic signals at the time of occurrence of partial discharge is measured by the sensors at the same time. The partial discharge and noise are separated by comparing a relative relation of signal intensity measured by the respective electromagnetic sensors constituting the partial discharge sensor with preliminarily-measured spatial intensity distribution at the time of occurrence of the partial discharge. Further, a peak position is obtained by comparing preliminarily-measured signal intensity distribution with measured signal distribution, thereby locating a defect position. Finally, a risk can be assessed by analyzing the φ-q-n pattern, current signal waveforms, and FFT waveforms of the partial discharge signal at the located defect position."样例6(标签1): "According to the invention, a steam turbine rotor blade achieving both abrasion resistance and reliability, and a method for manufacturing a steam turbine rotor blade capable of obtaining such a steam turbine rotor blade can be provided."样例7(标签2): "In other words, in the conventional technology, the waveform of a discharge pulse current that includes a current portion having a high peak value and short pulse width in the leading portion cannot be formed into a current waveform that does not affect the formation of the film.", "The present invention is made in view of the above and has an object to obtain a discharge surface treatment apparatus capable of forming the waveform of a discharge pulse current into a current waveform that does not affect the formation of the film when a capacitor is connected in parallel with a discharge electrode and a workpiece and a current portion having a high peak value and short pulse width is formed in the leading portion of the discharge pulse current generated between the poles."样例8(标签1): "The invention in claim 1 allows meter-in control and meter-out control to be individually performed, while enabling the number of components to be reduced to contribute to cost reduction.", "The invention in claim 2 allows flow rate control to be accurately performed using the meter-in valve.", "The invention in claim 3 facilitates control of the recycling flow rate, allowing accurate recycling flow rate control to be achieved."样例9(标签2): "However, according to the researches by the inventor, it is found that the other electric characteristics of the solar battery do not always indicate satisfactory characteristics in a bulk-type solar battery in which a texture structure at high short-circuit current density is adopted over the entire surface of a cell.", "The present invention has been devised in view of the above and it is an object of the present invention to obtain a solar battery cell having well-balanced electric characteristics and excellent in photoelectric conversion efficiency and a method of manufacturing the solar battery cell."样例10(标签1): "According to the aspects of the present disclosure, the communication system can efficiently use resources and thus can improve communication reliability. Further, according to the aspect of the present disclosure, since the IoT device uses the reference signal of the existing LTE communication system, reliability of the channel estimation can be improved, and the resources can be efficiently used."样例11(标签2): "Introduction of new radio communication technologies has led to increases in the number of user equipments (UEs) to which a base station (BS) provides services in a prescribed resource region, and has also led to increases in the amount of data and control information that the BS transmits to the UEs. Due to typically limited resources available to the BS for communication with the UE(s), new techniques are needed by which the BS utilizes the limited radio resources to efficiently receive/transmit uplink/downlink data and/or uplink/downlink control information. In particular, overcoming delay or latency has become an important challenge in applications whose performance critically depends on delay/latency.", "The technical objects that can be achieved through the present invention are not limited to what has been particularly described hereinabove and other technical objects not described herein will be more clearly understood by persons skilled in the art from the following detailed description."样例12(标签0): "The first aspect of a method for recovering a metal from a target according to the present invention is a method for recovering a metal from a target that consists essentially of a CoCrPt-based metal or a CoCrPtRu-based metal, and one or more metal oxides selected from the group consisting of SiO2, Cr2O3 and CoO. The method includes: heating the target at a temperature of from 1400 to 1790° C. in an upper crucible of a two-level crucible that includes the upper crucible with a through hole formed in a bottom surface thereof, and a lower crucible disposed below the through hole; and causing the melted metal to flow into the lower crucible, so that the metal is separated from the metal oxide. Thus, the metal can be separated from the metal oxide with a small number of process steps and less contamination of impurities."样例13(标签1): "According to the present invention, a method for manufacturing a filling planarization film, which can form a filling planarization film having excellent filling property (embeddability) into the recessed part and excellent heat resistance is provided.", "According to the present invention, a method for manufacturing an electronic device using the method for manufacturing a filling planarization film is provided."样例14(标签2): "In the guided-mode resonance filters disclosed in NPLs 1 and 2 and PTL 1, due to the presence of a substrate, the reduction in weight and thickness cannot be achieved. Further, the guided-mode resonance filters have large refractive index difference between the layers, which thus cause a problem of increased Fresnel reflection.", "With that, an object of the present invention is to provide a polarizer having a diffraction grating, the polarizer reducing Fresnel reflection and exhibiting high reflection diffraction efficiency for TE polarized light as well as high transmission diffraction efficiency for TM polarized light, and an optical element having the polarizer."样例15(标签0): "First Aspect of Invention", "The present invention was developed in view of the aforementioned problem. In order to solve the aforementioned problem, the first aspect of the invention provides a hot runner nozzle which has a nozzle gate fitted to face a gate of a cavity of a metal mold for molding a molded product using a plurality of types of molten resins. The hot runner nozzle can include a first resin flow path and a plurality of second resin flow paths. The first resin flow path has a funnel section which continues with the nozzle gate at a centripetal position and which allows a first molten resin to be gathered at the centripetal position of the funnel section so as to feed the resin into the nozzle gate. The second resin flow paths have a plurality of corresponding discharge ports facing and communicating with the funnel section of the first resin flow path, each of the discharge ports being disposed around the centripetal position, and feeding a second molten resin different from the first molten resin into the funnel section."## 应用场景
### 专利情感分析与技术趋势预测
该数据集可用于训练专利情感分析模型,帮助企业和研究机构了解专利文献中的情感倾向,预测技术发展趋势。通过分析大量专利数据中的情感表达,企业可以识别哪些技术领域正在快速发展,哪些技术方向可能面临挑战。例如,标签为2的数据可能代表对现有技术的批评或问题描述,通过分析这类数据可以发现技术痛点和改进机会;标签为1的数据可能代表技术优势和创新点,可用于识别技术突破方向。这种分析有助于企业制定研发战略,优化技术投资决策。
### 智能专利检索与推荐系统
基于该数据集训练的文本分类模型可以应用于智能专利检索系统。传统的专利检索主要基于关键词匹配,往往无法准确理解专利内容的语义和情感。通过情感分析模型,可以实现更精准的专利检索,例如检索具有特定情感倾向的专利文献。此外,该模型还可以用于构建专利推荐系统,根据用户的研究方向和兴趣,推荐相关的专利文献,提高研发效率。
### 技术竞争情报分析
企业可以利用该数据集进行技术竞争情报分析。通过对竞争对手专利的情感分析,可以了解竞争对手的技术优势、劣势以及研发重点。例如,分析竞争对手专利中描述的技术问题和解决方案,可以发现其技术短板和潜在的市场机会。同时,通过对比不同企业专利的情感分布,可以评估企业的技术创新能力和市场竞争力,为企业的竞争策略制定提供数据支持。
### 学术研究与算法验证
该数据集为自然语言处理领域的学术研究提供了宝贵的实验数据。研究人员可以利用该数据集验证新的文本分类算法、情感分析模型和深度学习架构。数据集的大规模和多样性使其成为评估模型性能的理想基准测试集。此外,由于数据来源于真实的专利文献,研究结果具有较高的实际应用价值,可以直接指导工业实践。
## 结尾
本数据集包含15万条专利文献文本数据,涵盖机械、电子、材料、化工等多个技术领域,标签分布均衡,数据质量高。该数据集不仅为自然语言处理模型训练提供了充足的样本,也为专利分析、技术情报挖掘等应用提供了有力支持。
数据集的核心优势在于其大规模、高质量的标注数据,以及涵盖多个技术领域的广泛覆盖范围。通过对这些数据的深入分析,可以揭示技术发展趋势、识别创新机会,为企业和研究机构提供有价值的决策参考。
如需获取更多关于数据集的详细信息或使用建议,欢迎进一步交流探讨。
看了又看
验证报告
以下为卖家选择提供的数据验证报告:






