Natural Questions Dataset

浩宇宝贝

Natural Questions Dataset

search enginesearth and natureartificial intelligenceintermediatetextenglish

￥4

已售 0

111.11MB

数据标识：D17222448688596230

发布时间：2024/07/29

数据描述

Context

The Natural Questions (NQ) dataset is a comprehensive collection of real user queries submitted to Google Search, with answers sourced from Wikipedia by expert annotators. Created by Google AI Research, this dataset aims to support the development and evaluation of advanced automated question-answering systems. The version provided here includes 89,312 meticulously annotated entries, tailored for ease of access and utility in natural language processing (NLP) and machine learning (ML) research.

Data Collection

The dataset is composed of authentic search queries from Google Search, reflecting the wide range of information sought by users globally. This approach ensures a realistic and diverse set of questions for NLP applications.

Data Pre-processing

The NQ dataset underwent significant pre-processing to prepare it for NLP tasks:

Removal of web-specific elements like URLs, hashtags, user mentions, and special characters using Python's "BeautifulSoup" and "regex" libraries.
Grammatical error identification and correction using the "LanguageTool" library, an open-source grammar, style, and spell checker.

These steps were taken to clean and simplify the text while retaining the essence of the questions and their answers, divided into 'questions', 'long answers', and 'short answers'.

Data Storage

The unprocessed data, including answers with embedded HTML, empty or complex long and short answers, is stored in "Natural-Questions-Base.csv". This version retains the raw structure of the data, featuring HTML elements in answers, and varied answer formats such as tables and lists, providing a comprehensive view for those interested in the original dataset's complexity and richness. The processed data is compiled into a single CSV file named "Natural-Questions-Filtered.csv". The file is structured for easy access and analysis, with each record containing the processed question, a detailed answer, and concise answer snippets.

Filtered Results

The filtered version is available where specific criteria, such as question length or answer complexity, were applied to refine the data further. This version allows for more focused research and application development.

Flask CSV Reader App

The repository at 'https://github.com/fujoos/natural_questions' also includes a Flask-based CSV reader application designed to read and display contents from the "NaturalQuestions.csv" file. The app provides functionalities such as:

Viewing questions and answers directly in your browser.
Filtering results based on criteria like question keywords or answer length. -See the live demo using the csv files converted to slite db at 'https://fujoos.pythonanywhere.com/'

验证报告

以下为卖家选择提供的数据验证报告：

Natural Questions Dataset

￥4

已售 0

111.11MB

申请报告

Natural Questions Dataset

Context

Data Collection

Data Pre-processing

Data Storage

Filtered Results

Flask CSV Reader App

关于典枢

下载与支持

服务协议

关于我们

官方公众号

技术交流群