💛

verify-tagFirst Quora Dataset Release: Question Pairs

earth and nature

1

已售 0
58.48MB

数据标识:D17220412104318513

发布时间:2024/07/27

以下为卖家选择提供的数据验证报告:

数据描述

Today, we are excited to announce the first in what we plan to be a series of public dataset releases. Our dataset releases will be oriented around various problems of relevance to Quora and will give researchers in diverse areas such as machine learning, natural language processing, network science, etc. the opportunity to try their hand at some of the challenges that arise in building a scalable online knowledge-sharing platform. Our first dataset is related to the problem of identifying duplicate questions.

An important product principle for Quora is that there should be a single question page for each logically distinct question. As a simple example, the queries “What is the most populous state in the USA?” and “Which state in the United States has the most people?” should not exist separately on Quora because the intent behind both is identical. Having a canonical page for each logically distinct query makes knowledge-sharing more efficient in many ways: for example, knowledge seekers can access all the answers to a question in a single location, and writers can reach a larger readership than if that audience was divided amongst several pages.

To mitigate the inefficiencies of having duplicate question pages at scale, we need an automated way of detecting if pairs of question text actually correspond to semantically equivalent queries. This is a challenging problem in natural language processing and machine learning, and it is a problem for which we are always searching for a better solution.

The dataset that we are releasing today will give anyone the opportunity to train and test models of semantic equivalence, based on actual Quora data. We are eager to see how diverse approaches fare on this problem.

Our dataset consists of over 400,000 lines of potential question duplicate pairs. Each line contains IDs for each question in the pair, the full text for each question, and a binary value that indicates whether the line truly contains a duplicate pair. Here are a few sample lines of the dataset:

Here are a few important things to keep in mind about this dataset:

Our original sampling method returned an imbalanced dataset with many more true examples of duplicate pairs than non-duplicates. Therefore, we supplemented the dataset with negative examples. One source of negative examples were pairs of “related questions” which, although pertaining to similar topics, are not truly semantically equivalent. The distribution of questions in the dataset should not be taken to be representative of the distribution of questions asked on Quora. This is, in part, because of the combination of sampling procedures and also due to some sanitization measures that have been applied to the final dataset (e.g., removal of questions with extremely long question details).

links for download data: http://qim.ec.quoracdn.net/quora_duplicate_questions.tsv

source: https://data.quora.com/First-Quora-Dataset-Release-Question-Pairs

data icon
First Quora Dataset Release: Question Pairs
1
已售 0
58.48MB
申请报告