Eric🐟

Wiki Neutrality Corpus (WNC)

TabularNLP Text

5

已售 0
106.7MB

数据标识:D17169025274954562

发布时间:2024/05/28

卖家暂未授权典枢平台对该文件进行数据验证,您可以向卖家

申请验证报告

数据描述

About Dataset

The Wiki Neutrality Corpus consists of over 180,000 aligned sentences pre and post-neutralization by English Wikipedia editors from revisions made between 2004 and 2019 where editors provided NPOV related justification.

The dataset was introduced as part of the research paper: Automatically Neutralizing Subjective Bias in Text.

Loading data using pandas:
pd.read_csv('biased.full', sep='\t', names=["id", "src_tok", "tgt_tok", "src_raw", "tgt_raw", "src_POS_tags", "tgt_parse_tags"])

All data files are TSVs with the following columns:

Columns Description Example
id A unique identifier which can be used to link to a Wikipedia Diff view. 532355971 (Links to https://en.wikipedia.org/w/index.php?diff=532355971
src_tok Tokenized source text she did not do as promised exposing her as an un ##pr ##in ##ci ##pled politician .
tgt_tok Tokenized target text she did not do , leading to accusations of her being an un ##pr ##in ##ci ##pled politician
src_raw Raw source text she did not do as promised exposing her as an unprincipled politician.
tgt_raw Raw target text she did not do , leading to accusations of her being an unprincipled politician.
src_POS_tags Part-of-speech tags for source text PRON VERB ADV VERB ADP VERB VERB PRON ADP DET ADJ ADJ ADJ ADJ ADJ NOUN PUNCT
tgt_parse_tags Syntactic parse tags for target text using the Stanford Parser nsubj aux neg ROOT mark advcl xcomp dobj prep det amod amod amod amod amod pobj punct

BibTeX Citation:
@misc{pryzant2019automatically,
title={Automatically Neutralizing Subjective Bias in Text},
author={Reid Pryzant and Richard Diehl Martinez and Nathan Dass and Sadao Kurohashi and Dan Jurafsky and Diyi Yang},
year={2019},
eprint={1911.09709},
archivePrefix={arXiv},
primaryClass={cs.CL}
}

data icon
Wiki Neutrality Corpus (WNC)
5
已售 0
106.7MB
申请报告