以下为卖家选择提供的数据验证报告:
数据描述
About Dataset
A collections of news articles in Traditional and Simplified Chinese. It includes some Internet news outlets that are NOT Chinese state media (they deserve a separate dataset).
Complete coverage is not guaranteed. Therefore this dataset is not suitable for analyzing event coverage. It is meant for using as a corpus for NLP algorithms.
Data Collection Process
- The links to the news articles were collected from the RSS feeds or the Twitter accounts of the news outlets.
- Download and parse the web pages. Then the meta tags were used to extract the title, description/summary, and cover image of each article. (These are the stuffs that are used in the Twitter and Facebook summary cards.)
Note: Only minimal text cleaning has been performed on the meta tags.
Data Fields
- title: Article title from
og:title
ortwitter:title
meta tag. - desc: Article summary from
twitter:description
orog:description
meta tag. - image: URL to the cover image from
twitter:image
orog:image
meta tag. - url: URL of the article.
- source: The code of the news outlet.
- date: The publish date of the article on Twitter or in RSS feeds. Format: YYYYMMDD
This dataset does not provide full texts of the article. You'll need to scrape it yourself using the links provided.

news_collection
70.32MB
申请报告