online

news_collection

NLPNews

￥15

70.32MB

数据标识：D17169521724940260

发布时间：2024/05/29

About Dataset

A collections of news articles in Traditional and Simplified Chinese. It includes some Internet news outlets that are NOT Chinese state media (they deserve a separate dataset).

Complete coverage is not guaranteed. Therefore this dataset is not suitable for analyzing event coverage. It is meant for using as a corpus for NLP algorithms.

Data Collection Process

The links to the news articles were collected from the RSS feeds or the Twitter accounts of the news outlets.
Download and parse the web pages. Then the meta tags were used to extract the title, description/summary, and cover image of each article. (These are the stuffs that are used in the Twitter and Facebook summary cards.)

Note: Only minimal text cleaning has been performed on the meta tags.

Data Fields

title: Article title from og:title or twitter:title meta tag.
desc: Article summary from twitter:description or og:description meta tag.
image: URL to the cover image from twitter:image or og:image meta tag.
url: URL of the article.
source: The code of the news outlet.
date: The publish date of the article on Twitter or in RSS feeds. Format: YYYYMMDD

This dataset does not provide full texts of the article. You'll need to scrape it yourself using the links provided.

看了又看

验证报告

以下为卖家选择提供的数据验证报告：

news_collection

￥15

70.32MB

申请报告

news_collection

About Dataset

Data Collection Process

Data Fields

关于典枢

下载与支持

服务协议

关于我们

官方公众号

技术交流群