晓彤

verify-tagSCP 001 to 6999

culture and humanitiespopular culturearts and entertainmenttext

4

已售 0
64.26MB

数据标识:D17220576939373771

发布时间:2024/07/27

以下为卖家选择提供的数据验证报告:

数据描述

Context

Collated with the intention of putting together a large, high quality text repository for natural language processing with leanings towards horror and urban legend. Also was interested in SCPs and analysing them myself, so created the dataset to capture several other values that may be of interest.

Content

This dataset is comprised of the main series SCPs numbered 1 to 6999. A significant minority of these were deleted or otherwise missing in some way, but in many cases have been flagged as such using the 'state' column which can contain values of 'blocked' and 'deleted'. In these cases, the main page content was still added to the dataframe and as such there are several duplicates in the main 'text' column, for example where the site's deletion notice was added as the text. The 'state' column also has 'active' and 'age restricted' values, where the article was presumed active and not blocked, and where the article was marked as containing adult content.

Contains eight columns:

  • Code: the code name for the SCP, for example SCP-034, where SCP is followed by a dash and a number, zero padded where it is three digits or less. Note, only a single codename in this format has been added, although some SCPs had multiple or alternate codenames.
  • Title: the text title of the SCP, for example "The Thing in the Room"
  • Text: the full text of the main web page, excluding image captions, but may include things other than just the story such as license notices. Paragraphs joined together with \n newline characters.
  • Image Captions: all image captions from the page, joined together with \n newline characters.
  • Rating: a positive or negative integer rating that users have given the article on the site. All are a plus or a minus followed by an integer, although the dataframe may store them as floats.
  • State: one of several categories the article can fall under, for example 'active' or 'deleted' as described above.
  • Tags: hidden and deleted tags that the article was given. Hidden tags will start with an underscore, whereas visible tags will not.
  • Link: the URL that links to the original work.

Acknowledgements

A big thank you to the many creators and contributors to the SCPs within this dataset. Each row links back to the original work, and the wiki which this was based on can be found at: https://scp-wiki.wikidot.com/. This data is available under a Creative Commons Attribution-ShareAlike License, as is the original work that it is based on.

Inspiration

Some ideas or challenges for things that may be possible with the dataset:

  • Predict the rating an SCP will receive, based on its tags or even its text
  • Generate realistic SCP text
  • Predict what an SCP is tagged with, by looking at the text
  • Use the image captions and text to generate images that match the article
data icon
SCP 001 to 6999
4
已售 0
64.26MB
申请报告