以下为卖家选择提供的数据验证报告:
数据描述
This dataset contains various texts in Japanese obtained from Aozora Bunko, by authors who served as inspiration for the characters in the manga Bungo Stray Dogs.
The bulk of the dataset consists of a folder with text files, each containing a single document. Two accompanying dataframes contain information on 1) the text files including their title and author; and 2) the authors including full name in both Japanese script and romaji and organization affiliation of the fictional characters.
Additionally, there is a balanced random sample of sentences for a subset of 20 authors, each with 1000 sentences, for the purposes of author classification. For the creation of this sample, the original document dataset was first restricted to texts using modern Kanji and Kana as per description on Aozora Bunko (新字新仮名). Then, only authors who had at least 1000 sentences over all their documents with at least 9 words or more were considered.
Also, three extra text files are included: 1) a file containing all chars contained across the documents that are neither kanji nor kana nor punctuation; 2) a file containing Japanese punctuation; and 3) a file containing all kana.
