老下头

verify-tagDigital Peter

image

7

已售 0
914.68MB

数据标识:D17174791485624028

发布时间:2024/06/04

以下为卖家选择提供的数据验证报告:

数据描述

DigitalPeter Data and description for the Digital Peter dataset.

Data Data link. It consists of images and txt files for cutted text lines.

Full documents link. It consists of images of full documents. You can use it to train models that cut the text lines from initial image.

Annotations link. It is annotation for the full documents dataset (previous link). Annotations include segmentation masks for text lines.

Data link has the following structure.

--- images/ --- mapper.csv/ --- words/ Images folder contains images of lines. Words folder contains txt files of text corresponding to images. You can map them by file name. For example image 0_1_0.jpg has text translation in file 0_1_0.txt.

mapper.csv has five columns - [new_name, old_name, train, public,private]. Data link has filenames from column "new_name". But for competition "DigitalPeter" (described below) we used different filenames. They are presented in column "old_name". So this file maps new_names to old_names from competition.

FINALLY, if you want to use our dataset for scientific researches, you can forget about "DigitalPeter" competition and don't look at column "old_name" in mapper.csv. Use only actual names from columns "new_name" and columns [train,public,private] for train/val/test splits.

Description The dataset consists of 9694 images and text files. There are 265788 symbols and approximately 50998 words.

Each pair consists of one image file and one text file. File names have the format . Where - is a document number, - is a page number in the document , - is a line number in the page of document . Such a naming system was created to help researchers who use our dataset to reconstruct original texts. One can train NLP models to help decrease the HTR model error.

Here is an example of one line of text.

Here is an example of segmented document.

data icon
Digital Peter
7
已售 0
914.68MB
申请报告