老下头

verify-tagHindi OCR synthetic line image-text pair

transfer learningimage text recognitionbertdeithindi

8

已售 0
734.76MB

数据标识:D17174917784042533

发布时间:2024/06/04

以下为卖家选择提供的数据验证报告:

数据描述

This dataset contains Hindi images and text pairs of sentences. It was created using TextToImageGenerator repository. It is intended for use of OCR related tasks. Furthermore, it is inspired by the IAM Handwriting Database. Go through TrOCR_HI Notebook.

Details

It contains a total of 80,000 images with a total of 6,30,515 words. According to this research paper around 90k words or 13k lines is enough to create an effective OCR model. Each image has a width of 900 and height of 64 pixels. A total of 8 fonts are used to create this dataset, which are: Lohit-Devanagari.ttf, Sura-Regular.ttf, arial-unicode-ms.ttf, NotoSansDevanagari.ttf, adobedevanagari-regular.ttf, Akshar-Unicode.ttf, mangal-2.ttf and gargi.ttf

Structure

Images are in the output_images folder, TestSamples folder contains 9 images randomly taken from the internet for testing and data.csv file contains the annotation as follows:

image_file - The name of the image file

text - The sentence used in the image

font_size - Size of the font (in pixels) used in image

font_file - The font used in the image

word_count - The total number of words present in the image

Sample line image text pair

3.png

Text: नज़रिया: गोरखपुर, नागपुर और दिल्ली के त्रिकोण में फंसा है 2019

19.png

Text: आईएस के शासन में कैसा है मोसुल में जीवन

Font distribution of dataset

Font File vs Average Font Size AverageFontSize

Font File vs Word Count WordCount

data icon
Hindi OCR synthetic line image-text pair
8
已售 0
734.76MB
申请报告