老下头

Hindi OCR synthetic line image-text pair

transfer learningimage text recognitionbertdeithindi

￥8

734.76MB

数据标识：D17174917784042533

发布时间：2024/06/04

This dataset contains Hindi images and text pairs of sentences. It was created using TextToImageGenerator repository. It is intended for use of OCR related tasks. Furthermore, it is inspired by the IAM Handwriting Database. Go through TrOCR_HI Notebook.

Details

It contains a total of 80,000 images with a total of 6,30,515 words. According to this research paper around 90k words or 13k lines is enough to create an effective OCR model. Each image has a width of 900 and height of 64 pixels. A total of 8 fonts are used to create this dataset, which are: Lohit-Devanagari.ttf, Sura-Regular.ttf, arial-unicode-ms.ttf, NotoSansDevanagari.ttf, adobedevanagari-regular.ttf, Akshar-Unicode.ttf, mangal-2.ttf and gargi.ttf

Structure

Images are in the output_images folder, TestSamples folder contains 9 images randomly taken from the internet for testing and data.csv file contains the annotation as follows:

image_file - The name of the image file

text - The sentence used in the image

font_size - Size of the font (in pixels) used in image

font_file - The font used in the image

word_count - The total number of words present in the image

Sample line image text pair

Text: नज़रिया: गोरखपुर, नागपुर और दिल्ली के त्रिकोण में फंसा है 2019

Text: आईएस के शासन में कैसा है मोसुल में जीवन

Font distribution of dataset

Font File vs Average Font Size AverageFontSize

Font File vs Word Count WordCount

看了又看

验证报告

以下为卖家选择提供的数据验证报告：

Hindi OCR synthetic line image-text pair

￥8

734.76MB

申请报告

Hindi OCR synthetic line image-text pair

Details

Structure

Sample line image text pair

Font distribution of dataset

关于典枢

下载与支持

服务协议

关于我们

官方公众号

技术交流群