以下为卖家选择提供的数据验证报告:
数据描述
This dataset contains Hindi images and text pairs of sentences. It was created using TextToImageGenerator repository. It is intended for use of OCR related tasks. Furthermore, it is inspired by the IAM Handwriting Database. Go through TrOCR_HI Notebook.
Details
It contains a total of 80,000 images with a total of 6,30,515 words. According to this research paper around 90k words or 13k lines is enough to create an effective OCR model. Each image has a width of 900 and height of 64 pixels. A total of 8 fonts are used to create this dataset, which are: Lohit-Devanagari.ttf, Sura-Regular.ttf, arial-unicode-ms.ttf, NotoSansDevanagari.ttf, adobedevanagari-regular.ttf, Akshar-Unicode.ttf, mangal-2.ttf and gargi.ttf
Structure
Images are in the output_images folder, TestSamples folder contains 9 images randomly taken from the internet for testing and data.csv file contains the annotation as follows:
image_file - The name of the image file
text - The sentence used in the image
font_size - Size of the font (in pixels) used in image
font_file - The font used in the image
word_count - The total number of words present in the image
Sample line image text pair
Text: नज़रिया: गोरखपुर, नागपुर और दिल्ली के त्रिकोण में फंसा है 2019
Text: आईएस के शासन में कैसा है मोसुल में जीवन
Font distribution of dataset
Font File vs Average Font Size
Font File vs Word Count
