老下头

IM2LATEX-100K

earth and natureeducation

￥20

630.71MB

数据标识：D17174933692691404

发布时间：2024/06/04

A prebuilt dataset for OpenAI's task for image-2-latex system. Includes total of ~100k formulas and images splitted into train, validation and test sets. Formulas were parsed from LaTeX sources provided here: http://www.cs.cornell.edu/projects/kddcup/datasets.html(originally from arXiv)

Each image is a PNG image of fixed size. Formula is in black and rest of the image is transparent.

For related tools (eg. tokenizer) check out this repository: https://github.com/Miffyli/im2latex-dataset For pre-made evaluation scripts and built im2latex system check this repository: https://github.com/harvardnlp/im2markup

Newlines used in formulas_im2latex.lst are UNIX-style newlines (\n). Reading file with other type of newlines results to slightly wrong amount of lines (104563 instead of 103558), and thus breaks the structure used by this dataset. Python 3.x reads files using newlines of the running system by default, and to avoid this file must be opened with newlines="\n" (eg. open("formulas_im2latex.lst", newline="\n")).

For more info head to the source of this dataset. https://zenodo.org/record/56198#.YHM2xRQzbvd

看了又看

验证报告

以下为卖家选择提供的数据验证报告：

IM2LATEX-100K

￥20

630.71MB

申请报告

IM2LATEX-100K

关于典枢

下载与支持

服务协议

关于我们

官方公众号

技术交流群