Highlights
- Pro
ocr
Awesome multilingual OCR toolkits based on PaddlePaddle (practical ultra lightweight OCR system, support 80+ languages recognition, provide data annotation and synthesis tools, support training and…
Tesseract Open Source OCR Engine (main repository)
pix2tex: Using a ViT to convert images of equations into LaTeX code.
Implementation of Nougat Neural Optical Understanding for Academic Documents
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
A synthetic data generator for text recognition
Official Implementation of SynthTIGER (Synthetic Text Image Generator), ICDAR 2021
Handwriting Synthesis with RNNs ✏️
DocBank: A Benchmark Dataset for Document Layout Analysis
This repo is used to release the ArxivFormula dataset.
UniMERNet: A Universal Network for Real-World Mathematical Expression Recognition
A Comprehensive Toolkit for High-Quality PDF Content Extraction
This repository contains a paper collection of the methods for document image processing, including appearance enhancement, deshadowing, dewarping, deblurring, binarization and so on.
A high-quality tool for convert PDF to Markdown and JSON.一站式开源高质量数据提取工具,将PDF转换成Markdown和JSON格式。
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Convert PDF to markdown + JSON quickly with high accuracy
An extremely fast LaTeX formatter written in Rust