Starred repositories
A high-throughput and memory-efficient inference and serving engine for LLMs
Toolkit for linearizing PDFs for LLM datasets/training
Fully open reproduction of DeepSeek-R1
[ICLR'25] Geometric Problem Solving Through Unified Formalized Vision-Language Pre-training
🍒 Cherry Studio is a desktop client that supports for multiple LLM providers. Support deepseek-r1
A programming framework for agentic AI 🤖 PyPi: autogen-agentchat Discord: https://aka.ms/autogen-discord Office Hour: https://aka.ms/autogen-officehour
Janus-Series: Unified Multimodal Understanding and Generation Models
Drop in a screenshot and convert it to clean code (HTML/Tailwind/React/Vue)
A simple screen parsing tool towards pure vision based GUI agent
Official codebase used to develop Vision Transformer, SigLIP, MLP-Mixer, LiT and more.
Agent framework and applications built upon Qwen>=2.0, featuring Function Calling, Code Interpreter, RAG, and Chrome extension.
Vary-tiny codebase upon LAVIS (for training from scratch)and a PDF image-text pairs data (about 600k including English/Chinese)
PDF scientific paper translation with preserved formats - 基于 AI 完整保留排版的 PDF 文档全文双语翻译,支持 Google/DeepL/Ollama/OpenAI 等服务,提供 CLI/GUI/Docker/Zotero
Official inference repo for FLUX.1 models
A Comprehensive Benchmark for Document Parsing and Evaluation
OCR Hinders RAG: Evaluating the Cascading Impact of OCR on Retrieval-Augmented Generation
This repository contains a paper collection of the methods for document image processing, including appearance enhancement, deshadowing, dewarping, deblurring, binarization and so on.
Augmentation pipeline for rendering synthetic paper printing, faxing, scanning and copy machine processes
Virtual whiteboard for sketching hand-drawn like diagrams
基于序列表格识别算法推理库,集成PP-Structure和modelscope等表格识别算法。
[CVPR 2024] DocRes: A Generalist Model Toward Unifying Document Image Restoration Tasks
Mobile-Agent: The Powerful Mobile Device Operation Assistant Family
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Models
A curated list of of awesome UI agents resources, encompassing Web, App, OS, and beyond (continually updated)
整理目前开源的最优表格识别模型,完善前后处理,模型转换为ONNX Organize the currently open-source optimal table recognition models, improve pre-processing and post-processing, and convert the models to ONNX.
Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
DocLayout-YOLO: Enhancing Document Layout Analysis through Diverse Synthetic Data and Global-to-Local Adaptive Perception