-
Institute of Information Engineering, Chinese Academy of Sciences
- Beijing, China
Stars
The repository provides code for running inference with the Meta Segment Anything Model 2 (SAM 2), links for downloading the trained model checkpoints, and example notebooks that show how to use th…
Official code implementation of General OCR Theory: Towards OCR-2.0 via a Unified End-to-end Model
Official implementation of the paper "Track the Answer: Extending TextVQA from Image to Video with Spatio-Temporal Clues"
[2024-NeurIPS] TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control
Generate a transcript for your favourite Manga: Detect manga characters, text blocks and panels. Order panels. Cluster characters. Match texts to their speakers. Perform OCR.
Official Implementation of OCR-free Document Understanding Transformer (Donut) and Synthetic Document Generator (SynthDoG), ECCV 2022
[ECCV 2024] Elysium: Exploring Object-level Perception in Videos via MLLM
The official repo for [CVPR'23] "DeepSolo: Let Transformer Decoder with Explicit Points Solo for Text Spotting" & [ArXiv'23] "DeepSolo++: Let Transformer Decoder with Explicit Points Solo for Multi…
PyTorch implementation of paper "ARTrack" and "ARTrackV2"
[NeurIPS'24] GoMatching: A Simple Baseline for Video Text Spotting via Long and Short Term Matching
(CVPR 2024) Bridging the Gap Between End-to-End and Two-Step Text Spotting.
[NeurIPS2021] BOVText: A Large-Scale, Multidimensional Multilingual Dataset for Video Text Spotting
✨✨ Scene-Text Grounding for Text-Based Video Question Answering (arxiv)
Qwen2-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
[IJCV 2024] TransDETR: End-to-end Video Text Spotting with Transformer