Stars
Solve Visual Understanding with Reinforced VLMs
SGLang is a fast serving framework for large language models and vision language models.
Qwen2.5-VL is the multimodal large language model series developed by Qwen team, Alibaba Cloud.
Aguvis: Unified Pure Vision Agents for Autonomous GUI Interaction
A Flexible Framework for Experiencing Cutting-edge LLM Inference Optimizations
Code for ChatRex: Taming Multimodal LLM for Joint Perception and Understanding
OmniGen: Unified Image Generation. https://arxiv.org/pdf/2409.11340
Includes the code for training and testing the CountGD model from the paper CountGD: Multi-Modal Open-World Counting.
A novel Multimodal Large Language Model (MLLM) architecture, designed to structurally align visual and textual embeddings.
A generative world for general-purpose robotics & embodied AI learning.
类似按键精灵的鼠标键盘录制和自动化操作 模拟点击和键入 | automate mouse clicks and keyboard input
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Memory-Guided Diffusion for Expressive Talking Video Generation
[CVPR 2025] DEIM: DETR with Improved Matching for Fast Convergence
[ICLR'23 Spotlight & IJCV'24] MapTR: Structured Modeling and Learning for Online Vectorized HD Map Construction
[CVPR 2025] Truncated Diffusion Model for Real-Time End-to-End Autonomous Driving
A new tensorrt integrate. Easy to integrate many tasks
[ACM MM 2022] Official Rail-DB and Rail-Net
[NeurIPS 2024 Best Paper][GPT beats diffusion🔥] [scaling laws in visual generation📈] Official impl. of "Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction". An *ult…
A C++ framework for programming real-time applications
Python scripts for the Segment Anythin 2 (SAM2) model in ONNX
Automate browser-based workflows with LLMs and Computer Vision