Stars
A paper list of some recent works about Token Compress for Vit and VLM
A simple framework for experimenting with Reinforcement Learning in Python.
A fork to add multimodal model training to open-r1
[CVPR-2024] Official implementations of CLIP-KD: An Empirical Study of CLIP Model Distillation
LLaVA-Mini is a unified large multimodal model (LMM) that can support the understanding of images, high-resolution images, and videos in an efficient manner.
iLLaVA: An Image is Worth Fewer Than 1/3 Input Tokens in Large Multimodal Models
A framework to enable autonomous android and computer use using any LLM (local or remote)
A bibliography and survey of the papers surrounding o1
A library for advanced large language model reasoning
Training Large Language Model to Reason in a Continuous Latent Space
Implementation of 🥥 Coconut, Chain of Continuous Thought, in Pytorch
Valley is a cutting-edge multimodal large model designed to handle a variety of tasks involving text, images, and video data.
A Simple Framework of Small-scale Large Multimodal Models for Video Understanding Based on TinyLLaVA_Factory.
A series of technical report on Slow Thinking with LLM
Official code for Paper "Mantis: Multi-Image Instruction Tuning" (TMLR2024)
GeoPixel: A Pixel Grounding Large Multimodal Model for Remote Sensing is specifically developed for high-resolution remote sensing image analysis, offering advanced multi-target pixel grounding cap…
RUCAIBox / Virgo
Forked from Richar-Du/VirgoOfficial code of *Virgo: A Preliminary Exploration on Reproducing o1-like MLLM*
Everything you need to build state-of-the-art foundation models, end-to-end.
Fully open reproduction of DeepSeek-R1
[ECCV2024] Grounded Multimodal Large Language Model with Localized Visual Tokenization
Clean, minimal, accessible reproduction of DeepSeek R1-Zero