Stars
[CVPR 2024 Oral] InternVL Family: A Pioneering Open-Source Alternative to GPT-4o. 接近GPT-4o表现的开源多模态对话模型
VILA is a family of state-of-the-art vision language models (VLMs) for diverse multimodal AI tasks across the edge, data center, and cloud.
Fast inference from large lauguage models via speculative decoding
Implementation of paper Data Engineering for Scaling Language Models to 128K Context
③[ICML2024] [IQA, IAA, VQA] All-in-one Foundation Model for visual scoring. Can efficiently fine-tune to downstream datasets.
(ICML 2024) Alphazero-like Tree-Search can guide large language model decoding and training
E5-V: Universal Embeddings with Multimodal Large Language Models
[ACM'MM 2024 Oral] Official code for "OneChart: Purify the Chart Structural Extraction via One Auxiliary Token"
Intervening Anchor Token: Decoding Strategy in Alleviating Hallucinations for MLLMs
②[CVPR 2024] Low-level visual instruction tuning, with a 200K dataset and a model zoo for fine-tuned checkpoints.
AnchorAttention: Improved attention for LLMs long-context training
Official repository for paper MG-LLaVA: Towards Multi-Granularity Visual Instruction Tuning(https://arxiv.org/abs/2406.17770).
[NeurIPS2024] Repo for the paper `ControlMLLM: Training-Free Visual Prompt Learning for Multimodal Large Language Models'
Official implementation for CVPR2023 Paper "Re-IQA : Unsupervised Learning for Image Quality Assessment in the Wild"
DepictQA: Depicted Image Quality Assessment with Vision Language Models
[EMNLP 2024 Findings🔥] Official implementation of "LOOK-M: Look-Once Optimization in KV Cache for Efficient Multimodal Long-Context Inference"
[Neurips 24' D&B] Official Dataloader and Evaluation Scripts for LongVideoBench.
Codes for ICML 2024 paper: "Video-of-Thought: Step-by-Step Video Reasoning from Perception to Cognition"
Collection of utilities that are not polished implementations but can be useful to users
Gödel Agent: A Self-Referential Agent Framework for Recursive Self-Improvement
A lightweight flexible Video-MLLM developed by TencentQQ Multimedia Research Team.
[ECCV 2024] Official Pytorch Implementation of A Comprehensive Study of Multimodal Large Language Models for Image Quality Assessment
Implementation for the research paper "Enhancing LLM Reasoning via Critique Models with Test-Time and Training-Time Supervision".
[preprint] We propose a novel fine-tuning method, Separate Memory and Reasoning, which combines prompt tuning with LoRA.
FocusLLM: Scaling LLM’s Context by Parallel Decoding
Official PyTorch implementation of "Scaling Up Personalized Image Aesthetic Assessment via Task Vector Customization" (ECCV 2024)
Official code for CVPR 2024 paper, "SC-Tune: Unleashing Self-Consistent Referential Comprehension in Large Vision Language Models"