🔥🔥🔥 A curated list of Multimodal Large Language Models (MLLM), including datasets, multimodal instruction tuning, multimodal in-context learning, multimodal chain-of-thought, llm-aided visual reasoning, foundation models, and others.
🔥🔥🔥 This list will be updated in real time.
🔥🔥🔥 A survey paper on MLLM is preparing and will be released soon.
Welcome to join our WeChat group of MLLM communication!
Table of Contents
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
MIMIC-IT: Multi-Modal In-Context Instruction Tuning |
arXiv | 2023-06-08 | Github | Demo |
Chameleon: Plug-and-Play Compositional Reasoning with Large Language Models |
arXiv | 2023-04-19 | Github | Demo |
HuggingGPT: Solving AI Tasks with ChatGPT and its Friends in HuggingFace |
arXiv | 2023-03-30 | Github | Demo |
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action |
arXiv | 2023-03-20 | Github | Demo |
Prompting Large Language Models with Answer Heuristics for Knowledge-based Visual Question Answering |
CVPR | 2023-03-03 | Github | - |
Visual Programming: Compositional visual reasoning without training |
CVPR | 2022-11-18 | Github | Local Demo |
An Empirical Study of GPT-3 for Few-Shot Knowledge-Based VQA |
AAAI | 2022-06-28 | Github | - |
Flamingo: a Visual Language Model for Few-Shot Learning |
NeurIPS | 2022-04-29 | Github | Demo |
Multimodal Few-Shot Learning with Frozen Language Models | NeurIPS | 2021-06-25 | - | - |
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
Transfer Visual Prompt Generator across LLMs |
arXiv | 2023-05-02 | Github | Demo |
GPT-4 Technical Report | arXiv | 2023-03-15 | - | - |
PaLM-E: An Embodied Multimodal Language Model | arXiv | 2023-03-06 | - | Demo |
Language Is Not All You Need: Aligning Perception with Language Models |
arXiv | 2023-02-27 | Github | - |
BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models |
arXiv | 2023-01-30 | Github | Demo |
Title | Venue | Date | Code | Demo |
---|---|---|---|---|
Can Large Pre-trained Models Help Vision Models on Perception Tasks? | arXiv | 2023-06-01 | Coming soon | - |
Contextual Object Detection with Multimodal Large Language Models |
arXiv | 2023-05-29 | Github | Demo |
On Evaluating Adversarial Robustness of Large Vision-Language Models |
arXiv | 2023-05-26 | Github | - |
Evaluating Object Hallucination in Large Vision-Language Models |
arXiv | 2023-05-17 | Github | - |
Name | Paper | Link | Notes |
---|---|---|---|
MIMIC-IT | MIMIC-IT: Multi-Modal In-Context Instruction Tuning | Coming soon | Multimodal in-context instruction dataset |
Name | Paper | Link | Notes |
---|---|---|---|
EgoCOT | EmbodiedGPT: Vision-Language Pre-Training via Embodied Chain of Thought | Coming soon | Large-scale embodied planning dataset |
VIP | Let’s Think Frame by Frame: Evaluating Video Chain of Thought with Video Infilling and Prediction | Coming soon | An inference-time dataset that can be used to evaluate VideoCOT |
ScienceQA | Learn to Explain: Multimodal Reasoning via Thought Chains for Science Question Answering | Link | Large-scale multi-choice dataset, featuring multimodal science questions and diverse domains |