VILA: On Pre-training for Visual Language Models

VILA arxiv / VILA Demo / VILA Huggingface

💡 Introduction

VILA is a visual language model (VLM) pretrained with interleaved image-text data at scale, enabling video understanding and multi-image understanding capabilities. VILA is deployable on the edge by AWQ 4bit quantization and TinyChat framework. We find: (1) image-text pairs are not enough, interleaved image-text is essential; (2) unfreezing LLM during interleaved image-text pre-training enables in-context learning; (3)re-blending text-only instruction data is crucial to boost both VLM and text-only performance; (4) token compression extends #video frames. VILA unveils appealing capabilities, including: video reasoning, in-context learning, visual chain-of-thought, and better world knowledge.

💡 News

[2024/06] VILA1.5 is now the best open sourced VLM on MMMU leaderboard and Video-MME leaderboard!
[2024/05] We moved our previous repo to NVlabs! All future developments will be updated here.
[2024/05] We release VILA-1.5, which offers video understanding capability. VILA-1.5 comes with four model sizes: 3B/8B/13B/40B.
[2024/05] We release AWQ-quantized 4bit VILA-1.5 models. VILA-1.5 is efficiently deployable on diverse NVIDIA GPUs (A100, 4090, 4070 Laptop, Orin, Orin Nano) by TinyChat and TensorRT-LLM backends.
[2024/03] VILA has been accepted by CVPR 2024!
[2024/02] We release AWQ-quantized 4bit VILA models, deployable on Jetson Orin and laptops through TinyChat and TinyChatEngine.
[2024/02] VILA is released. We propose interleaved image-text pretraining that enables multi-image VLM. VILA comes with impressive in-context learning capabilities. We open source everything: including training code, evaluation code, datasets, model ckpts.
[2023/12] Paper is on Arxiv!

Performance

Image QA Benchmarks

$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$	Prec.	VQAv2	GQA	VizWiz	SQA-I	VQA-T	POPE	MME	MMB	MMB-CN	SEED	SEED-I	MMMU (val)	MMMU (test)	llava-bench	MM-Vet	Average
VILA1.5-3B	fp16	80.4	61.5	53.5	69.0	60.4	85.9	1442.44	63.4	52.7	60.9	67.9	33.3	30.8	75.9	35.4	60.2
VILA1.5-3B-AWQ	int4	80.0	61.1	53.8	67.8	60.4	85.9	1437.34	63.3	51.4	59.8	66.6	32.7	31.1	75.0	37.3	59.9
VILA1.5-3B-S2	fp16	79.8	61.4	61.3	69.6	63.4	85.3	1431.65	62.8	52.2	60.0	66.4	32.8	31.3	76.7	38.6	60.9
VILA1.5-3B-S2-AWQ	int4	79.4	61.3	62.3	69.2	63.0	85.8	1417.06	61.6	51.5	59.1	65.7	33.4	30.4	77.1	36.7	60.5
Llama-3-VILA1.5-8B	fp16	80.9	61.9	58.7	79.9	66.3	84.4	1577.01	72.3	66.2	64.2	71.4	36.9	36.0	80.0	38.3	65.1
Llama-3-VILA1.5-8B-AWQ	int4	80.3	61.7	59.3	79.0	65.4	82.9	1593.65	71.0	64.9	64.0	71.1	36.0	36.1	79.0	37.2	64.5
VILA1.5-13B	fp16	82.8	64.3	62.6	80.1	65.0	86.3	1569.55	74.9	66.3	65.1	72.6	37.9	33.6	80.8	44.3	66.3
VILA1.5-13B-AWQ	int4	82.7	64.5	63.3	79.7	64.7	86.7	1531.35	74.7	66.7	65.1	72.6	37.8	34.0	81.9	46.4	66.5
VILA1.5-40B	fp16	84.3	64.6	62.2	87.2	73.6	87.3	1726.82	82.4	80.2	69.1	75.8	51.9	46.9	81.3	53.0	72.4
VILA1.5-40B-AWQ	int4	84.1	64.4	61.3	86.7	73.2	88.2	1714.79	83.2	79.6	68.9	75.6	49.3	46.2	83.0	51.4	72.1

^{NOTE: VQAV2 and VizWiz are test-dev, the average accuracy is calculated over all datasets and MME numbers are divided by 20.}

Video QA Benchmarks

$~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~$	Prec.	Perception Test	ActivityNet	MSVD	MSRVTT	TGIF
VILA1.5-3B	fp16	39.3	50.2	76.6	57.5	51.7
VILA1.5-3B-S2	fp16	39	50.7	76.9	57.6	51.7
Llama-3-VILA1.5-8B	fp16	41.8	54.3	78.3	60.1	54.1
VILA1.5-13B	fp16	39.3	54.7	77.9	60.2	56
VILA1.5-40B	fp16	41.7	58	80.1	63	58.2

Inference speed ( Token/sec )

$~~~~~~$	Precision	A100	4090	Orin
VILA1.5-3B	fp16	104.6	137.6	25.4
VILA1.5-3B-AWQ	int4	182.8	215.5	42.5
VILA1.5-3B-S2	fp16	104.3	137.2	24.6
VILA1.5-3B-S2-AWQ	int4	180.2	219.3	40.1
Llama-3-VILA1.5-8B	fp16	74.9	57.4	10.2
Llama-3-VILA1.5-8B-AWQ	int4	168.9	150.2	28.7
VILA1.5-13B	fp16	50.9	OOM	6.1
VILA1.5-13B-AWQ	int4	115.9	105.7	20.6
VILA1.5-40B	fp16	OOM	OOM	--
VILA1.5-40B-AWQ	int4	57.0	OOM	--

^{NOTE: Measured using the TinyChat backend at batch size = 1.}

VILA Examples

Video captioning

7ko9e-AGmbM.12_0_217_out.mp4

Prompt: Elaborate on the visual and narrative elements of the video in detail.

Caption: The video shows a person's hands working on a white surface. They are folding a piece of fabric with a checkered pattern in shades of blue and white. The fabric is being folded into a smaller, more compact shape. The person's fingernails are painted red, and they are wearing a black and red garment. There are also a ruler and a pencil on the surface, suggesting that measurements and precision are involved in the process.

In context learning

Multi-image reasoning

VILA on Jetson Orin

VILA-13B_Orin_deer.mp4.mp4

VILA on RTX 4090

vila_4090_two_cars_3x.mp4

Installation

./environment_setup.sh

or follow the instructions below in order.

conda create -n vila python=3.10 -y # make sure you install python 3.10
conda activate vila

pip install --upgrade pip  # enable PEP 660 support
# this is optional if you prefer to system built-in nvcc.
conda install -c nvidia cuda-toolkit -y
wget https://github.com/Dao-AILab/flash-attention/releases/download/v2.4.2/flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install flash_attn-2.4.2+cu118torch2.0cxx11abiFALSE-cp310-cp310-linux_x86_64.whl
pip install -e .
pip install -e ".[train]"

pip install git+https://github.com/huggingface/transformers@v4.36.2
site_pkg_path=$(python -c 'import site; print(site.getsitepackages()[0])')
cp -rv ./llava/train/transformers_replace/* $site_pkg_path/transformers/

Training

VILA training contains three steps, for specific hyperparameters, please check out the scripts/v1_5 folder:

Step-1: Alignment

We utilize LLaVA-CC3M-Pretrain-595K dataset to align the textual and visual modalities.

The stage 1 script takes in two parameters and it can run on a single 8xA100 node. BASE_MODEL_PATH points to a online or local huggingface repository, such as NousResearch/Llama-2-7b-hf. OUTPUT_NAME points to a target directory under checkpoints, which will save the trained multimodal projector afterwards.

bash scripts/v1_5/paper/1_mm_align.sh [BASE_MODEL_PATH] [OUTPUT_NAME]

Step-2: Pretraining

We use MMC4 and Coyo dataset to train VLM with interleaved image-text pairs.

bash scripts/v1_5/paper/2_pretrain_mmc4_coyo.sh [CODE_PATH] [BASE_MODEL_PATH] [STAGE1_PATH] [OUTPUT_NAME]

The stage 2 script takes in four arguments. CODE_PATH is the absolute path to our VILA codebase, BASE_MODEL_PATH has similar meaning to what is presented in the stage 1 script. STAGE1_PATH points to the OUTPUT_NAME of stage 1 (i.e. where the stage 1 checkpoint is stored). OUTPUT_NAME is the desired folder name under checkpoints that saves the pretraining checkpoint. The script we provided for this stage is executed on slurm, and we expect it to execute on 16 nodes (128 GPUs).

Step-3: Supervised fine-tuning

This is the last stage of VILA training, in which we tune the model to follow multimodal instructions on a subset of M3IT, FLAN and ShareGPT4V. This stage runs on a 8xA100 node.

bash scripts/v1_5/paper/3_sft.sh [STAGE2_PATH] [OUTPUT_NAME]

The stage 3 script takes in two arguments. STAGE2_PATH points to the OUTPUT_NAME of the stage 2 script (i.e. where the stage 2 checkpoint is stored). OUTPUT_NAME is the desired folder name under checkpoints that stores the final checkpoint.

Evaluations

Image Benchmarks

You can follow Llava1.5 eval to download all datasets. After downloading all datasets, please put them under playground/data/eval.

Please make the following changes to the MME evaluation script. Please search for:

data_path='MME_Benchmark_release_version'

and replace it with:

data_path=os.path.join(script_dir, 'MME_Benchmark_release_version')

We provide a push-the-button script to perform evaluation on all 10 datasets that do not require GPT-assisted evaluation:

./scripts/v1_5/eval/eval_all.sh [CHECKPOINT_PATH] [MODEL_NAME] [CONV_MODE]

This script takes in two parameters, CHECKPOINT_PATH points to the stage 3 model checkpoint, and MODEL_NAME will be the name of evaluation results.

VQAv2 and Vizwiz evaluations are hosted on eval.ai. You need to register an account and create a team to be able to submit eval.

MMBench and MMBench_CN eval are hosted on another evaluation server. Make sure you change the name of the file before submitting, otherwise the server caches results and will always return wrong result to you.

We provide a quick script to automatically organize the prediction files that need to be submitted to servers:

python scripts/v1_5/eval/copy_predictions.py [MODEL_NAME]

You will be able to find the predictions under playground/data/predictions_upload/[MODEL_NAME] after executing this script.

Video Benchmarks

Please follow the evaluation steps in Video-LLaVA for dataset preparation.

./scripts/v1_5/eval/video_chatgpt/run_all.sh [CHECKPOINT_PATH] [MODEL_NAME] [CONV_MODE]
./scripts/v1_5/eval/video_chatgpt/eval_all.sh [MODEL_NAME]

Inference

We provide snippets for quick inference with user prompts and images.

Llama-3-VILA1.5-8B inference:

python -W ignore llava/eval/run_vila.py \
    --model-path Efficient-Large-Model/Llama-3-VILA1.5-8b \
    --conv-mode llama_3 \
    --query "<image>\n Please describe the traffic condition." \
    --image-file "av.png"

VILA1.5-40B inference:

python -W ignore llava/eval/run_vila.py \
    --model-path Efficient-Large-Model/VILA1.5-40b \
    --conv-mode hermes-2 \
    --query "<image>\n Please describe the traffic condition." \
    --image-file "av.png"

VILA1.5-3B video inference:

python -W ignore llava/eval/run_vila.py \
    --model-path Efficient-Large-Model/VILA1.5-3b \
    --conv-mode vicuna_v1 \
    --query "<video>\n Please describe this video." \
    --video-file "demo.mp4"

Quantization and Deployment

Our VILA models are quantized by AWQ into 4 bits for efficient inference on the edge. We provide a push-the-button script to quantize VILA with AWQ.

Running VILA on GPUs and edge GPUs (Jetson Orin)

We support AWQ-quantized 4bit VILA on GPU platforms via TinyChat. We provide a tutorial to run the model with TinyChat after AWQ quantization. We also provide an instruction to launch a Gradio server (powered by TinyChat and AWQ) to serve 4-bit quantized VILA models.

Running VILA on laptops

We further support our AWQ-quantized 4bit VILA models on various CPU platforms with both x86 and ARM architectures with our TinyChatEngine. We also provide a detailed tutorial to help the users deploy VILA on different CPUs.

Checkpoints

We release VILA1.5-3B, VILA1.5-3B-S2, Llama-3-VILA1.5-8B, VILA1.5-13B, VILA1.5-40B and the 4-bit AWQ-quantized models VILA1.5-3B-AWQ, VILA1.5-3B-S2-AWQ, Llama-3-VILA1.5-8B-AWQ, VILA1.5-13B-AWQ, VILA1.5-40B-AWQ.

🔒 License

The code is released under the Apache 2.0 license as found in the LICENSE file.
The pretrained weights are released under the CC-BY-NC-SA-4.0 license.
The service is a research preview intended for non-commercial use only, and is subject to the following licenses and terms:
- Model License of LLaMA. For LLAMA3-VILA checkpoints terms of use, please refer to the LLAMA3 License for additional details.
- Terms of Use of the data generated by OpenAI
- Dataset Licenses for each one used during training.

Team


*Yao Lu: Nvidia	*Hongxu Yin: Nvidia	*Ji Lin: OpenAI (work done at Nvidia and MIT)
Wei Ping: Nvidia	Pavlo Molchanov: Nvidia	Andrew Tao: Nvidia
Haotian Tang: MIT	Shang Yang: MIT	Ligeng Zhu: Nvidia, MIT
Wei-Chen Wang: MIT	Fuzhao Xue: Nvidia, NUS	Yunhao Fang: Nvidia, UCSD
Yukang Chen: Nvidia, CUHK	Zhuoyang Zhang: Nvidia, Tsinghua Univ.	Yue Shen: Nvidia
Wei-Ming Chen: Nvidia	Huizi Mao: Nvidia	Baifeng Shi: Nvidia, UC Berkeley
Jan Kautz: Nvidia	Mohammad Shoeybi: Nvidia	Song Han: Nvidia, MIT

Citations

@misc{lin2023vila,
      title={VILA: On Pre-training for Visual Language Models},
      author={Ji Lin and Hongxu Yin and Wei Ping and Yao Lu and Pavlo Molchanov and Andrew Tao and Huizi Mao and Jan Kautz and Mohammad Shoeybi and Song Han},
      year={2023},
      eprint={2312.07533},
      archivePrefix={arXiv},
      primaryClass={cs.CV}
}

Acknowledgement

LLaVA: the codebase we built upon. Thanks for their wonderful work.
InternVL: for open-sourcing InternViT (used in VILA1.5-40b) and the InternVL-SFT data blend (inspired by LLaVA-1.6) used in all VILA1.5 models.
Vicuna: the amazing open-sourced large language model!
Video-ChatGPT: we borrowed video evaluation script from this repository.
MMC4, COYO-700M, M3IT, OpenORCA/FLAN, ShareGPT4V, WIT, GSM8K-ScRel, VisualGenome, VCR, ScienceQA, Shot2Story, Youcook2, Vatex, ShareGPT-Video for providing datasets used in this research.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

VILA: On Pre-training for Visual Language Models

💡 Introduction

💡 News

Performance

Image QA Benchmarks

Video QA Benchmarks

Inference speed ( Token/sec )

VILA Examples

Video captioning

In context learning

Multi-image reasoning

VILA on Jetson Orin

VILA on RTX 4090

Installation

Training

Step-1: Alignment

Step-2: Pretraining

Step-3: Supervised fine-tuning

Evaluations

Image Benchmarks

Video Benchmarks

Inference

Quantization and Deployment

Running VILA on GPUs and edge GPUs (Jetson Orin)

Running VILA on laptops

Checkpoints

🔒 License

Team

Citations

Acknowledgement

Files

README.md

Latest commit

History

README.md

File metadata and controls

VILA: On Pre-training for Visual Language Models

💡 Introduction

💡 News

Performance

Image QA Benchmarks

Video QA Benchmarks

Inference speed ( Token/sec )

VILA Examples

Video captioning

In context learning

Multi-image reasoning

VILA on Jetson Orin

VILA on RTX 4090

Installation

Training

Step-1: Alignment

Step-2: Pretraining

Step-3: Supervised fine-tuning

Evaluations

Image Benchmarks

Video Benchmarks

Inference

Quantization and Deployment

Running VILA on GPUs and edge GPUs (Jetson Orin)

Running VILA on laptops

Checkpoints

🔒 License

Team

Citations

Acknowledgement