Skip to content

Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling

License

Notifications You must be signed in to change notification settings

gpantaz/vl_mamba

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

17 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling (EMNLP 2024)

[Paper][Model Checkpoints][Data][Training]

Requirements

conda create -p ~/sharedscratch/conda_envs/athnlp2024-cu12 python=3.10

# Install dependencies
poetry install

# Install flash attention / mamba

poe install-flash-attn
poe install-mamba-ssm
poe install-causal-conv1d

Model Checkpoints 🤗

Pretrained Checkpoints

Model
Pythia-VL-1B
Mamba-VL-790M
Pythia-VL-1.4B
Mamba-VL-1.4B
Pythia-VL-2.8B
Mamba-VL-2.8B

Instruction-tuned Checkpoints

Model COCO NoCaps VQAv2 GQA V7W (test-T) VSR POPE RefCOCO (testA) RefCOCO (testB) RefCOCO+ (testA) RefCOCO+ (testB) RefCOCOg V7W (test-P) TextCaps TextVQA AI2D
Pythia-VL-1B 132.89 97.61 72.26 53.79 81.96 72.43 86.77 76.00 62.48 45.36 47.44 67.58 83.78 92.73 35.22 77.62
Mamba-VL-790M 133.81 99.00 71.67 54.95 81.82 75.39 86.77 67.84 56.35 57.97 41.43 59.16 74.01 94.30 40.72 79.27
Pythia-VL-1.4B 134.06 100.72 73.57 57.05 83.06 77.72 86.40 82.43 68.39 72.35 55.16 72.56 86.13 94.60 37.54 79.27
Mamba-VL-1.4B 134.76 100.56 74.46 58.44 83.78 80.18 85.32 76.60 63.48 68.40 52.11 68.82 80.18 98.68 41.30 80.86
Pythia-VL-2.8B 134.97 101.27 75.08 59.76 84.34 80.86 86.87 85.39 70.82 75.39 58.62 76.24 86.61 99.74 39.14 81.57
Mamba-VL-2.8B 135.53 102.00 76.08 60.41 85.31 81.45 87.33 79.29 64.97 71.64 53.94 71.27 82.50 100.47 42.14 83.71

Data

Instructions for specific datasets

GRIT

GRIT is downloaded using img2dataset. Note that some of the urls may not be available by the time of the downloading

./scripts/download_grit.sh storage/datasets/grit_url_folder storage/datasets/grit

To avoid training on the whole data, filter out grit by the noun_phrases (see appendix in the paper for full details)

python prepare_grit_dataset.py \
	--cache_dir /path/to/downloaded/grit \
	--output_folder /path/to/downsampled/grit \
	--downsample_images \
	--check_overlap 

OCRVQA

We also filter out examples from OCRVQA (see appendix in the paper for details)

python filter_out_ocrvqa_images.py \
	--cache_dir /path/to/downloaded/ocrvqa \
	--output_json /path/to/filtered/ocrvqa/examples \

Prepare pretraining dataset

python prepare_dataset.py \
	--dataset_subset llava_pretrain \
	--root_dataset_path storage/datasets \
	--cache_dir storage/datasets/vl_mamba \

Prepare instruction tuning dataset

python prepare_dataset.py \
	--dataset_subset instruction_tuning \
	--root_dataset_path storage/datasets \
	--cache_dir storage/datasets/vl_mamba \

Prepare a single dataset

python prepare_dataset.py \
	--dataset_subset coco \
	--root_dataset_path storage/datasets \
	--cache_dir storage/datasets/vl_mamba \

see DatasetNames in src/vl_mamba/datamodels/datamodels.py for the names of different datasets

Training

Pretraining

Pythia

./scripts/pretrain_pythia.sh

Mamba

./scripts/pretrain_mamba.sh

Instruction-tuning

Pythia

./scripts/finetune_pythia.sh path/to/pretrained/pythia/model /path/to/dataset/cache /path/to/root/dataset/path /output/model/directory wandb_run_name

Mamba

./scripts/finetune_mamba.sh path/to/pretrained/mamba/model /path/to/dataset/cache /path/to/root/dataset/path /output/model/directory wandb_run_name

Training logs

All the logs regarding pretraining / finetuning can be found on wandb Note that some of the runs were resumed from a previous checkpoint.

About

Shaking Up VLMs: Comparing Transformers and Structured State Space Models for Vision & Language Modeling

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published