Name	Name	Last commit message	Last commit date
parent directory ..
coyo	coyo
mmc4	mmc4
sft	sft
LICENSE	LICENSE
README.md	README.md

Data Preparation for Training VILA

To train VILA, we used the following datasets:

Stage	Datasets
1. Initialize projector	CC3M
2. Pre-training	MMC4-core, COYO-700M, ShreGPT4V_pretrain
3. SFT	LLaVA-Next mixture, VFLAN, WIT, GSM8K-ScRel-SFT, Sherlock, ScienceQA, Shot2story, Video_ChatGPT, Youcook2, Vatex, ShareGPT_Video

LLaVa-CC3M-Pretrain

We use LLaVA-CC3M-Pretrain-595K to train the visual language projector

MMC4-Core Dataset

Due to the limit of compute, we pre-train VILA on the smaller core set of MMC4 instead of the full set.

Firstly, download the annotations of the MMC4-core dataset here: https://github.com/allenai/mmc4. We used the non-fewer-face split, and you may need to request the access here.
Now modify the input and output path in mmc4_downloader.py and run the following script to scrawl the MMC4 images:

cd mmc4
python mmc4_downloader.py

Note that due to the expiration of image urls, you may end up getting a subset of the entire corpus.

The scrawling may take a long time. Optionally, you can also shard the workload over multiple jobs/machines concurrently to speed up the process:

# provide the start and end index of the jsonl shard. There are 23098 - 14 shards totally
# python mmc4_downloader.py <start_idx> <end_idx>
python mmc4_downloader.py 0 1000  # worker 1
python mmc4_downloader.py 1000 2000  # worker 2

Filter out invalid samples in MMC4:

python mmc4_filter_and_counter.py

Merge images and text into a unified pickle file for each shard:

python mmc4_merger.py

COYO-700M Dataset

Download the metadata of COYO-700M:

huggingface-cli download kakaobrain/coyo-700m --repo-type dataset --local-dir coyo-700m --local-dir-use-symlinks False

Scrawl the COYO images. Note that here we only keep a 20% subset in each shard with the highest CLIP similarity, to balance compute budget and data quality.

There are totally 128 shards of annotations. Now download each one with the script:

cd coyo
for SHARD in {0..127}; do
    python coyo_downloader.py $SHARD
done

Split downloaded COYO data into multiple shards:

python coyo_splitter.py

LLaVA-1.5 Instruction Data

We use this file in our experiments. Please download this dataset from LLaVA authors.

huggingface-cli download liuhaotian/LLaVA-Instruct-150K llava_v1_5_mix665k.json --repo-type dataset

VFlan dataset

TextFLAN

Download FLAN datasets:

huggingface-cli download Open-Orca/FLAN --repo-type dataset --local-dir FLAN --local-dir-use-symlinks False

Preprocess FLAN dataset (sample 1M data from 378M samples):

cd sft
python preprocess_flan.py

M3IT Dataset

Download M3IT datasets:

huggingface-cli download MMInstruction/M3IT --repo-type dataset --local-dir M3IT --local-dir-use-symlinks False

Preprocess M3IT dataset:

python preprocess_m3it.py

(Optional) Split FLAN+M3IT into multiple chunks to reduce CPU memory pressure during training:

python split_vflan.py

LLaVA-Next mixture

You can follow this page to prepare the data mixture that is proposed by LLaVA-Next.

Shot2story

Please follow this page to download the videos. The JSON file can be downloaded with

huggingface-cli download mit-han-lab/vila-dataset shot2story_shotonly.json
 --repo-type dataset --local-dir shot2story --local-dir-use-symlinks False

Video_ChatGPT

You can follow this page to prepare Video_ChatGPT dataset.

Youcook2

Please follow this page to download the videos. The JSON file can be downloaded with

huggingface-cli download mit-han-lab/vila-dataset youcook_filtered_v3.json --repo-type dataset --local-dir youcook2 --local-dir-use-symlinks False

Vatex

Please follow this page to download the videos. The JSON file can be downloaded with

huggingface-cli download mit-han-lab/vila-dataset vatex_filtered_v3.json --repo-type dataset --local-dir vatex --local-dir-use-symlinks False

ShareGPT_Video

You can follow this page to prepare ShareGPT_Video dataset.

WIT

The original WIT data can be obtained google-research-datasets/wit. * We subsample ~538K english data from the original WIT dataset and curate a llava conversation format JSON file.

huggingface-cli download mit-han-lab/vila-dataset wit_processed_538k.json --repo-type dataset --local-dir WIT --local-dir-use-symlinks False

GSM8K-ScRel-SFT

We add some math data gsm8k-ScRel to our SFT stage.

Sherlock

The image files of Sherlock can be obtained from VisualGenome and VCR separately. The llava conversation format JSON file can be downloaded with

huggingface-cli download mit-han-lab/vila-dataset sherlock_317k.json --repo-type dataset --local-dir sherlock --local-dir-use-symlinks False

ScienceQA

We use the train split of ScienceQA. The image data of the train split can be obtained from ScienceQA or their huggingface repo. The llava conversation format JSON file can be downloaded with

huggingface-cli download mit-han-lab/vila-dataset scienceqa_train_12k.json --repo-type dataset --local-dir scienceqa --local-dir-use-symlinks False

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

data_prepare

data_prepare

README.md

Data Preparation for Training VILA

LLaVa-CC3M-Pretrain

MMC4-Core Dataset

COYO-700M Dataset

LLaVA-1.5 Instruction Data

VFlan dataset

TextFLAN

M3IT Dataset

LLaVA-Next mixture

Shot2story

Video_ChatGPT

Youcook2

Vatex

ShareGPT_Video

WIT

GSM8K-ScRel-SFT

Sherlock

ScienceQA

Files

data_prepare

Directory actions

More options

Directory actions

More options

Latest commit

History

data_prepare

Folders and files

parent directory

README.md

Data Preparation for Training VILA

LLaVa-CC3M-Pretrain

MMC4-Core Dataset

COYO-700M Dataset

LLaVA-1.5 Instruction Data

VFlan dataset

TextFLAN

M3IT Dataset

LLaVA-Next mixture

Shot2story

Video_ChatGPT

Youcook2

Vatex

ShareGPT_Video

WIT

GSM8K-ScRel-SFT

Sherlock

ScienceQA