To train VILA, we used the following datasets:
Stage | Datasets |
---|---|
1. Initialize projector | CC3M |
2. Pre-training | MMC4-core, COYO-700M, ShreGPT4V_pretrain |
3. SFT | LLaVA-Next mixture, VFLAN, WIT, GSM8K-ScRel-SFT, Sherlock, ScienceQA, Shot2story, Video_ChatGPT, Youcook2, Vatex, ShareGPT_Video |
We use LLaVA-CC3M-Pretrain-595K to train the visual language projector
Due to the limit of compute, we pre-train VILA on the smaller core set of MMC4 instead of the full set.
-
Firstly, download the annotations of the MMC4-core dataset here: https://github.com/allenai/mmc4. We used the non-fewer-face split, and you may need to request the access here.
-
Now modify the input and output path in
mmc4_downloader.py
and run the following script to scrawl the MMC4 images:
cd mmc4
python mmc4_downloader.py
Note that due to the expiration of image urls, you may end up getting a subset of the entire corpus.
The scrawling may take a long time. Optionally, you can also shard the workload over multiple jobs/machines concurrently to speed up the process:
# provide the start and end index of the jsonl shard. There are 23098 - 14 shards totally
# python mmc4_downloader.py <start_idx> <end_idx>
python mmc4_downloader.py 0 1000 # worker 1
python mmc4_downloader.py 1000 2000 # worker 2
- Filter out invalid samples in MMC4:
python mmc4_filter_and_counter.py
- Merge images and text into a unified pickle file for each shard:
python mmc4_merger.py
- Download the metadata of COYO-700M:
huggingface-cli download kakaobrain/coyo-700m --repo-type dataset --local-dir coyo-700m --local-dir-use-symlinks False
- Scrawl the COYO images. Note that here we only keep a 20% subset in each shard with the highest CLIP similarity, to balance compute budget and data quality.
There are totally 128 shards of annotations. Now download each one with the script:
cd coyo
for SHARD in {0..127}; do
python coyo_downloader.py $SHARD
done
- Split downloaded COYO data into multiple shards:
python coyo_splitter.py
We use this file in our experiments. Please download this dataset from LLaVA authors.
huggingface-cli download liuhaotian/LLaVA-Instruct-150K llava_v1_5_mix665k.json --repo-type dataset
- Download FLAN datasets:
huggingface-cli download Open-Orca/FLAN --repo-type dataset --local-dir FLAN --local-dir-use-symlinks False
- Preprocess FLAN dataset (sample 1M data from 378M samples):
cd sft
python preprocess_flan.py
- Download M3IT datasets:
huggingface-cli download MMInstruction/M3IT --repo-type dataset --local-dir M3IT --local-dir-use-symlinks False
- Preprocess M3IT dataset:
python preprocess_m3it.py
- (Optional) Split FLAN+M3IT into multiple chunks to reduce CPU memory pressure during training:
python split_vflan.py
You can follow this page to prepare the data mixture that is proposed by LLaVA-Next.
Please follow this page to download the videos. The JSON file can be downloaded with
huggingface-cli download mit-han-lab/vila-dataset shot2story_shotonly.json
--repo-type dataset --local-dir shot2story --local-dir-use-symlinks False
You can follow this page to prepare Video_ChatGPT dataset.
Please follow this page to download the videos. The JSON file can be downloaded with
huggingface-cli download mit-han-lab/vila-dataset youcook_filtered_v3.json --repo-type dataset --local-dir youcook2 --local-dir-use-symlinks False
Please follow this page to download the videos. The JSON file can be downloaded with
huggingface-cli download mit-han-lab/vila-dataset vatex_filtered_v3.json --repo-type dataset --local-dir vatex --local-dir-use-symlinks False
You can follow this page to prepare ShareGPT_Video dataset.
The original WIT data can be obtained google-research-datasets/wit. * We subsample ~538K english data from the original WIT dataset and curate a llava conversation format JSON file.
huggingface-cli download mit-han-lab/vila-dataset wit_processed_538k.json --repo-type dataset --local-dir WIT --local-dir-use-symlinks False
We add some math data gsm8k-ScRel to our SFT stage.
The image files of Sherlock can be obtained from VisualGenome and VCR separately. The llava conversation format JSON file can be downloaded with
huggingface-cli download mit-han-lab/vila-dataset sherlock_317k.json --repo-type dataset --local-dir sherlock --local-dir-use-symlinks False
We use the train split of ScienceQA. The image data of the train split can be obtained from ScienceQA or their huggingface repo. The llava conversation format JSON file can be downloaded with
huggingface-cli download mit-han-lab/vila-dataset scienceqa_train_12k.json --repo-type dataset --local-dir scienceqa --local-dir-use-symlinks False