Skip to content

Latest commit

 

History

History
 
 

data_prepare

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 
 
 

Data Preparation for Training VILA

To train VILA, we used the following datasets:

Stage Datasets
1. Initialize projector CC3M
2. Pre-training MMC4-core, COYO-700M, ShreGPT4V_pretrain
3. SFT LLaVA-Next mixture, VFLAN, WIT, GSM8K-ScRel-SFT, Sherlock, ScienceQA, Shot2story, Video_ChatGPT, Youcook2, Vatex, ShareGPT_Video

LLaVa-CC3M-Pretrain

We use LLaVA-CC3M-Pretrain-595K to train the visual language projector

MMC4-Core Dataset

Due to the limit of compute, we pre-train VILA on the smaller core set of MMC4 instead of the full set.

  1. Firstly, download the annotations of the MMC4-core dataset here: https://github.com/allenai/mmc4. We used the non-fewer-face split, and you may need to request the access here.

  2. Now modify the input and output path in mmc4_downloader.py and run the following script to scrawl the MMC4 images:

cd mmc4
python mmc4_downloader.py

Note that due to the expiration of image urls, you may end up getting a subset of the entire corpus.

The scrawling may take a long time. Optionally, you can also shard the workload over multiple jobs/machines concurrently to speed up the process:

# provide the start and end index of the jsonl shard. There are 23098 - 14 shards totally
# python mmc4_downloader.py <start_idx> <end_idx>
python mmc4_downloader.py 0 1000  # worker 1
python mmc4_downloader.py 1000 2000  # worker 2
  1. Filter out invalid samples in MMC4:
python mmc4_filter_and_counter.py
  1. Merge images and text into a unified pickle file for each shard:
python mmc4_merger.py

COYO-700M Dataset

  1. Download the metadata of COYO-700M:
huggingface-cli download kakaobrain/coyo-700m --repo-type dataset --local-dir coyo-700m --local-dir-use-symlinks False
  1. Scrawl the COYO images. Note that here we only keep a 20% subset in each shard with the highest CLIP similarity, to balance compute budget and data quality.

There are totally 128 shards of annotations. Now download each one with the script:

cd coyo
for SHARD in {0..127}; do
    python coyo_downloader.py $SHARD
done
  1. Split downloaded COYO data into multiple shards:
python coyo_splitter.py

LLaVA-1.5 Instruction Data

We use this file in our experiments. Please download this dataset from LLaVA authors.

huggingface-cli download liuhaotian/LLaVA-Instruct-150K llava_v1_5_mix665k.json --repo-type dataset

VFlan dataset

TextFLAN

  1. Download FLAN datasets:
huggingface-cli download Open-Orca/FLAN --repo-type dataset --local-dir FLAN --local-dir-use-symlinks False
  1. Preprocess FLAN dataset (sample 1M data from 378M samples):
cd sft
python preprocess_flan.py

M3IT Dataset

  1. Download M3IT datasets:
huggingface-cli download MMInstruction/M3IT --repo-type dataset --local-dir M3IT --local-dir-use-symlinks False
  1. Preprocess M3IT dataset:
python preprocess_m3it.py
  1. (Optional) Split FLAN+M3IT into multiple chunks to reduce CPU memory pressure during training:
python split_vflan.py

LLaVA-Next mixture

You can follow this page to prepare the data mixture that is proposed by LLaVA-Next.

Shot2story

Please follow this page to download the videos. The JSON file can be downloaded with

huggingface-cli download mit-han-lab/vila-dataset shot2story_shotonly.json
 --repo-type dataset --local-dir shot2story --local-dir-use-symlinks False

Video_ChatGPT

You can follow this page to prepare Video_ChatGPT dataset.

Youcook2

Please follow this page to download the videos. The JSON file can be downloaded with

huggingface-cli download mit-han-lab/vila-dataset youcook_filtered_v3.json --repo-type dataset --local-dir youcook2 --local-dir-use-symlinks False

Vatex

Please follow this page to download the videos. The JSON file can be downloaded with

huggingface-cli download mit-han-lab/vila-dataset vatex_filtered_v3.json --repo-type dataset --local-dir vatex --local-dir-use-symlinks False

ShareGPT_Video

You can follow this page to prepare ShareGPT_Video dataset.

WIT

The original WIT data can be obtained google-research-datasets/wit. * We subsample ~538K english data from the original WIT dataset and curate a llava conversation format JSON file.

huggingface-cli download mit-han-lab/vila-dataset wit_processed_538k.json --repo-type dataset --local-dir WIT --local-dir-use-symlinks False

GSM8K-ScRel-SFT

We add some math data gsm8k-ScRel to our SFT stage.

Sherlock

The image files of Sherlock can be obtained from VisualGenome and VCR separately. The llava conversation format JSON file can be downloaded with

huggingface-cli download mit-han-lab/vila-dataset sherlock_317k.json --repo-type dataset --local-dir sherlock --local-dir-use-symlinks False

ScienceQA

We use the train split of ScienceQA. The image data of the train split can be obtained from ScienceQA or their huggingface repo. The llava conversation format JSON file can be downloaded with

huggingface-cli download mit-han-lab/vila-dataset scienceqa_train_12k.json --repo-type dataset --local-dir scienceqa --local-dir-use-symlinks False