This is code for the paper Unsupervised Vision-and-Language Pre-training Without Parallel Images and Captions (NAACL 2021) (Link).
The code is adopted from both the original VisualBERT code and LXMERT. Many thanks to Hao Tan for developing the great LXMERT codebase and hosting some data files!
-
Vocabulary files for the BUTD detector
mkdir -p data/vocabs/ wget https://raw.githubusercontent.com/peteanderson80/bottom-up-attention/master/data/genome/1600-400-20/attributes_vocab.txt -P data/vocabs/ wget https://raw.githubusercontent.com/peteanderson80/bottom-up-attention/master/data/genome/1600-400-20/objects_vocab.txt -P data/vocabs/ wget https://raw.githubusercontent.com/peteanderson80/bottom-up-attention/master/data/genome/1600-400-20/relations_vocab.txt -P data/vocabs/
-
Pre-training caption files Download the caption files from LXMERT:
mkdir -p data/lxmert wget nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_train.json -P data/lxmert/ wget nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_nominival.json -P data/lxmert/ wget nlp.cs.unc.edu/data/lxmert_data/lxmert/vgnococo.json -P data/lxmert/ wget nlp.cs.unc.edu/data/lxmert_data/lxmert/mscoco_minival.json -P data/lxmert/
-
COCO/VG Image Features In the paper, we used Conceptual Captions for pre-training. But the image features take up more than 800G so we cannot publisize the image features for now. Instead, we provide scripts to run on COCO/VG images and COCO/VG captions. First download the image features files from LXMERT:
MSCOCO features:
mkdir -p data/mscoco_imgfeat wget nlp.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/train2014_obj36.zip -P data/mscoco_imgfeat unzip data/mscoco_imgfeat/train2014_obj36.zip -d data/mscoco_imgfeat && rm data/mscoco_imgfeat/train2014_obj36.zip wget nlp.cs.unc.edu/data/lxmert_data/mscoco_imgfeat/val2014_obj36.zip -P data/mscoco_imgfeat unzip data/mscoco_imgfeat/val2014_obj36.zip -d data && rm data/mscoco_imgfeat/val2014_obj36.zip
VG features:
mkdir -p data/vg_gqa_imgfeat wget nlp.cs.unc.edu/data/lxmert_data/vg_gqa_imgfeat/vg_gqa_obj36.zip -P data/vg_gqa_imgfeat unzip data/vg_gqa_imgfeat/vg_gqa_obj36.zip -d data && rm data/vg_gqa_imgfeat/vg_gqa_obj36.zip
Then run the script to convert them into HDF5 format (for faster reading):
python tools/convert_tsv_to_h5.py
-
BookCorpus
We got our version of the BookCorpus from VL-BERT. After downloading the file, please put it under
data/lxmert/
asbc1g.doc
.
-
Download the annotation files from LXMERT
mkdir -p data/vqa wget nlp.cs.unc.edu/data/lxmert_data/vqa/train.json -P data/vqa/ wget nlp.cs.unc.edu/data/lxmert_data/vqa/nominival.json -P data/vqa/ wget nlp.cs.unc.edu/data/lxmert_data/vqa/minival.json -P data/vqa/
-
COCO/VG Image Features. Please refer to instructions in downloading pre-training data.
I recommend using docker to run the experiments. Use the image pytorch/pytorch:1.4-cuda10.1-cudnn7-devel
as a start.
pip install yacs easydict pycocotools matplotlib pillow commentjson attrdict boto3 h5py requests scikit-learn ftfy regex tqdm ml_collections msgpack lz4 msgpack_numpy lmdb pandas
conda install --yes -c pytorch torchvision cudatoolkit=10.1 pytorch=1.4.0
Below is an example config that conducts pre-training on COCO. As the image features of Conceptual Captions take up more than 800G so we cannot publisize the image features for now. The config used to train on Conceptual Captions is in configs/pretrain/conceptual_captions.json
.
Command:
export PYTHONPATH=$PYTHONPATH:src
CUDA_VISIBLE_DEVICES=0 python src/pretrain/lxmert_pretrain.py --multiGPU --output ./snap/test --config ./configs/pretrain/unsupervised.json
Model checkpoint trained on Conceptual Captions (GoogleDrive).
Caveats: in order to do memory-efficient training, we used shared memory array among processes. So please delete any file under /dev/shm/
with a prefix of sharearray_
. (This is partially the reason we recommand docker, as other people (though highly unlikely) may also be using shared memory array with the same name.)
We provide the command to run on VQA.
-
Training
Download the pre-trained checkpoint as in the previous section and save it as
snap/pretrain/CC_Unsupervised_LXRT.pth
.export PYTHONPATH=$PYTHONPATH:src CUDA_VISIBLE_DEVICES=0 python src/tasks/vqa.py --multiGPU --output ./snap/vqa_test --config ./configs/vqa.json
-
Testing on minival
Download the pre-trained checkpoint (GooglDrive) and save it as
snap/vqa.pth
.export PYTHONPATH=$PYTHONPATH:src CUDA_VISIBLE_DEVICES=0 python src/tasks/vqa.py --multiGPU --output ./snap/vqa_test --config ./configs/vqa.json --test val --load snap/vqa
This should give the score
0.6807
. -
Testing on test.
export PYTHONPATH=$PYTHONPATH:src CUDA_VISIBLE_DEVICES=0 python src/tasks/vqa.py --multiGPU --output ./snap/vqa_test --config ./configs/vqa.json --test test --load snap/vqa
The file
vqa_test/test_predict.json
could be submitted to the official VQA leaderboard. The model we provide should give a score on test-dev close to 70.7.