Official Code of the paper "Beyond Text: Frozen Large Language Models in Visual Signal Comprehension"
The proposed V2L Tokenizer can be trained with the following steps (Run.sh):
-
Downloading the few-shot splits and imagenet split on Google Drive
-
Confirming "$imagenet_path" is set as the folder of ImageNet1K dataset that has been arranged with following layout:
|--ImageNet1K
|--train
| |---n01440764
| |---01443537
| |---...
|--val
| |--ILSVRC2012_val_00000001.JPEG
| |--ILSVRC2012_val_00000002.JPEG
| |--.... -
Confirming "$llama_path" is set as the folder of LLaMA-2 model, containing its original model weight and tokenizer.
-
Run "step1_epanding_vocabulary_set.py" to expand the vocabulary set of LLaMA-2 with the proposed codebook extension strategy.
-
Run "step2_generate_codebook_embedding.py" to generate the vision-language codebook embeddings for the vocabulary sets.
-
Run "step3_global_codebook_filtering.py" to filter the vocabulries that has less visual semantics.
-
Run "step4_training_v2l_tokenizer.py" to train the V2L Tokenizer based on the codebook produced by the above 3 steps.
We also provided our codebooks and checkpoints at: https://drive.google.com/drive/folders/1Z8GxE-WMEijJV-JZmqL7AGzsB0gHk4ow?usp=sharing
The proposed V2L Tokenizer can be used for visual signal reconstruction, comprehension and denoising generation with LLaMA-V2:
-
Run "eval_reconstruction.py" to evalute reconstruction performance on ImageNet1K validation set.
-
Run "eval_understanding.py" to evalute comprehension performance on nway-kshot classficiation performance on mini-ImageNet.
-
Run "eval_denoising_generation.py" to evaluate the denoising generation performance on a subset of ImageNet1K validation set.