LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment
๐ Paper ย ๏ฝย ๐คDemo ย ย |ย ย ๐ค Model zoo ย ย |ย ย ๐Instruction ย ๏ฝ ๐ฅDatasets
- The following first figure shows the architecture of LanguageBind. LanguageBind can be easily extended to segmentation, detection tasks, and potentially to unlimited modalities.
- The second figure shows our proposed VIDAL-10M dataset, which includes five modalities: video, infrared, depth, audio, and language.
[2023.10.04] ๐ Code, checkpoints and demo are available now! Welcome to watch this repository for the latest updates.
- Local demo. Highly recommend trying out our web demo, which incorporates all features currently supported by LanguageBind.
python gradio_app.py --languagebind_weight LanguageBind.pt
- Online demo. We provide the online demo in Huggingface Spaces. In this demo, you can calculate the similarity of modalities to language, such as audio-to-language, video-to-language, and depth-to-image.
LanguageBind is a language-centric multimodal pretraining approach, taking the language as the bind across different modalities because the language modality is well-explored and contains rich semantics.
We propose VIDAL-10M, 10 Million data with Video, Infrared, Depth, Audio and their corresponding Language, which greatly expands the data beyond visual modalities.
We make multi-view enhancements to language. We produce multi-view description that combines meta-data, spatial, and temporal to greatly enhance the semantic information of the language. In addition we further enhance the language with ChatGPT to create a good semantic space for each modality aligned language.
- We list the pretrained checkpoints of LanguageBind below. We provide an aggregated weight (LanguageBind) for online demo and inference. Additionally, LanguageBind can be disassembled into different branches to handle different tasks.
- We additionally trained Video-Language with the LanguageBind method, which is stronger than on CLIP4Clip framework.
- The cache comes from OpenCLIP, which we downloaded from HuggingFace. Note that the original cache for pretrained weights is the Image-Language weights, just a few more HF profiles.
Model | Baidu Yun | Google Cloud | Peking University Yun |
---|---|---|---|
LanguageBind | Link | Link | TODO |
Video-Language (LanguageBind) | Link | Link | Link |
Video-Language (CLIP4Clip) | Link | Link | Link |
Audio-Language | Link | Link | Link |
Depth-Language | Link | Link | Link |
Thermal(Infrared)-Language | Link | Link | Link |
Image-Language | Link | Link | Link |
Cache for pretrained weight | Link | Link | Link |
We focus on reporting the parameters of the vision encoder. Our experiments are based on 3 million video-text pairs of VIDAL-10M, and we train on the CLIP4Clip framework.
Infrared-Language, Depth-Language, and Audio-Language zero-shot classification. We report the top-1 classification accuracy for all datasets.
- Python >= 3.8
- Pytorch >= 1.13.0
- CUDA Version >= 10.2 (recommend 11.6)
- Install required packages:
git clone https://github.com/PKU-YuanGroup/LanguageBind
cd LanguageBind
pip install -r requirements.txt
We open source all modal preprocessing code. Here is a simple script for multi-modal inference with LanguageBind.
modality_transform = {
'language': get_tokenizer(HF_HUB_PREFIX + args.model, cache_dir=args.cache_dir),
'video': get_video_transform(args),
'audio': get_audio_transform(args),
'depth': get_depth_transform(args),
'thermal': get_thermal_transform(args),
'image': get_image_transform(args),
}
image = ['image1.jpg', 'image2jpgwav']
audio = ['audio1.wav', 'audio2.wav']
video = ['video1.mp4', 'video2.mp4']
depth = ['depth1.png', 'depth2.png']
thermal = ['thermal1.jpg', 'thermal2.jpg']
language = ["text1", 'text2']
inputs = {
'image': stack_dict([load_and_transform_image(i, modality_transform['image']) for i in image], device),
'video': stack_dict([load_and_transform_video(i, modality_transform['video']) for i in video], device),
'audio': stack_dict([load_and_transform_audio(i, modality_transform['audio']) for i in audio], device),
'thermal': stack_dict([load_and_transform_thermal(i, modality_transform['thermal']) for i in thermal], device),
'depth': stack_dict([load_and_transform_depth(i, modality_transform['depth']) for i in depth], device),
'language': stack_dict([load_and_transform_text(i, modality_transform['language']) for i in language], device)
}
with torch.no_grad():
embeddings = model(inputs)
print("Video x Language: \n", torch.softmax(embeddings['video'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
print("Image x Language: \n", torch.softmax(embeddings['image'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
print("Depth x Language: \n", torch.softmax(embeddings['depth'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
print("Audio x Language: \n", torch.softmax(embeddings['audio'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
print("Thermal x Language: \n", torch.softmax(embeddings['thermal'] @ embeddings['language'].T, dim=-1).detach().cpu().numpy())
More details are in inference.py. Run the following command to start.
python inference.py --languagebind_weight LanguageBind.pt
The datasets is in DATASETS.md.
The training & validating instruction is in TRAIN_AND_VALIDATE.md.
- OpenCLIP An open source pretraining framework.
- CLIP4Clip An open source Video-Text retrieval framework.
- sRGB-TIR An open source framework to generate infrared (thermal) images.
- GLPN An open source framework to generate depth images.
- The majority of this project is released under the MIT license as found in the LICENSE file.
- The dataset of this project is released under the CC-BY-NC 4.0 license as found in the DATASET_LICENSE file.
If you find our paper and code useful in your research, please consider giving a star โญ and citation ๐.
@misc{zhu2023languagebind,
title={LanguageBind: Extending Video-Language Pretraining to N-modality by Language-based Semantic Alignment},
author={Bin Zhu and Bin Lin and Munan Ning and Yang Yan and Jiaxi Cui and Wang HongFa and Yatian Pang and Wenhao Jiang and Junwu Zhang and Zongwei Li and Cai Wan Zhang and Zhifeng Li and Wei Liu and Li Yuan},
year={2023},
eprint={2310.01852},
archivePrefix={arXiv},
primaryClass={cs.CV}
}