This is the official implementation for Predictive Dynamic Fusion (ICML 2024) by Bing Cao, Yinan Xia, Yi Ding, Changqing Zhang, and Qinghua Hu.
Multimodal fusion is crucial in joint decision-making systems for rendering holistic judgments. Since multimodal data changes in open environments, dynamic fusion has emerged and achieved remarkable progress in numerous applications. However, most existing dynamic multimodal fusion methods lack theoretical guarantees and easily fall into suboptimal problems, yielding unreliability and instability. To address this issue, we propose a predictive dynamic fusion (PDF) framework for multimodal learning. We proceed to reveal the multimodal fusion from a generalization perspective and theoretically derive the predictable Collaborative Belief (Co-Belief) with Mono- and Holo-Confidence, which provably reduces the upper bound of generalization error. Accordingly, we further propose a relative regularization strategy to calibrate the predicted Co-Belief for potential uncertainty. Extensive experiments on multiple benchmarks confirm our superiority.
numpy==1.21.6
Pillow==9.4.0
pytorch_pretrained_bert==0.6.2
scikit_learn==1.0.2
torch==1.11.0+cu113
torchvision==0.12.0+cu113
tqdm==4.65.0
Step 1: Download food101 and MVSA_Single and put them in the folder datasets.
Step 2: Prepare the train/dev/test splits jsonl files. We follow the QMF settings and provide them in corresponding folders.
Step 3 (optional): If you want use Glove model for Bow model, you can download glove.840B.300d.txt and put it in the folder datasets/glove_embeds. For bert model, you can download bert-base-uncased and put in the root folder bert-base-uncased/.
bash ./shells/batch_train_latefusion_pdf.sh
Tips: at the beginning of training, the output value of the confidence predictor may be minimal when batch size is small, and taking the log may be nan, which can be solved by reducing the learning rate or increasing the weight decay.
bash ./shells/batch_test_latefusion_pdf.sh
@article{cao2024predictive,
title={Predictive Dynamic Fusion},
author={Cao, Bing and Xia, Yinan and Ding, Yi and Zhang, Changqing and Hu, Qinghua},
journal={arXiv preprint arXiv:2406.04802},
year={2024}
}
The code is inspired by Provable Dynamic Fusion for Low-Quality Multimodal Data.