Skip to content

Multi-Scale Progressive Attention Network for Video Question Answering

Notifications You must be signed in to change notification settings

tobyloong/MSPAN-VideoQA

 
 

Repository files navigation

MSPAN-VideoQA

Multi-Scale Progressive Attention Network for Video Question Answering, ACL 2021.

Zhicheng Guo, Jiaxuan Zhao, Licheng Jiao, Xu Liu, Lingling Li

Setups

  1. Install the python dependency packages:

    pip install -r requirements.txt
  2. Download TGIF-QA, MSVD-QA, MSRVTT-QA datasets and edit absolute paths in preprocess/question_features.py , preprocess/appearance_features.py and preprocess/motion_features.py upon where you locate your data.

Preprocessing features

For above three datasets of VideoQA, you can choose 3 options of --dataset:

tgif-qa, msvd-qa and msrvtt-qa.

For different datasets, you can choose 5 options of --question_type:

none, action, count, frameqa and transition.

Extracting question features

  1. Download Glove 300D to preprocess/pretrained/ and process it into a pickle file:

    python preprocess/txt2pickle.py
  2. To extract question features.

    For TGIF-QA dataset:

    python preprocess/question_features.py 
            --dataset tgif-qa \
            --question_type action \
            --mode total
    python preprocess/question_features.py \
            --dataset tgif-qa \
            --question_type action \
            --mode train
    python preprocess/question_features.py \
            --dataset tgif-qa \
            --question_type action \
            --mode test

    For MSVD-QA/MSRVTT-QA dataset:

    python preprocess/question_features.py \
            --dataset msvd-qa \
            --question_type none \
            --mode total
    python preprocess/question_features.py \
            --dataset msvd-qa \
            --question_type none \
            --mode train
    python preprocess/question_features.py \
            --dataset msvd-qa \
            --question_type none \
            --mode val
    python preprocess/question_features.py \
            --dataset msvd-qa \
            --question_type none \
            --mode test

Extracting visual features

  1. Download pre-trained 3D-ResNet152 to preprocess/pretrained/ .

    You can learn more about this model in the following paper:

    "Would Mega-scale Datasets Further Enhance Spatiotemporal 3D CNNs", arXiv preprint, 2020.

  2. To extract appearance features:

    python preprocess/appearance_features.py \
            --gpu_id 0 \
            --dataset tgif-qa \
            --question_type action \
            --feature_type pool5 \
            --num_frames 16
  3. To extract motion features:

    python preprocess/motion_features.py \
            --gpu_id 0 \
            --dataset tgif-qa \
            --question_type action \
            --num_frames 16

Training

You can choose the suitable --dataset and --question_type to start training:

python train.py \
        --dataset tgif-qa \
        --question_type action \
        --T 2 \
        --K 3 \
        --num_scale 8 \
        --num_frames 16 \
        --gpu_id 0 \
        --max_epochs 30 \
        --batch_size 64 \
        --dropout 0.1 \
        --model_id 0 \
        --use_test \
        --use_train

Or, you can run the following command to start training:

sh train_sh/action.sh

You can see the training commands for all datasets and tasks under the train_sh folder.

Evaluation

You can download our pre-trained models from here.

To evaluate the trained model, run the following command:

sh test_sh/action.sh

You can see the evaluating commands for all datasets and tasks under the test_sh folder.

Citation

@inproceedings{guo2021multi,
  title={Multi-scale progressive attention network for video question answering},
  author={Guo, Zhicheng and Zhao, Jiaxuan and Jiao, Licheng and Liu, Xu and Li, Lingling},
  booktitle={Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 2: Short Papers)},
  pages={973--978},
  year={2021}
}

About

Multi-Scale Progressive Attention Network for Video Question Answering

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 95.8%
  • Shell 4.2%