Skip to content

Implementation for the paper "Hierarchical Conditional Relation Networks for Video Question Answering" (Le et al., CVPR 2020, Oral)

License

Notifications You must be signed in to change notification settings

thaolmk54/hcrn-videoqa

Repository files navigation

Hierarchical Conditional Relation Networks for Video Question Answering

We introduce a general-purpose reusable neural unit called Conditional Relation Network (CRN) that encapsulates and transforms an array of tensorial objects into a new array of the same kind, conditioned on a contextual feature. The flexibility of CRN units is then examined in solving Video Question Answering, a challenging problem requiring joint comprehension of video content and natural language processing.

Illustrations of CRN unit and the result of model building HCNR for VideoQA:

alt-text-1 alt-text-2

Check out our paper for details.

Setups

  1. Clone the repository:

    git clone https://github.com/thaolmk54/hcrn-videoqa.git

  2. Download TGIF-QA, MSRVTT-QA, MSVD-QA dataset and edit corresponding paths in the repo upon where you locate your data.

  3. Install dependencies:

conda create -n hcrn_videoqa python=3.6
conda activate hcrn_videoqa
conda install -c conda-forge ffmpeg
pip install -r requirements.txt

Experiments with TGIF-QA

Depending on the task to chose question_type out of 4 options: action, transition, count, frameqa.

Preprocessing visual features

  1. To extract appearance feature:

    python preprocess/preprocess_features.py --gpu_id 2 --dataset tgif-qa --model resnet101 --question_type {question_type}

  2. To extract motion feature:

    Download ResNeXt-101 pretrained model (resnext-101-kinetics.pth) and place it to data/preprocess/pretrained/.

    python preprocess/preprocess_features.py --dataset tgif-qa --model resnext101 --image_height 112 --image_width 112 --question_type {question_type}

Note: Extracting visual feature takes a long time. You can download our pre-extracted feature (action task) from [here](not available) for appearance and here for motion.

Proprocess linguistic features

  1. Download glove pretrained 300d word vectors to data/glove/ and process it into a pickle file:

    python txt2pickle.py

  2. Preprocess train/val/test questions:

    python preprocess/preprocess_questions.py --dataset tgif-qa --question_type {question_type} --glove_pt data/glove/glove.840.300d.pkl --mode train

    python preprocess/preprocess_questions.py --dataset tgif-qa --question_type {question_type} --mode test

Training

Choose a suitable config file in configs/{task}.yml for one of 4 tasks: action, transition, count, frameqa to train the model. For example, to train with action task, run the following command:

python train.py --cfg configs/tgif_qa_action.yml

Evaluation

To evaluate the trained model, run the following:

python validate.py --cfg configs/tgif_qa_action.yml

Note: Pretrained model for action task is available here. Save the file in results/expTGIF-QAAction/ckpt/ for evaluation.

Experiments with MSRVTT-QA and MSVD-QA

The following to to run experiments with MSRVTT-QA dataset, replace msrvtt-qa with msvd-qa to run with MSVD-QA dataset.

Preprocessing visual features

  1. To extract appearance feature:

    python preprocess/preprocess_features.py --gpu_id 2 --dataset msrvtt-qa --model resnet101

  2. To extract motion feature:

    python preprocess/preprocess_features.py --dataset msrvtt-qa --model resnext101 --image_height 112 --image_width 112

Proprocess linguistic features

Preprocess train/val/test questions:

`python preprocess/preprocess_questions.py --dataset msrvtt-qa --glove_pt data/glove/glove.840.300d.pkl --mode train`

`python preprocess/preprocess_questions.py --dataset msrvtt-qa --question_type {question_type} --mode val`

`python preprocess/preprocess_questions.py --dataset msrvtt-qa --question_type {question_type} --mode test`

Training

python train.py --cfg configs/msrvtt_qa.yml

Evaluation

To evaluate the trained model, run the following:

python validate.py --cfg configs/msrvtt_qa.yml

Acknowledgement

  • As for motion feature extraction, we adapt ResNeXt-101 model from this repo to our code. Thank @kenshohara for releasing the code and the pretrained models.
  • We refer to this repo for preprocessing.
  • Our implementation of dataloader is based on this repo.

About

Implementation for the paper "Hierarchical Conditional Relation Networks for Video Question Answering" (Le et al., CVPR 2020, Oral)

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages