This is the official repo of the paper M2CLIP: A Multimodal, Multi-Task Adapting Framework for Video Action Recognition.
@inproceedings{wang2024multimodal,
title={A Multimodal, Multi-Task Adapting Framework for Video Action Recognition},
author={Wang, Mengmeng and Xing, Jiazheng and Jiang, Boyuan and Chen, Jun and Mei, Jianbiao and Zuo, Xingxing and Dai, Guang and Wang, Jingdong and Liu, Yong},
booktitle={Proceedings of the AAAI Conference on Artificial Intelligence},
volume={38},
number={6},
pages={5517--5525},
year={2024}
}
We use conda to manage the Python environment. The dumped configuration is provided at environment.yml
Some common configurations (e.g., dataset paths, pretrained backbone paths) are set in config.py
. We've included an example configuration in config.py.example
which contains all required fields with values left empty. Please copy config.py.example
to config.py
and fill in the values before running the models.
The data list should be organized as follows
<video_1> <label_1>
<video_2> <label_2>
...
<video_N> <label_N>
where <video_i>
is the path to a video file, and <label_i>
is an integer between
We release the data list we used for Kinetics-400 (k400
, train list link, val list link) and Something-something-v2 (ssv2
, train list link, val list link), which reflect the class mapping of the released models and the videos available at our side. It is strongly recommended that the Kinetics-400 lists be cleaned first, as some videos may have been taken down by YouTube for various reasons (the training will stop on broken videos in the current implementation).
After obtaining the videos and the data lists, set the root dir and the list paths in config.py
in the DATASETS
dictionary (fill in the blanks for k400
and ssv2
or add new items for custom datasets). For each dataset, 5 fields are required:
-
TRAIN_ROOT
: the root directory which the video paths in the training list are relative to. -
VAL_ROOT
: the root directory which the video paths in the validation list are relative to. -
TRAIN_LIST
: the path to the training video list. -
VAL_LIST
: the path to the validation video list. -
NUM_CLASSES
: number of classes of the dataset.
We use the CLIP checkpoints from the official release. Put the downloaded checkpoint paths in config.py
. The currently supported architectures are CLIP-ViT-B/16 (set CLIP_VIT_B16_PATH
) and CLIP-ViT-L/14 (set CLIP_VIT_B16_PATH
).
We provide some preset scripts in the scripts/ directory containing some recommended settings. For a detailed description of the comand line arguments see the help message of main.py
.
The CLIP model implementation is modified from CLIP official repo. This repo is built upon ST-Adapter. Thanks for their awesome works!