All the model weights are saved with the clip_teacher
, which are loaded from the CLIP vision encoder.
We load those models with K400 masked pretraining and further pretrain them on multimodality data.
- 5M: CC3M + WebVid2M
- 17M: CC3M + CC12M + COCO + VG + SBU + WebVid2M
- 25M: CC3M + CC12M + COCO + VG + SBU + WebVid10M
Model | Setting | Model | Script |
---|---|---|---|
VideoMamba-M | 5M | aliyun, 🤗HF | script |
VideoMamba-M | 17M | aliyun, 🤗HF | script |
VideoMamba-M | 25M | aliyun, 🤗HF | script |
Dataset | Retrieval | VideoMamba-M | ||
---|---|---|---|---|
5M | 17M | 25M | ||
MSRVTT | T2V | R@1: 32.0 R@5: 53.1 R@10: 63.6 |
R@1: 34.7 R@5: 58.9 R@10: 68.0 |
R@1: 35.6 R@5: 58.1 R@10: 69.5 |
V2T | R@1: 28.2 R@5: 47.6 R@10: 56.5 |
R@1: 29.5 R@5: 49.9 R@10: 60.1 |
R@1: 29.1 R@5: 51.6 R@10: 62.2 |
|
DiDeMo | T2V | R@1: 36.6 R@5: 61.7 R@10: 70.3 |
R@1: 42.0 R@5: 67.3 R@10: 76.8 |
R@1: 43.1 R@5: 68.1 R@10: 77.7 |
V2T | R@1: 38.3 R@5: 64.7 R@10: 73.3 |
R@1: 42.3 R@5: 68.2 R@10: 76.9 |
R@1: 43.8 R@5: 69.7 R@10: 77.8 |
|
ActivityNet | T2V | R@1: 35.9 R@5: 61.1 R@10: 72.3 |
R@1: 40.1 R@5: 65.7 R@10: 76.1 |
R@1: 41.0 R@5: 67.5 R@10: 77.8 |
V2T | R@1: 32.8 R@5: 58.8 R@10: 69.9 |
R@1: 34.2 R@5: 61.8 R@10: 73.2 |
R@1: 37.1 R@5: 65.0 R@10: 75.1 |
|
LSMDC | T2V | R@1: 18.0 R@5: 36.1 R@10: 43.4 |
R@1: 18.4 R@5: 35.3 R@10: 43.0 |
R@1: 20.4 R@5: 37.1 R@10: 45.7 |
V2T | R@1: 15.9 R@5: 31.0 R@10: 39.2 |
R@1: 16.5 R@5: 32.1 R@10: 40.0 |
R@1: 17.9 R@5: 34.6 R@10: 42.1 |
|
MSVD | T2V | R@1: 38.0 R@5: 68.6 R@10: 79.0 |
R@1: 40.3 R@5: 70.0 R@10: 79.7 |
R@1: 42.0 R@5: 71.6 R@10: 81.2 |
V2T | R@1: 57.5 R@5: 79.9 R@10: 85.4 |
R@1: 61.8 R@5: 81.0 R@10: 87.0 |
R@1: 62.7 R@5: 82.8 R@10: 87.6 |