You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -36,7 +37,13 @@ The **inter-intra learning framework** can be extended to
36
37
## Usage of this repo
37
38
### Data preparation
38
39
You can download UCF101/HMDB51 dataset from official website: [UCF101](http://crcv.ucf.edu/data/UCF101.php) and [HMDB51](http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/). Then decoded videos to frames.
39
-
I highly recommend the pre-comupeted optical flow images and resized RGB frames in this [repo](https://github.com/feichtenhofer/twostreamfusion).
40
+
I highly recommend the pre-computed optical flow images and resized RGB frames in this [repo](https://github.com/feichtenhofer/twostreamfusion).
41
+
42
+
If you use pre-computed frames, the folder architecture is like `path/to/dataset/video_id/frames.jpg`. If you decode frames on your own, the folder architecture may be like `path/to/dataset/class_name/video_id/frames.jpg`, in which way, you need pay more attention to the corresponding paths in dataset preparation.
43
+
44
+
For pre-computed frames, find `rgb_folder`, `u_folder` and `v_folder` in `datasets/ucf101.py` for UCF101 datasets and change them to meet your environment. Please note that all those modalities are prepared even though in some settings, optical flow data are not used train the model.
45
+
46
+
If you do not prepare optical flow data, simply set `u_folder=rgb_folder` and `v_folder=rgb_folder` should help to avoid errors.
40
47
41
48
### Train self-supervised learning part
42
49
```
@@ -77,11 +84,11 @@ In this way, only testing is conducted using the given model.
77
84
78
85
**Note**: The accuracies using residual clips are not stable for validation set (this may also caused by limited validation samples), the final testing part will use the best model on validation set.
79
86
80
-
If everything is fine, you can achieve around 70% accuracy on UCF101. The results will vary from each other with different random seeds.
87
+
If everything is fine, you can achieve around 70% accuracy on UCF101. The results will vary from each other with different random seeds.
81
88
82
89
## Results
83
90
### Retrieval results
84
-
The table lists retrieval results on UCF101 *split* 1. We reimplemented CMC and report its results. Other results are from corresponding paper.
91
+
The table lists retrieval results on UCF101 *split* 1. We reimplemented CMC and report its results. Other results are from corresponding paper. VCOP, VCP, CMC, PRP, and ours are based on R3D network backbone.
We only use R3D as our network backbone. Usually, using Resnet-18-3D, R(2+1)D or deeper networks can yield better performance.
112
+
We only use R3D as our network backbone. In this table, all reported results are pre-trained on UCF101.
113
+
114
+
Usually, 1. using Resnet-18-3D, R(2+1)D or deeper networks; 2.pre-training on larger datasets; 3. using larger input resolutions; 4. combining with audios or other features will also help.
104
115
105
116
Method | UCF101 | HMDB51
106
117
---|---|---
@@ -113,12 +124,15 @@ CMC (3 views) | 59.1 | 26.7
113
124
R3D (random) | 54.5 | 23.4
114
125
ImageNet-inflated | 60.3 | 30.7
115
126
3D ST-puzzle | 65.8 | 33.7
116
-
VCOP | 64.9 | 29.5
117
-
VCP | 66.0 | 31.5
118
-
Ours (repeat + res) | 72.8 | 35.3
119
-
Ours (repeat + u) | 72.7 | 36.8
120
-
Ours (shuffle + res) | **74.4** | **38.3**
121
-
Ours (shuffle + v) | 67.0 | 34.0
127
+
VCOP (R3D) | 64.9 | 29.5
128
+
VCOP (R(2+1)D) | 72.4 | 30.9
129
+
VCP (R3D) | 66.0 | 31.5
130
+
Ours (repeat + res, R3D) | 72.8 | 35.3
131
+
Ours (repeat + u, R3D) | 72.7 | 36.8
132
+
Ours (shuffle + res, R3D) | **74.4** | **38.3**
133
+
Ours (shuffle + v, R3D) | 67.0 | 34.0
134
+
PRP (R3D) | 66.5 | 29.7
135
+
PRP (R(2+1)D) | 72.1 | 35.0
122
136
123
137
**Residual clips + 3D CNN** The residual clips with 3D CNNs are effective, especially for scratch training. More information about this part can be found in [Rethinking Motion Representation: Residual Frames with 3D ConvNets for Better Action Recognition](https://arxiv.org/abs/2001.05661) (previous but more detailed version) and [Motion Representation Using Residual Frames with 3D CNN](https://arxiv.org/abs/2006.13017) (short version with better results).
124
138
@@ -132,6 +146,18 @@ Which is slightly different from that in papers.
132
146
We also reimplement VCP in this [repo](https://github.com/BestJuly/VCP). By simply using residual clips, significant improvements can be obtained for both video retrieval and video recognition.
133
147
134
148
149
+
## Pretrained weights
150
+
We provide pertrained weights from self-supervised training step: R3D[(google drive)](https://drive.google.com/file/d/17c5KJuPFEHt0vCjrMPO3UfS7BN8nNESX/view?usp=sharing).
151
+
152
+
> With this model, for video retrieval, you should achieve
153
+
> - 33.4% @top1 with `--modality=res --merge=False`
154
+
> - 34.8% @top1 with `--modality=rgb --merge=False`
155
+
> - 36.5% @top1 with`--modality=res --merge=True`
156
+
157
+
We may add more pretrained weights to support different network backbones in the future.
0 commit comments