You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
@@ -10,7 +10,7 @@ Official code for paper, Self-supervised Video Representation Learning Using Int
10
10
- python 3.7.4
11
11
- accimage
12
12
13
-
## Inter-intra contrastive framework
13
+
## Inter-intra contrastive (IIC) framework
14
14
For samples, we have
15
15
-[ ] Inter-positives: samples with **same labels**, not used for self-supervised learning;
16
16
-[x] Inter-negatives: **different samples**, or samples with different indexes;
@@ -33,8 +33,13 @@ The **inter-intra learning framework** can be extended to
33
33
- Different intra-negative generation methods: frame repeating, frame shuffling ...
34
34
- Different backbones: C3D, R3D, R(2+1)D, I3D ...
35
35
36
+
## Updates
37
+
Oct. 1, 2020 - Results using C3D and R(2+1)D are added; fix random seed more tightly.
38
+
Aug. 26, 2020 - Add pretrained weights for R3D.
36
39
37
40
## Usage of this repo
41
+
> Notification: we have added codes to fix random seed more tightly for better reproducibility. However, results in our paper used previous random seed settings. Therefore, there should be tiny differences for the performance from that reported in our paper. To reproduce retrieval results same as our paper, please use the provided model weights.
42
+
38
43
### Data preparation
39
44
You can download UCF101/HMDB51 dataset from official website: [UCF101](http://crcv.ucf.edu/data/UCF101.php) and [HMDB51](http://serre-lab.clps.brown.edu/resource/hmdb-a-large-human-motion-database/). Then decoded videos to frames.
40
45
I highly recommend the pre-computed optical flow images and resized RGB frames in this [repo](https://github.com/feichtenhofer/twostreamfusion).
@@ -141,7 +146,7 @@ The key code for this part is
141
146
shift_x = torch.roll(x,1,2)
142
147
x = ((shift_x -x) + 1)/2
143
148
```
144
-
Which is slightly different from that in papers.
149
+
which is slightly different from that in papers.
145
150
146
151
We also reimplement VCP in this [repo](https://github.com/BestJuly/VCP). By simply using residual clips, significant improvements can be obtained for both video retrieval and video recognition.
147
152
@@ -157,13 +162,46 @@ Pertrained weights from self-supervised training step: R3D[(google drive)](https
157
162
Finetuned weights for action recognition: R3D[(google drive)](https://drive.google.com/file/d/12uzHArg5hMGLuEUz36H4fJgGaeN4QyhZ/view?usp=sharing).
158
163
159
164
> With this model, for video recognition, you should achieve
160
-
> 72.7% @top1 with `python ft_classify.py --model=r3d --modality=res --mode=test -ckpt=./path/to/model`
0 commit comments