Skip to content

Commit 5dcd194

Browse files
committed
add codes for training and retrieval
1 parent 9f0151b commit 5dcd194

Some content is hidden

Large Commits have some content hidden by default. Use the searchbox below for content that may be hidden.

46 files changed

+58060
-16
lines changed

README.md

+43-16
Original file line numberDiff line numberDiff line change
@@ -1,23 +1,29 @@
1-
Official code for [Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework](https://arxiv.org/abs/2008.02531) [ACMMM'20]
1+
Official code for paper, Self-supervised Video Representation Learning Using Inter-intra Contrastive Framework [ACMMM'20].
22

3-
[Project page](https://bestjuly.github.io/Inter-intra-video-contrastive-learning/)
3+
[Arxiv paper](https://arxiv.org/abs/2008.02531) [Project page](https://bestjuly.github.io/Inter-intra-video-contrastive-learning/)
44

5-
## Codes in refactoring and testing, Coming soon.
5+
## Codes in refactoring and testing, finetuning part coming soon.
66

77

88
## Requirements
9-
> This is my experimental enviroment.
10-
PyTorch 1.3.0
11-
python 3.7.4
9+
> This is my experimental enviroment.
10+
11+
- PyTorch 1.3.0
12+
- python 3.7.4
13+
- accimage
1214

1315
## Inter-intra contrastive framework
1416
For samples, we have
15-
- [ ] Inter-positives: samples with same labels, not used for self-supervised learning;
16-
- [x] Inter-negatives: different samples, or samples with different indexes;
17-
- [x] Intra-positives: data from the same sample, in different views / from different augmentations;
18-
- [x] Intra-negatives: data from the same sample while some kind of information has been broken down. In video case, temporal information has been destoried.
17+
- [ ] Inter-positives: samples with **same labels**, not used for self-supervised learning;
18+
- [x] Inter-negatives: **different samples**, or samples with different indexes;
19+
- [x] Intra-positives: data from the **same sample**, in different views / from different augmentations;
20+
- [x] Intra-negatives: data from the **same sample** while some kind of information has been broken down. In video case, temporal information has been destoried.
21+
22+
Our work makes use of all usable parts (in this classification category) to form an inter-intra contrastive framework. The experiments here are mainly based on Contrastive Multiview Coding.
1923

20-
Our work makes use of all usable parts (in this classification category) to form an inter-intra contrastive framework. The experiments here are mainly based on Contrastive Multiview Coding. It is flexible to extend this framework to other contrastive learning methods such as MoCo and SimCLR.
24+
It is flexible to extend this framework to other contrastive learning methods which use negative samples, such as MoCo and SimCLR.
25+
26+
![image](https://github.com/BestJuly/Inter-intra-video-contrastive-learning/blob/master/fig/general.png)
2127

2228
## Highlights
2329
### Make the most of data for contrastive learning.
@@ -39,22 +45,40 @@ I highly recommend the pre-comupeted optical flow images and resized RGB frames
3945
```
4046
python train_ssl.py --dataset=ucf101
4147
```
42-
The default setting uses frame repeating as intra-negative samples for videos. R3D is used by default. You can use `--model` to try different models.
4348

44-
We use two views in our experiments. View #1 is a RGB video clip, View #2 can be RGB/Res/Optical flow video vlip. Residual video clips are default modality for View # 2. You can use `--modality` to try other modalities. Intra-negative samples are generated from View #1.
49+
This equals to
50+
51+
```
52+
python train_ssl.py --dataset=ucf101 --model=r3d --modality=res --neg=repeat
53+
```
54+
55+
This default setting uses frame repeating as intra-negative samples for videos. R3D is used.
56+
57+
We use two views in our experiments. View #1 is a RGB video clip, View #2 can be RGB/Res/Optical flow video clip. Residual video clips are default modality for View #2. You can use `--modality` to try other modalities. Intra-negative samples are generated from View #1.
4558

46-
In this part, it may be wired to use only one optical flow channel *u* or *v*. We use only one channel to make it possible for **only one model** to handle inputs from different modalities. It is also an optional setting that using different models to handle each modality.
59+
It may be wired to use only one optical flow channel *u* or *v*. We use only one channel to make it possible for **only one model** to handle inputs from different modalities. It is also an optional setting that using different models to handle each modality.
4760

4861
### Retrieve video clips
4962
```
5063
python retrieve_clips.py --ckpt=/path/to/your/model --dataset=ucf101 --merge=True
5164
```
52-
Only one model is used for different views. You can set `--modality` to decide which modality to use. When setting `--merge=True`, RGB for View #1 and the specific modality for View #2 will be jointly tested.
65+
One model is used to handle different views/modalities. You can set `--modality` to decide which modality to use. When setting `--merge=True`, RGB for View #1 and the specific modality for View #2 will be jointly used for joint retrieval.
66+
67+
By default training setting, it is very easy to get over 30%@top1 for video retrieval in ucf101 and around 13%@top1 in hmdb51 without joint retrieval.
5368

5469
### Fine-tune model for video recognition
5570
```
5671
python ft_classify.py --ckpt=/path/to/your/model --dataset=ucf101
5772
```
73+
Testing will be automatically conducted at the end of training.
74+
75+
Or you can use
76+
```
77+
python ft_classify.py --ckpt=/path/to/your/model --dataset=ucf101 --mode=test
78+
```
79+
In this way, only testing is conducted using the given model.
80+
81+
The accuracies using residual clips are not stable for validation set (this may also caused by), the final testing part will use the best model on validation set.
5882

5983
## Results
6084
### Retrieval results
@@ -106,6 +130,9 @@ x = ((shift_x -x) + 1)/2
106130
```
107131
Which is slightly different from that in papers.
108132

133+
We also reimplement VCP in this [repo](https://github.com/BestJuly/VCP). By simply using residual cliops, significant improvements can be obtained for both video retrieval and video recognition.
134+
135+
109136
## Citation
110137
If you find our work helpful for your research, please consider citing the paper
111138
```
@@ -135,4 +162,4 @@ If you find the residual input helpful for video-related tasks, please consider
135162
```
136163

137164
## Acknowledgements
138-
Part of this code is inspired by [CMC](https://github.com/HobbitLong/CMC) and [VCOP](https://github.com/xudejing/video-clip-order-prediction)
165+
Part of this code is inspired by [CMC](https://github.com/HobbitLong/CMC) and [VCOP](https://github.com/xudejing/video-clip-order-prediction).

data/hmdb51/split/classInd.txt

+51
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,51 @@
1+
1 brush_hair
2+
2 cartwheel
3+
3 catch
4+
4 chew
5+
5 clap
6+
6 climb
7+
7 climb_stairs
8+
8 dive
9+
9 draw_sword
10+
10 dribble
11+
11 drink
12+
12 eat
13+
13 fall_floor
14+
14 fencing
15+
15 flic_flac
16+
16 golf
17+
17 handstand
18+
18 hit
19+
19 hug
20+
20 jump
21+
21 kick
22+
22 kick_ball
23+
23 kiss
24+
24 laugh
25+
25 pick
26+
26 pour
27+
27 pullup
28+
28 punch
29+
29 push
30+
30 pushup
31+
31 ride_bike
32+
32 ride_horse
33+
33 run
34+
34 shake_hands
35+
35 shoot_ball
36+
36 shoot_bow
37+
37 shoot_gun
38+
38 sit
39+
39 situp
40+
40 smile
41+
41 smoke
42+
42 somersault
43+
43 stand
44+
44 swing_baseball
45+
45 sword
46+
46 sword_exercise
47+
47 talk
48+
48 throw
49+
49 turn
50+
50 walk
51+
51 wave

0 commit comments

Comments
 (0)