Implemenetation of Video Transformer Network, a simple framework for video classification task, with Vision Transformer backbone, with additional temporal transformers.
Visual Transformer - using timm, can be changed to any image classifier
- Longformer - original transformer used in a paper, sample config
- Linformer - another linear complexity transformer for my own research, sample config
- Transformer - simple full transformer encoder, with a right configuration, model can be used as implementation of Is Space-Time Attention All You Need for Video Understanding?, sample config
Basic dataset loaders for
- Kinetics-400, (can be used for any
Kinetics-xxx
dataset) - Something-Something-V2
- UCF-101
import torch
from utils import load_yaml
from model import VTN
cfg = load_yaml('configs/vtn.yaml')
model = VTN(**vars(cfg))
video = torch.rand(1, 16, 3, 224, 224)
preds = model(video) # (1, 400)
Parameters are self-explanatory in config file
Model | Top-1 | Top-5 | Weights |
---|---|---|---|
Longformer-VTN | 78.9% | 93.7% | taken from |
Transformer-VTN | 78.0% | 93.7% | taken from |
Linformer-VTN | 75.6% | 92.6% | link |
Linformer-VTN-MIIL-21k | 76.8% | 93.4% | link |
Linformer-VTN-21k | 77.2% | 93.4% |