Skip to content
This repository has been archived by the owner on Jan 17, 2023. It is now read-only.

elb3k/vtn

Repository files navigation

VTN - Pytorch

Implemenetation of Video Transformer Network, a simple framework for video classification task, with Vision Transformer backbone, with additional temporal transformers.

Spatial Backbone:

Visual Transformer - using timm, can be changed to any image classifier

Temporal Backbone:

  1. Longformer - original transformer used in a paper, sample config
  2. Linformer - another linear complexity transformer for my own research, sample config
  3. Transformer - simple full transformer encoder, with a right configuration, model can be used as implementation of Is Space-Time Attention All You Need for Video Understanding?, sample config

Dataset implemenatations:

Basic dataset loaders for

  1. Kinetics-400, (can be used for any Kinetics-xxx dataset)
  2. Something-Something-V2
  3. UCF-101

Usage

import torch
from utils import load_yaml
from model import VTN

cfg = load_yaml('configs/vtn.yaml')

model = VTN(**vars(cfg))

video = torch.rand(1, 16, 3, 224, 224)

preds = model(video) # (1, 400)

Parameters are self-explanatory in config file

Results

Model Top-1 Top-5 Weights
Longformer-VTN 78.9% 93.7% taken from
Transformer-VTN 78.0% 93.7% taken from
Linformer-VTN 75.6% 92.6% link
Linformer-VTN-MIIL-21k 76.8% 93.4% link
Linformer-VTN-21k 77.2% 93.4%