Official PyTorch implementation of FasterViT: Fast Vision Transformers with Hierarchical Attention.
Ali Hatamizadeh, Greg Heinrich, Hongxu (Danny) Yin, Andrew Tao, Jose M. Alvarez, Jan Kautz, Pavlo Molchanov.
For business inquiries, please visit our website and submit the form: NVIDIA Research Licensing
FasterViT achieves a new SOTA Pareto-front in terms of accuracy vs. image throughput (no extra training data !)
We introduce a new self-attention mechanism, denoted as Hierarchical Attention (HAT), that captures both short and long-range information by learning cross-window carrier tokens.
- [06.09.2023] 🔥🔥 We have released source code and ImageNet-1K FasterViT-models !
- ImageNet-1K training code
- ImageNet-1K pre-trained models
- ImageNet-21K pre-trained models
- ImageNet-21K fine-tune scripts
- Any-resolution FasterViT
- Detection code (DINO) + models
- Segmentation code + models
FasterViT ImageNet-1K Pretrained Models
Name | Acc@1(%) | Acc@5(%) | Throughput(Img/Sec) | Resolution | #Params(M) | FLOPs(G) | Download |
---|---|---|---|---|---|---|---|
FasterViT-0 | 82.1 | 95.9 | 5802 | 224x224 | 31.4 | 3.3 | model |
FasterViT-1 | 83.2 | 96.5 | 4188 | 224x224 | 53.4 | 5.3 | model |
FasterViT-2 | 84.2 | 96.8 | 3161 | 224x224 | 75.9 | 8.7 | model |
FasterViT-3 | 84.9 | 97.2 | 1780 | 224x224 | 159.5 | 18.2 | model |
FasterViT-4 | 85.4 | 97.3 | 849 | 224x224 | 424.6 | 36.6 | model |
FasterViT-5 | 85.6 | 97.4 | 449 | 224x224 | 975.5 | 113.0 | model |
FasterViT-6 | 85.8 | 97.4 | 352 | 224x224 | 1360.0 | 142.0 | model |
All models use crop_pct=0.875
. Results are obtained by running inference on ImageNet-1K pretrained models without finetuning.
Name | A-Acc@1(%) | A-Acc@5(%) | R-Acc@1(%) | R-Acc@5(%) | V2-Acc@1(%) | V2-Acc@5(%) |
---|---|---|---|---|---|---|
FasterViT-0 | 23.9 | 57.6 | 45.9 | 60.4 | 70.9 | 90.0 |
FasterViT-1 | 31.2 | 63.3 | 47.5 | 61.9 | 72.6 | 91.0 |
FasterViT-2 | 38.2 | 68.9 | 49.6 | 63.4 | 73.7 | 91.6 |
FasterViT-3 | 44.2 | 73.0 | 51.9 | 65.6 | 75.0 | 92.2 |
FasterViT-4 | 49.0 | 75.4 | 56.0 | 69.6 | 75.7 | 92.7 |
FasterViT-5 | 52.7 | 77.6 | 56.9 | 70.0 | 76.0 | 93.0 |
FasterViT-6 | 53.7 | 78.4 | 57.1 | 70.1 | 76.1 | 93.0 |
A, R and V2 denote ImageNet-A, ImageNet-R and ImageNet-V2 respectively.
Please see TRAINING.md for detailed training instructions of all models.
The FasterViT models can be evaluated on ImageNet-1K validation set using the following:
python validate.py \
--model <model-name>
--checkpoint <checkpoint-path>
--data_dir <imagenet-path>
--batch-size <batch-size-per-gpu
Here --model
is the FasterViT variant (e.g. faster_vit_0_224_1k
), --checkpoint
is the path to pretrained model weights, --data_dir
is the path to ImageNet-1K validation set and --batch-size
is the number of batch size. We also provide a sample script here.
The dependencies can be installed by running:
pip install -r requirements.txt
Please download the ImageNet dataset from its official website. The training and validation images need to have sub-folders for each class with the following structure:
imagenet
├── train
│ ├── class1
│ │ ├── img1.jpeg
│ │ ├── img2.jpeg
│ │ └── ...
│ ├── class2
│ │ ├── img3.jpeg
│ │ └── ...
│ └── ...
└── val
├── class1
│ ├── img4.jpeg
│ ├── img5.jpeg
│ └── ...
├── class2
│ ├── img6.jpeg
│ └── ...
└── ...
This repository is built on top of the timm repository. We thank Ross Wrightman for creating and maintaining this high-quality library.
Copyright © 2023, NVIDIA Corporation. All rights reserved.
This work is made available under the NVIDIA Source Code License-NC. Click here to view a copy of this license.
For license information regarding the timm repository, please refer to its repository.
For license information regarding the ImageNet dataset, please see the ImageNet official website.