- Contains implementations of prominent ViT architectures broken down into modular components like encoder, attention mechanism, and decoder.
- Makes it easy to develop custom models by composing components of different architectures.
git clone https://github.com/SforAiDl/vformer.git
cd vformer/
python setup.py install
- Vanilla ViT
- Swin Transformer
- Pyramid Vision Transformer
- CrossViT
- Compact Vision Transformer
- Compact Convolutional Transformer
To instantiate and use a Swin Transformer model -
import torch
from vformer.models.classification import SwinTransformer
image = torch.randn(1, 3, 224, 224) # Example data
model = SwinTransformer(
img_size=224,
patch_size=4,
in_channels=3,
n_classes=10,
embed_dim=96,
depths=[2, 2, 6, 2],
num_heads=[3, 6, 12, 24],
window_size=7,
drop_rate=0.2,
)
logits = model(image)
VFormer
has a modular design and allows for easy experimentation using blocks/modules of different architectures. For example, if desired, you can use just the encoder or the windowed attention layer of the Swin Transformer model.
from vformer.attention import WindowAttention
window_attn = WindowAttention(
dim=128,
window_size=7,
num_heads=2,
**kwargs,
)
from vformer.encoder import SwinEncoder
swin_encoder = SwinEncoder(
dim=128,
input_resolution=(224, 224),
depth=2,
num_heads=2,
window_size=7,
**kwargs,
)
Citations (click to expand)
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
@article{dosovitskiy2020vit,
title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
author={Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and Uszkoreit, Jakob and Houlsby, Neil},
journal={ICLR},
year={2021}
}
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows
@article{liu2021Swin,
title={Swin Transformer: Hierarchical Vision Transformer using Shifted Windows},
author={Liu, Ze and Lin, Yutong and Cao, Yue and Hu, Han and Wei, Yixuan and Zhang, Zheng and Lin, Stephen and Guo, Baining},
journal={arXiv preprint arXiv:2103.14030},
year={2021}
}
Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions
@misc{wang2021pyramid,
title={Pyramid Vision Transformer: A Versatile Backbone for Dense Prediction without Convolutions},
author={Wenhai Wang and Enze Xie and Xiang Li and Deng-Ping Fan and Kaitao Song and Ding Liang and Tong Lu and Ping Luo and Ling Shao},
year={2021},
eprint={2102.12122},
archivePrefix={arXiv},
primaryClass={cs.CV}
}
CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification
@inproceedings{chen2021crossvit,
title={{CrossViT: Cross-Attention Multi-Scale Vision Transformer for Image Classification}},
author={Chun-Fu (Richard) Chen and Quanfu Fan and Rameswar Panda},
booktitle={International Conference on Computer Vision (ICCV)},
year={2021}
}
Escaping the Big Data Paradigm with Compact Transformers
@article{hassani2021escaping,
title = {Escaping the Big Data Paradigm with Compact Transformers},
author = {Ali Hassani and Steven Walton and Nikhil Shah and Abulikemu Abuduweili and Jiachen Li and Humphrey Shi},
year = 2021,
url = {https://arxiv.org/abs/2104.05704},
eprint = {2104.05704},
archiveprefix = {arXiv},
primaryclass = {cs.CV}
}