Skip to content

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

License

Notifications You must be signed in to change notification settings

gupta-abhay/pytorch-vit

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

24 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vision Transformers

Implementation of Vision Transformer in PyTorch, a new model to achieve SOTA in vision classification with using transformer style encoders. Associated blog article.

ViT

Features

Current Support for:

  • Vanilla ViT
  • Hybrid ViT (with support for BiTResNets as backbone)
  • Hybrid ViT (with support for AxialResNets as backbone)

To Do:

  • Training Script
  • Full Axial-ViT

References

  1. BiTResNet
  2. AxialResNet

Citations

@inproceedings{
    anonymous2021an,
    title={An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale},
    author={Anonymous},
    booktitle={Submitted to International Conference on Learning Representations},
    year={2021},
    url={https://openreview.net/forum?id=YicbFdNTTy},
    note={under review}
}