Rethinking Spatial Dimensions of Vision Transformers

Rethinking Spatial Dimensions of Vision Transformers

Byeongho Heo, Sangdoo Yun, Dongyoon Han, Sanghyuk Chun, Junsuk Choe, Seong Joon Oh | Paper

NAVER AI LAB

News

Mar 30, 2021: Code & paper released
Apr 2, 2021: PiT models with pretrained weights are added to timm repo. You can directly use PiT models with timm>=0.4.7.
Jul 23, 2021: Accepted to ICCV 2021 as a poster session

Abstract

Vision Transformer (ViT) extends the application range of transformers from language processing to computer vision tasks as being an alternative architecture against the existing convolutional neural networks (CNN). Since the transformer-based architecture has been innovative for computer vision modeling, the design convention towards an effective architecture has been less studied yet. From the successful design principles of CNN, we investigate

To finish reading, please visit source site