Vision Transformers is inspired by Transformers for natural language processing. Unlike traditional convolution networks with pyramid architectures, ViT has an isotropic architecture, where the input does not downsize.
The steps are the following.
- Split images into “patches”
- Flatten each path to a 1d array
- Connect with a dense layer
- Add vector with positioning information
- Connect all Vectors, including the token for classification, to the multi-head attention layer
- Connect with a dense layer
- Apply this again
- Softmax Output and get probability
Compared to Resnet, ViT performs better when the training dataset is larger than 100 million.