137. Vision Transformers

Vision Transformers is inspired by Transformers for natural language processing. Unlike traditional convolution networks with pyramid architectures, ViT has an isotropic architecture, where the input does not downsize.

The steps are the following.

Split images into “patches”
Flatten each path to a 1d array
Connect with a dense layer
Add vector with positioning information
Connect all Vectors, including the token for classification, to the multi-head attention layer
Connect with a dense layer
Apply this again
Softmax Output and get probability

Compared to Resnet, ViT performs better when the training dataset is larger than 100 million.

Related Posts

414. Graph Neural Network Basics

413. Tips For Developing Vector Databases

412. Augmenting LLMs with Private Data