Background
When using Traditional CNN-backboned architecture models, due to the convolutional filter’s weights being fully fixed after training, they suffered to adapt to different inputs dynamically.
Vision Transformers attempted to remove the convolution from the backbone, but since it is processing information only on a single scale, it was not great at dense predictions(pixel-level prediction). It also requires relatively high computational and memory costs.
Pyramid Vision Transformers(PVT) enabled transformer-backboned architecture models to generate multi-scale dense predictions and reduce computational costs at the same time.
Overall Architecture
PVT consists of 4 stages, each processing a different scale feature map.
Each stage has a Patch-Embedding phase and a Transformer-Encoder phase.
-
Patch-Embedding Phase
If we denote the patch size of the i-th stage as Pi, this phase first evenly divides the input feature map into (Hi-1 x Wi-1) / Pi^2. Then each patch is flattened and projected to a Ci-dimensional embedding. -
Transformer-Encoder Phase
The original Transformer has a multi-head attention layer, but it is replaced with a spatial-reduction attention layer. Similar to the original, SRA receives a query Q, a key K, and a value V, but reduces the spatial scale of K and V before the attention operation.