PVTv2
The previous PVT had mainly 3 limitations.
- When processing high-resolution images, the computational cost is still relatively high.
- Loses local continuity of the image because it treats the image as a sequence of non-overlapping patches.
- Inflexible for arbitrary image size because the positional encoding is fixed.
PVTv2 is a modified version to resolve the limitations above.
Modification
- Instead of applying convolution before the multi-head layer, the paper applies average pooling to reduce spatial dimensions.
- Apply overlapping patch embeddings.
- Remove fix-sized encoding, and instead apply zero-padding encoding by adding a 3×3 depth-wise convolution with the padding size of 1 between the first FC layer and the GELU layer.
Reference: PVT v2: Improved Baselines with Pyramid Vision Transformer