PVTv2
The previous PVT had mainly 3 limitations.
- When processing high-resolution images, the computational cost is still relatively high.
 - Loses local continuity of the image because it treats the image as a sequence of non-overlapping patches.
 - Inflexible for arbitrary image size because the positional encoding is fixed.
 
PVTv2 is a modified version to resolve the limitations above.

Modification
- Instead of applying convolution before the multi-head layer, the paper applies average pooling to reduce spatial dimensions.
 - Apply overlapping patch embeddings.
 - Remove fix-sized encoding, and instead apply zero-padding encoding by adding a 3×3 depth-wise convolution with the padding size of 1 between the first FC layer and the GELU layer.
 
Reference: PVT v2: Improved Baselines with Pyramid Vision Transformer



