186. PVTv2

August 16, 2022

PVTv2

The previous PVT had mainly 3 limitations.

When processing high-resolution images, the computational cost is still relatively high.
Loses local continuity of the image because it treats the image as a sequence of non-overlapping patches.
Inflexible for arbitrary image size because the positional encoding is fixed.

PVTv2 is a modified version to resolve the limitations above.

Modification

Instead of applying convolution before the multi-head layer, the paper applies average pooling to reduce spatial dimensions.
Apply overlapping patch embeddings.
Remove fix-sized encoding, and instead apply zero-padding encoding by adding a 3×3 depth-wise convolution with the padding size of 1 between the first FC layer and the GELU layer.

Reference: PVT v2: Improved Baselines with Pyramid Vision Transformer