186. PVTv2

PVTv2

The previous PVT had mainly 3 limitations.

  1. When processing high-resolution images, the computational cost is still relatively high.
  2. Loses local continuity of the image because it treats the image as a sequence of non-overlapping patches.
  3. Inflexible for arbitrary image size because the positional encoding is fixed.

PVTv2 is a modified version to resolve the limitations above.

Modification

  1. Instead of applying convolution before the multi-head layer, the paper applies average pooling to reduce spatial dimensions.
  2. Apply overlapping patch embeddings.
  3. Remove fix-sized encoding, and instead apply zero-padding encoding by adding a 3×3 depth-wise convolution with the padding size of 1 between the first FC layer and the GELU layer.

Reference: PVT v2: Improved Baselines with Pyramid Vision Transformer