352. Swin Transformer

Abstract

In existing Transformer-based models, tokens are all of a fixed scale, a property unsuitable for vision applications such as semantic segmentation that require dense prediction at the pixel level. In addition, due to the computational complexity of its self-attention being quadratic to image size, the complexity can easily be intractable when it comes to high-resolution images

Swin Transformer can conveniently leverage advanced techniques for dense prediction such as feature pyramid networks and UNet.

Key Elements

  1. Multi-Scale Feature Maps
    Swin Transformers builds a hierarchical feature map by merging image patches in deeper layers. This has linear computational complexity to input size due to computation of self-attention only within each local window.
  2. Sliding windows
    The shifting windows bridges the windows of the preceding layer to provide connections among the local windows(Red Boxes)

Reference

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows