Abstract
In existing Transformer-based models, tokens are all of a fixed scale, a property unsuitable for vision applications such as semantic segmentation that require dense prediction at the pixel level. In addition, due to the computational complexity of its self-attention being quadratic to image size, the complexity can easily be intractable when it comes to high-resolution images
Swin Transformer can conveniently leverage advanced techniques for dense prediction such as feature pyramid networks and UNet.
Key Elements
- Multi-Scale Feature Maps
Swin Transformers builds a hierarchical feature map by merging image patches in deeper layers. This has linear computational complexity to input size due to computation of self-attention only within each local window. - Sliding windows
The shifting windows bridges the windows of the preceding layer to provide connections among the local windows(Red Boxes)
Reference
Swin Transformer: Hierarchical Vision Transformer using Shifted Windows