352. Swin Transformer

Abstract

In existing Transformer-based models, tokens are all of a fixed scale, a property unsuitable for vision applications such as semantic segmentation that require dense prediction at the pixel level. In addition, due to the computational complexity of its self-attention being quadratic to image size, the complexity can easily be intractable when it comes to high-resolution images

Swin Transformer can conveniently leverage advanced techniques for dense prediction such as feature pyramid networks and UNet.

Key Elements

Multi-Scale Feature Maps
Swin Transformers builds a hierarchical feature map by merging image patches in deeper layers. This has linear computational complexity to input size due to computation of self-attention only within each local window.
Sliding windows
The shifting windows bridges the windows of the preceding layer to provide connections among the local windows(Red Boxes)

Reference

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Abstract

Key Elements

Reference

Related Posts

398. Findings Report

359. Future Frame Prediction For Anomaly Detection

358. Different Settings of Transfer Learning