221. SimSiam

Abstract

Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. However, previous works include the following that causes all outputs to “collapse” to a constant.

  1. Negative Sampling
  2. Large Batches
  3. Momentum Encoders

This paper proposes a solution by removing these elements.

Architecture

The architecture consists of 2 “views”. Each “view” has an encoder that shares weights and only the first “view” has a predictor after the encoder which transforms the input to match the other “view”. The model tries to minimize the negative cosine similarity between the two outputs, and applies stop-gradient only to the second “view”.

Reference:Exploring Simple Siamese Representation Learning