Abstract
Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. However, previous works include the following that causes all outputs to “collapse” to a constant.
- Negative Sampling
 - Large Batches
 - Momentum Encoders
 
This paper proposes a solution by removing these elements.

Architecture
The architecture consists of 2 “views”. Each “view” has an encoder that shares weights and only the first “view” has a predictor after the encoder which transforms the input to match the other “view”. The model tries to minimize the negative cosine similarity between the two outputs, and applies stop-gradient only to the second “view”.



