Abstract
Siamese networks have become a common structure in various recent models for unsupervised visual representation learning. However, previous works include the following that causes all outputs to “collapse” to a constant.
- Negative Sampling
- Large Batches
- Momentum Encoders
This paper proposes a solution by removing these elements.
Architecture
The architecture consists of 2 “views”. Each “view” has an encoder that shares weights and only the first “view” has a predictor after the encoder which transforms the input to match the other “view”. The model tries to minimize the negative cosine similarity between the two outputs, and applies stop-gradient only to the second “view”.