Self-Supervised Learning Tricks

I am reading some self-supervised learning papers. Some of them have interesting tricks to create self-supervised learning signals. This post is dedicated for those tricks.

The first paper I read is SwAV(Swapping Assignments between multiple Views of the same image) [1]. The high level idea is that we create $K$ clusters with cluster centers $\{c_k, \dots, c_k\}$ . These cluster centers are trainable parameters. More importantly, these cluster centers will be used in every training batch. In each batch, we distort (augment) each image into two versions. For each version, the distorted images are transformed to an embedding space $z$ and clustered equally by the $K$ clusters. As you might guess, the two distorted images from the same original image should belong to the same cluster.

The clustering-based method has an advantage over contrastive learning directly on image features because the former operates on a smaller input space when doing pairwise comparisons:

The interesting trick in SwAV is to partition images equally to the $K$ clusters. If we denote $C^T Z$ to be the image-cluster center similarity matrix, then the problem is converted to finding $Q$ such that:

The constraints enforce that on average each cluster is associated with $\text{batch size} / K$ data points. As illustrated in [1], they find the continuous code $Q^*$ using the iterative Sinkhorn-Knopp algorithm.

SwAV has been used to train on 1 billion random instagram images. The resulting models (called SEER) achieve SOTA top1 accuracy on ImageNet data after fine tuning. [2]

References

[1] Unsupervised Learning of Visual Features by Contrasting Cluster Assignments: https://arxiv.org/pdf/2006.09882.pdf

[2] Self-supervised Pretraining of Visual Features in the Wild: https://arxiv.org/abs/2103.01988

Self-Supervised Learning Tricks

Leave a comment

Cancel reply