Self Supervised Learning
Self Supervised Learning
Clara Gonçalves
April 2024
1 Preliminaries
1.1 Data Annotation
Supervised learning requires very large amounts of annotated data. Annotating such large datasets is a
labor-intensive and time-consuming task, which significantly impacts the cost and feasibility of creating
high-quality training data.
1
• Any part of the input from any other part.
• The future from the past.
• The past from the present.
• The top from the bottom.
3 Learning Problems
3.1 Reinforcement Learning
Reinforcement learning involves learning model parameters through active exploration and sparse rewards.
Examples include Deep Q-Learning and Policy Gradient methods like Actor-Critic.
4 Task-Specific Models
4.1 Unsupervised Learning of Depth and Ego-Motion
Train a Convolutional Neural Network (CNN) to jointly predict depth and relative pose from three video
frames. The photometric consistency loss is calculated between warped source views It−1 /It+1 and the target
view It .
The model typically uses a traditional U-Net (DispNet) with multi-scale prediction/loss. The final
objective includes photometric consistency, smoothness, and explainability losses.
Performance is nearly on par with depth or pose supervised methods, though it can fail in the presence
of dynamic objects due to the assumption of a static scene.
2
4.2 Unsupervised Monocular Depth Estimation from Stereo
Monodepth from stereo supervision can produce a disparity map aligned with the target instead of the input
due to naive sampling. The No-LR method corrects for this but suffers from artifacts.
The proposed approach uses the left image to produce disparities for both images, improving quality by
enforcing mutual consistency. Losses include photometric consistency, disparity smoothness, and left-right
consistency. A comparison with and without the left-right consistency loss is provided.
5 Pretext Tasks
5.1 Visual Representation Learning by Context Prediction
The goal of context prediction is to predict the relative position of patches (discrete set). This requires the
model to learn to recognize objects and their parts. Care must be taken to avoid trivial shortcuts, such as
edge continuity.
Aberration: Color channels shift with respect to the image location. The solution is to randomly drop
color channels or project towards grayscale.
Nearest neighbor retrieval results for a query image are based on fc6 features.
3
5.3 Feature Learning by Inpainting
The inpainting task involves trying to recover a region that has been removed from the context. This is
similar to a DAE but requires more semantic knowledge due to the large missing region.
6 Contrastive Learning
6.1 Hope of Generalization
We hope that the pretraining task and the transfer task are aligned. Pretext feature performance saturates
as the last layers are specific to the pretext task.
6.2 Desiderata
Can we find a more general pretext task? Pre-trained features should represent how images relate to each
other and should be invariant to nuisance factors (location, lighting, color). Augmentations are generated
from one reference image, called views.
3. Augmentations: Crop, resize, flip, rotation, cutout, color drop, jitter, Gaussian jitter, Gaussian
noise/blur, Sobel filter.
4
7 A Simple Framework for Contrastive Learning
Cosine similarity is used as a score function. SimCLR uses a projection network g(.) to project features to a
space where contrastive learning is applied. The projection improves learning, as more relevant information
is preserved in h, which is discarded in z.
Generate a positive pair by sampling data augmentation functions. Iterate through and use one of the
2N samples as reference, computing the average loss.
The InfoNCE loss uses all non-positive samples in the batch as x− .
Train the feature encoder on ImageNet (entire training set) using SimCLR. Freeze the feature encoder
and train a linear classifier on top with labeled data.
Train the feature encoder on ImageNet (entire training set) using SimCLR. Fine-tune the encoder with
1% / 10% of labeled data on ImageNet.
Large training batch size is crucial for SimCLR! Large batch size causes a large memory footprint during
backpropagation, requiring distributed training on TPUs (ImageNet experiments).
8 Conclusion
Creating labeled data is time-consuming and expensive. Self-supervised methods overcome this by learning
from data alone. Task-specific models typically minimize photometric consistency measures. Pretext tasks
have been introduced to learn more generic representations. These generic representations can then be fine-
tuned for target tasks. However, classical pretext tasks (e.g., rotation) are often not well aligned with the
target task. Contrastive learning and redundancy reduction are better aligned and produce state-of-the-art
results, closing the gap to fully supervised ImageNet pretraining.