0% found this document useful (0 votes)
10 views5 pages

Self Supervised Learning

Self_Supervised_Learning

Uploaded by

5bnvpv9db4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views5 pages

Self Supervised Learning

Self_Supervised_Learning

Uploaded by

5bnvpv9db4
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Self-supervised Learning

Clara Gonçalves
April 2024

1 Preliminaries
1.1 Data Annotation
Supervised learning requires very large amounts of annotated data. Annotating such large datasets is a
labor-intensive and time-consuming task, which significantly impacts the cost and feasibility of creating
high-quality training data.

1.2 Image Classification


Types of human error in image classification include:
Fine-grained recognition: There are more than 120 species of dogs in the dataset. We estimate that
28 (37%) of the human errors fall into this category.
Class unawareness: The annotator may sometimes be unaware of the ground truth class present as a
label option. This accounts for 18 (24%) of the human errors.
Insufficient training data: The annotator is only presented with 13 examples of a class under each
category name, which is insufficient for generalization. Approximately 4 (5%) of human errors fall into this
category.
Moreover, image classification is expensive. For example, the creation of ImageNet, which contains 14
million images, required an estimated 22 human years of annotation effort.

1.3 Dense Semantic and Instance Annotation


The Cityscapes dataset is an example of dense semantic and instance annotation. The annotation process
is very time-consuming, requiring significant human effort.

1.4 Stereo, Monodepth, and Optical Flow


The KITTI dataset includes stereo, monocular depth, optical flow, and scene flow annotations. Depth
correspondences are particularly challenging for humans to annotate accurately, so LiDAR and manual
object segmentation/tracking are often used.

1.5 Human Labeling is Sparse


Humans can learn very effectively with just a few labeled examples. They learn through interaction and
observation, which allows for rapid generalization even with sparse labeling.

2 In the Context of Computer Vision


2.1 Self-Supervision
Self-supervision involves using the observed data to infer hidden properties of the data. The idea is to obtain
labels from raw, unlabeled data itself by predicting parts of the data from other parts. This includes tasks
such as predicting:

1
• Any part of the input from any other part.
• The future from the past.
• The past from the present.
• The top from the bottom.

• The occluded from the visible.


• Pretend there is a part of the input you don’t know and predict that.

2.2 Example: Denoising Autoencoder


A Denoising Autoencoder (DAE) predicts the input from a corrupted version. After training, only the
encoder is kept, and the decoder is discarded.

3 Learning Problems
3.1 Reinforcement Learning
Reinforcement learning involves learning model parameters through active exploration and sparse rewards.
Examples include Deep Q-Learning and Policy Gradient methods like Actor-Critic.

3.2 Unsupervised Learning


Unsupervised learning involves learning model parameters using a dataset without labels. Examples include
clustering, dimensionality reduction, and generative models.

3.3 Supervised Learning


Supervised learning involves learning model parameters using a dataset of data-label pairs. Examples include
classification, regression, and structured prediction.

3.4 Self-Supervised Learning


Self-supervised learning involves learning model parameters using a dataset of data-data pairs. Examples
include self-supervised stereo/flow and contrastive learning.

4 Task-Specific Models
4.1 Unsupervised Learning of Depth and Ego-Motion
Train a Convolutional Neural Network (CNN) to jointly predict depth and relative pose from three video
frames. The photometric consistency loss is calculated between warped source views It−1 /It+1 and the target
view It .
The model typically uses a traditional U-Net (DispNet) with multi-scale prediction/loss. The final
objective includes photometric consistency, smoothness, and explainability losses.
Performance is nearly on par with depth or pose supervised methods, though it can fail in the presence
of dynamic objects due to the assumption of a static scene.

2
4.2 Unsupervised Monocular Depth Estimation from Stereo
Monodepth from stereo supervision can produce a disparity map aligned with the target instead of the input
due to naive sampling. The No-LR method corrects for this but suffers from artifacts.
The proposed approach uses the left image to produce disparities for both images, improving quality by
enforcing mutual consistency. Losses include photometric consistency, disparity smoothness, and left-right
consistency. A comparison with and without the left-right consistency loss is provided.

4.3 Digging Into Self-Supervised Monocular Depth Estimation


Monodepth2 introduces per-pixel minimum photometric consistency loss to handle occlusions. Computation
of all losses at the input resolution reduces texture-copy artifacts. An auto-masking loss is used to ignore
images in which the camera does not move.

4.4 Unsupervised Learning of Optical Flow


Bidirectional training involves sharing weights between both directions to train a universal network for
optical flow that predicts accurate flow in either direction (forward and backward). The data loss compares
flow-warped images to the original images. The consistency loss ensures forward-backward consistency, and
both losses are masked based on the estimated occlusion.

4.5 Self-Supervised Scene Flow Estimation


Combining monocular depth and optical flow from stereo videos with smoothness and photometric consis-
tency losses.

5 Pretext Tasks
5.1 Visual Representation Learning by Context Prediction
The goal of context prediction is to predict the relative position of patches (discrete set). This requires the
model to learn to recognize objects and their parts. Care must be taken to avoid trivial shortcuts, such as
edge continuity.
Aberration: Color channels shift with respect to the image location. The solution is to randomly drop
color channels or project towards grayscale.
Nearest neighbor retrieval results for a query image are based on fc6 features.

5.2 Visual Representation Learning by Solving Jigsaw Puzzles


Input: nine patches permuted using one of N permutations. Output: N-way classification with N ≪ 9!. The
Jigsaw puzzle task predicts one out of 1000 possible random permutations. Permutations are chosen based
on Hamming distance to increase difficulty.
A Siamese architecture is used, with concatenation of features and an MLP for 1000-way classification.
Shortcuts, such as adjacent patches including similar low-level statistics and edge continuity, are prevented
by normalizing patch mean and variance and selecting 64x64 pixel tiles randomly from 85x85 pixel cells.
Chromatic aberration is handled by using grayscale images or spatially jittering each color channel by a
few pixels.
For transfer learning, pre-trained weights are used as initialization for the convolutional layers of AlexNet,
while the remaining layers are trained from scratch (random initialization). Fine-tuning features for PASCAL
VOC classification, detection, and segmentation shows performance approaching fully supervised pre-training
on ImageNet.

3
5.3 Feature Learning by Inpainting
The inpainting task involves trying to recover a region that has been removed from the context. This is
similar to a DAE but requires more semantic knowledge due to the large missing region.

5.4 Visual Representation Learning by Predicting Image Rotations


The rotation task involves trying to recover the true orientation (4-way classification). To accurately recover
the correct orientation, semantic knowledge is required.
Pretext Task: Pretext tasks focus on ”visual common sense,” such as rearrangement, predicting rota-
tions, inpainting, and colorization. Models are forced to learn good features about natural images, such as
a semantic representation of an object category, to solve the pretext task.
The goal is not pretext task performance, but rather the utility of the learned features for downstream
tasks (classification, detection, segmentation).
Problems: Designing good pretext tasks is tedious and somewhat an art. The learned representations
may not be general.

6 Contrastive Learning
6.1 Hope of Generalization
We hope that the pretraining task and the transfer task are aligned. Pretext feature performance saturates
as the last layers are specific to the pretext task.

6.2 Desiderata
Can we find a more general pretext task? Pre-trained features should represent how images relate to each
other and should be invariant to nuisance factors (location, lighting, color). Augmentations are generated
from one reference image, called views.

6.3 Contrastive Learning


Given a chosen score function s(., .), we want to learn an encoder f that yields a high score for positive pairs
(x, x+ ) and a low score for negative pairs (x, x− ). Formally,

s(f (x), f (x+ )) ≫ s(f (x), f (x− )).


Assuming we have 1 reference (x), 1 positive (x+ ), and N − 1 negative (x− j ) examples, consider the
following multi-class cross-entropy loss function:
This is commonly known as the InfoNCE loss, and its negative is a lower bound on the mutual information
between f (x) and f (x+ ).
The key idea is that maximizing mutual information between features extracted from multiple views
forces the features to capture information about higher-level factors. Importantly, the larger the negative
sample size (N ), the tighter the bound.

6.4 Design Choices


1. Score function: Cosine similarity is commonly used.
2. Examples:

3. Augmentations: Crop, resize, flip, rotation, cutout, color drop, jitter, Gaussian jitter, Gaussian
noise/blur, Sobel filter.

4
7 A Simple Framework for Contrastive Learning
Cosine similarity is used as a score function. SimCLR uses a projection network g(.) to project features to a
space where contrastive learning is applied. The projection improves learning, as more relevant information
is preserved in h, which is discarded in z.
Generate a positive pair by sampling data augmentation functions. Iterate through and use one of the
2N samples as reference, computing the average loss.
The InfoNCE loss uses all non-positive samples in the batch as x− .
Train the feature encoder on ImageNet (entire training set) using SimCLR. Freeze the feature encoder
and train a linear classifier on top with labeled data.
Train the feature encoder on ImageNet (entire training set) using SimCLR. Fine-tune the encoder with
1% / 10% of labeled data on ImageNet.
Large training batch size is crucial for SimCLR! Large batch size causes a large memory footprint during
backpropagation, requiring distributed training on TPUs (ImageNet experiments).

7.1 Momentum Contrast


The difference with SimCLR is that Momentum Contrast (MoCo) keeps a running queue (dictionary) of keys
for negative samples (i.e., a ring buffer of minibatches). Gradients are backpropagated only to the query
encoder, not the queue.
Thus, the dictionary can be much larger than the minibatch, but it may become inconsistent as the
encoder changes during training. To improve the consistency of keys in the queue, a momentum update rule
is used.

7.2 Improved Momentum Contrast


Key insights include the importance of a non-linear projection head and strong data augmentation. Using
both, MoCo v2 outperforms SimCLR with much smaller minibatch sizes.

7.3 Barlow Twins


Inspired by information theory, Barlow Twins try to reduce redundancy between neurons. Neurons should
be invariant to data augmentation but independent of others. The idea is to compute the cross-correlation
matrix and encourage it to be an identity matrix. Diagonally, neurons across augmentations are correlated,
while off-diagonal elements indicate no redundancy.
No negative samples are needed, unlike in contrastive learning, making this a simpler method that
performs on par with the state-of-the-art. It is mildly affected by batch size, not degrading as much as
SimCLR.

8 Conclusion
Creating labeled data is time-consuming and expensive. Self-supervised methods overcome this by learning
from data alone. Task-specific models typically minimize photometric consistency measures. Pretext tasks
have been introduced to learn more generic representations. These generic representations can then be fine-
tuned for target tasks. However, classical pretext tasks (e.g., rotation) are often not well aligned with the
target task. Contrastive learning and redundancy reduction are better aligned and produce state-of-the-art
results, closing the gap to fully supervised ImageNet pretraining.

You might also like