0% found this document useful (0 votes)

10 views5 pages

Self Supervised Learning

Self_Supervised_Learning

Uploaded by

5bnvpv9db4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views5 pages

Self Supervised Learning

Self_Supervised_Learning

Uploaded by

5bnvpv9db4

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Self-supervised Learning

Clara Gonçalves
April 2024

1 Preliminaries
1.1 Data Annotation
Supervised learning requires very large amounts of annotated data. Annotating such large datasets is a
labor-intensive and time-consuming task, which significantly impacts the cost and feasibility of creating
high-quality training data.

1.2 Image Classification

Types of human error in image classification include:
Fine-grained recognition: There are more than 120 species of dogs in the dataset. We estimate that
28 (37%) of the human errors fall into this category.
Class unawareness: The annotator may sometimes be unaware of the ground truth class present as a
label option. This accounts for 18 (24%) of the human errors.
Insufficient training data: The annotator is only presented with 13 examples of a class under each
category name, which is insufficient for generalization. Approximately 4 (5%) of human errors fall into this
category.
Moreover, image classification is expensive. For example, the creation of ImageNet, which contains 14
million images, required an estimated 22 human years of annotation effort.

1.3 Dense Semantic and Instance Annotation

The Cityscapes dataset is an example of dense semantic and instance annotation. The annotation process
is very time-consuming, requiring significant human effort.

1.4 Stereo, Monodepth, and Optical Flow

The KITTI dataset includes stereo, monocular depth, optical flow, and scene flow annotations. Depth
correspondences are particularly challenging for humans to annotate accurately, so LiDAR and manual
object segmentation/tracking are often used.

1.5 Human Labeling is Sparse

Humans can learn very effectively with just a few labeled examples. They learn through interaction and
observation, which allows for rapid generalization even with sparse labeling.

2 In the Context of Computer Vision

2.1 Self-Supervision
Self-supervision involves using the observed data to infer hidden properties of the data. The idea is to obtain
labels from raw, unlabeled data itself by predicting parts of the data from other parts. This includes tasks
such as predicting:

1
• Any part of the input from any other part.
• The future from the past.
• The past from the present.
• The top from the bottom.

• The occluded from the visible.

• Pretend there is a part of the input you don’t know and predict that.

2.2 Example: Denoising Autoencoder

A Denoising Autoencoder (DAE) predicts the input from a corrupted version. After training, only the
encoder is kept, and the decoder is discarded.

3 Learning Problems
3.1 Reinforcement Learning
Reinforcement learning involves learning model parameters through active exploration and sparse rewards.
Examples include Deep Q-Learning and Policy Gradient methods like Actor-Critic.

3.2 Unsupervised Learning

Unsupervised learning involves learning model parameters using a dataset without labels. Examples include
clustering, dimensionality reduction, and generative models.

3.3 Supervised Learning

Supervised learning involves learning model parameters using a dataset of data-label pairs. Examples include
classification, regression, and structured prediction.

3.4 Self-Supervised Learning

Self-supervised learning involves learning model parameters using a dataset of data-data pairs. Examples
include self-supervised stereo/flow and contrastive learning.

4 Task-Specific Models
4.1 Unsupervised Learning of Depth and Ego-Motion
Train a Convolutional Neural Network (CNN) to jointly predict depth and relative pose from three video
frames. The photometric consistency loss is calculated between warped source views It−1 /It+1 and the target
view It .
The model typically uses a traditional U-Net (DispNet) with multi-scale prediction/loss. The final
objective includes photometric consistency, smoothness, and explainability losses.
Performance is nearly on par with depth or pose supervised methods, though it can fail in the presence
of dynamic objects due to the assumption of a static scene.

2
4.2 Unsupervised Monocular Depth Estimation from Stereo
Monodepth from stereo supervision can produce a disparity map aligned with the target instead of the input
due to naive sampling. The No-LR method corrects for this but suffers from artifacts.
The proposed approach uses the left image to produce disparities for both images, improving quality by
enforcing mutual consistency. Losses include photometric consistency, disparity smoothness, and left-right
consistency. A comparison with and without the left-right consistency loss is provided.

4.3 Digging Into Self-Supervised Monocular Depth Estimation

Monodepth2 introduces per-pixel minimum photometric consistency loss to handle occlusions. Computation
of all losses at the input resolution reduces texture-copy artifacts. An auto-masking loss is used to ignore
images in which the camera does not move.

4.4 Unsupervised Learning of Optical Flow

Bidirectional training involves sharing weights between both directions to train a universal network for
optical flow that predicts accurate flow in either direction (forward and backward). The data loss compares
flow-warped images to the original images. The consistency loss ensures forward-backward consistency, and
both losses are masked based on the estimated occlusion.

4.5 Self-Supervised Scene Flow Estimation

Combining monocular depth and optical flow from stereo videos with smoothness and photometric consis-
tency losses.

5 Pretext Tasks
5.1 Visual Representation Learning by Context Prediction
The goal of context prediction is to predict the relative position of patches (discrete set). This requires the
model to learn to recognize objects and their parts. Care must be taken to avoid trivial shortcuts, such as
edge continuity.
Aberration: Color channels shift with respect to the image location. The solution is to randomly drop
color channels or project towards grayscale.
Nearest neighbor retrieval results for a query image are based on fc6 features.

5.2 Visual Representation Learning by Solving Jigsaw Puzzles

Input: nine patches permuted using one of N permutations. Output: N-way classification with N ≪ 9!. The
Jigsaw puzzle task predicts one out of 1000 possible random permutations. Permutations are chosen based
on Hamming distance to increase difficulty.
A Siamese architecture is used, with concatenation of features and an MLP for 1000-way classification.
Shortcuts, such as adjacent patches including similar low-level statistics and edge continuity, are prevented
by normalizing patch mean and variance and selecting 64x64 pixel tiles randomly from 85x85 pixel cells.
Chromatic aberration is handled by using grayscale images or spatially jittering each color channel by a
few pixels.
For transfer learning, pre-trained weights are used as initialization for the convolutional layers of AlexNet,
while the remaining layers are trained from scratch (random initialization). Fine-tuning features for PASCAL
VOC classification, detection, and segmentation shows performance approaching fully supervised pre-training
on ImageNet.

3
5.3 Feature Learning by Inpainting
The inpainting task involves trying to recover a region that has been removed from the context. This is
similar to a DAE but requires more semantic knowledge due to the large missing region.

5.4 Visual Representation Learning by Predicting Image Rotations

The rotation task involves trying to recover the true orientation (4-way classification). To accurately recover
the correct orientation, semantic knowledge is required.
Pretext Task: Pretext tasks focus on ”visual common sense,” such as rearrangement, predicting rota-
tions, inpainting, and colorization. Models are forced to learn good features about natural images, such as
a semantic representation of an object category, to solve the pretext task.
The goal is not pretext task performance, but rather the utility of the learned features for downstream
tasks (classification, detection, segmentation).
Problems: Designing good pretext tasks is tedious and somewhat an art. The learned representations
may not be general.

6 Contrastive Learning
6.1 Hope of Generalization
We hope that the pretraining task and the transfer task are aligned. Pretext feature performance saturates
as the last layers are specific to the pretext task.

6.2 Desiderata
Can we find a more general pretext task? Pre-trained features should represent how images relate to each
other and should be invariant to nuisance factors (location, lighting, color). Augmentations are generated
from one reference image, called views.

6.3 Contrastive Learning

Given a chosen score function s(., .), we want to learn an encoder f that yields a high score for positive pairs
(x, x+ ) and a low score for negative pairs (x, x− ). Formally,

s(f (x), f (x+ )) ≫ s(f (x), f (x− )).

Assuming we have 1 reference (x), 1 positive (x+ ), and N − 1 negative (x− j ) examples, consider the
following multi-class cross-entropy loss function:
This is commonly known as the InfoNCE loss, and its negative is a lower bound on the mutual information
between f (x) and f (x+ ).
The key idea is that maximizing mutual information between features extracted from multiple views
forces the features to capture information about higher-level factors. Importantly, the larger the negative
sample size (N ), the tighter the bound.

6.4 Design Choices

1. Score function: Cosine similarity is commonly used.
2. Examples:

3. Augmentations: Crop, resize, flip, rotation, cutout, color drop, jitter, Gaussian jitter, Gaussian
noise/blur, Sobel filter.

4
7 A Simple Framework for Contrastive Learning
Cosine similarity is used as a score function. SimCLR uses a projection network g(.) to project features to a
space where contrastive learning is applied. The projection improves learning, as more relevant information
is preserved in h, which is discarded in z.
Generate a positive pair by sampling data augmentation functions. Iterate through and use one of the
2N samples as reference, computing the average loss.
The InfoNCE loss uses all non-positive samples in the batch as x− .
Train the feature encoder on ImageNet (entire training set) using SimCLR. Freeze the feature encoder
and train a linear classifier on top with labeled data.
Train the feature encoder on ImageNet (entire training set) using SimCLR. Fine-tune the encoder with
1% / 10% of labeled data on ImageNet.
Large training batch size is crucial for SimCLR! Large batch size causes a large memory footprint during
backpropagation, requiring distributed training on TPUs (ImageNet experiments).

7.1 Momentum Contrast

The difference with SimCLR is that Momentum Contrast (MoCo) keeps a running queue (dictionary) of keys
for negative samples (i.e., a ring buffer of minibatches). Gradients are backpropagated only to the query
encoder, not the queue.
Thus, the dictionary can be much larger than the minibatch, but it may become inconsistent as the
encoder changes during training. To improve the consistency of keys in the queue, a momentum update rule
is used.

7.2 Improved Momentum Contrast

Key insights include the importance of a non-linear projection head and strong data augmentation. Using
both, MoCo v2 outperforms SimCLR with much smaller minibatch sizes.

7.3 Barlow Twins

Inspired by information theory, Barlow Twins try to reduce redundancy between neurons. Neurons should
be invariant to data augmentation but independent of others. The idea is to compute the cross-correlation
matrix and encourage it to be an identity matrix. Diagonally, neurons across augmentations are correlated,
while off-diagonal elements indicate no redundancy.
No negative samples are needed, unlike in contrastive learning, making this a simpler method that
performs on par with the state-of-the-art. It is mildly affected by batch size, not degrading as much as
SimCLR.

8 Conclusion
Creating labeled data is time-consuming and expensive. Self-supervised methods overcome this by learning
from data alone. Task-specific models typically minimize photometric consistency measures. Pretext tasks
have been introduced to learn more generic representations. These generic representations can then be fine-
tuned for target tasks. However, classical pretext tasks (e.g., rotation) are often not well aligned with the
target task. Contrastive learning and redundancy reduction are better aligned and produce state-of-the-art
results, closing the gap to fully supervised ImageNet pretraining.

Staad Questions PDF
No ratings yet
Staad Questions PDF
8 pages
XKWorkshopManual PDF
No ratings yet
XKWorkshopManual PDF
3,165 pages
Convolutional Neural PDF
No ratings yet
Convolutional Neural PDF
187 pages
Chemical Engineering in Practice Second Edition - Sampler
100% (1)
Chemical Engineering in Practice Second Edition - Sampler
99 pages
Learning Deep Architectures For AI - Yoshua Bengio
No ratings yet
Learning Deep Architectures For AI - Yoshua Bengio
130 pages
2024 MTH058 Lecture08 N ShotLearning
No ratings yet
2024 MTH058 Lecture08 N ShotLearning
39 pages
Case Ih Tractor Ignition Electrical Parts
100% (2)
Case Ih Tractor Ignition Electrical Parts
16 pages
Computerised Assessment of Handwriting
No ratings yet
Computerised Assessment of Handwriting
15 pages
Deep Generative Models
No ratings yet
Deep Generative Models
55 pages
Unit II
No ratings yet
Unit II
27 pages
Lecture 12 Learning in Vision 2022
No ratings yet
Lecture 12 Learning in Vision 2022
100 pages
4.1 - Unsupervised Visual Representation Learning by Context Prediction
No ratings yet
4.1 - Unsupervised Visual Representation Learning by Context Prediction
10 pages
6333 Regularization With Stochastic Transformations and Perturbations For Deep Semi Supervised Learning
No ratings yet
6333 Regularization With Stochastic Transformations and Perturbations For Deep Semi Supervised Learning
9 pages
Essay 6
No ratings yet
Essay 6
15 pages
Pathak Context Encoders Feature CVPR 2016 Paper
No ratings yet
Pathak Context Encoders Feature CVPR 2016 Paper
9 pages
SSL 18 Mar 23 PDF
No ratings yet
SSL 18 Mar 23 PDF
50 pages
Bootstrap PDF
No ratings yet
Bootstrap PDF
11 pages
Jia Bin Huang Research Statement
No ratings yet
Jia Bin Huang Research Statement
6 pages
L U V U C - P N 3DM: Earning From Nlabelled Ideos Sing ON Trastive Redictive Eural Apping
No ratings yet
L U V U C - P N 3DM: Earning From Nlabelled Ideos Sing ON Trastive Redictive Eural Apping
19 pages
2019 Researchstatement
No ratings yet
2019 Researchstatement
6 pages
Revisiting Self-Supervised Visual Representation Learning PDF
No ratings yet
Revisiting Self-Supervised Visual Representation Learning PDF
10 pages
Conjugate Beam Method SLU
No ratings yet
Conjugate Beam Method SLU
41 pages
Learning With Few Data
No ratings yet
Learning With Few Data
67 pages
2024 MTH058 Lecture04 AILearningParadigms
No ratings yet
2024 MTH058 Lecture04 AILearningParadigms
85 pages
Self-Supervised Learning: Pretext Tasks
No ratings yet
Self-Supervised Learning: Pretext Tasks
3 pages
465-Lecture 17-CT
No ratings yet
465-Lecture 17-CT
22 pages
6-DeepVisualLearning L6
No ratings yet
6-DeepVisualLearning L6
82 pages
Chapter17 Autoencoders
No ratings yet
Chapter17 Autoencoders
23 pages
Learning To Compress Images and Videos
No ratings yet
Learning To Compress Images and Videos
8 pages
Weakly Supervised Contrastive Learning
No ratings yet
Weakly Supervised Contrastive Learning
10 pages
Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction
No ratings yet
Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction
11 pages
Self-Supervised Learning and Computer Vision Fast - Ai
No ratings yet
Self-Supervised Learning and Computer Vision Fast - Ai
7 pages
Convolutional Neural Networks in Computer Vision: Jochen Lang
No ratings yet
Convolutional Neural Networks in Computer Vision: Jochen Lang
44 pages
Revisting
No ratings yet
Revisting
13 pages
Dense Constrastive Learning For Self Supervised Visual Pre Training
No ratings yet
Dense Constrastive Learning For Self Supervised Visual Pre Training
11 pages
S - S L C - C G A N: EMI Upervised Earning With Ontext Onditional Enerative Dversarial Etworks
No ratings yet
S - S L C - C G A N: EMI Upervised Earning With Ontext Onditional Enerative Dversarial Etworks
10 pages
Beery Synthetic Examples Improve Generalization For Rare Classes WACV 2020 Paper
No ratings yet
Beery Synthetic Examples Improve Generalization For Rare Classes WACV 2020 Paper
11 pages
Dlincv 161110052148 PDF
No ratings yet
Dlincv 161110052148 PDF
271 pages
Research On Learning Representations in Computer Vision
No ratings yet
Research On Learning Representations in Computer Vision
52 pages
IT5413 Ch4 Self Supervising
No ratings yet
IT5413 Ch4 Self Supervising
29 pages
Sagar Institute of Research & Technology Department of Electronics & Communication
No ratings yet
Sagar Institute of Research & Technology Department of Electronics & Communication
13 pages
AAI Module 4
No ratings yet
AAI Module 4
13 pages
Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper
No ratings yet
Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper
12 pages
Lecture5 Vit Ink
No ratings yet
Lecture5 Vit Ink
58 pages
Generative Pretraining From Pixels V2
No ratings yet
Generative Pretraining From Pixels V2
12 pages
Paper 4
No ratings yet
Paper 4
12 pages
Generative Pretraining From Pixels
No ratings yet
Generative Pretraining From Pixels
13 pages
(Fall 2024) Deep Learning 3
No ratings yet
(Fall 2024) Deep Learning 3
54 pages
Master Inspera
No ratings yet
Master Inspera
45 pages
BMM 2018 - Deep Learning Tutorial
No ratings yet
BMM 2018 - Deep Learning Tutorial
47 pages
Week5 Computer Vision
No ratings yet
Week5 Computer Vision
58 pages
Deep Learning Approaches To Predict Future Frames in Videos
No ratings yet
Deep Learning Approaches To Predict Future Frames in Videos
17 pages
Cross Training
No ratings yet
Cross Training
11 pages
MN906 AI Watermarking
No ratings yet
MN906 AI Watermarking
99 pages
Short Course On Deep Learning: Welcome!!
No ratings yet
Short Course On Deep Learning: Welcome!!
57 pages
FTML Book
No ratings yet
FTML Book
130 pages
DL Tutorial NIPS2015 PDF
No ratings yet
DL Tutorial NIPS2015 PDF
133 pages
Volume Bible - Set Volume For Muscle Size - The Ultimate Evidence Based Bible (UPDATED MARCH 2020) James Krieger
100% (1)
Volume Bible - Set Volume For Muscle Size - The Ultimate Evidence Based Bible (UPDATED MARCH 2020) James Krieger
54 pages
AML - Lecture - 11 - 19nov24
No ratings yet
AML - Lecture - 11 - 19nov24
103 pages
Lec 16
No ratings yet
Lec 16
76 pages
Autopage C3-RS665 PDF
No ratings yet
Autopage C3-RS665 PDF
34 pages
Trabajo Final de Ingles Técnico
No ratings yet
Trabajo Final de Ingles Técnico
5 pages
Hopf Bifurcation Normal Form
100% (2)
Hopf Bifurcation Normal Form
3 pages
SL 1297 - Rudder Tube Assembly Inspection 2021-10-22
No ratings yet
SL 1297 - Rudder Tube Assembly Inspection 2021-10-22
4 pages
D Professional Development For Office Administration 2 1
No ratings yet
D Professional Development For Office Administration 2 1
55 pages
Data Umum SSH 2024
No ratings yet
Data Umum SSH 2024
376 pages
Automobile Road Test
No ratings yet
Automobile Road Test
2 pages
Pickle Brand Auditing and Strengthening
No ratings yet
Pickle Brand Auditing and Strengthening
34 pages
Injection Engine Control System. VAZ 21213, 21214 (Niva)
No ratings yet
Injection Engine Control System. VAZ 21213, 21214 (Niva)
3 pages
Sec A: Project: It Building, Bhaktapur NEA Supply GEN Supply
No ratings yet
Sec A: Project: It Building, Bhaktapur NEA Supply GEN Supply
3 pages
Mis 09
No ratings yet
Mis 09
31 pages
PG AHC Admissions Policy 2020
No ratings yet
PG AHC Admissions Policy 2020
13 pages
MFM Assignment 1 Draft
No ratings yet
MFM Assignment 1 Draft
9 pages
1.0 Introduction To Biochemistry and Cellular Organization
No ratings yet
1.0 Introduction To Biochemistry and Cellular Organization
6 pages
Article On Hedonic Loss
No ratings yet
Article On Hedonic Loss
14 pages
02 - FootPrinting
No ratings yet
02 - FootPrinting
91 pages
Phil Summa
No ratings yet
Phil Summa
3 pages
PHP Yii JSP Servlet - 2 - Md. Shibly Forkani
No ratings yet
PHP Yii JSP Servlet - 2 - Md. Shibly Forkani
4 pages
Practical Set-1: The Result Is 600 The Result Is 70
No ratings yet
Practical Set-1: The Result Is 600 The Result Is 70
12 pages
Project Scope Statement1
No ratings yet
Project Scope Statement1
6 pages
College Code / Name: 9615 - Maria College of Engineering and Technology Branch Code / Name: 103 - B.E. Civil Engineering
No ratings yet
College Code / Name: 9615 - Maria College of Engineering and Technology Branch Code / Name: 103 - B.E. Civil Engineering
3 pages
Hazard Identification: 2. Risk Analysis/Evaluation 3. Risk Control
No ratings yet
Hazard Identification: 2. Risk Analysis/Evaluation 3. Risk Control
2 pages
OOP Assignment 2
No ratings yet
OOP Assignment 2
2 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Geometric Feature Learning: Unlocking Visual Insights through Geometric Feature Learning
From Everand
Geometric Feature Learning: Unlocking Visual Insights through Geometric Feature Learning
Fouad Sabry
No ratings yet
Contextual Image Classification: Understanding Visual Data for Effective Classification
From Everand
Contextual Image Classification: Understanding Visual Data for Effective Classification
Fouad Sabry
No ratings yet
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
From Everand
Pedestrian Detection: Please, suggest a subtitle for a book with title 'Pedestrian Detection' within the realm of 'Computer Vision'. The suggested subtitle should not have ':'.
Fouad Sabry
No ratings yet
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
From Everand
Scale Invariant Feature Transform: Unveiling the Power of Scale Invariant Feature Transform in Computer Vision
Fouad Sabry
No ratings yet
Pyramid Image Processing: Exploring the Depths of Visual Analysis
From Everand
Pyramid Image Processing: Exploring the Depths of Visual Analysis
Fouad Sabry
No ratings yet
Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet