0% found this document useful (0 votes)
59 views50 pages

SSL 18 Mar 23 PDF

Self-supervised learning methods use pretext tasks to learn useful feature representations from unlabeled data. These tasks include predicting image rotations, relative patch locations, missing pixels through inpainting, and coloring grayscale images. The learned feature representations are then evaluated on downstream tasks with limited labeled data.

Uploaded by

arpan singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views50 pages

SSL 18 Mar 23 PDF

Self-supervised learning methods use pretext tasks to learn useful feature representations from unlabeled data. These tasks include predicting image rotations, relative patch locations, missing pixels through inpainting, and coloring grayscale images. The learned feature representations are then evaluated on downstream tasks with limited labeled data.

Uploaded by

arpan singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 50

CS60010: Deep Learning

Spring 2023

Sudeshna Sarkar

Self-Supervised Learning
Sudeshna Sarkar
17 Mar 2023
Self-supervised Learning

• Self-supervised learning methods solve “pretext” tasks that produce


good features for downstream tasks.
• Learn with supervised learning objectives, e.g., classification, regression.
• Labels of these pretext tasks are generated automatically
Representation learning

• Learn What?
• How to learn?
Coral
• Learn from what?
Fish

Compact Mental
Image Representation
im2vec
layer 3 representation of image

Image

layer 1 representation of image

Represent image as a neural embedding — a vector/tensor of neural activations


(perhaps representing a vector of detected texture patterns or object parts)
Slide credit: Phillip Isola
Investigating a representation via similarity analysis

How similar are these two images?

How about these two?

[Kriegeskorte et al. 2008]


Slide credit: Phillip Isola
Problem: Supervised Learning is Expensive!

Justin Johnson Lecture 22 - April 6, 2022


[slide credit: Justin Johnson]
Supervised computer
Vision in nature
vision
Raw unlabeled training data
Hand-curated training data
+ Cheap
+ Informative
- Noisy
- Expensive
- Harder to interpret
- Limited to teacher’s knowledge

Slide credit: Phillip Isola


Representation Learning

Representations??

Slide credit: Phillip Isola


Unsupervised + Deep Learning
ev
i
cet
Pretrained bj
“Deep” O
D
Representation
Unlabeled Unsupervised SG
Input Data Learning Machine

Must be
good for
transfer
learning
Data Dropout Prediction

Prediction
Objective
• Unsupervised / Self-supervised by predicting
part of data from other part
Self-supervised pretext tasks
learn to predict image transformations / complete corrupted images.

1. Solving the pretext tasks allow the model to learn good features.
2. We can automatically generate labels for the pretext tasks.
How to evaluate a self-supervised learning method?

1. Learn good feature extractors from self-supervised pretext tasks,


e.g., predicting image rotations
2. Evaluate the learned feature encoders on downstream target tasks
• Attach a shallow network on the feature extractor;
• train the shallow network on the target task with small amount of labeled
data
How to evaluate a self-supervised learning method?

Learn good feature extractors from self-


supervised pretext tasks, e.g., predicting
image rotations
How to evaluate a self-supervised learning method?

Learn good feature extractors from self-


supervised pretext tasks, e.g., predicting Evaluate the learned feature encoders on downstream
target tasks
image rotations
• Attach a shallow network on the feature extractor;
• train the shallow network on the target task with small
amount of labeled data
Pretext task: predict rotations

Hypothesis: a model could recognize the correct rotation of an object only if


it has the “visual commonsense” of what the object should look like
unperturbed.
Pretext task: predict rotations

Self-supervised
learning by rotating
the entire input
images.

The model learns to


predict which
rotation is applied
(4-way classification)
Pretext task: predict rotations
Evaluation on semi-supervised learning

Self-supervised learning on
CIFAR10 (entire training set)

Freeze conv1 + conv2 Learn


conv3 + linear layers with
subset of labeled CIFAR10 data
(classification).
Transfer learned features to supervised learning

Pretrained with full


ImageNet supervision

No pretraining

Self-supervised learning on
ImageNet (entire training set)
with AlexNet

Finetune on labeled data from


Pascal VOC 2007

Self-supervised learning
with rotation prediction
Pretext task: predict relative patch locations
Model predicts relative location of
two patches from the same image.
Discriminative pretraining task

Intuition: Requires understanding


objects and their parts

Doersch et al, “Unsupervised Visual Representation Learning by Context Prediction”, ICCV 2015

87
[slide credit: Justin Johnson]
Pretext task: solving “jigsaw puzzles”

Noroozi & Favaro, 2016)


Pretext task: predict missing pixels (inpainting)

Deepak Pathak, Philipp Krähenbühl, Jeff Donahue, Trevor Darrell, Alexei Efros. CVPR 2016
Feature Learning by Inpainting
Learning to inpaint by reconstruction

Learning to reconstruct the missing pixels


Context Encoders: Learning by Inpainting
Input Image

Encoder: Decoder:
𝜙𝜙 𝜓𝜓

Pathak et al, “Context Encoders: Feature Learning by Inpainting”, CVPR 2016

39
[slide credit: Justin Johnson]
Context Encoders: Learning by Inpainting
Input Image Predict Missing Pixels

Encoder: Decoder:
𝜙𝜙 𝜓𝜓

Pathak et al, “Context Encoders: Feature Learning by Inpainting”, CVPR 2016

40
[slide credit: Justin Johnson]
Context Encoders: Learning by Inpainting
Input Image Predict Missing Pixels

Encoder: Decoder:
𝜙𝜙 𝜓𝜓

L2 Loss
(Best for feature learning)
Pathak et al, “Context Encoders: Feature Learning by Inpainting”, CVPR 2016

41
[slide credit: Justin Johnson]
Context Encoders: Learning by Inpainting
Input Image Predict Missing Pixels

Encoder: Decoder:
𝜙𝜙 𝜓𝜓

L2 + Adversarial Loss
(Best for nice images)
Pathak et al, “Context Encoders: Feature Learning by Inpainting”, CVPR 2016

42
[slide credit: Justin Johnson]
Learning to inpaint by reconstruction

• Loss = reconstruction + adversarial learning

• Adversarial loss between “real” images and inpainted images


Inpainting evaluation

Input (context) reconstruction adversarial recon + adv


Pretext task: image coloring
Summary: pretext tasks from image transformations

• Pretext tasks focus on “visual common sense”, e.g., predict rotations,


inpainting, rearrangement, and colorization.
• The models are forced learn good features about natural images, e.g.,
semantic representation of an object category, in order to solve the
pretext tasks.
• We don’t care about the performance of these pretext tasks, but
rather how useful the learned features are for downstream tasks
(classification, detection, segmentation).
• Problems: 1) coming up with individual pretext tasks is tedious, and 2)
the learned representations may not be general.
Pretext tasks from image transformations

• Learned representations may be tied to a specific pretext task!Can we


come up with a more general pretext task?
Contrastive representation learning

• Intuition and formulation


• Instance contrastive learning: SimCLR and MOCO
• Sequence contrastive learning: CPC
A more general pretext task?
A more general pretext task?
Contrastive Representation Learning
Contrastive Learning
Assume we don’t have labels for images, but we know
whether some pairs of images are similar or dissimilar

Hadsell et al, “Dimensionality Reduction by Learning and Invariant Mapping”, CVPR 2006 White kitten image is free for commercial use under the Pixabay license

Justin Johnson Lecture 22 - 88 April 6, 2022


[slide credit: Justin Johnson]
Contrastive Learning
Assume we don’t have labels for images, but we know
whether some pairs of images are similar or dissimilar

Similar images should have similar features

CNN

CNN

Hadsell et al, “Dimensionality Reduction by Learning and Invariant Mapping”, CVPR 2006 White kitten image is free for commercial use under the Pixabay license

Justin Johnson Lecture 22 - 89 April 6, 2022


[slide credit: Justin Johnson]
Contrastive Learning
Assume we don’t have labels for images, but we know
whether some pairs of images are similar or dissimilar

Similar images should have similar features Dissimilar images should have dissimilar features

CNN CNN

CNN CNN

Hadsell et al, “Dimensionality Reduction by Learning and Invariant Mapping”, CVPR 2006 White kitten image is free for commercial use under the Pixabay license

Justin Johnson Lecture 22 - 90 April 6, 2022


[slide credit: Justin Johnson]
Contrastive Learning
Assume we don’t have labels for images, but we know
whether some pairs of images are similar or dissimilar
Let d be the Euclidean distance between features for two images
Similar images should have similar features Dissimilar images should have dissimilar features

CNN CNN

CNN CNN

Hadsell et al, “Dimensionality Reduction by Learning and Invariant Mapping”, CVPR 2006 White kitten image is free for commercial use under the Pixabay license

Justin Johnson Lecture 22 - 91 April 6, 2022


[slide credit: Justin Johnson]
Contrastive Learning
Assume we don’t have labels for images, but we know
whether some pairs of images are similar or dissimilar
Similar images should have similar features Dissimilar images should have dissimilar features

CNN CNN

CNN CNN

𝑑𝑑 2
𝐿𝐿𝑆𝑆 𝑥𝑥1, 𝑥𝑥2 = 𝐿𝐿𝐷𝐷 𝑥𝑥1, 𝑥𝑥2 = max(0, 𝑚𝑚 − 𝑑𝑑2 )
Pull features together Push features apart
Justin Johnson Lecture 22 - 92 (upto margin m)
April 6, 2022
[slide credit: Justin Johnson]
Contrastive Learning
Problem: Where to get positive and negative pairs?

Similar images should have similar features Dissimilar images should have dissimilar features

CNN CNN

CNN CNN

𝑑𝑑 2
𝐿𝐿𝑆𝑆 𝑥𝑥1, 𝑥𝑥2 = 𝐿𝐿𝐷𝐷 𝑥𝑥1, 𝑥𝑥2 = max(0, 𝑚𝑚 − 𝑑𝑑2 )
Pull features together Push features apart
Justin Johnson Lecture 22 - 92 (upto margin m)
April 6, 2022
[slide credit: Justin Johnson]
Contrastive Learning with Data Augmentation
Batch of N
images

Hadsell et al, “Dimensionality Reduction by Learning and Invariant Mapping”, CVPR 2006 Hjelm et al, “Learning deep representations by mutual information estimation and maximization”, ICLR 2019 Tian et al, “Contrastive Multiview Coding”, ECCV 2020
Wu et al, “Unsupervised Feature Learning by Non-Parametric Instance-Level Discrimination”, CVPR 2018 Bachman et al, “Learning Representations by Maximizing Mutual Information Across Views”, NeurIPS 2019 He et al, “Momentum Contrast for Unsupervised Visual Representation Learning”, CVPR 2020
Van den Oord et al, “Representation Learning with Contrastive Predictive Coding”, NeurIPS 2018 Henaff et al, “Data-Efficient Image Recognition with Contrastive Predictive Coding”, ICML 2020 Chen et al, “A Simple Framework for Contrastive Learning of Visual Representations”, ICML 2020

Justin Johnson Lecture 22 - 95 April 6, 2022


[slide credit: Justin Johnson]
Contrastive Learning with Data Augmentation
Batch of N Two augmentations
images for each image

𝑥𝑥!

𝑥𝑥"

𝑥𝑥#

𝑥𝑥$

𝑥𝑥%

𝑥𝑥&
Hadsell et al, “Dimensionality Reduction by Learning and Invariant Mapping”, CVPR 2006 Hjelm et al, “Learning deep representations by mutual information estimation and maximization”, ICLR 2019 Tian et al, “Contrastive Multiview Coding”, ECCV 2020
Wu et al, “Unsupervised Feature Learning by Non-Parametric Instance-Level Discrimination”, CVPR 2018 Bachman et al, “Learning Representations by Maximizing Mutual Information Across Views”, NeurIPS 2019 He et al, “Momentum Contrast for Unsupervised Visual Representation Learning”, CVPR 2020
Van den Oord et al, “Representation Learning with Contrastive Predictive Coding”, NeurIPS 2018 Henaff et al, “Data-Efficient Image Recognition with Contrastive Predictive Coding”, ICML 2020 Chen et al, “A Simple Framework for Contrastive Learning of Visual Representations”, ICML 2020

Justin Johnson Lecture 22 - 96 April 6, 2022


[slide credit: Justin Johnson]
Contrastive Learning with Data Augmentation
Batch of N Two augmentations Extract
images for each image features

𝑥𝑥!

𝑥𝑥"

𝑥𝑥#

𝑥𝑥$

𝑥𝑥%

𝑥𝑥&
Hadsell et al, “Dimensionality Reduction by Learning and Invariant Mapping”, CVPR 2006 Hjelm et al, “Learning deep representations by mutual information estimation and maximization”, ICLR 2019 Tian et al, “Contrastive Multiview Coding”, ECCV 2020
Wu et al, “Unsupervised Feature Learning by Non-Parametric Instance-Level Discrimination”, CVPR 2018 Bachman et al, “Learning Representations by Maximizing Mutual Information Across Views”, NeurIPS 2019 He et al, “Momentum Contrast for Unsupervised Visual Representation Learning”, CVPR 2020
Van den Oord et al, “Representation Learning with Contrastive Predictive Coding”, NeurIPS 2018 Henaff et al, “Data-Efficient Image Recognition with Contrastive Predictive Coding”, ICML 2020 Chen et al, “A Simple Framework for Contrastive Learning of Visual Representations”, ICML 2020

Justin Johnson Lecture 22 - 97 April 6, 2022


[slide credit: Justin Johnson]
Contrastive Learning with Data Augmentation
Batch of N Two augmentations Extract Each image tries to predict which o
images for each image features the other 2N-1 images came from
the same original image
𝑥𝑥!

𝑥𝑥"

𝑥𝑥#

𝑥𝑥$

𝑥𝑥%

𝑥𝑥&
Hadsell et al, “Dimensionality Reduction by Learning and Invariant Mapping”, CVPR 2006 Hjelm et al, “Learning deep representations by mutual information estimation and maximization”, ICLR 2019 Tian et al, “Contrastive Multiview Coding”, ECCV 2020
Wu et al, “Unsupervised Feature Learning by Non-Parametric Instance-Level Discrimination”, CVPR 2018 Bachman et al, “Learning Representations by Maximizing Mutual Information Across Views”, NeurIPS 2019 He et al, “Momentum Contrast for Unsupervised Visual Representation Learning”, CVPR 2020
Van den Oord et al, “Representation Learning with Contrastive Predictive Coding”, NeurIPS 2018 Henaff et al, “Data-Efficient Image Recognition with Contrastive Predictive Coding”, ICML 2020 Chen et al, “A Simple Framework for Contrastive Learning of Visual Representations”, ICML 2020

Justin Johnson Lecture 22 - 98 April 6, 2022


[slide credit: Justin Johnson]
Contrastive Learning with Data Augmentation
Batch of N Two augmentations Extract Each image tries to predict which o
images for each image features the other 2N-1 images came from
the same original image
𝑥𝑥!

𝑥𝑥"

𝑥𝑥#

𝑥𝑥$

𝑥𝑥%

𝑥𝑥&

Hadsell et al, “Dimensionality Reduction by Learning and Invariant Mapping”, CVPR 2006 Hjelm et al, “Learning deep representations by mutual information estimation and maximization”, ICLR 2019 Tian et al, “Contrastive Multiview Coding”, ECCV 2020
Wu et al, “Unsupervised Feature Learning by Non-Parametric Instance-Level Discrimination”, CVPR 2018 Bachman et al, “Learning Representations by Maximizing Mutual Information Across Views”, NeurIPS 2019 He et al, “Momentum Contrast for Unsupervised Visual Representation Learning”, CVPR 2020
Van den Oord et al, “Representation Learning with Contrastive Predictive Coding”, NeurIPS 2018 Henaff et al, “Data-Efficient Image Recognition with Contrastive Predictive Coding”, ICML 2020 Chen et al, “A Simple Framework for Contrastive Learning of Visual Representations”, ICML 2020

Justin Johnson Lecture 22 - 99 April 6, 2022


[slide credit: Justin Johnson]
Contrastive Learning with Data Augmentation
Batch of N Two augmentations Extract Each image tries to predict which o
images for each image features the other 2N-1 images came from
the same original image
𝑥𝑥! Similarity between 𝑥𝑥' and 𝑥𝑥( :
𝜙𝜙 𝑥𝑥' * 𝜙𝜙 𝑥𝑥(
𝑠𝑠' ,( =
𝑥𝑥" 𝜙𝜙 𝑥𝑥' ⋅ 𝜙𝜙 𝑥𝑥'

𝑥𝑥# If (𝑥𝑥' , 𝑥𝑥( ) is a positive pair,


then loss for 𝑥𝑥' is:
exp 𝑠𝑠' ,( /𝜏𝜏
𝑥𝑥$ 𝐿𝐿' = − log " .
∑+, ! exp 𝑠𝑠' ,+/𝜏𝜏
+- '

𝑥𝑥% (𝜏𝜏 is a temperature)

𝑥𝑥&

Hadsell et al, “Dimensionality Reduction by Learning and Invariant Mapping”, CVPR 2006 Hjelm et al, “Learning deep representations by mutual information estimation and maximization”, ICLR 2019 Tian et al, “Contrastive Multiview Coding”, ECCV 2020
Wu et al, “Unsupervised Feature Learning by Non-Parametric Instance-Level Discrimination”, CVPR 2018 Bachman et al, “Learning Representations by Maximizing Mutual Information Across Views”, NeurIPS 2019 He et al, “Momentum Contrast for Unsupervised Visual Representation Learning”, CVPR 2020
Van den Oord et al, “Representation Learning with Contrastive Predictive Coding”, NeurIPS 2018 Henaff et al, “Data-Efficient Image Recognition with Contrastive Predictive Coding”, ICML 2020 Chen et al, “A Simple Framework for Contrastive Learning of Visual Representations”, ICML 2020

Justin Johnson Lecture 22 - 100 April 6, 2022


[slide credit: Justin Johnson]
Contrastive Learning with Data Augmentation
Batch of N Two augmentations Extract Each image tries to predict which o
images for each image features the other 2N-1 images came from
the same original image
𝑥𝑥! Similarity between 𝑥𝑥' and 𝑥𝑥( :
𝜙𝜙 𝑥𝑥' * 𝜙𝜙 𝑥𝑥(
𝑠𝑠' ,( =
𝑥𝑥" 𝜙𝜙 𝑥𝑥' ⋅ 𝜙𝜙 𝑥𝑥'

𝑥𝑥# If (𝑥𝑥' , 𝑥𝑥( ) is a positive pair,


then loss for 𝑥𝑥' is:
exp 𝑠𝑠' ,( /𝜏𝜏
𝑥𝑥$ 𝐿𝐿' = − log " .
∑+, ! exp 𝑠𝑠' ,+/𝜏𝜏
+- '

𝑥𝑥% (𝜏𝜏 is a temperature)

Interpretation: Cross-entropy
𝑥𝑥& loss over the other 2N-1
elements in the batch!
Hadsell et al, “Dimensionality Reduction by Learning and Invariant Mapping”, CVPR 2006 Hjelm et al, “Learning deep representations by mutual information estimation and maximization”, ICLR 2019 Tian et al, “Contrastive Multiview Coding”, ECCV 2020
Wu et al, “Unsupervised Feature Learning by Non-Parametric Instance-Level Discrimination”, CVPR 2018 Bachman et al, “Learning Representations by Maximizing Mutual Information Across Views”, NeurIPS 2019 He et al, “Momentum Contrast for Unsupervised Visual Representation Learning”, CVPR 2020
Van den Oord et al, “Representation Learning with Contrastive Predictive Coding”, NeurIPS 2018 Henaff et al, “Data-Efficient Image Recognition with Contrastive Predictive Coding”, ICML 2020 Chen et al, “A Simple Framework for Contrastive Learning of Visual Representations”, ICML 2020

Justin Johnson Lecture 22 - 101 April 6, 2022


[slide credit: Justin Johnson]

You might also like