0% found this document useful (0 votes)
38 views22 pages

465-Lecture 17-CT

Uploaded by

labuni.jeni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views22 pages

465-Lecture 17-CT

Uploaded by

labuni.jeni
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

CSE 465

Lecture 17
Advanced Neural Network Architectures
Multi-Modal Models: CLIP
• CLIP stands for Contrastive Language-Image Pretraining, is a deep learning
model developed by OpenAI
• CLIP’s embeddings for images and text share the same space
• Therefore, it is possible to directly compare between the two modalities
• This is accomplished by training the model to bring related images and
texts closer together while pushing unrelated ones apart.
CLIP
• CLIP can be used for image classification tasks by associating images with
natural language descriptions
• It allows for more versatile and flexible image retrieval systems where users
can search for images using textual queries
• CLIP can be used to moderate content on online platforms by analyzing
images and accompanying text to identify and filter out inappropriate or
harmful content
How does CLIP work
• CLIP is designed to predict which N × N potential (image, text) pairings
within the batch are actual matches
• To achieve this, CLIP establishes a multi-modal embedding space through
the joint training of an image encoder and text encoder
• The CLIP loss aims to maximize the cosine similarity between the image and
text embeddings for the N genuine pairs in the batch while minimizing the
cosine similarity for the N² − N incorrect pairings
• The optimization process involves using a symmetric cross-entropy loss
function that operates on these similarity scores
Architecture
• ClIP uses two separate architectures as the backbone for encoding vision
and text datasets:
• Image encoder: Represents the neural network architecture (e.g., ResNet or Vision
Transformer) responsible for encoding images.
• Text encoder: Represents the neural network architecture (e.g., CBOW, BERT, or Text
Transformer) responsible for encoding textual information.
• The original CLIP model was trained from scratch without initializing the
image encoder and the text encoder with pre-trained weights as it was
trained on large volume of the dataset (400 million image-text pairs)
Architecture
Architecture
Input
• The model takes a batch of n pairs of images and texts as input where:
• I[n, h, w, c]: Represents a minibatch of aligned images
• Here n is the batch size, h is the image height, w is the image width, and c is the number of
channels.
• T[n, l]: Represents a minibatch of aligned texts
• Here n is the batch size, and l is the length of the textual sequence.
• Feature extraction
• Image Encoder: Extracts feature representations from the image encoder.
• The shape of the feature set is [n, d_i],
• Here d_i is the dimensionality of the image features.
• Text Encoder: Extracts feature representations from the text encoder.
• The shape of text feature set is [n, d_t]
• Here d_t is the dimensionality of the text features.

Learned Projections
• CLIP learns the projection matrix from image features to embedding space
• The embedding represents the new representation
• The shape of the matrix is [d_i, d_e]
• So, d_e is the desired dimensionality of the joint embedding space.
• Similarly a projection matrix is learned for the text representation
• The shape of the matrix is [d_t, d_e]
• Again, the matrix projects the text representation in d_e
• The projection operation can be done using a neural network with two linear
layers, whose weights are the learned projection matrix
• In most cases, the projection weights are the only weights with active gradients
that can be trained on new datasets.
• Additionally, the projection layer plays a crucial role in aligning the dimensions of
image and text embeddings, ensuring that they have the same size.
Symmetric Loss Function
• CLIP uses contrastive loss to bring related images and texts closer together
while pushing unrelated ones apart.
• labels = Generates labels representing the indices of the batch.
• loss_i = cross_entropy_loss(logits, labels, axis=0): Computes the cross-entropy loss
along the image axis.
• loss_t = cross_entropy_loss(logits, labels, axis=1): Computes the cross-entropy loss
along the text axis.
• loss = (loss_i + loss_t)/2: Computes the symmetric average of the image and text
losses.
Contrastive Loss
• The main idea behind contrastive pre-training is to teach the model to
differentiate between “positive” and “negative” pairs. In the context of
CLIP:
• Positive Pairs: These are pairs of images and text that are truly related or
semantically meaningful. Those pairs will have high cosine similarity. For example,
an image of a cat should have a positive text description like “a cute cat.”
• Negative Pairs: These are pairs where the text and image do not match. For
example, an image of a cat should have a negative text description like “a sunny day
at the beach.”
• The cosine similarity score provides a measure of how well the input image
and text match in the shared embedding space.
• Higher cosine similarity scores indicate stronger semantic alignment, suggesting that
the text and image are more closely related or that the image corresponds to the
textual description.
• Lower scores suggest weaker alignment or a mismatch.
Contrastive Loss Function
• A Contrastive Loss is low if the two correlated vectors
• And high and high otherwise.
Unsupervised Image Embedding
General Visual Representation Learning
• Done from unlabeled image dataset (unsupervised)
• After unsupervised learning, the learned model and image representations
can be used for downstream application
• Generative modeling
• Generate model pixels in the input space
• Pixel-level generation is computationally expensive
• Generating images of high-fidelity may not be necessary for representation learning
• Discriminative modeling
• Train networks to perform pretext tasks where inputs and labels are derived from an
unlabeled dataset
• Heuristic-based pretext tasks: rotation prediction, relative patch location prediction,
colorization, solving jigsaw puzzle
• Many heuristics seem ad-hoc and may be limiting
Self-supervised learning
• Autoencoders are a traditional self-supervised learning algorithm that
trains to reconstruct the input from a learned low-dimension
representation. Variants of autoencoders include denoising autoencoders,
where noise is added to the input during training
• Data augmentation is a widely used technique in computer vision research,
with numerous techniques such as flipping, cropping, and coloring. Image
overlay (also known as image mixture) is also a type of augmentation
method, but has been less explored compared to others
SimCLR Method
The SimCLR method
• Maximize the agreement of representations under data transformation,
using a contrastive loss in the latent/feature space
• A framework for contrastive representation learning.
• Two separate stochastic data augmentations t, t’ – T are applied to each
example to obtain two correlated views
• A base encoder network f(.) with a projection head g(.) is trained to
maximize agreement in latent representation
Data augmentation
• SimCLR uses random crop and color distortion for augmentation
• Transforms a given image randomly in two ways, yielding two correlated
views of the same example
Base Encoder
• F(x) is the base network that computes internal represent
• SimCLR uses ResNet, however, it is possible to use other networks
• This is the model whose weights we are training for the eventual
downstream task.
Projection Header
• G(h) is a projection network that project representation to a latent space
• SimCLR uses 2-layer MLP
• A Projection Head makes the representation vectors smaller before the loss
function
The SimCLR framework
• Maximize agreement using a contrastive tasks:
• Given {x_k} where two different examples x_i and x_j are a positive pair, identify x_j
in {x_k}_{k!=i} for x_i
• Loss function

You might also like