0% found this document useful (0 votes)

38 views22 pages

465-Lecture 17-CT

Uploaded by

labuni.jeni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

38 views22 pages

465-Lecture 17-CT

Uploaded by

labuni.jeni

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 22

CSE 465

Lecture 17
Advanced Neural Network Architectures
Multi-Modal Models: CLIP
• CLIP stands for Contrastive Language-Image Pretraining, is a deep learning
model developed by OpenAI
• CLIP’s embeddings for images and text share the same space
• Therefore, it is possible to directly compare between the two modalities
• This is accomplished by training the model to bring related images and
texts closer together while pushing unrelated ones apart.
CLIP
• CLIP can be used for image classification tasks by associating images with
natural language descriptions
• It allows for more versatile and flexible image retrieval systems where users
can search for images using textual queries
• CLIP can be used to moderate content on online platforms by analyzing
images and accompanying text to identify and filter out inappropriate or
harmful content
How does CLIP work
• CLIP is designed to predict which N × N potential (image, text) pairings
within the batch are actual matches
• To achieve this, CLIP establishes a multi-modal embedding space through
the joint training of an image encoder and text encoder
• The CLIP loss aims to maximize the cosine similarity between the image and
text embeddings for the N genuine pairs in the batch while minimizing the
cosine similarity for the N² − N incorrect pairings
• The optimization process involves using a symmetric cross-entropy loss
function that operates on these similarity scores
Architecture
• ClIP uses two separate architectures as the backbone for encoding vision
and text datasets:
• Image encoder: Represents the neural network architecture (e.g., ResNet or Vision
Transformer) responsible for encoding images.
• Text encoder: Represents the neural network architecture (e.g., CBOW, BERT, or Text
Transformer) responsible for encoding textual information.
• The original CLIP model was trained from scratch without initializing the
image encoder and the text encoder with pre-trained weights as it was
trained on large volume of the dataset (400 million image-text pairs)
Architecture
Architecture
Input
• The model takes a batch of n pairs of images and texts as input where:
• I[n, h, w, c]: Represents a minibatch of aligned images
• Here n is the batch size, h is the image height, w is the image width, and c is the number of
channels.
• T[n, l]: Represents a minibatch of aligned texts
• Here n is the batch size, and l is the length of the textual sequence.
• Feature extraction
• Image Encoder: Extracts feature representations from the image encoder.
• The shape of the feature set is [n, d_i],
• Here d_i is the dimensionality of the image features.
• Text Encoder: Extracts feature representations from the text encoder.
• The shape of text feature set is [n, d_t]
• Here d_t is the dimensionality of the text features.
•
Learned Projections
• CLIP learns the projection matrix from image features to embedding space
• The embedding represents the new representation
• The shape of the matrix is [d_i, d_e]
• So, d_e is the desired dimensionality of the joint embedding space.
• Similarly a projection matrix is learned for the text representation
• The shape of the matrix is [d_t, d_e]
• Again, the matrix projects the text representation in d_e
• The projection operation can be done using a neural network with two linear
layers, whose weights are the learned projection matrix
• In most cases, the projection weights are the only weights with active gradients
that can be trained on new datasets.
• Additionally, the projection layer plays a crucial role in aligning the dimensions of
image and text embeddings, ensuring that they have the same size.
Symmetric Loss Function
• CLIP uses contrastive loss to bring related images and texts closer together
while pushing unrelated ones apart.
• labels = Generates labels representing the indices of the batch.
• loss_i = cross_entropy_loss(logits, labels, axis=0): Computes the cross-entropy loss
along the image axis.
• loss_t = cross_entropy_loss(logits, labels, axis=1): Computes the cross-entropy loss
along the text axis.
• loss = (loss_i + loss_t)/2: Computes the symmetric average of the image and text
losses.
Contrastive Loss
• The main idea behind contrastive pre-training is to teach the model to
differentiate between “positive” and “negative” pairs. In the context of
CLIP:
• Positive Pairs: These are pairs of images and text that are truly related or
semantically meaningful. Those pairs will have high cosine similarity. For example,
an image of a cat should have a positive text description like “a cute cat.”
• Negative Pairs: These are pairs where the text and image do not match. For
example, an image of a cat should have a negative text description like “a sunny day
at the beach.”
• The cosine similarity score provides a measure of how well the input image
and text match in the shared embedding space.
• Higher cosine similarity scores indicate stronger semantic alignment, suggesting that
the text and image are more closely related or that the image corresponds to the
textual description.
• Lower scores suggest weaker alignment or a mismatch.
Contrastive Loss Function
• A Contrastive Loss is low if the two correlated vectors
• And high and high otherwise.
Unsupervised Image Embedding
General Visual Representation Learning
• Done from unlabeled image dataset (unsupervised)
• After unsupervised learning, the learned model and image representations
can be used for downstream application
• Generative modeling
• Generate model pixels in the input space
• Pixel-level generation is computationally expensive
• Generating images of high-fidelity may not be necessary for representation learning
• Discriminative modeling
• Train networks to perform pretext tasks where inputs and labels are derived from an
unlabeled dataset
• Heuristic-based pretext tasks: rotation prediction, relative patch location prediction,
colorization, solving jigsaw puzzle
• Many heuristics seem ad-hoc and may be limiting
Self-supervised learning
• Autoencoders are a traditional self-supervised learning algorithm that
trains to reconstruct the input from a learned low-dimension
representation. Variants of autoencoders include denoising autoencoders,
where noise is added to the input during training
• Data augmentation is a widely used technique in computer vision research,
with numerous techniques such as flipping, cropping, and coloring. Image
overlay (also known as image mixture) is also a type of augmentation
method, but has been less explored compared to others
SimCLR Method
The SimCLR method
• Maximize the agreement of representations under data transformation,
using a contrastive loss in the latent/feature space
• A framework for contrastive representation learning.
• Two separate stochastic data augmentations t, t’ – T are applied to each
example to obtain two correlated views
• A base encoder network f(.) with a projection head g(.) is trained to
maximize agreement in latent representation
Data augmentation
• SimCLR uses random crop and color distortion for augmentation
• Transforms a given image randomly in two ways, yielding two correlated
views of the same example
Base Encoder
• F(x) is the base network that computes internal represent
• SimCLR uses ResNet, however, it is possible to use other networks
• This is the model whose weights we are training for the eventual
downstream task.
Projection Header
• G(h) is a projection network that project representation to a latent space
• SimCLR uses 2-layer MLP
• A Projection Head makes the representation vectors smaller before the loss
function
The SimCLR framework
• Maximize agreement using a contrastive tasks:
• Given {x_k} where two different examples x_i and x_j are a positive pair, identify x_j
in {x_k}_{k!=i} for x_i
• Loss function

Stable Diffusion
No ratings yet
Stable Diffusion
58 pages
SOLIDWORKS Simulation 2019 Validation
100% (3)
SOLIDWORKS Simulation 2019 Validation
140 pages
Identify Your Helpers of Destiny
90% (10)
Identify Your Helpers of Destiny
6 pages
Spare Parts Specification: Especificacion de Repuestos
No ratings yet
Spare Parts Specification: Especificacion de Repuestos
223 pages
CNN and Autoencoder
No ratings yet
CNN and Autoencoder
56 pages
Physics - Classes IX-X - NC 2006 - Latest Revision June 2012
No ratings yet
Physics - Classes IX-X - NC 2006 - Latest Revision June 2012
72 pages
Magnum Line Pressure Operated Surface Safety Gate Valve: Invention, Innovation, and Engineering Creativity
No ratings yet
Magnum Line Pressure Operated Surface Safety Gate Valve: Invention, Innovation, and Engineering Creativity
4 pages
DAAI - Lecture - 15 - 23nov22
No ratings yet
DAAI - Lecture - 15 - 23nov22
113 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
No ratings yet
Chinese CLIP: Contrastive Vision-Language Pretraining in Chinese
18 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
14 pages
SimCLR: Simple Framework For Contrastive Learning of Visual Representaitons
No ratings yet
SimCLR: Simple Framework For Contrastive Learning of Visual Representaitons
20 pages
Contrastive Language Image Pre-Training
No ratings yet
Contrastive Language Image Pre-Training
18 pages
Learning With Few Data
No ratings yet
Learning With Few Data
67 pages
Combined Scaling For Zero-Shot Transfer Learning
No ratings yet
Combined Scaling For Zero-Shot Transfer Learning
47 pages
SSL 18 Mar 23 PDF
No ratings yet
SSL 18 Mar 23 PDF
50 pages
VSEPR Practice Problems
100% (2)
VSEPR Practice Problems
6 pages
NLP UNIT 5c
No ratings yet
NLP UNIT 5c
33 pages
Implement A Vision On A LLM
No ratings yet
Implement A Vision On A LLM
21 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
48 pages
4.1 - Unsupervised Visual Representation Learning by Context Prediction
No ratings yet
4.1 - Unsupervised Visual Representation Learning by Context Prediction
10 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
47 pages
Recurrent Neural Networks (RNNS) : 10-301/10-601 Introduction To Machine Learning
No ratings yet
Recurrent Neural Networks (RNNS) : 10-301/10-601 Introduction To Machine Learning
86 pages
VQGAN: Taming Transformer For High-Resolution Image Synthesis
No ratings yet
VQGAN: Taming Transformer For High-Resolution Image Synthesis
52 pages
Chapter17 Autoencoders
No ratings yet
Chapter17 Autoencoders
23 pages
Image Generator
No ratings yet
Image Generator
11 pages
Week 6 Unsupervised Learning
No ratings yet
Week 6 Unsupervised Learning
60 pages
BLM5135 10 ResidualNetworks Transformer
No ratings yet
BLM5135 10 ResidualNetworks Transformer
60 pages
A Guide To Image Captioning. How Deep Learning Helps in Captioning
No ratings yet
A Guide To Image Captioning. How Deep Learning Helps in Captioning
17 pages
Weakly Supervised Contrastive Learning
No ratings yet
Weakly Supervised Contrastive Learning
10 pages
Image Captioners Are Scalable Vision Learners Too
No ratings yet
Image Captioners Are Scalable Vision Learners Too
26 pages
CLIP - Connecting Text and Images - OpenAI
No ratings yet
CLIP - Connecting Text and Images - OpenAI
16 pages
BMM 2018 - Deep Learning Tutorial
No ratings yet
BMM 2018 - Deep Learning Tutorial
47 pages
uniclip评测
No ratings yet
uniclip评测
20 pages
Automatic Creative Selection With Cross-Modal Matching
No ratings yet
Automatic Creative Selection With Cross-Modal Matching
3 pages
04introduction To Neural Networks
No ratings yet
04introduction To Neural Networks
62 pages
Sim CLR
No ratings yet
Sim CLR
11 pages
Cloth Captioning
No ratings yet
Cloth Captioning
36 pages
Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction
No ratings yet
Split-Brain Autoencoders: Unsupervised Learning by Cross-Channel Prediction
11 pages
Dense Clip
No ratings yet
Dense Clip
11 pages
Computer Vision 12 Vision Language Models
No ratings yet
Computer Vision 12 Vision Language Models
56 pages
Hierarchical Text-Conditional Image Generation With CLIP Latents
No ratings yet
Hierarchical Text-Conditional Image Generation With CLIP Latents
27 pages
Contrastive Language and Vision Learning of General Fashion Concepts
No ratings yet
Contrastive Language and Vision Learning of General Fashion Concepts
13 pages
Self Supervised Learning
No ratings yet
Self Supervised Learning
5 pages
Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper
No ratings yet
Tian Learning Vision From Models Rivals Learning Vision From Data CVPR 2024 Paper
12 pages
Lecture22 Multimodal
No ratings yet
Lecture22 Multimodal
32 pages
Image Captioning Research Paper
No ratings yet
Image Captioning Research Paper
59 pages
Esser Taming Transformers For High-Resolution Image Synthesis CVPR 2021 Paper
No ratings yet
Esser Taming Transformers For High-Resolution Image Synthesis CVPR 2021 Paper
11 pages
Specification 201 Quality Systems 14 April 2016.RCN-D1623234100
No ratings yet
Specification 201 Quality Systems 14 April 2016.RCN-D1623234100
59 pages
Lecture2.2 UnimodalRepresentations Part1 PDF
No ratings yet
Lecture2.2 UnimodalRepresentations Part1 PDF
92 pages
Text-Image Embeddings With OpenAIs CLIP
No ratings yet
Text-Image Embeddings With OpenAIs CLIP
5 pages
AML - Lecture - 11 - 19nov24
No ratings yet
AML - Lecture - 11 - 19nov24
103 pages
Rec03 - Deep Architectures
No ratings yet
Rec03 - Deep Architectures
65 pages
PhoCLIP 232 Specialized Project OFFICIAL
No ratings yet
PhoCLIP 232 Specialized Project OFFICIAL
105 pages
ENG6500 8 DL IntroductionToDeepLearning Part2
No ratings yet
ENG6500 8 DL IntroductionToDeepLearning Part2
65 pages
Computer Vision 11 Transformers
No ratings yet
Computer Vision 11 Transformers
63 pages
What Is CLIP - Contrastive Language-Image Pre-Processing Explained
No ratings yet
What Is CLIP - Contrastive Language-Image Pre-Processing Explained
16 pages
Clip
No ratings yet
Clip
15 pages
Dis10 Sol
No ratings yet
Dis10 Sol
11 pages
DPA: Dual Prototypes Alignment For Unsupervised Adaptation of Vision-Language Models
No ratings yet
DPA: Dual Prototypes Alignment For Unsupervised Adaptation of Vision-Language Models
15 pages
Linear Regression
100% (1)
Linear Regression
56 pages
Unsupervised Pre-Training For Images: Sunita Sarawagi CS 725 Fall 2023
No ratings yet
Unsupervised Pre-Training For Images: Sunita Sarawagi CS 725 Fall 2023
62 pages
Lec 16
No ratings yet
Lec 16
76 pages
Deep Learning For Visual Understanding Bridging Text and Image With CLIP and BLIP
No ratings yet
Deep Learning For Visual Understanding Bridging Text and Image With CLIP and BLIP
9 pages
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
No ratings yet
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
17 pages
465-Lecture 10-11
No ratings yet
465-Lecture 10-11
79 pages
Magnetism - Notes 24-25
No ratings yet
Magnetism - Notes 24-25
13 pages
Clip Prefix For Image Captioning Task in Generative
No ratings yet
Clip Prefix For Image Captioning Task in Generative
13 pages
The Song of Mahamudra
No ratings yet
The Song of Mahamudra
8 pages
Pinnacle DV500 User Guide
No ratings yet
Pinnacle DV500 User Guide
188 pages
SEE 3433 Electrical Machines: Classification of DC Machines DC Generators - Separately Excited - Armature Reaction
No ratings yet
SEE 3433 Electrical Machines: Classification of DC Machines DC Generators - Separately Excited - Armature Reaction
22 pages
List Lagu Natal Pemuda
No ratings yet
List Lagu Natal Pemuda
2 pages
Domes: BY:-Sonal Bharadia Samiksha Choudhary
No ratings yet
Domes: BY:-Sonal Bharadia Samiksha Choudhary
32 pages
Omron Discontinuation Notice B 110039 20110126
No ratings yet
Omron Discontinuation Notice B 110039 20110126
15 pages
2020 Proposal
No ratings yet
2020 Proposal
14 pages
Airgoo: Airbrush Compressor General Manual
No ratings yet
Airgoo: Airbrush Compressor General Manual
128 pages
Measurements
No ratings yet
Measurements
34 pages
The English Version Of Truy Ệ N Ki Ề U: ~ By Phan Huy MPH
No ratings yet
The English Version Of Truy Ệ N Ki Ề U: ~ By Phan Huy MPH
76 pages
DLL G6 Q3 WEEK 9 Version2 (Mam Inkay Peralta)
No ratings yet
DLL G6 Q3 WEEK 9 Version2 (Mam Inkay Peralta)
71 pages
Order of Draw
No ratings yet
Order of Draw
5 pages
465-Lecture 12
No ratings yet
465-Lecture 12
31 pages
Assigement 1 Pete Olmeca
No ratings yet
Assigement 1 Pete Olmeca
2 pages
Pembangunan Ekonomi Pertanian Digital Dalam Menduk
No ratings yet
Pembangunan Ekonomi Pertanian Digital Dalam Menduk
25 pages
Wetted Surface Area of Partially Filled Horizontal Vessel
No ratings yet
Wetted Surface Area of Partially Filled Horizontal Vessel
1 page
SIROLL ALU en
No ratings yet
SIROLL ALU en
28 pages
465-Lecture 16 ViT
No ratings yet
465-Lecture 16 ViT
18 pages
465 Lecture 15-SelfAttention
No ratings yet
465 Lecture 15-SelfAttention
26 pages
Hippo S192021PreliminaryRound
No ratings yet
Hippo S192021PreliminaryRound
21 pages
Redemption - Batch - 3 11 24 To 3 15 24
No ratings yet
Redemption - Batch - 3 11 24 To 3 15 24
4 pages
Distokia Pada Sapi
No ratings yet
Distokia Pada Sapi
3 pages
Phychem Lab Assignment - R104 R105
No ratings yet
Phychem Lab Assignment - R104 R105
1 page
Adobe Scan Oct 16, 2024
No ratings yet
Adobe Scan Oct 16, 2024
1 page
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Image Segmentation: Unlocking Insights through Pixel Precision
From Everand
Image Segmentation: Unlocking Insights through Pixel Precision
Fouad Sabry
No ratings yet

465-Lecture 17-CT

Uploaded by

465-Lecture 17-CT

Uploaded by

CSE 465

You might also like