0% found this document useful (0 votes)

15 views5 pages

Text-Image Embeddings With OpenAIs CLIP

The document provides a guide on using OpenAI's CLIP, a multi-modal model that connects text and images through vector embeddings. It explains the contrastive learning process behind CLIP, how to create text and image embeddings, and perform text-image searches using dot product similarity. The guide also highlights the need for a vector database for scaling searches across larger datasets.

Uploaded by

elsa.anza27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

15 views5 pages

Text-Image Embeddings With OpenAIs CLIP

Uploaded by

elsa.anza27

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Text-image embeddings with OpenAI's CLIP

towardsdatascience.com/quick-fire-guide-to-multi-modal-ml-with-openais-clip-2dad7e398ac0

James Briggs August 13, 2022

James Briggs
Aug 11

5 min read

Quick-fire Guide to Multi-Modal ML With OpenAI’s CLIP

Learn how to translate between text to image and back again with
CLIP and vector embeddings

Modified by author, background photo by on

After a few short years of life, children can fathom the concepts behind simple words and
connect them to related images. They can identify the connection between shapes and
textures of the physical world to the abstract symbols of written language.

It’s something we take for granted. Very few (if any) people in the world will remember a
time when these “basic” skills were beyond their capacity.

Computers are different. They can calculate the parameters a rocket needs to traverse
the solar system. But if you ask a computer to find an image of “a dog in the park”, you’re
better off asking NASA for a free ticket to the space station.

1/5
At least, that was the case until recently.

In this article, we’re going to take a look at OpenAI’s CLIP. A “multi-modal” model capable
of understanding the relationships and concepts between both text and images. As we’ll
see, CLIP is more than a fancy parlor trick. It is shockingly capable.

Contrastive Learning?
Contrastive Language-Image Pretraining (CLIP) consists of two models trained in
parallel. A Vision Transformer (ViT) or ResNet model for image embeddings and a
transformer model for language embeddings.

During training, (image, text) pairs are fed into the respective models, and both output a
512-dimensional vector embedding that represents the respective image/text in vector
space.

The contrastive component takes these two vector embeddings and calculates the model
loss as the difference (e.g., contrast) between the two vectors. Both models are then
optimized to minimize this difference and therefore learn how to embed similar (image,
text) pairs into a similar vector space.

After this contrastive pretraining process, we are left with CLIP, a multi-modal model
capable of understanding both language and images via a shared vector space.

Using CLIP
OpenAI developed and released the clip library that can be found on GitHub here.
However, Hugging Face’s transformers library hosts another implementation of CLIP
(also built by OpenAI) that is more commonly used.

The Hugging Face implementation does not use ResNet for image encoding. It uses the
alternative setup of a ViT model paired with the text transformer. We will learn how to use
this implementation by stringing together a simple text-image search script that can be
adapted for image-image, text-text, and image-text modalities.

Loading Data and CLIP

To begin, we will install the libraries needed for our demo, download the dataset, and
initialize CLIP.

pip install -U torch datasets transformers

We will use the “imagenette” dataset, a collection of ~10K images hosted by Hugging
Face.

That gives us 9469 images ranging from radios to dogs. All of these images are stored in
the 'image' feature as PIL image objects.

2/5
Now we initialize CLIP via the transformers library like so:

A few things are happening here:

The whole device part is setting up our instance to use the fastest hardware
available to us (MPS on M1 chips, CUDA otherwise).
We set the model_id . This is the name of the CLIP model .
Then we initialize a tokenizer for preprocessing text, a processor for
preprocessing images, and the CLIP model for producing vector embeddings.

Now we’re ready to begin creating text and image embeddings.

Create Text Embeddings

The text transformer model handles the encoding of our text into meaningful vector
embeddings. To do this, we first tokenize the text to translate it from human-readable text
to transformer-readable tokens.

Then feed these tokens into the model using the get_text_features method.

Here we have a 512-dimensional vector representing the semantic meaning of the phrase
“a dog in the snow”. This is one-half of our text-image search.

Create Image Embeddings

The next step is creating image embeddings. Again, this is very straightforward. We swap
the tokenizer for a processor which will give us a resized image tensor called
pixel_values .

We can still visualize the processed image. It has been resized, and the pixel “activation”
values are no longer within the typical RGB range of 0–255 that Matplotlib can read, so
colors are not displayed correctly. Nonetheless, we can see that it is the same Sony radio
that we saw before.

Next, we process these inputs with CLIP, this time using the get_image_features
method.

And with that, we built vector embeddings for text and image with CLIP. With these
embeddings, we can compare their similarity using metrics like Euclidean distance,
cosine similarity, or dot product similarity.

However, we can’t compare much with just a single example of each, so let’s move on
and test this on a larger sample of images.

We will take 100 images at random from the imagenette data. To do this, we start by
selecting 100 index positions at random and use them to build a list of images.

3/5
Now we iterate through these 100 images and create image embeddings with CLIP. We
will add them all to a Numpy array called image_arr .

In the bottom cell, we can see that the minimum and maximum values in our image
embedding are -7.99 and +3.15 respectively. We will be using dot product similarity
to compare our vectors. If we want to compare them with dot product accurately, we need
to normalize them. We do that like so:

Now we’re ready to compare and search through our vectors.

Text-Image Search
As mentioned, we will be using dot product to compare vectors. The text embedding will
act as a “query” with which we will search for the most similar image embeddings.

We start by calculating the dot product similarity between our query and the images:

This gives us 100 scores, e.g., a one-to-many score for each text embedding to image
embeddings pair. All we do know is to sort these scores in descending value and return
the respective top-scoring images.

At position #1, we have a dog in the snow, a great result! This is very likely the only image
of a dog in the snow from our sample of 100 images. That is how we perform a text-
image search using CLIP.

CLIP is an amazing model that can be applied across the language-image domains in
any order or combination. We can perform text-text, image-image, and image-text
searches using the same methodology.

In fact, we can do all of those simultaneously by simply adding both image and text
vectors to a single store and then querying with either image or text.

Our approach is great if you’re sticking with fewer search items. However, this is slow or
even impossible when we begin searching through more records. To do that, we need a
vector database. Allowing us to scale this to millions or even billions of records.

If you’re interested in learning more about multi-modal models, NLP, or vector search,
check out my YouTube channel, reach out on Discord, or follow along with one of my free
courses (links below).

Thanks for reading!

Natural Language Processing (NLP) for Semantic Search |

Pinecone

Semantic search has long been a critical component in the technology

stacks of giants such as Google, Amazon, and…

4/5
www.pinecone.io

Embedding Methods for Image Search | Pinecone

www.pinecone.io

5/5

Enhancing Multimodal Understanding With CLIP-Based
No ratings yet
Enhancing Multimodal Understanding With CLIP-Based
7 pages
Practical 3
No ratings yet
Practical 3
4 pages
Image Caption Generator Report
No ratings yet
Image Caption Generator Report
27 pages
Unsupervised Pre-Training For Images: Sunita Sarawagi CS 725 Fall 2023
No ratings yet
Unsupervised Pre-Training For Images: Sunita Sarawagi CS 725 Fall 2023
62 pages
Lecture22 Multimodal
No ratings yet
Lecture22 Multimodal
32 pages
Cloth Captioning
No ratings yet
Cloth Captioning
36 pages
What Is CLIP - Contrastive Language-Image Pre-Processing Explained
No ratings yet
What Is CLIP - Contrastive Language-Image Pre-Processing Explained
16 pages
Contrastive Language and Vision Learning of General Fashion Concepts
No ratings yet
Contrastive Language and Vision Learning of General Fashion Concepts
13 pages
Mediaeval 2023
No ratings yet
Mediaeval 2023
5 pages
Clip Prefix For Image Captioning Task in Generative
No ratings yet
Clip Prefix For Image Captioning Task in Generative
13 pages
CLIP - Connecting Text and Images - OpenAI
No ratings yet
CLIP - Connecting Text and Images - OpenAI
16 pages
Generating Caption From Images Using Flickr Image Dataset
No ratings yet
Generating Caption From Images Using Flickr Image Dataset
7 pages
Interactive
No ratings yet
Interactive
13 pages
Image Captioning Using CNN & RNN
No ratings yet
Image Captioning Using CNN & RNN
4 pages
Clip Model
No ratings yet
Clip Model
7 pages
465-Lecture 17-CT
No ratings yet
465-Lecture 17-CT
22 pages
Project Report Image Captioning Models Prakhar Dhyani
No ratings yet
Project Report Image Captioning Models Prakhar Dhyani
8 pages
DW & Caption Generator - Paper 1
No ratings yet
DW & Caption Generator - Paper 1
6 pages
Implementing Complexity in Automatic Image Caption Generator Using Recurrent Neural Network Over Long Short-Term Memory
No ratings yet
Implementing Complexity in Automatic Image Caption Generator Using Recurrent Neural Network Over Long Short-Term Memory
8 pages
Image Captioners Sometimes Tell More Than Images They See
No ratings yet
Image Captioners Sometimes Tell More Than Images They See
6 pages
Automatic Creative Selection With Cross-Modal Matching
No ratings yet
Automatic Creative Selection With Cross-Modal Matching
3 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
47 pages
Clip
No ratings yet
Clip
15 pages
Loreggia Giacomo
No ratings yet
Loreggia Giacomo
80 pages
CLIP Summary
No ratings yet
CLIP Summary
2 pages
Research Paper 5
No ratings yet
Research Paper 5
10 pages
Contrastive Language Image Pre-Training
No ratings yet
Contrastive Language Image Pre-Training
18 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
14 pages
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
No ratings yet
BLIP3-o: A Family of Fully Open Unified Multimodal Models-Architecture, Training and Dataset
17 pages
Indian Institute OF Information Technology Allahabad: Text To Image Synthesis
No ratings yet
Indian Institute OF Information Technology Allahabad: Text To Image Synthesis
8 pages
PGCON Paper Final
No ratings yet
PGCON Paper Final
4 pages
Ijariie 26613
No ratings yet
Ijariie 26613
5 pages
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
No ratings yet
Jia Et Al. - 2021 - Scaling Up Visual and Vision-Language Representati
11 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
Activity Based Schedulingwith Atlas
100% (2)
Activity Based Schedulingwith Atlas
96 pages
Master Thesis Database Rug
100% (2)
Master Thesis Database Rug
7 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
CSCI 5922 Neural Networks and Deep Learning: Image Captioning
No ratings yet
CSCI 5922 Neural Networks and Deep Learning: Image Captioning
26 pages
Deep Visual-Semantic Alignments For Generating Image Descriptions
No ratings yet
Deep Visual-Semantic Alignments For Generating Image Descriptions
17 pages
Implement A Vision On A LLM
No ratings yet
Implement A Vision On A LLM
21 pages
Image Generator
No ratings yet
Image Generator
11 pages
Learning Similarities: An Ensemble Model For Textual Query Image Retrieval System
No ratings yet
Learning Similarities: An Ensemble Model For Textual Query Image Retrieval System
8 pages
Group No.17: Class-Ai - A Sub-Edi
No ratings yet
Group No.17: Class-Ai - A Sub-Edi
14 pages
Esser Taming Transformers For High-Resolution Image Synthesis CVPR 2021 Paper
No ratings yet
Esser Taming Transformers For High-Resolution Image Synthesis CVPR 2021 Paper
11 pages
Image Captioning
No ratings yet
Image Captioning
17 pages
Ultimate Guide To Embedding Models
No ratings yet
Ultimate Guide To Embedding Models
50 pages
Materials Today: Proceedings: K. Loganathan, R. Sarath Kumar, V. Nagaraj, Tegil J. John
No ratings yet
Materials Today: Proceedings: K. Loganathan, R. Sarath Kumar, V. Nagaraj, Tegil J. John
5 pages
BTP Report
No ratings yet
BTP Report
27 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
48 pages
A Guide To Image Captioning. How Deep Learning Helps in Captioning
No ratings yet
A Guide To Image Captioning. How Deep Learning Helps in Captioning
17 pages
VQGAN: Taming Transformer For High-Resolution Image Synthesis
No ratings yet
VQGAN: Taming Transformer For High-Resolution Image Synthesis
52 pages
Xampp Installation
No ratings yet
Xampp Installation
7 pages
Automatic Image Captioning Using Neural Networks
No ratings yet
Automatic Image Captioning Using Neural Networks
9 pages
Samsung-Rjio 5 G RRU & BBU
No ratings yet
Samsung-Rjio 5 G RRU & BBU
20 pages
Image Caption Generator
No ratings yet
Image Caption Generator
2 pages
NOJA-522-20 DNP3 Device Profile
No ratings yet
NOJA-522-20 DNP3 Device Profile
141 pages
Project Synopsis Imagecaptioning
No ratings yet
Project Synopsis Imagecaptioning
5 pages
Image Captioning
No ratings yet
Image Captioning
33 pages
Mini Project Fln..
No ratings yet
Mini Project Fln..
51 pages
Samsung Gt-I9023 Service Manual
No ratings yet
Samsung Gt-I9023 Service Manual
92 pages
Hierarchical Text-Conditional Image Generation With CLIP Latents
No ratings yet
Hierarchical Text-Conditional Image Generation With CLIP Latents
27 pages
Nokia Siemens Networks in Building Solutions
No ratings yet
Nokia Siemens Networks in Building Solutions
4 pages
Ak500 Key Programmer User Manual
No ratings yet
Ak500 Key Programmer User Manual
27 pages
CSP Copilot Getting Started Promo FAQ
No ratings yet
CSP Copilot Getting Started Promo FAQ
7 pages
ROHAN PRASAD FinalProjectReport - Rohan Gamer
No ratings yet
ROHAN PRASAD FinalProjectReport - Rohan Gamer
39 pages
Image Caption Generator PCL
No ratings yet
Image Caption Generator PCL
19 pages
PBI Desktop Fundamentals Training Session 2
No ratings yet
PBI Desktop Fundamentals Training Session 2
66 pages
Matrices and Matrix Operations: J of A Matrix
No ratings yet
Matrices and Matrix Operations: J of A Matrix
4 pages
Python Exercises - Part 1
No ratings yet
Python Exercises - Part 1
3 pages
Visual C++ Programming Final
No ratings yet
Visual C++ Programming Final
9 pages
1st Draft
No ratings yet
1st Draft
20 pages
HCF and LCM PDF
No ratings yet
HCF and LCM PDF
31 pages
Atlantic Council Cyber Statecraft Initiative 1717530831
No ratings yet
Atlantic Council Cyber Statecraft Initiative 1717530831
31 pages
Create Gmail Account
No ratings yet
Create Gmail Account
7 pages
Chapter 1
No ratings yet
Chapter 1
17 pages
Rk17jda12 PDF
No ratings yet
Rk17jda12 PDF
22 pages
Design and Development of Fingerprint Based Vehicle Starting System
No ratings yet
Design and Development of Fingerprint Based Vehicle Starting System
3 pages
Bis Assignment
No ratings yet
Bis Assignment
11 pages
05 - Hardware Configuration Step7
No ratings yet
05 - Hardware Configuration Step7
21 pages
Blue and Gray Simple Professional CV Resume
No ratings yet
Blue and Gray Simple Professional CV Resume
1 page
Test NG
No ratings yet
Test NG
8 pages
Registration For Affinsys Recruitment Drive - 2025 Graduating Batch
No ratings yet
Registration For Affinsys Recruitment Drive - 2025 Graduating Batch
3 pages
Bluetooth and Wifi Elm327 Instructions
No ratings yet
Bluetooth and Wifi Elm327 Instructions
5 pages
Log
No ratings yet
Log
2 pages
SQL
No ratings yet
SQL
5 pages
1-Computer Ethics For Computer Professionals: 1.1 - The ACM Code of Conduct
No ratings yet
1-Computer Ethics For Computer Professionals: 1.1 - The ACM Code of Conduct
3 pages
Biostar U8668-D Spec
No ratings yet
Biostar U8668-D Spec
2 pages
Machine Learning in Python: Hands on Machine Learning with Python Tools, Concepts and Techniques
From Everand
Machine Learning in Python: Hands on Machine Learning with Python Tools, Concepts and Techniques
Abiprod Pty Ltd
5/5 (10)
A Pocket Guide to CSS Animations
From Everand
A Pocket Guide to CSS Animations
Val Head
4.5/5 (2)
Machine Learning in Python: Hands on Machine Learning with Python Tools, Concepts and Techniques
From Everand
Machine Learning in Python: Hands on Machine Learning with Python Tools, Concepts and Techniques
Bob Mather
5/5 (1)