0% found this document useful (0 votes)
15 views

Text-image embeddings with OpenAIs CLIP

The document provides a guide on using OpenAI's CLIP, a multi-modal model that connects text and images through vector embeddings. It explains the contrastive learning process behind CLIP, how to create text and image embeddings, and perform text-image searches using dot product similarity. The guide also highlights the need for a vector database for scaling searches across larger datasets.

Uploaded by

elsa.anza27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
15 views

Text-image embeddings with OpenAIs CLIP

The document provides a guide on using OpenAI's CLIP, a multi-modal model that connects text and images through vector embeddings. It explains the contrastive learning process behind CLIP, how to create text and image embeddings, and perform text-image searches using dot product similarity. The guide also highlights the need for a vector database for scaling searches across larger datasets.

Uploaded by

elsa.anza27
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Text-image embeddings with OpenAI's CLIP

towardsdatascience.com/quick-fire-guide-to-multi-modal-ml-with-openais-clip-2dad7e398ac0

James Briggs August 13, 2022

James Briggs
Aug 11

5 min read

Quick-fire Guide to Multi-Modal ML With OpenAI’s CLIP

Learn how to translate between text to image and back again with
CLIP and vector embeddings

Modified by author, background photo by on

After a few short years of life, children can fathom the concepts behind simple words and
connect them to related images. They can identify the connection between shapes and
textures of the physical world to the abstract symbols of written language.

It’s something we take for granted. Very few (if any) people in the world will remember a
time when these “basic” skills were beyond their capacity.

Computers are different. They can calculate the parameters a rocket needs to traverse
the solar system. But if you ask a computer to find an image of “a dog in the park”, you’re
better off asking NASA for a free ticket to the space station.

1/5
At least, that was the case until recently.

In this article, we’re going to take a look at OpenAI’s CLIP. A “multi-modal” model capable
of understanding the relationships and concepts between both text and images. As we’ll
see, CLIP is more than a fancy parlor trick. It is shockingly capable.

Contrastive Learning?
Contrastive Language-Image Pretraining (CLIP) consists of two models trained in
parallel. A Vision Transformer (ViT) or ResNet model for image embeddings and a
transformer model for language embeddings.

During training, (image, text) pairs are fed into the respective models, and both output a
512-dimensional vector embedding that represents the respective image/text in vector
space.

The contrastive component takes these two vector embeddings and calculates the model
loss as the difference (e.g., contrast) between the two vectors. Both models are then
optimized to minimize this difference and therefore learn how to embed similar (image,
text) pairs into a similar vector space.

After this contrastive pretraining process, we are left with CLIP, a multi-modal model
capable of understanding both language and images via a shared vector space.

Using CLIP
OpenAI developed and released the clip library that can be found on GitHub here.
However, Hugging Face’s transformers library hosts another implementation of CLIP
(also built by OpenAI) that is more commonly used.

The Hugging Face implementation does not use ResNet for image encoding. It uses the
alternative setup of a ViT model paired with the text transformer. We will learn how to use
this implementation by stringing together a simple text-image search script that can be
adapted for image-image, text-text, and image-text modalities.

Loading Data and CLIP


To begin, we will install the libraries needed for our demo, download the dataset, and
initialize CLIP.

pip install -U torch datasets transformers

We will use the “imagenette” dataset, a collection of ~10K images hosted by Hugging
Face.

That gives us 9469 images ranging from radios to dogs. All of these images are stored in
the 'image' feature as PIL image objects.

2/5
Now we initialize CLIP via the transformers library like so:

A few things are happening here:

The whole device part is setting up our instance to use the fastest hardware
available to us (MPS on M1 chips, CUDA otherwise).
We set the model_id . This is the name of the CLIP model .
Then we initialize a tokenizer for preprocessing text, a processor for
preprocessing images, and the CLIP model for producing vector embeddings.

Now we’re ready to begin creating text and image embeddings.

Create Text Embeddings


The text transformer model handles the encoding of our text into meaningful vector
embeddings. To do this, we first tokenize the text to translate it from human-readable text
to transformer-readable tokens.

Then feed these tokens into the model using the get_text_features method.

Here we have a 512-dimensional vector representing the semantic meaning of the phrase
“a dog in the snow”. This is one-half of our text-image search.

Create Image Embeddings


The next step is creating image embeddings. Again, this is very straightforward. We swap
the tokenizer for a processor which will give us a resized image tensor called
pixel_values .

We can still visualize the processed image. It has been resized, and the pixel “activation”
values are no longer within the typical RGB range of 0–255 that Matplotlib can read, so
colors are not displayed correctly. Nonetheless, we can see that it is the same Sony radio
that we saw before.

Next, we process these inputs with CLIP, this time using the get_image_features
method.

And with that, we built vector embeddings for text and image with CLIP. With these
embeddings, we can compare their similarity using metrics like Euclidean distance,
cosine similarity, or dot product similarity.

However, we can’t compare much with just a single example of each, so let’s move on
and test this on a larger sample of images.

We will take 100 images at random from the imagenette data. To do this, we start by
selecting 100 index positions at random and use them to build a list of images.

3/5
Now we iterate through these 100 images and create image embeddings with CLIP. We
will add them all to a Numpy array called image_arr .

In the bottom cell, we can see that the minimum and maximum values in our image
embedding are -7.99 and +3.15 respectively. We will be using dot product similarity
to compare our vectors. If we want to compare them with dot product accurately, we need
to normalize them. We do that like so:

Now we’re ready to compare and search through our vectors.

Text-Image Search
As mentioned, we will be using dot product to compare vectors. The text embedding will
act as a “query” with which we will search for the most similar image embeddings.

We start by calculating the dot product similarity between our query and the images:

This gives us 100 scores, e.g., a one-to-many score for each text embedding to image
embeddings pair. All we do know is to sort these scores in descending value and return
the respective top-scoring images.

At position #1, we have a dog in the snow, a great result! This is very likely the only image
of a dog in the snow from our sample of 100 images. That is how we perform a text-
image search using CLIP.

CLIP is an amazing model that can be applied across the language-image domains in
any order or combination. We can perform text-text, image-image, and image-text
searches using the same methodology.

In fact, we can do all of those simultaneously by simply adding both image and text
vectors to a single store and then querying with either image or text.

Our approach is great if you’re sticking with fewer search items. However, this is slow or
even impossible when we begin searching through more records. To do that, we need a
vector database. Allowing us to scale this to millions or even billions of records.

If you’re interested in learning more about multi-modal models, NLP, or vector search,
check out my YouTube channel, reach out on Discord, or follow along with one of my free
courses (links below).

Thanks for reading!

Natural Language Processing (NLP) for Semantic Search |


Pinecone

Semantic search has long been a critical component in the technology


stacks of giants such as Google, Amazon, and…

4/5
www.pinecone.io

Embedding Methods for Image Search | Pinecone

www.pinecone.io

5/5

You might also like