Text-image embeddings with OpenAIs CLIP
Text-image embeddings with OpenAIs CLIP
towardsdatascience.com/quick-fire-guide-to-multi-modal-ml-with-openais-clip-2dad7e398ac0
James Briggs
Aug 11
5 min read
Learn how to translate between text to image and back again with
CLIP and vector embeddings
After a few short years of life, children can fathom the concepts behind simple words and
connect them to related images. They can identify the connection between shapes and
textures of the physical world to the abstract symbols of written language.
It’s something we take for granted. Very few (if any) people in the world will remember a
time when these “basic” skills were beyond their capacity.
Computers are different. They can calculate the parameters a rocket needs to traverse
the solar system. But if you ask a computer to find an image of “a dog in the park”, you’re
better off asking NASA for a free ticket to the space station.
1/5
At least, that was the case until recently.
In this article, we’re going to take a look at OpenAI’s CLIP. A “multi-modal” model capable
of understanding the relationships and concepts between both text and images. As we’ll
see, CLIP is more than a fancy parlor trick. It is shockingly capable.
Contrastive Learning?
Contrastive Language-Image Pretraining (CLIP) consists of two models trained in
parallel. A Vision Transformer (ViT) or ResNet model for image embeddings and a
transformer model for language embeddings.
During training, (image, text) pairs are fed into the respective models, and both output a
512-dimensional vector embedding that represents the respective image/text in vector
space.
The contrastive component takes these two vector embeddings and calculates the model
loss as the difference (e.g., contrast) between the two vectors. Both models are then
optimized to minimize this difference and therefore learn how to embed similar (image,
text) pairs into a similar vector space.
After this contrastive pretraining process, we are left with CLIP, a multi-modal model
capable of understanding both language and images via a shared vector space.
Using CLIP
OpenAI developed and released the clip library that can be found on GitHub here.
However, Hugging Face’s transformers library hosts another implementation of CLIP
(also built by OpenAI) that is more commonly used.
The Hugging Face implementation does not use ResNet for image encoding. It uses the
alternative setup of a ViT model paired with the text transformer. We will learn how to use
this implementation by stringing together a simple text-image search script that can be
adapted for image-image, text-text, and image-text modalities.
We will use the “imagenette” dataset, a collection of ~10K images hosted by Hugging
Face.
That gives us 9469 images ranging from radios to dogs. All of these images are stored in
the 'image' feature as PIL image objects.
2/5
Now we initialize CLIP via the transformers library like so:
The whole device part is setting up our instance to use the fastest hardware
available to us (MPS on M1 chips, CUDA otherwise).
We set the model_id . This is the name of the CLIP model .
Then we initialize a tokenizer for preprocessing text, a processor for
preprocessing images, and the CLIP model for producing vector embeddings.
Then feed these tokens into the model using the get_text_features method.
Here we have a 512-dimensional vector representing the semantic meaning of the phrase
“a dog in the snow”. This is one-half of our text-image search.
We can still visualize the processed image. It has been resized, and the pixel “activation”
values are no longer within the typical RGB range of 0–255 that Matplotlib can read, so
colors are not displayed correctly. Nonetheless, we can see that it is the same Sony radio
that we saw before.
Next, we process these inputs with CLIP, this time using the get_image_features
method.
And with that, we built vector embeddings for text and image with CLIP. With these
embeddings, we can compare their similarity using metrics like Euclidean distance,
cosine similarity, or dot product similarity.
However, we can’t compare much with just a single example of each, so let’s move on
and test this on a larger sample of images.
We will take 100 images at random from the imagenette data. To do this, we start by
selecting 100 index positions at random and use them to build a list of images.
3/5
Now we iterate through these 100 images and create image embeddings with CLIP. We
will add them all to a Numpy array called image_arr .
In the bottom cell, we can see that the minimum and maximum values in our image
embedding are -7.99 and +3.15 respectively. We will be using dot product similarity
to compare our vectors. If we want to compare them with dot product accurately, we need
to normalize them. We do that like so:
Text-Image Search
As mentioned, we will be using dot product to compare vectors. The text embedding will
act as a “query” with which we will search for the most similar image embeddings.
We start by calculating the dot product similarity between our query and the images:
This gives us 100 scores, e.g., a one-to-many score for each text embedding to image
embeddings pair. All we do know is to sort these scores in descending value and return
the respective top-scoring images.
At position #1, we have a dog in the snow, a great result! This is very likely the only image
of a dog in the snow from our sample of 100 images. That is how we perform a text-
image search using CLIP.
CLIP is an amazing model that can be applied across the language-image domains in
any order or combination. We can perform text-text, image-image, and image-text
searches using the same methodology.
In fact, we can do all of those simultaneously by simply adding both image and text
vectors to a single store and then querying with either image or text.
Our approach is great if you’re sticking with fewer search items. However, this is slow or
even impossible when we begin searching through more records. To do that, we need a
vector database. Allowing us to scale this to millions or even billions of records.
If you’re interested in learning more about multi-modal models, NLP, or vector search,
check out my YouTube channel, reach out on Discord, or follow along with one of my free
courses (links below).
4/5
www.pinecone.io
www.pinecone.io
5/5