Learning Transferable Visual Models From Natural Language Supervision

Download as pdf or txt
Download as pdf or txt
You are on page 1of 47
At a glance
Powered by AI
The paper explores using natural language supervision from web text to learn visual representations without using labeled images. The model is pre-trained to match image-caption pairs and can then transfer to various downstream computer vision tasks through zero-shot evaluation.

The paper explores using natural language supervision from web text to learn visual representations without using labeled images. A model is pre-trained to match image-caption pairs collected from the internet.

The model was pre-trained on a dataset of 400 million (image, text) pairs collected from the internet.

Learning Transferable Visual Models From Natural Language Supervision

Alec Radford * 1 Jong Wook Kim * 1 Chris Hallacy 1 Aditya Ramesh 1 Gabriel Goh 1 Sandhini Agarwal 1
Girish Sastry 1 Amanda Askell 1 Pamela Mishkin 1 Jack Clark 1 Gretchen Krueger 1 Ilya Sutskever 1

Abstract nitude in compute, model capacity, and data, steadily im-


proving capabilities. The development of “text-to-text” as
State-of-the-art computer vision systems are
a standardized input-output interface (McCann et al., 2018;
trained to predict a fixed set of predetermined
Radford et al., 2019; Raffel et al., 2019) has enabled task-
object categories. This restricted form of super-
agnostic architectures to zero-shot transfer to downstream
vision limits their generality and usability since
datasets removing the need for specialized output heads or
additional labeled data is needed to specify any
dataset specific customization. Flagship systems like GPT-3
other visual concept. Learning directly from raw
(Brown et al., 2020) are now competitive across many tasks
text about images is a promising alternative which
with bespoke models while requiring little to no dataset
leverages a much broader source of supervision.
specific training data.
We demonstrate that the simple pre-training task
of predicting which caption goes with which im- These results suggest that the aggregate supervision acces-
age is an efficient and scalable way to learn SOTA sible to modern pre-training methods within web-scale col-
image representations from scratch on a dataset lections of text surpasses that of high-quality crowd-labeled
of 400 million (image, text) pairs collected from NLP datasets. However, in other fields such as computer
the internet. After pre-training, natural language vision it is still standard practice to pre-train models on
is used to reference learned visual concepts (or crowd-labeled datasets such as ImageNet (Deng et al., 2009).
describe new ones) enabling zero-shot transfer Could scalable pre-training methods which learn directly
of the model to downstream tasks. We study from web text result in a similar breakthrough in computer
the performance of this approach by benchmark- vision? Prior work is encouraging.
ing on over 30 different existing computer vi-
Over 20 years ago Mori et al. (1999) explored improving
sion datasets, spanning tasks such as OCR, action
content based image retrieval by training a model to pre-
recognition in videos, geo-localization, and many
dict the nouns and adjectives in text documents paired with
types of fine-grained object classification. The
images. Quattoni et al. (2007) demonstrated it was possi-
model transfers non-trivially to most tasks and is
ble to learn more data efficient image representations via
often competitive with a fully supervised baseline
manifold learning in the weight space of classifiers trained
without the need for any dataset specific training.
to predict words in captions associated with images. Sri-
For instance, we match the accuracy of the orig-
vastava & Salakhutdinov (2012) explored deep represen-
inal ResNet-50 on ImageNet zero-shot without
tation learning by training multimodal Deep Boltzmann
needing to use any of the 1.28 million training
Machines on top of low-level image and text tag feature
examples it was trained on.
features. Joulin et al. (2016) modernized this line of work
and demonstrated that CNNs trained to predict words in
image captions learn useful image representations. They
1. Introduction and Motivating Work converted the title, description, and hashtag metadata of im-
Pre-training methods which learn directly from raw text ages in the YFCC100M dataset (Thomee et al., 2016) into
have revolutionized NLP over the last few years (Dai & a bag-of-words multi-label classification task and showed
Le, 2015; Peters et al., 2018; Howard & Ruder, 2018; Rad- that pre-training AlexNet (Krizhevsky et al., 2012) to pre-
ford et al., 2018; Devlin et al., 2018; Raffel et al., 2019). dict these labels learned representations which preformed
Task-agnostic objectives such as autoregressive and masked similarly to ImageNet-based pre-training on transfer tasks.
language modeling have scaled across many orders of mag- Li et al. (2017) then extended this approach to predicting
phrase n-grams in addition to individual words and demon-
*
Equal contribution 1 OpenAI, San Francisco, CA 94110, USA. strated the ability of their system to zero-shot transfer to
Correspondence to: <{alec, jongwook}@openai.com>. other image classification datasets by scoring target classes
based on their dictionary of learned visual n-grams and
Learning Transferable Visual Models From Natural Language Supervision 2
(1) Contrastive pre-training (2) Create dataset classifier from label text

plane

car
Pepper the
Pepper the
Pepper the
aussie pup Text A photo of Text
Pepper the
aussie pup dog
aussie pup Encoder … a {object}. Encoder
aussie pup
⋮ ⋮ …
T1 T2 T3 … TN
bird

I1 I1·T1 I1·T2 I1·T3 … I1·TN


(3) Use for zero-shot prediction
I2 I2·T1 I2·T2 I2·T3 … I2·TN T1 T2 T3 … TN

Image I3 I3·T1 I3·T2 I3·T3 … I3·TN


Encoder Image I1 I1·T1 I1·T2 I1·T3 … I1·TN
Encoder
⋮ ⋮ ⋮ ⋮ ⋮ ⋱ ⋮

IN IN·T1 IN·T2 IN·T3 IN·TN A photo of



a dog.

Figure 1. Summary of our approach. While standard image models jointly train an image feature extractor and a linear classifier to predict
some label, CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training
examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the
target dataset’s classes.

predicting the one with the highest score. Adopting more Natural language is able to express, and therefore supervise,
recent architectures and pre-training approaches, VirTex a much wider set of visual concepts through its general-
(Desai & Johnson, 2020), ICMLM (Bulent Sariyildiz et al., ity. Both approaches also use static softmax classifiers to
2020), and ConVIRT (Zhang et al., 2020) have recently perform prediction and lack a mechanism for dynamic out-
demonstrated the potential of transformer-based language puts. This severely curtails their flexibility and limits their
modeling, masked language modeling, and contrastive ob- “zero-shot” capabilities.
jectives to learn image representations from text.
A crucial difference between these weakly supervised mod-
While exciting as proofs of concept, using natural language els and recent explorations of learning image representations
supervision for image representation learning is still rare. directly from natural language is scale. While Mahajan et al.
This is likely because demonstrated performance on com- (2018) and Kolesnikov et al. (2019) trained their models for
mon benchmarks is much lower than alternative approaches. accelerator years on millions to billions of images, VirTex,
For example, Li et al. (2017) reach only 11.5% accuracy ICMLM, and ConVIRT trained for accelerator days on one
on ImageNet in a zero-shot setting. This is well below the to two hundred thousand images. In this work, we close
88.4% accuracy of the current state of the art (Xie et al., this gap and study the behaviors of image classifiers trained
2020). It is even below the 50% accuracy of classic com- with natural language supervision at large scale. Enabled
puter vision approaches (Deng et al., 2012). Instead, more by the large amounts of publicly available data of this form
narrowly scoped but well-targeted uses of weak supervision on the internet, we create a new dataset of 400 million (im-
have improved performance. Mahajan et al. (2018) showed age, text) pairs and demonstrate that a simplified version of
that predicting ImageNet related hashtags on Instagram im- ConVIRT trained from scratch, which we call CLIP, for Con-
ages is an effective pre-training task. When fine-tuned to trastive Language-Image Pre-training, is an efficient method
ImageNet these pre-trained models increased accuracy by of learning from natural language supervision. We study
over 5% and improved the overall state of the art at the time. the scalability of CLIP by training a series of eight models
Kolesnikov et al. (2019) and Dosovitskiy et al. (2020) have spanning almost 2 orders of magnitude of compute and ob-
also demonstrated large gains on a broader set of transfer serve that transfer performance is a smoothly predictable
benchmarks by pre-training models to predict the classes of function of compute (Hestness et al., 2017; Kaplan et al.,
the noisily labeled JFT-300M dataset. 2020). We find that CLIP, similar to the GPT family, learns
to perform a wide set of tasks during pre-training including
This line of work represents the current pragmatic middle
OCR, geo-localization, action recognition, and many others.
ground between learning from a limited amount of super-
We measure this by benchmarking the zero-shot transfer
vised “gold-labels” and learning from practically unlimited
performance of CLIP on over 30 existing datasets and find
amounts of raw text. However, it is not without compro-
it can be competitive with prior task-specific supervised
mises. Both works carefully design, and in the process limit,
models. We also confirm these findings with linear-probe
their supervision to 1000 and 18291 classes respectively.
Learning Transferable Visual Models From Natural Language Supervision 3

representation learning analysis and show that CLIP out- representations, improvements in deep contextual represen-
performs the best publicly available ImageNet model while tation learning suggest we now have the tools to effectively
also being more computationally efficient. We additionally leverage this abundant source of supervision (McCann et al.,
find that zero-shot CLIP models are much more robust than 2017).
equivalent accuracy supervised ImageNet models which
Learning from natural language has several potential
suggests that zero-shot evaluation of task-agnostic models is
strengths over other training methods. It’s much easier
much more representative of a model’s capability. These re-
to scale natural language supervision compared to standard
sults have significant policy and ethical implications, which
crowd-sourced labeling for image classification since it does
we consider in Section 7.
not require annotations to be in a classic “machine learning
compatible format” such as the canonical 1-of-N majority
40
vote “gold label”. Instead, methods which work on natural
35
Zero-Shot ImageNet Accuracy

language can learn passively from the supervision contained


30 in the vast amount of text on the internet. Learning from
25 natural language also has an important advantage over most
unsupervised or self-supervised learning approaches in that
20 it doesn’t “just” learn a representation but also connects that
4X efficiency 3X efficiency
15 representation to language which enables flexible zero-shot
10 transfer. In the following subsections, we detail the specific
Bag of Words Contrastive (CLIP) approach we settled on.
5 Bag of Words Prediction
Transformer Language Model
0 2.2. Creating a Sufficiently Large Dataset
2M 33M 67M 134M 268M 400M
# of images processed Existing work has mainly used three datasets, MS-COCO
(Lin et al., 2014), Visual Genome (Krishna et al., 2017), and
Figure 2. CLIP is much more efficient at zero-shot transfer
than our image caption baseline. Although highly expressive,
YFCC100M (Thomee et al., 2016). While MS-COCO and
we found that transformer-based language models are relatively Visual Genome are high quality crowd-labeled datasets, they
weak at zero-shot ImageNet classification. Here, we see that it are small by modern standards with approximately 100,000
learns 3x slower than a baseline which predicts a bag-of-words training photos each. By comparison, other computer vision
(BoW) encoding of the text (Joulin et al., 2016). Swapping the systems are trained on up to 3.5 billion Instagram photos
prediction objective for the contrastive objective of CLIP further (Mahajan et al., 2018). YFCC100M, at 100 million photos,
improves efficiency another 4x. is a possible alternative, but the metadata for each image is
sparse and of varying quality. Many images use automati-
cally generated filenames like 20160716 113957.JPG
as “titles” or contain “descriptions” of camera exposure
2. Approach settings. After filtering to keep only images with natural
2.1. Natural Language Supervision language titles and/or descriptions in English, the dataset
shrunk by a factor of 6 to only 15 million photos. This is
At the core of our approach is the idea of learning percep- approximately the same size as ImageNet.
tion from supervision contained in natural language. As
discussed in the introduction, this is not at all a new idea, A major motivation for natural language supervision is the
however terminology used to describe work in this space large quantities of data of this form available publicly on the
is varied, even seemingly contradictory, and stated motiva- internet. Since existing datasets do not adequately reflect
tions are diverse. Zhang et al. (2020), Gomez et al. (2017), this possibility, considering results only on them would un-
Joulin et al. (2016), and Desai & Johnson (2020) all intro- derestimate the potential of this line of research. To address
duce methods which learn visual representations from text this, we constructed a new dataset of 400 million (image,
paired with images but describe their approaches as unsuper- text) pairs collected form a variety of publicly available
vised, self-supervised, weakly supervised, and supervised sources on the Internet. To attempt to cover as broad a set
respectively. of visual concepts as possible, we search for (image, text)
pairs as part of the construction process whose text includes
We emphasize that what is common across this line of work one of a set of 500,000 queries.1 We approximately class
is not any of the details of the particular methods used but
1
the appreciation of natural language as a training signal. All The base query list is all words occurring at least 100 times in
these approaches are learning from natural language super- the English version of Wikipedia. This is augmented with bi-grams
with high pointwise mutual information as well as the names of
vision. Although early work wrestled with the complexity all Wikipedia articles above a certain search volume. Finally all
of natural language when using topic model and n-gram
Learning Transferable Visual Models From Natural Language Supervision 4

balance the results by including up to 20,000 (image, text) ity of the image and text embeddings of the N real pairs
pairs per query. The resulting dataset has a similar total in the batch while minimizing the cosine similarity of the
word count as the WebText dataset used to train GPT-2. We embeddings of the N 2 − N incorrect pairings. We opti-
refer to this dataset as WIT for WebImageText. mize a symmetric cross entropy loss over these similarity
scores. In Figure 3 we include pseudocode of the core of an
2.3. Selecting an Efficient Pre-Training Method implementation of CLIP. To our knowledge this batch con-
struction technique and objective was first introduced in the
State-of-the-art computer vision systems use very large area of deep metric learning as the multi-class N-pair loss
amounts of compute. Mahajan et al. (2018) required 19 Sohn (2016), was popularized for contrastive representation
GPU years to train their ResNeXt101-32x48d and Xie et al. learning by Oord et al. (2018) as the InfoNCE loss, and was
(2020) required 33 TPUv3 core-years to train their Noisy recently adapted for contrastive (text, image) representation
Student EfficientNet-L2. When considering that both these learning in the domain of medical imaging by Zhang et al.
systems were trained to predict only 1000 ImageNet classes, (2020).
the task of learning an open set of visual concepts from
natural language seems daunting. In the course of our ef- Due to the large size of our pre-training dataset, over-fitting
forts, we found training efficiency was key to successfully is not a major concern and the details of training CLIP are
scaling natural language supervision and we selected our simplified compared to the implementation of Zhang et al.
final pre-training method based on this metric. (2020). We train CLIP from scratch without initializing
the image encoder with ImageNet weights or the text en-
Our initial approach, similar to VirTex, jointly trained an coder with pre-trained weights. We remove the non-linear
image CNN and text transformer from scratch to predict the projection between the representation and the contrastive
caption of an image. However, we encountered difficulties embedding space, a change which was introduced by Bach-
efficiently scaling this method. In Figure 2 we show that a man et al. (2019) and popularized by Chen et al. (2020b).
63 million parameter transformer language model, which We use only a linear projection to map from each encoder’s
already uses twice the compute of its ResNet-50 image representation to the multi-modal embedding space. We
encoder, learns to recognize ImageNet classes three times did not notice a difference in training efficiency between
slower than a much simpler baseline that predicts a bag-of- the two versions and speculate that non-linear projections
words encoding of the same text. may be co-adapted with details of current image only self-
Both these approaches share a key similarity. They try to pre- supervised representation learning methods. We also re-
dict the exact words of the text accompanying each image. move the text transformation function tu from Zhang et al.
This is a difficult task due to the wide variety of descriptions, (2020) which samples a single sentence at uniform from
comments, and related text that co-occur with images. Re- the text since many of the (image, text) pairs in CLIP’s pre-
cent work in contrastive representation learning for images training dataset are only a single sentence. We also simplify
has found that contrastive objectives can learn better repre- the image transformation function tv . A random square
sentations than their equivalent predictive objective (Tian crop from resized images is the only data augmentation
et al., 2019). Other work has found that although generative used during training. Finally, the temperature parameter
models of images can learn high quality image representa- which controls the range of the logits in the softmax, τ , is
tions, they require over an order of magnitude more compute directly optimized during training as a log-parameterized
than contrastive models with the same performance (Chen multiplicative scalar to avoid turning as a hyper-parameter.
et al., 2020a). Noting these findings, we explored training
a system to solve the potentially easier proxy task of pre- 2.4. Choosing and Scaling a Model
dicting only which text as a whole is paired with which
We consider two different architectures for the image en-
image and not the exact words of that text. Starting with
coder. For the first, we use ResNet-50 (He et al., 2016a)
the same bag-of-words encoding baseline, we swapped the
as the base architecture for the image encoder due to its
predictive objective for a contrastive objective in Figure 2
widespread adoption and proven performance. We make sev-
and observed a further 4x efficiency improvement in the rate
eral modifications to the original version using the ResNet-
of zero-shot transfer to ImageNet.
D improvements from He et al. (2019) and the antialiased
Given a batch of N (image, text) pairs, CLIP is trained to rect-2 blur pooling from Zhang (2019). We also replace
predict which of the N × N possible (image, text) pairings the global average pooling layer with an attention pooling
across a batch actually occurred. To do this, CLIP learns a mechanism. The attention pooling is implemented as a sin-
multi-modal embedding space by jointly training an image gle layer of “transformer-style” multi-head QKV attention
encoder and text encoder to maximize the cosine similar- where the query is conditioned on the global average-pooled
WordNet synsets not already in the query list are added. representation of the image. For the second architecture, we
experiment with the recently introduced Vision Transformer
Learning Transferable Visual Models From Natural Language Supervision 5
# image_encoder - ResNet or Vision Transformer EfficientNet architecture, we use a simple baseline of allo-
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images cating additional compute equally to increasing the width,
# T[n, l] - minibatch of aligned texts depth, and resolution of the model. For the text encoder, we
# W_i[d_i, d_e] - learned proj of image to embed only scale the width of the model to be proportional to the
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter calculated increase in width of the ResNet and do not scale
the depth at all, as we found CLIP’s performance to be less
# extract feature representations of each modality sensitive to the capacity of the text encoder.
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]
2.5. Training
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1) We train a series of 5 ResNets and 3 Vision Transformers.
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
For the ResNets we train a ResNet-50, a ResNet-101, and
# scaled pairwise cosine similarities [n, n] then 3 more which follow EfficientNet-style model scaling
logits = np.dot(I_e, T_e.T) * np.exp(t) and use approximately 4x, 16x, and 64x the compute of a
ResNet-50. They are denoted as RN50x4, RN50x16, and
# symmetric loss function
labels = np.arange(n) RN50x64 respectively. For the Vision Transformers we
loss_i = cross_entropy_loss(logits, labels, axis=0) train a ViT-B/32, a ViT-B/16, and a ViT-L/14. We train all
loss_t = cross_entropy_loss(logits, labels, axis=1) models for 32 epochs. We use the Adam optimizer (Kingma
loss = (loss_i + loss_t)/2
& Ba, 2014) with decoupled weight decay regularization
Figure 3. Numpy-like pseudocode for the core of an implementa- (Loshchilov & Hutter, 2017) applied to all weights that are
tion of CLIP. not gains or biases, and decay the learning rate using a
cosine schedule (Loshchilov & Hutter, 2016). Initial hyper-
parameters were set using a combination of grid searches,
random search, and manual tuning on the baseline ResNet-
(ViT) (Dosovitskiy et al., 2020). We closely follow their 50 model when trained for 1 epoch. Hyper-parameters were
implementation with only the minor modification of adding then adapted heuristically for larger models due to compu-
an additional layer normalization to the combined patch tational constraints. The learnable temperature parameter
and position embeddings before the transformer and use a τ was initialized to the equivalent of 0.07 from (Wu et al.,
slightly different initialization scheme. 2018) and clipped to prevent scaling the logits by more
The text encoder is a Transformer (Vaswani et al., 2017) than 100 which we found necessary to prevent training in-
with the architecture modifications described in Radford stability. We use a very large minibatch size of 32,768.
et al. (2019). As a base size we use a 63M-parameter 12- Mixed-precision (Micikevicius et al., 2017) was used to ac-
layer 512-wide model with 8 attention heads. The trans- celerate training and save memory. To save additional mem-
former operates on a lower-cased byte pair encoding (BPE) ory, gradient checkpointing (Griewank & Walther, 2000;
representation of the text with a 49,152 vocab size (Sen- Chen et al., 2016), half-precision Adam statistics (Dhariwal
nrich et al., 2015). For computational efficiency, the max et al., 2020), and half-precision stochastically rounded text
sequence length was capped at 76. The text sequence is encoder weights were used. The calculation of embedding
bracketed with [SOS] and [EOS] tokens and the activa- similarities was also sharded with individual GPUs comput-
tions of the highest layer of the transformer at the [EOS] ing only the subset of the pairwise similarities necessary for
token are treated as the feature representation of the text their local batch of embeddings. The largest ResNet model,
which is layer normalized and then linearly projected into RN50x64, took 18 days to train on 592 V100 GPUs while
the multi-modal embedding space. Masked self-attention the largest Vision Transformer took 12 days on 256 V100
was used in the text encoder to preserve the ability to ini- GPUs. For the ViT-L/14 we also pre-train at a higher 336
tialize with a pre-trained language model or add language pixel resolution for one additional epoch to boost perfor-
modeling as an auxiliary objective, though exploration of mance similar to FixRes (Touvron et al., 2019). We denote
this is left as future work. this model as ViT-L/14@336px. Unless otherwise specified,
all results reported in this paper as “CLIP” uses this model
While previous computer vision research has often scaled which we found to perform best.
models by increasing the width (Mahajan et al., 2018) or
depth (He et al., 2016a) in isolation, for the ResNet image
encoders we adapt the approach of Tan & Le (2019) which
found that allocating additional compute across all of width,
depth, and resolution outperforms only allocating it to only
one dimension of the model. While Tan & Le (2019) tune
the ratio of compute allocated to each dimension for their
Learning Transferable Visual Models From Natural Language Supervision 6

3. Experiments training as a transfer learning method to improve supervised


fine-tuning, it also included an ablation study demonstrat-
3.1. Zero-Shot Transfer ing that the performance of four heuristic zero-shot transfer
3.1.1. M OTIVATION methods improved steadily over the course of pre-training,
without any supervised adaption. This analysis served as the
In computer vision, zero-shot learning usually refers to the basis for GPT-2 (Radford et al., 2019) which focused exclu-
study of generalizing to unseen object categories in image sively on studying the task-learning capabilities of language
classification (Lampert et al., 2009). We instead use the models via zero-shot transfer.
term in a broader sense and study generalization to unseen
datasets. We motivate this as a proxy for performing un- 3.1.2. U SING CLIP FOR Z ERO -S HOT T RANSFER
seen tasks, as aspired to in the zero-data learning paper of
Larochelle et al. (2008). While much research in the field of CLIP is pre-trained to predict if an image and a text snippet
unsupervised learning focuses on the representation learn- are paired together in its dataset. To perform zero-shot clas-
ing capabilities of machine learning systems, we motivate sification, we reuse this capability. For each dataset, we use
studying zero-shot transfer as a way of measuring the task- the names of all the classes in the dataset as the set of poten-
learning capabilities of machine learning systems. In this tial text pairings and predict the most probable (image, text)
view, a dataset evaluates performance on a task on a spe- pair according to CLIP. In a bit more detail, we first compute
cific distribution. However, many popular computer vision the feature embedding of the image and the feature embed-
datasets were created by the research community primarily ding of the set of possible texts by their respective encoders.
as benchmarks to guide the development of generic image The cosine similarity of these embeddings is then calculated,
classification methods rather than measuring performance scaled by a temperature parameter τ , and normalized into a
on a specific task. While it is reasonable to say that the probability distribution via a softmax. Note that this predic-
SVHN dataset measures the task of street number transcrip- tion layer is a multinomial logistic regression classifier with
tion on the distribution of Google Street View photos, it is L2-normalized inputs, L2-normalized weights, no bias, and
unclear what “real” task the CIFAR-10 dataset measures. temperature scaling. When interpreted this way, the image
It is clear, however, what distribution CIFAR-10 is drawn encoder is the computer vision backbone which computes a
from - TinyImages (Torralba et al., 2008). On these kinds of feature representation for the image and the text encoder is a
datasets, zero-shot transfer is more an evaluation of CLIP’s hypernetwork (Ha et al., 2016) which generates the weights
robustness to distribution shift and domain generalization of a linear classifier based on the text specifying the visual
rather than task generalization. Please see Section 3.3 for concepts that the classes represent. Lei Ba et al. (2015) first
analysis focused on this. introduced a zero-shot image classifier of this form while
the idea of generating a classifier from natural language
To our knowledge, Visual N-Grams (Li et al., 2017) first dates back to at least Elhoseiny et al. (2013). Continuing
studied zero-shot transfer to existing image classification with this interpretation, every step of CLIP pre-training can
datasets in the manner described above. It is also the only be viewed as optimizing the performance of a randomly
other work we are aware of that has studied zero-shot trans- created proxy to a computer vision dataset which contains 1
fer to standard image classification datasets using a gener- example per class and has 32,768 total classes defined via
ically pre-trained model and serves as the best reference natural language descriptions. For zero-shot evaluation, we
point for contextualizing CLIP. Their approach learns the cache the zero-shot classifier once it has been computed by
parameters of a dictionary of 142,806 visual n-grams (span- the text encoder and reuse it for all subsequent predictions.
ning 1- to 5- grams) and optimizes these n-grams using a This allows the cost of generating it to be amortized across
differential version of Jelinek-Mercer smoothing to maxi- all the predictions in a dataset.
mize the probability of all text n-grams for a given image.
In order to perform zero-shot transfer, they first convert the 3.1.3. I NITIAL C OMPARISON TO V ISUAL N-G RAMS
text of each of the dataset’s class names into its n-gram
representation and then compute its probability according In Table 1 we compare Visual N-Grams to CLIP. The best
to their model, predicting the one with the highest score. CLIP model improves accuracy on ImageNet from a proof
of concept 11.5% to 76.2% and matches the performance
Our focus on studying zero-shot transfer as an evaluation of of the original ResNet-50 despite using none of the 1.28
task learning is inspired by work demonstrating task learn- million crowd-labeled training examples available for this
ing in the field of NLP. To our knowledge Liu et al. (2018) dataset. Additionally, the top-5 accuracy of CLIP models
first identified task learning as an “unexpected side-effect” are noticeably higher than their top-1, and this model has a
when a language model trained to generate Wikipedia ar- 95% top-5 accuracy, matching Inception-V4 (Szegedy et al.,
ticles learned to reliably transliterate names between lan- 2016). The ability to match the performance of a strong,
guages. While GPT-1 (Radford et al., 2018) focused on pre- fully supervised baselines in a zero-shot setting suggests
Learning Transferable Visual Models From Natural Language Supervision 7
70
RN50x64
aYahoo ImageNet SUN
Visual N-Grams 72.4 11.5 23.0
CLIP 98.4 76.2 58.5
65

Table 1. Comparing CLIP to prior zero-shot transfer image classi- 5 point


improvement

Average Score (%)


fication results. CLIP improves performance on all three datasets 60
by a large amount. This improvement reflects many differences
in the 4 years since the development of Visual N-Grams (Li et al., RN50x16
2017). 4X efficiency gain
55
RN50x4
CLIP is a significant step towards flexible and practical
zero-shot computer vision classifiers. As mentioned above, 50 RN101
the comparison to Visual N-Grams is meant for contextu-
RN50 Prompt engineering and ensembling
alizing the performance of CLIP and should not be inter-
preted as a direct methods comparison between CLIP and Contextless class names (Li et al. 2017)
45
Visual N-Grams as many performance relevant differences 6.1 9.9 21.5 75.3 265.9
between the two systems were not controlled for. For in- Model GFLOPs
stance, we train on a dataset that is 10x larger, use a vision
model that requires nearly 100x more compute per predic- Figure 4. Prompt engineering and ensembling improve zero-
tion, likely used over 1000x their training compute, and shot performance. Compared to the baseline of using contextless
use a transformer-based model which did not exist when class names, prompt engineering and ensembling boost zero-shot
classification performance by almost 5 points on average across
Visual N-Grams was published. As a closer comparison, we
36 datasets. This improvement is similar to the gain from using
trained a CLIP ResNet-50 on the same YFCC100M dataset
4 times more compute with the baseline zero-shot method but is
that Visual N-Grams was trained on and found it matched “free” when amortized over many predictions.
their reported ImageNet performance within a V100 GPU
day. This baseline was also trained from scratch instead of
being initialized from pre-trained ImageNet weights as in
Visual N-Grams. chosen somewhat haphazardly and do not anticipate issues
CLIP also outperforms Visual N-Grams on the other 2 re- related to zero-shot transfer which relies on task description
ported datasets. On aYahoo, CLIP achieves a 95% reduction in order to transfer successfully.
in the number of errors, and on SUN, CLIP more than dou- A common issue is polysemy. When the name of a class
bles the accuracy of Visual N-Grams. To conduct a more is the only information provided to CLIP’s text encoder it
comprehensive analysis and stress test, we implement a is unable to differentiate which word sense is meant due to
much larger evaluation suite detailed in Appendix A. In the lack of context. In some cases multiple meanings of the
total we expand from the 3 datasets reported in Visual N- same word might be included as different classes in the same
Grams to include over 30 datasets and compare to over 50 dataset! This happens in ImageNet which contains both
existing computer vision systems to contextualize results. construction cranes and cranes that fly. Another example is
found in classes of the Oxford-IIIT Pet dataset where the
3.1.4. P ROMPT E NGINEERING AND E NSEMBLING word boxer is, from context, clearly referring to a breed of
Most standard image classification datasets treat the infor- dog, but to a text encoder lacking context could just as likely
mation naming or describing classes which enables natural refer to a type of athlete.
language based zero-shot transfer as an afterthought. The Another issue we encountered is that it’s relatively rare in
vast majority of datasets annotate images with just a numeric our pre-training dataset for the text paired with the image
id of the label and contain a file mapping these ids back to to be just a single word. Usually the text is a full sentence
their names in English. Some datasets, such as Flowers102 describing the image in some way. To help bridge this
and GTSRB, don’t appear to include this mapping at all distribution gap, we found that using the prompt template
in their released versions preventing zero-shot transfer en- “A photo of a {label}.” to be a good default that
tirely.2 For many datasets, we observed these labels may be helps specify the text is about the content of the image. This
2
Alec learned much more about flower species and German often improves performance over the baseline of using only
traffic signs over the course of this project than he originally antic- the label text. For instance, just using this prompt improves
ipated. accuracy on ImageNet by 1.3%.
Learning Transferable Visual Models From Natural Language Supervision 8

Similar to the “prompt engineering” discussion around GPT- STL10 +34.8


StanfordCars +28.9
3 (Brown et al., 2020; Gao et al., 2020), we have also Country211 +23.2
observed that zero-shot performance can be significantly Food101 +22.5
Kinetics700 +14.5
improved by customizing the prompt text to each task. A SST2 +12.4
few, non exhaustive, examples follow. We found on several SUN397 +7.8
UCF101 +7.7
fine-grained image classification datasets that it helped to
HatefulMemes +6.7
specify the category. For example on Oxford-IIIT Pets, us- CIFAR10 +3.9
ing “A photo of a {label}, a type of pet.” CIFAR100 +3.0
FER2013 +2.8
to help provide context worked well. Likewise, on Food101 Caltech101 +2.0
specifying a type of food and on FGVC Aircraft a type of ImageNet +1.9
OxfordPets +1.1
aircraft helped too. For OCR datasets, we found that putting PascalVOC2007 +0.5
quotes around the text or number to be recognized improved -3.2 Birdsnap
-10.0 MNIST
performance. Finally, we found that on satellite image classi-
-11.3 FGVCAircraft
fication datasets it helped to specify that the images were of -11.9 RESISC45
this form and we use variants of “a satellite photo -12.5 Flowers102
-16.6 DTD
of a {label}.”. -18.2 CLEVRCounts
-18.4 GTSRB
We also experimented with ensembling over multiple zero- -19.5 PatchCamelyon
shot classifiers as another way of improving performance. -34.0 KITTI Distance
These classifiers are computed by using different context
-37.1 EuroSAT
prompts such as ‘A photo of a big {label}” and 40 30 20 10 0 10 20 30 40
“A photo of a small {label}”. We construct the Score (%)
ensemble over the embedding space instead of probability
Zero-Shot CLIP vs. Linear Probe on ResNet50
space. This allows us to cache a single set of averaged text Figure 5. Zero-shot CLIP is competitive with a fully super-
embeddings so that the compute cost of the ensemble is the vised baseline. Across a 27 dataset eval suite, a zero-shot CLIP
same as using a single classifier when amortized over many classifier outperforms a fully supervised linear classifier fitted on
predictions. We’ve observed ensembling across many gen- ResNet-50 features on 16 datasets, including ImageNet.
erated zero-shot classifiers to reliably improve performance
and use it for the majority of datasets. On ImageNet, we
ensemble 80 different context prompts and this improves
performance by an additional 3.5% over the single default ten than not and wins on 16 of the 27 datasets. Looking at
prompt discussed above. When considered together, prompt individual datasets reveals some interesting behavior. The
engineering and ensembling improve ImageNet accuracy dataset zero-shot CLIP improves by the most is STL10, a
by almost 5%. In Figure 4 we visualize how prompt engi- dataset designed to encourage efficient learning by provid-
neering and ensembling change the performance of a set of ing only a limited number of labeled examples. Zero-shot
CLIP models compared to the contextless baseline approach CLIP, without using any training examples, achieves 99.3%
of directly embedding the class name as done in Li et al. on this dataset which appears to be a new state of the art. On
(2017). fine-grained classification tasks, we observe a wide spread
in performance. On two of these datasets, Stanford Cars and
3.1.5. A NALYSIS OF Z ERO -S HOT CLIP P ERFORMANCE Food101, zero-shot CLIP outperforms logistic regression on
ResNet-50 features by over 20% while on two others, Flow-
Since task-agnostic zero-shot classifiers for computer vision ers102 and FGVCAircraft, zero-shot CLIP underperforms
have been understudied, CLIP provides a promising oppor- by over 10%. On OxfordPets and Birdsnap, performance
tunity to gain a better understanding of this type of model. is much closer. We suspect these difference are primarily
In this section, we conduct a study of various properties of due to varying amounts of per-task supervision between
CLIP’s zero-shot classifiers. As a first question, we look WIT and ImageNet. On “general” object classification
simply at how well zero-shot classifiers perform. To con- datasets such as ImageNet, CIFAR10, and PascalVOC2007
textualize this, we compare to the performance of a simple performance is relatively similar with a slight advantage for
off-the-shelf baseline: fitting a fully supervised, regularized, zero-shot CLIP. Zero-shot CLIP significantly outperforms
logistic regression classifier on the features of the canonical a ResNet-50 on two datasets measuring action recognition
ResNet-50. In Figure 5 we show this comparison across 27 in videos. On Kinetics700, CLIP outperforms a ResNet-50
datasets. Please see Appendix A for details of datasets and by 14.5%. Zero-shot CLIP also outperforms a ResNet-50’s
setup. features by 7.7% on UCF101. We speculate this is due to
Zero-shot CLIP outperforms this baseline slightly more of- natural language providing wider supervision for visual con-
cepts involving verbs, compared to the noun-centric object
Learning Transferable Visual Models From Natural Language Supervision 9
75 sion on the features of many image models including the
Linear Probe CLIP
best publicly available ImageNet models, self-supervised
70
learning methods, and CLIP itself. While it is intuitive to
65 Zero-Shot BiT-M (ImageNet-21K)
expect zero-shot to underperform one-shot, we instead find
CLIP that zero-shot CLIP matches the performance of 4-shot lo-
60 SimCLRv2 gistic regression on the same feature space. This is likely
Average Score (%)

due to an important difference between the zero-shot and


55 ResNet50 few-shot approach. First, CLIP’s zero-shot classifier is gen-
erated via natural language which allows for visual concepts
50 to be directly specified (“communicated”). By contrast,
“normal” supervised learning must infer concepts indirectly
45 from training examples. Context-less example-based learn-
ing has the drawback that many different hypotheses can
40 be consistent with the data, especially in the one-shot case.
A single image often contains many different visual con-
35
cepts. Although a capable learner is able to exploit visual
30 cues and heuristics, such as assuming that the concept being
0 1 2 4 8 16 demonstrated is the primary object in an image, there is no
# of labeled training examples per class guarantee.

Figure 6. Zero-shot CLIP outperforms few-shot linear probes. A potential resolution of this discrepancy between zero-
Zero-shot CLIP matches the average performance of a 4-shot linear shot and few-shot performance is to use CLIP’s zero-shot
classifier trained on the same feature space and nearly matches the classifier as a prior for the weights of the few-shot classifier.
best results of a 16-shot linear classifier across publicly available While adding an L2 penalty towards the generated weights
models. For both BiT-M and SimCLRv2, the best performing is a straightforward implementation of this idea, we found
model is highlighted. Light gray lines are other models in the eval that hyperparameter optimization would often select for
suite. The 20 datasets with at least 16 examples per class were such a large value of this regularizer that the resulting few-
used in this analysis.
shot classifier was “just” the zero-shot classifier. Research
into better methods of combining the strength of zero-shot
transfer with flexibility of few-shot learning is a promising
supervision in ImageNet. direction for future work.
Looking at where zero-shot CLIP notably underperforms, When comparing zero-shot CLIP to few-shot logistic re-
we see that zero-shot CLIP is quite weak on several spe- gression on the features of other models, zero-shot CLIP
cialized, complex, or abstract tasks such as satellite image roughly matches the performance of the best performing
classification (EuroSAT and RESISC45), lymph node tumor 16-shot classifier in our evaluation suite, which uses the fea-
detection (PatchCamelyon), counting objects in synthetic tures of a BiT-M ResNet-152x2 trained on ImageNet-21K.
scenes (CLEVRCounts), self-driving related tasks such as We are certain that a BiT-L model trained on JFT-300M
German traffic sign recognition (GTSRB), recognizing dis- would perform even better but these models have not been
tance to the nearest car (KITTI Distance). These results publicly released. That a BiT-M ResNet-152x2 performs
highlight the poor capability of zero-shot CLIP on more best in a 16-shot setting is somewhat surprising since, as
complex tasks. By contrast, non-expert humans can robustly analyzed in Section 3.2, the Noisy Student EfficientNet-L2
perform several of these tasks, such as counting, satellite outperforms it in a fully supervised setting by almost 5% on
image classification, and traffic sign recognition, suggesting average across 27 datasets.
significant room for improvement. However, we caution
that it is unclear whether measuring zero-shot transfer, as In addition to studying the average performance of zero-shot
opposed to few-shot transfer, is a meaningful evaluation for CLIP and few-shot logistic regression, we also examine
difficult tasks that a learner has no prior experience with, performance on individual datasets. In Figure 7, we show
such as lymph node tumor classification for almost all hu- estimates for the number of labeled examples per class that
mans (and possibly CLIP). a logistic regression classifier on the same feature space
requires to match the performance of zero-shot CLIP. Since
While comparing zero-shot performance to fully supervised zero-shot CLIP is also a linear classifier, this estimates the
models contextualizes the task-learning capabilities of CLIP, effective data efficiency of zero-shot transfer in this setting.
comparing to few-shot methods is a more direct compari- In order to avoid training thousands of linear classifiers,
son, since zero-shot is its limit. In Figure 6, we visualize we estimate the effective data efficiency based on a log-
how zero-shot CLIP compares to few-shot logistic regres-
Learning Transferable Visual Models From Natural Language Supervision 10
FER2013 184 100 STL10
CIFAR10 81 CIFAR10
Food101 64 Food101 OxfordPets
Caltech101
OxfordPets 48 90
Country211 32 MNIST
ImageNet 16.0 VOC2007
PCam 14.7 80
SST2 14.4 Stanford Cars

Zero-Shot CLIP Performance


Kinetics700 13.6 CIFAR100
ImageNet UCF101 Flowers102
STL10 12.7 RESISC45
CIFAR100 12.0 70
HatefulMemes 9.8 SUN397SST2
StanfordCars 6.0 HatefulMemes PCAM
MNIST 4.8 60 Kinetics700
EuroSAT
SUN397 3.9 FER2013
Caltech101 3.5 DTD
KITTI Distance 2.9 GTSRB
UCF101 2.9 50 Birdsnap
Birdsnap 2.7
DTD 2.6
FGVCAircraft 2.0 40
GTSRB 1.6 FGVCAircraft
CLEVRCounts 1.5 Country211
KITTI Distance
RESISC45 1.5 Mean: 20.8
30
EuroSAT 0.9 Median: 5.4
Flowers102 0.9
0 25 50 75 100 125 150 175 200
CLEVRCounts
r = 0.82
20
# of labeled examples per class 20 30 40 50 60 70 80 90 100
required to match zero-shot Linear Probe CLIP Performance
Figure 7. The data efficiency of zero-shot transfer varies Figure 8. Zero-shot performance is correlated with linear
widely. Calculating the number of labeled examples per class probe performance but still mostly sub-optimal. Comparing
a linear classifier on the same CLIP feature space requires to match zero-shot and linear probe performance across datasets shows a
the performance of the zero-shot classifier contextualizes the ef- strong correlation with zero-shot performance mostly shifted 10 to
fectiveness of zero-shot transfer. Values are estimated based on 25 points lower. On only 5 datasets does zero-shot performance
log-linear interpolation of 1, 2, 4, 8, 16-shot and fully supervised approach linear probe performance (≤3 point difference).
results. Performance varies widely from still underperforming a
one-shot classifier on two datasets to matching an estimated 184
labeled examples per class.
and zero-shot transfer capabilities.
There is a positive correlation of 0.82 (p-value < 10−6 )
linear interpolation of the performance of a 1, 2, 4, 8, 16- between zero-shot performance and fully supervised perfor-
shot (when possible), and a fully supervised linear classifier mance, suggesting that CLIP is relatively consistent at con-
trained on each dataset. We find that zero-shot transfer can necting underlying representation and task learning to zero-
have widely varying efficiency per dataset from less than 1 shot transfer. However, zero-shot CLIP only approaches
labeled example per class to 184. Two datasets, Flowers102 fully supervised performance on 5 datasets: STL10, CI-
and EuroSAT underperform one-shot models. Half of the FAR10, Food101, OxfordPets, and Caltech101. On all 5
datasets require less than 5 examples per class with a median datasets, both zero-shot accuracy and fully supervised accu-
of 5.4. However, the mean estimated data efficiency is 20.8 racy are over 90%. This suggests that CLIP may be more
examples per class. This is due to the 20% of datasets effective at zero-shot transfer for tasks where its underly-
where supervised classifiers require many labeled examples ing representations are also high quality. The slope of a
per class in order to match performance. On ImageNet, linear regression model predicting zero-shot performance
zero-shot CLIP matches the performance of a 16-shot linear as a function of fully supervised performance estimates that
classifier trained on the same feature space. for every 1% improvement in fully supervised performance,
zero-shot performance improves by 1.28%. However, the
If we assume that evaluation datasets are large enough that
95th-percentile confidence intervals still include values of
the parameters of linear classifiers trained on them are well
less than 1 (0.93-1.79).
estimated, then, because CLIP’s zero-shot classifier is also
a linear classifier, the performance of the fully supervised Over the past few years, empirical studies of deep learning
classifiers roughly sets an upper bound for what zero-shot systems have documented that performance is predictable as
transfer can achieve. In Figure 8 we compare CLIP’s zero- a function of important quantities such as training compute
shot performance with fully supervised linear classifiers and dataset size (Hestness et al., 2017; Kaplan et al., 2020).
across datasets. The dashed, y = x line represents an “op- The GPT family of models has so far demonstrated consis-
timal” zero-shot classifier that matches the performance of tent improvements in zero-shot performance across a 1000x
its fully supervised equivalent. For most datasets, the per- increase in training compute. In Figure 9, we check whether
formance of zero-shot classifiers still underperform fully su- the zero-shot performance of CLIP follows a similar scaling
pervised classifiers by 10% to 25%, suggesting that there is pattern. We plot the average error rate of the 5 ResNet CLIP
still plenty of headroom for improving CLIP’s task-learning models across 39 evaluations on 36 different datasets and
Learning Transferable Visual Models From Natural Language Supervision 11

RN50 general and robust representations during the pre-training


45
RN101 phase. Linear classifiers, because of their limited flexibility,
instead highlight these failures and provide clear feedback
40 RN50x4
during development. For CLIP, training supervised linear
classifiers has the added benefit of being very similar to the
Error (%)

RN50x16 approach used for its zero-shot classifiers which enables


35 extensive comparisons and analysis in Section 3.1. Finally,
we aim to compare CLIP to a comprehensive set of existing
models across many tasks. Studying 66 different models on
30
RN50x64 27 different datasets requires tuning 1782 different evalua-
tions. Fine-tuning opens up a much larger design and hyper-
6.1 9.9 21.5 75.3 265.9 parameter space, which makes it difficult to fairly evaluate
Model GFLOPs and computationally expensive to compare a diverse set of
techniques as discussed in other large scale empirical studies
Figure 9. Zero-shot CLIP performance scales smoothly as a
(Lucic et al., 2018; Choi et al., 2019). By comparison, linear
function of model compute. Across 39 evals on 36 different
classifiers require minimal hyper-parameter tuning and have
datasets, average zero-shot error is well modeled by a log-log lin-
ear trend across a 44x range of compute spanning 5 different CLIP standardized implementations and evaluation procedures.
models. Lightly shaded lines are performance on individual evals, Please see Appendix A for further details on evaluation.
showing that performance is much more varied despite the smooth Figure 10 summarizes our findings. To minimize selection
overall trend.
effects that could raise concerns of confirmation or reporting
bias, we first study performance on the 12 dataset evaluation
find that a similar log-log linear scaling trend holds for CLIP suite from Kornblith et al. (2019). While small CLIP mod-
across a 44x increase in model compute. While the overall els such as a ResNet-50 and ResNet-101 outperform other
trend is smooth, we found that performance on individual ResNets trained on ImageNet-1K (BiT-S and the originals),
evaluations can be much noisier. We are unsure whether they underperform ResNets trained on ImageNet-21K (BiT-
this is caused by high variance between individual training M). These small CLIP models also underperform models
runs on sub-tasks (as documented in D’Amour et al. (2020)) in the EfficientNet family with similar compute require-
masking a steadily improving trend or whether performance ments. However, models trained with CLIP scale very well
is actually non-monotonic as a function of compute on some and the largest model we trained (ResNet-50x64) slightly
tasks. outperforms the best performing existing model (a Noisy
Student EfficientNet-L2) on both overall score and compute
3.2. Representation Learning efficiency. We also find that CLIP vision transformers are
about 3x more compute efficient than CLIP ResNets, which
While we have extensively analyzed the task-learning ca- allows us to reach higher overall performance within our
pabilities of CLIP through zero-shot transfer in the previ- compute budget. These results qualitatively replicate the
ous section, it is more common to study the representation findings of Dosovitskiy et al. (2020) which reported that
learning capabilities of a model. There exist many ways to vision transformers are more compute efficient than con-
evaluate the quality of representations as well as disagree- vnets when trained on sufficiently large datasets. Our best
ments over what properties an “ideal” representation should overall model is a ViT-L/14 that is fine-tuned at a higher res-
have (Locatello et al., 2020). Fitting a linear classifier on olution of 336 pixels on our dataset for 1 additional epoch.
a representation extracted from the model and measuring This model outperforms the best existing model across this
its performance on various datasets is a common approach. evaluation suite by an average of 2.6%.
An alternative is measuring the performance of end-to-end
fine-tuning of the model. This increases flexibility, and As Figure 21 qualitatively shows, CLIP models learn a wider
prior work has convincingly demonstrated that fine-tuning set of tasks than has previously been demonstrated in a sin-
outperforms linear classification on most image classifi- gle computer vision model trained end-to-end from random
cation datasets (Kornblith et al., 2019; Zhai et al., 2019). initialization. These tasks include geo-localization, optical
While the high performance of fine-tuning motivates its character recognition, facial emotion recognition, and action
study for practical reasons, we still opt for linear classifier recognition. None of these tasks are measured in the evalua-
based evaluation for several reasons. Our work is focused tion suite of Kornblith et al. (2019). This could be argued
on developing a high-performing task and dataset-agnostic to be a form of selection bias in Kornblith et al. (2019)’s
pre-training approach. Fine-tuning, because it adapts rep- study towards tasks that overlap with ImageNet. To address
resentations to each dataset during the fine-tuning phase, this, we also measure performance on a broader 27 dataset
can compensate for and potentially mask failures to learn evaluation suite. This evaluation suite, detailed in Appendix
Learning Transferable Visual Models From Natural Language Supervision 12

Linear probe average over Kornblith et al.'s 12 datasets Linear probe average over all 27 datasets
90
85

85
80
Average Score (%)

Average Score (%)


80
75

75
70

100 101 102 100 101 102


Forward-pass GFLOPs/image Forward-pass GFLOPs/image

CLIP-ViT Instagram-pretrained ViT (ImageNet-21k)


CLIP-ResNet SimCLRv2 BiT-M
EfficientNet-NoisyStudent BYOL BiT-S
EfficientNet MoCo ResNet
Figure 10. Linear probe performance of CLIP models in comparison with state-of-the-art computer vision models, including
EfficientNet (Tan & Le, 2019; Xie et al., 2020), MoCo (Chen et al., 2020e), Instagram-pretrained ResNeXt models (Mahajan et al., 2018;
Touvron et al., 2019), BiT (Kolesnikov et al., 2019), ViT (Dosovitskiy et al., 2020), SimCLRv2 (Chen et al., 2020d), BYOL (Grill et al.,
2020), and the original ResNet models (He et al., 2016b). (Left) Scores are averaged over 12 datasets studied by Kornblith et al. (2019).
(Right) Scores are averaged over 27 datasets that contain a wider variety of distributions. Dotted lines indicate models fine-tuned or
evaluated on images at a higher-resolution than pre-training. See Table 10 for individual scores and Figure 20 for plots for each dataset.

A includes datasets representing the aforementioned tasks, per-dataset differences in the performance of the best CLIP
German Traffic Signs Recognition Benchmark (Stallkamp model and the best model in our evaluation suite across
et al., 2011), as well as several other datasets adapted from all 27 datasets in Figure 11. CLIP outperforms the Noisy
VTAB (Zhai et al., 2019). Student EfficientNet-L2 on 21 of the 27 datasets. CLIP
improves the most on tasks which require OCR (SST2
On this broader evaluation suite, the benefits of CLIP are
and HatefulMemes), geo-localization and scene recognition
more clear. All CLIP models, regardless of scale, outper-
(Country211, SUN397), and activity recognition in videos
form all evaluated systems in terms of compute efficiency.
(Kinetics700 and UCF101). In addition CLIP also does
The improvement in average score of the best model over
much better on fine-grained car and traffic sign recognition
previous systems increases from 2.6% to 5%. We also find
(Stanford Cars and GTSRB). This may reflect a problem
that self-supervised systems do noticeably better on our
with overly narrow supervision in ImageNet. A result such
broader evaluation suite. For instance, while SimCLRv2
as the 14.7% improvement on GTSRB could be indicative
still underperforms BiT-M on average on the 12 datasets
of an issue with ImageNet-1K, which has only a single la-
of Kornblith et al. (2019), SimCLRv2 outperforms BiT-M
bel for all traffic and street signs. This could encourage
on our 27 dataset evaluation suite. These findings suggest
a supervised representation to collapse intra-class details
continuing to expand task diversity and coverage in order
and hurt accuracy on a fine-grained downstream task. As
to better understand the “general” performance of systems.
mentioned, CLIP still underperforms the EfficientNet on
We suspect additional evaluation efforts along the lines of
several datasets. Unsurprisingly, the dataset that the Effi-
VTAB to be valuable.
cientNet does best relative to CLIP on is the one it was
In addition to the aggregate analysis above, we visualize trained on: ImageNet. The EffcientNet also slightly outper-
Learning Transferable Visual Models From Natural Language Supervision 13

SST2 +23.6 We caution that, to date, most of these studies limit their
Country211 +22.7
HatefulMemes +18.8 evaluation to models trained on ImageNet. Recalling the
StanfordCars +15.9 topic of discussion, it may be a mistake to generalize too
GTSRB +14.7
SUN397 +6.5 far from these initial findings. To what degree are these
Kinetics700 +6.2 failures attributable to deep learning, ImageNet, or some
RESISC45 +5.1
FER2013 +4.5 combination of the two? CLIP models, which are trained via
Food101 +3.9 natural language supervision on a very large dataset and are
FGVCAircraft +3.2
UCF101 +3.1 capable of high zero-shot performance, are an opportunity
KITTI Distance +2.3 to investigate this question from a different angle.
Birdsnap +1.4
Flowers102 +1.4 Taori et al. (2020) is a recent comprehensive study mov-
Caltech101 +1.3
EuroSAT +0.9 ing towards quantifying and understanding these behaviors
MNIST +0.6 for ImageNet models. Taori et al. (2020) study how the
DTD +0.5 performance of ImageNet models change when evaluated
VOC2007 +0.5
STL10 +0.0 on natural distribution shifts. They measure performance
-0.5 OxfordPets on a set of 7 distribution shifts: ImageNetV2 (Recht et al.,
-0.8 CIFAR10
-1.2 PatchCamelyon 2019), ImageNet Sketch (Wang et al., 2019), Youtube-BB
-1.7 CIFAR100 and ImageNet-Vid (Shankar et al., 2019), ObjectNet (Barbu
-2.4 CLEVRCounts
-3.0 ImageNet et al., 2019), ImageNet Adversarial (Hendrycks et al., 2019),
10 5 0 5 10 15 20 25 and ImageNet Rendition (Hendrycks et al., 2020a). They
Score (%) distinguish these datasets, which all consist of novel images
Logistic Regression on CLIP vs. EfficientNet L2 NS collected from a variety of sources, from synthetic distri-
bution shifts such as ImageNet-C (Hendrycks & Dietterich,
Figure 11. CLIP’s features outperform the features of the best 2019), Stylized ImageNet (Geirhos et al., 2018), or adver-
ImageNet model on a wide variety of datasets. Fitting a linear sarial attacks (Goodfellow et al., 2014) which are created by
classifier on CLIP’s features outperforms using the Noisy Student perturbing existing images in various ways. They propose
EfficientNet-L2 on 21 out of 27 datasets. this distinction because in part because they find that while
several techniques have been demonstrated to improve per-
forms CLIP on low-resolution datasets such as CIFAR10 formance on synthetic distribution shifts, they often fail to
and CIFAR100. We suspect this is at least partly due to the yield consistent improvements on natural distributions.3
lack of scale-based data augmentation in CLIP. The Effi- Across these collected datasets, the accuracy of ImageNet
cientNet also does slightly better on PatchCamelyon and models drop well below the expectation set by the Ima-
CLEVRCounts, datasets where overall performance is still geNet validation set. For the following summary discussion
low for both approaches. we report average accuracy across all 7 natural distribution
shift datasets and average accuracy across the correspond-
3.3. Robustness to Natural Distribution Shift ing class subsets of ImageNet unless otherwise specified.
In 2015, it was announced that a deep learning model ex- Additionally, for Youtube-BB and ImageNet-Vid, which
ceeded human performance on the ImageNet test set (He have two different evaluation settings, we use the average
et al., 2015). However, research in the subsequent years of pm-0 and pm-10 accuracy.
has repeatedly found that these models still make many sim- A ResNet-101 makes 5 times as many mistakes when eval-
ple mistakes (Dodge & Karam, 2017; Geirhos et al., 2018; uated on these natural distribution shifts compared to the
Alcorn et al., 2019), and new benchmarks testing these sys- ImageNet validation set. Encouragingly however, Taori et al.
tems has often found their performance to be much lower (2020) find that accuracy under distribution shift increases
than both their ImageNet accuracy and human accuracy predictably with ImageNet accuracy and is well modeled
(Recht et al., 2019; Barbu et al., 2019). What explains this as a linear function of logit-transformed accuracy. Taori
discrepancy? Various ideas have been suggested and stud- et al. (2020) use this finding to propose that robustness
ied (Ilyas et al., 2019; Geirhos et al., 2020). A common analysis should distinguish between effective and relative
theme of proposed explanations is that deep learning models robustness. Effective robustness measures improvements
are exceedingly adept at finding correlations and patterns in accuracy under distribution shift above what is predicted
which hold across their training dataset and thus improve by the documented relationship between in-distribution and
in-distribution performance. However many of these corre- out-of-distribution accuracy. Relative robustness captures
lations and patterns are actually spurious and do not hold for 3
other distributions and result in large drops in performance We refer readers to Hendrycks et al. (2020a) for additional
experiments and discussion on this claim.
on other datasets.
Learning Transferable Visual Models From Natural Language Supervision 14

90
Linear probe average over Kornblith et al.'s 12 datasets 90
Linear probe average over 26 datasets

85 85
Transfer Score (%)

Transfer Score (%)


80 80

75 75

70 70

65 65
65 70 75 80 85 90 65 70 75 80 85 90
ImageNet Score (%) ImageNet Score (%)

CLIP-ViT Instagram ViT (ImageNet-21k)


CLIP-ResNet SimCLRv2 BiT-M
EfficientNet-NoisyStudent BYOL BiT-S
EfficientNet MoCo ResNet
Figure 12. CLIP’s features are more robust to task shift when compared to models pre-trained on ImageNet. For both dataset
splits, the transfer scores of linear probes trained on the representations of CLIP models are higher than other models with similar
ImageNet performance. This suggests that the representations of models trained on ImageNet are somewhat overfit to their task.

any improvement in out-of-distribution accuracy. Taori et al. more robust, they do not necessarily mean that supervised
(2020) argue that robustness techniques should aim to im- learning on ImageNet causes a robustness gap. Other details
prove both effective robustness and relative robustness. of CLIP, such as its large and diverse pre-training dataset
or use of natural language supervision could also result
Almost all models studied in Taori et al. (2020) are trained
in much more robust models regardless of whether they
or fine-tuned on the ImageNet dataset. Returning to the
are zero-shot or fine-tuned. As an initial experiment to
discussion in the introduction to this section - is training
potentially begin narrowing this down, we also measure
or adapting to the ImageNet dataset distribution the cause
how the performance of CLIP models change after adapting
of the observed robustness gap? Intuitively, a zero-shot
to the ImageNet distribution via a L2 regularized logistic
model should not be able to exploit spurious correlations
regression classifier fit to CLIP features on the ImageNet
or patterns that hold only on a specific distribution, since it
training set. We visualize how performance changes from
is not trained on that distribution. 4 Thus it is reasonable
the zero-shot classifier in Figure 14. Although adapting
to expect zero-shot models to have much higher effective
CLIP to the ImageNet distribution increases its ImageNet
robustness. In Figure 13, we compare the performance of
accuracy by 9.2% to 85.4% overall, and ties the accuracy
zero-shot CLIP with existing ImageNet models on natural
of the 2018 SOTA from Mahajan et al. (2018), average
distribution shifts. All zero-shot CLIP models improve
accuracy under distribution shift slightly decreases.
effective robustness by a large amount and reduce the size
of the gap between ImageNet accuracy and accuracy under It is surprising to see a 9.2% increase in accuracy, which cor-
distribution shift by up to 75%. responds to roughly 3 years of improvement in SOTA, fail
to translate into any improvement in average performance
While these results show that zero-shot models can be much
under distribution shift. We also break down the differences
4
We caution that a zero-shot model can still exploit spurious between zero-shot accuracy and linear classifier accuracy
correlations that are shared between the pre-training and evaluation per dataset in Figure 14 and find performance still increases
distributions.
significantly on one dataset, ImageNetV2. ImageNetV2
Learning Transferable Visual Models From Natural Language Supervision 15
ImageNet Zero-Shot
100 Dataset Examples ResNet101 CLIP Δ Score
Ideal robust model (y = x)
95 Zero-Shot CLIP
Average on 7 natural distribution shift datasets (top-1, %)

Standard ImageNet training


90 Exisiting robustness techniques ImageNet 76.2 76.2 0%
85

80

75 ImageNetV2 64.3 70.1 +5.8%


70

65
ImageNet-R 37.7 88.9 +51.2%
60

55

50
ObjectNet 32.6 72.3 +39.7%
45

40

35 ImageNet 25.2 60.2 +35.0%


30 Sketch
25

20 ImageNet-A 2.7 77.1 +74.4%


65 70 75 80 85 90 95 100
Average on class subsampled ImageNet (top-1, %)

Figure 13. Zero-shot CLIP is much more robust to distribution shift than standard ImageNet models. (Left) An ideal robust model
(dashed line) performs equally well on the ImageNet distribution and on other natural image distributions. Zero-shot CLIP models shrink
this “robustness gap” by up to 75%. Linear fits on logit transformed values are shown with bootstrap estimated 95% confidence intervals.
(Right) Visualizing distribution shift for bananas, a class shared across 5 of the 7 natural distribution shift datasets. The performance of
the best zero-shot CLIP model, ViT-L/14@336px, is compared with a model that has the same performance on the ImageNet validation
set, ResNet-101.

closely followed the creation process of the original Ima- datasets, Youtube-BB and ImageNet-Vid, consist of super-
geNet dataset which suggests that gains in accuracy from classes of ImageNet. This presents a problem when trying
supervised adaptation are closely concentrated around the to use the fixed 1000-way classifier of an ImageNet model
ImageNet distribution. Performance decreases by 4.7% on to make predictions. Taori et al. (2020) handle this by max-
ImageNet-R, 3.8% on ObjectNet, 2.8% on ImageNet Sketch, pooling predictions across all sub-classes according to the
and 1.9% on ImageNet-A. The change in accuracy on the ImageNet class hierarchy. Sometimes this mapping is much
two other datasets, Youtube-BB and ImageNet Vid, is in- less than perfect. For the person class in Youtube-BB, pre-
significant. dictions are made by pooling over the ImageNet classes for
a baseball player, a bridegroom, and a scuba diver. With
How is it possible to improve accuracy by 9.2% on the Im-
CLIP we can instead generate a custom zero-shot classi-
ageNet dataset with little to no increase in accuracy under
fier for each dataset directly based on its class names. In
distribution shift? Is the gain primarily from “exploiting
Figure 14 we see that this improves average effective ro-
spurious correlations”? Is this behavior unique to some com-
bustness by 5% but is concentrated in large improvements
bination of CLIP, the ImageNet datatset, and the distribution
on only a few datasets. Curiously, accuracy on ObjectNet
shifts studied, or a more general phenomena? Does it hold
also increases by 2.3%. Although the dataset was designed
for end-to-end finetuning as well as linear classifiers? We
to closely overlap with ImageNet classes, using the names
do not have confident answers to these questions at this time.
provided for each class by ObjectNet’s creators still helps a
Prior work has also pre-trained models on distributions other
small amount compared to using ImageNet class names and
than ImageNet, but it is common to study and release mod-
pooling predictions when necessary.
els only after they have been fine-tuned to ImageNet. As a
step towards understanding whether pre-trained zero-shot While zero-shot CLIP improves effective robustness, Figure
models consistently have higher effective robustness than 14 shows that the benefit is almost entirely gone in a fully
fine-tuned models, we encourage the authors of Mahajan supervised setting. To better understand this difference, we
et al. (2018), Kolesnikov et al. (2019), and Dosovitskiy et al. investigate how effective robustness changes on the contin-
(2020) to, if possible, study these questions on their models uum from zero-shot to fully supervised. In Figure 15 we
as well. visualize the performance of 0-shot, 1-shot, 2-shot, 4-shot
..., 128-shot, and fully supervised logistic regression classi-
We also investigate another robustness intervention enabled
fiers on the best CLIP model’s features. We see that while
by flexible zero-shot natural-language-based image classi-
few-shot models also show higher effective robustness than
fiers. The target classes across the 7 transfer datasets are
existing models, this benefit fades as in-distribution per-
not always perfectly aligned with those of ImageNet. Two
Learning Transferable Visual Models From Natural Language Supervision 16
Adapt to ImageNet
ImageNet +9.2
Average on 7 natural distribution shift datasets (top-1, %)

ImageNetV2 +5.8
80
Adapt to class shift Youtube-BB +0.6
-0.5 ImageNet Vid
75 Adapt to ImageNet -1.9 ImageNet-A
-2.8 ImageNet Sketch
70
-3.8 ObjectNet
65 -4.7 ImageNet-R
60
10 5 0 5 10 15 20 25 30
Change from zero-shot ImageNet classifier accuracy (%)
55
Adapt to class shift
50
Youtube-BB +26.9
45 ImageNet Vid +8.3
ObjectNet +2.3
40 Ideal robust model (y = x)
Adaptive Zero-Shot CLIP ImageNet Sketch 0
35 ImageNet Zero-Shot CLIP ImageNet-R 0
Logistic Regression CLIP
ImageNet-A 0
30
Standard ImageNet training
Robustness intervention ImageNetV2 0
Trained with more data ImageNet 0
25
70 75 80 85 90 95 10 5 0 5 10 15 20 25 30
Average on class subsampled ImageNet (top-1, %) Change from zero-shot ImageNet classifier accuracy (%)
Figure 14. While supervised adaptation to ImageNet increases ImageNet accuracy by 9.2%, it slightly reduces average robustness.
(Left) Customizing zero-shot CLIP to each dataset improves robustness compared to using a single static zero-shot ImageNet classifier
and pooling predictions across similar classes as in Taori et al. (2020). CLIP models adapted to ImageNet have similar effective robustness
as the best prior ImageNet models. (Right) Details of per dataset changes in accuracy for the two robustness interventions. Adapting to
ImageNet increases accuracy on ImageNetV2 noticeably but trades off accuracy on several other distributions. Dataset specific zero-shot
classifiers can improve accuracy by a large amount but are limited to only a few datasets that include classes which don’t perfectly align
with ImageNet categories.

formance increases with more training data and is mostly, 4. Comparison to Human Performance
though not entirely, gone for the fully supervised model.
Additionally, zero-shot CLIP is notably more robust than How does CLIP compare to human performance and human
a few-shot model with equivalent ImageNet performance. learning? To get a better understanding of how well humans
Across our experiments, high effective robustness seems to perform in similar evaluation settings to CLIP, we evaluated
result from minimizing the amount of distribution specific humans on one of our tasks. We wanted to get a sense of
training data a model has access to, but this comes at a cost how strong human zero-shot performance is at these tasks,
of reducing dataset-specific performance. and how much human performance is improved if they are
shown one or two image samples. This can help us to
Taken together, these results suggest that the recent shift compare task difficulty for humans and CLIP, and identify
towards large-scale task and dataset agnostic pre-training correlations and differences between them.
combined with a reorientation towards zero-shot and few-
shot benchmarking on broad evaluation suites (as advocated We had five different humans look at each of 3669 images
by Yogatama et al. (2019) and Linzen (2020)) promotes the in the test split of the Oxford IIT Pets dataset (Parkhi et al.,
development of more robust systems and provides a more 2012) and select which of the 37 cat or dog breeds best
accurate assessment of performance. We are curious to see matched the image (or ‘I don’t know’ if they were com-
if the same results hold for zero-shot models in the field pletely uncertain). In the zero-shot case the humans were
of NLP such as the GPT family. While Hendrycks et al. given no examples of the breeds and asked to label them
(2020b) has reported that pre-training improves relative ro- to the best of their ability without an internet search. In
bustness on sentiment analysis, Miller et al. (2020)’s study the one-shot experiment the humans were given one sample
of the robustness of question answering models under nat- image of each breed and in the two-shot experiment they
ural distribution shift finds, similar to Taori et al. (2020), were given two sample images of each breed.5
little evidence of effective robustness improvements to date. 5
There is not a perfect correspondence between the human
few-shot tasks and the model’s few-shot performance since the
model cannot refer to sample images in the way that the humans
Learning Transferable Visual Models From Natural Language Supervision 17

75
Average on 7 natural distribution shift datasets (top-1, %)

0 shot
Majority Vote
all Majority Vote Accuracy
70 Accuracy Accuracy
128 on Full Dataset on Guesses
64
on Guesses
32
65 Zero-shot human 53.7 57.0 69.7 63.9
16 shot
Zero-shot CLIP 93.5 93.5 93.5 93.5
60 8 shot One-shot human 75.7 80.3 78.5 81.2
Two-shot human 75.7 85.0 79.2 86.1
4 shot
55

2 shot
50 Table 2. Comparison of human performance on Oxford IIT Pets.
45
As in Parkhi et al. (2012), the metric is average per-class classifica-
tion accuracy. Most of the gain in performance when going from
1 shot
40 the human zero shot case to the human one shot case is on images
that participants were highly uncertain on. “Guesses” refers to
35
restricting the dataset to where participants selected an answer
30 Ideal robust model (y = x) other than “I don’t know”, the “majority vote” is taking the most
Few-Shot CLIP (best model)
Zero-Shot CLIP (best model)
frequent (exclusive of ties) answer per image.
25 Standard ImageNet training
Robustness intervention
Trained with more data
20 don’t make effective use of prior knowledge and the humans
65 70 75 80 85 90 95
Average on class subsampled ImageNet (top-1, %) do, we speculate that finding a method to properly integrate
prior knowledge into few-shot learning is an important step
Figure 15. Few-shot CLIP also increases effective robustness in algorithmic improvements to CLIP. To our knowledge,
compared to existing ImageNet models but is less robust than using a linear classifier on top of the features of a high-
zero-shot CLIP. Minimizing the amount of ImageNet training quality pre-trained model is near state-of-the-art for few
data used for adaption increases effective robustness at the cost of shot learning (Tian et al., 2020), which suggests that there is
decreasing relative robustness. 16-shot logistic regression CLIP a gap between the best few-shot machine learning methods
matches zero-shot CLIP on ImageNet, as previously reported in
and human few-shot learning.
Figure 7, but is less robust.
If we plot human accuracy vs CLIP’s zero shot accuracy
(Figure 16), we see that the hardest problems for CLIP are
also hard for humans. To the extent that errors are consistent,
One possible concern was that the human workers were not our hypothesis is that this is due to at least a two factors:
sufficiently motivated in the zero-shot task. High human noise in the dataset (including mislabeled images) and out of
accuracy of 94% on the STL-10 dataset (Coates et al., 2011) distribution images being hard for both humans and models.
and 97-100% accuracy on the subset of attention check
images increased our trust in the human workers. 5. Data Overlap Analysis
Interestingly, humans went from a performance average of A concern with pre-training on a very large internet dataset
54% to 76% with just one training example per class, and is unintentional overlap with downstream evals. This is
the marginal gain from an additional training example is important to investigate since, in a worst-case scenario, a
minimal. The gain in accuracy going from zero to one shot complete copy of an evaluation dataset could leak into the
is almost entirely on images that humans were uncertain pre-training dataset and invalidate the evaluation as a mean-
about. This suggests that humans “know what they don’t ingful test of generalization. One option to prevent this is to
know” and are able to update their priors on the images they identify and remove all duplicates before training a model.
are most uncertain in based on a single example. Given this, While this guarantees reporting true hold-out performance,
it seems that while CLIP is a promising training strategy it requires knowing all possible data which a model might
for zero-shot performance (Figure 5) and does well on tests be evaluated on ahead of time. This has the downside of
of natural distribution shift (Figure 13), there is a large limiting the scope of benchmarking and analysis. Adding a
difference between how humans learn from a few examples new evaluation would require an expensive re-train or risk
and the few-shot methods in this paper. reporting an un-quantified benefit due to overlap.
This suggests that there are still algorithmic improvements Instead, we document how much overlap occurs and how
waiting to be made to decrease the gap between machine performance changes due to these overlaps. To do this, we
and human sample efficiency, as noted by Lake et al. (2016) use the following procedure:
and others. Because these few-shot evaluations of CLIP
1) For each evaluation dataset, we run a duplicate detector
can. (see Appendix C) on its examples. We then manually inspect
Learning Transferable Visual Models From Natural Language Supervision 18

are guaranteed to have no overlap due to containing novel


100 data from after the date our dataset was created (ObjectNet
and Hateful Memes). This demonstrates our detector has
80
a low-false positive rate which is important as false posi-
tives would under-estimate the effect of contamination in
Accuracy (%)

our analysis. There is a median overlap of 2.2% and an av-


60 erage overlap of 3.2%. Due to this small amount of overlap,
overall accuracy is rarely shifted by more than 0.1% with
40 only 7 datasets above this threshold. Of these, only 2 are
statistically significant after Bonferroni correction. The max
Zero-Shot CLIP detected improvement is only 0.6% on Birdsnap which has
20 One-Shot Human
Zero-Shot Human the second largest overlap at 12.1%. The largest overlap is
for Country211 at 21.5%. This is due to it being constructed
pug

bengal
sphynx
beagle

bombay
german_shorthaired
shiba_inu
great_pyrenees
samoyed
saint_bernard
pomeranian
newfoundland

chihuahua
japanese_chin
american_bulldog

birman
ragdoll
english_setter

wheaten_terrier

russian_blue
persian
leonberger
abyssinian
boxer
scottish_terrier
yorkshire_terrier
siamese
miniature_pinscher
havanese
keeshond
maine_coon
basset_hound

british_shorthair
staffordshire_bull_terrier
american_pit_bull_terrier
egyptian_mau
english_cocker_spaniel
out of YFCC100M, which our pre-training dataset contains
a filtered subset of. Despite this large overlap there is only
a 0.2% increase in accuracy on Country211. This may be
because the training text accompanying an example is often
not related to the specific task a downstream eval measures.
Country211 measures geo-localization ability, but inspect-
Figure 16. The hardest problems for CLIP also tend to be the hard- ing the training text for these duplicates showed they often
est problems for humans. Here we rank image categories by diffi- do not mention the location of the image.
culty for CLIP as measured as probability of the correct label. We are aware of two potential concerns with our analysis.
First our detector is not perfect. While it achieves near
100% accuracy on its proxy training task and manual in-
spection + threshold tuning results in very high precision
the found nearest neighbors and set a per dataset threshold with good recall among the found nearest-neighbors, we can
to keep high precision while maximizing recall. Using not tractably check its recall across 400 million examples.
this threshold, we then create two new subsets, Overlap, Another potential confounder of our analysis is that the un-
which contains all examples which have a similarity to a derlying data distribution may shift between the Overlap
training example above the threshold, and Clean, which and Clean subsets. For example, on Kinetics-700 many
contains all examples that are below this threshold. We “overlaps” are in fact all black transition frames. This ex-
denote the unaltered full dataset All for reference. From plains why Kinetics-700 has an apparent 20% accuracy drop
this we first record the degree of data contamination as the on Overlap. We suspect more subtle distribution shifts
ratio of the number of examples in Overlap to the size of likely exist. One possibility we noticed on CIFAR-100 is
All. that, due to the very low resolution of its images, many
duplicates were false positives of small objects such as birds
2) We then compute the zero-shot accuracy of CLIP or planes. Changes in accuracy could instead be due to
RN50x64 on the three splits and report All - Clean changes in the class distribution or difficulty of the dupli-
as our main metric. This is the difference in accuracy due cates. Unfortunately, these distribution and difficulty shifts
to contamination. When positive it is our estimate of how could also mask the effects of over-fitting.
much the overall reported accuracy on the dataset was in-
flated by over-fitting to overlapping data. However, these results closely follow the findings of simi-
lar duplicate analysis in previous work on large scale pre-
3) The amount of overlap is often small so we also run a training. Mahajan et al. (2018) and Kolesnikov et al. (2019)
binomial significance test where we use the accuracy on detected similar overlap rates and found minimal changes in
Clean as the null hypothesis and compute the one-tailed overall performance. Importantly, Kolesnikov et al. (2019)
(greater) p-value for the Overlap subset. We also calculate also compared the alternative de-duplication strategy dis-
99.5% Clopper-Pearson confidence intervals on Dirty as cussed in the introduction to this section with the approach
another check. we settled on and observed little difference between the two
A summary of this analysis is presented in Figure 17. Out approaches.
of 35 datasets studied, 9 datasets have no detected overlap
at all. Most of these datasets are synthetic or specialized
making them unlikely to be posted as normal images on
the internet (for instance MNIST, CLEVR, and GTSRB) or
Learning Transferable Visual Models From Natural Language Supervision 19
0.75
Difference in Accuracy on Overlapping vs. Clean Data (%)

Birdsnap p < 1e-3


20 CIFAR-100 p < 0.05

Overall Accuracy Change Due To Overlap (%)


0.5 p > 0.05
CIFAR-100 FER2013
10 SUN397 SUN Stanford Cars
0.25 Country211
SUN397 SUN
0 0

ImageNet Sketch -0.25


-10

-0.5
-20 Kinetics-700
-0.75
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5
Detected Data Overlap (%) Detected Data Overlap (%)

Figure 17. Few statistically significant improvements in accuracy due to detected data overlap. (Left) While several datasets have
up to +-20% apparent differences in zero-shot accuracy on detected overlapping vs clean examples only 5 datasets out of 35 total have
99.5% Clopper-Pearson confidence intervals that exclude a 0% accuracy difference. 2 of these datasets do worse on overlapping data.
(Right) Since the percentage of detected overlapping examples is almost always in the single digits, the overall test accuracy gain due to
overlap is much smaller with the largest estimated increase being only 0.6% on Birdsnap. Similarly, for only 6 datasets are the accuracy
improvements statistically significant when calculated using a one-sided binomial test.

6. Limitations still many, many, tasks where CLIP’s zero-shot performance


is near chance level.
There are still many limitations to CLIP. While several of
these are discussed as part of analysis in various sections, While zero-shot CLIP generalizes well to many natural im-
we summarize and collect them here. age distributions as investigated in Section 3.3, we’ve ob-
served that zero-shot CLIP still generalizes poorly to data
On datasets with training splits, the performance of zero- that is truly out-of-distribution for it. An illustrative exam-
shot CLIP is on average competitive with the simple su- ple occurs for the task of OCR as reported in Appendix E.
pervised baseline of a linear classifier on top of ResNet-50 CLIP learns a high quality semantic OCR representation that
features. On most of these datasets, the performance of performs well on digitally rendered text, which is common
this baseline is now well below the overall state of the art. in its pre-training dataset, as evidenced by performance on
Significant work is still needed to improve the task learning Rendered SST2. However, CLIP only achieves 88% accu-
and transfer capabilities of CLIP. While scaling has so far racy on the handwritten digits of MNIST. An embarrassingly
steadily improved performance and suggests a route for con- simple baseline of logistic regression on raw pixels outper-
tinued improvement, we estimate around a 1000x increase forms zero-shot CLIP. Both semantic and near-duplicate
in compute is required for zero-shot CLIP to reach overall nearest-neighbor retrieval verify that there are almost no im-
state-of-the-art performance. This is infeasible to train with ages that resemble MNIST digits in our pre-training dataset.
current hardware. Further research into improving upon the This suggests CLIP does little to address the underlying
computational and data efficiency of CLIP will be necessary. problem of brittle generalization of deep learning models.
Analysis in Section 3.1 found that CLIP’s zero-shot perfor- Instead CLIP tries to circumvent the problem and hopes that
mance is still quite weak on several kinds of tasks. When by training on such a large and varied dataset that all data
compared to task-specific models, the performance of CLIP will be effectively in-distribution. This is a naive assumption
is poor on several types of fine-grained classification such that, as MNIST demonstrates, is easy to violate.
as differentiating models of cars, species of flowers, and Although CLIP can flexibly generate zero-shot classifiers
variants of aircraft. CLIP also struggles with more abstract for a wide variety of tasks and datasets, CLIP is still limited
and systematic tasks such as counting the number of objects to choosing from only those concepts in a given zero-shot
in an image. Finally for novel tasks which are unlikely to be classifier. This is a significant restriction compared to a
included in CLIP’s pre-training dataset, such as classifying truly flexible approach like image captioning which could
the distance to the nearest car in a photo, CLIP’s perfor- generate novel outputs. Unfortunately, as described in Sec-
mance can be near random. We are confident that there are tion 2.3 we found the computational efficiency of the image
Learning Transferable Visual Models From Natural Language Supervision 20

caption baseline we tried to be much lower than CLIP. A notably different from human performance which shows a
simple idea worth trying is joint training of a contrastive large increase from a zero to a one shot setting. Future work
and generative objective with the hope of combining the is needed to develop methods that combine CLIP’s strong
efficiency of CLIP with the flexibility of a caption model. zero-shot performance with efficient few-shot learning.
As another alternative, search could be performed at infer-
ence time over many natural language explanations of a 7. Broader Impacts
given image, similar to approach proposed in Learning with
Latent Language Andreas et al. (2017). CLIP has a wide range of capabilities due to its ability to
carry out arbitrary image classification tasks. One can give it
CLIP also does not address the poor data efficiency of deep
a images of cats and dogs and ask it to classify cats, or give
learning. Instead CLIP compensates by using a source of
it images taken in a department store and ask it to classify
supervision that can be scaled to hundreds of millions of
shoplifters–a task with significant social implications and
training examples. If every image seen during training of
for which AI may be unfit. Like any image classification
a CLIP model was presented at a rate of one per second,
system, CLIP’s performance and fitness for purpose need to
it would take 405 years to iterate through the 12.8 billion
be evaluated, and its broader impacts analyzed in context.
images seen over 32 training epochs. Combining CLIP
CLIP also introduces a capability that will magnify and alter
with self-supervision (Henaff, 2020; Chen et al., 2020c) and
such issues: CLIP makes it possible to easily create your
self-training (Lee; Xie et al., 2020) methods is a promising
own classes for categorization (to ‘roll your own classifier’)
direction given their demonstrated ability to improve data
without a need for re-training. This capability introduces
efficiency over standard supervised learning.
challenges similar to those found in characterizing other,
Our methodology has several significant limitations. De- large-scale generative models like GPT-3 (Brown et al.,
spite our focus on zero-shot transfer, we repeatedly queried 2020); models that exhibit non-trivial zero-shot (or few-
performance on full validation sets to guide the develop- shot) generalization can have a vast range of capabilities,
ment of CLIP. These validation sets often have thousands many of which are made clear only after testing for them.
of examples, which is unrealistic for true zero-shot sce-
Our studies of CLIP in a zero-shot setting show that the
narios. Similar concerns have been raised in the field of
model displays significant promise for widely-applicable
semi-supervised learning (Oliver et al., 2018). Another po-
tasks like image retrieval or search. For example, it can find
tential issue is our selection of evaluation datasets. While
relevant images in a database given text, or relevant text
we have reported results on Kornblith et al. (2019)’s 12
given an image. Further, the relative ease of steering CLIP
dataset evaluation suite as a standardized collection, our
toward bespoke applications with little or no additional data
main results use a somewhat haphazardly assembled col-
or training could unlock a variety of novel applications that
lection of 27 datasets that is undeniably co-adapted with
are hard for us to envision today, as has occurred with large
the development and capabilities of CLIP. Creating a new
language models over the past few years.
benchmark of tasks designed explicitly to evaluate broad
zero-shot transfer capabilities, rather than re-using existing In addition to the more than 30 datasets studied in earlier
supervised datasets, would help address these issues. sections of this paper, we evaluate CLIP’s performance on
the FairFace benchmark and undertake exploratory bias
CLIP is trained on text paired with images on the internet.
probes. We then characterize the model’s performance in
These image-text pairs are unfiltered and uncurated and
a downstream task, surveillance, and discuss its usefulness
result in CLIP models learning many social biases. This
as compared with other available systems. Many of CLIP’s
has been previously demonstrated for image caption models
capabilities are omni-use in nature (e.g. OCR can be used
(Bhargava & Forsyth, 2019). We refer readers to Section 7
to make scanned documents searchable, to power screen
for detailed analysis and quantification of these behaviors for
reading technologies, or to read license plates). Several
CLIP as well as discussion of potential mitigation strategies.
of the capabilities measured, from action recognition, ob-
While we have emphasized throughout this work that speci- ject classification, and geo-localization, to facial emotion
fying image classifiers through natural language is a flexible recognition, can be used in surveillance. Given its social
and general interface, it has its own limitations. Many com- implications, we address this domain of use specifically in
plex tasks and visual concepts can be difficult to specify the Surveillance section.
just through text. Actual training examples are undeniably
We have also sought to characterize the social biases inher-
useful but CLIP does not optimize for few-shot performance
ent to the model. Our bias tests represent our initial efforts
directly. In our work, we fall back to fitting linear classifiers
to probe aspects of how the model responds in different sce-
on top of CLIP’s features. This results in a counter-intuitive
narios, and are by nature limited in scope. CLIP and models
drop in performance when transitioning from a zero-shot
like it will need to be analyzed in relation to their specific
to a few-shot setting. As discussed in Section 4, this is
Learning Transferable Visual Models From Natural Language Supervision 21

deployments to understand how bias manifests and iden- 4).


tify potential interventions. Further community exploration
Additionally, we test the performance of the LR CLIP and
will be required to develop broader, more contextual, and
ZS CLIP models across intersectional race and gender cate-
more robust testing schemes so that AI developers can bet-
gories as they are defined in the FairFace dataset. We find
ter characterize biases in general purpose computer vision
that model performance on gender classification is above
models.
95% for all race categories. Table 5 summarizes these re-
sults.
7.1. Bias
While LR CLIP achieves higher accuracy than the Linear
Algorithmic decisions, training data, and choices about how Probe Instagram model on the FairFace benchmark dataset
classes are defined and taxonomized (which we refer to in- for gender, race and age classification of images by intersec-
formally as “class design”) can all contribute to and amplify tional categories, accuracy on benchmarks offers only one
social biases and inequalities resulting from the use of AI approximation of algorithmic fairness, as Raji et al. (2020)
systems (Noble, 2018; Bechmann & Bowker, 2019; Bowker have shown, and often fails as a meaningful measure of fair-
& Star, 2000). Class design is particularly relevant to mod- ness in real world contexts. Even if a model has both higher
els like CLIP, since any developer can define a class and the accuracy and lower disparities in performance on different
model will provide some result. sub-groups, this does not mean it will have lower disparities
In this section, we provide preliminary analysis of some in impact (Scheuerman et al., 2019). For example, higher
of the biases in CLIP, using bias probes inspired by those performance on underrepresented groups might be used by
outlined in Buolamwini & Gebru (2018) and Kärkkäinen a company to justify their use of facial recognition, and to
& Joo (2019). We also conduct exploratory bias research then deploy it ways that affect demographic groups dispro-
intended to find specific examples of biases in the model, portionately. Our use of facial classification benchmarks to
similar to that conducted by Solaiman et al. (2019). probe for biases is not intended to imply that facial classi-
fication is an unproblematic task, nor to endorse the use of
We start by analyzing the performance of Zero-Shot CLIP on race, age, or gender classification in deployed contexts.
the face image dataset FairFace (Kärkkäinen & Joo, 2019)6
as an initial bias probe, then probe the model further to We also probed the model using classification terms with
surface additional biases and sources of biases, including high potential to cause representational harm, focusing on
class design. denigration harms in particular (Crawford, 2017). We car-
ried out an experiment in which the ZS CLIP model was
We evaluated two versions of CLIP on the FairFace dataset: required to classify 10,000 images from the FairFace dataset.
a zero-shot CLIP model (“ZS CLIP”), and a linear regres- In addition to the FairFace classes, we added in the follow-
sion classifier fitted to FairFace’s dataset on top of CLIP’s ing classes: ‘animal’, ‘gorilla’, ‘chimpanzee’, ‘orangutan’,
features (“LR CLIP”). We find that LR CLIP gets higher ‘thief’, ‘criminal’ and ‘suspicious person’. The goal of this
accuracy on the FairFace dataset than both the ResNext-101 experiment was to check if harms of denigration dispropor-
32x48d Instagram model (“Linear Probe Instagram”) (Ma- tionately impact certain demographic subgroups.
hajan et al., 2018) and FairFace’s own model on most of the
classification tests we ran7 . ZS CLIP’s performance varies We found that 4.9% (confidence intervals between 4.6%
by category and is worse than that of FairFace’s model for a and 5.4%) of the images were misclassified into one of
few categories, and better for others. (See Table 3 and Table the non-human classes we used in our probes (‘animal’,
‘chimpanzee’, ‘gorilla’, ‘orangutan’). Out of these, ‘Black’
6
FairFace is a face image dataset designed to balance age, gen- images had the highest misclassification rate (approximately
der, and race, in order to reduce asymmetries common in previous
14%; confidence intervals between [12.6% and 16.4%])
face datasets. It categorizes gender into 2 groups: female and male
and race into 7 groups: White, Black, Indian, East Asian, Southeast while all other races had misclassification rates under 8%.
Asian, Middle Eastern, and Latino. There are inherent problems People aged 0-20 years had the highest proportion being
with race and gender classifications, as e.g. Bowker & Star (2000) classified into this category at 14% .
and Keyes (2018) have shown. While FairFace’s dataset reduces
the proportion of White faces, it still lacks representation of entire We also found that 16.5% of male images were misclassified
large demographic groups, effectively erasing such categories. We into classes related to crime (‘thief’, ‘suspicious person’ and
use the 2 gender categories and 7 race categories defined in the ‘criminal’) as compared to 9.8% of female images. Inter-
FairFace dataset in a number of our experiments not in order to
estingly, we found that people aged 0-20 years old were
reinforce or endorse the use of such reductive categories, but in
order to enable us to make comparisons to prior work. more likely to fall under these crime-related classes (approx-
7
One challenge with this comparison is that the FairFace model imately 18%) compared to images of people in different
uses binary classes for race (“White” and “Non-White”), instead age ranges (approximately 12% for people aged 20-60 and
of breaking down races into finer-grained sub-groups. 0% for people over 70). We found significant disparities in
Learning Transferable Visual Models From Natural Language Supervision 22

Model Race Gender Age Model Race Gender Age


FairFace Model 93.7 94.2 59.7 FairFace Model 75.4 94.4 60.7
Linear Probe CLIP 93.4 96.5 63.8 Linear Probe CLIP 92.8 97.7 63.1
Zero-Shot CLIP 58.3 95.9 57.1 Zero-Shot CLIP 91.3 97.2 54.3
Linear Probe Instagram 90.8 93.2 54.2 Linear Probe Instagram 87.2 93.9 54.1

Table 3. Percent accuracy on Race, Gender, and Age classification Table 4. Percent accuracy on Race, Gender, and Age classification
of images in FairFace category ‘White’ of images in FairFace categories ‘Black,’ ‘Indian,’ ‘East Asian,’
‘Southeast Asian,’ ‘Middle Eastern,’ and ‘Latino’ (grouped to-
gether as FairFace category ‘Non-White’)

Middle Southeast East


Model Gender Black White Indian Latino Eastern Asian Asian Average
Male 96.9 96.4 98.7 96.5 98.9 96.2 96.9 97.2
Linear Probe CLIP Female 97.9 96.7 97.9 99.2 97.2 98.5 97.3 97.8
97.4 96.5 98.3 97.8 98.4 97.3 97.1 97.5
Male 96.3 96.4 97.7 97.2 98.3 95.5 96.8 96.9
Zero-Shot CLIP Female 97.1 95.3 98.3 97.8 97.5 97.2 96.4 97.0
96.7 95.9 98.0 97.5 98.0 96.3 96.6
Male 92.5 94.8 96.2 93.1 96.0 92.7 93.4 94.1
Linear Probe Instagram Female 90.1 91.4 95.0 94.8 95.0 94.1 94.3 93.4
91.3 93.2 95.6 94.0 95.6 93.4 93.9

Table 5. Percent accuracy on gender classification of images by FairFace race category

classifications across races for crime related terms, which is developers can design their own classes.
captured in Table 6.
We also carried out experiments similar to those outlined by
Given that we observed that people under 20 were the most Schwemmer et al. (2020) to test how CLIP treated images
likely to be classified in both the crime-related and non- of men and women differently using images of Members
human animal categories, we carried out classification for of Congress. As part of these experiments, we studied
the images with the same classes but with an additional how certain additional design decisions such as deciding
category ‘child’ added to the categories. Our goal here thresholds for labels can impact the labels output by CLIP
was to see if this category would significantly change the and how biases manifest.
behaviour of the model and shift how the denigration harms
We carried out three experiments - we tested for accuracy
are distributed by age. We found that this drastically reduced
on gender classification and we tested for how labels were
the number of images of people under 20 classified in either
differentially distributed across two different label sets. For
crime-related categories or non-human animal categories
our first label set, we used a label set of 300 occupations and
(Table 7). This points to how class design has the potential
for our second label set we used a combined set of labels that
to be a key factor determining both the model performance
Google Cloud Vision, Amazon Rekognition and Microsoft
and the unwanted biases or behaviour the model may exhibit
Azure Computer Vision returned for all the images.
while also asks overarching questions about the use of face
images to automatically classify people along such lines We first simply looked into gender prediction performance
(Blaise Aguera y Arcas & Todorov, 2017). of the model on the images of Members of Congress, in
order to check to see if the model correctly recognized
The results of these probes can change based on the class
men as men and women as women given the image of a
categories one chooses to include as well as the specific
person who appeared to be in an official setting/position of
language one uses to describe each class. Poor class design
power. We found that the model got 100% accuracy on the
can lead to poor real world performance; this concern is
images. This is slightly better performance than the model’s
particularly relevant to a model like CLIP, given how easily
performance on the FairFace dataset. We hypothesize that
Learning Transferable Visual Models From Natural Language Supervision 23

Middle Southeast East


Category Black White Indian Latino Eastern Asian Asian
Crime-related Categories 16.4 24.9 24.4 10.8 19.7 4.4 1.3
Non-human Categories 14.4 5.5 7.6 3.7 2.0 1.9 0.0

Table 6. Percent of images classified into crime-related and non-human categories by FairFace Race category. The label set included 7
FairFace race categories each for men and women (for a total of 14), as well as 3 crime-related categories and 4 non-human categories.

Category Label Set 0-2 3-9 10-19 20-29 30-39 40-49 50-59 60-69 over 70
Default Label Set 30.3 35.0 29.5 16.3 13.9 18.5 19.1 16.2 10.4
Default Label Set + ‘child’ category 2.3 4.3 14.7 15.0 13.4 18.2 18.6 15.5 9.4

Table 7. Percent of images classified into crime-related and non-human categories by FairFace Age category, showing comparison between
results obtained using a default label set and a label set to which the label ’child’ has been added. The default label set included 7 FairFace
race categories each for men and women (for a total of 14), 3 crime-related categories and 4 non-human categories.

one of the reasons for this is that all the images in the ecutive’ and ‘doctor’. Out of the only four occupations that
Members of Congress dataset were high quality and clear, it attached more often to women, three were ‘newscaster’,
with the people clearly centered, unlike those in the FairFace ‘television presenter’ and ‘newsreader’ and the fourth was
dataset. ‘Judge’. This is again similar to the biases found in GCV
and points to historical gendered differences (Schwemmer
In order to study how the biases in returned labels depend on
et al., 2020).
the thresholds set for label probability, we did an experiment
in which we set threshold values at 0.5% and 4.0%. We Interestingly, when we lowered the threshold to 0.5% for
found that the lower threshold led to lower quality of labels. this set of labels, we found that the labels disproportionately
However, even the differing distributions of labels under describing men also shifted to appearance oriented words
this threshold can hold signals for bias. For example, we such as ‘suit’, ‘tie’ and ‘necktie’ (Figure 18). Many occupa-
find that under the 0.5% threshold labels such as ‘nanny’ tion oriented words such as ‘military person’ and ‘executive’
and ‘housekeeper’ start appearing for women whereas labels - which were not used to describe images of women at the
such as ‘prisoner’ and ‘mobster’ start appearing for men. higher 4% threshold - were used for both men and women
This points to gendered associations similar to those that at the lower 0.5% threshold, which could have caused the
have previously been found for occupations (Schwemmer change in labels for men. The reverse was not true. Descrip-
et al., 2020) (Nosek et al., 2002) (Bolukbasi et al., 2016). tive words used to describe women were still uncommon
amongst men.
At the higher 4% threshold, the labels with the highest prob-
ability across both genders include “lawmaker”, “legislator” Design decisions at every stage of building a model impact
and “congressman”. However, the presence of these biases how biases manifest and this is especially true for CLIP
amongst lower probability labels nonetheless point to larger given the flexibility it offers. In addition to choices about
questions about what ‘sufficiently’ safe behaviour may look training data and model architecture, decisions about things
like for deploying such systems. like class designs and thresholding values can alter the labels
a model outputs and as a result heighten or lower certain
When given the combined set of labels that Google Cloud
kinds of harm, such as those described by Crawford (2017).
Vision (GCV), Amazon Rekognition and Microsoft returned
People designing and developing models and AI systems
for all the images, similar to the biases Schwemmer et al.
have considerable power. Decisions about things like class
(2020) found in GCV systems, we found our system also
design are a key determiner not only of model performance,
disproportionately attached labels to do with hair and ap-
but also of how and in what contexts model biases manifest.
pearance in general to women more than men. For ex-
ample, labels such as ‘brown hair’, ‘blonde’ and ‘blond’ These experiments are not comprehensive. They illus-
appeared significantly more often for women. Additionally, trate potential issues stemming from class design and other
CLIP attached some labels that described high status occu- sources of bias, and are intended to spark inquiry.
pations disproportionately more often to men such as ‘ex-
Learning Transferable Visual Models From Natural Language Supervision 24

Top labels, Top labels,


images of women images of men
woman man
lady male
female face
looking player
senior citizen black
public speaking head
blonde facial expression
spokesperson suit
blazer photo
laughing military officer
hot walking
magenta photograph
bob cut elder
black hair display
pixie cut tie
pink shoulder
bangs frown
newsreader kid
purple Women necktie Women
blouse Men yellow Men
0 20 40 60 80 100 0 20 40 60 80 100
Frequency (%) Frequency (%)

Figure 18. CLIP performance on Member of Congress images when given the combined returned label set for the images from Google
Cloud Vision, Amazon Rekognition and Microsoft Azure Computer Vision. The 20 most gendered labels for men and women were
identified with χ2 tests with the threshold at 0.5%. Labels are sorted by absolute frequencies. Bars denote the percentage of images for a
certain label by gender.

7.2. Surveillance lot, school campus, etc.). For fine-grained classification, the
model had to choose between two options constructed to
We next sought to characterize model performance in re-
determine if the model could identify the presence/absence
lation to a downstream task for which there is significant
of smaller features in the image such as a person standing
societal sensitivity: surveillance. Our analysis aims to better
in the corner.
embody the characterization approach described above and
to help orient the research community towards the potential For coarse classification, we constructed the classes by hand-
future impacts of increasingly general purpose computer captioning the images ourselves to describe the contents
vision models and aid the development of norms and checks of the image and there were always at least 6 options for
around such systems. Our inclusion of surveillance is not the model to choose from. Additionally, we carried out a
intended to indicate enthusiasm for this domain - rather, we ‘stress test’ where the class set included at least one more
think surveillance is an important domain to try to make caption for something that was ‘close’ to the image (for
predictions about given its societal implications (Zuboff, example, ‘parking lot with white car’ vs. ‘parking lot with
2015; Browne, 2015). red car’). We found that the model had a top-1 accuracy
of 91.8% on the CCTV images for the initial evaluation.
We measure the model’s performance on classification of
The accuracy dropped significantly to 51.1% for the second
images from CCTV cameras and zero-shot celebrity identifi-
evaluation, with the model incorrectly choosing the ‘close’
cation. We first tested model performance on low-resolution
answer 40.7% of the time.
images captured from surveillance cameras (e.g. CCTV
cameras). We used the VIRAT dataset (Oh et al., 2011) and For fine-grained detection, the zero-shot model performed
data captured by Varadarajan & Odobez (2009), which both poorly, with results near random. Note that this experiment
consist of real world outdoor scenes with non-actors. was targeted only towards detecting the presence or absence
of small objects in image sequences.
Given CLIP’s flexible class construction, we tested 515
surveillance images captured from 12 different video se- We also tested CLIP’s zero-shot performance for ‘in the
quences on self-constructed general classes for coarse and wild’ identity detection using the CelebA dataset8 . We did
fine grained classification. Coarse classification required the 8
Note: The CelebA dataset is more representative of faces with
model to correctly identify the main subject of the image (i.e. lighter skin tones. Due to the nature of the dataset, we were not
determine if the image was a picture of an empty parking able to control for race, gender, age, etc.
Learning Transferable Visual Models From Natural Language Supervision 25

this to evaluate the model’s performance for identity detec-


tion using just the publicly available data it was pre-trained Model 100 Classes 1k Classes 2k Classes
on. While we tested this on a dataset of celebrities who have
a larger number of images on the internet, we hypothesize CLIP L/14 59.2 43.3 42.2
that the number of images in the pre-training data needed CLIP RN50x64 56.4 39.5 38.4
for the model to associate faces with names will keep de- CLIP RN50x16 52.7 37.4 36.3
creasing as models get more powerful (see Table 8), which CLIP RN50x4 52.8 38.1 37.3
has significant societal implications (Garvie, 2019). This
mirrors recent developments in natural language processing, Table 8. CelebA Zero-Shot Top-1 Identity Recognition Accuracy
in which recent large language models trained on Internet
data often exhibit a surprising ability to provide informa- of such models, and we are excited to engage with the
tion related to relatively minor public figures (Brown et al., research community on such questions.
2020).
We believe one good step forward is community exploration
We found that the model had 59.2% top-1 accuracy out to further characterize the capabilities of models like CLIP
of 100 possible classes for ‘in the wild’ 8k celebrity im- and - crucially - identify application areas where they have
ages. However, this performance dropped to 43.3% when promising performance and areas where they may have
we increased our class sizes to 1k celebrity names. This reduced performance9 . This process of characterization can
performance is not competitive when compared to produc- help researchers increase the likelihood models are used
tion level models such as Google’s Celebrity Recognition beneficially by:
(Google). However, what makes these results noteworthy is
that this analysis was done using only zero-shot identifica- • Identifying potentially beneficial downstream uses of
tion capabilities based on names inferred from pre-training models early in the research process, enabling other
data - we didn’t use any additional task-specific dataset, and researchers to think about applications.
so the (relatively) strong results further indicate that before
deploying multimodal models, people will need to carefully • Surfacing tasks with significant sensitivity and a large
study them for behaviors in a given context and domain. set of societal stakeholders, which may call for inter-
vention by policymakers.
CLIP offers significant benefit for tasks that have relatively
little data given its zero-shot capabilities. However, large • Better characterizing biases in models, alerting other
datasets and high performing supervised models exist for researchers to areas of concern and areas for interven-
many in-demand surveillance tasks such as facial recogni- tions.
tion. As a result, CLIP’s comparative appeal for such uses
is low. Additionally, CLIP is not designed for common • Creating suites of tests to evaluate systems like CLIP
surveillance-relevant tasks like object detection and seman- on, so we can better characterize model capabilities
tic segmentation. This means it has limited use for certain earlier in the development cycle.
surveillance tasks when models that are designed with these
uses in mind such as Detectron2 (Wu et al., 2019) are widely • Identifying potential failure modes and areas for further
available. work.

However, CLIP does unlock a certain aspect of usability


given how it removes the need for training data. Thus, CLIP We plan to contribute to this work, and hope this analysis
and similar models could enable bespoke, niche surveillance provides some motivating examples for subsequent research.
use cases for which no well-tailored models or datasets exist,
and could lower the skill requirements to build such appli- 8. Related Work
cations. As our experiments show, ZS CLIP displays non-
trivial, but not exceptional, performance on a few surveil- Any model that leverages written, spoken, signed or any
lance relevant tasks today. other form of human language as part of its training signal
is arguably using natural language as a source of supervi-
sion. This is an admittedly extremely broad area and covers
7.3. Future Work
most work in the field of distributional semantics including
This preliminary analysis is intended to illustrate some of topic models (Blei et al., 2003), word, sentence, and para-
the challenges that general purpose computer vision models graph vectors (Mikolov et al., 2013; Kiros et al., 2015; Le &
pose and to give a glimpse into their biases and impacts. 9
A model could be unfit for use due to inadequate performance
We hope that this work motivates future research on the or due to the inappropriateness of AI use in the application area
characterization of the capabilities, shortcomings, and biases itself.
Learning Transferable Visual Models From Natural Language Supervision 26

Mikolov, 2014), and language models (Bengio et al., 2003). Hodosh et al., 2013). Over time work explored many combi-
It also includes much of the broader field of NLP that deals nations of training objective, transfer, and more expressive
with predicting or modeling sequences of natural language models and steadily improved performance (Frome et al.,
in some way. Work in NLP intentionally leveraging natural 2013; Socher et al., 2014; Karpathy et al., 2014; Kiros et al.,
language supervision in the form of explanations, feedback, 2014; Faghri et al., 2017).
instructions, and advice for tasks such as classification (as
Other work has leveraged natural language supervision for
opposed to the commonly used representation of supervision
domains other than images. Stroud et al. (2020) explores
as a set of arbitrarily encoded discrete category labels) has
large scale representation learning by training a system to
been explored in many creative and advanced ways. Dialog
pair descriptive text with videos instead of images. Several
based learning (Weston, 2016; Li et al., 2016; Hancock et al.,
works have explored using dense spoken natural language
2019) develops techniques to learn from interactive natural
supervision for videos (Miech et al., 2019; 2020b). When
language feedback in dialog. Several papers have leveraged
considered together with CLIP, these works suggest that
semantic parsing to convert natural language explanations
large scale natural language supervision is a promising way
into features (Srivastava et al., 2017) or additional training
to learn high quality perceptual systems for many domains.
labels (Hancock et al., 2018). More recently, ExpBERT
Alayrac et al. (2020) extended this line of work to an addi-
(Murty et al., 2020) uses feature representations produced
tional modality by adding raw audio as an additional super-
by conditioning a deep contextual language model on nat-
vision source and demonstrated benefits from combining all
ural language explanations and descriptions of relations to
three sources of supervision.
improve performance on the task of relation extraction.
As part of our work on CLIP we also construct a new dataset
CLIP is an example of using natural language as a training
of image-text pairs. Modern work on image-text retrieval
signal for learning about a domain other than language. In
has relied on a set of crowd-sourced sentence level im-
this context, the earliest use of the term natural language
age caption evaluation datasets like Pascal1K (Rashtchian
supervision that we are aware of is the work of Ramanathan
et al., 2010), Flickr8K (Hodosh et al., 2013), and Flickr30K
et al. (2013) which showed that natural language descrip-
(Young et al., 2014). However, these datasets are still rel-
tions could be used along side other sources of supervision
atively small and limit achievable performance. Several
to improve performance on the task of video event under-
methods have been proposed to create larger datasets au-
standing. However, as mentioned in the introduction and
tomatically with Ordonez et al. (2011) as a notable early
approach section, methods of leveraging natural language
example. In the deep learning era, Mithun et al. (2018)
descriptions in computer vision well predate the use of this
demonstrated an additional set of (image, text) pairs col-
specific term, especially for image retrieval (Mori et al.,
lected from the internet could improve retrieval performance
1999) and object classification (Wang et al., 2009). Other
and several new automatically constructed datasets such as
early work leveraged tags (but not natural language) asso-
Conceptual Captions (Sharma et al., 2018), LAIT (Qi et al.,
ciated with images for the task of semantic segmentation
2020), and OCR-CC (Yang et al., 2020) have been created.
(Barnard et al., 2003). More recently, He & Peng (2017) and
However, these datasets still use significantly more aggres-
(Liang et al., 2020) demonstrated using natural language
sive filtering or are designed for a specific task such as OCR
descriptions and explanations to improve fine-grained vi-
and as a result are still much smaller than WIT with between
sual classification of birds. Others have investigated how
1 and 10 million training examples.
grounded language can be used to improve visual represen-
tations and classifiers on the ShapeWorld dataset (Kuhnle A related idea to CLIP is webly supervised learning. This
& Copestake, 2017; Andreas et al., 2017; Mu et al., 2019). line of work queries image search engines to build image
Finally, techniques which combine natural language with datasets by querying for terms and uses the queries as the
reinforcement learning environments (Narasimhan et al., labels for the returned images (Fergus et al., 2005). Classi-
2015) have demonstrated exciting emergent behaviors such fiers trained on these large but noisily labeled datasets can
as systematically accomplishing zero-shot tasks (Hill et al., be competitive with those trained on smaller carefully la-
2019). beled datasets. These image-query pairs are also often used
to improve performance on standard datasets as additional
CLIP’s pre-training task optimizes for text-image retrieval.
training data (Chen & Gupta, 2015). CLIP also uses search
This areas of research dates back to the mid-90s with the
queries as part of its dataset creation process. However
previously mentioned Mori et al. (1999) as representative of
CLIP only uses full text sequences co-occuring with images
early work. While initial efforts focused primarily on predic-
as supervision rather than just the queries, which are often
tive objectives over time research shifted towards learning
only a single word or short n-gram. We also restrict this step
joint multi-modal embedding spaces with techniques like
in CLIP to text only querying for sub-string matches while
kernel Canonical Correlation Analysis and various ranking
most webly supervised work uses standard image search
objectives (Weston et al., 2010; Socher & Fei-Fei, 2010;
Learning Transferable Visual Models From Natural Language Supervision 27

engines which have their own complex retrieval and filter- including, but not limited, to Numpy (Harris et al., 2020),
ing pipelines that often involve computer vision systems. SciPy (Virtanen et al., 2020), ftfy (Speer, 2019), Tensor-
Of this line of work, Learning Everything about Anything: Flow (Abadi et al., 2016), PyTorch (Paszke et al., 2019),
Webly-Supervised Visual Concept Learning (Divvala et al., pandas (pandas development team, 2020), and scikit-learn
2014) has a notably similar ambition and goal as CLIP. (Pedregosa et al., 2011).
Finally, CLIP is related to a recent burst of activity on learn-
ing joint models of vision and language (Lu et al., 2019; Tan References
& Bansal, 2019; Chen et al., 2019; Li et al., 2020b; Yu et al., Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,
2020). This line of work focuses on richly connecting vision J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.
and language in order to solve complex downstream tasks Tensorflow: A system for large-scale machine learning. In
such as visual question answering, visual commonsense 12th {USENIX} symposium on operating systems design
reasoning, or multimodal entailment. These approaches and implementation ({OSDI} 16), pp. 265–283, 2016.
leverage impressively engineered models which combine 3
(or more) pre-trained subsystems, typically an image feature Alayrac, J.-B., Recasens, A., Schneider, R., Arandjelović,
model, a region proposal / object detection model, and a R., Ramapuram, J., De Fauw, J., Smaira, L., Dieleman, S.,
pre-trained masked language model such as BERT. These and Zisserman, A. Self-supervised multimodal versatile
systems are then jointly fine-tuned via various training objec- networks. arXiv preprint arXiv:2006.16228, 2020.
tives on image-text pairs and applied to the aforementioned
tasks and achieve impressive results. CLIP is instead fo- Alcorn, M. A., Li, Q., Gong, Z., Wang, C., Mai, L., Ku, W.-
cused on learning visual models from scratch via natural S., and Nguyen, A. Strike (with) a pose: Neural networks
language supervision and does not densely connect the two are easily fooled by strange poses of familiar objects. In
domains with a joint attention model. The only interaction Proceedings of the IEEE Conference on Computer Vision
in a CLIP model between the image and text domain is a and Pattern Recognition, pp. 4845–4854, 2019.
single dot product in a learned joint embedding space. We Andreas, J., Klein, D., and Levine, S. Learning with latent
are excited to see CLIP hybridized with this line of work. language. arXiv preprint arXiv:1711.00482, 2017.
Assiri, Y. Stochastic optimization of plain convolutional
9. Conclusion neural networks with simple methods. arXiv preprint
We have investigated whether it is possible to transfer the arXiv:2001.08856, 2020.
success of task-agnostic web-scale pre-training in NLP to
Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning
another domain. We find that adopting this formula re-
representations by maximizing mutual information across
sults in similar behaviors emerging in the field of computer
views. In Advances in Neural Information Processing
vision and discuss the social implications of this line of
Systems, pp. 15535–15545, 2019.
research. In order to optimize their training objective, CLIP
models learn to perform a wide variety of tasks during pre- Barbu, A., Mayo, D., Alverio, J., Luo, W., Wang, C., Gut-
training. This task learning can then be leveraged via natural freund, D., Tenenbaum, J., and Katz, B. Objectnet: A
language prompting to enable zero-shot transfer to many large-scale bias-controlled dataset for pushing the lim-
existing datasets. At sufficient scale, the performance of this its of object recognition models. In Advances in Neural
approach can be competitive with task-specific supervised Information Processing Systems, pp. 9453–9463, 2019.
models although there is still room for much improvement.
Barnard, K., Duygulu, P., Forsyth, D., Freitas, N. d., Blei,
ACKNOWLEDGMENTS D. M., and Jordan, M. I. Matching words and pictures.
Journal of machine learning research, 3(Feb):1107–1135,
We’d like to thank the millions of people involved in creating 2003.
the data CLIP is trained on. We’d also like to thank Susan
Zhang for her work on image conditional language models Bechmann, A. and Bowker, G. C. Unsupervised by any
while at OpenAI, Ishaan Gulrajani for catching an error in other name: Hidden layers of knowledge production in
the pseudocode, and Irene Solaiman, Miles Brundage, and artificial intelligence on social media. Big Data & Society,
Gillian Hadfield for their thoughtful feedback on the broader 6(1):205395171881956, January 2019. doi: 10.1177/
impacts section of the paper. We are also grateful to the 2053951718819569. URL https://doi.org/10.
Acceleration and Supercomputing teams at OpenAI for their 1177/2053951718819569.
critical work on software and hardware infrastructure this Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. A
project used. Finally, we’d also like to thank the developers neural probabilistic language model. Journal of machine
of the many software packages used throughout this project learning research, 3(Feb):1137–1155, 2003.
Learning Transferable Visual Models From Natural Language Supervision 28

Bhargava, S. and Forsyth, D. Exposing and correcting the Chen, T., Kornblith, S., Swersky, K., Norouzi, M., and
gender bias in image captioning datasets and models. Hinton, G. Big self-supervised models are strong semi-
arXiv preprint arXiv:1912.00578, 2019. supervised learners. arXiv preprint arXiv:2006.10029,
2020d.
Blaise Aguera y Arcas, M. M. and Todorov,
A. Physiognomy’s new clothes. 2017. Chen, X. and Gupta, A. Webly supervised learning of
URL https://medium.com/@blaisea/ convolutional networks. In Proceedings of the IEEE
physiognomys-new-clothes-f2d4b59fdd6a. International Conference on Computer Vision, pp. 1431–
1439, 2015.
Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet
allocation. Journal of machine Learning research, 3(Jan): Chen, X., Fan, H., Girshick, R., and He, K. Improved
993–1022, 2003. baselines with momentum contrastive learning. arXiv
Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and preprint arXiv:2003.04297, 2020e.
Kalai, A. T. Man is to computer programmer as woman Chen, Y.-C., Li, L., Yu, L., Kholy, A. E., Ahmed, F., Gan, Z.,
is to homemaker? debiasing word embeddings. Advances Cheng, Y., and Liu, J. Uniter: Learning universal image-
in neural information processing systems, 29:4349–4357, text representations. arXiv preprint arXiv:1909.11740,
2016. 2019.
Bowker, G. C. and Star, S. L. Sorting things out: Classifica-
Cheng, G., Han, J., and Lu, X. Remote sensing image scene
tion and its consequences. MIT press, 2000.
classification: Benchmark and state of the art. Proceed-
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, ings of the IEEE, 105(10):1865–1883, 2017.
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., et al. Language models are few-shot learners. Choi, D., Shallue, C. J., Nado, Z., Lee, J., Maddison, C. J.,
arXiv preprint arXiv:2005.14165, 2020. and Dahl, G. E. On empirical comparisons of optimiz-
ers for deep learning. arXiv preprint arXiv:1910.05446,
Browne, S. Dark Matters: Surveillance of Blackness. Duke 2019.
University Press, 2015.
Coates, A., Ng, A., and Lee, H. An analysis of single-
Bulent Sariyildiz, M., Perez, J., and Larlus, D. Learning layer networks in unsupervised feature learning. In Pro-
visual representations with caption annotations. arXiv ceedings of the fourteenth international conference on
e-prints, pp. arXiv–2008, 2020. artificial intelligence and statistics, pp. 215–223, 2011.
Buolamwini, J. and Gebru, T. Gender shades: Intersec- Crawford, K. The trouble with bias. NIPS 2017
tional accuracy disparities in commercial gender classi- Keynote, 2017. URL https://www.youtube.com/
fication. In Conference on fairness, accountability and watch?v=fMym_BKWQzk.
transparency, pp. 77–91, 2018.
Dai, A. M. and Le, Q. V. Semi-supervised sequence learning.
Carreira, J., Noland, E., Hillier, C., and Zisserman, A. A
In Advances in neural information processing systems,
short note on the kinetics-700 human action dataset. arXiv
pp. 3079–3087, 2015.
preprint arXiv:1907.06987, 2019.
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D’Amour, A., Heller, K., Moldovan, D., Adlam, B., Ali-
D., and Sutskever, I. Generative pretraining from pixels. panahi, B., Beutel, A., Chen, C., Deaton, J., Eisenstein,
In International Conference on Machine Learning, pp. J., Hoffman, M. D., et al. Underspecification presents
1691–1703. PMLR, 2020a. challenges for credibility in modern machine learning.
arXiv preprint arXiv:2011.03395, 2020.
Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training
deep nets with sublinear memory cost. arXiv preprint Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
arXiv:1604.06174, 2016. Fei, L. ImageNet: A Large-Scale Hierarchical Image
Database. In CVPR09, 2009.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A
simple framework for contrastive learning of visual rep- Deng, J., Berg, A. C., Satheesh, S., Su, H., Khosla, A.,
resentations. arXiv preprint arXiv:2002.05709, 2020b. and Fei-Fei, L. Ilsvrc 2012, 2012. URL http://www.
image-net.org/challenges/LSVRC/2012/.
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., and
Hinton, G. Big self-supervised models are strong semi- Desai, K. and Johnson, J. Virtex: Learning visual rep-
supervised learners. arXiv preprint arXiv:2006.10029, resentations from textual annotations. arXiv preprint
2020c. arXiv:2006.06666, 2020.
Learning Transferable Visual Models From Natural Language Supervision 29

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Geiger, A., Lenz, P., and Urtasun, R. Are we ready for
Pre-training of deep bidirectional transformers for lan- autonomous driving? the kitti vision benchmark suite. In
guage understanding. arXiv preprint arXiv:1810.04805, Conference on Computer Vision and Pattern Recognition
2018. (CVPR), 2012.

Dhariwal, P., Jun, H., Payne, C., Kim, J. W., Radford, A., Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wich-
and Sutskever, I. Jukebox: A generative model for music. mann, F. A., and Brendel, W. Imagenet-trained cnns are
arXiv preprint arXiv:2005.00341, 2020. biased towards texture; increasing shape bias improves ac-
curacy and robustness. arXiv preprint arXiv:1811.12231,
Divvala, S. K., Farhadi, A., and Guestrin, C. Learning
2018.
everything about anything: Webly-supervised visual con-
cept learning. In Proceedings of the IEEE Conference Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R.,
on Computer Vision and Pattern Recognition, pp. 3270– Brendel, W., Bethge, M., and Wichmann, F. A. Short-
3277, 2014. cut learning in deep neural networks. arXiv preprint
Dodge, S. and Karam, L. A study and comparison of human arXiv:2004.07780, 2020.
and deep learning recognition performance under visual
Gomez, L., Patel, Y., Rusiñol, M., Karatzas, D., and Jawahar,
distortions. In 2017 26th international conference on
C. Self-supervised learning of visual features through
computer communication and networks (ICCCN), pp. 1–
embedding images into text topic spaces. In Proceedings
7. IEEE, 2017.
of the IEEE Conference on Computer Vision and Pattern
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, Recognition, pp. 4230–4239, 2017.
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M.,
Heigold, G., Gelly, S., et al. An image is worth 16x16 Goodfellow, I. J., Shlens, J., and Szegedy, C. Explain-
words: Transformers for image recognition at scale. arXiv ing and harnessing adversarial examples. arXiv preprint
preprint arXiv:2010.11929, 2020. arXiv:1412.6572, 2014.

Elhoseiny, M., Saleh, B., and Elgammal, A. Write a classi- Goodfellow, I. J., Erhan, D., Carrier, P. L., Courville, A.,
fier: Zero-shot learning using purely textual descriptions. Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler,
In Proceedings of the IEEE International Conference on D., Lee, D.-H., et al. Challenges in representation learn-
Computer Vision, pp. 2584–2591, 2013. ing: A report on three machine learning contests. Neural
Networks, 64:59–63, 2015.
Faghri, F., Fleet, D. J., Kiros, J. R., and Fidler, S. Vse++: Im-
proving visual-semantic embeddings with hard negatives. Google. Google cloud api: Celebrity recognition. URL
arXiv preprint arXiv:1707.05612, 2017. https://cloud.google.com/vision/docs/
celebrity-recognition.
Fergus, R., Fei-Fei, L., Perona, P., and Zisserman, A. Learn-
ing object categories from google’s image search. In Griewank, A. and Walther, A. Algorithm 799: revolve: an
Tenth IEEE International Conference on Computer Vision implementation of checkpointing for the reverse or ad-
(ICCV’05) Volume 1, volume 2, pp. 1816–1823. IEEE, joint mode of computational differentiation. ACM Trans-
2005. actions on Mathematical Software (TOMS), 26(1):19–45,
2000.
Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J.,
Ranzato, M., and Mikolov, T. Devise: A deep visual- Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond,
semantic embedding model. In Advances in neural infor- P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo,
mation processing systems, pp. 2121–2129, 2013. Z. D., Azar, M. G., et al. Bootstrap your own latent: A
Gan, Z., Chen, Y.-C., Li, L., Zhu, C., Cheng, Y., and Liu, J. new approach to self-supervised learning. arXiv preprint
Large-scale adversarial training for vision-and-language arXiv:2006.07733, 2020.
representation learning. arXiv preprint arXiv:2006.06195,
Ha, D., Dai, A., and Le, Q. V. Hypernetworks. arXiv
2020.
preprint arXiv:1609.09106, 2016.
Gao, T., Fisch, A., and Chen, D. Making pre-trained lan-
guage models better few-shot learners. arXiv preprint Hancock, B., Bringmann, M., Varma, P., Liang, P., Wang,
arXiv:2012.15723, 2020. S., and Ré, C. Training classifiers with natural language
explanations. In Proceedings of the conference. Associ-
Garvie, C., May 2019. URL https://www. ation for Computational Linguistics. Meeting, volume
flawedfacedata.com/. 2018, pp. 1884. NIH Public Access, 2018.
Learning Transferable Visual Models From Natural Language Supervision 30

Hancock, B., Bordes, A., Mazare, P.-E., and Weston, J. Henaff, O. Data-efficient image recognition with contrastive
Learning from dialogue after deployment: Feed yourself, predictive coding. In International Conference on Ma-
chatbot! arXiv preprint arXiv:1901.05415, 2019. chine Learning, pp. 4182–4192. PMLR, 2020.

Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, Hendrycks, D. and Dietterich, T. Benchmarking neural
R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., network robustness to common corruptions and perturba-
Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van tions. arXiv preprint arXiv:1903.12261, 2019.
Kerkwijk, M. H., Brett, M., Haldane, A., Fernández del
Hendrycks, D. and Gimpel, K. Gaussian error linear units
Rı́o, J., Wiebe, M., Peterson, P., Gérard-Marchant, P.,
(gelus). arXiv preprint arXiv:1606.08415, 2016.
Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H.,
Gohlke, C., and Oliphant, T. E. Array programming Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., and
with NumPy. Nature, 585:357–362, 2020. doi: 10.1038/ Song, D. Natural adversarial examples. arXiv preprint
s41586-020-2649-2. arXiv:1907.07174, 2019.
Hays, J. and Efros, A. A. Im2gps: estimating geographic Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F.,
information from a single image. In 2008 ieee confer- Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M.,
ence on computer vision and pattern recognition, pp. 1–8. et al. The many faces of robustness: A critical analy-
IEEE, 2008. sis of out-of-distribution generalization. arXiv preprint
arXiv:2006.16241, 2020a.
He, K., Zhang, X., Ren, S., and Sun, J. Delving deep
into rectifiers: Surpassing human-level performance on Hendrycks, D., Liu, X., Wallace, E., Dziedzic, A., Krishnan,
imagenet classification. In Proceedings of the IEEE inter- R., and Song, D. Pretrained transformers improve out-of-
national conference on computer vision, pp. 1026–1034, distribution robustness. arXiv preprint arXiv:2004.06100,
2015. 2020b.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H.,
ing for image recognition. In Proceedings of the IEEE Kianinejad, H., Patwary, M., Ali, M., Yang, Y., and Zhou,
conference on computer vision and pattern recognition, Y. Deep learning scaling is predictable, empirically. arXiv
pp. 770–778, 2016a. preprint arXiv:1712.00409, 2017.

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- Hill, F., Lampinen, A., Schneider, R., Clark, S., Botvinick,
ing for image recognition. In Proceedings of the IEEE M., McClelland, J. L., and Santoro, A. Environmental
conference on computer vision and pattern recognition, drivers of systematicity and generalization in a situated
pp. 770–778, 2016b. agent. In International Conference on Learning Repre-
sentations, 2019.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo-
mentum contrast for unsupervised visual representation Hodosh, M., Young, P., and Hockenmaier, J. Framing image
learning. In Proceedings of the IEEE/CVF Conference description as a ranking task: Data, models and evaluation
on Computer Vision and Pattern Recognition, pp. 9729– metrics. Journal of Artificial Intelligence Research, 47:
9738, 2020. 853–899, 2013.
Hongsuck Seo, P., Weyand, T., Sim, J., and Han, B. Cplanet:
He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., and Li, M.
Enhancing image geolocalization by combinatorial parti-
Bag of tricks for image classification with convolutional
tioning of maps. In Proceedings of the European Confer-
neural networks. In Proceedings of the IEEE Conference
ence on Computer Vision (ECCV), pp. 536–551, 2018.
on Computer Vision and Pattern Recognition, pp. 558–
567, 2019. Howard, J. and Ruder, S. Universal language model
fine-tuning for text classification. arXiv preprint
He, X. and Peng, Y. Fine-grained image classification via
arXiv:1801.06146, 2018.
combining vision and language. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recog- Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran,
nition, pp. 5994–6002, 2017. B., and Madry, A. Adversarial examples are not bugs,
they are features. In Advances in Neural Information
Helber, P., Bischke, B., Dengel, A., and Borth, D. Eurosat: Processing Systems, pp. 125–136, 2019.
A novel dataset and deep learning benchmark for land
use and land cover classification. IEEE Journal of Se- Ioffe, S. and Szegedy, C. Batch normalization: Accelerating
lected Topics in Applied Earth Observations and Remote deep network training by reducing internal covariate shift.
Sensing, 12(7):2217–2226, 2019. arXiv preprint arXiv:1502.03167, 2015.
Learning Transferable Visual Models From Natural Language Supervision 31

Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman, general visual representations for transfer. arXiv preprint
A. Deep structured output learning for unconstrained text arXiv:1912.11370, 2019.
recognition. arXiv preprint arXiv:1412.5903, 2014.
Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet
Jaderberg, M., Simonyan, K., Zisserman, A., et al. Spatial models transfer better? In Proceedings of the IEEE
transformer networks. Advances in neural information conference on computer vision and pattern recognition,
processing systems, 28:2017–2025, 2015. pp. 2661–2671, 2019.

Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K.,
Lawrence Zitnick, C., and Girshick, R. Clevr: A diag- Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma,
nostic dataset for compositional language and elementary D. A., et al. Visual genome: Connecting language and
visual reasoning. In Proceedings of the IEEE Confer- vision using crowdsourced dense image annotations. In-
ence on Computer Vision and Pattern Recognition, pp. ternational journal of computer vision, 123(1):32–73,
2901–2910, 2017. 2017.
Joulin, A., Van Der Maaten, L., Jabri, A., and Vasilache, N. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
Learning visual features from large weakly supervised classification with deep convolutional neural networks.
data. In European Conference on Computer Vision, pp. In Advances in neural information processing systems,
67–84. Springer, 2016. pp. 1097–1105, 2012.
Kalfaoglu, M., Kalkan, S., and Alatan, A. A. Late temporal Kuhnle, A. and Copestake, A. Shapeworld-a new test
modeling in 3d cnn architectures with bert for action methodology for multimodal language understanding.
recognition. arXiv preprint arXiv:2008.01232, 2020. arXiv preprint arXiv:1704.04517, 2017.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Kärkkäinen, K. and Joo, J. Fairface: Face attribute dataset
Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and for balanced race, gender, and age, 2019.
Amodei, D. Scaling laws for neural language models.
arXiv preprint arXiv:2001.08361, 2020. Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gersh-
man, S. J. Building machines that learn and think like
Karpathy, A., Joulin, A., and Fei-Fei, L. F. Deep fragment people, 2016.
embeddings for bidirectional image sentence mapping.
In Advances in neural information processing systems, Lampert, C. H., Nickisch, H., and Harmeling, S. Learning
pp. 1889–1897, 2014. to detect unseen object classes by between-class attribute
transfer. In 2009 IEEE Conference on Computer Vision
Keyes, O. The misgendering machines: Trans/hci implica- and Pattern Recognition, pp. 951–958. IEEE, 2009.
tions of automatic gender recognition. Proceedings of the
ACM on Human-Computer Interaction, 2(CSCW):1–22, Larochelle, H., Erhan, D., and Bengio, Y. Zero-data learning
2018. of new tasks. 2008.

Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Le, Q. and Mikolov, T. Distributed representations of sen-
Ringshia, P., and Testuggine, D. The hateful memes tences and documents. In International conference on
challenge: Detecting hate speech in multimodal memes. machine learning, pp. 1188–1196, 2014.
arXiv preprint arXiv:2005.04790, 2020.
LeCun, Y. The mnist database of handwritten digits.
Kingma, D. P. and Ba, J. Adam: A method for stochastic http://yann. lecun. com/exdb/mnist/.
optimization. arXiv preprint arXiv:1412.6980, 2014.
Lee, D.-H. Pseudo-label: The simple and efficient semi-
Kiros, R., Salakhutdinov, R., and Zemel, R. S. Unifying supervised learning method for deep neural networks.
visual-semantic embeddings with multimodal neural lan-
guage models. arXiv preprint arXiv:1411.2539, 2014. Lei Ba, J., Swersky, K., Fidler, S., et al. Predicting deep
zero-shot convolutional neural networks using textual
Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, descriptions. In Proceedings of the IEEE International
R., Torralba, A., and Fidler, S. Skip-thought vectors. Conference on Computer Vision, pp. 4247–4255, 2015.
Advances in neural information processing systems, 28:
3294–3302, 2015. Li, A., Jabri, A., Joulin, A., and van der Maaten, L. Learning
visual n-grams from web data. In Proceedings of the
Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, IEEE International Conference on Computer Vision, pp.
J., Gelly, S., and Houlsby, N. Large scale learning of 4183–4192, 2017.
Learning Transferable Visual Models From Natural Language Supervision 32

Li, G., Duan, N., Fang, Y., Gong, M., and Jiang, D. Lucic, M., Kurach, K., Michalski, M., Gelly, S., and Bous-
Unicoder-vl: A universal encoder for vision and language quet, O. Are gans created equal? a large-scale study.
by cross-modal pre-training. 2020a. Advances in neural information processing systems, 31:
700–709, 2018.
Li, J., Miller, A. H., Chopra, S., Ranzato, M., and Weston, J.
Learning through dialogue interactions by asking ques- Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri,
tions. arXiv preprint arXiv:1612.04936, 2016. M., Li, Y., Bharambe, A., and van der Maaten, L. Ex-
ploring the limits of weakly supervised pretraining. In
Li, X., Yin, X., Li, C., Hu, X., Zhang, P., Zhang, L., Wang, Proceedings of the European Conference on Computer
L., Hu, H., Dong, L., Wei, F., et al. Oscar: Object- Vision (ECCV), pp. 181–196, 2018.
semantics aligned pre-training for vision-language tasks.
arXiv preprint arXiv:2004.06165, 2020b. McCann, B., Bradbury, J., Xiong, C., and Socher, R.
Learned in translation: Contextualized word vectors. In
Liang, W., Zou, J., and Yu, Z. Alice: Active learning with Advances in neural information processing systems, pp.
contrastive natural language explanations. arXiv preprint 6294–6305, 2017.
arXiv:2009.10259, 2020.
McCann, B., Keskar, N. S., Xiong, C., and Socher, R. The
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra- natural language decathlon: Multitask learning as ques-
manan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: tion answering. arXiv preprint arXiv:1806.08730, 2018.
Common objects in context. In European conference on
computer vision, pp. 740–755. Springer, 2014. Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen,
E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O.,
Linzen, T. How can we accelerate progress towards Venkatesh, G., et al. Mixed precision training. arXiv
human-like linguistic generalization? arXiv preprint preprint arXiv:1710.03740, 2017.
arXiv:2005.00955, 2020.
Miech, A., Zhukov, D., Alayrac, J.-B., Tapaswi, M., Laptev,
Lippe, P., Holla, N., Chandra, S., Rajamanickam, S., An- I., and Sivic, J. Howto100m: Learning a text-video em-
toniou, G., Shutova, E., and Yannakoudakis, H. A mul- bedding by watching hundred million narrated video clips.
timodal framework for the detection of hateful memes. In Proceedings of the IEEE international conference on
arXiv preprint arXiv:2012.12871, 2020. computer vision, pp. 2630–2640, 2019.
Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepa- Miech, A., Alayrac, J.-B., Laptev, I., Sivic, J., and Zisser-
ssi, R., Kaiser, L., and Shazeer, N. Generating man, A. Rareact: A video dataset of unusual interactions.
wikipedia by summarizing long sequences. arXiv preprint arXiv preprint arXiv:2008.01018, 2020a.
arXiv:1801.10198, 2018.
Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J.,
Locatello, F., Bauer, S., Lucic, M., Rätsch, G., Gelly, S., and Zisserman, A. End-to-end learning of visual represen-
Schölkopf, B., and Bachem, O. A sober look at the tations from uncurated instructional videos. In Proceed-
unsupervised learning of disentangled representations ings of the IEEE/CVF Conference on Computer Vision
and their evaluation. arXiv preprint arXiv:2010.14766, and Pattern Recognition, pp. 9879–9889, 2020b.
2020.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and
Loshchilov, I. and Hutter, F. Sgdr: Stochastic gra- Dean, J. Distributed representations of words and phrases
dient descent with warm restarts. arXiv preprint and their compositionality. Advances in neural informa-
arXiv:1608.03983, 2016. tion processing systems, 26:3111–3119, 2013.
Loshchilov, I. and Hutter, F. Decoupled weight decay regu- Miller, J., Krauth, K., Recht, B., and Schmidt, L. The effect
larization. arXiv preprint arXiv:1711.05101, 2017. of natural distribution shift on question answering models.
arXiv preprint arXiv:2004.14444, 2020.
Lu, J., Batra, D., Parikh, D., and Lee, S. Vilbert: Pretraining
task-agnostic visiolinguistic representations for vision- Mishra, A., Alahari, K., and Jawahar, C. Scene text recogni-
and-language tasks. In Advances in Neural Information tion using higher order language priors. 2012.
Processing Systems, pp. 13–23, 2019.
Mithun, N. C., Panda, R., Papalexakis, E. E., and Roy-
Lu, Z., Xiong, X., Li, Y., Stroud, J., and Ross, D. Leveraging Chowdhury, A. K. Webly supervised joint embedding for
weakly supervised data and pose representation for action cross-modal image-text retrieval. In Proceedings of the
recognition, 2020. URL https://www.youtube. 26th ACM international conference on Multimedia, pp.
com/watch?v=KOQFxbPPLOE&t=1390s. 1856–1864, 2018.
Learning Transferable Visual Models From Natural Language Supervision 33

Mori, Y., Takahashi, H., and Oka, R. Image-to-word trans- Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar,
formation based on dividing and vector quantizing images C. V. Cats and dogs. In IEEE Conference on Computer
with words. Citeseer, 1999. Vision and Pattern Recognition, 2012.

Mu, J., Liang, P., and Goodman, N. Shaping visual represen- Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
tations with language for few-shot classification. arXiv Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
preprint arXiv:1911.02683, 2019. L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison,
M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L.,
Muller-Budack, E., Pustu-Iren, K., and Ewerth, R. Geolo- Bai, J., and Chintala, S. Pytorch: An imperative style,
cation estimation of photos using a hierarchical model high-performance deep learning library. In Wallach, H.,
and scene classification. In Proceedings of the European Larochelle, H., Beygelzimer, A., dAlché-Buc, F., Fox, E.,
Conference on Computer Vision (ECCV), pp. 563–579, and Garnett, R. (eds.), Advances in Neural Information
2018. Processing Systems 32, pp. 8024–8035, 2019.
Murty, S., Koh, P. W., and Liang, P. Expbert: Representation Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
engineering with natural language explanations. arXiv Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
preprint arXiv:2005.01932, 2020. Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-
napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
Narasimhan, K., Kulkarni, T., and Barzilay, R. Language Scikit-learn: Machine learning in Python. Journal of
understanding for text-based games using deep reinforce- Machine Learning Research, 12:2825–2830, 2011.
ment learning. arXiv preprint arXiv:1506.08941, 2015.
Pennington, J., Socher, R., and Manning, C. D. Glove:
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Global vectors for word representation. In Proceedings
and Ng, A. Y. Reading digits in natural images with of the 2014 conference on empirical methods in natural
unsupervised feature learning. 2011. language processing (EMNLP), pp. 1532–1543, 2014.
Noble, S. U. Algorithms of oppression: How search engines Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark,
reinforce racism. 2018. C., Lee, K., and Zettlemoyer, L. Deep contextualized
word representations. arXiv preprint arXiv:1802.05365,
Nosek, B. A., Banaji, M. R., and Greenwald, A. G. Harvest-
2018.
ing implicit group attitudes and beliefs from a demonstra-
tion web site. Group Dynamics: Theory, Research, and Qi, D., Su, L., Song, J., Cui, E., Bharti, T., and Sacheti,
Practice, 6(1):101, 2002. A. Imagebert: Cross-modal pre-training with large-
scale weak-supervised image-text data. arXiv preprint
Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.-C., Lee,
arXiv:2001.07966, 2020.
J. T., Mukherjee, S., Aggarwal, J., Lee, H., Davis, L., et al.
A large-scale benchmark dataset for event recognition in Quattoni, A., Collins, M., and Darrell, T. Learning visual
surveillance video. In CVPR 2011, pp. 3153–3160. IEEE, representations using images with captions. In 2007 IEEE
2011. Conference on Computer Vision and Pattern Recognition,
pp. 1–8. IEEE, 2007.
Oliver, A., Odena, A., Raffel, C. A., Cubuk, E. D., and Good-
fellow, I. Realistic evaluation of deep semi-supervised Radford, A., Narasimhan, K., Salimans, T., and Sutskever,
learning algorithms. Advances in neural information pro- I. Improving language understanding by generative pre-
cessing systems, 31:3235–3246, 2018. training, 2018.
Oord, A. v. d., Li, Y., and Vinyals, O. Representation learn- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
ing with contrastive predictive coding. arXiv preprint Sutskever, I. Language models are unsupervised multitask
arXiv:1807.03748, 2018. learners. 2019.

Ordonez, V., Kulkarni, G., and Berg, T. Im2text: Describing Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
images using 1 million captioned photographs. Advances Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring
in neural information processing systems, 24:1143–1151, the limits of transfer learning with a unified text-to-text
2011. transformer. arXiv preprint arXiv:1910.10683, 2019.

pandas development team, T. pandas-dev/pandas: Pan- Raji, I. D., Gebru, T., Mitchell, M., Buolamwini, J., Lee,
das, February 2020. URL https://doi.org/10. J., and Denton, E. Saving face: Investigating the ethical
5281/zenodo.3509134. concerns of facial recognition auditing, 2020.
Learning Transferable Visual Models From Natural Language Supervision 34

Ramanathan, V., Liang, P., and Fei-Fei, L. Video event Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning,
understanding using natural language descriptions. In C. D., Ng, A. Y., and Potts, C. Recursive deep models for
Proceedings of the IEEE International Conference on semantic compositionality over a sentiment treebank. In
Computer Vision, pp. 905–912, 2013. Proceedings of the 2013 conference on empirical methods
in natural language processing, pp. 1631–1642, 2013.
Rashtchian, C., Young, P., Hodosh, M., and Hockenmaier, J.
Collecting image annotations using amazon’s mechanical Socher, R., Karpathy, A., Le, Q. V., Manning, C. D., and Ng,
turk. In Proceedings of the NAACL HLT 2010 Workshop A. Y. Grounded compositional semantics for finding and
on Creating Speech and Language Data with Amazon’s describing images with sentences. Transactions of the
Mechanical Turk, pp. 139–147, 2010. Association for Computational Linguistics, 2:207–218,
2014.
Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do im-
agenet classifiers generalize to imagenet? arXiv preprint Sohn, K. Improved deep metric learning with multi-class
arXiv:1902.10811, 2019. n-pair loss objective. In Advances in neural information
processing systems, pp. 1857–1865, 2016.
Salimans, T. and Kingma, D. P. Weight normalization: A
simple reparameterization to accelerate training of deep Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-
neural networks. In Advances in neural information pro- Voss, A., Wu, J., Radford, A., Krueger, G., Kim, J. W.,
cessing systems, pp. 901–909, 2016. Kreps, S., McCain, M., Newhouse, A., Blazakis, J.,
McGuffie, K., and Wang, J. Release strategies and the
Scheuerman, M. K., Paul, J. M., and Brubaker, J. R. How social impacts of language models, 2019.
computers see gender: An evaluation of gender classifica-
tion in commercial facial analysis services. Proceedings Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset
of the ACM on Human-Computer Interaction, 3(CSCW): of 101 human actions classes from videos in the wild.
1–33, 2019. arXiv preprint arXiv:1212.0402, 2012.
Schwemmer, C., Knight, C., Bello-Pardo, E. D., Oklobdzija, Speer, R. ftfy. Zenodo, 2019. URL https://doi.org/
S., Schoonvelde, M., and Lockhart, J. W. Diagnosing 10.5281/zenodo.2591652. Version 5.5.
gender bias in image recognition systems. Socius, 6:
2378023120967171, 2020. Srivastava, N. and Salakhutdinov, R. Multimodal learning
with deep boltzmann machines. In NIPS, 2012.
Sennrich, R., Haddow, B., and Birch, A. Neural machine
translation of rare words with subword units. arXiv Srivastava, S., Labutov, I., and Mitchell, T. Joint concept
preprint arXiv:1508.07909, 2015. learning and semantic parsing from natural language ex-
planations. In Proceedings of the 2017 conference on
Shankar, V., Dave, A., Roelofs, R., Ramanan, D., Recht, B., empirical methods in natural language processing, pp.
and Schmidt, L. Do image classifiers generalize across 1527–1536, 2017.
time? arXiv preprint arXiv:1906.02168, 2019.
Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. The
Sharma, P., Ding, N., Goodman, S., and Soricut, R. Con- German Traffic Sign Recognition Benchmark: A multi-
ceptual captions: A cleaned, hypernymed, image alt-text class classification competition. In IEEE International
dataset for automatic image captioning. In Proceedings Joint Conference on Neural Networks, pp. 1453–1460,
of the 56th Annual Meeting of the Association for Compu- 2011.
tational Linguistics (Volume 1: Long Papers), pp. 2556–
2565, 2018. Stroud, J. C., Ross, D. A., Sun, C., Deng, J., Sukthankar, R.,
and Schmid, C. Learning video representations from tex-
Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., tual web supervision. arXiv preprint arXiv:2007.14937,
Batra, D., Parikh, D., and Rohrbach, M. Towards vqa 2020.
models that can read. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pp. Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi,
8317–8326, 2019. A. Inception-v4, inception-resnet and the impact
of residual connections on learning. arXiv preprint
Socher, R. and Fei-Fei, L. Connecting modalities: Semi- arXiv:1602.07261, 2016.
supervised segmentation and annotation of images using
unaligned text corpora. In 2010 IEEE Computer Society Tan, H. and Bansal, M. Lxmert: Learning cross-modality
Conference on Computer Vision and Pattern Recognition, encoder representations from transformers. arXiv preprint
pp. 966–973. IEEE, 2010. arXiv:1908.07490, 2019.
Learning Transferable Visual Models From Natural Language Supervision 35

Tan, M. and Le, Q. V. Efficientnet: Rethinking model Vo, N., Jacobs, N., and Hays, J. Revisiting im2gps in the
scaling for convolutional neural networks. arXiv preprint deep learning era. In Proceedings of the IEEE Interna-
arXiv:1905.11946, 2019. tional Conference on Computer Vision, pp. 2621–2630,
2017.
Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B.,
and Schmidt, L. Measuring robustness to natural dis- Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and
tribution shifts in image classification. arXiv preprint Bowman, S. R. Glue: A multi-task benchmark and anal-
arXiv:2007.00644, 2020. ysis platform for natural language understanding. arXiv
preprint arXiv:1804.07461, 2018.
Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni,
K., Poland, D., Borth, D., and Li, L.-J. Yfcc100m: The Wang, H., Ge, S., Lipton, Z., and Xing, E. P. Learning ro-
new data in multimedia research. Communications of the bust global representations by penalizing local predictive
ACM, 59(2):64–73, 2016. power. In Advances in Neural Information Processing
Systems, pp. 10506–10518, 2019.
Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview
coding. arXiv preprint arXiv:1906.05849, 2019. Wang, H., Lu, P., Zhang, H., Yang, M., Bai, X., Xu, Y., He,
M., Wang, Y., and Liu, W. All you need is boundary: To-
Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J. B., and
ward arbitrary-shaped text spotting. In Proceedings of the
Isola, P. Rethinking few-shot image classification: a
AAAI Conference on Artificial Intelligence, volume 34,
good embedding is all you need? arXiv preprint
pp. 12160–12167, 2020.
arXiv:2003.11539, 2020.
Torralba, A., Fergus, R., and Freeman, W. T. 80 million tiny Wang, J., Markert, K., and Everingham, M. Learning mod-
images: A large data set for nonparametric object and els for object recognition from natural language descrip-
scene recognition. IEEE transactions on pattern analysis tions. In BMVC, volume 1, pp. 2, 2009.
and machine intelligence, 30(11):1958–1970, 2008. Weston, J., Bengio, S., and Usunier, N. Large scale im-
Touvron, H., Vedaldi, A., Douze, M., and Jégou, H. Fix- age annotation: learning to rank with joint word-image
ing the train-test resolution discrepancy. In Advances in embeddings. Machine learning, 81(1):21–35, 2010.
neural information processing systems, pp. 8252–8262, Weston, J. E. Dialog-based language learning. In Advances
2019. in Neural Information Processing Systems, pp. 829–837,
Varadarajan, J. and Odobez, J.-M. Topic models for scene 2016.
analysis and abnormality detection. In 2009 IEEE 12th
Weyand, T., Kostrikov, I., and Philbin, J. Planet-photo geolo-
International Conference on Computer Vision Workshops,
cation with convolutional neural networks. In European
ICCV Workshops, pp. 1338–1345. IEEE, 2009.
Conference on Computer Vision, pp. 37–55. Springer,
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, 2016.
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., and Gir-
tion is all you need. In Advances in neural information
shick, R. Detectron2. https://github.com/
processing systems, pp. 5998–6008, 2017.
facebookresearch/detectron2, 2019.
Veeling, B. S., Linmans, J., Winkens, J., Cohen, T., and
Wu, Z., Xiong, Y., Yu, S., and Lin, D. Unsupervised feature
Welling, M. Rotation equivariant CNNs for digital pathol-
learning via non-parametric instance-level discrimination.
ogy. June 2018.
arXiv preprint arXiv:1805.01978, 2018.
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M.,
Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Xie, Q., Luong, M.-T., Hovy, E., and Le, Q. V. Self-training
Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., with noisy student improves imagenet classification. In
Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Proceedings of the IEEE/CVF Conference on Computer
Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, İ., Vision and Pattern Recognition, pp. 10687–10698, 2020.
Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D., Yang, Z., Lu, Y., Wang, J., Yin, X., Florencio, D., Wang,
Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., L., Zhang, C., Zhang, L., and Luo, J. Tap: Text-aware
Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, pre-training for text-vqa and text-caption. arXiv preprint
F., van Mulbregt, P., and SciPy 1.0 Contributors. SciPy arXiv:2012.04638, 2020.
1.0: Fundamental Algorithms for Scientific Computing
in Python. Nature Methods, 17:261–272, 2020. doi: Yogatama, D., d’Autume, C. d. M., Connor, J., Kocisky,
10.1038/s41592-019-0686-2. T., Chrzanowski, M., Kong, L., Lazaridou, A., Ling, W.,
Learning Transferable Visual Models From Natural Language Supervision 36

Yu, L., Dyer, C., et al. Learning and evaluating general


linguistic intelligence. arXiv preprint arXiv:1901.11373,
2019.
Young, P., Lai, A., Hodosh, M., and Hockenmaier, J. From
image descriptions to visual denotations: New similarity
metrics for semantic inference over event descriptions.
Transactions of the Association for Computational Lin-
guistics, 2:67–78, 2014.
Yu, F., Tang, J., Yin, W., Sun, Y., Tian, H., Wu, H.,
and Wang, H. Ernie-vil: Knowledge enhanced vision-
language representations through scene graph. arXiv
preprint arXiv:2006.16934, 2020.

Zeiler, M. D. and Fergus, R. Visualizing and understand-


ing convolutional networks. In European conference on
computer vision, pp. 818–833. Springer, 2014.
Zhai, X., Puigcerver, J., Kolesnikov, A., Ruyssen, P.,
Riquelme, C., Lucic, M., Djolonga, J., Pinto, A. S., Neu-
mann, M., Dosovitskiy, A., et al. A large-scale study of
representation learning with the visual task adaptation
benchmark. arXiv preprint arXiv:1910.04867, 2019.
Zhang, R. Making convolutional networks shift-invariant
again. arXiv preprint arXiv:1904.11486, 2019.

Zhang, Y., Jiang, H., Miura, Y., Manning, C. D., and Lan-
glotz, C. P. Contrastive learning of medical visual repre-
sentations from paired images and text. arXiv preprint
arXiv:2010.00747, 2020.

Zuboff, S. Big other: surveillance capitalism and the


prospects of an information civilization. Journal of Infor-
mation Technology, 30(1):75–89, 2015.
Learning Transferable Visual Models From Natural Language Supervision 37

A. Linear-probe evaluation the ResNet-50 architecture as in the smallest contrastive


model. To do so, the output from the CNN is projected into
We provide additional details for linear probe experiments four tokens, which are then fed as a prefix to a language
presented in this paper, including the list of the datasets and model autoregressively predicting the text tokens. Apart
models used for evaluation. from the training objective, the model was trained on the
same dataset for the same number of epochs as other CLIP
A.1. Datasets models.
We use the 12 datasets from the well-studied evaluation
suite introduced by (Kornblith et al., 2019) and add 15 CLIP-RN Five ResNet-based contrastive CLIP models
additional datasets in order to assess the performance of are included. As discussed in the paper, the first two models
models on a wider variety of distributions and tasks. These follow ResNet-50 and ResNet-101, and we use EfficientNet-
datasets include MNIST, the Facial Expression Recognition style (Tan & Le, 2019) scaling for the next three models
2013 dataset (Goodfellow et al., 2015), STL-10 (Coates which simultaneously scale the model width, the number
et al., 2011), EuroSAT (Helber et al., 2019), the NWPU- of layers, and the input resolution to obtain models with
RESISC45 dataset (Cheng et al., 2017), the German Traf- roughly 4x, 16x, and 64x computation.
fic Sign Recognition Benchmark (GTSRB) dataset (Stal-
lkamp et al., 2011), the KITTI dataset (Geiger et al., 2012), CLIP-ViT We include four CLIP models that use the Vi-
PatchCamelyon (Veeling et al., 2018), the UCF101 action sion Transformer (Dosovitskiy et al., 2020) architecture as
recognition dataset (Soomro et al., 2012), Kinetics 700 (Car- the image encoder. We include three models trained on 224-
reira et al., 2019), 2,500 random samples of the CLEVR by-224 pixel images: ViT-B/32, ViT-B/16, ViT-L/14, and
dataset (Johnson et al., 2017), the Hateful Memes dataset the ViT-L/14 model fine-tuned on 336-by-336 pixel input
(Kiela et al., 2020), and the ImageNet-1k dataset (Deng images.
et al., 2012). For the two video datasets (UCF101 and Ki-
netics700), we use the middle frame of each video clip as EfficietNet We use the nine models (B0-B8) from the
the input image. STL-10 and UCF101 have multiple pre- original EfficientNet paper (Tan & Le, 2019), as well as
defined train/validation/test splits, 10 and 3 respectively, and the noisy-student variants (B0-B7, L2-475, and L2-800)
we report the average over all splits. Details on each dataset (Tan & Le, 2019). The largest models (L2-475 and L2-800)
and the corresponding evaluation metrics are provided in take the input resolutions of 475x475 and 800x800 pixels,
Table 9. respectively.
Additionally, we created two datasets that we call Coun- Instagram-pretrained ResNeXt We use the four models
try211 and Rendered SST2. The Country211 dataset is (32x8d, 32x16d, 32x32d, 32x48d) released by (Mahajan
designed to assess the geolocation capability of visual rep- et al., 2018), as well as their two FixRes variants which use
resentations. We filtered the YFCC100m dataset (Thomee higher input resolutions (Touvron et al., 2019).
et al., 2016) to find 211 countries (defined as having an
ISO-3166 country code) that have at least 300 photos with
Big Transfer (BiT) We use BiT-S and BiT-M models
GPS coordinates, and we built a balanced dataset with 211
(Kolesnikov et al., 2019), trained on the ImageNet-1k and
categories, by sampling 200 photos for training and 100
ImageNet-21k datasets. The model weights for BiT-L is not
photos for testing, for each country.
publicly available.
The Rendered SST2 dataset is designed to measure the opti-
cal character recognition capability of visual representations. Vision Transformer (ViT) We also include four ViT
To do so, we used the sentences from the Stanford Sentiment (Dosovitskiy et al., 2020) checkpoints pretrained on the
Treebank dataset (Socher et al., 2013) and rendered them ImageNet-21k dataset, namely ViT-B/32, ViT-B/16, ViT-
into images, with black texts on a white background, in a L/16, and ViT-H/14. We note that their best-performing
448×448 resolution. Two example images from this dataset models, trained on the JFT-300M dataset, are not available
are shown in Figure 19. publicly.

A.2. Models SimCLRv2 The SimCLRv2 (Chen et al., 2020d) project


released pre-trained and fine-tuned models in various set-
In combination with the datasets listed above, we evaluate tings. We use the seven pretrain-only checkpoints with
the following series of models using linear probes. selective kernels.

LM RN50 This is a multimodal model that uses an au- BYOL We use the recently released model weights of
toregressive loss instead of a contrastive loss, while using BYOL (Grill et al., 2020), specifically their 50x1 and 200x2
Learning Transferable Visual Models From Natural Language Supervision 38

Figure 19. Two example images from the Rendered SST2 dataset

checkpoints. validation data.

Momentum Contrast (MoCo) We include the MoCo-v1 A.4. Results


(He et al., 2020) and the MoCo-v2 (Chen et al., 2020e)
The linear probe results are provided in Table 10. The best-
checkpoints.
performing CLIP model, using ViT-L/14 archiecture and
336-by-336 pixel images, achieved the state of the art in
VirTex We use the pretrained model of VirTex (Desai & 21 of the 27 datasets, i.e. included in the Clopper-Pearson
Johnson, 2020). We note that VirTex has a similar model 99.5% confidence interval around each dataset’s top score.
design to CLIP-AR but is trained on a 1000x smaller dataset For many datasets, CLIP performs significantly better than
of high-quality captions from MSCOCO. other models, demonstrating the advantage of natural lan-
guage supervision over traditional pre-training approaches
ResNet We add the original ResNet checkpoints released based on image classification. See Section 3.2 for more
by (He et al., 2016b), namely ResNet-50, ResNet-101, and discussions on the linear probe results.
ResNet152.

A.3. Evaluation
We train a logistic regression classifier using L-BFGS, with
maximum 1,000 iterations, and report the corresponding
metric for each dataset. We determine the L2 regularization
strength λ using a hyperparameter sweep on the validation
sets over the range between 10−6 and 106 , with 96 log-
arithmically spaced steps. To save compute required for
the sweeps, we perform a parametric binary search that
starts with λ = [10−6 , 10−4 , 10−2 , 1, 102 , 104 , 106 ] and it-
eratively halves the interval around the peak until it reaches
a resolution of 8 steps per decade. The hyperparameter
sweeps are performed on a validation split of each dataset.
For the datasets that contains a validation split in addition
to from the test split, we use the provided validation set to
perform the hyperparameter search, and for the datasets that
do not provide a validation split or have not published labels
for the test data, we split the training dataset to perform the
hyperparameter search and report the performance on the
Learning Transferable Visual Models From Natural Language Supervision 39

Dataset Classes Train size Test size Evaluation metric


Food-101 102 75,750 25,250 accuracy
CIFAR-10 10 50,000 10,000 accuracy
CIFAR-100 100 50,000 10,000 accuracy
Birdsnap 500 42,283 2,149 accuracy
SUN397 397 19,850 19,850 accuracy
Stanford Cars 196 8,144 8,041 accuracy
FGVC Aircraft 100 6,667 3,333 mean per class
Pascal VOC 2007 Classification 20 5,011 4,952 11-point mAP
Describable Textures 47 3,760 1,880 accuracy
Oxford-IIIT Pets 37 3,680 3,669 mean per class
Caltech-101 102 3,060 6,085 mean-per-class
Oxford Flowers 102 102 2,040 6,149 mean per class
MNIST 10 60,000 10,000 accuracy
Facial Emotion Recognition 2013 8 32,140 3,574 accuracy
STL-10 10 1000 8000 accuracy
EuroSAT 10 10,000 5,000 accuracy
RESISC45 45 3,150 25,200 accuracy
GTSRB 43 26,640 12,630 accuracy
KITTI 4 6,770 711 accuracy
Country211 211 43,200 21,100 accuracy
PatchCamelyon 2 294,912 32,768 accuracy
UCF101 101 9,537 1,794 accuracy
Kinetics700 700 494,801 31,669 mean(top1, top5)
CLEVR Counts 8 2,000 500 accuracy
Hateful Memes 2 8,500 500 ROC AUC
Rendered SST2 2 7,792 1,821 accuracy
ImageNet 1000 1,281,167 50,000 accuracy

Table 9. Datasets examined for linear probes. We note that, for the Birdsnap and Kinetics700 datasets, we used the resources that are
available online at the time of this writing.
Learning Transferable Visual Models From Natural Language Supervision 40

Rendered SST2
FGVC Aircraft

HatefulMemes
Stanford Cars

Kinetics700
Oxford Pets

Country211
Flowers102
Caltech101

RESISC45
CIFAR100

VOC2007

ImageNet
CIFAR10

FER2013

EuroSAT
Birdsnap
Food101

SUN397

UCF101

CLEVR
GTSRB
MNIST

STL10

KITTI

PCam
DTD
LM RN50 81.3 82.8 61.7 44.2 69.6 74.9 44.9 85.5 71.5 82.8 85.5 91.1 96.6 60.1 96.3 93.4 84.0 73.8 70.2 19.0 82.9 76.4 51.9 51.2 65.2 76.8 65.2
RN50 86.4 88.7 70.3 56.4 73.3 78.3 49.1 87.1 76.4 88.2 89.6 96.1 98.3 64.2 97.2 95.2 87.5 82.4 70.2 25.3 82.7 81.6 57.2 53.6 65.7 72.6 73.3
CLIP-ResNet

RN101 88.9 91.1 73.5 58.6 75.1 84.0 50.7 88.0 76.3 91.0 92.0 96.4 98.4 65.2 98.2 95.9 89.3 82.4 73.6 26.6 82.8 84.0 60.3 50.3 68.2 73.3 75.7
RN50x4 91.3 90.5 73.0 65.7 77.0 85.9 57.3 88.4 79.5 91.9 92.5 97.8 98.5 68.1 98.1 96.4 89.7 85.5 59.4 30.3 83.0 85.7 62.6 52.5 68.0 76.6 78.2
RN50x16 93.3 92.2 74.9 72.8 79.2 88.7 62.7 89.0 79.1 93.5 93.7 98.3 98.9 68.7 99.0 97.0 91.4 89.0 69.2 34.8 83.5 88.0 66.3 53.8 71.1 80.0 81.5
RN50x64 94.8 94.1 78.6 77.2 81.1 90.5 67.7 88.9 82.0 94.5 95.4 98.9 98.9 71.3 99.3 97.1 92.8 90.2 69.2 40.7 83.7 89.5 69.1 55.0 75.0 81.2 83.6
B/32 88.8 95.1 80.5 58.5 76.6 81.8 52.0 87.7 76.5 90.0 93.0 96.9 99.0 69.2 98.6 97.0 90.5 85.3 66.2 27.8 83.9 85.5 61.7 52.1 66.7 70.8 76.1
CLIP-ViT

B/16 92.8 96.2 83.1 67.8 78.4 86.7 59.5 89.2 79.2 93.1 94.7 98.1 99.0 69.5 99.2 97.1 92.7 86.6 67.8 33.3 83.5 88.4 66.1 57.1 70.3 75.5 80.2
L/14 95.2 98.0 87.5 77.0 81.8 90.9 69.4 89.6 82.1 95.1 96.5 99.2 99.2 72.2 99.8 98.2 94.1 92.5 64.7 42.9 85.8 91.5 72.0 57.8 76.2 80.8 83.9
L/14-336px 95.9 97.9 87.4 79.9 82.2 91.5 71.6 89.9 83.0 95.1 96.0 99.2 99.2 72.9 99.7 98.1 94.9 92.4 69.2 46.4 85.6 92.0 73.0 60.3 77.3 80.5 85.4
B0 74.3 92.5 76.5 59.7 62.0 62.5 55.7 84.4 71.2 93.0 93.3 91.7 98.2 57.2 97.8 97.3 85.5 80.0 73.8 12.4 83.1 74.4 47.6 47.9 55.7 53.4 76.9
B1 74.2 93.2 77.2 61.3 62.6 62.5 56.1 84.7 74.2 93.4 93.6 92.4 98.3 57.0 97.9 96.8 84.5 75.9 75.5 12.5 82.7 74.7 48.5 44.3 54.5 54.4 78.6
B2 75.8 93.6 77.9 64.4 64.0 63.2 57.0 85.3 73.5 93.9 93.5 92.9 98.5 56.6 98.1 96.9 84.4 76.4 73.1 12.6 84.3 75.1 49.4 42.6 55.4 55.2 79.7
EfficientNet

B3 77.4 94.0 78.0 66.5 64.4 66.0 59.3 85.8 73.1 94.1 93.7 93.3 98.5 57.1 98.6 97.3 85.0 75.8 76.1 13.4 83.3 78.1 50.9 45.1 53.8 54.8 81.0
B4 79.7 94.1 78.7 70.1 65.4 66.4 60.4 86.5 73.4 94.7 93.5 93.2 98.8 57.9 98.9 96.8 85.0 78.3 72.3 13.9 83.1 79.1 52.5 46.5 54.4 55.4 82.9
B5 81.5 93.6 77.9 72.4 67.1 72.7 68.9 86.7 73.9 95.0 94.7 94.5 98.4 58.5 99.1 96.8 86.0 78.5 69.6 14.9 84.7 80.9 54.5 46.6 53.3 56.3 83.7
B6 82.4 94.0 78.0 73.5 65.8 71.1 68.2 87.6 73.9 95.0 94.1 93.7 98.4 60.2 99.0 96.8 85.4 78.1 72.7 15.3 84.2 80.0 54.1 51.1 53.3 57.0 84.0
B7 84.5 94.9 80.1 74.7 69.0 77.1 72.3 87.2 76.8 95.2 94.7 95.9 98.6 61.3 99.3 96.3 86.8 80.8 75.8 16.4 85.2 81.9 56.8 51.9 54.4 57.8 84.8
B8 84.5 95.0 80.7 75.2 69.6 76.8 71.5 87.4 77.1 94.9 95.2 96.3 98.6 61.4 99.4 97.0 87.4 80.4 70.9 17.4 85.2 82.4 57.7 51.4 51.7 55.8 85.3
B0 78.1 94.0 78.6 63.5 65.5 57.2 53.7 85.6 75.6 93.8 93.1 94.5 98.1 55.6 98.6 97.0 84.3 74.0 71.6 14.0 83.1 76.7 51.7 47.3 55.7 55.0 78.5
EfficientNet Noisy Student

B1 80.4 95.1 80.2 66.6 67.6 59.6 53.7 86.2 77.0 94.6 94.4 95.1 98.0 56.1 98.9 96.9 84.3 73.1 67.1 14.5 83.9 79.9 54.5 46.1 54.3 54.9 81.1
B2 80.9 95.3 81.3 67.6 67.9 60.9 55.2 86.3 77.7 95.0 94.7 94.4 98.0 55.5 98.9 97.3 84.6 71.7 70.0 14.6 82.9 80.1 55.1 46.1 54.1 55.3 82.2
B3 82.6 95.9 82.1 68.6 68.8 60.6 55.4 86.5 77.2 95.0 94.8 95.2 98.1 56.0 99.3 96.5 85.0 70.5 69.5 15.1 83.1 81.8 56.8 45.1 55.7 52.0 83.8
B4 85.2 95.6 81.0 72.5 69.7 56.1 52.6 87.0 78.7 94.8 95.2 95.3 98.2 56.0 99.4 95.3 84.8 61.9 64.8 16.0 82.8 83.4 59.8 43.2 55.3 53.0 85.4
B5 87.6 96.3 82.4 75.3 71.6 64.7 64.8 87.8 79.6 95.5 95.6 96.6 98.8 60.9 99.5 96.1 87.0 68.5 73.7 16.4 83.5 86.4 61.6 46.3 53.4 55.8 85.8
B6 87.3 97.0 83.9 75.8 71.4 67.6 65.6 87.3 78.5 95.2 96.4 97.2 98.6 61.9 99.7 96.6 86.1 70.7 72.4 17.6 84.2 85.5 61.0 49.6 54.6 55.7 86.4
B7 88.4 96.0 82.0 76.9 72.6 72.2 71.2 88.1 80.5 95.5 95.5 96.6 98.5 62.7 99.5 96.2 88.5 73.4 73.0 18.5 83.8 86.6 63.2 50.5 57.2 56.7 87.0
L2-475 91.6 99.0 91.0 74.8 76.4 75.1 66.8 89.5 81.9 95.6 96.5 97.7 98.9 67.5 99.7 97.0 89.5 73.4 68.9 22.2 86.3 89.4 68.2 58.3 58.6 55.2 88.3
L2-800 92.0 98.7 89.0 78.5 75.7 75.5 68.4 89.4 82.5 95.6 94.7 97.9 98.5 68.4 99.7 97.2 89.9 77.7 66.9 23.7 86.8 88.9 66.7 62.7 58.4 56.9 88.4
32x8d 84.8 95.9 80.9 63.8 69.0 74.2 56.0 88.0 75.4 95.4 93.9 91.7 97.4 60.7 64.4 95.7 82.1 72.3 69.2 16.7 82.3 80.1 56.8 42.2 53.3 55.2 83.3
32x16d 85.7 96.5 80.9 64.8 70.5 77.5 56.7 87.9 76.2 95.6 94.9 92.5 97.4 61.6 76.7 95.5 82.8 73.8 66.1 17.5 83.4 81.1 58.2 41.3 54.2 56.1 84.4
Instagram

32x32d 86.7 96.8 82.7 67.1 71.5 77.5 55.4 88.3 78.5 95.8 95.3 94.4 97.9 62.4 87.5 95.7 85.4 71.2 66.8 18.0 83.7 82.1 58.8 39.7 55.3 56.7 85.0
32x48d 86.9 96.8 83.4 65.9 72.2 76.6 53.2 88.0 77.2 95.5 95.8 93.6 98.1 63.7 88.3 95.3 85.4 73.0 67.2 18.5 82.7 82.8 59.2 41.3 55.5 56.7 85.2
FixRes-v1 88.5 95.7 81.1 67.4 72.9 80.5 57.6 88.0 77.9 95.8 96.1 94.5 97.9 62.2 82.9 96.2 86.6 76.5 64.8 19.3 82.5 83.4 59.8 43.5 56.6 59.0 86.0
FixRes-v2 88.5 95.7 81.1 67.3 72.9 80.7 57.5 88.0 77.9 95.0 96.0 94.5 98.0 62.1 83.1 96.5 86.6 76.3 64.8 19.5 82.3 83.5 59.8 44.2 56.6 59.0 86.0
R50x1 72.5 91.7 74.8 57.7 61.1 53.5 52.5 83.7 72.4 92.3 91.2 92.0 98.4 56.1 76.7 97.4 85.0 70.0 66.0 12.5 83.0 72.3 47.5 48.3 54.1 55.3 75.2
R50x3 75.1 93.7 79.0 61.1 63.7 55.2 54.1 84.8 74.6 92.5 91.6 92.8 98.8 58.7 74.9 97.8 86.4 73.1 73.8 14.0 84.2 76.4 50.0 49.2 54.7 54.2 77.2
BiT-S

R101x1 73.5 92.8 77.4 58.4 61.3 54.0 52.4 84.4 73.5 92.5 91.8 90.6 98.3 56.5 63.5 97.3 84.6 69.4 68.9 12.6 82.0 73.5 48.6 45.4 52.6 55.5 76.0
R101x3 74.7 93.9 79.8 57.8 62.9 54.7 53.3 84.7 75.5 92.3 91.2 92.6 98.8 59.7 61.5 98.0 85.5 71.8 60.2 14.1 83.1 75.9 50.4 49.7 54.1 54.6 77.4
R152x2 74.9 94.3 79.7 58.7 62.7 55.9 53.6 85.3 74.9 93.0 92.0 91.7 98.6 58.3 60.9 97.8 86.2 71.8 71.6 13.9 84.1 76.2 49.9 48.2 53.8 55.9 77.1
R152x4 74.7 94.2 79.2 57.8 62.9 51.2 50.8 85.4 75.4 93.1 91.2 91.4 98.9 61.4 64.2 98.0 85.5 72.8 67.9 14.9 83.1 76.0 50.3 42.9 53.6 56.0 78.5
R50x1 83.3 94.9 82.2 70.9 69.9 59.0 55.6 86.8 77.3 91.5 93.9 99.4 98.0 60.6 81.5 97.5 87.4 68.6 68.2 16.6 82.5 79.4 53.2 49.4 54.5 53.4 76.7
R50x3 86.9 96.7 86.2 75.7 74.6 60.6 54.2 87.7 78.5 93.2 95.3 99.4 98.6 64.6 81.1 98.0 88.1 69.9 59.6 19.6 83.4 83.5 57.8 51.3 55.8 55.6 80.7
BiT-M

R101x1 85.5 95.7 84.4 73.0 72.5 59.8 55.0 87.3 78.1 92.2 95.0 99.5 98.1 62.5 72.0 97.6 87.8 68.7 67.7 18.0 84.0 82.3 55.9 53.4 54.8 53.1 79.4
R101x3 87.2 97.4 87.5 72.4 75.0 57.4 47.4 87.5 79.6 93.2 95.4 99.6 98.6 64.3 73.8 98.2 87.7 68.8 64.1 20.7 80.4 84.0 58.7 52.6 54.9 54.3 81.2
R152x2 88.0 97.5 87.8 75.8 75.9 61.5 55.3 88.1 79.8 93.6 95.9 99.5 98.5 64.3 61.4 97.9 89.0 70.0 70.3 20.7 82.6 85.5 59.6 50.8 54.9 55.1 81.9
R152x4 87.2 97.6 88.2 72.4 75.0 49.1 43.4 87.1 79.9 92.4 95.4 99.3 98.5 65.7 57.7 97.8 87.7 68.2 57.1 20.6 80.4 84.6 59.0 49.7 57.2 55.1 81.5
B/32 81.8 96.7 86.3 65.2 70.7 49.1 42.7 85.3 73.1 90.4 94.5 98.7 97.8 59.0 99.1 96.3 83.0 68.1 65.1 15.7 82.6 79.1 51.7 38.9 57.1 54.6 76.5
B/16 86.7 96.9 86.4 74.0 74.2 54.7 46.0 86.7 74.3 92.7 94.1 99.2 97.4 61.3 99.5 96.4 84.5 63.1 61.5 17.5 85.4 82.7 56.6 40.0 57.0 56.1 80.9
ViT

L/16 87.4 97.9 89.0 76.5 74.9 62.5 52.2 86.1 75.0 92.9 94.7 99.3 98.0 64.0 99.7 96.5 85.7 70.4 58.8 17.7 85.7 84.1 58.0 38.4 58.4 52.8 81.5
H/14 83.4 95.8 84.5 70.2 69.2 62.3 54.8 84.7 75.4 91.7 93.7 98.9 98.5 62.4 98.5 97.3 87.0 73.9 63.4 15.4 87.0 79.4 52.1 41.1 55.9 54.1 75.4
R50x1 76.4 93.2 77.9 48.6 64.1 56.3 51.7 84.4 77.0 88.3 91.8 92.9 97.6 59.7 97.9 97.5 85.8 71.1 69.1 15.8 84.8 78.4 51.0 56.2 53.9 53.8 73.8
R50x3 81.0 95.6 82.4 56.5 67.0 65.6 61.1 85.9 78.8 90.9 94.1 95.4 98.7 62.6 98.8 97.9 88.2 78.2 74.7 17.6 85.4 82.6 54.6 55.4 54.2 55.2 77.3
SimCLR

R101x1 77.9 94.8 79.9 51.9 65.2 57.1 52.0 85.4 77.2 90.0 91.6 92.7 97.2 59.4 98.2 96.8 84.6 65.7 70.6 16.1 84.3 78.8 52.4 53.6 55.1 55.7 76.1
R101x3 82.2 96.4 83.4 57.5 68.2 64.6 60.0 86.2 78.9 91.8 95.0 95.4 98.4 63.0 99.0 97.9 88.0 77.5 69.1 18.3 85.5 82.9 55.9 52.2 54.5 56.3 78.8
R152x1 78.6 95.0 79.9 50.3 65.6 55.6 52.2 85.8 77.3 90.1 92.5 91.8 97.6 59.8 98.6 96.6 84.3 64.8 70.3 16.6 83.9 79.4 53.1 57.2 55.8 54.8 76.9
R152x2 82.3 96.7 83.9 58.1 68.5 64.9 58.7 86.6 79.1 92.2 94.1 96.0 98.2 64.1 99.0 98.0 88.1 77.0 69.8 18.4 85.3 82.7 56.2 53.6 56.0 56.5 79.2
R152x3 83.6 96.8 84.5 60.3 69.1 68.5 63.1 86.7 80.5 92.6 94.9 96.3 98.7 65.4 99.2 98.1 89.5 78.4 68.5 19.4 85.2 83.5 57.0 54.4 54.6 54.2 80.0
MoCo BYOL

50x1 74.0 93.6 79.1 47.6 63.7 61.6 62.3 82.6 77.0 88.3 93.7 94.3 98.7 58.8 97.4 97.6 88.2 80.1 71.4 14.1 84.8 77.3 49.3 56.1 53.8 54.4 73.3
200x2 78.5 96.2 83.3 53.4 68.5 61.7 55.4 86.6 77.4 91.9 95.5 93.9 98.7 62.6 99.0 97.7 87.4 77.1 76.4 16.4 84.0 82.6 55.1 54.1 52.5 52.4 79.2
v1 65.9 85.0 63.1 27.5 52.6 35.9 43.5 75.7 70.0 70.4 78.1 85.4 97.6 54.3 59.1 97.1 82.9 62.6 60.2 12.6 85.7 64.2 40.7 54.7 55.6 53.5 57.2
v2 72.2 93.4 76.3 39.6 60.2 48.3 51.1 82.6 75.1 84.4 89.9 90.7 98.4 58.3 62.9 97.2 85.4 75.7 75.4 13.2 85.6 72.7 47.8 56.9 53.9 53.8 69.1
VirTex 57.9 83.9 57.5 17.0 49.8 22.4 34.5 83.8 58.2 53.6 70.6 74.7 98.1 56.5 68.1 94.8 74.1 69.5 71.3 8.7 83.1 61.5 39.9 45.5 53.5 55.8 50.7
50 71.3 91.8 74.5 52.7 60.5 49.9 48.5 83.8 72.3 92.4 90.8 90.8 98.3 54.9 64.6 96.7 83.6 70.6 67.1 11.7 82.5 71.2 46.8 43.0 56.5 55.5 74.3
ResNet

101 72.7 93.0 77.2 53.7 60.8 50.1 47.0 84.4 71.6 92.3 91.9 90.4 98.5 56.6 62.1 97.1 83.4 72.5 63.6 11.9 83.3 72.7 48.3 43.2 53.0 54.7 75.8
152 73.7 93.5 78.0 55.1 61.6 52.8 48.4 84.5 71.9 93.0 92.1 89.6 98.2 57.0 61.5 97.0 83.1 70.1 70.2 12.3 82.9 75.3 49.2 42.4 53.2 53.9 77.1

Table 10. Linear probe performance of various pre-trained models over 27 datasets. Scores within the 99.5% Clopper-Pearson confidence
interval of each dataset’s top score are shown in bold.
Learning Transferable Visual Models From Natural Language Supervision 41
Food101 CIFAR10 CIFAR100 Birdsnap
95 80
90
98 75
90 70
96 85
65
accuracy

accuracy

accuracy

accuracy
85 94 80 60
80 55
92
75 50
75 90 45
70 40
100 101 102 100 101 102 100 101 102 100 101 102
SUN397 StanfordCars FGVCAircraft PascalVOC2007
90
90 70
80 89

11-point mAP over 20 classes


80 65 88
75

mean per class


60 87
accuracy

accuracy
70 86
70 55
85
60 50
65 84
50 45 83
60
100 101 102 100 101 102 100 101 102 100 101 102
DescribableTextures OxfordPets Caltech101 Flowers102
96 100
82 96
94 98
80 95
92

mean-per-class
mean per class

mean per class


94 96
78
accuracy

90 93
76 94
88 92
74 91 92
86
72 90 90
84
100 101 102 100 101 102 100 101 102 100 101 102
MNIST FacialEmotionRecognition2013 STL10 EuroSAT
72.5 100
99.00 98.0
98.75 70.0
90 97.5
98.50 67.5
97.0
accuracy

accuracy

accuracy

accuracy
98.25 65.0 80
98.00 62.5 96.5
97.75 60.0 70 96.0
97.50 57.5 95.5
97.25 60
55.0
100 101 102 100 101 102 100 101 102 100 101 102
RESISC45 GTSRB KITTI PatchCamelyon
87
94 90 75.0
86
92 85 72.5
70.0 85
90 80
accuracy

accuracy

accuracy

accuracy

67.5 84
88 75 65.0 83
86 70 62.5 82
84 60.0
65 81
82 57.5
100 101 102 100 101 102 100 101 102 100 101 102
UCF101 Kinetics700 CLEVRCounts Country211
45
90 70 60
40
65
mean(top1, top5)

85 55 35
accuracy

accuracy

accuracy

60 30
50
80 25
55 45 20
75
50 15
40
10
100 101 102 100 101 102 100 101 102 100 101 102
HatefulMemes SST2 ImageNet GFLOPs/image

75 80 87.5 CLIP-ViT
85.0 CLIP-ResNet
75 EfficientNet-NoisyStudent
70 82.5 EfficientNet
70 Instagram-pretrained
accuracy

accuracy

80.0
ROCAUC

65 SimCLRv2
65 77.5 BYOL
60 75.0 MoCo
60
72.5 ViT (ImageNet-21k)
55 55 BiT-M
70.0 BiT-S
100 101 102 100 101 102 100 101 102 ResNet
GFLOPs/image GFLOPs/image GFLOPs/image

Figure 20. Linear probe performance plotted for each of the 27 datasets, using the data from Table 10.
Learning Transferable Visual Models From Natural Language Supervision 42
Food101 SUN397 Youtube-BB EuroSAT
correct label: guacamole correct rank: 1/101 correct probability: 90.15% correct label: television studio correct rank: 1/397 correct probability: 90.22% correct label(s): airplane,person correct rank: 1/23 correct probability: 88.98% correct label: annual crop land correct rank: 4/10 correct probability: 12.90%

a photo of guacamole, a type of food. a photo of a television studio. a photo of a airplane. a centered satellite photo of permanent crop land.

a photo of ceviche, a type of food. a photo of a podium indoor. a photo of a bird. a centered satellite photo of pasture land.

a photo of edamame, a type of food. a photo of a conference room. a photo of a bear. a centered satellite photo of highway or road.

a photo of tuna tartare, a type of food. a photo of a lecture room. a photo of a giraffe. a centered satellite photo of annual crop land.

a photo of hummus, a type of food. a photo of a control room. a photo of a car. a centered satellite photo of brushland or shrubland.

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100


PatchCamelyon (PCam) ImageNet-A (Adversarial) CIFAR-10 CLEVR Count

correct label: healthy lymph node tissue correct rank: 2/2 correct probability: 22.81% correct label: lynx correct rank: 5/200 correct probability: 4.18% correct label: bird correct rank: 1/10 correct probability: 40.86% correct label: 4 correct rank: 2/8 correct probability: 17.11%

this is a photo of lymph node tumor tissue a photo of a fox squirrel. a photo of a bird. a photo of 3 objects.

this is a photo of healthy lymph node tissue a photo of a mongoose. a photo of a cat. a photo of 4 objects.

a photo of a skunk. a photo of a deer. a photo of 5 objects.

a photo of a red fox. a photo of a frog. a photo of 6 objects.

a photo of a lynx. a photo of a dog. a photo of 10 objects.

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

Facial Emotion Recognition 2013 (FER2013) UCF101 Caltech-101 ImageNet-R (Rendition)


correct label: angry correct rank: 5/7 correct probability: 8.16% correct label: Volleyball Spiking correct rank: 1/101 correct probability: 99.30% correct label: kangaroo correct rank: 1/102 correct probability: 99.81% correct label: Siberian Husky correct rank: 1/200 correct probability: 76.02%

a photo of a happy looking face. a photo of a person volleyball spiking. a photo of a kangaroo. a photo of a siberian husky.

a photo of a neutral looking face. a photo of a person jump rope. a photo of a gerenuk. a photo of a german shepherd dog.

a photo of a surprised looking face. a photo of a person long jump. a photo of a emu. a photo of a collie.

a photo of a fearful looking face. a photo of a person soccer penalty. a photo of a wild cat. a photo of a border collie.

a photo of a angry looking face. a photo of a person table tennis shot. a photo of a scorpion. a photo of a rottweiler.

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100


Oxford-IIIT Pets CIFAR-100 ImageNetV2 Matched Frequency FGVC Aircraft
correct label: Maine Coon correct rank: 1/37 correct probability: 99.99% correct label: snake correct rank: 1/100 correct probability: 38.02% correct label: beer bottle correct rank: 1/1000 correct probability: 88.27% correct label: Boeing 717 correct rank: 2/100 correct probability: 9.91%

a photo of a maine coon, a type of pet. a photo of a snake. a photo of a beer bottle. a photo of a mcdonnell douglas md-90, a type of aircraft.

a photo of a persian, a type of pet. a photo of a sweet pepper. a photo of a pirate ship. a photo of a boeing 717, a type of aircraft.

a photo of a ragdoll, a type of pet. a photo of a flatfish. a photo of a chocolate syrup. a photo of a fokker 100, a type of aircraft.

a photo of a birman, a type of pet. a photo of a turtle. a photo of a product packet / packaging. a photo of a mcdonnell douglas dc-9-30, a type of aircraft.

a photo of a siamese, a type of pet. a photo of a lizard. a photo of a wine bottle. a photo of a boeing 727-200, a type of aircraft.

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100


Country211 RESISC45 Stanford Cars SUN
correct label: Belize correct rank: 5/211 correct probability: 3.92% correct label: roundabout correct rank: 1/45 correct probability: 96.39% correct label: 2012 Honda Accord Coupe correct rank: 1/196 correct probability: 63.30% correct label: kennel indoor correct rank: 1/723 correct probability: 98.63%

a photo i took in french guiana. satellite imagery of roundabout. a photo of a 2012 honda accord coupe. a photo of a kennel indoor.

a photo i took in gabon. satellite imagery of intersection. a photo of a 2012 honda accord sedan. a photo of a kennel outdoor.

a photo i took in cambodia. satellite imagery of church. a photo of a 2012 acura tl sedan. a photo of a jail cell.

a photo i took in guyana. satellite imagery of medium residential. a photo of a 2012 acura tsx sedan. a photo of a jail indoor.

a photo i took in belize. satellite imagery of chaparral. a photo of a 2008 acura tl type-s. a photo of a veterinarians office.

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

Kinetics-700 Flowers-102 ImageNet Birdsnap


correct label: country line dancing correct rank: 1/700 correct probability: 98.98% correct label: great masterwort correct rank: 1/102 correct probability: 74.25% correct label: King Charles Spaniel correct rank: 1/1000 correct probability: 91.61% correct label: Black chinned Hummingbird correct rank: 4/500 correct probability: 12.00%

a photo of country line dancing. a photo of a great masterwort, a type of flower. a photo of a king charles spaniel. a photo of a broad tailed hummingbird, a type of bird.

a photo of square dancing. a photo of a bishop of llandaff, a type of flower. a photo of a brittany dog. a photo of a calliope hummingbird, a type of bird.

a photo of swing dancing. a photo of a pincushion flower, a type of flower. a photo of a cocker spaniel. a photo of a costas hummingbird, a type of bird.

a photo of dancing charleston. a photo of a globe flower, a type of flower. a photo of a papillon. a photo of a black chinned hummingbird, a type of bird.

a photo of salsa dancing. a photo of a prince of wales feathers, a type of flower. a photo of a sussex spaniel. a photo of a annas hummingbird, a type of bird.

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100


aYahoo ObjectNet ImageNet Overlap ImageNet Blurry Describable Textures Dataset (DTD)

correct label: building correct rank: 1/12 correct probability: 97.69% correct label: Pill bottle correct rank: 1/113 correct probability: 98.34% correct label: marimba correct rank: 1/1000 correct probability: 79.54% correct label: perforated correct rank: 2/47 correct probability: 20.50%

a photo of a building. a photo of a pill bottle. a photo of a marimba. a photo of a polka-dotted texture.

a photo of a carriage. a photo of a bottle cap. a photo of a abacus. a photo of a perforated texture.

a photo of a statue. a photo of a beer bottle. a photo of a steel drum. a photo of a dotted texture.

a photo of a bag. a photo of a pillow. a photo of a computer keyboard. a photo of a studded texture.

a photo of a mug. a photo of a wine bottle. a photo of a pool table. a photo of a freckled texture.

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

PASCAL VOC 2007 MNIST Street View House Numbers (SVHN) ImageNet Vid
correct label(s): motorcycle correct rank: 1/20 correct probability: 99.69% correct label: 7 correct rank: 1/10 correct probability: 85.32% correct label: 158 correct rank: 83/2000 correct probability: 0.27% correct label(s): antelope correct rank: 1/30 correct probability: 99.77%

a photo of a motorcycle. a photo of the number: "7". a street sign of the number: "1157". a photo of a antelope.

a photo of a bicycle. a photo of the number: "2". a street sign of the number: "1165". a photo of a zebra.

a photo of a car. a photo of the number: "1". a street sign of the number: "1164". a photo of a car.

a photo of a horse. a photo of the number: "6". a street sign of the number: "1155". a photo of a cattle.

a photo of a dining table. a photo of the number: "4". a street sign of the number: "1364". a photo of a elephant.

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

ImageNet Sketch Hateful Memes Stanford Sentiment Treebank German Traffic Sign Recognition Benchmark (GTSRB)
correct label: barn correct rank: 1/1000 correct probability: 79.56% correct label: meme correct rank: 1/2 correct probability: 99.20% correct label: positive correct rank: 1/2 correct probability: 78.21% correct label: red and white triangle with exclamation mark warning correct rank: 1/43 correct probability: 45.75%

a photo of a barn. a meme. a positive review of a movie. a zoomed in photo of a "red and white triangle with exclamation mark warning" traffic sign.

a photo of a church. a hatespeech meme. a negative review of a movie. a zoomed in photo of a "red and white triangle with black right curve approaching warning" traffic sign.

a photo of a threshing machine. a zoomed in photo of a "red and white triangle car skidding / slipping warning" traffic sign.

a photo of a sawmill. a zoomed in photo of a "red and white triangle rough / bumpy road warning" traffic sign.

a photo of a prison. a zoomed in photo of a "red and white triangle with black left curve approaching warning" traffic sign.

0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100 0 20 40 60 80 100

Figure 21. Visualization of predictions from 36 CLIP zero-shot classifiers. All examples are random with the exception of reselecting
Hateful Memes to avoid offensive content. The predicted probability of the top 5 classes is shown along with the text used to represent
the class. When more than one template is used, the first template is shown. The ground truth label is colored green while an incorrect
prediction is colored orange.
Learning Transferable Visual Models From Natural Language Supervision 43

B. Zero-Shot Prediction
Linear Classifier Zero Shot
To provide a qualitative summary / overview of CLIP’s Dataset YFCC WIT ∆ YFCC WIT ∆
zero-shot performance we visualize a randomly selected Birdsnap 47.4 35.3 +12.1 19.9 4.5 +15.4
Country211 23.1 17.3 +5.8 5.2 5.3 +0.1
prediction for 36 different zero-shot CLIP classifiers in Fig- Flowers102 94.4 89.8 +4.6 48.6 21.7 +26.9
ure 21. GTSRB 66.8 72.5 −5.7 6.9 7.0 −0.1
UCF101 69.2 74.9 −5.7 22.9 32.0 −9.1
Stanford Cars 31.4 50.3 −18.9 3.8 10.9 −7.1
C. Duplicate Detector ImageNet 62.0 60.8 +1.2 31.3 27.6 +3.7
Dataset Average 65.5 66.6 −1.1 29.6 30.0 −0.4
Our early attempts at duplicate detection and analysis used Dataset “Wins” 10 15 −5 19 18 +1

nearest neighbors in the model’s learned embedding space.


Table 11. CLIP performs similarly when trained on only
While it is intuitive to use a model’s own notion of similar-
YFCC100M. Comparing a ResNet-50 trained on only
ity, we encountered issues. We found the model’s feature YFCC100M with a same sized subset of WIT shows simi-
space is weighted very heavily towards semantic similar- lar average performance and number of wins on zero shot and
ity. Many false positives occurred due to distinct objects linear classifier evals. However, large differences in dataset
that would be described similarly (soccer balls, flowers of specific performance occur. We include performance on the 3
the same species, etc...) having almost perfect similarity. datasets where YFCC does best and worst compared to WIT
We also observed the model was quite poor at assigning according to a linear probe in order to highlight this as well as
certain kinds of near-duplicates high similarity scores. We aggregate performance across all linear and zero-shot evals and
noticed repeatedly that images with high-frequency textures the canonical ImageNet dataset.
(such as fur or stripe patterns) pre-processed by different
resizing algorithms (nearest neighbor vs bi-linear) could
have surprisingly low similarity. This resulted in many false the YFCC100M dataset (details described in Section 2.2)
negatives. and compared its performance to the same model trained
on an equally sized subset of WIT. We train each model for
We built our own near-duplicate detector to fix this issue. 32 epochs at which point transfer performance begins to
We created a synthetic data augmentation pipeline that com- plateau due to overfitting. Results are shown in Table 11.
bined a variety of common image manipulations. The aug- Across our whole eval suite, YFCC and WIT perform simi-
mentation pipeline combines random cropping and zooming, larly on average for both zero-shot and linear probe settings.
aspect ratio distortion, downsizing and upscaling to different However, performance on specific fine-grained classifica-
resolutions, minor rotations, jpeg compression, and HSV tion datasets can vary widely - sometimes by over 10%.
color jitter. The pipeline also randomly selects from differ- Our speculation is that these differences in performance re-
ent interpolation algorithms for all relevant steps. We then flect the relative density of relevant data in each pre-training
trained a model to maximize the similarity of an image and dataset. For instance, pre-training on YFCC100M, which
its transformed variant while minimizing similarity to all might contain many photos of birds and flowers (common
other images in a training batch. We used the same n-pair / subjects for photographers), results in better performance on
InfoNCE loss as CLIP but with a fixed temperature of 0.07. Birdsnap and Flowers102, while pre-training on WIT results
We selected a ResNet-50 as the model architecture. We in better car and pet classifiers (which appear common in
modified the base ResNet-50 with the anti-alias improve- our dataset).
ments from (Zhang, 2019) and used weight norm (Sali- Overall, these results are encouraging as they suggest our
mans & Kingma, 2016) instead of batch norm (Ioffe & approach can use any reasonably filtered collection of paired
Szegedy, 2015) to avoid leaking information about dupli- (text, image) data. This mirrors recent work which reported
cates via batch statistics - a problem previously noted in positive results using the same contrastive pre-training ob-
(Henaff, 2020). We also found the GELU activation func- jective on the relatively different domain of medical imaging
tion (Hendrycks & Gimpel, 2016) to perform better for this (Zhang et al., 2020). It also is similar to the findings of noisy
task. We trained the model with a total batch size of 1,712 student self-training which reported only slight improve-
for approximately 30 million images sampled from our pre- ments when using their JFT300M dataset over YFCC100M
training dataset. At the end of training it achieves nearly (Xie et al., 2020). We suspect the major advantage of our
100% accuracy on its proxy training task. dataset over the already existing YFCC100M is its much
larger size.
D. Dataset Ablation on YFCC100M Finally, we caution that WIT includes this filtered subset
To study whether our custom dataset is critical to the perfor- of YFCC100M. This could result in our ablation under-
mance of CLIP, we trained a model on a filtered subset of estimating the size of performance differences between
YFCC100M and the rest of WIT. We do not think this is
Learning Transferable Visual Models From Natural Language Supervision 44

likely as YFCC100M is only 3.7% of the overall WIT data word recognition, while Hateful Memes (Kiela et al., 2020)
blend and it did not noticeably change the performance of and SST-2 (Socher et al., 2013) check the ability of a model
models when it was added to the existing data blend during to use OCR to perform a semantic task. Results are reported
the creation of WIT. in Table 13.
CLIP’s performance is still highly variable and appears to
E. Selected Task and Dataset Results be sensitive to some combination of the domain (rendered or
natural images) and the type of text to be recognized (num-
Due to the large variety of datasets and experiments consid-
bers or words). CLIP’s OCR performance is strongest Hate-
ered in this work, the main body focuses on summarizing
ful Memes and SST-2 - datasets where the text is digitally
and analyzing overall results. In the following subsections
rendered and consists mostly of words. On IIIT5K, which
we report details of performance for specific groups of tasks,
is natural images of individually cropped words, zero-shot
datasets, and evaluation settings.
CLIP performs a bit more respectively and its performance
is similar to Jaderberg et al. (2014) early work combining
E.1. Image and Text Retrieval deep learning and structured prediction to perform open-
CLIP pre-trains for the task of image-text retrieval on our vocabulary OCR. However, performance is noticeably lower
noisy web-scale dataset. Although the focus of this paper on two datasets involving recognition of hand written and
is on representation learning and task learning for the pur- street view numbers. CLIP’s 51% accuracy on full number
pose of transfer to a wide variety of downstream datasets, SVHN is well below any published results. Inspection sug-
validating that CLIP is able to achieve high transfer perfor- gests CLIP struggles with repeated characters as well as the
mance transfer on exactly what it is pre-trained for is an low resolution and blurry images of SVHN. CLIP’s zero-
important sanity check / proof of concept. In Table 12 we shot MNIST performance is also poor and is outperformed
check the zero-shot transfer performance of CLIP for both by supervised logistic regression on raw pixels, one of the
text and image retrieval on the Flickr30k and MSCOCO simplest possible machine learning baselines.
datsets. Zero-shot CLIP matches or outperforms all prior SST-2 is a sentence level NLP dataset which we render into
zero-shot results on these two datasets. Zero-shot CLIP is images. We include SST-2 in order to check whether CLIP
also competitive with the current overall SOTA for the task is able to convert low level OCR capability into a higher
of text retrieval on Flickr30k. On image retrieval, CLIP’s level representation. Fitting a linear classifier on CLIP’s rep-
performance relative to the overall state of the art is notice- resentation of rendered sentences achives 80.5% accuracy.
ably lower. However, zero-shot CLIP is still competitive This is on par with the 80% accuracy of a continuous bag
with a fine-tuned Unicoder-VL. On the larger MS-COCO of words baseline using GloVe word vectors pre-trained on
dataset fine-tuning improves performance significantly and 840 billion tokens (Pennington et al., 2014). While this is a
zero-shot CLIP is not competitive with the most recent work. simple NLP baseline by today’s standard, and well below
For both these datasets we prepend the prompt “a photo the 97.5% of the current SOTA, it is encouraging to see
of” to the description of each image which we found boosts that CLIP is able to turn an image of rendered text into a
CLIP’s zero-shot R@1 performance between 1 and 2 points. non-trivial sentence level representation. Fully supervised
CLIP is also surprisingly strong on Hateful Meme detec-
E.2. Optical Character Recognition tion, where CLIP is only 0.7 points behind the current single
Although visualizations have shown that ImageNet models model SOTA and several points above the best baseline from
contain features that respond to the presence of text in an the original paper. Similar to SST-2, these other results on
image (Zeiler & Fergus, 2014), these representations are Hateful Memes use the ground truth text which CLIP does
not sufficiently fine-grained to use for the task of optical not have access to. Finally, we note that zero-shot CLIP
character recognition (OCR). To compensate, models are outperforms the best results using fully supervised linear
augmented with the outputs of custom OCR engines and probes across all other 56 models included in our evaluation
features to boost performance on tasks where this capability suite. This suggests CLIP’s OCR capability is at least some-
is required (Singh et al., 2019; Yang et al., 2020). Early dur- what unique compared to existing work on self-supervised
ing the development of CLIP, we noticed that CLIP began to and supervised representation learning.
learn primitive OCR capabilities which appeared to steadily
improve over the course of the project. To evaluate this E.3. Action Recognition in Videos
qualitatively noticed behavior, we measured performance For the purpose of learning, a potentially important aspect
on 5 datasets requiring the direct and indirect use of OCR. of natural language is its ability to express, and therefore su-
Three of these datasets MNIST (LeCun), SVHN (Netzer pervise, an extremely wide set of concepts. A CLIP model,
et al., 2011), and IIIT5K (Mishra et al., 2012) directly check since it is trained to pair semi-arbitrary text with images, is
the ability of a model to perform low-level character and
Learning Transferable Visual Models From Natural Language Supervision 45

Text Retrieval Image Retrieval


Flickr30k MSCOCO Flickr30k MSCOCO
R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10 R@1 R@5 R@10
Unicoder-VLa 86.2 96.3 99.0 62.3 87.1 92.8 71.5 90.9 94.9 46.7 76.0 85.3
Finetune

Uniterb 87.3 98.0 99.2 65.7 88.6 93.8 75.6 94.1 96.8 52.9 79.9 88.0
VILLAc 87.9 97.5 98.8 - - - 76.3 94.2 96.8 - - -
Oscard - - - 73.5 92.2 96.0 - - - 57.5 82.8 89.8
ERNIE-ViLe 88.7 98.0 99.2 - - - 76.7 93.6 96.4 - - -
Visual N-Gramsf 15.4 35.7 45.1 8.7 23.1 33.3 8.8 21.2 29.9 5.0 14.5 21.9
Zero-Shot

ImageBERTg - - - 44.0 71.2 80.4 - - - 32.3 59.0 70.2


Unicoder-VLa 64.3 86.8 92.3 - - - 48.4 76.0 85.2 - - -
Uniterb 83.6 95.7 97.7 - - - 68.7 89.2 93.9 - - -
CLIP 88.0 98.7 99.4 58.4 81.5 88.1 68.7 90.6 95.2 37.8 62.4 72.2

Table 12. CLIP improves zero-shot retrieval and is competitive with the best fine-tuned result on Flickr30k text retrieval. Bold
indicates best overall performance while an underline indicates best in category performance (zero-shot or fine-tuned). For all other
models, best results from the paper are reported regardless of model size / variant. MSCOCO performance is reported on the 5k test set.
a
(Li et al., 2020a) b (Chen et al., 2019) c (Gan et al., 2020) d (Li et al., 2020b) e (Yu et al., 2020) f (Li et al., 2017) g (Qi et al., 2020)

IIIT5K Hateful UCF101 K700 RareAct


MNIST SVHN 1k Memes SST-2 Top-1 AVG mWAP mWSAP
a b c d e R(2+1)D-BERTa 98.7 - - -
Linear Finetune

SOTA 99.8 96.4 98.9 78.0 97.5


Finetune

JOINTf - - 89.6 - - NS ENet-L2b - 84.8 - -


CBoWg - - - - 80.0 HT100M S3Dd 91.3 - - -
Baseline I3De - 70.2 - -
Raw Pixels 92.5 - - - -
ES Best 98.9h - - 58.6h 59.0i MMV FACf 91.8 - - -
Linear

CLIP 99.2 - - 77.3 80.5 NS ENet-L2c 89.4c 68.2c - -


CLIP 92.0 73.0 - -
ZS

CLIP 88.4 51.0 90.0 63.3 67.9


HT100M S3Dd - - 30.5 34.8
ZS

CLIP 80.3 69.6 40.7 44.8


Table 13. OCR performance on 5 datasets. All metrics are accuracy
on the test set except for Hateful Memes which reports ROC
Table 14. Action recognition performance on 3 video datasets. Sin-
AUC on the dev set. Single model SOTA reported to best of
gle model SOTA reported to best of knowledge. Note that linear
knowledge. Pixel LR refers to logistic regression on raw pixels. ES
CLIP and linear NS ENet-L2 are trained and evaluated on a single
Best reports the best performance across the 56 non-CLIP models
frame subsampled version of each dataset and not directly compa-
in our evaluation suite. a (Assiri, 2020) b (Jaderberg et al., 2015)
c rable to prior work. On Kinetics-700, we report the ActivityNet
(Wang et al., 2020) d (Lippe et al., 2020) f (Jaderberg et al., 2014)
g competition metric which is the average of top-1 and top-5 per-
(Wang et al., 2018) h (Xie et al., 2020) i (Mahajan et al., 2018)
formance. a (Kalfaoglu et al., 2020) b (Lu et al., 2020) c (Xie et al.,
2020) d (Miech et al., 2020b) e (Carreira et al., 2019) f (Alayrac
et al., 2020)
likely to receive supervision for a wide range of visual con-
cepts involving both common and proper nouns, verbs, and
adjectives. ImageNet-1K, by contrast, only labels common number of training frames. To deal with this, we aggres-
nouns. Does the lack of broader supervision in ImageNet sively sub-sample each video to only a single center frame,
result in weaker transfer of ImageNet models to tasks involv- effectively turning it into an image classification dataset.
ing the recognition of visual concepts that are not nouns? As a result, our reported performance in a linear evaluation
setting likely under estimates performance by a moderate
To investigate this, we measure and compare the perfor-
amount.
mance of CLIP and ImageNet models on several video
action classification datasets which measure the ability of a Despite this handicap, CLIP features transfer surprisingly
model to recognize verbs. In Table 14 we report results on well to this task. CLIP matches the best prior result on UCF-
UCF-101 (Soomro et al., 2012) and Kinetics-700 (Carreira 101 in a linear probe evaluation setting and also outperforms
et al., 2019), two common datasets for the task. Unfortu- all other models in our evaluation suite. On Kinetics-700,
nately, our CPU based linear classifier takes a prohibitively CLIP also outperforms the fine-tuned I3D baseline from the
long time to evaluate on a video dataset due to the very large original paper. Since it does not require a training stage,
Learning Transferable Visual Models From Natural Language Supervision 46

IN IN-V2 IN-A IN-R ObjectNet IN-Sketch IN-Vid YTBB


Top-1 Top-1 Top-1 Top-1 Top-1 Top-1 PM0 PM10 PM0 PM10
NS EfficientNet-L2a 88.3 80.2 84.9 74.7 68.5 47.6 88.0 82.1 67.7 63.5
FixResNeXt101-32x48d V2b 86.4 78.0 68.4 80.0 57.8 59.1 85.8 72.2 68.9 57.7
Linear Probe CLIP 85.4 75.9 75.3 84.2 66.2 57.4 89.1 77.2 68.7 63.1
Zero-Shot CLIP 76.2 70.1 77.2 88.9 72.3 60.2 95.3 89.2 95.2 88.5

Table 15. Detailed ImageNet robustness performance. IN is used to abbreviate for ImageNet. a (Xie et al., 2020) b (Touvron et al., 2019)

we report CLIP’s zero-shot performance when averaging 16. Since IM2GPS is a regression benchmark, we guess the
predictions across all frames. CLIP also performs well in GPS coordinates of the nearest image in a set of reference
this setting and on Kinetics-700 its performance is within images using CLIP’s embedding space. This is not a zero-
1% of the fully supervised I3D baseline which is trained shot result since it uses nearest-neighbor regression. Despite
on 545000 labeled videos. Encouraged by these results, we querying only 1 million images, which is much less than
also measure CLIP’s performance on the recently introduced prior work, CLIP performs similarly to several task specific
RareAct dataset (Miech et al., 2020a) which was designed models. It is not, however, competitive with the current state
to measure zero-shot recognition of unusual actions like of the art.
“hammering a phone” and “drilling an egg”. CLIP improves
over the prior state of the art, a S3D model trained on auto- E.5. Robustness to Distribution Shift
matically extracted captions from 100 million instructional
videos, by 10 points. Section 3.3 provides a high level summary and analysis of
ImageNet related robustness results. We briefly provide
While CLIP has encouragingly strong performance on the some additional numerical details in this appendix. Per-
task of action recognition, we note that there are many differ- formance results per dataset are provided in Table 15 and
ences between the models being compared beyond just their compared with the current state of the art results reported
form of supervision such as model architecture, training in Taori et al. (2020)’s evaluation suite. Zero-shot CLIP im-
data distribution, dataset size, and compute used. Further proves the state of the art on 5 of the 7 datasets, ImageNet-R,
work is needed to more precisely determine what specific ObjectNet, ImageNet-Sketch, ImageNet-Vid, and Youtube-
design decisions contribute to achieving high performance BB. CLIP’s improvements are largest on ImageNet-Vid and
on this task. Youtube-BB due to its flexible zero-shot capability and on
ImageNet-R, which likely reflects CLIP’s pre-training dis-
tribution including significant amounts of creative content.
1km 25km 200km 750km 2500km A similar behavior has been documented for the Instagram
ISNs a
16.9 43.0 51.9 66.7 80.2 pre-trained ResNeXt models as discussed in Taori et al.
CPlaNetb 16.5 37.1 46.4 62.0 78.5 (2020).
CLIP 13.9 32.9 43.0 62.0 79.3
Deep-Ret+c 14.4 33.3 47.7 61.6 73.4
PlaNetd 8.4 24.5 37.6 53.6 71.3

Table 16. Geolocalization performance on the IM2GPS test set.


Metric is percent of images localized within a given radius. Models
are ordered by average performance. a (Muller-Budack et al., 2018)
b
(Hongsuck Seo et al., 2018) c (Vo et al., 2017) c (Weyand et al.,
2016)

E.4. Geolocalization
Another behavior we noticed during the development of
CLIP was its ability to recognize many places and locations.
To quantify this we created the Country211 dataset as de-
scribed in Appendix A and report results on it throughout
the paper. However it is a new benchmark so to compare
with prior work on geolocalization we also report results
on the IM2GPS test set from Hays & Efros (2008) in Table
Learning Transferable Visual Models From Natural Language Supervision 47

Rendered SST2
FGVC Aircraft

HatefulMemes
Stanford Cars

Kinetics700
Oxford Pets

Country211
Flowers102
Caltech101

RESISC45
CIFAR100

VOC2007

ImageNet
CIFAR10

FER2013

EuroSAT
Birdsnap
Food101

SUN397

UCF101

CLEVR
GTSRB
MNIST

STL10

KITTI

PCam
DTD

RN50 81.1 75.6 41.6 32.6 59.6 55.8 19.3 82.1 41.7 85.4 82.1 65.9 66.6 42.2 94.3 41.1 54.2 35.2 42.2 16.1 57.6 63.6 43.5 20.3 59.7 56.9 59.6
CLIP-ResNet

RN101 83.9 81.0 49.0 37.2 59.9 62.3 19.5 82.4 43.9 86.2 85.1 65.7 59.3 45.6 96.7 33.1 58.5 38.3 33.3 16.9 55.2 62.2 46.7 28.1 61.1 64.2 62.2
RN50x4 86.8 79.2 48.9 41.6 62.7 67.9 24.6 83.0 49.3 88.1 86.0 68.0 75.2 51.1 96.4 35.0 59.2 35.7 26.0 20.2 57.5 65.5 49.0 17.0 58.3 66.6 65.8
RN50x16 90.5 82.2 54.2 45.9 65.0 72.3 30.3 82.9 52.8 89.7 87.6 71.9 80.0 56.0 97.8 40.3 64.4 39.6 33.9 24.0 62.5 68.7 53.4 17.6 58.9 67.6 70.5
RN50x64 91.8 86.8 61.3 48.9 66.9 76.0 35.6 83.8 53.4 93.4 90.6 77.3 90.8 61.0 98.3 59.4 69.7 47.9 33.2 29.6 65.0 74.1 56.8 27.5 62.1 70.7 73.6
B/32 84.4 91.3 65.1 37.8 63.2 59.4 21.2 83.1 44.5 87.0 87.9 66.7 51.9 47.3 97.2 49.4 60.3 32.2 39.4 17.8 58.4 64.5 47.8 24.8 57.6 59.6 63.2
CLIP-ViT

B/16 89.2 91.6 68.7 39.1 65.2 65.6 27.1 83.9 46.0 88.9 89.3 70.4 56.0 52.7 98.2 54.1 65.5 43.3 44.0 23.3 48.1 69.8 52.4 23.4 61.7 59.8 68.6
L/14 92.9 96.2 77.9 48.3 67.7 77.3 36.1 84.1 55.3 93.5 92.6 78.7 87.2 57.5 99.3 59.9 71.6 50.3 23.1 32.7 58.8 76.2 60.3 24.3 63.3 64.0 75.3
L/14-336px 93.8 95.7 77.5 49.5 68.4 78.8 37.2 84.3 55.7 93.5 92.8 78.3 88.3 57.7 99.4 59.6 71.7 52.3 21.9 34.9 63.0 76.9 61.3 24.8 63.3 67.9 76.2

Table 17. Zero-shot performance of CLIP models over 27 datasets.

You might also like