Learning Transferable Visual Models From Natural Language Supervision
Learning Transferable Visual Models From Natural Language Supervision
Learning Transferable Visual Models From Natural Language Supervision
Alec Radford * 1 Jong Wook Kim * 1 Chris Hallacy 1 Aditya Ramesh 1 Gabriel Goh 1 Sandhini Agarwal 1
Girish Sastry 1 Amanda Askell 1 Pamela Mishkin 1 Jack Clark 1 Gretchen Krueger 1 Ilya Sutskever 1
plane
car
Pepper the
Pepper the
Pepper the
aussie pup Text A photo of Text
Pepper the
aussie pup dog
aussie pup Encoder … a {object}. Encoder
aussie pup
⋮ ⋮ …
T1 T2 T3 … TN
bird
Figure 1. Summary of our approach. While standard image models jointly train an image feature extractor and a linear classifier to predict
some label, CLIP jointly trains an image encoder and a text encoder to predict the correct pairings of a batch of (image, text) training
examples. At test time the learned text encoder synthesizes a zero-shot linear classifier by embedding the names or descriptions of the
target dataset’s classes.
predicting the one with the highest score. Adopting more Natural language is able to express, and therefore supervise,
recent architectures and pre-training approaches, VirTex a much wider set of visual concepts through its general-
(Desai & Johnson, 2020), ICMLM (Bulent Sariyildiz et al., ity. Both approaches also use static softmax classifiers to
2020), and ConVIRT (Zhang et al., 2020) have recently perform prediction and lack a mechanism for dynamic out-
demonstrated the potential of transformer-based language puts. This severely curtails their flexibility and limits their
modeling, masked language modeling, and contrastive ob- “zero-shot” capabilities.
jectives to learn image representations from text.
A crucial difference between these weakly supervised mod-
While exciting as proofs of concept, using natural language els and recent explorations of learning image representations
supervision for image representation learning is still rare. directly from natural language is scale. While Mahajan et al.
This is likely because demonstrated performance on com- (2018) and Kolesnikov et al. (2019) trained their models for
mon benchmarks is much lower than alternative approaches. accelerator years on millions to billions of images, VirTex,
For example, Li et al. (2017) reach only 11.5% accuracy ICMLM, and ConVIRT trained for accelerator days on one
on ImageNet in a zero-shot setting. This is well below the to two hundred thousand images. In this work, we close
88.4% accuracy of the current state of the art (Xie et al., this gap and study the behaviors of image classifiers trained
2020). It is even below the 50% accuracy of classic com- with natural language supervision at large scale. Enabled
puter vision approaches (Deng et al., 2012). Instead, more by the large amounts of publicly available data of this form
narrowly scoped but well-targeted uses of weak supervision on the internet, we create a new dataset of 400 million (im-
have improved performance. Mahajan et al. (2018) showed age, text) pairs and demonstrate that a simplified version of
that predicting ImageNet related hashtags on Instagram im- ConVIRT trained from scratch, which we call CLIP, for Con-
ages is an effective pre-training task. When fine-tuned to trastive Language-Image Pre-training, is an efficient method
ImageNet these pre-trained models increased accuracy by of learning from natural language supervision. We study
over 5% and improved the overall state of the art at the time. the scalability of CLIP by training a series of eight models
Kolesnikov et al. (2019) and Dosovitskiy et al. (2020) have spanning almost 2 orders of magnitude of compute and ob-
also demonstrated large gains on a broader set of transfer serve that transfer performance is a smoothly predictable
benchmarks by pre-training models to predict the classes of function of compute (Hestness et al., 2017; Kaplan et al.,
the noisily labeled JFT-300M dataset. 2020). We find that CLIP, similar to the GPT family, learns
to perform a wide set of tasks during pre-training including
This line of work represents the current pragmatic middle
OCR, geo-localization, action recognition, and many others.
ground between learning from a limited amount of super-
We measure this by benchmarking the zero-shot transfer
vised “gold-labels” and learning from practically unlimited
performance of CLIP on over 30 existing datasets and find
amounts of raw text. However, it is not without compro-
it can be competitive with prior task-specific supervised
mises. Both works carefully design, and in the process limit,
models. We also confirm these findings with linear-probe
their supervision to 1000 and 18291 classes respectively.
Learning Transferable Visual Models From Natural Language Supervision 3
representation learning analysis and show that CLIP out- representations, improvements in deep contextual represen-
performs the best publicly available ImageNet model while tation learning suggest we now have the tools to effectively
also being more computationally efficient. We additionally leverage this abundant source of supervision (McCann et al.,
find that zero-shot CLIP models are much more robust than 2017).
equivalent accuracy supervised ImageNet models which
Learning from natural language has several potential
suggests that zero-shot evaluation of task-agnostic models is
strengths over other training methods. It’s much easier
much more representative of a model’s capability. These re-
to scale natural language supervision compared to standard
sults have significant policy and ethical implications, which
crowd-sourced labeling for image classification since it does
we consider in Section 7.
not require annotations to be in a classic “machine learning
compatible format” such as the canonical 1-of-N majority
40
vote “gold label”. Instead, methods which work on natural
35
Zero-Shot ImageNet Accuracy
balance the results by including up to 20,000 (image, text) ity of the image and text embeddings of the N real pairs
pairs per query. The resulting dataset has a similar total in the batch while minimizing the cosine similarity of the
word count as the WebText dataset used to train GPT-2. We embeddings of the N 2 − N incorrect pairings. We opti-
refer to this dataset as WIT for WebImageText. mize a symmetric cross entropy loss over these similarity
scores. In Figure 3 we include pseudocode of the core of an
2.3. Selecting an Efficient Pre-Training Method implementation of CLIP. To our knowledge this batch con-
struction technique and objective was first introduced in the
State-of-the-art computer vision systems use very large area of deep metric learning as the multi-class N-pair loss
amounts of compute. Mahajan et al. (2018) required 19 Sohn (2016), was popularized for contrastive representation
GPU years to train their ResNeXt101-32x48d and Xie et al. learning by Oord et al. (2018) as the InfoNCE loss, and was
(2020) required 33 TPUv3 core-years to train their Noisy recently adapted for contrastive (text, image) representation
Student EfficientNet-L2. When considering that both these learning in the domain of medical imaging by Zhang et al.
systems were trained to predict only 1000 ImageNet classes, (2020).
the task of learning an open set of visual concepts from
natural language seems daunting. In the course of our ef- Due to the large size of our pre-training dataset, over-fitting
forts, we found training efficiency was key to successfully is not a major concern and the details of training CLIP are
scaling natural language supervision and we selected our simplified compared to the implementation of Zhang et al.
final pre-training method based on this metric. (2020). We train CLIP from scratch without initializing
the image encoder with ImageNet weights or the text en-
Our initial approach, similar to VirTex, jointly trained an coder with pre-trained weights. We remove the non-linear
image CNN and text transformer from scratch to predict the projection between the representation and the contrastive
caption of an image. However, we encountered difficulties embedding space, a change which was introduced by Bach-
efficiently scaling this method. In Figure 2 we show that a man et al. (2019) and popularized by Chen et al. (2020b).
63 million parameter transformer language model, which We use only a linear projection to map from each encoder’s
already uses twice the compute of its ResNet-50 image representation to the multi-modal embedding space. We
encoder, learns to recognize ImageNet classes three times did not notice a difference in training efficiency between
slower than a much simpler baseline that predicts a bag-of- the two versions and speculate that non-linear projections
words encoding of the same text. may be co-adapted with details of current image only self-
Both these approaches share a key similarity. They try to pre- supervised representation learning methods. We also re-
dict the exact words of the text accompanying each image. move the text transformation function tu from Zhang et al.
This is a difficult task due to the wide variety of descriptions, (2020) which samples a single sentence at uniform from
comments, and related text that co-occur with images. Re- the text since many of the (image, text) pairs in CLIP’s pre-
cent work in contrastive representation learning for images training dataset are only a single sentence. We also simplify
has found that contrastive objectives can learn better repre- the image transformation function tv . A random square
sentations than their equivalent predictive objective (Tian crop from resized images is the only data augmentation
et al., 2019). Other work has found that although generative used during training. Finally, the temperature parameter
models of images can learn high quality image representa- which controls the range of the logits in the softmax, τ , is
tions, they require over an order of magnitude more compute directly optimized during training as a log-parameterized
than contrastive models with the same performance (Chen multiplicative scalar to avoid turning as a hyper-parameter.
et al., 2020a). Noting these findings, we explored training
a system to solve the potentially easier proxy task of pre- 2.4. Choosing and Scaling a Model
dicting only which text as a whole is paired with which
We consider two different architectures for the image en-
image and not the exact words of that text. Starting with
coder. For the first, we use ResNet-50 (He et al., 2016a)
the same bag-of-words encoding baseline, we swapped the
as the base architecture for the image encoder due to its
predictive objective for a contrastive objective in Figure 2
widespread adoption and proven performance. We make sev-
and observed a further 4x efficiency improvement in the rate
eral modifications to the original version using the ResNet-
of zero-shot transfer to ImageNet.
D improvements from He et al. (2019) and the antialiased
Given a batch of N (image, text) pairs, CLIP is trained to rect-2 blur pooling from Zhang (2019). We also replace
predict which of the N × N possible (image, text) pairings the global average pooling layer with an attention pooling
across a batch actually occurred. To do this, CLIP learns a mechanism. The attention pooling is implemented as a sin-
multi-modal embedding space by jointly training an image gle layer of “transformer-style” multi-head QKV attention
encoder and text encoder to maximize the cosine similar- where the query is conditioned on the global average-pooled
WordNet synsets not already in the query list are added. representation of the image. For the second architecture, we
experiment with the recently introduced Vision Transformer
Learning Transferable Visual Models From Natural Language Supervision 5
# image_encoder - ResNet or Vision Transformer EfficientNet architecture, we use a simple baseline of allo-
# text_encoder - CBOW or Text Transformer
# I[n, h, w, c] - minibatch of aligned images cating additional compute equally to increasing the width,
# T[n, l] - minibatch of aligned texts depth, and resolution of the model. For the text encoder, we
# W_i[d_i, d_e] - learned proj of image to embed only scale the width of the model to be proportional to the
# W_t[d_t, d_e] - learned proj of text to embed
# t - learned temperature parameter calculated increase in width of the ResNet and do not scale
the depth at all, as we found CLIP’s performance to be less
# extract feature representations of each modality sensitive to the capacity of the text encoder.
I_f = image_encoder(I) #[n, d_i]
T_f = text_encoder(T) #[n, d_t]
2.5. Training
# joint multimodal embedding [n, d_e]
I_e = l2_normalize(np.dot(I_f, W_i), axis=1) We train a series of 5 ResNets and 3 Vision Transformers.
T_e = l2_normalize(np.dot(T_f, W_t), axis=1)
For the ResNets we train a ResNet-50, a ResNet-101, and
# scaled pairwise cosine similarities [n, n] then 3 more which follow EfficientNet-style model scaling
logits = np.dot(I_e, T_e.T) * np.exp(t) and use approximately 4x, 16x, and 64x the compute of a
ResNet-50. They are denoted as RN50x4, RN50x16, and
# symmetric loss function
labels = np.arange(n) RN50x64 respectively. For the Vision Transformers we
loss_i = cross_entropy_loss(logits, labels, axis=0) train a ViT-B/32, a ViT-B/16, and a ViT-L/14. We train all
loss_t = cross_entropy_loss(logits, labels, axis=1) models for 32 epochs. We use the Adam optimizer (Kingma
loss = (loss_i + loss_t)/2
& Ba, 2014) with decoupled weight decay regularization
Figure 3. Numpy-like pseudocode for the core of an implementa- (Loshchilov & Hutter, 2017) applied to all weights that are
tion of CLIP. not gains or biases, and decay the learning rate using a
cosine schedule (Loshchilov & Hutter, 2016). Initial hyper-
parameters were set using a combination of grid searches,
random search, and manual tuning on the baseline ResNet-
(ViT) (Dosovitskiy et al., 2020). We closely follow their 50 model when trained for 1 epoch. Hyper-parameters were
implementation with only the minor modification of adding then adapted heuristically for larger models due to compu-
an additional layer normalization to the combined patch tational constraints. The learnable temperature parameter
and position embeddings before the transformer and use a τ was initialized to the equivalent of 0.07 from (Wu et al.,
slightly different initialization scheme. 2018) and clipped to prevent scaling the logits by more
The text encoder is a Transformer (Vaswani et al., 2017) than 100 which we found necessary to prevent training in-
with the architecture modifications described in Radford stability. We use a very large minibatch size of 32,768.
et al. (2019). As a base size we use a 63M-parameter 12- Mixed-precision (Micikevicius et al., 2017) was used to ac-
layer 512-wide model with 8 attention heads. The trans- celerate training and save memory. To save additional mem-
former operates on a lower-cased byte pair encoding (BPE) ory, gradient checkpointing (Griewank & Walther, 2000;
representation of the text with a 49,152 vocab size (Sen- Chen et al., 2016), half-precision Adam statistics (Dhariwal
nrich et al., 2015). For computational efficiency, the max et al., 2020), and half-precision stochastically rounded text
sequence length was capped at 76. The text sequence is encoder weights were used. The calculation of embedding
bracketed with [SOS] and [EOS] tokens and the activa- similarities was also sharded with individual GPUs comput-
tions of the highest layer of the transformer at the [EOS] ing only the subset of the pairwise similarities necessary for
token are treated as the feature representation of the text their local batch of embeddings. The largest ResNet model,
which is layer normalized and then linearly projected into RN50x64, took 18 days to train on 592 V100 GPUs while
the multi-modal embedding space. Masked self-attention the largest Vision Transformer took 12 days on 256 V100
was used in the text encoder to preserve the ability to ini- GPUs. For the ViT-L/14 we also pre-train at a higher 336
tialize with a pre-trained language model or add language pixel resolution for one additional epoch to boost perfor-
modeling as an auxiliary objective, though exploration of mance similar to FixRes (Touvron et al., 2019). We denote
this is left as future work. this model as ViT-L/14@336px. Unless otherwise specified,
all results reported in this paper as “CLIP” uses this model
While previous computer vision research has often scaled which we found to perform best.
models by increasing the width (Mahajan et al., 2018) or
depth (He et al., 2016a) in isolation, for the ResNet image
encoders we adapt the approach of Tan & Le (2019) which
found that allocating additional compute across all of width,
depth, and resolution outperforms only allocating it to only
one dimension of the model. While Tan & Le (2019) tune
the ratio of compute allocated to each dimension for their
Learning Transferable Visual Models From Natural Language Supervision 6
Figure 6. Zero-shot CLIP outperforms few-shot linear probes. A potential resolution of this discrepancy between zero-
Zero-shot CLIP matches the average performance of a 4-shot linear shot and few-shot performance is to use CLIP’s zero-shot
classifier trained on the same feature space and nearly matches the classifier as a prior for the weights of the few-shot classifier.
best results of a 16-shot linear classifier across publicly available While adding an L2 penalty towards the generated weights
models. For both BiT-M and SimCLRv2, the best performing is a straightforward implementation of this idea, we found
model is highlighted. Light gray lines are other models in the eval that hyperparameter optimization would often select for
suite. The 20 datasets with at least 16 examples per class were such a large value of this regularizer that the resulting few-
used in this analysis.
shot classifier was “just” the zero-shot classifier. Research
into better methods of combining the strength of zero-shot
transfer with flexibility of few-shot learning is a promising
supervision in ImageNet. direction for future work.
Looking at where zero-shot CLIP notably underperforms, When comparing zero-shot CLIP to few-shot logistic re-
we see that zero-shot CLIP is quite weak on several spe- gression on the features of other models, zero-shot CLIP
cialized, complex, or abstract tasks such as satellite image roughly matches the performance of the best performing
classification (EuroSAT and RESISC45), lymph node tumor 16-shot classifier in our evaluation suite, which uses the fea-
detection (PatchCamelyon), counting objects in synthetic tures of a BiT-M ResNet-152x2 trained on ImageNet-21K.
scenes (CLEVRCounts), self-driving related tasks such as We are certain that a BiT-L model trained on JFT-300M
German traffic sign recognition (GTSRB), recognizing dis- would perform even better but these models have not been
tance to the nearest car (KITTI Distance). These results publicly released. That a BiT-M ResNet-152x2 performs
highlight the poor capability of zero-shot CLIP on more best in a 16-shot setting is somewhat surprising since, as
complex tasks. By contrast, non-expert humans can robustly analyzed in Section 3.2, the Noisy Student EfficientNet-L2
perform several of these tasks, such as counting, satellite outperforms it in a fully supervised setting by almost 5% on
image classification, and traffic sign recognition, suggesting average across 27 datasets.
significant room for improvement. However, we caution
that it is unclear whether measuring zero-shot transfer, as In addition to studying the average performance of zero-shot
opposed to few-shot transfer, is a meaningful evaluation for CLIP and few-shot logistic regression, we also examine
difficult tasks that a learner has no prior experience with, performance on individual datasets. In Figure 7, we show
such as lymph node tumor classification for almost all hu- estimates for the number of labeled examples per class that
mans (and possibly CLIP). a logistic regression classifier on the same feature space
requires to match the performance of zero-shot CLIP. Since
While comparing zero-shot performance to fully supervised zero-shot CLIP is also a linear classifier, this estimates the
models contextualizes the task-learning capabilities of CLIP, effective data efficiency of zero-shot transfer in this setting.
comparing to few-shot methods is a more direct compari- In order to avoid training thousands of linear classifiers,
son, since zero-shot is its limit. In Figure 6, we visualize we estimate the effective data efficiency based on a log-
how zero-shot CLIP compares to few-shot logistic regres-
Learning Transferable Visual Models From Natural Language Supervision 10
FER2013 184 100 STL10
CIFAR10 81 CIFAR10
Food101 64 Food101 OxfordPets
Caltech101
OxfordPets 48 90
Country211 32 MNIST
ImageNet 16.0 VOC2007
PCam 14.7 80
SST2 14.4 Stanford Cars
Linear probe average over Kornblith et al.'s 12 datasets Linear probe average over all 27 datasets
90
85
85
80
Average Score (%)
75
70
A includes datasets representing the aforementioned tasks, per-dataset differences in the performance of the best CLIP
German Traffic Signs Recognition Benchmark (Stallkamp model and the best model in our evaluation suite across
et al., 2011), as well as several other datasets adapted from all 27 datasets in Figure 11. CLIP outperforms the Noisy
VTAB (Zhai et al., 2019). Student EfficientNet-L2 on 21 of the 27 datasets. CLIP
improves the most on tasks which require OCR (SST2
On this broader evaluation suite, the benefits of CLIP are
and HatefulMemes), geo-localization and scene recognition
more clear. All CLIP models, regardless of scale, outper-
(Country211, SUN397), and activity recognition in videos
form all evaluated systems in terms of compute efficiency.
(Kinetics700 and UCF101). In addition CLIP also does
The improvement in average score of the best model over
much better on fine-grained car and traffic sign recognition
previous systems increases from 2.6% to 5%. We also find
(Stanford Cars and GTSRB). This may reflect a problem
that self-supervised systems do noticeably better on our
with overly narrow supervision in ImageNet. A result such
broader evaluation suite. For instance, while SimCLRv2
as the 14.7% improvement on GTSRB could be indicative
still underperforms BiT-M on average on the 12 datasets
of an issue with ImageNet-1K, which has only a single la-
of Kornblith et al. (2019), SimCLRv2 outperforms BiT-M
bel for all traffic and street signs. This could encourage
on our 27 dataset evaluation suite. These findings suggest
a supervised representation to collapse intra-class details
continuing to expand task diversity and coverage in order
and hurt accuracy on a fine-grained downstream task. As
to better understand the “general” performance of systems.
mentioned, CLIP still underperforms the EfficientNet on
We suspect additional evaluation efforts along the lines of
several datasets. Unsurprisingly, the dataset that the Effi-
VTAB to be valuable.
cientNet does best relative to CLIP on is the one it was
In addition to the aggregate analysis above, we visualize trained on: ImageNet. The EffcientNet also slightly outper-
Learning Transferable Visual Models From Natural Language Supervision 13
SST2 +23.6 We caution that, to date, most of these studies limit their
Country211 +22.7
HatefulMemes +18.8 evaluation to models trained on ImageNet. Recalling the
StanfordCars +15.9 topic of discussion, it may be a mistake to generalize too
GTSRB +14.7
SUN397 +6.5 far from these initial findings. To what degree are these
Kinetics700 +6.2 failures attributable to deep learning, ImageNet, or some
RESISC45 +5.1
FER2013 +4.5 combination of the two? CLIP models, which are trained via
Food101 +3.9 natural language supervision on a very large dataset and are
FGVCAircraft +3.2
UCF101 +3.1 capable of high zero-shot performance, are an opportunity
KITTI Distance +2.3 to investigate this question from a different angle.
Birdsnap +1.4
Flowers102 +1.4 Taori et al. (2020) is a recent comprehensive study mov-
Caltech101 +1.3
EuroSAT +0.9 ing towards quantifying and understanding these behaviors
MNIST +0.6 for ImageNet models. Taori et al. (2020) study how the
DTD +0.5 performance of ImageNet models change when evaluated
VOC2007 +0.5
STL10 +0.0 on natural distribution shifts. They measure performance
-0.5 OxfordPets on a set of 7 distribution shifts: ImageNetV2 (Recht et al.,
-0.8 CIFAR10
-1.2 PatchCamelyon 2019), ImageNet Sketch (Wang et al., 2019), Youtube-BB
-1.7 CIFAR100 and ImageNet-Vid (Shankar et al., 2019), ObjectNet (Barbu
-2.4 CLEVRCounts
-3.0 ImageNet et al., 2019), ImageNet Adversarial (Hendrycks et al., 2019),
10 5 0 5 10 15 20 25 and ImageNet Rendition (Hendrycks et al., 2020a). They
Score (%) distinguish these datasets, which all consist of novel images
Logistic Regression on CLIP vs. EfficientNet L2 NS collected from a variety of sources, from synthetic distri-
bution shifts such as ImageNet-C (Hendrycks & Dietterich,
Figure 11. CLIP’s features outperform the features of the best 2019), Stylized ImageNet (Geirhos et al., 2018), or adver-
ImageNet model on a wide variety of datasets. Fitting a linear sarial attacks (Goodfellow et al., 2014) which are created by
classifier on CLIP’s features outperforms using the Noisy Student perturbing existing images in various ways. They propose
EfficientNet-L2 on 21 out of 27 datasets. this distinction because in part because they find that while
several techniques have been demonstrated to improve per-
forms CLIP on low-resolution datasets such as CIFAR10 formance on synthetic distribution shifts, they often fail to
and CIFAR100. We suspect this is at least partly due to the yield consistent improvements on natural distributions.3
lack of scale-based data augmentation in CLIP. The Effi- Across these collected datasets, the accuracy of ImageNet
cientNet also does slightly better on PatchCamelyon and models drop well below the expectation set by the Ima-
CLEVRCounts, datasets where overall performance is still geNet validation set. For the following summary discussion
low for both approaches. we report average accuracy across all 7 natural distribution
shift datasets and average accuracy across the correspond-
3.3. Robustness to Natural Distribution Shift ing class subsets of ImageNet unless otherwise specified.
In 2015, it was announced that a deep learning model ex- Additionally, for Youtube-BB and ImageNet-Vid, which
ceeded human performance on the ImageNet test set (He have two different evaluation settings, we use the average
et al., 2015). However, research in the subsequent years of pm-0 and pm-10 accuracy.
has repeatedly found that these models still make many sim- A ResNet-101 makes 5 times as many mistakes when eval-
ple mistakes (Dodge & Karam, 2017; Geirhos et al., 2018; uated on these natural distribution shifts compared to the
Alcorn et al., 2019), and new benchmarks testing these sys- ImageNet validation set. Encouragingly however, Taori et al.
tems has often found their performance to be much lower (2020) find that accuracy under distribution shift increases
than both their ImageNet accuracy and human accuracy predictably with ImageNet accuracy and is well modeled
(Recht et al., 2019; Barbu et al., 2019). What explains this as a linear function of logit-transformed accuracy. Taori
discrepancy? Various ideas have been suggested and stud- et al. (2020) use this finding to propose that robustness
ied (Ilyas et al., 2019; Geirhos et al., 2020). A common analysis should distinguish between effective and relative
theme of proposed explanations is that deep learning models robustness. Effective robustness measures improvements
are exceedingly adept at finding correlations and patterns in accuracy under distribution shift above what is predicted
which hold across their training dataset and thus improve by the documented relationship between in-distribution and
in-distribution performance. However many of these corre- out-of-distribution accuracy. Relative robustness captures
lations and patterns are actually spurious and do not hold for 3
other distributions and result in large drops in performance We refer readers to Hendrycks et al. (2020a) for additional
experiments and discussion on this claim.
on other datasets.
Learning Transferable Visual Models From Natural Language Supervision 14
90
Linear probe average over Kornblith et al.'s 12 datasets 90
Linear probe average over 26 datasets
85 85
Transfer Score (%)
75 75
70 70
65 65
65 70 75 80 85 90 65 70 75 80 85 90
ImageNet Score (%) ImageNet Score (%)
any improvement in out-of-distribution accuracy. Taori et al. more robust, they do not necessarily mean that supervised
(2020) argue that robustness techniques should aim to im- learning on ImageNet causes a robustness gap. Other details
prove both effective robustness and relative robustness. of CLIP, such as its large and diverse pre-training dataset
or use of natural language supervision could also result
Almost all models studied in Taori et al. (2020) are trained
in much more robust models regardless of whether they
or fine-tuned on the ImageNet dataset. Returning to the
are zero-shot or fine-tuned. As an initial experiment to
discussion in the introduction to this section - is training
potentially begin narrowing this down, we also measure
or adapting to the ImageNet dataset distribution the cause
how the performance of CLIP models change after adapting
of the observed robustness gap? Intuitively, a zero-shot
to the ImageNet distribution via a L2 regularized logistic
model should not be able to exploit spurious correlations
regression classifier fit to CLIP features on the ImageNet
or patterns that hold only on a specific distribution, since it
training set. We visualize how performance changes from
is not trained on that distribution. 4 Thus it is reasonable
the zero-shot classifier in Figure 14. Although adapting
to expect zero-shot models to have much higher effective
CLIP to the ImageNet distribution increases its ImageNet
robustness. In Figure 13, we compare the performance of
accuracy by 9.2% to 85.4% overall, and ties the accuracy
zero-shot CLIP with existing ImageNet models on natural
of the 2018 SOTA from Mahajan et al. (2018), average
distribution shifts. All zero-shot CLIP models improve
accuracy under distribution shift slightly decreases.
effective robustness by a large amount and reduce the size
of the gap between ImageNet accuracy and accuracy under It is surprising to see a 9.2% increase in accuracy, which cor-
distribution shift by up to 75%. responds to roughly 3 years of improvement in SOTA, fail
to translate into any improvement in average performance
While these results show that zero-shot models can be much
under distribution shift. We also break down the differences
4
We caution that a zero-shot model can still exploit spurious between zero-shot accuracy and linear classifier accuracy
correlations that are shared between the pre-training and evaluation per dataset in Figure 14 and find performance still increases
distributions.
significantly on one dataset, ImageNetV2. ImageNetV2
Learning Transferable Visual Models From Natural Language Supervision 15
ImageNet Zero-Shot
100 Dataset Examples ResNet101 CLIP Δ Score
Ideal robust model (y = x)
95 Zero-Shot CLIP
Average on 7 natural distribution shift datasets (top-1, %)
80
65
ImageNet-R 37.7 88.9 +51.2%
60
55
50
ObjectNet 32.6 72.3 +39.7%
45
40
Figure 13. Zero-shot CLIP is much more robust to distribution shift than standard ImageNet models. (Left) An ideal robust model
(dashed line) performs equally well on the ImageNet distribution and on other natural image distributions. Zero-shot CLIP models shrink
this “robustness gap” by up to 75%. Linear fits on logit transformed values are shown with bootstrap estimated 95% confidence intervals.
(Right) Visualizing distribution shift for bananas, a class shared across 5 of the 7 natural distribution shift datasets. The performance of
the best zero-shot CLIP model, ViT-L/14@336px, is compared with a model that has the same performance on the ImageNet validation
set, ResNet-101.
closely followed the creation process of the original Ima- datasets, Youtube-BB and ImageNet-Vid, consist of super-
geNet dataset which suggests that gains in accuracy from classes of ImageNet. This presents a problem when trying
supervised adaptation are closely concentrated around the to use the fixed 1000-way classifier of an ImageNet model
ImageNet distribution. Performance decreases by 4.7% on to make predictions. Taori et al. (2020) handle this by max-
ImageNet-R, 3.8% on ObjectNet, 2.8% on ImageNet Sketch, pooling predictions across all sub-classes according to the
and 1.9% on ImageNet-A. The change in accuracy on the ImageNet class hierarchy. Sometimes this mapping is much
two other datasets, Youtube-BB and ImageNet Vid, is in- less than perfect. For the person class in Youtube-BB, pre-
significant. dictions are made by pooling over the ImageNet classes for
a baseball player, a bridegroom, and a scuba diver. With
How is it possible to improve accuracy by 9.2% on the Im-
CLIP we can instead generate a custom zero-shot classi-
ageNet dataset with little to no increase in accuracy under
fier for each dataset directly based on its class names. In
distribution shift? Is the gain primarily from “exploiting
Figure 14 we see that this improves average effective ro-
spurious correlations”? Is this behavior unique to some com-
bustness by 5% but is concentrated in large improvements
bination of CLIP, the ImageNet datatset, and the distribution
on only a few datasets. Curiously, accuracy on ObjectNet
shifts studied, or a more general phenomena? Does it hold
also increases by 2.3%. Although the dataset was designed
for end-to-end finetuning as well as linear classifiers? We
to closely overlap with ImageNet classes, using the names
do not have confident answers to these questions at this time.
provided for each class by ObjectNet’s creators still helps a
Prior work has also pre-trained models on distributions other
small amount compared to using ImageNet class names and
than ImageNet, but it is common to study and release mod-
pooling predictions when necessary.
els only after they have been fine-tuned to ImageNet. As a
step towards understanding whether pre-trained zero-shot While zero-shot CLIP improves effective robustness, Figure
models consistently have higher effective robustness than 14 shows that the benefit is almost entirely gone in a fully
fine-tuned models, we encourage the authors of Mahajan supervised setting. To better understand this difference, we
et al. (2018), Kolesnikov et al. (2019), and Dosovitskiy et al. investigate how effective robustness changes on the contin-
(2020) to, if possible, study these questions on their models uum from zero-shot to fully supervised. In Figure 15 we
as well. visualize the performance of 0-shot, 1-shot, 2-shot, 4-shot
..., 128-shot, and fully supervised logistic regression classi-
We also investigate another robustness intervention enabled
fiers on the best CLIP model’s features. We see that while
by flexible zero-shot natural-language-based image classi-
few-shot models also show higher effective robustness than
fiers. The target classes across the 7 transfer datasets are
existing models, this benefit fades as in-distribution per-
not always perfectly aligned with those of ImageNet. Two
Learning Transferable Visual Models From Natural Language Supervision 16
Adapt to ImageNet
ImageNet +9.2
Average on 7 natural distribution shift datasets (top-1, %)
ImageNetV2 +5.8
80
Adapt to class shift Youtube-BB +0.6
-0.5 ImageNet Vid
75 Adapt to ImageNet -1.9 ImageNet-A
-2.8 ImageNet Sketch
70
-3.8 ObjectNet
65 -4.7 ImageNet-R
60
10 5 0 5 10 15 20 25 30
Change from zero-shot ImageNet classifier accuracy (%)
55
Adapt to class shift
50
Youtube-BB +26.9
45 ImageNet Vid +8.3
ObjectNet +2.3
40 Ideal robust model (y = x)
Adaptive Zero-Shot CLIP ImageNet Sketch 0
35 ImageNet Zero-Shot CLIP ImageNet-R 0
Logistic Regression CLIP
ImageNet-A 0
30
Standard ImageNet training
Robustness intervention ImageNetV2 0
Trained with more data ImageNet 0
25
70 75 80 85 90 95 10 5 0 5 10 15 20 25 30
Average on class subsampled ImageNet (top-1, %) Change from zero-shot ImageNet classifier accuracy (%)
Figure 14. While supervised adaptation to ImageNet increases ImageNet accuracy by 9.2%, it slightly reduces average robustness.
(Left) Customizing zero-shot CLIP to each dataset improves robustness compared to using a single static zero-shot ImageNet classifier
and pooling predictions across similar classes as in Taori et al. (2020). CLIP models adapted to ImageNet have similar effective robustness
as the best prior ImageNet models. (Right) Details of per dataset changes in accuracy for the two robustness interventions. Adapting to
ImageNet increases accuracy on ImageNetV2 noticeably but trades off accuracy on several other distributions. Dataset specific zero-shot
classifiers can improve accuracy by a large amount but are limited to only a few datasets that include classes which don’t perfectly align
with ImageNet categories.
formance increases with more training data and is mostly, 4. Comparison to Human Performance
though not entirely, gone for the fully supervised model.
Additionally, zero-shot CLIP is notably more robust than How does CLIP compare to human performance and human
a few-shot model with equivalent ImageNet performance. learning? To get a better understanding of how well humans
Across our experiments, high effective robustness seems to perform in similar evaluation settings to CLIP, we evaluated
result from minimizing the amount of distribution specific humans on one of our tasks. We wanted to get a sense of
training data a model has access to, but this comes at a cost how strong human zero-shot performance is at these tasks,
of reducing dataset-specific performance. and how much human performance is improved if they are
shown one or two image samples. This can help us to
Taken together, these results suggest that the recent shift compare task difficulty for humans and CLIP, and identify
towards large-scale task and dataset agnostic pre-training correlations and differences between them.
combined with a reorientation towards zero-shot and few-
shot benchmarking on broad evaluation suites (as advocated We had five different humans look at each of 3669 images
by Yogatama et al. (2019) and Linzen (2020)) promotes the in the test split of the Oxford IIT Pets dataset (Parkhi et al.,
development of more robust systems and provides a more 2012) and select which of the 37 cat or dog breeds best
accurate assessment of performance. We are curious to see matched the image (or ‘I don’t know’ if they were com-
if the same results hold for zero-shot models in the field pletely uncertain). In the zero-shot case the humans were
of NLP such as the GPT family. While Hendrycks et al. given no examples of the breeds and asked to label them
(2020b) has reported that pre-training improves relative ro- to the best of their ability without an internet search. In
bustness on sentiment analysis, Miller et al. (2020)’s study the one-shot experiment the humans were given one sample
of the robustness of question answering models under nat- image of each breed and in the two-shot experiment they
ural distribution shift finds, similar to Taori et al. (2020), were given two sample images of each breed.5
little evidence of effective robustness improvements to date. 5
There is not a perfect correspondence between the human
few-shot tasks and the model’s few-shot performance since the
model cannot refer to sample images in the way that the humans
Learning Transferable Visual Models From Natural Language Supervision 17
75
Average on 7 natural distribution shift datasets (top-1, %)
0 shot
Majority Vote
all Majority Vote Accuracy
70 Accuracy Accuracy
128 on Full Dataset on Guesses
64
on Guesses
32
65 Zero-shot human 53.7 57.0 69.7 63.9
16 shot
Zero-shot CLIP 93.5 93.5 93.5 93.5
60 8 shot One-shot human 75.7 80.3 78.5 81.2
Two-shot human 75.7 85.0 79.2 86.1
4 shot
55
2 shot
50 Table 2. Comparison of human performance on Oxford IIT Pets.
45
As in Parkhi et al. (2012), the metric is average per-class classifica-
tion accuracy. Most of the gain in performance when going from
1 shot
40 the human zero shot case to the human one shot case is on images
that participants were highly uncertain on. “Guesses” refers to
35
restricting the dataset to where participants selected an answer
30 Ideal robust model (y = x) other than “I don’t know”, the “majority vote” is taking the most
Few-Shot CLIP (best model)
Zero-Shot CLIP (best model)
frequent (exclusive of ties) answer per image.
25 Standard ImageNet training
Robustness intervention
Trained with more data
20 don’t make effective use of prior knowledge and the humans
65 70 75 80 85 90 95
Average on class subsampled ImageNet (top-1, %) do, we speculate that finding a method to properly integrate
prior knowledge into few-shot learning is an important step
Figure 15. Few-shot CLIP also increases effective robustness in algorithmic improvements to CLIP. To our knowledge,
compared to existing ImageNet models but is less robust than using a linear classifier on top of the features of a high-
zero-shot CLIP. Minimizing the amount of ImageNet training quality pre-trained model is near state-of-the-art for few
data used for adaption increases effective robustness at the cost of shot learning (Tian et al., 2020), which suggests that there is
decreasing relative robustness. 16-shot logistic regression CLIP a gap between the best few-shot machine learning methods
matches zero-shot CLIP on ImageNet, as previously reported in
and human few-shot learning.
Figure 7, but is less robust.
If we plot human accuracy vs CLIP’s zero shot accuracy
(Figure 16), we see that the hardest problems for CLIP are
also hard for humans. To the extent that errors are consistent,
One possible concern was that the human workers were not our hypothesis is that this is due to at least a two factors:
sufficiently motivated in the zero-shot task. High human noise in the dataset (including mislabeled images) and out of
accuracy of 94% on the STL-10 dataset (Coates et al., 2011) distribution images being hard for both humans and models.
and 97-100% accuracy on the subset of attention check
images increased our trust in the human workers. 5. Data Overlap Analysis
Interestingly, humans went from a performance average of A concern with pre-training on a very large internet dataset
54% to 76% with just one training example per class, and is unintentional overlap with downstream evals. This is
the marginal gain from an additional training example is important to investigate since, in a worst-case scenario, a
minimal. The gain in accuracy going from zero to one shot complete copy of an evaluation dataset could leak into the
is almost entirely on images that humans were uncertain pre-training dataset and invalidate the evaluation as a mean-
about. This suggests that humans “know what they don’t ingful test of generalization. One option to prevent this is to
know” and are able to update their priors on the images they identify and remove all duplicates before training a model.
are most uncertain in based on a single example. Given this, While this guarantees reporting true hold-out performance,
it seems that while CLIP is a promising training strategy it requires knowing all possible data which a model might
for zero-shot performance (Figure 5) and does well on tests be evaluated on ahead of time. This has the downside of
of natural distribution shift (Figure 13), there is a large limiting the scope of benchmarking and analysis. Adding a
difference between how humans learn from a few examples new evaluation would require an expensive re-train or risk
and the few-shot methods in this paper. reporting an un-quantified benefit due to overlap.
This suggests that there are still algorithmic improvements Instead, we document how much overlap occurs and how
waiting to be made to decrease the gap between machine performance changes due to these overlaps. To do this, we
and human sample efficiency, as noted by Lake et al. (2016) use the following procedure:
and others. Because these few-shot evaluations of CLIP
1) For each evaluation dataset, we run a duplicate detector
can. (see Appendix C) on its examples. We then manually inspect
Learning Transferable Visual Models From Natural Language Supervision 18
bengal
sphynx
beagle
bombay
german_shorthaired
shiba_inu
great_pyrenees
samoyed
saint_bernard
pomeranian
newfoundland
chihuahua
japanese_chin
american_bulldog
birman
ragdoll
english_setter
wheaten_terrier
russian_blue
persian
leonberger
abyssinian
boxer
scottish_terrier
yorkshire_terrier
siamese
miniature_pinscher
havanese
keeshond
maine_coon
basset_hound
british_shorthair
staffordshire_bull_terrier
american_pit_bull_terrier
egyptian_mau
english_cocker_spaniel
out of YFCC100M, which our pre-training dataset contains
a filtered subset of. Despite this large overlap there is only
a 0.2% increase in accuracy on Country211. This may be
because the training text accompanying an example is often
not related to the specific task a downstream eval measures.
Country211 measures geo-localization ability, but inspect-
Figure 16. The hardest problems for CLIP also tend to be the hard- ing the training text for these duplicates showed they often
est problems for humans. Here we rank image categories by diffi- do not mention the location of the image.
culty for CLIP as measured as probability of the correct label. We are aware of two potential concerns with our analysis.
First our detector is not perfect. While it achieves near
100% accuracy on its proxy training task and manual in-
spection + threshold tuning results in very high precision
the found nearest neighbors and set a per dataset threshold with good recall among the found nearest-neighbors, we can
to keep high precision while maximizing recall. Using not tractably check its recall across 400 million examples.
this threshold, we then create two new subsets, Overlap, Another potential confounder of our analysis is that the un-
which contains all examples which have a similarity to a derlying data distribution may shift between the Overlap
training example above the threshold, and Clean, which and Clean subsets. For example, on Kinetics-700 many
contains all examples that are below this threshold. We “overlaps” are in fact all black transition frames. This ex-
denote the unaltered full dataset All for reference. From plains why Kinetics-700 has an apparent 20% accuracy drop
this we first record the degree of data contamination as the on Overlap. We suspect more subtle distribution shifts
ratio of the number of examples in Overlap to the size of likely exist. One possibility we noticed on CIFAR-100 is
All. that, due to the very low resolution of its images, many
duplicates were false positives of small objects such as birds
2) We then compute the zero-shot accuracy of CLIP or planes. Changes in accuracy could instead be due to
RN50x64 on the three splits and report All - Clean changes in the class distribution or difficulty of the dupli-
as our main metric. This is the difference in accuracy due cates. Unfortunately, these distribution and difficulty shifts
to contamination. When positive it is our estimate of how could also mask the effects of over-fitting.
much the overall reported accuracy on the dataset was in-
flated by over-fitting to overlapping data. However, these results closely follow the findings of simi-
lar duplicate analysis in previous work on large scale pre-
3) The amount of overlap is often small so we also run a training. Mahajan et al. (2018) and Kolesnikov et al. (2019)
binomial significance test where we use the accuracy on detected similar overlap rates and found minimal changes in
Clean as the null hypothesis and compute the one-tailed overall performance. Importantly, Kolesnikov et al. (2019)
(greater) p-value for the Overlap subset. We also calculate also compared the alternative de-duplication strategy dis-
99.5% Clopper-Pearson confidence intervals on Dirty as cussed in the introduction to this section with the approach
another check. we settled on and observed little difference between the two
A summary of this analysis is presented in Figure 17. Out approaches.
of 35 datasets studied, 9 datasets have no detected overlap
at all. Most of these datasets are synthetic or specialized
making them unlikely to be posted as normal images on
the internet (for instance MNIST, CLEVR, and GTSRB) or
Learning Transferable Visual Models From Natural Language Supervision 19
0.75
Difference in Accuracy on Overlapping vs. Clean Data (%)
-0.5
-20 Kinetics-700
-0.75
0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 20.0 22.5
Detected Data Overlap (%) Detected Data Overlap (%)
Figure 17. Few statistically significant improvements in accuracy due to detected data overlap. (Left) While several datasets have
up to +-20% apparent differences in zero-shot accuracy on detected overlapping vs clean examples only 5 datasets out of 35 total have
99.5% Clopper-Pearson confidence intervals that exclude a 0% accuracy difference. 2 of these datasets do worse on overlapping data.
(Right) Since the percentage of detected overlapping examples is almost always in the single digits, the overall test accuracy gain due to
overlap is much smaller with the largest estimated increase being only 0.6% on Birdsnap. Similarly, for only 6 datasets are the accuracy
improvements statistically significant when calculated using a one-sided binomial test.
caption baseline we tried to be much lower than CLIP. A notably different from human performance which shows a
simple idea worth trying is joint training of a contrastive large increase from a zero to a one shot setting. Future work
and generative objective with the hope of combining the is needed to develop methods that combine CLIP’s strong
efficiency of CLIP with the flexibility of a caption model. zero-shot performance with efficient few-shot learning.
As another alternative, search could be performed at infer-
ence time over many natural language explanations of a 7. Broader Impacts
given image, similar to approach proposed in Learning with
Latent Language Andreas et al. (2017). CLIP has a wide range of capabilities due to its ability to
carry out arbitrary image classification tasks. One can give it
CLIP also does not address the poor data efficiency of deep
a images of cats and dogs and ask it to classify cats, or give
learning. Instead CLIP compensates by using a source of
it images taken in a department store and ask it to classify
supervision that can be scaled to hundreds of millions of
shoplifters–a task with significant social implications and
training examples. If every image seen during training of
for which AI may be unfit. Like any image classification
a CLIP model was presented at a rate of one per second,
system, CLIP’s performance and fitness for purpose need to
it would take 405 years to iterate through the 12.8 billion
be evaluated, and its broader impacts analyzed in context.
images seen over 32 training epochs. Combining CLIP
CLIP also introduces a capability that will magnify and alter
with self-supervision (Henaff, 2020; Chen et al., 2020c) and
such issues: CLIP makes it possible to easily create your
self-training (Lee; Xie et al., 2020) methods is a promising
own classes for categorization (to ‘roll your own classifier’)
direction given their demonstrated ability to improve data
without a need for re-training. This capability introduces
efficiency over standard supervised learning.
challenges similar to those found in characterizing other,
Our methodology has several significant limitations. De- large-scale generative models like GPT-3 (Brown et al.,
spite our focus on zero-shot transfer, we repeatedly queried 2020); models that exhibit non-trivial zero-shot (or few-
performance on full validation sets to guide the develop- shot) generalization can have a vast range of capabilities,
ment of CLIP. These validation sets often have thousands many of which are made clear only after testing for them.
of examples, which is unrealistic for true zero-shot sce-
Our studies of CLIP in a zero-shot setting show that the
narios. Similar concerns have been raised in the field of
model displays significant promise for widely-applicable
semi-supervised learning (Oliver et al., 2018). Another po-
tasks like image retrieval or search. For example, it can find
tential issue is our selection of evaluation datasets. While
relevant images in a database given text, or relevant text
we have reported results on Kornblith et al. (2019)’s 12
given an image. Further, the relative ease of steering CLIP
dataset evaluation suite as a standardized collection, our
toward bespoke applications with little or no additional data
main results use a somewhat haphazardly assembled col-
or training could unlock a variety of novel applications that
lection of 27 datasets that is undeniably co-adapted with
are hard for us to envision today, as has occurred with large
the development and capabilities of CLIP. Creating a new
language models over the past few years.
benchmark of tasks designed explicitly to evaluate broad
zero-shot transfer capabilities, rather than re-using existing In addition to the more than 30 datasets studied in earlier
supervised datasets, would help address these issues. sections of this paper, we evaluate CLIP’s performance on
the FairFace benchmark and undertake exploratory bias
CLIP is trained on text paired with images on the internet.
probes. We then characterize the model’s performance in
These image-text pairs are unfiltered and uncurated and
a downstream task, surveillance, and discuss its usefulness
result in CLIP models learning many social biases. This
as compared with other available systems. Many of CLIP’s
has been previously demonstrated for image caption models
capabilities are omni-use in nature (e.g. OCR can be used
(Bhargava & Forsyth, 2019). We refer readers to Section 7
to make scanned documents searchable, to power screen
for detailed analysis and quantification of these behaviors for
reading technologies, or to read license plates). Several
CLIP as well as discussion of potential mitigation strategies.
of the capabilities measured, from action recognition, ob-
While we have emphasized throughout this work that speci- ject classification, and geo-localization, to facial emotion
fying image classifiers through natural language is a flexible recognition, can be used in surveillance. Given its social
and general interface, it has its own limitations. Many com- implications, we address this domain of use specifically in
plex tasks and visual concepts can be difficult to specify the Surveillance section.
just through text. Actual training examples are undeniably
We have also sought to characterize the social biases inher-
useful but CLIP does not optimize for few-shot performance
ent to the model. Our bias tests represent our initial efforts
directly. In our work, we fall back to fitting linear classifiers
to probe aspects of how the model responds in different sce-
on top of CLIP’s features. This results in a counter-intuitive
narios, and are by nature limited in scope. CLIP and models
drop in performance when transitioning from a zero-shot
like it will need to be analyzed in relation to their specific
to a few-shot setting. As discussed in Section 4, this is
Learning Transferable Visual Models From Natural Language Supervision 21
Table 3. Percent accuracy on Race, Gender, and Age classification Table 4. Percent accuracy on Race, Gender, and Age classification
of images in FairFace category ‘White’ of images in FairFace categories ‘Black,’ ‘Indian,’ ‘East Asian,’
‘Southeast Asian,’ ‘Middle Eastern,’ and ‘Latino’ (grouped to-
gether as FairFace category ‘Non-White’)
classifications across races for crime related terms, which is developers can design their own classes.
captured in Table 6.
We also carried out experiments similar to those outlined by
Given that we observed that people under 20 were the most Schwemmer et al. (2020) to test how CLIP treated images
likely to be classified in both the crime-related and non- of men and women differently using images of Members
human animal categories, we carried out classification for of Congress. As part of these experiments, we studied
the images with the same classes but with an additional how certain additional design decisions such as deciding
category ‘child’ added to the categories. Our goal here thresholds for labels can impact the labels output by CLIP
was to see if this category would significantly change the and how biases manifest.
behaviour of the model and shift how the denigration harms
We carried out three experiments - we tested for accuracy
are distributed by age. We found that this drastically reduced
on gender classification and we tested for how labels were
the number of images of people under 20 classified in either
differentially distributed across two different label sets. For
crime-related categories or non-human animal categories
our first label set, we used a label set of 300 occupations and
(Table 7). This points to how class design has the potential
for our second label set we used a combined set of labels that
to be a key factor determining both the model performance
Google Cloud Vision, Amazon Rekognition and Microsoft
and the unwanted biases or behaviour the model may exhibit
Azure Computer Vision returned for all the images.
while also asks overarching questions about the use of face
images to automatically classify people along such lines We first simply looked into gender prediction performance
(Blaise Aguera y Arcas & Todorov, 2017). of the model on the images of Members of Congress, in
order to check to see if the model correctly recognized
The results of these probes can change based on the class
men as men and women as women given the image of a
categories one chooses to include as well as the specific
person who appeared to be in an official setting/position of
language one uses to describe each class. Poor class design
power. We found that the model got 100% accuracy on the
can lead to poor real world performance; this concern is
images. This is slightly better performance than the model’s
particularly relevant to a model like CLIP, given how easily
performance on the FairFace dataset. We hypothesize that
Learning Transferable Visual Models From Natural Language Supervision 23
Table 6. Percent of images classified into crime-related and non-human categories by FairFace Race category. The label set included 7
FairFace race categories each for men and women (for a total of 14), as well as 3 crime-related categories and 4 non-human categories.
Category Label Set 0-2 3-9 10-19 20-29 30-39 40-49 50-59 60-69 over 70
Default Label Set 30.3 35.0 29.5 16.3 13.9 18.5 19.1 16.2 10.4
Default Label Set + ‘child’ category 2.3 4.3 14.7 15.0 13.4 18.2 18.6 15.5 9.4
Table 7. Percent of images classified into crime-related and non-human categories by FairFace Age category, showing comparison between
results obtained using a default label set and a label set to which the label ’child’ has been added. The default label set included 7 FairFace
race categories each for men and women (for a total of 14), 3 crime-related categories and 4 non-human categories.
one of the reasons for this is that all the images in the ecutive’ and ‘doctor’. Out of the only four occupations that
Members of Congress dataset were high quality and clear, it attached more often to women, three were ‘newscaster’,
with the people clearly centered, unlike those in the FairFace ‘television presenter’ and ‘newsreader’ and the fourth was
dataset. ‘Judge’. This is again similar to the biases found in GCV
and points to historical gendered differences (Schwemmer
In order to study how the biases in returned labels depend on
et al., 2020).
the thresholds set for label probability, we did an experiment
in which we set threshold values at 0.5% and 4.0%. We Interestingly, when we lowered the threshold to 0.5% for
found that the lower threshold led to lower quality of labels. this set of labels, we found that the labels disproportionately
However, even the differing distributions of labels under describing men also shifted to appearance oriented words
this threshold can hold signals for bias. For example, we such as ‘suit’, ‘tie’ and ‘necktie’ (Figure 18). Many occupa-
find that under the 0.5% threshold labels such as ‘nanny’ tion oriented words such as ‘military person’ and ‘executive’
and ‘housekeeper’ start appearing for women whereas labels - which were not used to describe images of women at the
such as ‘prisoner’ and ‘mobster’ start appearing for men. higher 4% threshold - were used for both men and women
This points to gendered associations similar to those that at the lower 0.5% threshold, which could have caused the
have previously been found for occupations (Schwemmer change in labels for men. The reverse was not true. Descrip-
et al., 2020) (Nosek et al., 2002) (Bolukbasi et al., 2016). tive words used to describe women were still uncommon
amongst men.
At the higher 4% threshold, the labels with the highest prob-
ability across both genders include “lawmaker”, “legislator” Design decisions at every stage of building a model impact
and “congressman”. However, the presence of these biases how biases manifest and this is especially true for CLIP
amongst lower probability labels nonetheless point to larger given the flexibility it offers. In addition to choices about
questions about what ‘sufficiently’ safe behaviour may look training data and model architecture, decisions about things
like for deploying such systems. like class designs and thresholding values can alter the labels
a model outputs and as a result heighten or lower certain
When given the combined set of labels that Google Cloud
kinds of harm, such as those described by Crawford (2017).
Vision (GCV), Amazon Rekognition and Microsoft returned
People designing and developing models and AI systems
for all the images, similar to the biases Schwemmer et al.
have considerable power. Decisions about things like class
(2020) found in GCV systems, we found our system also
design are a key determiner not only of model performance,
disproportionately attached labels to do with hair and ap-
but also of how and in what contexts model biases manifest.
pearance in general to women more than men. For ex-
ample, labels such as ‘brown hair’, ‘blonde’ and ‘blond’ These experiments are not comprehensive. They illus-
appeared significantly more often for women. Additionally, trate potential issues stemming from class design and other
CLIP attached some labels that described high status occu- sources of bias, and are intended to spark inquiry.
pations disproportionately more often to men such as ‘ex-
Learning Transferable Visual Models From Natural Language Supervision 24
Figure 18. CLIP performance on Member of Congress images when given the combined returned label set for the images from Google
Cloud Vision, Amazon Rekognition and Microsoft Azure Computer Vision. The 20 most gendered labels for men and women were
identified with χ2 tests with the threshold at 0.5%. Labels are sorted by absolute frequencies. Bars denote the percentage of images for a
certain label by gender.
7.2. Surveillance lot, school campus, etc.). For fine-grained classification, the
model had to choose between two options constructed to
We next sought to characterize model performance in re-
determine if the model could identify the presence/absence
lation to a downstream task for which there is significant
of smaller features in the image such as a person standing
societal sensitivity: surveillance. Our analysis aims to better
in the corner.
embody the characterization approach described above and
to help orient the research community towards the potential For coarse classification, we constructed the classes by hand-
future impacts of increasingly general purpose computer captioning the images ourselves to describe the contents
vision models and aid the development of norms and checks of the image and there were always at least 6 options for
around such systems. Our inclusion of surveillance is not the model to choose from. Additionally, we carried out a
intended to indicate enthusiasm for this domain - rather, we ‘stress test’ where the class set included at least one more
think surveillance is an important domain to try to make caption for something that was ‘close’ to the image (for
predictions about given its societal implications (Zuboff, example, ‘parking lot with white car’ vs. ‘parking lot with
2015; Browne, 2015). red car’). We found that the model had a top-1 accuracy
of 91.8% on the CCTV images for the initial evaluation.
We measure the model’s performance on classification of
The accuracy dropped significantly to 51.1% for the second
images from CCTV cameras and zero-shot celebrity identifi-
evaluation, with the model incorrectly choosing the ‘close’
cation. We first tested model performance on low-resolution
answer 40.7% of the time.
images captured from surveillance cameras (e.g. CCTV
cameras). We used the VIRAT dataset (Oh et al., 2011) and For fine-grained detection, the zero-shot model performed
data captured by Varadarajan & Odobez (2009), which both poorly, with results near random. Note that this experiment
consist of real world outdoor scenes with non-actors. was targeted only towards detecting the presence or absence
of small objects in image sequences.
Given CLIP’s flexible class construction, we tested 515
surveillance images captured from 12 different video se- We also tested CLIP’s zero-shot performance for ‘in the
quences on self-constructed general classes for coarse and wild’ identity detection using the CelebA dataset8 . We did
fine grained classification. Coarse classification required the 8
Note: The CelebA dataset is more representative of faces with
model to correctly identify the main subject of the image (i.e. lighter skin tones. Due to the nature of the dataset, we were not
determine if the image was a picture of an empty parking able to control for race, gender, age, etc.
Learning Transferable Visual Models From Natural Language Supervision 25
Mikolov, 2014), and language models (Bengio et al., 2003). Hodosh et al., 2013). Over time work explored many combi-
It also includes much of the broader field of NLP that deals nations of training objective, transfer, and more expressive
with predicting or modeling sequences of natural language models and steadily improved performance (Frome et al.,
in some way. Work in NLP intentionally leveraging natural 2013; Socher et al., 2014; Karpathy et al., 2014; Kiros et al.,
language supervision in the form of explanations, feedback, 2014; Faghri et al., 2017).
instructions, and advice for tasks such as classification (as
Other work has leveraged natural language supervision for
opposed to the commonly used representation of supervision
domains other than images. Stroud et al. (2020) explores
as a set of arbitrarily encoded discrete category labels) has
large scale representation learning by training a system to
been explored in many creative and advanced ways. Dialog
pair descriptive text with videos instead of images. Several
based learning (Weston, 2016; Li et al., 2016; Hancock et al.,
works have explored using dense spoken natural language
2019) develops techniques to learn from interactive natural
supervision for videos (Miech et al., 2019; 2020b). When
language feedback in dialog. Several papers have leveraged
considered together with CLIP, these works suggest that
semantic parsing to convert natural language explanations
large scale natural language supervision is a promising way
into features (Srivastava et al., 2017) or additional training
to learn high quality perceptual systems for many domains.
labels (Hancock et al., 2018). More recently, ExpBERT
Alayrac et al. (2020) extended this line of work to an addi-
(Murty et al., 2020) uses feature representations produced
tional modality by adding raw audio as an additional super-
by conditioning a deep contextual language model on nat-
vision source and demonstrated benefits from combining all
ural language explanations and descriptions of relations to
three sources of supervision.
improve performance on the task of relation extraction.
As part of our work on CLIP we also construct a new dataset
CLIP is an example of using natural language as a training
of image-text pairs. Modern work on image-text retrieval
signal for learning about a domain other than language. In
has relied on a set of crowd-sourced sentence level im-
this context, the earliest use of the term natural language
age caption evaluation datasets like Pascal1K (Rashtchian
supervision that we are aware of is the work of Ramanathan
et al., 2010), Flickr8K (Hodosh et al., 2013), and Flickr30K
et al. (2013) which showed that natural language descrip-
(Young et al., 2014). However, these datasets are still rel-
tions could be used along side other sources of supervision
atively small and limit achievable performance. Several
to improve performance on the task of video event under-
methods have been proposed to create larger datasets au-
standing. However, as mentioned in the introduction and
tomatically with Ordonez et al. (2011) as a notable early
approach section, methods of leveraging natural language
example. In the deep learning era, Mithun et al. (2018)
descriptions in computer vision well predate the use of this
demonstrated an additional set of (image, text) pairs col-
specific term, especially for image retrieval (Mori et al.,
lected from the internet could improve retrieval performance
1999) and object classification (Wang et al., 2009). Other
and several new automatically constructed datasets such as
early work leveraged tags (but not natural language) asso-
Conceptual Captions (Sharma et al., 2018), LAIT (Qi et al.,
ciated with images for the task of semantic segmentation
2020), and OCR-CC (Yang et al., 2020) have been created.
(Barnard et al., 2003). More recently, He & Peng (2017) and
However, these datasets still use significantly more aggres-
(Liang et al., 2020) demonstrated using natural language
sive filtering or are designed for a specific task such as OCR
descriptions and explanations to improve fine-grained vi-
and as a result are still much smaller than WIT with between
sual classification of birds. Others have investigated how
1 and 10 million training examples.
grounded language can be used to improve visual represen-
tations and classifiers on the ShapeWorld dataset (Kuhnle A related idea to CLIP is webly supervised learning. This
& Copestake, 2017; Andreas et al., 2017; Mu et al., 2019). line of work queries image search engines to build image
Finally, techniques which combine natural language with datasets by querying for terms and uses the queries as the
reinforcement learning environments (Narasimhan et al., labels for the returned images (Fergus et al., 2005). Classi-
2015) have demonstrated exciting emergent behaviors such fiers trained on these large but noisily labeled datasets can
as systematically accomplishing zero-shot tasks (Hill et al., be competitive with those trained on smaller carefully la-
2019). beled datasets. These image-query pairs are also often used
to improve performance on standard datasets as additional
CLIP’s pre-training task optimizes for text-image retrieval.
training data (Chen & Gupta, 2015). CLIP also uses search
This areas of research dates back to the mid-90s with the
queries as part of its dataset creation process. However
previously mentioned Mori et al. (1999) as representative of
CLIP only uses full text sequences co-occuring with images
early work. While initial efforts focused primarily on predic-
as supervision rather than just the queries, which are often
tive objectives over time research shifted towards learning
only a single word or short n-gram. We also restrict this step
joint multi-modal embedding spaces with techniques like
in CLIP to text only querying for sub-string matches while
kernel Canonical Correlation Analysis and various ranking
most webly supervised work uses standard image search
objectives (Weston et al., 2010; Socher & Fei-Fei, 2010;
Learning Transferable Visual Models From Natural Language Supervision 27
engines which have their own complex retrieval and filter- including, but not limited, to Numpy (Harris et al., 2020),
ing pipelines that often involve computer vision systems. SciPy (Virtanen et al., 2020), ftfy (Speer, 2019), Tensor-
Of this line of work, Learning Everything about Anything: Flow (Abadi et al., 2016), PyTorch (Paszke et al., 2019),
Webly-Supervised Visual Concept Learning (Divvala et al., pandas (pandas development team, 2020), and scikit-learn
2014) has a notably similar ambition and goal as CLIP. (Pedregosa et al., 2011).
Finally, CLIP is related to a recent burst of activity on learn-
ing joint models of vision and language (Lu et al., 2019; Tan References
& Bansal, 2019; Chen et al., 2019; Li et al., 2020b; Yu et al., Abadi, M., Barham, P., Chen, J., Chen, Z., Davis, A., Dean,
2020). This line of work focuses on richly connecting vision J., Devin, M., Ghemawat, S., Irving, G., Isard, M., et al.
and language in order to solve complex downstream tasks Tensorflow: A system for large-scale machine learning. In
such as visual question answering, visual commonsense 12th {USENIX} symposium on operating systems design
reasoning, or multimodal entailment. These approaches and implementation ({OSDI} 16), pp. 265–283, 2016.
leverage impressively engineered models which combine 3
(or more) pre-trained subsystems, typically an image feature Alayrac, J.-B., Recasens, A., Schneider, R., Arandjelović,
model, a region proposal / object detection model, and a R., Ramapuram, J., De Fauw, J., Smaira, L., Dieleman, S.,
pre-trained masked language model such as BERT. These and Zisserman, A. Self-supervised multimodal versatile
systems are then jointly fine-tuned via various training objec- networks. arXiv preprint arXiv:2006.16228, 2020.
tives on image-text pairs and applied to the aforementioned
tasks and achieve impressive results. CLIP is instead fo- Alcorn, M. A., Li, Q., Gong, Z., Wang, C., Mai, L., Ku, W.-
cused on learning visual models from scratch via natural S., and Nguyen, A. Strike (with) a pose: Neural networks
language supervision and does not densely connect the two are easily fooled by strange poses of familiar objects. In
domains with a joint attention model. The only interaction Proceedings of the IEEE Conference on Computer Vision
in a CLIP model between the image and text domain is a and Pattern Recognition, pp. 4845–4854, 2019.
single dot product in a learned joint embedding space. We Andreas, J., Klein, D., and Levine, S. Learning with latent
are excited to see CLIP hybridized with this line of work. language. arXiv preprint arXiv:1711.00482, 2017.
Assiri, Y. Stochastic optimization of plain convolutional
9. Conclusion neural networks with simple methods. arXiv preprint
We have investigated whether it is possible to transfer the arXiv:2001.08856, 2020.
success of task-agnostic web-scale pre-training in NLP to
Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning
another domain. We find that adopting this formula re-
representations by maximizing mutual information across
sults in similar behaviors emerging in the field of computer
views. In Advances in Neural Information Processing
vision and discuss the social implications of this line of
Systems, pp. 15535–15545, 2019.
research. In order to optimize their training objective, CLIP
models learn to perform a wide variety of tasks during pre- Barbu, A., Mayo, D., Alverio, J., Luo, W., Wang, C., Gut-
training. This task learning can then be leveraged via natural freund, D., Tenenbaum, J., and Katz, B. Objectnet: A
language prompting to enable zero-shot transfer to many large-scale bias-controlled dataset for pushing the lim-
existing datasets. At sufficient scale, the performance of this its of object recognition models. In Advances in Neural
approach can be competitive with task-specific supervised Information Processing Systems, pp. 9453–9463, 2019.
models although there is still room for much improvement.
Barnard, K., Duygulu, P., Forsyth, D., Freitas, N. d., Blei,
ACKNOWLEDGMENTS D. M., and Jordan, M. I. Matching words and pictures.
Journal of machine learning research, 3(Feb):1107–1135,
We’d like to thank the millions of people involved in creating 2003.
the data CLIP is trained on. We’d also like to thank Susan
Zhang for her work on image conditional language models Bechmann, A. and Bowker, G. C. Unsupervised by any
while at OpenAI, Ishaan Gulrajani for catching an error in other name: Hidden layers of knowledge production in
the pseudocode, and Irene Solaiman, Miles Brundage, and artificial intelligence on social media. Big Data & Society,
Gillian Hadfield for their thoughtful feedback on the broader 6(1):205395171881956, January 2019. doi: 10.1177/
impacts section of the paper. We are also grateful to the 2053951718819569. URL https://doi.org/10.
Acceleration and Supercomputing teams at OpenAI for their 1177/2053951718819569.
critical work on software and hardware infrastructure this Bengio, Y., Ducharme, R., Vincent, P., and Jauvin, C. A
project used. Finally, we’d also like to thank the developers neural probabilistic language model. Journal of machine
of the many software packages used throughout this project learning research, 3(Feb):1137–1155, 2003.
Learning Transferable Visual Models From Natural Language Supervision 28
Bhargava, S. and Forsyth, D. Exposing and correcting the Chen, T., Kornblith, S., Swersky, K., Norouzi, M., and
gender bias in image captioning datasets and models. Hinton, G. Big self-supervised models are strong semi-
arXiv preprint arXiv:1912.00578, 2019. supervised learners. arXiv preprint arXiv:2006.10029,
2020d.
Blaise Aguera y Arcas, M. M. and Todorov,
A. Physiognomy’s new clothes. 2017. Chen, X. and Gupta, A. Webly supervised learning of
URL https://medium.com/@blaisea/ convolutional networks. In Proceedings of the IEEE
physiognomys-new-clothes-f2d4b59fdd6a. International Conference on Computer Vision, pp. 1431–
1439, 2015.
Blei, D. M., Ng, A. Y., and Jordan, M. I. Latent dirichlet
allocation. Journal of machine Learning research, 3(Jan): Chen, X., Fan, H., Girshick, R., and He, K. Improved
993–1022, 2003. baselines with momentum contrastive learning. arXiv
Bolukbasi, T., Chang, K.-W., Zou, J. Y., Saligrama, V., and preprint arXiv:2003.04297, 2020e.
Kalai, A. T. Man is to computer programmer as woman Chen, Y.-C., Li, L., Yu, L., Kholy, A. E., Ahmed, F., Gan, Z.,
is to homemaker? debiasing word embeddings. Advances Cheng, Y., and Liu, J. Uniter: Learning universal image-
in neural information processing systems, 29:4349–4357, text representations. arXiv preprint arXiv:1909.11740,
2016. 2019.
Bowker, G. C. and Star, S. L. Sorting things out: Classifica-
Cheng, G., Han, J., and Lu, X. Remote sensing image scene
tion and its consequences. MIT press, 2000.
classification: Benchmark and state of the art. Proceed-
Brown, T. B., Mann, B., Ryder, N., Subbiah, M., Kaplan, ings of the IEEE, 105(10):1865–1883, 2017.
J., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G.,
Askell, A., et al. Language models are few-shot learners. Choi, D., Shallue, C. J., Nado, Z., Lee, J., Maddison, C. J.,
arXiv preprint arXiv:2005.14165, 2020. and Dahl, G. E. On empirical comparisons of optimiz-
ers for deep learning. arXiv preprint arXiv:1910.05446,
Browne, S. Dark Matters: Surveillance of Blackness. Duke 2019.
University Press, 2015.
Coates, A., Ng, A., and Lee, H. An analysis of single-
Bulent Sariyildiz, M., Perez, J., and Larlus, D. Learning layer networks in unsupervised feature learning. In Pro-
visual representations with caption annotations. arXiv ceedings of the fourteenth international conference on
e-prints, pp. arXiv–2008, 2020. artificial intelligence and statistics, pp. 215–223, 2011.
Buolamwini, J. and Gebru, T. Gender shades: Intersec- Crawford, K. The trouble with bias. NIPS 2017
tional accuracy disparities in commercial gender classi- Keynote, 2017. URL https://www.youtube.com/
fication. In Conference on fairness, accountability and watch?v=fMym_BKWQzk.
transparency, pp. 77–91, 2018.
Dai, A. M. and Le, Q. V. Semi-supervised sequence learning.
Carreira, J., Noland, E., Hillier, C., and Zisserman, A. A
In Advances in neural information processing systems,
short note on the kinetics-700 human action dataset. arXiv
pp. 3079–3087, 2015.
preprint arXiv:1907.06987, 2019.
Chen, M., Radford, A., Child, R., Wu, J., Jun, H., Luan, D’Amour, A., Heller, K., Moldovan, D., Adlam, B., Ali-
D., and Sutskever, I. Generative pretraining from pixels. panahi, B., Beutel, A., Chen, C., Deaton, J., Eisenstein,
In International Conference on Machine Learning, pp. J., Hoffman, M. D., et al. Underspecification presents
1691–1703. PMLR, 2020a. challenges for credibility in modern machine learning.
arXiv preprint arXiv:2011.03395, 2020.
Chen, T., Xu, B., Zhang, C., and Guestrin, C. Training
deep nets with sublinear memory cost. arXiv preprint Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
arXiv:1604.06174, 2016. Fei, L. ImageNet: A Large-Scale Hierarchical Image
Database. In CVPR09, 2009.
Chen, T., Kornblith, S., Norouzi, M., and Hinton, G. A
simple framework for contrastive learning of visual rep- Deng, J., Berg, A. C., Satheesh, S., Su, H., Khosla, A.,
resentations. arXiv preprint arXiv:2002.05709, 2020b. and Fei-Fei, L. Ilsvrc 2012, 2012. URL http://www.
image-net.org/challenges/LSVRC/2012/.
Chen, T., Kornblith, S., Swersky, K., Norouzi, M., and
Hinton, G. Big self-supervised models are strong semi- Desai, K. and Johnson, J. Virtex: Learning visual rep-
supervised learners. arXiv preprint arXiv:2006.10029, resentations from textual annotations. arXiv preprint
2020c. arXiv:2006.06666, 2020.
Learning Transferable Visual Models From Natural Language Supervision 29
Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. Bert: Geiger, A., Lenz, P., and Urtasun, R. Are we ready for
Pre-training of deep bidirectional transformers for lan- autonomous driving? the kitti vision benchmark suite. In
guage understanding. arXiv preprint arXiv:1810.04805, Conference on Computer Vision and Pattern Recognition
2018. (CVPR), 2012.
Dhariwal, P., Jun, H., Payne, C., Kim, J. W., Radford, A., Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wich-
and Sutskever, I. Jukebox: A generative model for music. mann, F. A., and Brendel, W. Imagenet-trained cnns are
arXiv preprint arXiv:2005.00341, 2020. biased towards texture; increasing shape bias improves ac-
curacy and robustness. arXiv preprint arXiv:1811.12231,
Divvala, S. K., Farhadi, A., and Guestrin, C. Learning
2018.
everything about anything: Webly-supervised visual con-
cept learning. In Proceedings of the IEEE Conference Geirhos, R., Jacobsen, J.-H., Michaelis, C., Zemel, R.,
on Computer Vision and Pattern Recognition, pp. 3270– Brendel, W., Bethge, M., and Wichmann, F. A. Short-
3277, 2014. cut learning in deep neural networks. arXiv preprint
Dodge, S. and Karam, L. A study and comparison of human arXiv:2004.07780, 2020.
and deep learning recognition performance under visual
Gomez, L., Patel, Y., Rusiñol, M., Karatzas, D., and Jawahar,
distortions. In 2017 26th international conference on
C. Self-supervised learning of visual features through
computer communication and networks (ICCCN), pp. 1–
embedding images into text topic spaces. In Proceedings
7. IEEE, 2017.
of the IEEE Conference on Computer Vision and Pattern
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, Recognition, pp. 4230–4239, 2017.
D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M.,
Heigold, G., Gelly, S., et al. An image is worth 16x16 Goodfellow, I. J., Shlens, J., and Szegedy, C. Explain-
words: Transformers for image recognition at scale. arXiv ing and harnessing adversarial examples. arXiv preprint
preprint arXiv:2010.11929, 2020. arXiv:1412.6572, 2014.
Elhoseiny, M., Saleh, B., and Elgammal, A. Write a classi- Goodfellow, I. J., Erhan, D., Carrier, P. L., Courville, A.,
fier: Zero-shot learning using purely textual descriptions. Mirza, M., Hamner, B., Cukierski, W., Tang, Y., Thaler,
In Proceedings of the IEEE International Conference on D., Lee, D.-H., et al. Challenges in representation learn-
Computer Vision, pp. 2584–2591, 2013. ing: A report on three machine learning contests. Neural
Networks, 64:59–63, 2015.
Faghri, F., Fleet, D. J., Kiros, J. R., and Fidler, S. Vse++: Im-
proving visual-semantic embeddings with hard negatives. Google. Google cloud api: Celebrity recognition. URL
arXiv preprint arXiv:1707.05612, 2017. https://cloud.google.com/vision/docs/
celebrity-recognition.
Fergus, R., Fei-Fei, L., Perona, P., and Zisserman, A. Learn-
ing object categories from google’s image search. In Griewank, A. and Walther, A. Algorithm 799: revolve: an
Tenth IEEE International Conference on Computer Vision implementation of checkpointing for the reverse or ad-
(ICCV’05) Volume 1, volume 2, pp. 1816–1823. IEEE, joint mode of computational differentiation. ACM Trans-
2005. actions on Mathematical Software (TOMS), 26(1):19–45,
2000.
Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J.,
Ranzato, M., and Mikolov, T. Devise: A deep visual- Grill, J.-B., Strub, F., Altché, F., Tallec, C., Richemond,
semantic embedding model. In Advances in neural infor- P. H., Buchatskaya, E., Doersch, C., Pires, B. A., Guo,
mation processing systems, pp. 2121–2129, 2013. Z. D., Azar, M. G., et al. Bootstrap your own latent: A
Gan, Z., Chen, Y.-C., Li, L., Zhu, C., Cheng, Y., and Liu, J. new approach to self-supervised learning. arXiv preprint
Large-scale adversarial training for vision-and-language arXiv:2006.07733, 2020.
representation learning. arXiv preprint arXiv:2006.06195,
Ha, D., Dai, A., and Le, Q. V. Hypernetworks. arXiv
2020.
preprint arXiv:1609.09106, 2016.
Gao, T., Fisch, A., and Chen, D. Making pre-trained lan-
guage models better few-shot learners. arXiv preprint Hancock, B., Bringmann, M., Varma, P., Liang, P., Wang,
arXiv:2012.15723, 2020. S., and Ré, C. Training classifiers with natural language
explanations. In Proceedings of the conference. Associ-
Garvie, C., May 2019. URL https://www. ation for Computational Linguistics. Meeting, volume
flawedfacedata.com/. 2018, pp. 1884. NIH Public Access, 2018.
Learning Transferable Visual Models From Natural Language Supervision 30
Hancock, B., Bordes, A., Mazare, P.-E., and Weston, J. Henaff, O. Data-efficient image recognition with contrastive
Learning from dialogue after deployment: Feed yourself, predictive coding. In International Conference on Ma-
chatbot! arXiv preprint arXiv:1901.05415, 2019. chine Learning, pp. 4182–4192. PMLR, 2020.
Harris, C. R., Millman, K. J., van der Walt, S. J., Gommers, Hendrycks, D. and Dietterich, T. Benchmarking neural
R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., network robustness to common corruptions and perturba-
Berg, S., Smith, N. J., Kern, R., Picus, M., Hoyer, S., van tions. arXiv preprint arXiv:1903.12261, 2019.
Kerkwijk, M. H., Brett, M., Haldane, A., Fernández del
Hendrycks, D. and Gimpel, K. Gaussian error linear units
Rı́o, J., Wiebe, M., Peterson, P., Gérard-Marchant, P.,
(gelus). arXiv preprint arXiv:1606.08415, 2016.
Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H.,
Gohlke, C., and Oliphant, T. E. Array programming Hendrycks, D., Zhao, K., Basart, S., Steinhardt, J., and
with NumPy. Nature, 585:357–362, 2020. doi: 10.1038/ Song, D. Natural adversarial examples. arXiv preprint
s41586-020-2649-2. arXiv:1907.07174, 2019.
Hays, J. and Efros, A. A. Im2gps: estimating geographic Hendrycks, D., Basart, S., Mu, N., Kadavath, S., Wang, F.,
information from a single image. In 2008 ieee confer- Dorundo, E., Desai, R., Zhu, T., Parajuli, S., Guo, M.,
ence on computer vision and pattern recognition, pp. 1–8. et al. The many faces of robustness: A critical analy-
IEEE, 2008. sis of out-of-distribution generalization. arXiv preprint
arXiv:2006.16241, 2020a.
He, K., Zhang, X., Ren, S., and Sun, J. Delving deep
into rectifiers: Surpassing human-level performance on Hendrycks, D., Liu, X., Wallace, E., Dziedzic, A., Krishnan,
imagenet classification. In Proceedings of the IEEE inter- R., and Song, D. Pretrained transformers improve out-of-
national conference on computer vision, pp. 1026–1034, distribution robustness. arXiv preprint arXiv:2004.06100,
2015. 2020b.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H.,
ing for image recognition. In Proceedings of the IEEE Kianinejad, H., Patwary, M., Ali, M., Yang, Y., and Zhou,
conference on computer vision and pattern recognition, Y. Deep learning scaling is predictable, empirically. arXiv
pp. 770–778, 2016a. preprint arXiv:1712.00409, 2017.
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learn- Hill, F., Lampinen, A., Schneider, R., Clark, S., Botvinick,
ing for image recognition. In Proceedings of the IEEE M., McClelland, J. L., and Santoro, A. Environmental
conference on computer vision and pattern recognition, drivers of systematicity and generalization in a situated
pp. 770–778, 2016b. agent. In International Conference on Learning Repre-
sentations, 2019.
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Mo-
mentum contrast for unsupervised visual representation Hodosh, M., Young, P., and Hockenmaier, J. Framing image
learning. In Proceedings of the IEEE/CVF Conference description as a ranking task: Data, models and evaluation
on Computer Vision and Pattern Recognition, pp. 9729– metrics. Journal of Artificial Intelligence Research, 47:
9738, 2020. 853–899, 2013.
Hongsuck Seo, P., Weyand, T., Sim, J., and Han, B. Cplanet:
He, T., Zhang, Z., Zhang, H., Zhang, Z., Xie, J., and Li, M.
Enhancing image geolocalization by combinatorial parti-
Bag of tricks for image classification with convolutional
tioning of maps. In Proceedings of the European Confer-
neural networks. In Proceedings of the IEEE Conference
ence on Computer Vision (ECCV), pp. 536–551, 2018.
on Computer Vision and Pattern Recognition, pp. 558–
567, 2019. Howard, J. and Ruder, S. Universal language model
fine-tuning for text classification. arXiv preprint
He, X. and Peng, Y. Fine-grained image classification via
arXiv:1801.06146, 2018.
combining vision and language. In Proceedings of the
IEEE Conference on Computer Vision and Pattern Recog- Ilyas, A., Santurkar, S., Tsipras, D., Engstrom, L., Tran,
nition, pp. 5994–6002, 2017. B., and Madry, A. Adversarial examples are not bugs,
they are features. In Advances in Neural Information
Helber, P., Bischke, B., Dengel, A., and Borth, D. Eurosat: Processing Systems, pp. 125–136, 2019.
A novel dataset and deep learning benchmark for land
use and land cover classification. IEEE Journal of Se- Ioffe, S. and Szegedy, C. Batch normalization: Accelerating
lected Topics in Applied Earth Observations and Remote deep network training by reducing internal covariate shift.
Sensing, 12(7):2217–2226, 2019. arXiv preprint arXiv:1502.03167, 2015.
Learning Transferable Visual Models From Natural Language Supervision 31
Jaderberg, M., Simonyan, K., Vedaldi, A., and Zisserman, general visual representations for transfer. arXiv preprint
A. Deep structured output learning for unconstrained text arXiv:1912.11370, 2019.
recognition. arXiv preprint arXiv:1412.5903, 2014.
Kornblith, S., Shlens, J., and Le, Q. V. Do better imagenet
Jaderberg, M., Simonyan, K., Zisserman, A., et al. Spatial models transfer better? In Proceedings of the IEEE
transformer networks. Advances in neural information conference on computer vision and pattern recognition,
processing systems, 28:2017–2025, 2015. pp. 2661–2671, 2019.
Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Krishna, R., Zhu, Y., Groth, O., Johnson, J., Hata, K.,
Lawrence Zitnick, C., and Girshick, R. Clevr: A diag- Kravitz, J., Chen, S., Kalantidis, Y., Li, L.-J., Shamma,
nostic dataset for compositional language and elementary D. A., et al. Visual genome: Connecting language and
visual reasoning. In Proceedings of the IEEE Confer- vision using crowdsourced dense image annotations. In-
ence on Computer Vision and Pattern Recognition, pp. ternational journal of computer vision, 123(1):32–73,
2901–2910, 2017. 2017.
Joulin, A., Van Der Maaten, L., Jabri, A., and Vasilache, N. Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet
Learning visual features from large weakly supervised classification with deep convolutional neural networks.
data. In European Conference on Computer Vision, pp. In Advances in neural information processing systems,
67–84. Springer, 2016. pp. 1097–1105, 2012.
Kalfaoglu, M., Kalkan, S., and Alatan, A. A. Late temporal Kuhnle, A. and Copestake, A. Shapeworld-a new test
modeling in 3d cnn architectures with bert for action methodology for multimodal language understanding.
recognition. arXiv preprint arXiv:2008.01232, 2020. arXiv preprint arXiv:1704.04517, 2017.
Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Kärkkäinen, K. and Joo, J. Fairface: Face attribute dataset
Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and for balanced race, gender, and age, 2019.
Amodei, D. Scaling laws for neural language models.
arXiv preprint arXiv:2001.08361, 2020. Lake, B. M., Ullman, T. D., Tenenbaum, J. B., and Gersh-
man, S. J. Building machines that learn and think like
Karpathy, A., Joulin, A., and Fei-Fei, L. F. Deep fragment people, 2016.
embeddings for bidirectional image sentence mapping.
In Advances in neural information processing systems, Lampert, C. H., Nickisch, H., and Harmeling, S. Learning
pp. 1889–1897, 2014. to detect unseen object classes by between-class attribute
transfer. In 2009 IEEE Conference on Computer Vision
Keyes, O. The misgendering machines: Trans/hci implica- and Pattern Recognition, pp. 951–958. IEEE, 2009.
tions of automatic gender recognition. Proceedings of the
ACM on Human-Computer Interaction, 2(CSCW):1–22, Larochelle, H., Erhan, D., and Bengio, Y. Zero-data learning
2018. of new tasks. 2008.
Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A., Le, Q. and Mikolov, T. Distributed representations of sen-
Ringshia, P., and Testuggine, D. The hateful memes tences and documents. In International conference on
challenge: Detecting hate speech in multimodal memes. machine learning, pp. 1188–1196, 2014.
arXiv preprint arXiv:2005.04790, 2020.
LeCun, Y. The mnist database of handwritten digits.
Kingma, D. P. and Ba, J. Adam: A method for stochastic http://yann. lecun. com/exdb/mnist/.
optimization. arXiv preprint arXiv:1412.6980, 2014.
Lee, D.-H. Pseudo-label: The simple and efficient semi-
Kiros, R., Salakhutdinov, R., and Zemel, R. S. Unifying supervised learning method for deep neural networks.
visual-semantic embeddings with multimodal neural lan-
guage models. arXiv preprint arXiv:1411.2539, 2014. Lei Ba, J., Swersky, K., Fidler, S., et al. Predicting deep
zero-shot convolutional neural networks using textual
Kiros, R., Zhu, Y., Salakhutdinov, R. R., Zemel, R., Urtasun, descriptions. In Proceedings of the IEEE International
R., Torralba, A., and Fidler, S. Skip-thought vectors. Conference on Computer Vision, pp. 4247–4255, 2015.
Advances in neural information processing systems, 28:
3294–3302, 2015. Li, A., Jabri, A., Joulin, A., and van der Maaten, L. Learning
visual n-grams from web data. In Proceedings of the
Kolesnikov, A., Beyer, L., Zhai, X., Puigcerver, J., Yung, IEEE International Conference on Computer Vision, pp.
J., Gelly, S., and Houlsby, N. Large scale learning of 4183–4192, 2017.
Learning Transferable Visual Models From Natural Language Supervision 32
Li, G., Duan, N., Fang, Y., Gong, M., and Jiang, D. Lucic, M., Kurach, K., Michalski, M., Gelly, S., and Bous-
Unicoder-vl: A universal encoder for vision and language quet, O. Are gans created equal? a large-scale study.
by cross-modal pre-training. 2020a. Advances in neural information processing systems, 31:
700–709, 2018.
Li, J., Miller, A. H., Chopra, S., Ranzato, M., and Weston, J.
Learning through dialogue interactions by asking ques- Mahajan, D., Girshick, R., Ramanathan, V., He, K., Paluri,
tions. arXiv preprint arXiv:1612.04936, 2016. M., Li, Y., Bharambe, A., and van der Maaten, L. Ex-
ploring the limits of weakly supervised pretraining. In
Li, X., Yin, X., Li, C., Hu, X., Zhang, P., Zhang, L., Wang, Proceedings of the European Conference on Computer
L., Hu, H., Dong, L., Wei, F., et al. Oscar: Object- Vision (ECCV), pp. 181–196, 2018.
semantics aligned pre-training for vision-language tasks.
arXiv preprint arXiv:2004.06165, 2020b. McCann, B., Bradbury, J., Xiong, C., and Socher, R.
Learned in translation: Contextualized word vectors. In
Liang, W., Zou, J., and Yu, Z. Alice: Active learning with Advances in neural information processing systems, pp.
contrastive natural language explanations. arXiv preprint 6294–6305, 2017.
arXiv:2009.10259, 2020.
McCann, B., Keskar, N. S., Xiong, C., and Socher, R. The
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ra- natural language decathlon: Multitask learning as ques-
manan, D., Dollár, P., and Zitnick, C. L. Microsoft coco: tion answering. arXiv preprint arXiv:1806.08730, 2018.
Common objects in context. In European conference on
computer vision, pp. 740–755. Springer, 2014. Micikevicius, P., Narang, S., Alben, J., Diamos, G., Elsen,
E., Garcia, D., Ginsburg, B., Houston, M., Kuchaiev, O.,
Linzen, T. How can we accelerate progress towards Venkatesh, G., et al. Mixed precision training. arXiv
human-like linguistic generalization? arXiv preprint preprint arXiv:1710.03740, 2017.
arXiv:2005.00955, 2020.
Miech, A., Zhukov, D., Alayrac, J.-B., Tapaswi, M., Laptev,
Lippe, P., Holla, N., Chandra, S., Rajamanickam, S., An- I., and Sivic, J. Howto100m: Learning a text-video em-
toniou, G., Shutova, E., and Yannakoudakis, H. A mul- bedding by watching hundred million narrated video clips.
timodal framework for the detection of hateful memes. In Proceedings of the IEEE international conference on
arXiv preprint arXiv:2012.12871, 2020. computer vision, pp. 2630–2640, 2019.
Liu, P. J., Saleh, M., Pot, E., Goodrich, B., Sepa- Miech, A., Alayrac, J.-B., Laptev, I., Sivic, J., and Zisser-
ssi, R., Kaiser, L., and Shazeer, N. Generating man, A. Rareact: A video dataset of unusual interactions.
wikipedia by summarizing long sequences. arXiv preprint arXiv preprint arXiv:2008.01018, 2020a.
arXiv:1801.10198, 2018.
Miech, A., Alayrac, J.-B., Smaira, L., Laptev, I., Sivic, J.,
Locatello, F., Bauer, S., Lucic, M., Rätsch, G., Gelly, S., and Zisserman, A. End-to-end learning of visual represen-
Schölkopf, B., and Bachem, O. A sober look at the tations from uncurated instructional videos. In Proceed-
unsupervised learning of disentangled representations ings of the IEEE/CVF Conference on Computer Vision
and their evaluation. arXiv preprint arXiv:2010.14766, and Pattern Recognition, pp. 9879–9889, 2020b.
2020.
Mikolov, T., Sutskever, I., Chen, K., Corrado, G. S., and
Loshchilov, I. and Hutter, F. Sgdr: Stochastic gra- Dean, J. Distributed representations of words and phrases
dient descent with warm restarts. arXiv preprint and their compositionality. Advances in neural informa-
arXiv:1608.03983, 2016. tion processing systems, 26:3111–3119, 2013.
Loshchilov, I. and Hutter, F. Decoupled weight decay regu- Miller, J., Krauth, K., Recht, B., and Schmidt, L. The effect
larization. arXiv preprint arXiv:1711.05101, 2017. of natural distribution shift on question answering models.
arXiv preprint arXiv:2004.14444, 2020.
Lu, J., Batra, D., Parikh, D., and Lee, S. Vilbert: Pretraining
task-agnostic visiolinguistic representations for vision- Mishra, A., Alahari, K., and Jawahar, C. Scene text recogni-
and-language tasks. In Advances in Neural Information tion using higher order language priors. 2012.
Processing Systems, pp. 13–23, 2019.
Mithun, N. C., Panda, R., Papalexakis, E. E., and Roy-
Lu, Z., Xiong, X., Li, Y., Stroud, J., and Ross, D. Leveraging Chowdhury, A. K. Webly supervised joint embedding for
weakly supervised data and pose representation for action cross-modal image-text retrieval. In Proceedings of the
recognition, 2020. URL https://www.youtube. 26th ACM international conference on Multimedia, pp.
com/watch?v=KOQFxbPPLOE&t=1390s. 1856–1864, 2018.
Learning Transferable Visual Models From Natural Language Supervision 33
Mori, Y., Takahashi, H., and Oka, R. Image-to-word trans- Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar,
formation based on dividing and vector quantizing images C. V. Cats and dogs. In IEEE Conference on Computer
with words. Citeseer, 1999. Vision and Pattern Recognition, 2012.
Mu, J., Liang, P., and Goodman, N. Shaping visual represen- Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
tations with language for few-shot classification. arXiv Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
preprint arXiv:1911.02683, 2019. L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison,
M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L.,
Muller-Budack, E., Pustu-Iren, K., and Ewerth, R. Geolo- Bai, J., and Chintala, S. Pytorch: An imperative style,
cation estimation of photos using a hierarchical model high-performance deep learning library. In Wallach, H.,
and scene classification. In Proceedings of the European Larochelle, H., Beygelzimer, A., dAlché-Buc, F., Fox, E.,
Conference on Computer Vision (ECCV), pp. 563–579, and Garnett, R. (eds.), Advances in Neural Information
2018. Processing Systems 32, pp. 8024–8035, 2019.
Murty, S., Koh, P. W., and Liang, P. Expbert: Representation Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V.,
engineering with natural language explanations. arXiv Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P.,
preprint arXiv:2005.01932, 2020. Weiss, R., Dubourg, V., Vanderplas, J., Passos, A., Cour-
napeau, D., Brucher, M., Perrot, M., and Duchesnay, E.
Narasimhan, K., Kulkarni, T., and Barzilay, R. Language Scikit-learn: Machine learning in Python. Journal of
understanding for text-based games using deep reinforce- Machine Learning Research, 12:2825–2830, 2011.
ment learning. arXiv preprint arXiv:1506.08941, 2015.
Pennington, J., Socher, R., and Manning, C. D. Glove:
Netzer, Y., Wang, T., Coates, A., Bissacco, A., Wu, B., Global vectors for word representation. In Proceedings
and Ng, A. Y. Reading digits in natural images with of the 2014 conference on empirical methods in natural
unsupervised feature learning. 2011. language processing (EMNLP), pp. 1532–1543, 2014.
Noble, S. U. Algorithms of oppression: How search engines Peters, M. E., Neumann, M., Iyyer, M., Gardner, M., Clark,
reinforce racism. 2018. C., Lee, K., and Zettlemoyer, L. Deep contextualized
word representations. arXiv preprint arXiv:1802.05365,
Nosek, B. A., Banaji, M. R., and Greenwald, A. G. Harvest-
2018.
ing implicit group attitudes and beliefs from a demonstra-
tion web site. Group Dynamics: Theory, Research, and Qi, D., Su, L., Song, J., Cui, E., Bharti, T., and Sacheti,
Practice, 6(1):101, 2002. A. Imagebert: Cross-modal pre-training with large-
scale weak-supervised image-text data. arXiv preprint
Oh, S., Hoogs, A., Perera, A., Cuntoor, N., Chen, C.-C., Lee,
arXiv:2001.07966, 2020.
J. T., Mukherjee, S., Aggarwal, J., Lee, H., Davis, L., et al.
A large-scale benchmark dataset for event recognition in Quattoni, A., Collins, M., and Darrell, T. Learning visual
surveillance video. In CVPR 2011, pp. 3153–3160. IEEE, representations using images with captions. In 2007 IEEE
2011. Conference on Computer Vision and Pattern Recognition,
pp. 1–8. IEEE, 2007.
Oliver, A., Odena, A., Raffel, C. A., Cubuk, E. D., and Good-
fellow, I. Realistic evaluation of deep semi-supervised Radford, A., Narasimhan, K., Salimans, T., and Sutskever,
learning algorithms. Advances in neural information pro- I. Improving language understanding by generative pre-
cessing systems, 31:3235–3246, 2018. training, 2018.
Oord, A. v. d., Li, Y., and Vinyals, O. Representation learn- Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., and
ing with contrastive predictive coding. arXiv preprint Sutskever, I. Language models are unsupervised multitask
arXiv:1807.03748, 2018. learners. 2019.
Ordonez, V., Kulkarni, G., and Berg, T. Im2text: Describing Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S.,
images using 1 million captioned photographs. Advances Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring
in neural information processing systems, 24:1143–1151, the limits of transfer learning with a unified text-to-text
2011. transformer. arXiv preprint arXiv:1910.10683, 2019.
pandas development team, T. pandas-dev/pandas: Pan- Raji, I. D., Gebru, T., Mitchell, M., Buolamwini, J., Lee,
das, February 2020. URL https://doi.org/10. J., and Denton, E. Saving face: Investigating the ethical
5281/zenodo.3509134. concerns of facial recognition auditing, 2020.
Learning Transferable Visual Models From Natural Language Supervision 34
Ramanathan, V., Liang, P., and Fei-Fei, L. Video event Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning,
understanding using natural language descriptions. In C. D., Ng, A. Y., and Potts, C. Recursive deep models for
Proceedings of the IEEE International Conference on semantic compositionality over a sentiment treebank. In
Computer Vision, pp. 905–912, 2013. Proceedings of the 2013 conference on empirical methods
in natural language processing, pp. 1631–1642, 2013.
Rashtchian, C., Young, P., Hodosh, M., and Hockenmaier, J.
Collecting image annotations using amazon’s mechanical Socher, R., Karpathy, A., Le, Q. V., Manning, C. D., and Ng,
turk. In Proceedings of the NAACL HLT 2010 Workshop A. Y. Grounded compositional semantics for finding and
on Creating Speech and Language Data with Amazon’s describing images with sentences. Transactions of the
Mechanical Turk, pp. 139–147, 2010. Association for Computational Linguistics, 2:207–218,
2014.
Recht, B., Roelofs, R., Schmidt, L., and Shankar, V. Do im-
agenet classifiers generalize to imagenet? arXiv preprint Sohn, K. Improved deep metric learning with multi-class
arXiv:1902.10811, 2019. n-pair loss objective. In Advances in neural information
processing systems, pp. 1857–1865, 2016.
Salimans, T. and Kingma, D. P. Weight normalization: A
simple reparameterization to accelerate training of deep Solaiman, I., Brundage, M., Clark, J., Askell, A., Herbert-
neural networks. In Advances in neural information pro- Voss, A., Wu, J., Radford, A., Krueger, G., Kim, J. W.,
cessing systems, pp. 901–909, 2016. Kreps, S., McCain, M., Newhouse, A., Blazakis, J.,
McGuffie, K., and Wang, J. Release strategies and the
Scheuerman, M. K., Paul, J. M., and Brubaker, J. R. How social impacts of language models, 2019.
computers see gender: An evaluation of gender classifica-
tion in commercial facial analysis services. Proceedings Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset
of the ACM on Human-Computer Interaction, 3(CSCW): of 101 human actions classes from videos in the wild.
1–33, 2019. arXiv preprint arXiv:1212.0402, 2012.
Schwemmer, C., Knight, C., Bello-Pardo, E. D., Oklobdzija, Speer, R. ftfy. Zenodo, 2019. URL https://doi.org/
S., Schoonvelde, M., and Lockhart, J. W. Diagnosing 10.5281/zenodo.2591652. Version 5.5.
gender bias in image recognition systems. Socius, 6:
2378023120967171, 2020. Srivastava, N. and Salakhutdinov, R. Multimodal learning
with deep boltzmann machines. In NIPS, 2012.
Sennrich, R., Haddow, B., and Birch, A. Neural machine
translation of rare words with subword units. arXiv Srivastava, S., Labutov, I., and Mitchell, T. Joint concept
preprint arXiv:1508.07909, 2015. learning and semantic parsing from natural language ex-
planations. In Proceedings of the 2017 conference on
Shankar, V., Dave, A., Roelofs, R., Ramanan, D., Recht, B., empirical methods in natural language processing, pp.
and Schmidt, L. Do image classifiers generalize across 1527–1536, 2017.
time? arXiv preprint arXiv:1906.02168, 2019.
Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. The
Sharma, P., Ding, N., Goodman, S., and Soricut, R. Con- German Traffic Sign Recognition Benchmark: A multi-
ceptual captions: A cleaned, hypernymed, image alt-text class classification competition. In IEEE International
dataset for automatic image captioning. In Proceedings Joint Conference on Neural Networks, pp. 1453–1460,
of the 56th Annual Meeting of the Association for Compu- 2011.
tational Linguistics (Volume 1: Long Papers), pp. 2556–
2565, 2018. Stroud, J. C., Ross, D. A., Sun, C., Deng, J., Sukthankar, R.,
and Schmid, C. Learning video representations from tex-
Singh, A., Natarajan, V., Shah, M., Jiang, Y., Chen, X., tual web supervision. arXiv preprint arXiv:2007.14937,
Batra, D., Parikh, D., and Rohrbach, M. Towards vqa 2020.
models that can read. In Proceedings of the IEEE Con-
ference on Computer Vision and Pattern Recognition, pp. Szegedy, C., Ioffe, S., Vanhoucke, V., and Alemi,
8317–8326, 2019. A. Inception-v4, inception-resnet and the impact
of residual connections on learning. arXiv preprint
Socher, R. and Fei-Fei, L. Connecting modalities: Semi- arXiv:1602.07261, 2016.
supervised segmentation and annotation of images using
unaligned text corpora. In 2010 IEEE Computer Society Tan, H. and Bansal, M. Lxmert: Learning cross-modality
Conference on Computer Vision and Pattern Recognition, encoder representations from transformers. arXiv preprint
pp. 966–973. IEEE, 2010. arXiv:1908.07490, 2019.
Learning Transferable Visual Models From Natural Language Supervision 35
Tan, M. and Le, Q. V. Efficientnet: Rethinking model Vo, N., Jacobs, N., and Hays, J. Revisiting im2gps in the
scaling for convolutional neural networks. arXiv preprint deep learning era. In Proceedings of the IEEE Interna-
arXiv:1905.11946, 2019. tional Conference on Computer Vision, pp. 2621–2630,
2017.
Taori, R., Dave, A., Shankar, V., Carlini, N., Recht, B.,
and Schmidt, L. Measuring robustness to natural dis- Wang, A., Singh, A., Michael, J., Hill, F., Levy, O., and
tribution shifts in image classification. arXiv preprint Bowman, S. R. Glue: A multi-task benchmark and anal-
arXiv:2007.00644, 2020. ysis platform for natural language understanding. arXiv
preprint arXiv:1804.07461, 2018.
Thomee, B., Shamma, D. A., Friedland, G., Elizalde, B., Ni,
K., Poland, D., Borth, D., and Li, L.-J. Yfcc100m: The Wang, H., Ge, S., Lipton, Z., and Xing, E. P. Learning ro-
new data in multimedia research. Communications of the bust global representations by penalizing local predictive
ACM, 59(2):64–73, 2016. power. In Advances in Neural Information Processing
Systems, pp. 10506–10518, 2019.
Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview
coding. arXiv preprint arXiv:1906.05849, 2019. Wang, H., Lu, P., Zhang, H., Yang, M., Bai, X., Xu, Y., He,
M., Wang, Y., and Liu, W. All you need is boundary: To-
Tian, Y., Wang, Y., Krishnan, D., Tenenbaum, J. B., and
ward arbitrary-shaped text spotting. In Proceedings of the
Isola, P. Rethinking few-shot image classification: a
AAAI Conference on Artificial Intelligence, volume 34,
good embedding is all you need? arXiv preprint
pp. 12160–12167, 2020.
arXiv:2003.11539, 2020.
Torralba, A., Fergus, R., and Freeman, W. T. 80 million tiny Wang, J., Markert, K., and Everingham, M. Learning mod-
images: A large data set for nonparametric object and els for object recognition from natural language descrip-
scene recognition. IEEE transactions on pattern analysis tions. In BMVC, volume 1, pp. 2, 2009.
and machine intelligence, 30(11):1958–1970, 2008. Weston, J., Bengio, S., and Usunier, N. Large scale im-
Touvron, H., Vedaldi, A., Douze, M., and Jégou, H. Fix- age annotation: learning to rank with joint word-image
ing the train-test resolution discrepancy. In Advances in embeddings. Machine learning, 81(1):21–35, 2010.
neural information processing systems, pp. 8252–8262, Weston, J. E. Dialog-based language learning. In Advances
2019. in Neural Information Processing Systems, pp. 829–837,
Varadarajan, J. and Odobez, J.-M. Topic models for scene 2016.
analysis and abnormality detection. In 2009 IEEE 12th
Weyand, T., Kostrikov, I., and Philbin, J. Planet-photo geolo-
International Conference on Computer Vision Workshops,
cation with convolutional neural networks. In European
ICCV Workshops, pp. 1338–1345. IEEE, 2009.
Conference on Computer Vision, pp. 37–55. Springer,
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, 2016.
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Atten-
Wu, Y., Kirillov, A., Massa, F., Lo, W.-Y., and Gir-
tion is all you need. In Advances in neural information
shick, R. Detectron2. https://github.com/
processing systems, pp. 5998–6008, 2017.
facebookresearch/detectron2, 2019.
Veeling, B. S., Linmans, J., Winkens, J., Cohen, T., and
Wu, Z., Xiong, Y., Yu, S., and Lin, D. Unsupervised feature
Welling, M. Rotation equivariant CNNs for digital pathol-
learning via non-parametric instance-level discrimination.
ogy. June 2018.
arXiv preprint arXiv:1805.01978, 2018.
Virtanen, P., Gommers, R., Oliphant, T. E., Haberland, M.,
Reddy, T., Cournapeau, D., Burovski, E., Peterson, P., Xie, Q., Luong, M.-T., Hovy, E., and Le, Q. V. Self-training
Weckesser, W., Bright, J., van der Walt, S. J., Brett, M., with noisy student improves imagenet classification. In
Wilson, J., Millman, K. J., Mayorov, N., Nelson, A. R. J., Proceedings of the IEEE/CVF Conference on Computer
Jones, E., Kern, R., Larson, E., Carey, C. J., Polat, İ., Vision and Pattern Recognition, pp. 10687–10698, 2020.
Feng, Y., Moore, E. W., VanderPlas, J., Laxalde, D., Yang, Z., Lu, Y., Wang, J., Yin, X., Florencio, D., Wang,
Perktold, J., Cimrman, R., Henriksen, I., Quintero, E. A., L., Zhang, C., Zhang, L., and Luo, J. Tap: Text-aware
Harris, C. R., Archibald, A. M., Ribeiro, A. H., Pedregosa, pre-training for text-vqa and text-caption. arXiv preprint
F., van Mulbregt, P., and SciPy 1.0 Contributors. SciPy arXiv:2012.04638, 2020.
1.0: Fundamental Algorithms for Scientific Computing
in Python. Nature Methods, 17:261–272, 2020. doi: Yogatama, D., d’Autume, C. d. M., Connor, J., Kocisky,
10.1038/s41592-019-0686-2. T., Chrzanowski, M., Kong, L., Lazaridou, A., Ling, W.,
Learning Transferable Visual Models From Natural Language Supervision 36
Zhang, Y., Jiang, H., Miura, Y., Manning, C. D., and Lan-
glotz, C. P. Contrastive learning of medical visual repre-
sentations from paired images and text. arXiv preprint
arXiv:2010.00747, 2020.
LM RN50 This is a multimodal model that uses an au- BYOL We use the recently released model weights of
toregressive loss instead of a contrastive loss, while using BYOL (Grill et al., 2020), specifically their 50x1 and 200x2
Learning Transferable Visual Models From Natural Language Supervision 38
Figure 19. Two example images from the Rendered SST2 dataset
A.3. Evaluation
We train a logistic regression classifier using L-BFGS, with
maximum 1,000 iterations, and report the corresponding
metric for each dataset. We determine the L2 regularization
strength λ using a hyperparameter sweep on the validation
sets over the range between 10−6 and 106 , with 96 log-
arithmically spaced steps. To save compute required for
the sweeps, we perform a parametric binary search that
starts with λ = [10−6 , 10−4 , 10−2 , 1, 102 , 104 , 106 ] and it-
eratively halves the interval around the peak until it reaches
a resolution of 8 steps per decade. The hyperparameter
sweeps are performed on a validation split of each dataset.
For the datasets that contains a validation split in addition
to from the test split, we use the provided validation set to
perform the hyperparameter search, and for the datasets that
do not provide a validation split or have not published labels
for the test data, we split the training dataset to perform the
hyperparameter search and report the performance on the
Learning Transferable Visual Models From Natural Language Supervision 39
Table 9. Datasets examined for linear probes. We note that, for the Birdsnap and Kinetics700 datasets, we used the resources that are
available online at the time of this writing.
Learning Transferable Visual Models From Natural Language Supervision 40
Rendered SST2
FGVC Aircraft
HatefulMemes
Stanford Cars
Kinetics700
Oxford Pets
Country211
Flowers102
Caltech101
RESISC45
CIFAR100
VOC2007
ImageNet
CIFAR10
FER2013
EuroSAT
Birdsnap
Food101
SUN397
UCF101
CLEVR
GTSRB
MNIST
STL10
KITTI
PCam
DTD
LM RN50 81.3 82.8 61.7 44.2 69.6 74.9 44.9 85.5 71.5 82.8 85.5 91.1 96.6 60.1 96.3 93.4 84.0 73.8 70.2 19.0 82.9 76.4 51.9 51.2 65.2 76.8 65.2
RN50 86.4 88.7 70.3 56.4 73.3 78.3 49.1 87.1 76.4 88.2 89.6 96.1 98.3 64.2 97.2 95.2 87.5 82.4 70.2 25.3 82.7 81.6 57.2 53.6 65.7 72.6 73.3
CLIP-ResNet
RN101 88.9 91.1 73.5 58.6 75.1 84.0 50.7 88.0 76.3 91.0 92.0 96.4 98.4 65.2 98.2 95.9 89.3 82.4 73.6 26.6 82.8 84.0 60.3 50.3 68.2 73.3 75.7
RN50x4 91.3 90.5 73.0 65.7 77.0 85.9 57.3 88.4 79.5 91.9 92.5 97.8 98.5 68.1 98.1 96.4 89.7 85.5 59.4 30.3 83.0 85.7 62.6 52.5 68.0 76.6 78.2
RN50x16 93.3 92.2 74.9 72.8 79.2 88.7 62.7 89.0 79.1 93.5 93.7 98.3 98.9 68.7 99.0 97.0 91.4 89.0 69.2 34.8 83.5 88.0 66.3 53.8 71.1 80.0 81.5
RN50x64 94.8 94.1 78.6 77.2 81.1 90.5 67.7 88.9 82.0 94.5 95.4 98.9 98.9 71.3 99.3 97.1 92.8 90.2 69.2 40.7 83.7 89.5 69.1 55.0 75.0 81.2 83.6
B/32 88.8 95.1 80.5 58.5 76.6 81.8 52.0 87.7 76.5 90.0 93.0 96.9 99.0 69.2 98.6 97.0 90.5 85.3 66.2 27.8 83.9 85.5 61.7 52.1 66.7 70.8 76.1
CLIP-ViT
B/16 92.8 96.2 83.1 67.8 78.4 86.7 59.5 89.2 79.2 93.1 94.7 98.1 99.0 69.5 99.2 97.1 92.7 86.6 67.8 33.3 83.5 88.4 66.1 57.1 70.3 75.5 80.2
L/14 95.2 98.0 87.5 77.0 81.8 90.9 69.4 89.6 82.1 95.1 96.5 99.2 99.2 72.2 99.8 98.2 94.1 92.5 64.7 42.9 85.8 91.5 72.0 57.8 76.2 80.8 83.9
L/14-336px 95.9 97.9 87.4 79.9 82.2 91.5 71.6 89.9 83.0 95.1 96.0 99.2 99.2 72.9 99.7 98.1 94.9 92.4 69.2 46.4 85.6 92.0 73.0 60.3 77.3 80.5 85.4
B0 74.3 92.5 76.5 59.7 62.0 62.5 55.7 84.4 71.2 93.0 93.3 91.7 98.2 57.2 97.8 97.3 85.5 80.0 73.8 12.4 83.1 74.4 47.6 47.9 55.7 53.4 76.9
B1 74.2 93.2 77.2 61.3 62.6 62.5 56.1 84.7 74.2 93.4 93.6 92.4 98.3 57.0 97.9 96.8 84.5 75.9 75.5 12.5 82.7 74.7 48.5 44.3 54.5 54.4 78.6
B2 75.8 93.6 77.9 64.4 64.0 63.2 57.0 85.3 73.5 93.9 93.5 92.9 98.5 56.6 98.1 96.9 84.4 76.4 73.1 12.6 84.3 75.1 49.4 42.6 55.4 55.2 79.7
EfficientNet
B3 77.4 94.0 78.0 66.5 64.4 66.0 59.3 85.8 73.1 94.1 93.7 93.3 98.5 57.1 98.6 97.3 85.0 75.8 76.1 13.4 83.3 78.1 50.9 45.1 53.8 54.8 81.0
B4 79.7 94.1 78.7 70.1 65.4 66.4 60.4 86.5 73.4 94.7 93.5 93.2 98.8 57.9 98.9 96.8 85.0 78.3 72.3 13.9 83.1 79.1 52.5 46.5 54.4 55.4 82.9
B5 81.5 93.6 77.9 72.4 67.1 72.7 68.9 86.7 73.9 95.0 94.7 94.5 98.4 58.5 99.1 96.8 86.0 78.5 69.6 14.9 84.7 80.9 54.5 46.6 53.3 56.3 83.7
B6 82.4 94.0 78.0 73.5 65.8 71.1 68.2 87.6 73.9 95.0 94.1 93.7 98.4 60.2 99.0 96.8 85.4 78.1 72.7 15.3 84.2 80.0 54.1 51.1 53.3 57.0 84.0
B7 84.5 94.9 80.1 74.7 69.0 77.1 72.3 87.2 76.8 95.2 94.7 95.9 98.6 61.3 99.3 96.3 86.8 80.8 75.8 16.4 85.2 81.9 56.8 51.9 54.4 57.8 84.8
B8 84.5 95.0 80.7 75.2 69.6 76.8 71.5 87.4 77.1 94.9 95.2 96.3 98.6 61.4 99.4 97.0 87.4 80.4 70.9 17.4 85.2 82.4 57.7 51.4 51.7 55.8 85.3
B0 78.1 94.0 78.6 63.5 65.5 57.2 53.7 85.6 75.6 93.8 93.1 94.5 98.1 55.6 98.6 97.0 84.3 74.0 71.6 14.0 83.1 76.7 51.7 47.3 55.7 55.0 78.5
EfficientNet Noisy Student
B1 80.4 95.1 80.2 66.6 67.6 59.6 53.7 86.2 77.0 94.6 94.4 95.1 98.0 56.1 98.9 96.9 84.3 73.1 67.1 14.5 83.9 79.9 54.5 46.1 54.3 54.9 81.1
B2 80.9 95.3 81.3 67.6 67.9 60.9 55.2 86.3 77.7 95.0 94.7 94.4 98.0 55.5 98.9 97.3 84.6 71.7 70.0 14.6 82.9 80.1 55.1 46.1 54.1 55.3 82.2
B3 82.6 95.9 82.1 68.6 68.8 60.6 55.4 86.5 77.2 95.0 94.8 95.2 98.1 56.0 99.3 96.5 85.0 70.5 69.5 15.1 83.1 81.8 56.8 45.1 55.7 52.0 83.8
B4 85.2 95.6 81.0 72.5 69.7 56.1 52.6 87.0 78.7 94.8 95.2 95.3 98.2 56.0 99.4 95.3 84.8 61.9 64.8 16.0 82.8 83.4 59.8 43.2 55.3 53.0 85.4
B5 87.6 96.3 82.4 75.3 71.6 64.7 64.8 87.8 79.6 95.5 95.6 96.6 98.8 60.9 99.5 96.1 87.0 68.5 73.7 16.4 83.5 86.4 61.6 46.3 53.4 55.8 85.8
B6 87.3 97.0 83.9 75.8 71.4 67.6 65.6 87.3 78.5 95.2 96.4 97.2 98.6 61.9 99.7 96.6 86.1 70.7 72.4 17.6 84.2 85.5 61.0 49.6 54.6 55.7 86.4
B7 88.4 96.0 82.0 76.9 72.6 72.2 71.2 88.1 80.5 95.5 95.5 96.6 98.5 62.7 99.5 96.2 88.5 73.4 73.0 18.5 83.8 86.6 63.2 50.5 57.2 56.7 87.0
L2-475 91.6 99.0 91.0 74.8 76.4 75.1 66.8 89.5 81.9 95.6 96.5 97.7 98.9 67.5 99.7 97.0 89.5 73.4 68.9 22.2 86.3 89.4 68.2 58.3 58.6 55.2 88.3
L2-800 92.0 98.7 89.0 78.5 75.7 75.5 68.4 89.4 82.5 95.6 94.7 97.9 98.5 68.4 99.7 97.2 89.9 77.7 66.9 23.7 86.8 88.9 66.7 62.7 58.4 56.9 88.4
32x8d 84.8 95.9 80.9 63.8 69.0 74.2 56.0 88.0 75.4 95.4 93.9 91.7 97.4 60.7 64.4 95.7 82.1 72.3 69.2 16.7 82.3 80.1 56.8 42.2 53.3 55.2 83.3
32x16d 85.7 96.5 80.9 64.8 70.5 77.5 56.7 87.9 76.2 95.6 94.9 92.5 97.4 61.6 76.7 95.5 82.8 73.8 66.1 17.5 83.4 81.1 58.2 41.3 54.2 56.1 84.4
Instagram
32x32d 86.7 96.8 82.7 67.1 71.5 77.5 55.4 88.3 78.5 95.8 95.3 94.4 97.9 62.4 87.5 95.7 85.4 71.2 66.8 18.0 83.7 82.1 58.8 39.7 55.3 56.7 85.0
32x48d 86.9 96.8 83.4 65.9 72.2 76.6 53.2 88.0 77.2 95.5 95.8 93.6 98.1 63.7 88.3 95.3 85.4 73.0 67.2 18.5 82.7 82.8 59.2 41.3 55.5 56.7 85.2
FixRes-v1 88.5 95.7 81.1 67.4 72.9 80.5 57.6 88.0 77.9 95.8 96.1 94.5 97.9 62.2 82.9 96.2 86.6 76.5 64.8 19.3 82.5 83.4 59.8 43.5 56.6 59.0 86.0
FixRes-v2 88.5 95.7 81.1 67.3 72.9 80.7 57.5 88.0 77.9 95.0 96.0 94.5 98.0 62.1 83.1 96.5 86.6 76.3 64.8 19.5 82.3 83.5 59.8 44.2 56.6 59.0 86.0
R50x1 72.5 91.7 74.8 57.7 61.1 53.5 52.5 83.7 72.4 92.3 91.2 92.0 98.4 56.1 76.7 97.4 85.0 70.0 66.0 12.5 83.0 72.3 47.5 48.3 54.1 55.3 75.2
R50x3 75.1 93.7 79.0 61.1 63.7 55.2 54.1 84.8 74.6 92.5 91.6 92.8 98.8 58.7 74.9 97.8 86.4 73.1 73.8 14.0 84.2 76.4 50.0 49.2 54.7 54.2 77.2
BiT-S
R101x1 73.5 92.8 77.4 58.4 61.3 54.0 52.4 84.4 73.5 92.5 91.8 90.6 98.3 56.5 63.5 97.3 84.6 69.4 68.9 12.6 82.0 73.5 48.6 45.4 52.6 55.5 76.0
R101x3 74.7 93.9 79.8 57.8 62.9 54.7 53.3 84.7 75.5 92.3 91.2 92.6 98.8 59.7 61.5 98.0 85.5 71.8 60.2 14.1 83.1 75.9 50.4 49.7 54.1 54.6 77.4
R152x2 74.9 94.3 79.7 58.7 62.7 55.9 53.6 85.3 74.9 93.0 92.0 91.7 98.6 58.3 60.9 97.8 86.2 71.8 71.6 13.9 84.1 76.2 49.9 48.2 53.8 55.9 77.1
R152x4 74.7 94.2 79.2 57.8 62.9 51.2 50.8 85.4 75.4 93.1 91.2 91.4 98.9 61.4 64.2 98.0 85.5 72.8 67.9 14.9 83.1 76.0 50.3 42.9 53.6 56.0 78.5
R50x1 83.3 94.9 82.2 70.9 69.9 59.0 55.6 86.8 77.3 91.5 93.9 99.4 98.0 60.6 81.5 97.5 87.4 68.6 68.2 16.6 82.5 79.4 53.2 49.4 54.5 53.4 76.7
R50x3 86.9 96.7 86.2 75.7 74.6 60.6 54.2 87.7 78.5 93.2 95.3 99.4 98.6 64.6 81.1 98.0 88.1 69.9 59.6 19.6 83.4 83.5 57.8 51.3 55.8 55.6 80.7
BiT-M
R101x1 85.5 95.7 84.4 73.0 72.5 59.8 55.0 87.3 78.1 92.2 95.0 99.5 98.1 62.5 72.0 97.6 87.8 68.7 67.7 18.0 84.0 82.3 55.9 53.4 54.8 53.1 79.4
R101x3 87.2 97.4 87.5 72.4 75.0 57.4 47.4 87.5 79.6 93.2 95.4 99.6 98.6 64.3 73.8 98.2 87.7 68.8 64.1 20.7 80.4 84.0 58.7 52.6 54.9 54.3 81.2
R152x2 88.0 97.5 87.8 75.8 75.9 61.5 55.3 88.1 79.8 93.6 95.9 99.5 98.5 64.3 61.4 97.9 89.0 70.0 70.3 20.7 82.6 85.5 59.6 50.8 54.9 55.1 81.9
R152x4 87.2 97.6 88.2 72.4 75.0 49.1 43.4 87.1 79.9 92.4 95.4 99.3 98.5 65.7 57.7 97.8 87.7 68.2 57.1 20.6 80.4 84.6 59.0 49.7 57.2 55.1 81.5
B/32 81.8 96.7 86.3 65.2 70.7 49.1 42.7 85.3 73.1 90.4 94.5 98.7 97.8 59.0 99.1 96.3 83.0 68.1 65.1 15.7 82.6 79.1 51.7 38.9 57.1 54.6 76.5
B/16 86.7 96.9 86.4 74.0 74.2 54.7 46.0 86.7 74.3 92.7 94.1 99.2 97.4 61.3 99.5 96.4 84.5 63.1 61.5 17.5 85.4 82.7 56.6 40.0 57.0 56.1 80.9
ViT
L/16 87.4 97.9 89.0 76.5 74.9 62.5 52.2 86.1 75.0 92.9 94.7 99.3 98.0 64.0 99.7 96.5 85.7 70.4 58.8 17.7 85.7 84.1 58.0 38.4 58.4 52.8 81.5
H/14 83.4 95.8 84.5 70.2 69.2 62.3 54.8 84.7 75.4 91.7 93.7 98.9 98.5 62.4 98.5 97.3 87.0 73.9 63.4 15.4 87.0 79.4 52.1 41.1 55.9 54.1 75.4
R50x1 76.4 93.2 77.9 48.6 64.1 56.3 51.7 84.4 77.0 88.3 91.8 92.9 97.6 59.7 97.9 97.5 85.8 71.1 69.1 15.8 84.8 78.4 51.0 56.2 53.9 53.8 73.8
R50x3 81.0 95.6 82.4 56.5 67.0 65.6 61.1 85.9 78.8 90.9 94.1 95.4 98.7 62.6 98.8 97.9 88.2 78.2 74.7 17.6 85.4 82.6 54.6 55.4 54.2 55.2 77.3
SimCLR
R101x1 77.9 94.8 79.9 51.9 65.2 57.1 52.0 85.4 77.2 90.0 91.6 92.7 97.2 59.4 98.2 96.8 84.6 65.7 70.6 16.1 84.3 78.8 52.4 53.6 55.1 55.7 76.1
R101x3 82.2 96.4 83.4 57.5 68.2 64.6 60.0 86.2 78.9 91.8 95.0 95.4 98.4 63.0 99.0 97.9 88.0 77.5 69.1 18.3 85.5 82.9 55.9 52.2 54.5 56.3 78.8
R152x1 78.6 95.0 79.9 50.3 65.6 55.6 52.2 85.8 77.3 90.1 92.5 91.8 97.6 59.8 98.6 96.6 84.3 64.8 70.3 16.6 83.9 79.4 53.1 57.2 55.8 54.8 76.9
R152x2 82.3 96.7 83.9 58.1 68.5 64.9 58.7 86.6 79.1 92.2 94.1 96.0 98.2 64.1 99.0 98.0 88.1 77.0 69.8 18.4 85.3 82.7 56.2 53.6 56.0 56.5 79.2
R152x3 83.6 96.8 84.5 60.3 69.1 68.5 63.1 86.7 80.5 92.6 94.9 96.3 98.7 65.4 99.2 98.1 89.5 78.4 68.5 19.4 85.2 83.5 57.0 54.4 54.6 54.2 80.0
MoCo BYOL
50x1 74.0 93.6 79.1 47.6 63.7 61.6 62.3 82.6 77.0 88.3 93.7 94.3 98.7 58.8 97.4 97.6 88.2 80.1 71.4 14.1 84.8 77.3 49.3 56.1 53.8 54.4 73.3
200x2 78.5 96.2 83.3 53.4 68.5 61.7 55.4 86.6 77.4 91.9 95.5 93.9 98.7 62.6 99.0 97.7 87.4 77.1 76.4 16.4 84.0 82.6 55.1 54.1 52.5 52.4 79.2
v1 65.9 85.0 63.1 27.5 52.6 35.9 43.5 75.7 70.0 70.4 78.1 85.4 97.6 54.3 59.1 97.1 82.9 62.6 60.2 12.6 85.7 64.2 40.7 54.7 55.6 53.5 57.2
v2 72.2 93.4 76.3 39.6 60.2 48.3 51.1 82.6 75.1 84.4 89.9 90.7 98.4 58.3 62.9 97.2 85.4 75.7 75.4 13.2 85.6 72.7 47.8 56.9 53.9 53.8 69.1
VirTex 57.9 83.9 57.5 17.0 49.8 22.4 34.5 83.8 58.2 53.6 70.6 74.7 98.1 56.5 68.1 94.8 74.1 69.5 71.3 8.7 83.1 61.5 39.9 45.5 53.5 55.8 50.7
50 71.3 91.8 74.5 52.7 60.5 49.9 48.5 83.8 72.3 92.4 90.8 90.8 98.3 54.9 64.6 96.7 83.6 70.6 67.1 11.7 82.5 71.2 46.8 43.0 56.5 55.5 74.3
ResNet
101 72.7 93.0 77.2 53.7 60.8 50.1 47.0 84.4 71.6 92.3 91.9 90.4 98.5 56.6 62.1 97.1 83.4 72.5 63.6 11.9 83.3 72.7 48.3 43.2 53.0 54.7 75.8
152 73.7 93.5 78.0 55.1 61.6 52.8 48.4 84.5 71.9 93.0 92.1 89.6 98.2 57.0 61.5 97.0 83.1 70.1 70.2 12.3 82.9 75.3 49.2 42.4 53.2 53.9 77.1
Table 10. Linear probe performance of various pre-trained models over 27 datasets. Scores within the 99.5% Clopper-Pearson confidence
interval of each dataset’s top score are shown in bold.
Learning Transferable Visual Models From Natural Language Supervision 41
Food101 CIFAR10 CIFAR100 Birdsnap
95 80
90
98 75
90 70
96 85
65
accuracy
accuracy
accuracy
accuracy
85 94 80 60
80 55
92
75 50
75 90 45
70 40
100 101 102 100 101 102 100 101 102 100 101 102
SUN397 StanfordCars FGVCAircraft PascalVOC2007
90
90 70
80 89
accuracy
70 86
70 55
85
60 50
65 84
50 45 83
60
100 101 102 100 101 102 100 101 102 100 101 102
DescribableTextures OxfordPets Caltech101 Flowers102
96 100
82 96
94 98
80 95
92
mean-per-class
mean per class
90 93
76 94
88 92
74 91 92
86
72 90 90
84
100 101 102 100 101 102 100 101 102 100 101 102
MNIST FacialEmotionRecognition2013 STL10 EuroSAT
72.5 100
99.00 98.0
98.75 70.0
90 97.5
98.50 67.5
97.0
accuracy
accuracy
accuracy
accuracy
98.25 65.0 80
98.00 62.5 96.5
97.75 60.0 70 96.0
97.50 57.5 95.5
97.25 60
55.0
100 101 102 100 101 102 100 101 102 100 101 102
RESISC45 GTSRB KITTI PatchCamelyon
87
94 90 75.0
86
92 85 72.5
70.0 85
90 80
accuracy
accuracy
accuracy
accuracy
67.5 84
88 75 65.0 83
86 70 62.5 82
84 60.0
65 81
82 57.5
100 101 102 100 101 102 100 101 102 100 101 102
UCF101 Kinetics700 CLEVRCounts Country211
45
90 70 60
40
65
mean(top1, top5)
85 55 35
accuracy
accuracy
accuracy
60 30
50
80 25
55 45 20
75
50 15
40
10
100 101 102 100 101 102 100 101 102 100 101 102
HatefulMemes SST2 ImageNet GFLOPs/image
75 80 87.5 CLIP-ViT
85.0 CLIP-ResNet
75 EfficientNet-NoisyStudent
70 82.5 EfficientNet
70 Instagram-pretrained
accuracy
accuracy
80.0
ROCAUC
65 SimCLRv2
65 77.5 BYOL
60 75.0 MoCo
60
72.5 ViT (ImageNet-21k)
55 55 BiT-M
70.0 BiT-S
100 101 102 100 101 102 100 101 102 ResNet
GFLOPs/image GFLOPs/image GFLOPs/image
Figure 20. Linear probe performance plotted for each of the 27 datasets, using the data from Table 10.
Learning Transferable Visual Models From Natural Language Supervision 42
Food101 SUN397 Youtube-BB EuroSAT
correct label: guacamole correct rank: 1/101 correct probability: 90.15% correct label: television studio correct rank: 1/397 correct probability: 90.22% correct label(s): airplane,person correct rank: 1/23 correct probability: 88.98% correct label: annual crop land correct rank: 4/10 correct probability: 12.90%
a photo of guacamole, a type of food. a photo of a television studio. a photo of a airplane. a centered satellite photo of permanent crop land.
a photo of ceviche, a type of food. a photo of a podium indoor. a photo of a bird. a centered satellite photo of pasture land.
a photo of edamame, a type of food. a photo of a conference room. a photo of a bear. a centered satellite photo of highway or road.
a photo of tuna tartare, a type of food. a photo of a lecture room. a photo of a giraffe. a centered satellite photo of annual crop land.
a photo of hummus, a type of food. a photo of a control room. a photo of a car. a centered satellite photo of brushland or shrubland.
correct label: healthy lymph node tissue correct rank: 2/2 correct probability: 22.81% correct label: lynx correct rank: 5/200 correct probability: 4.18% correct label: bird correct rank: 1/10 correct probability: 40.86% correct label: 4 correct rank: 2/8 correct probability: 17.11%
this is a photo of lymph node tumor tissue a photo of a fox squirrel. a photo of a bird. a photo of 3 objects.
this is a photo of healthy lymph node tissue a photo of a mongoose. a photo of a cat. a photo of 4 objects.
a photo of a happy looking face. a photo of a person volleyball spiking. a photo of a kangaroo. a photo of a siberian husky.
a photo of a neutral looking face. a photo of a person jump rope. a photo of a gerenuk. a photo of a german shepherd dog.
a photo of a surprised looking face. a photo of a person long jump. a photo of a emu. a photo of a collie.
a photo of a fearful looking face. a photo of a person soccer penalty. a photo of a wild cat. a photo of a border collie.
a photo of a angry looking face. a photo of a person table tennis shot. a photo of a scorpion. a photo of a rottweiler.
a photo of a maine coon, a type of pet. a photo of a snake. a photo of a beer bottle. a photo of a mcdonnell douglas md-90, a type of aircraft.
a photo of a persian, a type of pet. a photo of a sweet pepper. a photo of a pirate ship. a photo of a boeing 717, a type of aircraft.
a photo of a ragdoll, a type of pet. a photo of a flatfish. a photo of a chocolate syrup. a photo of a fokker 100, a type of aircraft.
a photo of a birman, a type of pet. a photo of a turtle. a photo of a product packet / packaging. a photo of a mcdonnell douglas dc-9-30, a type of aircraft.
a photo of a siamese, a type of pet. a photo of a lizard. a photo of a wine bottle. a photo of a boeing 727-200, a type of aircraft.
a photo i took in french guiana. satellite imagery of roundabout. a photo of a 2012 honda accord coupe. a photo of a kennel indoor.
a photo i took in gabon. satellite imagery of intersection. a photo of a 2012 honda accord sedan. a photo of a kennel outdoor.
a photo i took in cambodia. satellite imagery of church. a photo of a 2012 acura tl sedan. a photo of a jail cell.
a photo i took in guyana. satellite imagery of medium residential. a photo of a 2012 acura tsx sedan. a photo of a jail indoor.
a photo i took in belize. satellite imagery of chaparral. a photo of a 2008 acura tl type-s. a photo of a veterinarians office.
a photo of country line dancing. a photo of a great masterwort, a type of flower. a photo of a king charles spaniel. a photo of a broad tailed hummingbird, a type of bird.
a photo of square dancing. a photo of a bishop of llandaff, a type of flower. a photo of a brittany dog. a photo of a calliope hummingbird, a type of bird.
a photo of swing dancing. a photo of a pincushion flower, a type of flower. a photo of a cocker spaniel. a photo of a costas hummingbird, a type of bird.
a photo of dancing charleston. a photo of a globe flower, a type of flower. a photo of a papillon. a photo of a black chinned hummingbird, a type of bird.
a photo of salsa dancing. a photo of a prince of wales feathers, a type of flower. a photo of a sussex spaniel. a photo of a annas hummingbird, a type of bird.
correct label: building correct rank: 1/12 correct probability: 97.69% correct label: Pill bottle correct rank: 1/113 correct probability: 98.34% correct label: marimba correct rank: 1/1000 correct probability: 79.54% correct label: perforated correct rank: 2/47 correct probability: 20.50%
a photo of a building. a photo of a pill bottle. a photo of a marimba. a photo of a polka-dotted texture.
a photo of a carriage. a photo of a bottle cap. a photo of a abacus. a photo of a perforated texture.
a photo of a statue. a photo of a beer bottle. a photo of a steel drum. a photo of a dotted texture.
a photo of a bag. a photo of a pillow. a photo of a computer keyboard. a photo of a studded texture.
a photo of a mug. a photo of a wine bottle. a photo of a pool table. a photo of a freckled texture.
PASCAL VOC 2007 MNIST Street View House Numbers (SVHN) ImageNet Vid
correct label(s): motorcycle correct rank: 1/20 correct probability: 99.69% correct label: 7 correct rank: 1/10 correct probability: 85.32% correct label: 158 correct rank: 83/2000 correct probability: 0.27% correct label(s): antelope correct rank: 1/30 correct probability: 99.77%
a photo of a motorcycle. a photo of the number: "7". a street sign of the number: "1157". a photo of a antelope.
a photo of a bicycle. a photo of the number: "2". a street sign of the number: "1165". a photo of a zebra.
a photo of a car. a photo of the number: "1". a street sign of the number: "1164". a photo of a car.
a photo of a horse. a photo of the number: "6". a street sign of the number: "1155". a photo of a cattle.
a photo of a dining table. a photo of the number: "4". a street sign of the number: "1364". a photo of a elephant.
ImageNet Sketch Hateful Memes Stanford Sentiment Treebank German Traffic Sign Recognition Benchmark (GTSRB)
correct label: barn correct rank: 1/1000 correct probability: 79.56% correct label: meme correct rank: 1/2 correct probability: 99.20% correct label: positive correct rank: 1/2 correct probability: 78.21% correct label: red and white triangle with exclamation mark warning correct rank: 1/43 correct probability: 45.75%
a photo of a barn. a meme. a positive review of a movie. a zoomed in photo of a "red and white triangle with exclamation mark warning" traffic sign.
a photo of a church. a hatespeech meme. a negative review of a movie. a zoomed in photo of a "red and white triangle with black right curve approaching warning" traffic sign.
a photo of a threshing machine. a zoomed in photo of a "red and white triangle car skidding / slipping warning" traffic sign.
a photo of a sawmill. a zoomed in photo of a "red and white triangle rough / bumpy road warning" traffic sign.
a photo of a prison. a zoomed in photo of a "red and white triangle with black left curve approaching warning" traffic sign.
Figure 21. Visualization of predictions from 36 CLIP zero-shot classifiers. All examples are random with the exception of reselecting
Hateful Memes to avoid offensive content. The predicted probability of the top 5 classes is shown along with the text used to represent
the class. When more than one template is used, the first template is shown. The ground truth label is colored green while an incorrect
prediction is colored orange.
Learning Transferable Visual Models From Natural Language Supervision 43
B. Zero-Shot Prediction
Linear Classifier Zero Shot
To provide a qualitative summary / overview of CLIP’s Dataset YFCC WIT ∆ YFCC WIT ∆
zero-shot performance we visualize a randomly selected Birdsnap 47.4 35.3 +12.1 19.9 4.5 +15.4
Country211 23.1 17.3 +5.8 5.2 5.3 +0.1
prediction for 36 different zero-shot CLIP classifiers in Fig- Flowers102 94.4 89.8 +4.6 48.6 21.7 +26.9
ure 21. GTSRB 66.8 72.5 −5.7 6.9 7.0 −0.1
UCF101 69.2 74.9 −5.7 22.9 32.0 −9.1
Stanford Cars 31.4 50.3 −18.9 3.8 10.9 −7.1
C. Duplicate Detector ImageNet 62.0 60.8 +1.2 31.3 27.6 +3.7
Dataset Average 65.5 66.6 −1.1 29.6 30.0 −0.4
Our early attempts at duplicate detection and analysis used Dataset “Wins” 10 15 −5 19 18 +1
likely as YFCC100M is only 3.7% of the overall WIT data word recognition, while Hateful Memes (Kiela et al., 2020)
blend and it did not noticeably change the performance of and SST-2 (Socher et al., 2013) check the ability of a model
models when it was added to the existing data blend during to use OCR to perform a semantic task. Results are reported
the creation of WIT. in Table 13.
CLIP’s performance is still highly variable and appears to
E. Selected Task and Dataset Results be sensitive to some combination of the domain (rendered or
natural images) and the type of text to be recognized (num-
Due to the large variety of datasets and experiments consid-
bers or words). CLIP’s OCR performance is strongest Hate-
ered in this work, the main body focuses on summarizing
ful Memes and SST-2 - datasets where the text is digitally
and analyzing overall results. In the following subsections
rendered and consists mostly of words. On IIIT5K, which
we report details of performance for specific groups of tasks,
is natural images of individually cropped words, zero-shot
datasets, and evaluation settings.
CLIP performs a bit more respectively and its performance
is similar to Jaderberg et al. (2014) early work combining
E.1. Image and Text Retrieval deep learning and structured prediction to perform open-
CLIP pre-trains for the task of image-text retrieval on our vocabulary OCR. However, performance is noticeably lower
noisy web-scale dataset. Although the focus of this paper on two datasets involving recognition of hand written and
is on representation learning and task learning for the pur- street view numbers. CLIP’s 51% accuracy on full number
pose of transfer to a wide variety of downstream datasets, SVHN is well below any published results. Inspection sug-
validating that CLIP is able to achieve high transfer perfor- gests CLIP struggles with repeated characters as well as the
mance transfer on exactly what it is pre-trained for is an low resolution and blurry images of SVHN. CLIP’s zero-
important sanity check / proof of concept. In Table 12 we shot MNIST performance is also poor and is outperformed
check the zero-shot transfer performance of CLIP for both by supervised logistic regression on raw pixels, one of the
text and image retrieval on the Flickr30k and MSCOCO simplest possible machine learning baselines.
datsets. Zero-shot CLIP matches or outperforms all prior SST-2 is a sentence level NLP dataset which we render into
zero-shot results on these two datasets. Zero-shot CLIP is images. We include SST-2 in order to check whether CLIP
also competitive with the current overall SOTA for the task is able to convert low level OCR capability into a higher
of text retrieval on Flickr30k. On image retrieval, CLIP’s level representation. Fitting a linear classifier on CLIP’s rep-
performance relative to the overall state of the art is notice- resentation of rendered sentences achives 80.5% accuracy.
ably lower. However, zero-shot CLIP is still competitive This is on par with the 80% accuracy of a continuous bag
with a fine-tuned Unicoder-VL. On the larger MS-COCO of words baseline using GloVe word vectors pre-trained on
dataset fine-tuning improves performance significantly and 840 billion tokens (Pennington et al., 2014). While this is a
zero-shot CLIP is not competitive with the most recent work. simple NLP baseline by today’s standard, and well below
For both these datasets we prepend the prompt “a photo the 97.5% of the current SOTA, it is encouraging to see
of” to the description of each image which we found boosts that CLIP is able to turn an image of rendered text into a
CLIP’s zero-shot R@1 performance between 1 and 2 points. non-trivial sentence level representation. Fully supervised
CLIP is also surprisingly strong on Hateful Meme detec-
E.2. Optical Character Recognition tion, where CLIP is only 0.7 points behind the current single
Although visualizations have shown that ImageNet models model SOTA and several points above the best baseline from
contain features that respond to the presence of text in an the original paper. Similar to SST-2, these other results on
image (Zeiler & Fergus, 2014), these representations are Hateful Memes use the ground truth text which CLIP does
not sufficiently fine-grained to use for the task of optical not have access to. Finally, we note that zero-shot CLIP
character recognition (OCR). To compensate, models are outperforms the best results using fully supervised linear
augmented with the outputs of custom OCR engines and probes across all other 56 models included in our evaluation
features to boost performance on tasks where this capability suite. This suggests CLIP’s OCR capability is at least some-
is required (Singh et al., 2019; Yang et al., 2020). Early dur- what unique compared to existing work on self-supervised
ing the development of CLIP, we noticed that CLIP began to and supervised representation learning.
learn primitive OCR capabilities which appeared to steadily
improve over the course of the project. To evaluate this E.3. Action Recognition in Videos
qualitatively noticed behavior, we measured performance For the purpose of learning, a potentially important aspect
on 5 datasets requiring the direct and indirect use of OCR. of natural language is its ability to express, and therefore su-
Three of these datasets MNIST (LeCun), SVHN (Netzer pervise, an extremely wide set of concepts. A CLIP model,
et al., 2011), and IIIT5K (Mishra et al., 2012) directly check since it is trained to pair semi-arbitrary text with images, is
the ability of a model to perform low-level character and
Learning Transferable Visual Models From Natural Language Supervision 45
Uniterb 87.3 98.0 99.2 65.7 88.6 93.8 75.6 94.1 96.8 52.9 79.9 88.0
VILLAc 87.9 97.5 98.8 - - - 76.3 94.2 96.8 - - -
Oscard - - - 73.5 92.2 96.0 - - - 57.5 82.8 89.8
ERNIE-ViLe 88.7 98.0 99.2 - - - 76.7 93.6 96.4 - - -
Visual N-Gramsf 15.4 35.7 45.1 8.7 23.1 33.3 8.8 21.2 29.9 5.0 14.5 21.9
Zero-Shot
Table 12. CLIP improves zero-shot retrieval and is competitive with the best fine-tuned result on Flickr30k text retrieval. Bold
indicates best overall performance while an underline indicates best in category performance (zero-shot or fine-tuned). For all other
models, best results from the paper are reported regardless of model size / variant. MSCOCO performance is reported on the 5k test set.
a
(Li et al., 2020a) b (Chen et al., 2019) c (Gan et al., 2020) d (Li et al., 2020b) e (Yu et al., 2020) f (Li et al., 2017) g (Qi et al., 2020)
Table 15. Detailed ImageNet robustness performance. IN is used to abbreviate for ImageNet. a (Xie et al., 2020) b (Touvron et al., 2019)
we report CLIP’s zero-shot performance when averaging 16. Since IM2GPS is a regression benchmark, we guess the
predictions across all frames. CLIP also performs well in GPS coordinates of the nearest image in a set of reference
this setting and on Kinetics-700 its performance is within images using CLIP’s embedding space. This is not a zero-
1% of the fully supervised I3D baseline which is trained shot result since it uses nearest-neighbor regression. Despite
on 545000 labeled videos. Encouraged by these results, we querying only 1 million images, which is much less than
also measure CLIP’s performance on the recently introduced prior work, CLIP performs similarly to several task specific
RareAct dataset (Miech et al., 2020a) which was designed models. It is not, however, competitive with the current state
to measure zero-shot recognition of unusual actions like of the art.
“hammering a phone” and “drilling an egg”. CLIP improves
over the prior state of the art, a S3D model trained on auto- E.5. Robustness to Distribution Shift
matically extracted captions from 100 million instructional
videos, by 10 points. Section 3.3 provides a high level summary and analysis of
ImageNet related robustness results. We briefly provide
While CLIP has encouragingly strong performance on the some additional numerical details in this appendix. Per-
task of action recognition, we note that there are many differ- formance results per dataset are provided in Table 15 and
ences between the models being compared beyond just their compared with the current state of the art results reported
form of supervision such as model architecture, training in Taori et al. (2020)’s evaluation suite. Zero-shot CLIP im-
data distribution, dataset size, and compute used. Further proves the state of the art on 5 of the 7 datasets, ImageNet-R,
work is needed to more precisely determine what specific ObjectNet, ImageNet-Sketch, ImageNet-Vid, and Youtube-
design decisions contribute to achieving high performance BB. CLIP’s improvements are largest on ImageNet-Vid and
on this task. Youtube-BB due to its flexible zero-shot capability and on
ImageNet-R, which likely reflects CLIP’s pre-training dis-
tribution including significant amounts of creative content.
1km 25km 200km 750km 2500km A similar behavior has been documented for the Instagram
ISNs a
16.9 43.0 51.9 66.7 80.2 pre-trained ResNeXt models as discussed in Taori et al.
CPlaNetb 16.5 37.1 46.4 62.0 78.5 (2020).
CLIP 13.9 32.9 43.0 62.0 79.3
Deep-Ret+c 14.4 33.3 47.7 61.6 73.4
PlaNetd 8.4 24.5 37.6 53.6 71.3
E.4. Geolocalization
Another behavior we noticed during the development of
CLIP was its ability to recognize many places and locations.
To quantify this we created the Country211 dataset as de-
scribed in Appendix A and report results on it throughout
the paper. However it is a new benchmark so to compare
with prior work on geolocalization we also report results
on the IM2GPS test set from Hays & Efros (2008) in Table
Learning Transferable Visual Models From Natural Language Supervision 47
Rendered SST2
FGVC Aircraft
HatefulMemes
Stanford Cars
Kinetics700
Oxford Pets
Country211
Flowers102
Caltech101
RESISC45
CIFAR100
VOC2007
ImageNet
CIFAR10
FER2013
EuroSAT
Birdsnap
Food101
SUN397
UCF101
CLEVR
GTSRB
MNIST
STL10
KITTI
PCam
DTD
RN50 81.1 75.6 41.6 32.6 59.6 55.8 19.3 82.1 41.7 85.4 82.1 65.9 66.6 42.2 94.3 41.1 54.2 35.2 42.2 16.1 57.6 63.6 43.5 20.3 59.7 56.9 59.6
CLIP-ResNet
RN101 83.9 81.0 49.0 37.2 59.9 62.3 19.5 82.4 43.9 86.2 85.1 65.7 59.3 45.6 96.7 33.1 58.5 38.3 33.3 16.9 55.2 62.2 46.7 28.1 61.1 64.2 62.2
RN50x4 86.8 79.2 48.9 41.6 62.7 67.9 24.6 83.0 49.3 88.1 86.0 68.0 75.2 51.1 96.4 35.0 59.2 35.7 26.0 20.2 57.5 65.5 49.0 17.0 58.3 66.6 65.8
RN50x16 90.5 82.2 54.2 45.9 65.0 72.3 30.3 82.9 52.8 89.7 87.6 71.9 80.0 56.0 97.8 40.3 64.4 39.6 33.9 24.0 62.5 68.7 53.4 17.6 58.9 67.6 70.5
RN50x64 91.8 86.8 61.3 48.9 66.9 76.0 35.6 83.8 53.4 93.4 90.6 77.3 90.8 61.0 98.3 59.4 69.7 47.9 33.2 29.6 65.0 74.1 56.8 27.5 62.1 70.7 73.6
B/32 84.4 91.3 65.1 37.8 63.2 59.4 21.2 83.1 44.5 87.0 87.9 66.7 51.9 47.3 97.2 49.4 60.3 32.2 39.4 17.8 58.4 64.5 47.8 24.8 57.6 59.6 63.2
CLIP-ViT
B/16 89.2 91.6 68.7 39.1 65.2 65.6 27.1 83.9 46.0 88.9 89.3 70.4 56.0 52.7 98.2 54.1 65.5 43.3 44.0 23.3 48.1 69.8 52.4 23.4 61.7 59.8 68.6
L/14 92.9 96.2 77.9 48.3 67.7 77.3 36.1 84.1 55.3 93.5 92.6 78.7 87.2 57.5 99.3 59.9 71.6 50.3 23.1 32.7 58.8 76.2 60.3 24.3 63.3 64.0 75.3
L/14-336px 93.8 95.7 77.5 49.5 68.4 78.8 37.2 84.3 55.7 93.5 92.8 78.3 88.3 57.7 99.4 59.6 71.7 52.3 21.9 34.9 63.0 76.9 61.3 24.8 63.3 67.9 76.2