0% found this document useful (0 votes)
60 views10 pages

Zhou Conditional Prompt Learning For Vision-Language Models CVPR 2022 Paper

Uploaded by

reghecampfluca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
60 views10 pages

Zhou Conditional Prompt Learning For Vision-Language Models CVPR 2022 Paper

Uploaded by

reghecampfluca
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Conditional Prompt Learning for Vision-Language Models

Kaiyang Zhou Jingkang Yang Chen Change Loy Ziwei Liu


S-Lab, Nanyang Technological University, Singapore
{kaiyang.zhou, jingkang001, ccloy, ziwei.liu}@ntu.edu.sg

Abstract method focuses on closed-set visual concepts, limiting the


model to a pre-defined list of categories, and is unscalable
With the rise of powerful pre-trained vision-language when it comes to new categories unseen during training.
models like CLIP, it becomes essential to investigate ways In contrast, for vision-language models1 like CLIP [40]
to adapt these models to downstream datasets. A recently and ALIGN [24], the classification weights are diametri-
proposed method named Context Optimization (CoOp) in- cally generated by a parameterized text encoder (e.g., a
troduces the concept of prompt learning—a recent trend in Transformer [48]) through prompting [34]. For instance, to
NLP—to the vision domain for adapting pre-trained vision- differentiate pet images containing different breeds of dogs
language models. Specifically, CoOp turns context words and cats, one can adopt a prompt template like “a photo of
in a prompt into a set of learnable vectors and, with only a a {class}, a type of pet” as input to the text encoder, and
few labeled images for learning, can achieve huge improve- as a result, class-specific weights for classification can be
ments over intensively-tuned manual prompts. In our study synthesized by filling in the “{class}” token with real class
we identify a critical problem of CoOp: the learned con- names. Compared to discrete labels, vision-language mod-
text is not generalizable to wider unseen classes within the els’ source of supervision comes from natural language,
same dataset, suggesting that CoOp overfits base classes which allows open-set visual concepts to be broadly ex-
observed during training. To address the problem, we pro- plored and has been proven effective in learning transferable
pose Conditional Context Optimization (CoCoOp), which representations [24, 40].
extends CoOp by further learning a lightweight neural net- With the rise of such powerful vision-language models,
work to generate for each image an input-conditional token the community has recently started to investigate potential
(vector). Compared to CoOp’s static prompts, our dynamic solutions to efficiently adapt these models to downstream
prompts adapt to each instance and are thus less sensitive datasets [14, 53, 56, 62]. To fit web-scale data, such as the
to class shift. Extensive experiments show that CoCoOp 400 million pairs of images and texts used by CLIP, vision-
generalizes much better than CoOp to unseen classes, even language models are purposefully designed to have high ca-
showing promising transferability beyond a single dataset; pacity, entailing that the model size would be enormous,
and yields stronger domain generalization performance as typically with hundreds of millions of parameters or even
well. Code is available at https://github.com/ billions. Therefore, fine-tuning the entire model, as often
KaiyangZhou/CoOp. adopted in deep learning research [18], is impractical and
might even damage the well-learned representation space.
A safer approach is to tune a prompt by adding some
1. Introduction context that is meaningful to a task, like “a type of pet” for
the pet dataset mentioned above, which has been found ef-
Recent research in large-scale vision-language pre-
fective in improving performance [40]. However, prompt
training has achieved striking performance in zero-shot im-
engineering is extremely time-consuming and inefficient as
age recognition [13,24,33,40], demonstrating a potential in
it has to be based on trial and error, and does not guaran-
learning open-world visual concepts for such a paradigm.
tee an optimal prompt either. To automate prompt engi-
The key design lies in how visual concepts are modeled.
neering, Zhou et al. [62] have recently explored the concept
In traditional supervised learning where labels are dis-
of prompt learning—a recent trend in NLP [15, 25, 30, 32,
cretized, each category is associated with a randomly initial-
44, 60]—for adapting pre-trained vision-language models.
ized weight vector that is learned to minimize the distance
Their approach, Context Optimization (CoOp), turns con-
with images containing the same category. Such a learning
1 We follow existing studies [13,24,33,40] to refer to CLIP-like models

Corresponding author as vision-language models.

16816
<latexit sha1_base64="8143cVdce6SsJFqV+hM9gRzsI3Q=">AAADNHicjVJNb9MwGHbC1whsbHDkYtFNKpcq6WFwnNQLF2BIdJuURJXjuKtVJ47sN+2qyH+EK/wG/gsSN8SV34CTBZS0HHilWI+fx+/zxK+cFIJr8P1vjnvn7r37D/Yeeo8e7x88OTx6eqFlqSibUimkukqIZoLnbAocBLsqFCNZIthlspzU+uWKKc1l/hE2BYszcp3zOacELDU7cvZPjsMoyaqVmQXxMf67GdebaJFK0B32rWXDCNgNNNGVYqmpQBGeY7uINdmYeOR5Hc9hDW7My553l9zN6Kj/lRWtGocGLlrY9CXzaiIn8n1hrPaH2XZ8x9aYCqI108bMDgf+yG8K74KgBQPU1rkd4EGUSlpmLIfGJQz8AuKKKOBUMONFpWYFoUtyzUILc5IxHVdNvsEnlknxXCr75YAbtttRkUzrTZbYkxmBhd7WavJfWljC/HVc8bwogeX0NmheCgwS168Ap1wxCmJjAaGK23/FdEEUoWDfSi8lyXp3qLJSAFdy3WcTKZdAEm08O8Fge1674GI8Ck5H/ofx4GzYznIPPUcv0BAF6BU6Q2/QOZoi6oDzyfnsfHG/ut/dH+7P26Ou0/Y8Q71yf/0GApsFgw==</latexit>

CoCoOp
<latexit sha1_base64="nJGHb+fSeYbFHtGpvQfsLkYuVT0=">AAADMnicjVLLbtQwFHXCq6RQWliysZhWGjajZBaFZaVuugGKxLSVkmjkOE7HGjuO7JuZjqL8R7fwDfwM7BBbPgInHVAyw4IrJTr3HN9zkisnheAGfP+b4967/+Dho53H3u6Tp3vP9g+eXxhVasomVAmlrxJimOA5mwAHwa4KzYhMBLtM5qeNfrlg2nCVf4JVwWJJrnOecUrAUtMDZ/foMIwSWS3qaRAf4r/NuGmiWarAdNh3lg0jYDfQRleapXUFmvAc25dYklUdjzyv4zlswE39uufdJbczOup/ZUWL1qGFszVs55KsOlUfitoqf/pNv/dsiakgxjBT19P9gT/y28LbIFiDAVrXuV3fXpQqWkqWQ+sSBn4BcUU0cCpY7UWlYQWhc3LNQgtzIpmJqza/xkeWSXGmtH1ywC3bnaiINGYlE3tSEpiZTa0h/6WFJWRv44rnRQksp3dBWSkwKNzcAZxyzSiIlQWEam6/FdMZ0YSCvSm9lET2/qGSpQCu1bLPJkrNgSSm9uwGg819bYOL8Sg4Hvkfx4OT4XqXO+gleoWGKEBv0Ak6Q+dogqijnVvns/PF/ep+d3+4P++Ous565gXqlfvrN8MdBL0=</latexit>

Zero-shot CoOp
<latexit sha1_base64="IJtCaraVVLlW6HHagghvMZyBOHc=">AAACOHicbVA9TwJBEN3DL8QvxNLmIjGxkdxRqCWJjaUmokQgZHeZkw27t5fdOYVc+Cu2+hv8J3Z2xtZf4AJXiPqSSV7em8nMPJZIYTEI3rzC0vLK6lpxvbSxubW9U96t3FidGg5NrqU2LUYtSBFDEwVKaCUGqGISbtnwfOrfPoCxQsfXOE6gq+h9LCLBKTqpV650EEbIouwOjD62A42TXrka1IIZ/L8kzEmV5Ljs7Xrbnb7mqYIYuaTWtsMgwW5GDQouYVLqpBYSyof0HtqOxlSB7Waz4yf+oVP6fqSNqxj9mfpzIqPK2rFirlNRHNjf3lT8z2unGJ11MxEnKULM54uiVPqo/WkSfl8Y4CjHjlBuhLvV5wNqKEeX18IWphZ+yFQqURj9uKgyrYdImZ2UXILh77z+kpt6LTypBVf1auMoz7JI9skBOSIhOSUNckEuSZNwMiJP5Jm8eK/eu/fhfc5bC14+s0cW4H19AzV1rS8=</latexit>

Base classes
<latexit sha1_base64="0ivCj/QEdXYHnnI7rZsyExeY+nE=">AAADQXicbVLPb9MwFHYyfowCYxs3uFh0k8olSnroJk6DXbgAQ6LbpCSqHMdZrdpxZL+0q6Ic+Gu4wt/AX8GfwA1x5YKbdVO67klxvve99z7bn5wUghvw/V+Ou3Hv/oOHm486j5883Xq2vbN7alSpKRtSJZQ+T4hhgudsCBwEOy80IzIR7CyZHC/qZ1OmDVf5F5gXLJbkIucZpwQsNdpxXuzvhVEiq2k9CuI9fJP0F0k0ThWYFvvBsiFownNsFzEj89jrdFoSvQW4rF+vSLXJdclW9S7paNoMNHB8DSNgl5Bk1bH6VNStvPk3rlSapXX1kc0wFcQYZuqmj8liXL2ltNSEzt/gg4F3OLCF6/l31sqbgdF21/f8JvA6CJagi5ZxYt3cilJFS8lyaETCwC8grogGTgWrO1FpWEHohFyw0MKcSGbiqjlvjfctk+JMafvlgBu2PVERacxcJrZTEhib27UFeVctLCE7jCueFyWwnF5tlJUCg8KLJ4FTrhkFMbeAUM3tWTEdE2sQ2IezsksiV+5QyVIA12q2yiZKTYAkpu5YB4Pbfq2D074XDDz/c7971Ft6uYleoleohwJ0gI7Qe3SChog6X51vznfnh/vT/e3+cf9etbrOcuY5Wgn3339dNAbl</latexit>

<latexit sha1_base64="f1SUBf8cVkB8dE/1Q5lxwIU6Dy4=">AAADOXicbVJNb9NAEF2br2KgtOXIZUVaKVwiO4e04lTUCxegSKStZFvRer1JVtn1WrvjpJbl38IVfgO/hCM3xJU/wMYJECeMZOvNm5k3u0+b5IIb8P1vjnvn7r37D/Yeeo8eP9l/enB4dGVUoSkbUiWUvkmIYYJnbAgcBLvJNSMyEew6mV0s69dzpg1X2UcocxZLMsn4mFMClhodOkfHYZTIal6PgvgY/036yySapgrMBvvWsiHRms+JwBMCLO553sk/he4S3NYvW0qb5K7iRjWMgN1Cc6dKs7SuQBOeYfsTC1LWq13RvFFo4PQPbAaTcXWh3uf1Rr4t+I4tMBXEGGbqpo/JfFq9prTQhJav8OmgdzaoRwcdv+c3gXdBsAYdtI5L6+J+lCpaSJZBIx8Gfg5xRTRwKljtRYVhOaEzMmGhhRmRzMRVc7Aan1gmxWOl7ZcBbtjNiYpIY0qZ2E5JYGq2a0vyf7WwgPFZXPEsL4BldLVoXAgMCi+fAk65ZhREaQGhmtuzYjol1gmwD6a1JZGtO1SyEMC1WrTZRKkZkMTUnnUw2PZrF1z1e8Gg53/od867ay/30HP0AnVRgE7ROXqDLtEQUad0PjmfnS/uV/e7+8P9uWp1nfXMM9QK99dv+RAE2A==</latexit> <latexit sha1_base64="zX3nn3Sqn+w/qDQYjs0d4xdW0sQ=">AAADKHicbVLLjtMwFHXCawgwD1iysWgrlU2VdNEZsRo0GzbAINGZkZKochy3sWrHkX3TThXlF9jCN/A17NBs+RKcTIF2ypWSnHvuvcfOsZNCcAO+f+O49+4/ePho77H35Omz/YPDo+cXRpWasjFVQumrhBgmeM7GwEGwq0IzIhPBLpP5WVO/XDBtuMo/w6pgsSSznE85JWCpyZHj9LphlMhqUU+CuIv/JsMmibJUgdlg31s2BE14ju1LLMkqHnjeP4V+A67r11tKm+Su4kY1JFrzBRF4RoA1wr1utGj7W5j9gRGwa0im1Zn6WNQbefttPak0S+vqA1tiKogxzNRtH5NFVr2ltNSErt7g49HgZFRPDjv+wG8D74JgDTpoHefWtP0oVbSULIdWPgz8AuKKaOBUsNqLSsMKQudkxkILcyKZiat2YzXuWSbFU6XtkwNu2c2JikhjVjKxnZJAZu7WGvJ/tbCE6Ulc8bwogeX0dqFpKTAo3Jw8TrlmFMTKAkI1t3vFNCPWCbD3Y2uVRG79QyVLAVyr5TabKDUHkpjasw4Gd/3aBRfDQTAa+J+GndP+2ss99BK9Qn0UoGN0it6hczRG1MmcL85X55v73f3h/nRvbltdZz3zAm2F++s3r/P8tA==</latexit>

<latexit sha1_base64="+xVRXtBc7t53www2R1akeGF/K1k=">AAACSnicbVC7TgMxEPSFVwivACWNRUCiiu4ogBKJhjJIJCAdp2jP8SVW7PPJ3guKTvkBvoYWvoEf4DfoEA3OoyAJI9kezexqvRNnUlj0/U+vtLK6tr5R3qxsbe/s7lX3D1pW54bxJtNSm8cYLJci5U0UKPljZjioWPKHuH8z9h8G3Fih03scZjxS0E1FIhigk9rVkxAiGmY9jdq9OnHXWABjxAAk7QLyqN6u1vy6PwFdJsGM1MgMjfa+t/vU0SxXPEUmwdow8DOMCjAomOSjylNueQasD10eOpqC4jYqJuuM6KlTOjTRxp0U6UT921GAsnaoYlepAHt20RuL/3lhjslVVIg0y5GnbDooySVFTcfZ0I4wnKEcOgLMCPdXynpggKFLcG5KrOZ2KFQuURj9PK/GWvcRYjuquASDxbyWSeu8HlzU/bvz2vXZLMsyOSLH5IwE5JJck1vSIE3CyAt5JW/k3fvwvrxv72daWvJmPYdkDqXVX8LtsjE=</latexit>

[a] [photo] [of] [a] [arrival gate]. [v1 ] [v2 ] . . . [vM ] [arrival gate]. [v1 (x)] [v2 (x)] . . . [vM (x)] [arrival gate].
.. <latexit sha1_base64="4TW2Q2kvoIwAfSidpPz6R0O1N8s=">AAADAHicjVFNj9MwEHXC1xJg6cKRi0W7UrlUSQ/AsVIvXBCLRHdXaqLKdpytVTuO7El3qygXfg03xJV/Ar8GJ1uhZMuBkWy9efPxxmNaSGEhDH95/r37Dx4+OnocPHn67Pj54OTFudWlYXzBtNTmkhLLpcj5AgRIflkYThSV/IJu5k38YsuNFTr/AruCJ4pc5SITjICjVoPfp6NlTFW1rVdRMsJ/nWnjxOtUg+2wHx27jIHfQKtcGZ7WFRgicuwueU12dTIJgk7PcQNu6je93l3yUKMT/Q+tUbxtGzjVNplm1VzP9aeiXg2G4SRsDR+CaA+GaG9nqxPvOE41KxXPgUli7TIKC0gqYkAwyesgLi0vCNuQK750MCeK26Rqx6vxqWNSnGnjTg64ZbsVFVHW7hR1mYrA2t6NNeS/YssSsvdJJfKiBJ6zW6GslBg0bn4Up8JwBnLnAGFGuFkxWxNDGLh/76lQ1XtDpUoJwujrPku13gChtg7cBqO7+zoE59NJ9HYSfp4OZ+P9Lo/QK/QajVGE3qEZ+oDO0AIxb+ZlnvYK/6v/zf/u/7hN9b19zUvUM//nHw/58a8=</latexit>

.. <latexit sha1_base64="4TW2Q2kvoIwAfSidpPz6R0O1N8s=">AAADAHicjVFNj9MwEHXC1xJg6cKRi0W7UrlUSQ/AsVIvXBCLRHdXaqLKdpytVTuO7El3qygXfg03xJV/Ar8GJ1uhZMuBkWy9efPxxmNaSGEhDH95/r37Dx4+OnocPHn67Pj54OTFudWlYXzBtNTmkhLLpcj5AgRIflkYThSV/IJu5k38YsuNFTr/AruCJ4pc5SITjICjVoPfp6NlTFW1rVdRMsJ/nWnjxOtUg+2wHx27jIHfQKtcGZ7WFRgicuwueU12dTIJgk7PcQNu6je93l3yUKMT/Q+tUbxtGzjVNplm1VzP9aeiXg2G4SRsDR+CaA+GaG9nqxPvOE41KxXPgUli7TIKC0gqYkAwyesgLi0vCNuQK750MCeK26Rqx6vxqWNSnGnjTg64ZbsVFVHW7hR1mYrA2t6NNeS/YssSsvdJJfKiBJ6zW6GslBg0bn4Up8JwBnLnAGFGuFkxWxNDGLh/76lQ1XtDpUoJwujrPku13gChtg7cBqO7+zoE59NJ9HYSfp4OZ+P9Lo/QK/QajVGE3qEZ+oDO0AIxb+ZlnvYK/6v/zf/u/7hN9b19zUvUM//nHw/58a8=</latexit>

.. <latexit sha1_base64="4TW2Q2kvoIwAfSidpPz6R0O1N8s=">AAADAHicjVFNj9MwEHXC1xJg6cKRi0W7UrlUSQ/AsVIvXBCLRHdXaqLKdpytVTuO7El3qygXfg03xJV/Ar8GJ1uhZMuBkWy9efPxxmNaSGEhDH95/r37Dx4+OnocPHn67Pj54OTFudWlYXzBtNTmkhLLpcj5AgRIflkYThSV/IJu5k38YsuNFTr/AruCJ4pc5SITjICjVoPfp6NlTFW1rVdRMsJ/nWnjxOtUg+2wHx27jIHfQKtcGZ7WFRgicuwueU12dTIJgk7PcQNu6je93l3yUKMT/Q+tUbxtGzjVNplm1VzP9aeiXg2G4SRsDR+CaA+GaG9nqxPvOE41KxXPgUli7TIKC0gqYkAwyesgLi0vCNuQK750MCeK26Rqx6vxqWNSnGnjTg64ZbsVFVHW7hR1mYrA2t6NNeS/YssSsvdJJfKiBJ6zW6GslBg0bn4Up8JwBnLnAGFGuFkxWxNDGLh/76lQ1XtDpUoJwujrPku13gChtg7cBqO7+zoE59NJ9HYSfp4OZ+P9Lo/QK/QajVGE3qEZ+oDO0AIxb+ZlnvYK/6v/zf/u/7hN9b19zUvUM//nHw/58a8=</latexit>

...
<latexit sha1_base64="CrDUidhbV+gYx4dWg1zM4wwFCGM=">AAADNnicjVLLbtQwFHXCqwToA5ZsLKaVhs0omUVhWTEbNogiMW2lJBrZjtOxxokj+2baUeQ/YQvfwK+wYYfY8gk46YAyHRZcydHxub7nOEemlRQGwvCb59+5e+/+g52HwaPHT3b39g+enhlVa8anTEmlLygxXIqST0GA5BeV5qSgkp/TxaTtny+5NkKVH2FV8bQgl6XIBSPgqNmBt3d0GCe0aJZ2FqWH+O9m3G6SeabA9Nh3jo0T4NfQWTeaZ7YBTUSJ3UdekZVNR0HQ0xy24Nq+3NDuk9seve5/eSXLTiEI/mg5spujeTNRE/W+sj2mp0hlzW3zxsWHmSTGcGPtbH8QjsKu8DaI1mCA1nXqItxNMsXqgpfQqcRRWEHaEA2CSW6DpDa8ImxBLnnsYEkKbtKmu4HFR47JcK60WyXgju1PNKQwZlVQd7IgMDe3ey35r15cQ/46bURZ1cBLdmOU1xKDwu07wJnQnIFcOUCYFu6umM2JJgzca9lwocXGPzRFLUFodbXJUqUWQKixgUswup3XNjgbj6LjUfhhPDgZrrPcQc/RCzREEXqFTtBbdIqmiHlL75P32fvif/W/+z/8nzdHfW898wxtlP/rN106BlU=</latexit>
. . .
<latexit sha1_base64="oyFSkVR5pVK+BaFeHpURW7SEkEo=">AAACR3icbVBNTwIxEO3iF+IX6NFLlZhwIrsc1COJF4+YyEeybEi3dKGh3W7aWQ3ZcPbXeNXf4E/wV3gzHu0CBwEnaeflvZnMzAsTwQ247qdT2Nre2d0r7pcODo+OT8qV045RqaasTZVQuhcSwwSPWRs4CNZLNCMyFKwbTu5yvfvEtOEqfoRpwgJJRjGPOCVgqUH5wicB9pOxAmWziuyXE1Yes6EmIqgPylW37s4DbwJvCapoGa1BxTnuDxVNJYuBCmKM77kJBBnRwKlgs1I/NSwhdEJGzLcwJpKZIJvfMsNXlhniSGn7YsBz9m9HRqQxUxnaSmmXNOtaTv6n+SlEt0HG4yQFFtPFoCgVGBTOjcFDrhkFMbWAUM3trpiOiSYUrH0rU0K5ckMmUwFcq+dVNlRqAiQ0s5J10Fv3axN0GnXvuu4+NKrN2tLLIjpHl6iGPHSDmugetVAbUfSCXtEbenc+nC/n2/lZlBacZc8ZWomC8wtbS7EJ</latexit> <latexit sha1_base64="w2uawpOynxeIAr2p+Zz3wz9wQgQ=">AAADNnicbVJNb9NAEF2br2KgH3DksiKtFC6WnUNacSrqhQvQSk1bybai9XrSrLLrtXbXSSPL/4Qr/Ab+ChduiCs/gY0TIE4YydabN/Nmdp82LTjTJgi+Oe69+w8ePtp57D15+mx3b//g+ZWWpaIwoJJLdZMSDZzlMDDMcLgpFBCRcrhOJ2eL+vUUlGYyvzTzAhJBbnM2YpQYSw0PnL3DKE5FNa2HYXKI/ya9RRKPM2n0GvvespGVjiFThCe+5x39k3cX4K5+3RqzTm6PW6tGsYE701yoUpDVlVGE5dj++IzM6+WueNpMaOD4D2yE6ag6kx+Lei3fHPgBZphyojXouukDUYyrt5SWitD5G3zc90/69XC/E/hBE3gbhCvQQas4txbuxpmkpYDcNOOjMChMUhFlGOVQe3GpoSB0Qm4hsjAnAnRSNQer8ZFlMjySyn65wQ27rqiI0HouUtsprO96s7Yg/1eLSjM6SSqWF6WBnC4XjUqOjcSLd4AzpoAaPreAUMXsWTEdE+uEsa+ltSUVrTtUouSGKTlrs6mUE0NSXXvWwXDTr21w1fPDvh9c9Dqn3ZWXO+gleoW6KETH6BS9Q+dogKgzdT45n50v7lf3u/vD/blsdZ2V5gVqhfvrN5lQA7A=</latexit>

<latexit sha1_base64="Sds3QJMIKgV/HNcJ+p5gIs1s934=">AAADJHicbVJNj9MwEHXC11Jg6cKRi0VbqVyqpIfuitOivXABFonurpREleNMNlbtOLKddqsof4Ar/AZ+DTfEgQs/BeFkC7RbRorz5s34efySuOBMG8/74bi3bt+5e2/vfufBw0f7j7sHT860LBWFKZVcqouYaOAsh6lhhsNFoYCImMN5PD9p6ucLUJrJ/INZFRAJcpmzlFFiLDXr/hr0gzAW1aKe+VEf/03GTRJmiTR6g31j2cAownJsF74kq2jU6fxTGDbgqn6xpbRJ7ipuVAM7VAaJIrxRHfTDRdvcwuwPDA1cmTitTuS7ot7I23frR6Ugqau3sMSUE61B120fiCKrXlFaKkJXL/HhZHQ0qWfdnjfy2sC7wF+DHlrH6ezA2Q8TSUsBuWnlA98rTFQRZRjlUHfCUkNB6JxcQmBhTgToqGoHq/HAMglOpbJPbnDLbu6oiNB6JWLbKawX+matIf9XC0qTHkUVy4vSQE6vD0pLjo3EzVfHCVNADV9ZQKhidlZMM2KdMPbf2DolFlt3qETJDVNyuc3GUs4NiXXdsQ76N/3aBWfjkT8Zee/HvePh2ss99Aw9R0Pko0N0jF6jUzRF1Emcj84n57P7xf3qfnO/X7e6znrPU7QV7s/f+gr8gQ==</latexit>

[a] [photo] [of] [a] [cathedral]. [v1 ] [v2 ] . . . [vM ] [cathedral]. [v1 (x)] [v2 (x)] . . . [vM (x)] [cathedral].
<latexit sha1_base64="9fR6HmABMleNcUoWa6oyKg/n/6s=">AAACPHicbZDLTgIxFIY7eEO8gSZu3DQSE1ZkBhO8rDBuXGIilwQI6ZQCDe100p7RkJGXcavP4Hu4d2fcurbALAT8kyZ//nNOTs/nh4IbcN0PJ7W2vrG5ld7O7Ozu7R9kc4d1oyJNWY0qoXTTJ4YJHrAacBCsGWpGpC9Ywx/dTuuNR6YNV8EDjEPWkWQQ8D6nBGzUzR63mQyH8Q2lkSZ0fI3LV8Xz8qSbzbtFdya8arzE5FGiajfn7Ld7ikaSBUAFMabluSF0YqKBU8EmmXZkWEjoiAxYy9qASGY68eyACT6zSQ/3lbYvADxL/07ERBozlr7tlASGZrk2Df+rtSLoX3ZiHoQRsIDOF/UjgUHhKQ3c45pREGNrCNXc/hXTIbEkwDJb2OLLhRtiGQngWj0tpr5SIyC+mWQsQW+Z16qpl4peuejel/KVQsIyjU7QKSogD12gCrpDVVRDFD2jF/SK3px359P5cr7nrSknmTlCC3J+fgEWTa10</latexit>

<latexit sha1_base64="mCP733lRKW729yXdlkP1ECjR4Hs=">AAADS3icjVLBbtNAEF27FEqA0sKRy4qkUrhYdg6l4lTUCxegSKStZFvRer2pV9n1WrvjpJblL+BruMI38AF8BzfEgY0bkJNwYKS13ryZebN+2qQQ3IDvf3fcnTu7d+/t3e89ePho//HB4ZMLo0pN2ZgqofRVQgwTPGdj4CDYVaEZkYlgl8nsbFm/nDNtuMo/QlWwWJLrnE85JWCpyaEzOBqEUSLreTMJ4gH+m4yWSZSlCkyHfWvZMAJ2A+3qWrO0qUETnmP7EQtSNbHX63U0h0tw07xY0+6S2zs61f/aFc1bhRZmf2A7mEzrM/W+aDr5puA7tsBUEGOYaWxfxGSR1a8pLTWh1St84nvHfjM56Pue3wbeBsEK9NEqzq2v+1GqaClZDq16GPgFxDXRwKlgTS8qDSsInZFrFlqYE8lMXLf3avCRZVI8VdqeHHDLdidqIo2pZGI7JYHMbNaW5L9qYQnTk7jmeVECy+ntomkpMCi8fBw45ZpREJUFhGpu74ppRqwTYJ/Q2pZErv1DLUsBXKvFOpsoNQOSmKZnHQw2/doGFyMvOPb8D6P+6XDl5R56hp6jIQrQS3SK3qBzNEbU+eR8dr44X91v7g/3p/vrttV1VjNP0Vrs7P4GptUMcA==</latexit>

<latexit sha1_base64="kASuoAz9VVVabPM2RlbBZuw3XFo=">AAADS3icjVLBbtNAEF27FIqB0sKRy4qkUrhEdlQR4FTUCxegSKStZFvRer1uVtn1WrvjpJHlL+BruMI38AF8BzfEgY0bkJNwYKS13ryZebN+2qQQ3IDvf3fcnVu7t+/s3fXu3X+w//Dg8NG5UaWmbESVUPoyIYYJnrMRcBDsstCMyESwi2R6uqxfzJg2XOUfYVGwWJKrnGecErDU+NDpHnXDKJHVrB4HcRf/TQbLJJqkCkyLfWvZMAJ2Dc3qSrO0rkATnmP7EXOyqOO+57U0e0twXT9b026T2zta1f/aFc0ahQZO/sBmMMmqU/W+qFv5puA7NsdUEGOYqW1fxGQxqV5TWmpCF6/w8GV/eFyPDzp+328Cb4NgBTpoFWfW1/0oVbSULIdGPQz8AuKKaOBUsNqLSsMKQqfkioUW5kQyE1fNvWp8ZJkUZ0rbkwNu2PZERaQxC5nYTklgYjZrS/JftbCE7EVc8bwogeX0ZlFWCgwKLx8HTrlmFMTCAkI1t3fFdEKsE2Cf0NqWRK79QyVLAVyr+TqbKDUFkpjasw4Gm35tg/NBP3je9z8MOie9lZd76Al6inooQEN0gt6gMzRC1PnkfHa+OF/db+4P96f766bVdVYzj9Fa7Oz+Br3HDH0=</latexit>

Accuracy: 69.36 Accuracy: 80.60 Accuracy: 79.74


<latexit sha1_base64="YrdloPpTXXetAZ2qlWj/z7dT0lQ=">AAADOHicbVLLbtNAFB2bVwlQGliyGZFWCpvIziKtWLXqhg1QJNJWiq1oPLlJRpnxWDPXecjyr7CFb+BP2LFDbPkCJm5ATtMr2T73nPsYH02SSWExCH54/r37Dx4+2nvcePL02f7zg+aLS6tzw6HPtdTmOmEWpEihjwIlXGcGmEokXCWz87V+NQdjhU4/4yqDWLFJKsaCM3TUsOk1jw4HUaKKeTkM40P6P+muk2g60mhr7HvHDtAwkVL3kgu2ijuNRm1Eew2W5ZutUXVyd2RNvWt0NK8aKjj9ByOEJSbj4lx/zMpaXn0rVwoDo7L4AAvKJbMWbFnVgcqmxRnnuWF89ZYe9zonPSecGSPmTNIJQxgetIJOUAXdBeEGtMgmLpyJ+9FI81xBitWyQRhkGBfMoOASykaUW8gYn7EJDBxMmQIbF9UxS3rkmBEda+OeFGnF1jsKpqxdqcRVKoZTe1tbk3dpgxzHJ3Eh0ixHSPnNonEuKWq6vgl0JAxwlCsHGDfCnZXyKXO+oLsvW1sStfUPhcolCqMX22yi9QxZYsuGczC87dcuuOx2wl4n+NRtnbY3Xu6RV+Q1aZOQHJNT8o5ckD7h3tL74n31vvnf/Z/+L//3TanvbXpekq3w//wF8BACsw==</latexit>

Arrival gate Cathedral


<latexit sha1_base64="X9DUCorVWpoVUM8/d/6DB0Zwkhw=">AAADNXicbVJNb9MwGHbC1wiwDzhysegmlUuU9NBNnIZ64QIMiW6TkqhynLerVTuObKddFeWXcIXfwG/hwA1x5S/gZAWl614pzvM+76cfOS040yYIfjjuvfsPHj7aeew9efpsd2//4Pm5lqWiMKaSS3WZEg2c5TA2zHC4LBQQkXK4SOejJn6xAKWZzD+bVQGJIFc5mzJKjKUmB87u0WEUp6Ja1JMwOcT/nUHjxLNMGt1h31s2MoqwHNuDL8kq8T2v06LfgOv69UarLrndshO9q3W8aAtaOPsHYwPXJp1WI/mxqDt++29VqRRkdfUBlphyojXous0DUcyqt5SWitDVG3w89E+GNjAiZgaZInyy3wv8oDW8DcI16KG1nTUKxpmkpYDctJOiMChMUhFlGOVQe3GpoSB0Tq4gsjAnAnRStTvW+MgyGZ5KZb/c4JbtVlREaL0Sqc0UdkV9O9aQd8Wi0kxPkorlRWkgpzeDpiXHRuLmGeCMKaCGrywgVDG7K6YzYkUx9rFsTEnFxh0qUXLDlFxusqmUc0NSXXtWwfC2XtvgfOCHQz/4NOid9tda7qCX6BXqoxAdo1P0Dp2hMaJO6Xxxvjrf3O/uT/eX+/sm1XXWNS/Qhrl//gKV5AGL</latexit>

(a) Both CoOp and CoCoOp work well on the base classes observed during training and beat manual prompts by a significant margin.
<latexit sha1_base64="8143cVdce6SsJFqV+hM9gRzsI3Q=">AAADNHicjVJNb9MwGHbC1whsbHDkYtFNKpcq6WFwnNQLF2BIdJuURJXjuKtVJ47sN+2qyH+EK/wG/gsSN8SV34CTBZS0HHilWI+fx+/zxK+cFIJr8P1vjnvn7r37D/Yeeo8e7x88OTx6eqFlqSibUimkukqIZoLnbAocBLsqFCNZIthlspzU+uWKKc1l/hE2BYszcp3zOacELDU7cvZPjsMoyaqVmQXxMf67GdebaJFK0B32rWXDCNgNNNGVYqmpQBGeY7uINdmYeOR5Hc9hDW7My553l9zN6Kj/lRWtGocGLlrY9CXzaiIn8n1hrPaH2XZ8x9aYCqI108bMDgf+yG8K74KgBQPU1rkd4EGUSlpmLIfGJQz8AuKKKOBUMONFpWYFoUtyzUILc5IxHVdNvsEnlknxXCr75YAbtttRkUzrTZbYkxmBhd7WavJfWljC/HVc8bwogeX0NmheCgwS168Ap1wxCmJjAaGK23/FdEEUoWDfSi8lyXp3qLJSAFdy3WcTKZdAEm08O8Fge1674GI8Ck5H/ofx4GzYznIPPUcv0BAF6BU6Q2/QOZoi6oDzyfnsfHG/ut/dH+7P26Ou0/Y8Q71yf/0GApsFgw==</latexit>

CoCoOp
<latexit sha1_base64="nJGHb+fSeYbFHtGpvQfsLkYuVT0=">AAADMnicjVLLbtQwFHXCq6RQWliysZhWGjajZBaFZaVuugGKxLSVkmjkOE7HGjuO7JuZjqL8R7fwDfwM7BBbPgInHVAyw4IrJTr3HN9zkisnheAGfP+b4967/+Dho53H3u6Tp3vP9g+eXxhVasomVAmlrxJimOA5mwAHwa4KzYhMBLtM5qeNfrlg2nCVf4JVwWJJrnOecUrAUtMDZ/foMIwSWS3qaRAf4r/NuGmiWarAdNh3lg0jYDfQRleapXUFmvAc25dYklUdjzyv4zlswE39uufdJbczOup/ZUWL1qGFszVs55KsOlUfitoqf/pNv/dsiakgxjBT19P9gT/y28LbIFiDAVrXuV3fXpQqWkqWQ+sSBn4BcUU0cCpY7UWlYQWhc3LNQgtzIpmJqza/xkeWSXGmtH1ywC3bnaiINGYlE3tSEpiZTa0h/6WFJWRv44rnRQksp3dBWSkwKNzcAZxyzSiIlQWEam6/FdMZ0YSCvSm9lET2/qGSpQCu1bLPJkrNgSSm9uwGg819bYOL8Sg4Hvkfx4OT4XqXO+gleoWGKEBv0Ak6Q+dogqijnVvns/PF/ep+d3+4P++Ous565gXqlfvrN8MdBL0=</latexit>

New classes Zero-shot CoOp


<latexit sha1_base64="tQWd+dbC3pYqVRjIn2dwUiR7CO4=">AAADQHicbVJNb9QwEHXCV1mgtHBCXCy2lZZLlOxhW3Eq6oULUCS2rbSJVo7jbay148ie7DaKIvFruMJv4F/wD7ghrpzwpluUdDtSnDdvZp7tJ8e54AZ8/6fj3rl77/6DrYe9R4+fbD/d2X12alShKRtTJZQ+j4lhgmdsDBwEO881IzIW7CyeH6/qZwumDVfZZyhzFklykfEZpwQsNd11XuzvTcJYVot6GkR7+H8yXCVhmigwLfa9ZSegCc+wXcSSlJHX67UkBitwWb/uSLXJTclW9TbpcNEMNDC9hiGwS4hn1bH6mNetvPk3rlSaJXX1gS0xFcQYZuqmj8k8rd5SWmhCyzf4YOQdjmzher7dP93p+57fBN4EwRr00TpOrJnbYaJoIVkGjcgk8HOIKqKBU8HqXlgYlhM6JxdsYmFGJDNR1Ry3xvuWSfBMaftlgBu2PVERaUwpY9spCaTmZm1F3labFDA7jCqe5QWwjF5tNCsEBoVXLwInXDMKorSAUM3tWTFNifUH7Lvp7BLLzh0qWQjgWi27bKzUHEhs6p51MLjp1yY4HXrByPM/DftHg7WXW+gleoUGKEAH6Ai9QydojKjzxfnqfHO+uz/cX+5v989Vq+usZ56jTrh//wFJJgaK</latexit> <latexit sha1_base64="IJtCaraVVLlW6HHagghvMZyBOHc=">AAACOHicbVA9TwJBEN3DL8QvxNLmIjGxkdxRqCWJjaUmokQgZHeZkw27t5fdOYVc+Cu2+hv8J3Z2xtZf4AJXiPqSSV7em8nMPJZIYTEI3rzC0vLK6lpxvbSxubW9U96t3FidGg5NrqU2LUYtSBFDEwVKaCUGqGISbtnwfOrfPoCxQsfXOE6gq+h9LCLBKTqpV650EEbIouwOjD62A42TXrka1IIZ/L8kzEmV5Ljs7Xrbnb7mqYIYuaTWtsMgwW5GDQouYVLqpBYSyof0HtqOxlSB7Waz4yf+oVP6fqSNqxj9mfpzIqPK2rFirlNRHNjf3lT8z2unGJ11MxEnKULM54uiVPqo/WkSfl8Y4CjHjlBuhLvV5wNqKEeX18IWphZ+yFQqURj9uKgyrYdImZ2UXILh77z+kpt6LTypBVf1auMoz7JI9skBOSIhOSUNckEuSZNwMiJP5Jm8eK/eu/fhfc5bC14+s0cW4H19AzV1rS8=</latexit>

<latexit sha1_base64="bYymbds63ROIjnQFB2XBSZyp6V0=">AAADNnicbVJNb9NAEF2brxKgH3DksiKtFC6WnUNacSrqhQvQSk1bybai9XrTrLLrtXbHSSPL/4Qr/Ab+ChduiCs/gbUTIE4YydabNzNvdp82yQU34PvfHPfe/QcPH+087jx5+mx3b//g+ZVRhaZsSJVQ+iYhhgmesSFwEOwm14zIRLDrZHpW169nTBuusktY5CyW5DbjY04JWGp04OwdhlEiy1k1CuJD/Dfp10k0SRWYNfa9ZcM5z1I8JlrGXqdz9G+8V4O76nVLZp3cllurhhGwO2guVGqWViVowjNsf2JOFtVyVzRrFBo4+QObwWRcnqmPebWWbwp+YHNMBTGGmarpYzKflG8pLTShizf4eOCdDKrRftf3/CbwNghWoItWcW4t3I1SRQvJMmjkw8DPIS6JBk4FqzpRYVhO6JTcstDCjEhm4rI5WIWPLGPdVNp+GeCGXZ8oiTRmIRPbKQlMzGatJv9XCwsYn8Qlz/ICWEaXi8aFwKBw/Q5wyjWjIBYWEKq5PSumE2KdAPtaWlsS2bpDKQsBXKt5m02UmgJJTNWxDgabfm2Dq74XDDz/ot897a283EEv0SvUQwE6RqfoHTpHQ0SdmfPJ+ex8cb+6390f7s9lq+usZl6gVri/fgMcsQOA</latexit>

[v1 ] [v2 ] . . . [vM ] [wind farm].


<latexit sha1_base64="knM3erqhsh6bC2DEtwwf35p1uZE=">AAADJHicbVJNb9MwGHbC1ygwOjhysWgrlUuU9NBNnIZ24QIMiW6TkqhyHHe1aseR/aZdFOUPcIXfwK/hhjhw4acgnKxAu/JKdp73eV8/th8nyQU34Ps/HPfW7Tt37+3d7zx4+Gj/cffgyZlRhaZsQpVQ+iIhhgmesQlwEOwi14zIRLDzZHHS1M+XTBuusg9Q5iyW5DLjM04JWGra/TXoh1Eiq2U9DeI+/puMmiSapwrMBvvGsiFowjNsJ7EiZex1Ov8Uhg24ql9sKW2Su4ob1XDFsxTPiJaN6qAfLdvmFs7/wAjYFSSz6kS9y+uNvP22flSapXX1lq0wFcQYZuq2j8l8Xr2itNCEli/x4dg7GtfTbs/3/DbwLgjWoIfWcTo9cPajVNFCsgxa+TDwc4grooFTwepOVBiWE7oglyy0MCOSmbhqD1bjgWXsDZW2IwPcspsrKiKNKWViOyWBublZa8j/1cICZkdxxbO8AJbR641mhcCgcPPqOOWaURClBYRqbs+K6ZxYJ8D+G1u7JHLrDpUsBHCtVttsotQCSGLqjnUwuOnXLjgbecHY89+PesfDtZd76Bl6joYoQIfoGL1Gp2iCqJM6H51Pzmf3i/vV/eZ+v251nfWap2gr3J+/AZGr/FE=</latexit>

<latexit sha1_base64="9z/5HEKimh9eQz1kLtmR+HpiKRA=">AAACR3icbVC7TgMxEPSFd3glUNIYIiSq6I4CKJFoKINEINJxivYcH7Hix8neA0Wn1HwNLXwDn8BX0CFKnJCCEEayPZrd1XgnzaVwGIbvQWVhcWl5ZXWtur6xubVdq+/cOFNYxtvMSGM7KTguheZtFCh5J7ccVCr5bTq4GNdvH7h1wuhrHOY8UXCvRSYYoJe6tf0YEhrnfYPGvybz11h4FLpHM7AqaXZrjbAZTkDnSTQlDTJFq1sPtu56hhWKa2QSnIujMMekBIuCST6q3hWO58AGcM9jTzUo7pJyssuIHnrFWxvrj0Y6UX9PlKCcG6rUdyrAvvtbG4v/1eICs7OkFDovkGv2Y5QVkqKh42BoT1jOUA49AWaF/ytlfbDA0Mc345KqmR1KVUgU1jzOqqkxA4TUjao+wehvXvPk5rgZnTTDq+PG+dE0y1WyRw7IEYnIKTknl6RF2oSRJ/JMXshr8BZ8BJ/B109rJZjO7JIZVIJvBjyw2Q==</latexit>

[a] [photo] [of] [a] [wind farm]. [v1 (x)] [v2 (x)] . . . [vM (x)] [wind farm].
.. <latexit sha1_base64="4TW2Q2kvoIwAfSidpPz6R0O1N8s=">AAADAHicjVFNj9MwEHXC1xJg6cKRi0W7UrlUSQ/AsVIvXBCLRHdXaqLKdpytVTuO7El3qygXfg03xJV/Ar8GJ1uhZMuBkWy9efPxxmNaSGEhDH95/r37Dx4+OnocPHn67Pj54OTFudWlYXzBtNTmkhLLpcj5AgRIflkYThSV/IJu5k38YsuNFTr/AruCJ4pc5SITjICjVoPfp6NlTFW1rVdRMsJ/nWnjxOtUg+2wHx27jIHfQKtcGZ7WFRgicuwueU12dTIJgk7PcQNu6je93l3yUKMT/Q+tUbxtGzjVNplm1VzP9aeiXg2G4SRsDR+CaA+GaG9nqxPvOE41KxXPgUli7TIKC0gqYkAwyesgLi0vCNuQK750MCeK26Rqx6vxqWNSnGnjTg64ZbsVFVHW7hR1mYrA2t6NNeS/YssSsvdJJfKiBJ6zW6GslBg0bn4Up8JwBnLnAGFGuFkxWxNDGLh/76lQ1XtDpUoJwujrPku13gChtg7cBqO7+zoE59NJ9HYSfp4OZ+P9Lo/QK/QajVGE3qEZ+oDO0AIxb+ZlnvYK/6v/zf/u/7hN9b19zUvUM//nHw/58a8=</latexit>

.. <latexit sha1_base64="4TW2Q2kvoIwAfSidpPz6R0O1N8s=">AAADAHicjVFNj9MwEHXC1xJg6cKRi0W7UrlUSQ/AsVIvXBCLRHdXaqLKdpytVTuO7El3qygXfg03xJV/Ar8GJ1uhZMuBkWy9efPxxmNaSGEhDH95/r37Dx4+OnocPHn67Pj54OTFudWlYXzBtNTmkhLLpcj5AgRIflkYThSV/IJu5k38YsuNFTr/AruCJ4pc5SITjICjVoPfp6NlTFW1rVdRMsJ/nWnjxOtUg+2wHx27jIHfQKtcGZ7WFRgicuwueU12dTIJgk7PcQNu6je93l3yUKMT/Q+tUbxtGzjVNplm1VzP9aeiXg2G4SRsDR+CaA+GaG9nqxPvOE41KxXPgUli7TIKC0gqYkAwyesgLi0vCNuQK750MCeK26Rqx6vxqWNSnGnjTg64ZbsVFVHW7hR1mYrA2t6NNeS/YssSsvdJJfKiBJ6zW6GslBg0bn4Up8JwBnLnAGFGuFkxWxNDGLh/76lQ1XtDpUoJwujrPku13gChtg7cBqO7+zoE59NJ9HYSfp4OZ+P9Lo/QK/QajVGE3qEZ+oDO0AIxb+ZlnvYK/6v/zf/u/7hN9b19zUvUM//nHw/58a8=</latexit>

.. <latexit sha1_base64="4TW2Q2kvoIwAfSidpPz6R0O1N8s=">AAADAHicjVFNj9MwEHXC1xJg6cKRi0W7UrlUSQ/AsVIvXBCLRHdXaqLKdpytVTuO7El3qygXfg03xJV/Ar8GJ1uhZMuBkWy9efPxxmNaSGEhDH95/r37Dx4+OnocPHn67Pj54OTFudWlYXzBtNTmkhLLpcj5AgRIflkYThSV/IJu5k38YsuNFTr/AruCJ4pc5SITjICjVoPfp6NlTFW1rVdRMsJ/nWnjxOtUg+2wHx27jIHfQKtcGZ7WFRgicuwueU12dTIJgk7PcQNu6je93l3yUKMT/Q+tUbxtGzjVNplm1VzP9aeiXg2G4SRsDR+CaA+GaG9nqxPvOE41KxXPgUli7TIKC0gqYkAwyesgLi0vCNuQK750MCeK26Rqx6vxqWNSnGnjTg64ZbsVFVHW7hR1mYrA2t6NNeS/YssSsvdJJfKiBJ6zW6GslBg0bn4Up8JwBnLnAGFGuFkxWxNDGLh/76lQ1XtDpUoJwujrPku13gChtg7cBqO7+zoE59NJ9HYSfp4OZ+P9Lo/QK/QajVGE3qEZ+oDO0AIxb+ZlnvYK/6v/zf/u/7hN9b19zUvUM//nHw/58a8=</latexit>

...
<latexit sha1_base64="CrDUidhbV+gYx4dWg1zM4wwFCGM=">AAADNnicjVLLbtQwFHXCqwToA5ZsLKaVhs0omUVhWTEbNogiMW2lJBrZjtOxxokj+2baUeQ/YQvfwK+wYYfY8gk46YAyHRZcydHxub7nOEemlRQGwvCb59+5e+/+g52HwaPHT3b39g+enhlVa8anTEmlLygxXIqST0GA5BeV5qSgkp/TxaTtny+5NkKVH2FV8bQgl6XIBSPgqNmBt3d0GCe0aJZ2FqWH+O9m3G6SeabA9Nh3jo0T4NfQWTeaZ7YBTUSJ3UdekZVNR0HQ0xy24Nq+3NDuk9seve5/eSXLTiEI/mg5spujeTNRE/W+sj2mp0hlzW3zxsWHmSTGcGPtbH8QjsKu8DaI1mCA1nXqItxNMsXqgpfQqcRRWEHaEA2CSW6DpDa8ImxBLnnsYEkKbtKmu4HFR47JcK60WyXgju1PNKQwZlVQd7IgMDe3ey35r15cQ/46bURZ1cBLdmOU1xKDwu07wJnQnIFcOUCYFu6umM2JJgzca9lwocXGPzRFLUFodbXJUqUWQKixgUswup3XNjgbj6LjUfhhPDgZrrPcQc/RCzREEXqFTtBbdIqmiHlL75P32fvif/W/+z/8nzdHfW898wxtlP/rN106BlU=</latexit>

. . .
<latexit sha1_base64="CXyZvQbzqGgRKR8uT6+QZOyCWjs=">AAADOnicbVJNj9MwEHXC1xJgP9gjF4vuSuVSJT10V5wW7YULsEh0d6UkqhzX3Vq148ietBui/Beu8Bv4I1y5Ia78AJy0QNMyUqI3b2be2E9OMsEN+P43x71z9979BzsPvUePn+zu7R88vTQq15QNqRJKXyfEMMFTNgQOgl1nmhGZCHaVzM7r+tWcacNV+gGKjMWS3KR8wikBS40OnMOjMEpkOa9GQXyE/yb9OommYwVmjX1j2RA04Sm2P7EgRdzzvON/Et0a3FYvWlLr5LbkWjWMgN1Cc6lSs3FVtlZVy13RvFFo4PQPbAaTSXmu3mXVWr4p+JYtMBXEGGaqpo/JbFq+ojTXhBYv8cmgdzqoRvsdv+c3gbdBsAIdtIoLa+NuNFY0lyyFRj4M/AzikmjgVLDKi3LDMkJn5IaFFqZEMhOXzcEqfGyZMZ4obb8UcMOuT5REGlPIxHZKAlOzWavJ/9XCHCanccnTLAeW0uWiSS4wKFy/BTzmmlEQhQWEam7PiumUWCfAvpjWlkS27lDKXADXatFmE6VmQBJTedbBYNOvbXDZ7wWDnv++3znrrrzcQc/Qc9RFATpBZ+g1ukBDRJ2Pzifns/PF/ep+d3+4P5etrrOaOUStcH/9BowyBWc=</latexit>

[v1 ] [v2 ] . . . [vM ] [train railway].


<latexit sha1_base64="w8KDLMXs6SaYywgazGIUXWr+c4c=">AAADKXicbVLLjtMwFHXCawgwL5ZsLNpKZVMlXXRGrGY0GzbAINGZkZqoclx3atWOI/umnSjKN7CFb+Br2AFbfgQnUyCZcqU4555777F9kjgV3IDv/3Dce/cfPHy089h78vTZ7t7+weGFUZmmbEyVUPoqJoYJnrAxcBDsKtWMyFiwy3h5VtUvV0wbrpKPkKcskuQ64XNOCVhqeuC4ve4kjGWxKqdB1MV/k2GVhIuZAtNg31p2AprwBNtFrEkeDTzvn0K/Ajflq5ZSk9xWbFS3lXvdcFUP1HDxB4bAbiCeF2fqfVo28vpdm1JoNiuLd2yNqSDGMFPWfUymi+KU0kwTmr/GR6PB8aic7nf8gV8H3gbBBnTQJs6ta7vhTNFMsgRq+UngpxAVRAOngpVemBmWErok12xiYUIkM1FRH6zEPcvM8Fxp+ySAa7Y5URBpTC5j2ykJLMzdWkX+rzbJYH4cFTxJM2AJvd1ongkMClefHs+4ZhREbgGhmtuzYrog1gmwP0hrl1i27lDITADXat1mY6WWQGJTetbB4K5f2+BiOAhGA//DsHPS33i5g16gl6iPAnSETtAbdI7GiDrc+eR8dr64X91v7nf3522r62xmnqNWuL9+AynP/UM=</latexit>

[v1 (x)] [v2 (x)] . . . [vM (x)] [train railway].


<latexit sha1_base64="J6UHTDMJ1GOsD18m3zfvpkS9P0s=">AAACS3icbVBLSgNBEO2J//hLdOmmMQiuwowLdRlw41LBmMA4hJpOj2nSn6G7xhCGnMDTuNUzeADP4U5c2IlZGPVBVz1eVVFdL82lcBiGb0FlaXlldW19o7q5tb2zW6vv3TpTWMbbzEhjuyk4LoXmbRQoeTe3HFQqeScdXkzrnQdunTD6Bsc5TxTca5EJBuilXu0ohoTG+cCg8dlkPkwFtCA09UGOYJw0e7VG2AxnoH9JNCcNMsdVrx7s3PUNKxTXyCQ4F0dhjkkJFgWTfFK9KxzPgQ3hnseealDcJeXsngk98kqfZsb6p5HO1J8TJSjnxir1nQpw4H7XpuJ/tbjA7Dwphc4L5Jp9L8oKSdHQqTm0LyxnKMeeALPC/5WyAVhg6C1c2JKqhRtKVUgU1owW1dSYIULqJlXvYPTbr7/k9qQZnTbD65NG63ju5To5IIfkmETkjLTIJbkibcLII3kiz+QleA3eg4/g87u1Esxn9skCKitf3saywA==</latexit>

[a] [photo] [of] [a] [train railway].


<latexit sha1_base64="spEv4ej9mg3LJ30AXIGKzolRdcY=">AAACPHicbZDLTgIxFIY7eENUBE3cuGkkJqzIDAYxrjBuXGIilwQI6ZQCDe100p7RkJGXcavP4Hu4d2fcurZcFgL+SZM//zknp+fzQ8ENuO6Hk9jY3NreSe6m9vYP0oeZ7FHdqEhTVqNKKN30iWGCB6wGHARrhpoR6QvW8Ee303rjkWnDVfAA45B1JBkEvM8pARt1MydtJsNhfENppAkdX+NyqXBRmnQzObfgzoTXjbcwObRQtZt10u2eopFkAVBBjGl5bgidmGjgVLBJqh0ZFhI6IgPWsjYgkplOPDtggs9t0sN9pe0LAM/SvxMxkcaMpW87JYGhWa1Nw/9qrQj6V52YB2EELKDzRf1IYFB4SgP3uGYUxNgaQjW3f8V0SCwJsMyWtvhy6YZYRgK4Vk/Lqa/UCIhvJilL0FvltW7qxYJ3WXDvi7lKfsEyiU7RGcojD5VRBd2hKqohip7RC3pFb8678+l8Od/z1oSzmDlGS3J+fgEPPa1w</latexit>

<latexit sha1_base64="oKJTGrxy+9QoHd9nSFUvs70YTHk=">AAADS3icjVLNbtNAEF67FEqA/sCRy4qkUrhEdg5pxamoFy5AkUhbybai9WZTr7LrtXbHSS3LT8DTcIVn4AF4Dm6IA2s3ICfhwEhrffPNzDfrTxtnghvwvO+Ou3Nv9/6DvYedR4+f7B8cHj29NCrXlI2pEkpfx8QwwVM2Bg6CXWeaERkLdhXPz+v61YJpw1X6EYqMRZLcpHzGKQFLTY6c3nEvCGNZLqqJH/Xw32RYJ2EyVWBa7FvLBiGwW2hWl5pNqxI04Sm2H7EkRRUNOp2WZr8Gt9XLNe02ub2jVf2vXeGiUWhg8gc2g/GsPFfvs6qVbwq+Y0tMBTGGmcr2hUxmSfma0lwTWrzCJ6PB6aiaHHa9gdcE3gb+CnTRKi6sr/vhVNFcshQa9cD3MohKooFTwapOmBuWETonNyywMCWSmahs7lXhY8tM8Uxpe1LADdueKIk0ppCx7ZQEErNZq8l/1YIcZqdRydMsB5bSu0WzXGBQuH4ceMo1oyAKCwjV3N4V04RYJ8A+obUtsVz7h1LmArhWy3U2VmoOJDZVxzrob/q1DS6HA3808D4Mu2f9lZd76Dl6gfrIRyfoDL1BF2iMqPPJ+ex8cb6639wf7k/3112r66xmnqG12Nn9Db2/DH0=</latexit>

<latexit sha1_base64="1srz8pVJrnh6c3fD9AduK7Gk5D0=">AAADS3icjVLBbtNAEF27FEqA0sKRy4qkUrhEdiRK21NRL1yAIpG2UmxF6/W6XmXXa+2Ok0aWv4Cv4QrfwAfwHdwQB9ZuQE7CgZHWevNm5s36aaNccAOe991xt+5s3723c7/z4OGj3cd7+08ujCo0ZSOqhNJXETFM8IyNgINgV7lmREaCXUbTs7p+OWPacJV9hEXOQkmuM55wSsBSk32nd9AbB5EsZ9XED3v4bzKskyCNFZgW+9ay4wDYDTSrS83iqgRNeIbtR8zJogoHnU5Ls1+Dm+rFinab3NzRqv7XrmDWKDQw/QObwSgpz9T7vGrl64Lv2BxTQYxhprJ9AZN5Wr6mtNCELk7w4cvB0XE12et6A68JvAn8JeiiZZxbX3eDWNFCsgwa9bHv5RCWRAOnglWdoDAsJ3RKrtnYwoxIZsKyuVeFDywT40RpezLADdueKIk0ZiEj2ykJpGa9VpP/qo0LSI7Ckmd5ASyjt4uSQmBQuH4cOOaaURALCwjV3N4V05RYJ8A+oZUtkVz5h1IWArhW81U2UmoKJDJVxzror/u1CS6GA/9w4H0Ydk/7Sy930DP0HPWRj16hU/QGnaMRos4n57PzxfnqfnN/uD/dX7etrrOceYpWYmv7N796DH4=</latexit>

Wind farm Accuracy: 75.35 Accuracy: 65.89 Accuracy: 76.86


<latexit sha1_base64="RkBZMghrkxbWYtXoTL4CTyYAvLU=">AAADNXicbVLLbtNAFB2bVzHQByzZjEgrhU1kZ5FWrIq6YQMUiTSVbCsaTybNKDMea+Y6qWX5S9jCN/AtLNghtvwCYzcgu+mVbJ97zn2MjybJBDfg+z8c9979Bw8f7Tz2njx9tru3f/D8wqhcUzamSih9mRDDBE/ZGDgIdplpRmQi2CRZntX6ZMW04Sr9DEXGYkmuUj7nlIClpgfO7tFhGCWyXFXTID7E/5NhnUSLmQLTYt9bNgRNeIrtS6xJEQ88rzWiX4Pr6nVnVJvcHtlS7xodrZqGBi7+wQjYNSTz8kx9zKpW3nwbV0rNZlX5ga0xFcQYZqqmjslsUb6lNNeEFm/w8WhwMrLChKczPCdaTvd7/sBvAm+DYAN6aBPntYPRTNFcshSaTWHgZxCXRAOnglVelBuWEbokVyy0MCWSmbhszljhI8vYxUrbJwXcsO2OkkhjCpnYSklgYW5rNXmXFuYwP4lLnmY5sJTeLJrnAoPC9TXAM64ZBVFYQKjm9qyYLog1Bexl6WxJZOcfSpkL4Fqtu2yi1BJIYirPOhjc9msbXAwHwWjgfxr2TvsbL3fQS/QK9VGAjtEpeofO0RhRJ3e+OF+db+5396f7y/19U+o6m54XqBPun79BNQFb</latexit>

Train railway
<latexit sha1_base64="8F2SgDUKU2CwjF7V4lnnVuQDSfM=">AAADOXicbVLLbtNAFB2bVzHQF0s2I9JKYRPZWaQVq6Ju2ABFatpKthWNJ5NmlBmPNXOd1LL8LWzhG/gSluwQW36AsRuQ3fRKts895z7GR5Nkghvw/R+O++Dho8dPtp56z56/2N7Z3du/MCrXlI2pEkpfJcQwwVM2Bg6CXWWaEZkIdpksTmv9csm04So9hyJjsSTXKZ9xSsBSkz1n//AgjBJZLqtJEB/g/8mwTqL5VIFpsR8sG4ImPMX2JVakiAee1xrRr8FN9aYzqk1ujmyp942Olk1DA+f/YATsBpJZeao+ZVUrb76NK6Vm06r8yFaYCmIMM1VTx2Q2L99RmmtCi7f4aDQ4HlnhvL13stvzB34TeBMEa9BD6zizLm5HU0VzyVJotoWBn0FcEg2cClZ5UW5YRuiCXLPQwpRIZuKyOWeFDy0zxTOl7ZMCbth2R0mkMYVMbKUkMDd3tZq8TwtzmB3HJU+zHFhKbxfNcoFB4foq4CnXjIIoLCBUc3tWTOfEGgP2wnS2JLLzD6XMBXCtVl02UWoBJDGVZx0M7vq1CS6Gg2A08D8Peyf9tZdb6BV6jfooQEfoBL1HZ2iMqFM4X5yvzjf3u/vT/eX+vi11nXXPS9QJ989fXHMDQg==</latexit>

(b) The instance-conditional prompts learned by CoCoOp are much more generalizable than CoOp to the unseen classes.

Figure 1. Motivation of our research: to learn generalizable prompts. The images are randomly selected from SUN397 [55], which is
a widely-used scene recognition dataset.

text words in a prompt into a set of learnable vectors, taking conditional prompts are more generalizable: they are op-
advantage of the differentiable nature of neural networks. timized to characterize each instance (more robust to class
With only a few labeled images for learning, CoOp achieves shift) rather than to serve only for some specific classes.
huge improvements over intensively-tuned manual prompts We present comprehensive experiments on 11 datasets
across a wide range of image recognition datasets. covering a diverse set of visual recognition tasks. Specifi-
In our study, we identify a critical problem of CoOp: the cally, we design a base-to-new generalization setting where
learned context is not generalizable to wider unseen classes a model is first learned using base classes and then tested
within the same task. Figure 1 illustrates the problem: the on completely new classes. Compared with the zero-shot
context learned by CoOp works well in distinguishing the method [40] and CoOp [62], our approach achieves the best
base classes like “arrival gate” and “cathedral” but suffers a overall performance (Table 1). Importantly, CoCoOp gains
significant drop in accuracy when it is transferred to the new significant improvements over CoOp in unseen classes (Fig-
(unseen) classes, such as “wind farm” and “train railway”— ure 3(a)), allowing the gap between manual and learning-
even though the task’s nature remains the same, i.e., recog- based prompts to be substantially reduced.
nizing scenes. The results suggest that the learned context In a more challenging scenario where the context learned
overfits the base classes, thus failing to capture more gen- for one task is directly transferred to other tasks with dras-
eralizable elements that are vital for broader scene recog- tically different classes, CoCoOp still beats CoOp with a
nition. We argue that such a problem is caused by CoOp’s clear margin (Table 2), suggesting that instance-conditional
static design: the context, which is fixed once learned, is prompts are more transferable and have the potential to suc-
optimized only for a specific set of (training) classes. On ceed at larger scale. CoCoOp also obtains stronger domain
the contrary, the manually-designed prompts adopted by the generalization performance than CoOp (Table 3), further
zero-shot method are relatively generalizable. justifying the strengths of dynamic prompts.
To address the weak generalizability problem, we intro- In summary, our research provides timely insights into
duce a novel concept: conditional prompt learning. The key the generalizability problem in prompt learning, and cru-
idea is to make a prompt conditioned on each input instance cially, demonstrates the effectiveness of a simple idea in
(image) rather than fixed once learned. To make the model various problem scenarios. We hope our approach and the
parameter-efficient, we introduce a simple yet effective im- findings presented in this work can pave the way for future
plementation of conditional prompt learning. Specifically, research in generalizable—and transferable—prompt learn-
we extend CoOp by further learning a lightweight neural ing.
network to generate for each image an input-conditional to-
ken (vector), which is combined with the learnable con- 2. Related Work
text vectors. We call our approach Conditional Context
Vision-Language Models We mainly review studies fo-
Optimization (CoCoOp).2 An overview is shown in Fig-
cused on aligning images and texts to learn a joint embed-
ure 2. Interestingly, the paradigm of CoCoOp is analogous
ding space [24, 40, 59]. The idea of cross-modality align-
to image captioning [49], which explains why instance-
ment is certainly not new and has been investigated since
2 Pronounced as /k@U­ku:p/. nearly a decade ago—though with dramatically different

16817
Conditional Context Optimization (CoCoOp)

technologies than today. context tokens

A typical vision-language model consists of three key el- <latexit sha1_base64="HOGtH0eXP5R5+EEu2fQHZwuL4xw=">AAACF3icbVA9TwJBEN3DL8Qv1NJmI5hYkTsKtSSxscREPiJcyN6ywIbdvcvuHIZc7l/YWOhfsTO2lv4TSxe4QsCXTPLy3kxm5gWR4AZc99vJbWxube/kdwt7+weHR8Xjk6YJY01Zg4Yi1O2AGCa4Yg3gIFg70ozIQLBWML6d+a0J04aH6gGmEfMlGSo+4JSAlR7L3UAmk7TnlXvFkltx58DrxMtICWWo94o/3X5IY8kUUEGM6XhuBH5CNHAqWFroxoZFhI7JkHUsVUQy4yfzi1N8YZU+HoTalgI8V/9OJEQaM5WB7ZQERmbVm4n/eZ0YBjd+wlUUA1N0sWgQCwwhnr2P+1wzCmJqCaGa21sxHRFNKNiQlrYEcumHRMYCuA6f0oKNylsNZp00qxXvquLeV0u1ahZaHp2hc3SJPHSNaugO1VEDUaTQM3pFb86L8+58OJ+L1pyTzZyiJThfv5cLoGY=</latexit>

v1 v2 <latexit sha1_base64="/ObT7ntJtfRdwD1X3OgKDxBQKY0=">AAACF3icbVC7TgJBFJ3FF+ILtbSZCCZWZHcLtSSxscREHhE2ZHaYhQnz2MzMYshm/8LGQn/Fztha+ieWDrCFgCe5yck59+bee8KYUW1c99spbGxube8Ud0t7+weHR+Xjk5aWicKkiSWTqhMiTRgVpGmoYaQTK4J4yEg7HN/O/PaEKE2leDDTmAQcDQWNKEbGSo/VXsjTSdb3q/1yxa25c8B14uWkAnI0+uWf3kDihBNhMENadz03NkGKlKGYkazUSzSJER6jIelaKhAnOkjnF2fwwioDGEllSxg4V/9OpIhrPeWh7eTIjPSqNxP/87qJiW6ClIo4MUTgxaIoYdBIOHsfDqgi2LCpJQgram+FeIQUwsaGtLQl5Es/pDxhhir5lJVsVN5qMOuk5de8q5p771fqfh5aEZyBc3AJPHAN6uAONEATYCDAM3gFb86L8+58OJ+L1oKTz5yCJThfv5i2oGc=</latexit>

...
<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>
<latexit sha1_base64="t4hsqUX1Tf4ub/K3HNBJTidAYC8=">AAACF3icbVA9TwJBEJ3zE/ELtbTZCCZW5I5CLUlsbEwwkY8IF7K37MGG3b3L7h6GXPgXNhb6V+yMraX/xNIFrhDwJZO8vDeTmXlBzJk2rvvtrK1vbG5t53byu3v7B4eFo+OGjhJFaJ1EPFKtAGvKmaR1wwynrVhRLAJOm8HwZuo3R1RpFskHM46pL3BfspARbKz0WOoEIh1NunelbqHolt0Z0CrxMlKEDLVu4afTi0giqDSEY63bnhsbP8XKMMLpJN9JNI0xGeI+bVsqsaDaT2cXT9C5VXoojJQtadBM/TuRYqH1WAS2U2Az0MveVPzPaycmvPZTJuPEUEnmi8KEIxOh6fuoxxQlho8twUQxeysiA6wwMTakhS2BWPghFQk3TEVPk7yNylsOZpU0KmXvsuzeV4rVShZaDk7hDC7Agyuowi3UoA4EJDzDK7w5L8678+F8zlvXnGzmBBbgfP0Cxb+ggg==</latexit>

vM [CLASS] . Text Encoder

ements: two for image and text encoding while the third is + + ... +
<latexit sha1_base64="MdQQkHPETjye9ifJOXGRj5eF6M4=">AAACF3icbVC7SgNBFJ31GeMrammzGARBCLsp1DJgYxnBPDBZwuxkNhkyj2XmrhqW/QsbC/0VO7G19E8snSRbmMQDFw7n3Mu994QxZwY879tZWV1b39gsbBW3d3b39ksHh02jEk1ogyiudDvEhnImaQMYcNqONcUi5LQVjq4nfuuBasOUvINxTAOBB5JFjGCw0n0X6BOEUXqe9Uplr+JN4S4TPydllKPeK/10+4okgkogHBvT8b0YghRrYITTrNhNDI0xGeEB7VgqsaAmSKcXZ+6pVfpupLQtCe5U/TuRYmHMWIS2U2AYmkVvIv7ndRKIroKUyTgBKslsUZRwF5Q7ed/tM00J8LElmGhmb3XJEGtMwIY0tyUUcz+kIuHAtHrMijYqfzGYZdKsVvyLindbLdeqeWgFdIxO0Bny0SWqoRtURw1EkETP6BW9OS/Ou/PhfM5aV5x85gjNwfn6BZuRoQE=</latexit> <latexit sha1_base64="MdQQkHPETjye9ifJOXGRj5eF6M4=">AAACF3icbVC7SgNBFJ31GeMrammzGARBCLsp1DJgYxnBPDBZwuxkNhkyj2XmrhqW/QsbC/0VO7G19E8snSRbmMQDFw7n3Mu994QxZwY879tZWV1b39gsbBW3d3b39ksHh02jEk1ogyiudDvEhnImaQMYcNqONcUi5LQVjq4nfuuBasOUvINxTAOBB5JFjGCw0n0X6BOEUXqe9Uplr+JN4S4TPydllKPeK/10+4okgkogHBvT8b0YghRrYITTrNhNDI0xGeEB7VgqsaAmSKcXZ+6pVfpupLQtCe5U/TuRYmHMWIS2U2AYmkVvIv7ndRKIroKUyTgBKslsUZRwF5Q7ed/tM00J8LElmGhmb3XJEGtMwIY0tyUUcz+kIuHAtHrMijYqfzGYZdKsVvyLindbLdeqeWgFdIxO0Bny0SWqoRtURw1EkETP6BW9OS/Ou/PhfM5aV5x85gjNwfn6BZuRoQE=</latexit>

<latexit sha1_base64="MdQQkHPETjye9ifJOXGRj5eF6M4=">AAACF3icbVC7SgNBFJ31GeMrammzGARBCLsp1DJgYxnBPDBZwuxkNhkyj2XmrhqW/QsbC/0VO7G19E8snSRbmMQDFw7n3Mu994QxZwY879tZWV1b39gsbBW3d3b39ksHh02jEk1ogyiudDvEhnImaQMYcNqONcUi5LQVjq4nfuuBasOUvINxTAOBB5JFjGCw0n0X6BOEUXqe9Uplr+JN4S4TPydllKPeK/10+4okgkogHBvT8b0YghRrYITTrNhNDI0xGeEB7VgqsaAmSKcXZ+6pVfpupLQtCe5U/TuRYmHMWIS2U2AYmkVvIv7ndRKIroKUyTgBKslsUZRwF5Q7ed/tM00J8LElmGhmb3XJEGtMwIY0tyUUcz+kIuHAtHrMijYqfzGYZdKsVvyLindbLdeqeWgFdIxO0Bny0SWqoRtURw1EkETP6BW9OS/Ou/PhfM5aV5x85gjNwfn6BZuRoQE=</latexit>

<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>

...
<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>

⇡ ⇡ ... ⇡
related to the design of loss functions. In early days, mod-
<latexit sha1_base64="4KzQHU+3d8F0bpLXiRHf1+GMPSk=">AAACF3icbVA9T8MwEL3wWcpXgZHFokViqpIOwFiJhbFI9EM0UeW4TmvVdiLbAVVR/wULA/wVNsTKyD9hxG0z0JYnnfT03p3u7oUJZ9q47reztr6xubVd2Cnu7u0fHJaOjls6ThWhTRLzWHVCrClnkjYNM5x2EkWxCDlth6Obqd9+pEqzWN6bcUIDgQeSRYxgY6WHih+KzE/YpNIrld2qOwNaJV5OypCj0Sv9+P2YpIJKQzjWuuu5iQkyrAwjnE6KfqppgskID2jXUokF1UE2u3iCzq3SR1GsbEmDZurfiQwLrccitJ0Cm6Fe9qbif143NdF1kDGZpIZKMl8UpRyZGE3fR32mKDF8bAkmitlbERlihYmxIS1sCcXCD5lIuWEqfpoUbVTecjCrpFWrepdV965Wrtfy0ApwCmdwAR5cQR1uoQFNICDhGV7hzXlx3p0P53PeuubkMyewAOfrF+USoJU=</latexit> <latexit sha1_base64="4KzQHU+3d8F0bpLXiRHf1+GMPSk=">AAACF3icbVA9T8MwEL3wWcpXgZHFokViqpIOwFiJhbFI9EM0UeW4TmvVdiLbAVVR/wULA/wVNsTKyD9hxG0z0JYnnfT03p3u7oUJZ9q47reztr6xubVd2Cnu7u0fHJaOjls6ThWhTRLzWHVCrClnkjYNM5x2EkWxCDlth6Obqd9+pEqzWN6bcUIDgQeSRYxgY6WHih+KzE/YpNIrld2qOwNaJV5OypCj0Sv9+P2YpIJKQzjWuuu5iQkyrAwjnE6KfqppgskID2jXUokF1UE2u3iCzq3SR1GsbEmDZurfiQwLrccitJ0Cm6Fe9qbif143NdF1kDGZpIZKMl8UpRyZGE3fR32mKDF8bAkmitlbERlihYmxIS1sCcXCD5lIuWEqfpoUbVTecjCrpFWrepdV965Wrtfy0ApwCmdwAR5cQR1uoQFNICDhGV7hzXlx3p0P53PeuubkMyewAOfrF+USoJU=</latexit>

<latexit sha1_base64="4KzQHU+3d8F0bpLXiRHf1+GMPSk=">AAACF3icbVA9T8MwEL3wWcpXgZHFokViqpIOwFiJhbFI9EM0UeW4TmvVdiLbAVVR/wULA/wVNsTKyD9hxG0z0JYnnfT03p3u7oUJZ9q47reztr6xubVd2Cnu7u0fHJaOjls6ThWhTRLzWHVCrClnkjYNM5x2EkWxCDlth6Obqd9+pEqzWN6bcUIDgQeSRYxgY6WHih+KzE/YpNIrld2qOwNaJV5OypCj0Sv9+P2YpIJKQzjWuuu5iQkyrAwjnE6KfqppgskID2jXUokF1UE2u3iCzq3SR1GsbEmDZurfiQwLrccitJ0Cm6Fe9qbif143NdF1kDGZpIZKMl8UpRyZGE3fR32mKDF8bAkmitlbERlihYmxIS1sCcXCD5lIuWEqfpoUbVTecjCrpFWrepdV965Wrtfy0ApwCmdwAR5cQR1uoQFNICDhGV7hzXlx3p0P53PeuubkMyewAOfrF+USoJU=</latexit>

<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>

els for processing images and texts are often designed and ...
<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>

...
<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>

also learned independently, with their outputs connected by meta token


<latexit sha1_base64="4KzQHU+3d8F0bpLXiRHf1+GMPSk=">AAACF3icbVA9T8MwEL3wWcpXgZHFokViqpIOwFiJhbFI9EM0UeW4TmvVdiLbAVVR/wULA/wVNsTKyD9hxG0z0JYnnfT03p3u7oUJZ9q47reztr6xubVd2Cnu7u0fHJaOjls6ThWhTRLzWHVCrClnkjYNM5x2EkWxCDlth6Obqd9+pEqzWN6bcUIDgQeSRYxgY6WHih+KzE/YpNIrld2qOwNaJV5OypCj0Sv9+P2YpIJKQzjWuuu5iQkyrAwjnE6KfqppgskID2jXUokF1UE2u3iCzq3SR1GsbEmDZurfiQwLrccitJ0Cm6Fe9qbif143NdF1kDGZpIZKMl8UpRyZGE3fR32mKDF8bAkmitlbERlihYmxIS1sCcXCD5lIuWEqfpoUbVTecjCrpFWrepdV965Wrtfy0ApwCmdwAR5cQR1uoQFNICDhGV7hzXlx3p0P53PeuubkMyewAOfrF+USoJU=</latexit>


Image Encoder ...
<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>

extra modules (losses) for alignment. Images are often en-


coded using hand-crafted descriptors [10, 45] or neural net- Meta-Net

works [12, 29], while texts are encoded using, for instance,
pre-trained word vectors [12, 45] or the frequency-based
TF-IDF features [10, 29]. In terms of cross-modality align- Figure 2. Our approach, Conditional Context Optimization (Co-
ment, common approaches include metric learning [12], CoOp), consists of two learnable components: a set of context
multi-label classification [16, 26], and n-gram language vectors and a lightweight neural network (Meta-Net) that gener-
learning [31]. Recently, a study suggests that training the ates for each image an input-conditional token.
vision part with an image captioning loss can make the vi-
sual representation more transferable [7].
Recent vision-language models [13,24,33,40] bridge the where the main idea is to turn a prompt into a set of contin-
two modalities by learning two encoders jointly. Also, the uous vectors that can be end-to-end optimized with respect
models are now built with much larger neural networks. As to an objective function. See Liu et al. [34] for a more com-
discussed in Zhou et al. [62], recent successes in vision- prehensive survey.
language models are mainly attributed to the developments In computer vision, prompt learning is a nascent research
in i) Transformers [48], ii) contrastive representation learn- direction that has only been explored very recently [27, 42,
ing [4, 17, 20], and iii) web-scale training datasets [24, 40]. 56,58,62]. Our research is built on top of CoOp [62], which
A representative approach is CLIP [40], which trains two is the earliest work to bring continuous prompt learning
neural network-based encoders using a contrastive loss to to the vision domain for adaptation of pre-trained vision-
match pairs of images and texts. After consuming 400 mil- language models. Crucially, our approach solves the weak
lion data pairs, the CLIP model demonstrates a remark- generalizability problem of CoOp [62], based on a simple
able zero-shot image recognition capability. Similar to idea of conditional prompt learning—which to our knowl-
CoOp [62], our approach is orthogonal to the research of edge is also novel in the context of NLP and thus could be
CLIP-like models [13,24,33,40], aiming to offer an efficient of interest to the NLP community as well.
solution for adapting pre-trained vision-language models to
downstream applications.
Zero-Shot Learning (ZSL) is another relevant research
Prompt Learning This topic originates from the NLP
area where the goal is similar to ours, i.e., to recognize novel
domain. The motivation was to view pre-trained language
classes by training only on base classes [3,51,54,57]. More-
models, such as BERT [8] or GPT [41], as knowledge
over, the generalization problem where a model trained on
bases from which information useful to downstream tasks
base classes often fails on novel classes is also linked to
is elicited [39]. Concretely, given a pre-trained language
the “seen-class bias” issue raised in the ZSL literature [54].
model, the task is often formulated as a “fill-in-the-blank”
The most common approach to ZSL is to learn a semantic
cloze test, such as asking the model to predict the masked
space based on auxiliary information such as attributes [23]
token in “No reason to watch. It was [MASK]” as either
or word embeddings [12, 52]. Different from existing ZSL
“positive” or “negative” for sentiment classification. The
methods, our work addresses the emerging problem of
key lies in how to design the underlined part, known as
adapting large vision-language models and uses drastically
prompt (template), in such a format familiar to the model.
different techniques based on prompting.
Instead of manually designing a prompt, research in
prompt learning aims to automate the process with the help
of affordable-sized labeled data. Jiang et al. [25] use text 3. Methodology
mining and paraphrasing to generate a group of candidate
prompts, within which the optimal ones are chosen to have An overview of our approach is shown in Figure 2. Be-
the highest training accuracy. Shin et al. [44] propose Au- low we first provide brief reviews on CLIP [40], which is
toPrompt, a gradient-based approach that selects from a vo- the base model used in this paper, and CoOp [62]. Then, we
cabulary the best tokens that cause the greatest changes in present the technical details of our approach as well as the
gradients based on the label likelihood. Our research is most rationale behind the design. Same as CoOp, our approach
related to continuous prompt learning methods [30, 32, 60], is applicable to broader CLIP-like vision-language models.

16818
3.1. Reviews of CLIP and CoOp ents can be propagated all the way back to update the con-
text vectors. Note that the base model of CLIP is frozen in
Contrastive Language-Image Pre-training known as
the entire training process (ours too).
CLIP [41], has well demonstrated the potential of learning
open-set visual concepts. CLIP is built using two encoders, 3.2. CoCoOp: Conditional Context Optimization
one for image and the other for text, as shown in Figure 2.
The image encoder can be either a ResNet [18] or a ViT [9], CoOp is a data-efficient approach allowing the context
which is used to transform an image into a feature vector. vectors to be trained with only a few labeled images in a
The text encoder is a Transformer [48], which takes as input downstream dataset. However, as discussed CoOp is not
a sequence of word tokens and again produces a vectorized generalizable to wider unseen classes within the same task.
representation. We argue that instance-conditional context can generalize
During training, CLIP adopts a contrastive loss to learn better because it shifts the focus away from a specific set of
a joint embedding space for the two modalities. Specifi- classes—for reducing overfitting—to each input instance,
cally, for a mini-batch of image-text pairs, CLIP maximizes and hence to the entire task.
for each image the cosine similarity with the matched text A straightforward way to implement CoCoOp is to build
while minimizes the cosine similarities with all other un- M neural networks to get M context tokens. However, such
matched texts, and the loss is computed in a similar fashion a design would require M × the size of a neural network,
for each text too. After training, CLIP can be used for zero- which is much larger than having M context vectors as in
shot image recognition. Let x be image features generated CoOp. Here we propose a parameter-efficient design that
by the image encoder and {wi }K works very well in practice. Specifically, on top of the M
i=1 a set of weight vectors
produced by the text encoder, each representing a category context vectors, we further learn a lightweight neural net-
(suppose there are K categories in total). In particular, each work, called Meta-Net, to generate for each input a condi-
wi is derived from a prompt, such as “a photo of a {class}” tional token (vector), which is then combined with the con-
where the “{class}” token is filled with the i-th class name. text vectors. See Figure 2 for a sketch of the architecture.
The prediction probability is then Let hθ (·) denote the Meta-Net parameterized by θ, each
context token is now obtained by vm (x) = vm + π where
π = hθ (x) and m ∈ {1, 2, ..., M }. The prompt for the
\label {eq:pred_zs} p(y | \bm {x}) = \frac {\exp (\operatorname {sim} (\bm {x}, \bm {w}_y) / \tau )}{\sum _{i=1}^K \exp (\operatorname {sim} (\bm {x}, \bm {w}_i) / \tau )}, (1) i-th class is thus conditioned on the input, i.e., ti (x) =
{v1 (x), v2 (x), . . . , vM (x), ci }. The prediction probability
where sim(·, ·) denotes cosine similarity and τ is a learned is computed as
temperature parameter.
Context Optimization (CoOp) aims to overcome the \label {eq:pred_cocoop} p(y | \bm {x}) = \frac {\exp (\operatorname {sim} (\bm {x}, g(\bm {t}_y (\bm {x}))) / \tau )}{\sum _{i=1}^K \exp (\operatorname {sim} (\bm {x}, g(\bm {t}_i (\bm {x})) / \tau )}. (3)
inefficiency problem in prompt engineering for better adapt-
ing pre-trained vision-language models to downstream ap- During training, we update the context vectors {vm }M
m=1
plications [62]. The key idea in CoOp is to model each con- together with the Meta-Net’s parameters θ. In this work,
text token using a continuous vector that can be end-to-end the Meta-Net is built with a two-layer bottleneck structure
learned from data. Concretely, instead of using “a photo (Linear-ReLU-Linear), with the hidden layer reducing the
of a” as the context, CoOp introduces M learnable context input dimension by 16×. The input to the Meta-Net is sim-
vectors, {v1 , v2 , . . . , vM }, each having the same dimension ply the output features produced by the image encoder. We
with the word embeddings. The prompt for the i-th class, leave exploration of more advanced designs for future work.
denoted by ti , now becomes ti = {v1 , v2 , . . . , vM , ci }
where ci is the word embedding(s) for the class name. The 4. Experiments
context vectors are shared among all classes.3 Let g(·) de-
note the text encoder, the prediction probability is then Our approach is mainly evaluated in the following three
problem settings: 1) generalization from base to new classes
\label {eq:pred_coop} p(y | \bm {x}) = \frac {\exp (\operatorname {sim} (\bm {x}, g(\bm {t}_y)) / \tau )}{\sum _{i=1}^K \exp (\operatorname {sim} (\bm {x}, g(\bm {t}_i) / \tau )}. (2) within a dataset (Section 4.1); 2) cross-dataset transfer (Sec-
tion 4.2); 3) domain generalization (Section 4.3). All mod-
els used in our experiments are based on the open-source
To adapt CLIP to a downstream image recognition CLIP [40].4 Before discussing the results, we provide the
dataset, a cross-entropy loss can be used as the learning ob- details of the experimental setup below.
jective. Since the text encoder g(·) is differentiable, gradi-
Datasets For the first two settings, i.e., base-to-new gen-
3 CoOp has an alternative version that learns class-specific context, eralization and cross-dataset transfer, we use the 11 image
which is not considered here because it is not straightforward to transfer
class-specific context to unseen classes. 4 https://github.com/openai/CLIP.

16819
Table 1. Comparison of CLIP, CoOp and CoCoOp in the base-to-new generalization setting. For learning-based methods (CoOp and
CoCoOp), their prompts are learned from the base classes (16 shots). The results strongly justify the strong generalizability of conditional
prompt learning. H: Harmonic mean (to highlight the generalization trade-off [54]).

(a) Average over 11 datasets. (b) ImageNet. (c) Caltech101.

Base New H Base New H Base New H


CLIP 69.34 74.22 71.70 CLIP 72.43 68.14 70.22 CLIP 96.84 94.00 95.40
CoOp 82.69 63.22 71.66 CoOp 76.47 67.88 71.92 CoOp 98.00 89.81 93.73
CoCoOp 80.47 71.69 75.83 CoCoOp 75.98 70.43 73.10 CoCoOp 97.96 93.81 95.84

(d) OxfordPets. (e) StanfordCars. (f) Flowers102.

Base New H Base New H Base New H


CLIP 91.17 97.26 94.12 CLIP 63.37 74.89 68.65 CLIP 72.08 77.80 74.83
CoOp 93.67 95.29 94.47 CoOp 78.12 60.40 68.13 CoOp 97.60 59.67 74.06
CoCoOp 95.20 97.69 96.43 CoCoOp 70.49 73.59 72.01 CoCoOp 94.87 71.75 81.71

(g) Food101. (h) FGVCAircraft. (i) SUN397.

Base New H Base New H Base New H


CLIP 90.10 91.22 90.66 CLIP 27.19 36.29 31.09 CLIP 69.36 75.35 72.23
CoOp 88.33 82.26 85.19 CoOp 40.44 22.30 28.75 CoOp 80.60 65.89 72.51
CoCoOp 90.70 91.29 90.99 CoCoOp 33.41 23.71 27.74 CoCoOp 79.74 76.86 78.27

(j) DTD. (k) EuroSAT. (l) UCF101.

Base New H Base New H Base New H


CLIP 53.24 59.90 56.37 CLIP 56.48 64.05 60.03 CLIP 70.53 77.50 73.85
CoOp 79.44 41.18 54.24 CoOp 92.19 54.74 68.69 CoOp 84.69 56.05 67.46
CoCoOp 77.01 56.00 64.85 CoCoOp 87.49 60.04 71.21 CoCoOp 82.33 73.45 77.64

recognition datasets as in Zhou et al. [62], which cover a di- prompts. It is worth mentioning that the manual prompt
verse set of recognition tasks. Specifically, the benchmark for each dataset was intensively tuned using all classes in
includes ImageNet [6] and Caltech101 [11] for classifica- the test data [40].
tion on generic objects; OxfordPets [38], StanfordCars [28],
Flowers102 [36], Food101 [2] and FGVCAircraft [35] for Training Details Our implementation is based on
fine-grained classification; SUN397 [55] for scene recogni- CoOp’s code.5 Throughout the experiments, we use the best
tion; UCF101 [46] for action recognition; DTD [5] for tex- available vision backbone in CLIP, i.e., ViT-B/16. Zhou et
ture classification; and finally EuroSAT [19] for satellite im- al. [62] have suggested that a shorter context length and
agery recognition. For domain generalization experiments, a good initialization can lead to better performance and
we use ImageNet as the source dataset and four other vari- stronger robustness to domain shift. Therefore, we fix the
ants of ImageNet that contain different types of domain shift context length to 4 and initialize the context vectors using
as the target datasets, namely ImageNetV2 [43], ImageNet- the pre-trained word embeddings of “a photo of a” for both
Sketch [50], ImageNet-A [22] and ImageNet-R [21]. CoOp and CoCoOp. Due to the instance-conditional de-
Following Zhou et al. [62], we randomly sample for each sign, our approach is slow to train and consumes much more
dataset a few-shot training set while using the original test GPU memory than CoOp. Therefore, to ensure the model
set for testing. We only evaluate the highest shot number can fit into a GPU and meanwhile reduce the training time,
studied in Zhou et al. [62], i.e., 16 shots, which is sufficient we train CoCoOp with batch size of 1 for 10 epochs. Such
to justify our approach. For learning-based models, the re- a limitation is discussed in more detail in Section 5.
sults are averaged over three runs.
4.1. Generalization From Base to New Classes
Baselines The direct rival to our approach is CoOp [62], Solving the weak generalizability problem of CoOp is
which essentially learns static prompts (in comparison the main focus in this research. On each of the 11 datasets,
to our dynamic prompts). The zero-shot method, i.e.,
CLIP [40] is also compared, which is based on manual 5 https://github.com/KaiyangZhou/CoOp.

16820
(a) (b)

Figure 3. Comprehensive comparisons of CoCoOp and CoOp in the base-to-new generalization setting. (a) CoCoOp is able to gain
consistent improvements over CoOp in unseen classes on all datasets. (b) CoCoOp’s declines in base accuracy are mostly under 3%, which
are far outweighed by the gains in generalization.

we split the classes equally into two groups, one as base CoCoOp’s Gains in Generalization Far Outweigh
classes and the other as new classes. Learning-based mod- Losses in Base Accuracy In comparison to CoOp, per-
els, i.e., CoOp and CoCoOp, are trained using only the base formance drops in the base classes occur for CoCoOp on
classes while evaluation is conducted on the base and new most datasets (see Figure 3(b)). This is reasonable because
classes separately to test generalizability. The detailed re- CoOp optimizes specifically for base classes whereas Co-
sults are shown in Table 1. CoOp optimizes for each instance in order to gain more
generalization over an entire task. But it is worth noting
Failures of CoOp in Unseen Classes The split does not
that on the 9 datasets where CoCoOp’s base accuracy drops
guarantee that the two class groups are equally difficult, as
below CoOp’s, most losses are under 3% (precisely on 6
evidenced in CLIP’s bumpy results: the base and new ac-
out of 9 datasets), which are far outweighed by the gains in
curacy numbers are dramatically different.6 Nonetheless,
unseen classes shown in Figure 3(a); even for those where
CoOp’s new accuracy is consistently much weaker than the
CoCoOp suffers the biggest losses, the boosts in generaliza-
base accuracy on nearly all datasets, leaving a huge gap of
tion are mostly significant enough to turn the averages into
almost 20% on average (82.69% vs 63.22%). Despite main-
positives, e.g., StanfordCars sees the worst base accuracy
taining an advantage over CLIP in terms of average perfor-
drop of -7.63% but has the third-highest accuracy gain of
mance, CoOp’s gains in the base classes are nearly zeroed
+13.19% in the new classes, which together bring a 5.56%
out by the catastrophic failures in the new classes, highlight-
positive improvement for CoCoOp.
ing the need to improve generalizability for learning-based
prompts. CoCoOp Is More Compelling Than CLIP When tak-
ing into account both the base and new classes, CoCoOp
CoCoOp Significantly Narrows Generalization Gap shows a gain of more than 4% over CLIP (75.83% vs
As shown in Table 1(a), CoCoOp improves the accuracy 71.70), suggesting that instance-conditional prompts have
in unseen classes from 63.22% to 71.69%, which largely a better potential in capturing more generalizable elements
reduces the gap with manual prompts. The results confirm that are relevant for a recognition task. Theoretically,
that instance-conditional prompts are more generalizable. learning-based prompts have a much higher risk of overfit-
A more detailed breakdown of per-dataset improvement is ting base classes than manual prompts. Therefore, CLIP is a
visualized in Figure 3(a) where we observe more than 10% strong competitor to beat in unseen classes. Different from
increases in accuracy on 5 out of 11 datasets. Notably, on CoOp, we obtain promising results for CoCoOp: the new
the challenging ImageNet dataset, CoCoOp’s surge from accuracy is even better than CLIP’s on 4 out of 11 datasets
67.88% to 70.43% represents a non-trivial progress (the (i.e., ImageNet, OxfordPets, Food101 and SUN397) and not
70.43% accuracy even surpasses CLIP’s 68.14%). too far away from CLIP’s on the rest except FGVCAircraft
6 For convenience, we refer to base accuracy as the performance in base where the gap between manual and learning-based prompts
classes; and similarly for new accuracy. is generally large. In the ablation study on context length,

16821
Table 2. Comparison of prompt learning methods in the cross-dataset transfer setting. Prompts applied to the 10 target datasets are
learned from ImageNet (16 images per class). Clearly, CoCoOp demonstrates better transferability than CoOp. ∆ denotes CoCoOp’s gain
over CoOp.

Source Target

FGVCAircraft
StanfordCars

Flowers102
Caltech101

OxfordPets
ImageNet

EuroSAT
Food101

SUN397

UCF101

Average
DTD
CoOp [62] 71.51 93.70 89.14 64.51 68.71 85.30 18.47 64.15 41.92 46.39 66.55 63.88
CoCoOp 71.02 94.43 90.14 65.32 71.88 86.06 22.94 67.36 45.73 45.37 68.21 65.74
∆ -0.49 +0.73 +1.00 +0.81 +3.17 +0.76 +4.47 +3.21 +3.81 -1.02 +1.66 +1.86

Table 3. Comparison of manual and learning-based prompts in domain generalization. CoOp and CoCoOp use as training data 16
images from each of the 1,000 classes on ImageNet. In general, CoCoOp is more domain-generalizable than CoOp.

Source Target
Learnable? ImageNet ImageNetV2 ImageNet-Sketch ImageNet-A ImageNet-R
CLIP [40] 66.73 60.83 46.15 47.77 73.96
CoOp [62] ✓ 71.51 64.20 47.99 49.71 75.21
CoCoOp ✓ 71.02 64.07 48.75 50.63 76.18

we find that FGVCAircraft benefits from longer context, is much lower, such as FGVCAircraft and DTD (contain-
which is aligned with the findings in Zhou et al. [62]. ing various textures) where the accuracy numbers are well
To close or even overturn the gaps between manual and below 50%. Nonetheless, CoCoOp exhibits much stronger
learning-based prompts in unseen classes, more efforts are transferability than CoOp on the two mentioned datasets as
required and we hope the insights presented in this research well as on most other fine-grained or specialized datasets.
can help the community tackle the generalizability issue in
prompt learning. 4.3. Domain Generalization
Generalization to out-of-distribution data is a capability
4.2. Cross-Dataset Transfer essential for machine learning models to succeed in prac-
Having demonstrated CoCoOp’s generalizability within tical applications [47, 61]. Zhou et al. [62] have revealed
a dataset, we further show that CoCoOp has the potential to that their learnable prompts are more robust than manual
transfer beyond a single dataset, which is a much more chal- prompts to domain shift. We are also interested to know if
lenging problem because the fundamentals can be totally instance-conditional prompts still maintain the advantages
changed across different datasets (e.g., from object recog- as in previous experiments.
nition to texture classification). We only consider prompt Following Zhou et al. [62], we evaluate CoCoOp’s do-
learning methods in this setting. main generalization performance by transferring the con-
We compare CoCoOp with CoOp by transferring con- text learned from ImageNet to the four specially designed
text learned from ImageNet, with all 1,000 classes used, to benchmarks. We also include the comparison with CLIP.
each of the other 10 datasets. The results are detailed in Table 3 shows the results. Both prompt learning methods
Table 2. On the source dataset, the two models perform clearly beat CLIP on all target datasets. Compared to CoOp,
similarly. Whereas on the target datasets, CoCoOp mostly CoCoOp performs slightly worse on ImageNetV2 but bet-
outperforms CoOp by a clear margin. Since the ImageNet ter on the other three. The results confirm that instance-
classes mainly contain objects, as well as a fair amount of conditional prompts are more domain-generalizable.
dog breeds, it is reasonable to see high accuracy for both
models on the relevant target datasets including Caltech101
4.4. Further Analysis
and OxfordPets. Class-Incremental Test We consider a practical prob-
By comparison, the performance on other datasets with lem scenario where the recognition targets originally com-
distant—and more fine-grained or specialized—categories posed of base classes are expanded to include completely

16822
Table 4. Recognition accuracy (average over 11 datasets) on a might question if the improvements simply come from an
combination of base and new classes. The learnable models only increased learning capacity. To clear the doubt, we remove
have access to training data from base classes. the Meta-Net part and increase the number of context tokens
in CoOp to the maximum such that CoOp’s and CoCoOp’s
Learnable? Accuracy sizes are similar. The results in Table 5 show that increasing
CLIP [40] 65.22 the parameter size is not the key.
CoOp [62] ✓ 65.55
CoCoOp ✓ 69.13 5. Limitations
The first limitation is about training efficiency: CoCoOp
is slow to train and would consume a significant amount
of GPU memory if the batch size is set larger than one.
The reason is because CoCoOp is based on an instance-
conditional design that requires for each image an indepen-
dent forward pass of instance-specific prompts through the
text encoder. This is much less efficient than CoOp that only
needs a single forward pass of prompts through the text en-
coder for an entire mini-batch of any size.
(a) Ablation on initialization. (b) Ablation on context length.
The second limitation is that on 7 out of the 11 datasets
Figure 4. Ablation studies. (see Table 1), CoCoOp’s performance in unseen classes still
lags behind CLIP’s, indicating that more efforts are needed
Table 5. CoCoOp (last row) vs a bigger CoOp on ImageNet. from the community to fully close or overturn the gaps be-
tween manual and learning-based prompts.
Model # params Base New H
CoOp (ctx=4) 2,048 76.47 67.88 71.92
6. Discussion and Conclusion
CoOp (ctx=60) 30,720 76.16 65.34 70.34 Our research addresses an important issue that arises
CoOp (ctx=4) + Meta-Net 34,816 75.98 70.43 73.10 with the availability of large pre-trained AI models, i.e.,
how to adapt them to downstream applications. These mod-
els, also called foundation models [1], have received in-
new classes. This problem is relevant to the existing creasing attention from academia and industry in both the
continual learning literature [37] but different in that the vision and NLP communities because they are so powerful
model here does not have access to any training data from in terms of their capabilities for diverse downstream tasks.
new classes and needs to perform zero-shot recognition on However, foundation models are costly to pre-train in terms
them. We compare CLIP, CoOp and CoCoOp using the of data scale and compute resources; and typically contain
11 datasets. The average results are reported in Table 4. an enormous number of parameters in order to develop suf-
Clearly, CoOp loses competitiveness against CLIP as their ficient capacity. For instance, the CLIP model [40] based
performance is similar but the former needs training data. on ViT-B/16 used in our experiments has a whopping 150M
Again, CoCoOp beats the two competitors with a signifi- parameter size. These factors together highlight the need for
cant margin. research of efficient adaptation methods for democratizing
Initialization We compare word embeddings-based ini- foundation models.
tialization with random initialization, which samples from a Our studies, which follow the line of parameter-efficient
zero-mean Gaussian distribution with 0.02 standard devia- prompt learning [62], provide timely insights into the gen-
tion. Figure 4(a) suggests that a proper initialization is more eralizability issue of static prompts, and more importantly,
beneficial to both the base and new classes. demonstrate that a simple design based on conditional
prompt learning performs superbly in a variety of prob-
Context Length Following Zhou et al. [62], we study 4, lem scenarios, including generalization from base to new
8 and 16 context tokens. For fair comparison, we use ran- classes, cross-dataset prompt transfer, and domain general-
dom initialization for all context tokens. Figure 4(b) sum- ization.
marizes the results on the 11 datasets. The differences in the
Acknowledgements This work is supported by NTU
base classes are fairly small whereas in the new classes the
NAP, and under the RIE2020 Industry Alignment Fund – In-
models with a longer context length clearly perform better.
dustry Collaboration Projects (IAF-ICP) Funding Initiative,
CoCoOp vs a Bigger CoOp Since CoCoOp introduces as well as cash and in-kind contribution from the industry
more parameters than CoOp, namely the Meta-Net, one partner(s).

16823
References visual features through embedding images into text topic
spaces. In CVPR, 2017. 3
[1] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt-
[17] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
man, Simran Arora, Sydney von Arx, Michael S Bernstein,
Girshick. Momentum contrast for unsupervised visual rep-
Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al.
resentation learning. In CVPR, 2020. 3
On the opportunities and risks of foundation models. arXiv
preprint arXiv:2108.07258, 2021. 8 [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[2] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Deep residual learning for image recognition. In CVPR,
Food-101–mining discriminative components with random 2016. 1, 4
forests. In ECCV, 2014. 5 [19] Patrick Helber, Benjamin Bischke, Andreas Dengel, and
[3] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Damian Borth. Eurosat: A novel dataset and deep learning
Sha. An empirical study and analysis of generalized zero- benchmark for land use and land cover classification. IEEE
shot learning for object recognition in the wild. In ECCV, Journal of Selected Topics in Applied Earth Observations
2016. 3 and Remote Sensing, 2019. 5
[4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- [20] Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali
offrey Hinton. A simple framework for contrastive learning Razavi, Carl Doersch, S. M. Ali Eslami, and Aäron van den
of visual representations. In ICML, 2020. 3 Oord. Data-efficient image recognition with contrastive pre-
[5] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy dictive coding. In ICML, 2020. 3
Mohamed, and Andrea Vedaldi. Describing textures in the [21] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada-
wild. In CVPR, 2014. 5 vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu,
[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image and Justin Gilmer. The many faces of robustness: A critical
database. In CVPR, 2009. 5 analysis of out-of-distribution generalization. ICCV, 2021. 5
[7] Karan Desai and Justin Johnson. Virtex: Learning visual [22] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein-
representations from textual annotations. In CVPR, 2021. 3 hardt, and Dawn Song. Natural adversarial examples. In
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina CVPR, 2021. 5
Toutanova. Bert: Pre-training of deep bidirectional trans- [23] Dat Huynh and Ehsan Elhamifar. Fine-grained generalized
formers for language understanding. In NAACL, 2019. 3 zero-shot learning via dense attribute-based attention. In
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, CVPR, 2020. 3
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [24] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom
vain Gelly, et al. An image is worth 16x16 words: Trans- Duerig. Scaling up visual and vision-language representation
formers for image recognition at scale. In ICLR, 2021. 4 learning with noisy text supervision. In ICML, 2021. 1, 2, 3
[10] Mohamed Elhoseiny, Babak Saleh, and Ahmed Elgammal. [25] Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neu-
Write a classifier: Zero-shot learning using purely textual big. How can we know what language models know? ACL,
descriptions. In ICCV, 2013. 3 2020. 1, 3
[11] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- [26] Armand Joulin, Laurens Van Der Maaten, Allan Jabri, and
ative visual models from few training examples: An incre- Nicolas Vasilache. Learning visual features from large
mental bayesian approach tested on 101 object categories. In weakly supervised data. In ECCV, 2016. 3
CVPR-W, 2004. 5
[27] Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi
[12] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio,
Xie. Prompting visual-language models for efficient video
Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. De-
understanding. arXiv preprint arXiv:2112.04478, 2021. 3
vise: A deep visual-semantic embedding model. NeurIPS,
2013. 3 [28] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
3d object representations for fine-grained categorization. In
[13] Andreas Fürst, Elisabeth Rumetshofer, Viet Tran, Hubert
ICCV-W, 2013. 5
Ramsauer, Fei Tang, Johannes Lehner, David Kreil, Michael
Kopp, Günter Klambauer, Angela Bitto-Nemling, et al. [29] Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. Predicting
Cloob: Modern hopfield networks with infoloob outperform deep zero-shot convolutional neural networks using textual
clip. arXiv preprint arXiv:2110.11316, 2021. 1, 3 descriptions. In ICCV, 2015. 3
[14] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao [30] Brian Lester, Rami Al-Rfou, and Noah Constant. The power
Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. of scale for parameter-efficient prompt tuning. arXiv preprint
Clip-adapter: Better vision-language models with feature arXiv:2104.08691, 2021. 1, 3
adapters. arXiv preprint arXiv:2110.04544, 2021. 1 [31] Ang Li, Allan Jabri, Armand Joulin, and Laurens van der
[15] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre- Maaten. Learning visual n-grams from web data. In ICCV,
trained language models better few-shot learners. arXiv 2017. 3
preprint arXiv:2012.15723, 2020. 1 [32] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz-
[16] Lluis Gomez, Yash Patel, Marçal Rusiñol, Dimosthenis ing continuous prompts for generation. arXiv preprint
Karatzas, and CV Jawahar. Self-supervised learning of arXiv:2101.00190, 2021. 1, 3

16824
[33] Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Polosukhin. Attention is all you need. In NeurIPS, 2017. 1,
Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan. Su- 3, 4
pervision exists everywhere: A data efficient contrastive [49] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du-
language-image pre-training paradigm. arXiv preprint mitru Erhan. Show and tell: A neural image caption gen-
arXiv:2110.05208, 2021. 1, 3 erator. In CVPR, 2015. 2
[34] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hi- [50] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P
roaki Hayashi, and Graham Neubig. Pre-train, prompt, and Xing. Learning robust global representations by penalizing
predict: A systematic survey of prompting methods in nat- local predictive power. In NeurIPS, 2019. 5
ural language processing. arXiv preprint arXiv:2107.13586, [51] Wei Wang, Vincent W Zheng, Han Yu, and Chunyan Miao.
2021. 1, 3 A survey of zero-shot learning: Settings, methods, and ap-
[35] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew plications. TIST, 2019. 3
Blaschko, and Andrea Vedaldi. Fine-grained visual classi- [52] Xiaolong Wang, Yufei Ye, and Abhinav Gupta. Zero-shot
fication of aircraft. arXiv preprint arXiv:1306.5151, 2013. recognition via semantic embeddings and knowledge graphs.
5 In CVPR, 2018. 3
[36] Maria-Elena Nilsback and Andrew Zisserman. Automated [53] Mitchell Wortsman, Gabriel Ilharco, Mike Li, Jong Wook
flower classification over a large number of classes. In Kim, Hannaneh Hajishirzi, Ali Farhadi, Hongseok
ICVGIP, 2008. 5 Namkoong, and Ludwig Schmidt. Robust fine-tuning
[37] German I Parisi, Ronald Kemker, Jose L Part, Christopher of zero-shot models. arXiv preprint arXiv:2109.01903,
Kanan, and Stefan Wermter. Continual lifelong learning with 2021. 1
neural networks: A review. Neural Networks, 2019. 8 [54] Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot
[38] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and learning-the good, the bad and the ugly. In CVPR, 2017. 3,
CV Jawahar. Cats and dogs. In CVPR, 2012. 5 5
[39] Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton [55] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva,
Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian and Antonio Torralba. Sun database: Large-scale scene
Riedel. Language models as knowledge bases? In EMNLP, recognition from abbey to zoo. In CVPR, 2010. 2, 5
2019. 3 [56] Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-
[40] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Seng Chua, and Maosong Sun. Cpt: Colorful prompt tun-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, ing for pre-trained vision-language models. arXiv preprint
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- arXiv:2109.11797, 2021. 1, 3
ing transferable visual models from natural language super- [57] Kai Yi, Xiaoqian Shen, Yunhao Gou, and Mohamed El-
vision. In ICML, 2021. 1, 2, 3, 4, 5, 7, 8 hoseiny. Exploring hierarchical graph representation for
[41] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario large-scale zero-shot image classification. arXiv preprint
Amodei, Ilya Sutskever, et al. Language models are unsu- arXiv:2203.01386, 2022. 3
pervised multitask learners. OpenAI blog, 2019. 3, 4 [58] Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu-
[42] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li.
Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Pointclip: Point cloud understanding by clip. arXiv preprint
Denseclip: Language-guided dense prediction with context- arXiv:2112.02413, 2021. 3
aware prompting. In CVPR, 2022. 3 [59] Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D
[43] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Manning, and Curtis P Langlotz. Contrastive learning of
Vaishaal Shankar. Do imagenet classifiers generalize to im- medical visual representations from paired images and text.
agenet? In ICML, 2019. 5 arXiv preprint arXiv:2010.00747, 2020. 2
[44] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric [60] Zexuan Zhong, Dan Friedman, and Danqi Chen. Factual
Wallace, and Sameer Singh. Autoprompt: Eliciting knowl- probing is [mask]: Learning vs. learning to recall. In NAACL,
edge from language models with automatically generated 2021. 1, 3
prompts. In EMNLP, 2020. 1, 3 [61] Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and
[45] Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bas- Chen Change Loy. Domain generalization in vision: A sur-
tani, Christopher D Manning, and Andrew Y Ng. Zero-shot vey. arXiv preprint arXiv:2103.02503, 2021. 7
learning through cross-modal transfer. In NeurIPS, 2013. 3 [62] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
[46] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Liu. Learning to prompt for vision-language models. arXiv
Ucf101: A dataset of 101 human actions classes from videos preprint arXiv:2109.01134, 2021. 1, 2, 3, 4, 5, 7, 8
in the wild. arXiv preprint arXiv:1212.0402, 2012. 5
[47] Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Car-
lini, Benjamin Recht, and Ludwig Schmidt. Measuring ro-
bustness to natural distribution shifts in image classification.
In NeurIPS, 2020. 7
[48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia

16825

You might also like