0% found this document useful (0 votes)

72 views11 pages

Co Co Op

The document discusses Conditional Context Optimization (CoCoOp), an extension of the Context Optimization (CoOp) method for adapting vision-language models like CLIP to downstream datasets. CoCoOp addresses the limitation of CoOp by learning dynamic prompts that are input-conditional, improving generalization to unseen classes and enhancing domain performance. The study highlights the effectiveness of CoCoOp in achieving better transferability and generalization compared to traditional methods.

Uploaded by

liver0377

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

72 views11 pages

Co Co Op

Uploaded by

liver0377

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

Conditional Prompt Learning for Vision-Language Models

Kaiyang Zhou Jingkang Yang Chen Change Loy Ziwei Liu

S-Lab, Nanyang Technological University, Singapore
{kaiyang.zhou, jingkang001, ccloy, ziwei.liu}@ntu.edu.sg
arXiv:2203.05557v2 [cs.CV] 6 Oct 2022

Abstract method focuses on closed-set visual concepts, limiting the

model to a pre-defined list of categories, and is unscalable
With the rise of powerful pre-trained vision-language when it comes to new categories unseen during training.
models like CLIP, it becomes essential to investigate ways In contrast, for vision-language models1 like CLIP [40]
to adapt these models to downstream datasets. A recently and ALIGN [24], the classification weights are diametri-
proposed method named Context Optimization (CoOp) in- cally generated by a parameterized text encoder (e.g., a
troduces the concept of prompt learning—a recent trend in Transformer [48]) through prompting [34]. For instance, to
NLP—to the vision domain for adapting pre-trained vision- differentiate pet images containing different breeds of dogs
language models. Specifically, CoOp turns context words and cats, one can adopt a prompt template like “a photo of
in a prompt into a set of learnable vectors and, with only a a {class}, a type of pet” as input to the text encoder, and
few labeled images for learning, can achieve huge improve- as a result, class-specific weights for classification can be
ments over intensively-tuned manual prompts. In our study synthesized by filling in the “{class}” token with real class
we identify a critical problem of CoOp: the learned con- names. Compared to discrete labels, vision-language mod-
text is not generalizable to wider unseen classes within the els’ source of supervision comes from natural language,
same dataset, suggesting that CoOp overfits base classes which allows open-set visual concepts to be broadly ex-
observed during training. To address the problem, we pro- plored and has been proven effective in learning transferable
pose Conditional Context Optimization (CoCoOp), which representations [24, 40].
extends CoOp by further learning a lightweight neural net- With the rise of such powerful vision-language models,
work to generate for each image an input-conditional token the community has recently started to investigate potential
(vector). Compared to CoOp’s static prompts, our dynamic solutions to efficiently adapt these models to downstream
prompts adapt to each instance and are thus less sensitive datasets [14, 53, 56, 63]. To fit web-scale data, such as the
to class shift. Extensive experiments show that CoCoOp 400 million pairs of images and texts used by CLIP, vision-
generalizes much better than CoOp to unseen classes, even language models are purposefully designed to have high ca-
showing promising transferability beyond a single dataset; pacity, entailing that the model size would be enormous,
and yields stronger domain generalization performance as typically with hundreds of millions of parameters or even
well. Code is available at https://github.com/ billions. Therefore, fine-tuning the entire model, as often
KaiyangZhou/CoOp. adopted in deep learning research [18], is impractical and
might even damage the well-learned representation space.
A safer approach is to tune a prompt by adding some
1. Introduction context that is meaningful to a task, like “a type of pet” for
the pet dataset mentioned above, which has been found ef-
Recent research in large-scale vision-language pre-
fective in improving performance [40]. However, prompt
training has achieved striking performance in zero-shot im-
engineering is extremely time-consuming and inefficient as
age recognition [13,24,33,40], demonstrating a potential in
it has to be based on trial and error, and does not guaran-
learning open-world visual concepts for such a paradigm.
tee an optimal prompt either. To automate prompt engi-
The key design lies in how visual concepts are modeled.
neering, Zhou et al. [63] have recently explored the concept
In traditional supervised learning where labels are dis-
of prompt learning—a recent trend in NLP [15, 25, 30, 32,
cretized, each category is associated with a randomly initial-
44, 60]—for adapting pre-trained vision-language models.
ized weight vector that is learned to minimize the distance
Their approach, Context Optimization (CoOp), turns con-
with images containing the same category. Such a learning
1 We follow existing studies [13,24,33,40] to refer to CLIP-like models

Corresponding author as vision-language models.

1
<latexit sha1_base64="8143cVdce6SsJFqV+hM9gRzsI3Q=">AAADNHicjVJNb9MwGHbC1whsbHDkYtFNKpcq6WFwnNQLF2BIdJuURJXjuKtVJ47sN+2qyH+EK/wG/gsSN8SV34CTBZS0HHilWI+fx+/zxK+cFIJr8P1vjnvn7r37D/Yeeo8e7x88OTx6eqFlqSibUimkukqIZoLnbAocBLsqFCNZIthlspzU+uWKKc1l/hE2BYszcp3zOacELDU7cvZPjsMoyaqVmQXxMf67GdebaJFK0B32rWXDCNgNNNGVYqmpQBGeY7uINdmYeOR5Hc9hDW7My553l9zN6Kj/lRWtGocGLlrY9CXzaiIn8n1hrPaH2XZ8x9aYCqI108bMDgf+yG8K74KgBQPU1rkd4EGUSlpmLIfGJQz8AuKKKOBUMONFpWYFoUtyzUILc5IxHVdNvsEnlknxXCr75YAbtttRkUzrTZbYkxmBhd7WavJfWljC/HVc8bwogeX0NmheCgwS168Ap1wxCmJjAaGK23/FdEEUoWDfSi8lyXp3qLJSAFdy3WcTKZdAEm08O8Fge1674GI8Ck5H/ofx4GzYznIPPUcv0BAF6BU6Q2/QOZoi6oDzyfnsfHG/ut/dH+7P26Ou0/Y8Q71yf/0GApsFgw==</latexit>

CoCoOp
<latexit sha1_base64="nJGHb+fSeYbFHtGpvQfsLkYuVT0=">AAADMnicjVLLbtQwFHXCq6RQWliysZhWGjajZBaFZaVuugGKxLSVkmjkOE7HGjuO7JuZjqL8R7fwDfwM7BBbPgInHVAyw4IrJTr3HN9zkisnheAGfP+b4967/+Dho53H3u6Tp3vP9g+eXxhVasomVAmlrxJimOA5mwAHwa4KzYhMBLtM5qeNfrlg2nCVf4JVwWJJrnOecUrAUtMDZ/foMIwSWS3qaRAf4r/NuGmiWarAdNh3lg0jYDfQRleapXUFmvAc25dYklUdjzyv4zlswE39uufdJbczOup/ZUWL1qGFszVs55KsOlUfitoqf/pNv/dsiakgxjBT19P9gT/y28LbIFiDAVrXuV3fXpQqWkqWQ+sSBn4BcUU0cCpY7UWlYQWhc3LNQgtzIpmJqza/xkeWSXGmtH1ywC3bnaiINGYlE3tSEpiZTa0h/6WFJWRv44rnRQksp3dBWSkwKNzcAZxyzSiIlQWEam6/FdMZ0YSCvSm9lET2/qGSpQCu1bLPJkrNgSSm9uwGg819bYOL8Sg4Hvkfx4OT4XqXO+gleoWGKEBv0Ak6Q+dogqijnVvns/PF/ep+d3+4P++Ous565gXqlfvrN8MdBL0=</latexit>

Zero-shot CoOp
<latexit sha1_base64="IJtCaraVVLlW6HHagghvMZyBOHc=">AAACOHicbVA9TwJBEN3DL8QvxNLmIjGxkdxRqCWJjaUmokQgZHeZkw27t5fdOYVc+Cu2+hv8J3Z2xtZf4AJXiPqSSV7em8nMPJZIYTEI3rzC0vLK6lpxvbSxubW9U96t3FidGg5NrqU2LUYtSBFDEwVKaCUGqGISbtnwfOrfPoCxQsfXOE6gq+h9LCLBKTqpV650EEbIouwOjD62A42TXrka1IIZ/L8kzEmV5Ljs7Xrbnb7mqYIYuaTWtsMgwW5GDQouYVLqpBYSyof0HtqOxlSB7Waz4yf+oVP6fqSNqxj9mfpzIqPK2rFirlNRHNjf3lT8z2unGJ11MxEnKULM54uiVPqo/WkSfl8Y4CjHjlBuhLvV5wNqKEeX18IWphZ+yFQqURj9uKgyrYdImZ2UXILh77z+kpt6LTypBVf1auMoz7JI9skBOSIhOSUNckEuSZNwMiJP5Jm8eK/eu/fhfc5bC14+s0cW4H19AzV1rS8=</latexit>

Base classes
<latexit sha1_base64="0ivCj/QEdXYHnnI7rZsyExeY+nE=">AAADQXicbVLPb9MwFHYyfowCYxs3uFh0k8olSnroJk6DXbgAQ6LbpCSqHMdZrdpxZL+0q6Ic+Gu4wt/AX8GfwA1x5YKbdVO67klxvve99z7bn5wUghvw/V+Ou3Hv/oOHm486j5883Xq2vbN7alSpKRtSJZQ+T4hhgudsCBwEOy80IzIR7CyZHC/qZ1OmDVf5F5gXLJbkIucZpwQsNdpxXuzvhVEiq2k9CuI9fJP0F0k0ThWYFvvBsiFownNsFzEj89jrdFoSvQW4rF+vSLXJdclW9S7paNoMNHB8DSNgl5Bk1bH6VNStvPk3rlSapXX1kc0wFcQYZuqmj8liXL2ltNSEzt/gg4F3OLCF6/l31sqbgdF21/f8JvA6CJagi5ZxYt3cilJFS8lyaETCwC8grogGTgWrO1FpWEHohFyw0MKcSGbiqjlvjfctk+JMafvlgBu2PVERacxcJrZTEhib27UFeVctLCE7jCueFyWwnF5tlJUCg8KLJ4FTrhkFMbeAUM3tWTEdE2sQ2IezsksiV+5QyVIA12q2yiZKTYAkpu5YB4Pbfq2D074XDDz/c7971Ft6uYleoleohwJ0gI7Qe3SChog6X51vznfnh/vT/e3+cf9etbrOcuY5Wgn3339dNAbl</latexit>

<latexit sha1_base64="f1SUBf8cVkB8dE/1Q5lxwIU6Dy4=">AAADOXicbVJNb9NAEF2br2KgtOXIZUVaKVwiO4e04lTUCxegSKStZFvRer1JVtn1WrvjpJbl38IVfgO/hCM3xJU/wMYJECeMZOvNm5k3u0+b5IIb8P1vjnvn7r37D/Yeeo8eP9l/enB4dGVUoSkbUiWUvkmIYYJnbAgcBLvJNSMyEew6mV0s69dzpg1X2UcocxZLMsn4mFMClhodOkfHYZTIal6PgvgY/036yySapgrMBvvWsiHRms+JwBMCLO553sk/he4S3NYvW0qb5K7iRjWMgN1Cc6dKs7SuQBOeYfsTC1LWq13RvFFo4PQPbAaTcXWh3uf1Rr4t+I4tMBXEGGbqpo/JfFq9prTQhJav8OmgdzaoRwcdv+c3gXdBsAYdtI5L6+J+lCpaSJZBIx8Gfg5xRTRwKljtRYVhOaEzMmGhhRmRzMRVc7Aan1gmxWOl7ZcBbtjNiYpIY0qZ2E5JYGq2a0vyf7WwgPFZXPEsL4BldLVoXAgMCi+fAk65ZhREaQGhmtuzYjol1gmwD6a1JZGtO1SyEMC1WrTZRKkZkMTUnnUw2PZrF1z1e8Gg53/od867ay/30HP0AnVRgE7ROXqDLtEQUad0PjmfnS/uV/e7+8P9uWp1nfXMM9QK99dv+RAE2A==</latexit> <latexit sha1_base64="zX3nn3Sqn+w/qDQYjs0d4xdW0sQ=">AAADKHicbVLLjtMwFHXCawgwD1iysWgrlU2VdNEZsRo0GzbAINGZkZKochy3sWrHkX3TThXlF9jCN/A17NBs+RKcTIF2ypWSnHvuvcfOsZNCcAO+f+O49+4/ePho77H35Omz/YPDo+cXRpWasjFVQumrhBgmeM7GwEGwq0IzIhPBLpP5WVO/XDBtuMo/w6pgsSSznE85JWCpyZHj9LphlMhqUU+CuIv/JsMmibJUgdlg31s2BE14ju1LLMkqHnjeP4V+A67r11tKm+Su4kY1JFrzBRF4RoA1wr1utGj7W5j9gRGwa0im1Zn6WNQbefttPak0S+vqA1tiKogxzNRtH5NFVr2ltNSErt7g49HgZFRPDjv+wG8D74JgDTpoHefWtP0oVbSULIdWPgz8AuKKaOBUsNqLSsMKQudkxkILcyKZiat2YzXuWSbFU6XtkwNu2c2JikhjVjKxnZJAZu7WGvJ/tbCE6Ulc8bwogeX0dqFpKTAo3Jw8TrlmFMTKAkI1t3vFNCPWCbD3Y2uVRG79QyVLAVyr5TabKDUHkpjasw4Gd/3aBRfDQTAa+J+GndP+2ss99BK9Qn0UoGN0it6hczRG1MmcL85X55v73f3h/nRvbltdZz3zAm2F++s3r/P8tA==</latexit>

<latexit sha1_base64="+xVRXtBc7t53www2R1akeGF/K1k=">AAACSnicbVC7TgMxEPSFVwivACWNRUCiiu4ogBKJhjJIJCAdp2jP8SVW7PPJ3guKTvkBvoYWvoEf4DfoEA3OoyAJI9kezexqvRNnUlj0/U+vtLK6tr5R3qxsbe/s7lX3D1pW54bxJtNSm8cYLJci5U0UKPljZjioWPKHuH8z9h8G3Fih03scZjxS0E1FIhigk9rVkxAiGmY9jdq9OnHXWABjxAAk7QLyqN6u1vy6PwFdJsGM1MgMjfa+t/vU0SxXPEUmwdow8DOMCjAomOSjylNueQasD10eOpqC4jYqJuuM6KlTOjTRxp0U6UT921GAsnaoYlepAHt20RuL/3lhjslVVIg0y5GnbDooySVFTcfZ0I4wnKEcOgLMCPdXynpggKFLcG5KrOZ2KFQuURj9PK/GWvcRYjuquASDxbyWSeu8HlzU/bvz2vXZLMsyOSLH5IwE5JJck1vSIE3CyAt5JW/k3fvwvrxv72daWvJmPYdkDqXVX8LtsjE=</latexit>

[a] [photo] [of] [a] [arrival gate]. [v1 ] [v2 ] . . . [vM ] [arrival gate]. [v1 (x)] [v2 (x)] . . . [vM (x)] [arrival gate].
.. <latexit sha1_base64="4TW2Q2kvoIwAfSidpPz6R0O1N8s=">AAADAHicjVFNj9MwEHXC1xJg6cKRi0W7UrlUSQ/AsVIvXBCLRHdXaqLKdpytVTuO7El3qygXfg03xJV/Ar8GJ1uhZMuBkWy9efPxxmNaSGEhDH95/r37Dx4+OnocPHn67Pj54OTFudWlYXzBtNTmkhLLpcj5AgRIflkYThSV/IJu5k38YsuNFTr/AruCJ4pc5SITjICjVoPfp6NlTFW1rVdRMsJ/nWnjxOtUg+2wHx27jIHfQKtcGZ7WFRgicuwueU12dTIJgk7PcQNu6je93l3yUKMT/Q+tUbxtGzjVNplm1VzP9aeiXg2G4SRsDR+CaA+GaG9nqxPvOE41KxXPgUli7TIKC0gqYkAwyesgLi0vCNuQK750MCeK26Rqx6vxqWNSnGnjTg64ZbsVFVHW7hR1mYrA2t6NNeS/YssSsvdJJfKiBJ6zW6GslBg0bn4Up8JwBnLnAGFGuFkxWxNDGLh/76lQ1XtDpUoJwujrPku13gChtg7cBqO7+zoE59NJ9HYSfp4OZ+P9Lo/QK/QajVGE3qEZ+oDO0AIxb+ZlnvYK/6v/zf/u/7hN9b19zUvUM//nHw/58a8=</latexit>

.. <latexit sha1_base64="4TW2Q2kvoIwAfSidpPz6R0O1N8s=">AAADAHicjVFNj9MwEHXC1xJg6cKRi0W7UrlUSQ/AsVIvXBCLRHdXaqLKdpytVTuO7El3qygXfg03xJV/Ar8GJ1uhZMuBkWy9efPxxmNaSGEhDH95/r37Dx4+OnocPHn67Pj54OTFudWlYXzBtNTmkhLLpcj5AgRIflkYThSV/IJu5k38YsuNFTr/AruCJ4pc5SITjICjVoPfp6NlTFW1rVdRMsJ/nWnjxOtUg+2wHx27jIHfQKtcGZ7WFRgicuwueU12dTIJgk7PcQNu6je93l3yUKMT/Q+tUbxtGzjVNplm1VzP9aeiXg2G4SRsDR+CaA+GaG9nqxPvOE41KxXPgUli7TIKC0gqYkAwyesgLi0vCNuQK750MCeK26Rqx6vxqWNSnGnjTg64ZbsVFVHW7hR1mYrA2t6NNeS/YssSsvdJJfKiBJ6zW6GslBg0bn4Up8JwBnLnAGFGuFkxWxNDGLh/76lQ1XtDpUoJwujrPku13gChtg7cBqO7+zoE59NJ9HYSfp4OZ+P9Lo/QK/QajVGE3qEZ+oDO0AIxb+ZlnvYK/6v/zf/u/7hN9b19zUvUM//nHw/58a8=</latexit>

...
<latexit sha1_base64="CrDUidhbV+gYx4dWg1zM4wwFCGM=">AAADNnicjVLLbtQwFHXCqwToA5ZsLKaVhs0omUVhWTEbNogiMW2lJBrZjtOxxokj+2baUeQ/YQvfwK+wYYfY8gk46YAyHRZcydHxub7nOEemlRQGwvCb59+5e+/+g52HwaPHT3b39g+enhlVa8anTEmlLygxXIqST0GA5BeV5qSgkp/TxaTtny+5NkKVH2FV8bQgl6XIBSPgqNmBt3d0GCe0aJZ2FqWH+O9m3G6SeabA9Nh3jo0T4NfQWTeaZ7YBTUSJ3UdekZVNR0HQ0xy24Nq+3NDuk9seve5/eSXLTiEI/mg5spujeTNRE/W+sj2mp0hlzW3zxsWHmSTGcGPtbH8QjsKu8DaI1mCA1nXqItxNMsXqgpfQqcRRWEHaEA2CSW6DpDa8ImxBLnnsYEkKbtKmu4HFR47JcK60WyXgju1PNKQwZlVQd7IgMDe3ey35r15cQ/46bURZ1cBLdmOU1xKDwu07wJnQnIFcOUCYFu6umM2JJgzca9lwocXGPzRFLUFodbXJUqUWQKixgUswup3XNjgbj6LjUfhhPDgZrrPcQc/RCzREEXqFTtBbdIqmiHlL75P32fvif/W/+z/8nzdHfW898wxtlP/rN106BlU=</latexit>
. . .
<latexit sha1_base64="oyFSkVR5pVK+BaFeHpURW7SEkEo=">AAACR3icbVBNTwIxEO3iF+IX6NFLlZhwIrsc1COJF4+YyEeybEi3dKGh3W7aWQ3ZcPbXeNXf4E/wV3gzHu0CBwEnaeflvZnMzAsTwQ247qdT2Nre2d0r7pcODo+OT8qV045RqaasTZVQuhcSwwSPWRs4CNZLNCMyFKwbTu5yvfvEtOEqfoRpwgJJRjGPOCVgqUH5wicB9pOxAmWziuyXE1Yes6EmIqgPylW37s4DbwJvCapoGa1BxTnuDxVNJYuBCmKM77kJBBnRwKlgs1I/NSwhdEJGzLcwJpKZIJvfMsNXlhniSGn7YsBz9m9HRqQxUxnaSmmXNOtaTv6n+SlEt0HG4yQFFtPFoCgVGBTOjcFDrhkFMbWAUM3trpiOiSYUrH0rU0K5ckMmUwFcq+dVNlRqAiQ0s5J10Fv3axN0GnXvuu4+NKrN2tLLIjpHl6iGPHSDmugetVAbUfSCXtEbenc+nC/n2/lZlBacZc8ZWomC8wtbS7EJ</latexit> <latexit sha1_base64="w2uawpOynxeIAr2p+Zz3wz9wQgQ=">AAADNnicbVJNb9NAEF2br2KgH3DksiKtFC6WnUNacSrqhQvQSk1bybai9XrSrLLrtXbXSSPL/4Qr/Ab+ChduiCs/gY0TIE4YydabN/Nmdp82LTjTJgi+Oe69+w8ePtp57D15+mx3b//g+ZWWpaIwoJJLdZMSDZzlMDDMcLgpFBCRcrhOJ2eL+vUUlGYyvzTzAhJBbnM2YpQYSw0PnL3DKE5FNa2HYXKI/ya9RRKPM2n0GvvespGVjiFThCe+5x39k3cX4K5+3RqzTm6PW6tGsYE701yoUpDVlVGE5dj++IzM6+WueNpMaOD4D2yE6ag6kx+Lei3fHPgBZphyojXouukDUYyrt5SWitD5G3zc90/69XC/E/hBE3gbhCvQQas4txbuxpmkpYDcNOOjMChMUhFlGOVQe3GpoSB0Qm4hsjAnAnRSNQer8ZFlMjySyn65wQ27rqiI0HouUtsprO96s7Yg/1eLSjM6SSqWF6WBnC4XjUqOjcSLd4AzpoAaPreAUMXsWTEdE+uEsa+ltSUVrTtUouSGKTlrs6mUE0NSXXvWwXDTr21w1fPDvh9c9Dqn3ZWXO+gleoW6KETH6BS9Q+dogKgzdT45n50v7lf3u/vD/blsdZ2V5gVqhfvrN5lQA7A=</latexit>

<latexit sha1_base64="Sds3QJMIKgV/HNcJ+p5gIs1s934=">AAADJHicbVJNj9MwEHXC11Jg6cKRi0VbqVyqpIfuitOivXABFonurpREleNMNlbtOLKddqsof4Ar/AZ+DTfEgQs/BeFkC7RbRorz5s34efySuOBMG8/74bi3bt+5e2/vfufBw0f7j7sHT860LBWFKZVcqouYaOAsh6lhhsNFoYCImMN5PD9p6ucLUJrJ/INZFRAJcpmzlFFiLDXr/hr0gzAW1aKe+VEf/03GTRJmiTR6g31j2cAownJsF74kq2jU6fxTGDbgqn6xpbRJ7ipuVAM7VAaJIrxRHfTDRdvcwuwPDA1cmTitTuS7ot7I23frR6Ugqau3sMSUE61B120fiCKrXlFaKkJXL/HhZHQ0qWfdnjfy2sC7wF+DHlrH6ezA2Q8TSUsBuWnlA98rTFQRZRjlUHfCUkNB6JxcQmBhTgToqGoHq/HAMglOpbJPbnDLbu6oiNB6JWLbKawX+matIf9XC0qTHkUVy4vSQE6vD0pLjo3EzVfHCVNADV9ZQKhidlZMM2KdMPbf2DolFlt3qETJDVNyuc3GUs4NiXXdsQ76N/3aBWfjkT8Zee/HvePh2ss99Aw9R0Pko0N0jF6jUzRF1Emcj84n57P7xf3qfnO/X7e6znrPU7QV7s/f+gr8gQ==</latexit>

[a] [photo] [of] [a] [cathedral]. [v1 ] [v2 ] . . . [vM ] [cathedral]. [v1 (x)] [v2 (x)] . . . [vM (x)] [cathedral].
<latexit sha1_base64="9fR6HmABMleNcUoWa6oyKg/n/6s=">AAACPHicbZDLTgIxFIY7eEO8gSZu3DQSE1ZkBhO8rDBuXGIilwQI6ZQCDe100p7RkJGXcavP4Hu4d2fcurbALAT8kyZ//nNOTs/nh4IbcN0PJ7W2vrG5ld7O7Ozu7R9kc4d1oyJNWY0qoXTTJ4YJHrAacBCsGWpGpC9Ywx/dTuuNR6YNV8EDjEPWkWQQ8D6nBGzUzR63mQyH8Q2lkSZ0fI3LV8Xz8qSbzbtFdya8arzE5FGiajfn7Ld7ikaSBUAFMabluSF0YqKBU8EmmXZkWEjoiAxYy9qASGY68eyACT6zSQ/3lbYvADxL/07ERBozlr7tlASGZrk2Df+rtSLoX3ZiHoQRsIDOF/UjgUHhKQ3c45pREGNrCNXc/hXTIbEkwDJb2OLLhRtiGQngWj0tpr5SIyC+mWQsQW+Z16qpl4peuejel/KVQsIyjU7QKSogD12gCrpDVVRDFD2jF/SK3px359P5cr7nrSknmTlCC3J+fgEWTa10</latexit>

<latexit sha1_base64="mCP733lRKW729yXdlkP1ECjR4Hs=">AAADS3icjVLBbtNAEF27FEqA0sKRy4qkUrhYdg6l4lTUCxegSKStZFvRer2pV9n1WrvjpJblL+BruMI38AF8BzfEgY0bkJNwYKS13ryZebN+2qQQ3IDvf3fcnTu7d+/t3e89ePho//HB4ZMLo0pN2ZgqofRVQgwTPGdj4CDYVaEZkYlgl8nsbFm/nDNtuMo/QlWwWJLrnE85JWCpyaEzOBqEUSLreTMJ4gH+m4yWSZSlCkyHfWvZMAJ2A+3qWrO0qUETnmP7EQtSNbHX63U0h0tw07xY0+6S2zs61f/aFc1bhRZmf2A7mEzrM/W+aDr5puA7tsBUEGOYaWxfxGSR1a8pLTWh1St84nvHfjM56Pue3wbeBsEK9NEqzq2v+1GqaClZDq16GPgFxDXRwKlgTS8qDSsInZFrFlqYE8lMXLf3avCRZVI8VdqeHHDLdidqIo2pZGI7JYHMbNaW5L9qYQnTk7jmeVECy+ntomkpMCi8fBw45ZpREJUFhGpu74ppRqwTYJ/Q2pZErv1DLUsBXKvFOpsoNQOSmKZnHQw2/doGFyMvOPb8D6P+6XDl5R56hp6jIQrQS3SK3qBzNEbU+eR8dr44X91v7g/3p/vrttV1VjNP0Vrs7P4GptUMcA==</latexit>

<latexit sha1_base64="kASuoAz9VVVabPM2RlbBZuw3XFo=">AAADS3icjVLBbtNAEF27FIqB0sKRy4qkUrhEdlQR4FTUCxegSKStZFvRer1uVtn1WrvjpJHlL+BruMI38AF8BzfEgY0bkJNwYKS13ryZebN+2qQQ3IDvf3fcnVu7t+/s3fXu3X+w//Dg8NG5UaWmbESVUPoyIYYJnrMRcBDsstCMyESwi2R6uqxfzJg2XOUfYVGwWJKrnGecErDU+NDpHnXDKJHVrB4HcRf/TQbLJJqkCkyLfWvZMAJ2Dc3qSrO0rkATnmP7EXOyqOO+57U0e0twXT9b026T2zta1f/aFc0ahQZO/sBmMMmqU/W+qFv5puA7NsdUEGOYqW1fxGQxqV5TWmpCF6/w8GV/eFyPDzp+328Cb4NgBTpoFWfW1/0oVbSULIdGPQz8AuKKaOBUsNqLSsMKQqfkioUW5kQyE1fNvWp8ZJkUZ0rbkwNu2PZERaQxC5nYTklgYjZrS/JftbCE7EVc8bwogeX0ZlFWCgwKLx8HTrlmFMTCAkI1t3fFdEKsE2Cf0NqWRK79QyVLAVyr+TqbKDUFkpjasw4Gm35tg/NBP3je9z8MOie9lZd76Al6inooQEN0gt6gMzRC1PnkfHa+OF/db+4P96f766bVdVYzj9Fa7Oz+Br3HDH0=</latexit>

Accuracy: 69.36 Accuracy: 80.60 Accuracy: 79.74

<latexit sha1_base64="YrdloPpTXXetAZ2qlWj/z7dT0lQ=">AAADOHicbVLLbtNAFB2bVwlQGliyGZFWCpvIziKtWLXqhg1QJNJWiq1oPLlJRpnxWDPXecjyr7CFb+BP2LFDbPkCJm5ATtMr2T73nPsYH02SSWExCH54/r37Dx4+2nvcePL02f7zg+aLS6tzw6HPtdTmOmEWpEihjwIlXGcGmEokXCWz87V+NQdjhU4/4yqDWLFJKsaCM3TUsOk1jw4HUaKKeTkM40P6P+muk2g60mhr7HvHDtAwkVL3kgu2ijuNRm1Eew2W5ZutUXVyd2RNvWt0NK8aKjj9ByOEJSbj4lx/zMpaXn0rVwoDo7L4AAvKJbMWbFnVgcqmxRnnuWF89ZYe9zonPSecGSPmTNIJQxgetIJOUAXdBeEGtMgmLpyJ+9FI81xBitWyQRhkGBfMoOASykaUW8gYn7EJDBxMmQIbF9UxS3rkmBEda+OeFGnF1jsKpqxdqcRVKoZTe1tbk3dpgxzHJ3Eh0ixHSPnNonEuKWq6vgl0JAxwlCsHGDfCnZXyKXO+oLsvW1sStfUPhcolCqMX22yi9QxZYsuGczC87dcuuOx2wl4n+NRtnbY3Xu6RV+Q1aZOQHJNT8o5ckD7h3tL74n31vvnf/Z/+L//3TanvbXpekq3w//wF8BACsw==</latexit>

Arrival gate Cathedral

<latexit sha1_base64="X9DUCorVWpoVUM8/d/6DB0Zwkhw=">AAADNXicbVJNb9MwGHbC1wiwDzhysegmlUuU9NBNnIZ64QIMiW6TkqhynLerVTuObKddFeWXcIXfwG/hwA1x5S/gZAWl614pzvM+76cfOS040yYIfjjuvfsPHj7aeew9efpsd2//4Pm5lqWiMKaSS3WZEg2c5TA2zHC4LBQQkXK4SOejJn6xAKWZzD+bVQGJIFc5mzJKjKUmB87u0WEUp6Ja1JMwOcT/nUHjxLNMGt1h31s2MoqwHNuDL8kq8T2v06LfgOv69UarLrndshO9q3W8aAtaOPsHYwPXJp1WI/mxqDt++29VqRRkdfUBlphyojXous0DUcyqt5SWitDVG3w89E+GNjAiZgaZInyy3wv8oDW8DcI16KG1nTUKxpmkpYDctJOiMChMUhFlGOVQe3GpoSB0Tq4gsjAnAnRStTvW+MgyGZ5KZb/c4JbtVlREaL0Sqc0UdkV9O9aQd8Wi0kxPkorlRWkgpzeDpiXHRuLmGeCMKaCGrywgVDG7K6YzYkUx9rFsTEnFxh0qUXLDlFxusqmUc0NSXXtWwfC2XtvgfOCHQz/4NOid9tda7qCX6BXqoxAdo1P0Dp2hMaJO6Xxxvjrf3O/uT/eX+/sm1XXWNS/Qhrl//gKV5AGL</latexit>

(a) Both CoOp and CoCoOp work well on the base classes observed during training and beat manual prompts by a significant margin.
<latexit sha1_base64="8143cVdce6SsJFqV+hM9gRzsI3Q=">AAADNHicjVJNb9MwGHbC1whsbHDkYtFNKpcq6WFwnNQLF2BIdJuURJXjuKtVJ47sN+2qyH+EK/wG/gsSN8SV34CTBZS0HHilWI+fx+/zxK+cFIJr8P1vjnvn7r37D/Yeeo8e7x88OTx6eqFlqSibUimkukqIZoLnbAocBLsqFCNZIthlspzU+uWKKc1l/hE2BYszcp3zOacELDU7cvZPjsMoyaqVmQXxMf67GdebaJFK0B32rWXDCNgNNNGVYqmpQBGeY7uINdmYeOR5Hc9hDW7My553l9zN6Kj/lRWtGocGLlrY9CXzaiIn8n1hrPaH2XZ8x9aYCqI108bMDgf+yG8K74KgBQPU1rkd4EGUSlpmLIfGJQz8AuKKKOBUMONFpWYFoUtyzUILc5IxHVdNvsEnlknxXCr75YAbtttRkUzrTZbYkxmBhd7WavJfWljC/HVc8bwogeX0NmheCgwS168Ap1wxCmJjAaGK23/FdEEUoWDfSi8lyXp3qLJSAFdy3WcTKZdAEm08O8Fge1674GI8Ck5H/ofx4GzYznIPPUcv0BAF6BU6Q2/QOZoi6oDzyfnsfHG/ut/dH+7P26Ou0/Y8Q71yf/0GApsFgw==</latexit>

New classes Zero-shot CoOp

<latexit sha1_base64="tQWd+dbC3pYqVRjIn2dwUiR7CO4=">AAADQHicbVJNb9QwEHXCV1mgtHBCXCy2lZZLlOxhW3Eq6oULUCS2rbSJVo7jbay148ie7DaKIvFruMJv4F/wD7ghrpzwpluUdDtSnDdvZp7tJ8e54AZ8/6fj3rl77/6DrYe9R4+fbD/d2X12alShKRtTJZQ+j4lhgmdsDBwEO881IzIW7CyeH6/qZwumDVfZZyhzFklykfEZpwQsNd11XuzvTcJYVot6GkR7+H8yXCVhmigwLfa9ZSegCc+wXcSSlJHX67UkBitwWb/uSLXJTclW9TbpcNEMNDC9hiGwS4hn1bH6mNetvPk3rlSaJXX1gS0xFcQYZuqmj8k8rd5SWmhCyzf4YOQdjmzher7dP93p+57fBN4EwRr00TpOrJnbYaJoIVkGjcgk8HOIKqKBU8HqXlgYlhM6JxdsYmFGJDNR1Ry3xvuWSfBMaftlgBu2PVERaUwpY9spCaTmZm1F3labFDA7jCqe5QWwjF5tNCsEBoVXLwInXDMKorSAUM3tWTFNifUH7Lvp7BLLzh0qWQjgWi27bKzUHEhs6p51MLjp1yY4HXrByPM/DftHg7WXW+gleoUGKEAH6Ai9QydojKjzxfnqfHO+uz/cX+5v989Vq+usZ56jTrh//wFJJgaK</latexit> <latexit sha1_base64="IJtCaraVVLlW6HHagghvMZyBOHc=">AAACOHicbVA9TwJBEN3DL8QvxNLmIjGxkdxRqCWJjaUmokQgZHeZkw27t5fdOYVc+Cu2+hv8J3Z2xtZf4AJXiPqSSV7em8nMPJZIYTEI3rzC0vLK6lpxvbSxubW9U96t3FidGg5NrqU2LUYtSBFDEwVKaCUGqGISbtnwfOrfPoCxQsfXOE6gq+h9LCLBKTqpV650EEbIouwOjD62A42TXrka1IIZ/L8kzEmV5Ljs7Xrbnb7mqYIYuaTWtsMgwW5GDQouYVLqpBYSyof0HtqOxlSB7Waz4yf+oVP6fqSNqxj9mfpzIqPK2rFirlNRHNjf3lT8z2unGJ11MxEnKULM54uiVPqo/WkSfl8Y4CjHjlBuhLvV5wNqKEeX18IWphZ+yFQqURj9uKgyrYdImZ2UXILh77z+kpt6LTypBVf1auMoz7JI9skBOSIhOSUNckEuSZNwMiJP5Jm8eK/eu/fhfc5bC14+s0cW4H19AzV1rS8=</latexit>

<latexit sha1_base64="bYymbds63ROIjnQFB2XBSZyp6V0=">AAADNnicbVJNb9NAEF2brxKgH3DksiKtFC6WnUNacSrqhQvQSk1bybai9XrTrLLrtXbHSSPL/4Qr/Ab+ChduiCs/gbUTIE4YydabNzNvdp82yQU34PvfHPfe/QcPH+087jx5+mx3b//g+ZVRhaZsSJVQ+iYhhgmesSFwEOwm14zIRLDrZHpW169nTBuusktY5CyW5DbjY04JWGp04OwdhlEiy1k1CuJD/Dfp10k0SRWYNfa9ZcM5z1I8JlrGXqdz9G+8V4O76nVLZp3cllurhhGwO2guVGqWViVowjNsf2JOFtVyVzRrFBo4+QObwWRcnqmPebWWbwp+YHNMBTGGmarpYzKflG8pLTShizf4eOCdDKrRftf3/CbwNghWoItWcW4t3I1SRQvJMmjkw8DPIS6JBk4FqzpRYVhO6JTcstDCjEhm4rI5WIWPLGPdVNp+GeCGXZ8oiTRmIRPbKQlMzGatJv9XCwsYn8Qlz/ICWEaXi8aFwKBw/Q5wyjWjIBYWEKq5PSumE2KdAPtaWlsS2bpDKQsBXKt5m02UmgJJTNWxDgabfm2Dq74XDDz/ot897a283EEv0SvUQwE6RqfoHTpHQ0SdmfPJ+ex8cb+6390f7s9lq+usZl6gVri/fgMcsQOA</latexit>

[v1 ] [v2 ] . . . [vM ] [wind farm].

<latexit sha1_base64="knM3erqhsh6bC2DEtwwf35p1uZE=">AAADJHicbVJNb9MwGHbC1ygwOjhysWgrlUuU9NBNnIZ24QIMiW6TkqhyHHe1aseR/aZdFOUPcIXfwK/hhjhw4acgnKxAu/JKdp73eV8/th8nyQU34Ps/HPfW7Tt37+3d7zx4+Gj/cffgyZlRhaZsQpVQ+iIhhgmesQlwEOwi14zIRLDzZHHS1M+XTBuusg9Q5iyW5DLjM04JWGra/TXoh1Eiq2U9DeI+/puMmiSapwrMBvvGsiFowjNsJ7EiZex1Ov8Uhg24ql9sKW2Su4ob1XDFsxTPiJaN6qAfLdvmFs7/wAjYFSSz6kS9y+uNvP22flSapXX1lq0wFcQYZuq2j8l8Xr2itNCEli/x4dg7GtfTbs/3/DbwLgjWoIfWcTo9cPajVNFCsgxa+TDwc4grooFTwepOVBiWE7oglyy0MCOSmbhqD1bjgWXsDZW2IwPcspsrKiKNKWViOyWBublZa8j/1cICZkdxxbO8AJbR641mhcCgcPPqOOWaURClBYRqbs+K6ZxYJ8D+G1u7JHLrDpUsBHCtVttsotQCSGLqjnUwuOnXLjgbecHY89+PesfDtZd76Bl6joYoQIfoGL1Gp2iCqJM6H51Pzmf3i/vV/eZ+v251nfWap2gr3J+/AZGr/FE=</latexit>

<latexit sha1_base64="9z/5HEKimh9eQz1kLtmR+HpiKRA=">AAACR3icbVC7TgMxEPSFd3glUNIYIiSq6I4CKJFoKINEINJxivYcH7Hix8neA0Wn1HwNLXwDn8BX0CFKnJCCEEayPZrd1XgnzaVwGIbvQWVhcWl5ZXWtur6xubVdq+/cOFNYxtvMSGM7KTguheZtFCh5J7ccVCr5bTq4GNdvH7h1wuhrHOY8UXCvRSYYoJe6tf0YEhrnfYPGvybz11h4FLpHM7AqaXZrjbAZTkDnSTQlDTJFq1sPtu56hhWKa2QSnIujMMekBIuCST6q3hWO58AGcM9jTzUo7pJyssuIHnrFWxvrj0Y6UX9PlKCcG6rUdyrAvvtbG4v/1eICs7OkFDovkGv2Y5QVkqKh42BoT1jOUA49AWaF/ytlfbDA0Mc345KqmR1KVUgU1jzOqqkxA4TUjao+wehvXvPk5rgZnTTDq+PG+dE0y1WyRw7IEYnIKTknl6RF2oSRJ/JMXshr8BZ8BJ/B109rJZjO7JIZVIJvBjyw2Q==</latexit>

[a] [photo] [of] [a] [wind farm]. [v1 (x)] [v2 (x)] . . . [vM (x)] [wind farm].
.. <latexit sha1_base64="4TW2Q2kvoIwAfSidpPz6R0O1N8s=">AAADAHicjVFNj9MwEHXC1xJg6cKRi0W7UrlUSQ/AsVIvXBCLRHdXaqLKdpytVTuO7El3qygXfg03xJV/Ar8GJ1uhZMuBkWy9efPxxmNaSGEhDH95/r37Dx4+OnocPHn67Pj54OTFudWlYXzBtNTmkhLLpcj5AgRIflkYThSV/IJu5k38YsuNFTr/AruCJ4pc5SITjICjVoPfp6NlTFW1rVdRMsJ/nWnjxOtUg+2wHx27jIHfQKtcGZ7WFRgicuwueU12dTIJgk7PcQNu6je93l3yUKMT/Q+tUbxtGzjVNplm1VzP9aeiXg2G4SRsDR+CaA+GaG9nqxPvOE41KxXPgUli7TIKC0gqYkAwyesgLi0vCNuQK750MCeK26Rqx6vxqWNSnGnjTg64ZbsVFVHW7hR1mYrA2t6NNeS/YssSsvdJJfKiBJ6zW6GslBg0bn4Up8JwBnLnAGFGuFkxWxNDGLh/76lQ1XtDpUoJwujrPku13gChtg7cBqO7+zoE59NJ9HYSfp4OZ+P9Lo/QK/QajVGE3qEZ+oDO0AIxb+ZlnvYK/6v/zf/u/7hN9b19zUvUM//nHw/58a8=</latexit>

. . .
<latexit sha1_base64="CXyZvQbzqGgRKR8uT6+QZOyCWjs=">AAADOnicbVJNj9MwEHXC1xJgP9gjF4vuSuVSJT10V5wW7YULsEh0d6UkqhzX3Vq148ietBui/Beu8Bv4I1y5Ia78AJy0QNMyUqI3b2be2E9OMsEN+P43x71z9979BzsPvUePn+zu7R88vTQq15QNqRJKXyfEMMFTNgQOgl1nmhGZCHaVzM7r+tWcacNV+gGKjMWS3KR8wikBS40OnMOjMEpkOa9GQXyE/yb9OommYwVmjX1j2RA04Sm2P7EgRdzzvON/Et0a3FYvWlLr5LbkWjWMgN1Cc6lSs3FVtlZVy13RvFFo4PQPbAaTSXmu3mXVWr4p+JYtMBXEGGaqpo/JbFq+ojTXhBYv8cmgdzqoRvsdv+c3gbdBsAIdtIoLa+NuNFY0lyyFRj4M/AzikmjgVLDKi3LDMkJn5IaFFqZEMhOXzcEqfGyZMZ4obb8UcMOuT5REGlPIxHZKAlOzWavJ/9XCHCanccnTLAeW0uWiSS4wKFy/BTzmmlEQhQWEam7PiumUWCfAvpjWlkS27lDKXADXatFmE6VmQBJTedbBYNOvbXDZ7wWDnv++3znrrrzcQc/Qc9RFATpBZ+g1ukBDRJ2Pzifns/PF/ep+d3+4P5etrrOaOUStcH/9BowyBWc=</latexit>

[v1 ] [v2 ] . . . [vM ] [train railway].

<latexit sha1_base64="w8KDLMXs6SaYywgazGIUXWr+c4c=">AAADKXicbVLLjtMwFHXCawgwL5ZsLNpKZVMlXXRGrGY0GzbAINGZkZqoclx3atWOI/umnSjKN7CFb+Br2AFbfgQnUyCZcqU4555777F9kjgV3IDv/3Dce/cfPHy089h78vTZ7t7+weGFUZmmbEyVUPoqJoYJnrAxcBDsKtWMyFiwy3h5VtUvV0wbrpKPkKcskuQ64XNOCVhqeuC4ve4kjGWxKqdB1MV/k2GVhIuZAtNg31p2AprwBNtFrEkeDTzvn0K/Ajflq5ZSk9xWbFS3lXvdcFUP1HDxB4bAbiCeF2fqfVo28vpdm1JoNiuLd2yNqSDGMFPWfUymi+KU0kwTmr/GR6PB8aic7nf8gV8H3gbBBnTQJs6ta7vhTNFMsgRq+UngpxAVRAOngpVemBmWErok12xiYUIkM1FRH6zEPcvM8Fxp+ySAa7Y5URBpTC5j2ykJLMzdWkX+rzbJYH4cFTxJM2AJvd1ongkMClefHs+4ZhREbgGhmtuzYrog1gmwP0hrl1i27lDITADXat1mY6WWQGJTetbB4K5f2+BiOAhGA//DsHPS33i5g16gl6iPAnSETtAbdI7GiDrc+eR8dr64X91v7nf3522r62xmnqNWuL9+AynP/UM=</latexit>

[v1 (x)] [v2 (x)] . . . [vM (x)] [train railway].

<latexit sha1_base64="J6UHTDMJ1GOsD18m3zfvpkS9P0s=">AAACS3icbVBLSgNBEO2J//hLdOmmMQiuwowLdRlw41LBmMA4hJpOj2nSn6G7xhCGnMDTuNUzeADP4U5c2IlZGPVBVz1eVVFdL82lcBiGb0FlaXlldW19o7q5tb2zW6vv3TpTWMbbzEhjuyk4LoXmbRQoeTe3HFQqeScdXkzrnQdunTD6Bsc5TxTca5EJBuilXu0ohoTG+cCg8dlkPkwFtCA09UGOYJw0e7VG2AxnoH9JNCcNMsdVrx7s3PUNKxTXyCQ4F0dhjkkJFgWTfFK9KxzPgQ3hnseealDcJeXsngk98kqfZsb6p5HO1J8TJSjnxir1nQpw4H7XpuJ/tbjA7Dwphc4L5Jp9L8oKSdHQqTm0LyxnKMeeALPC/5WyAVhg6C1c2JKqhRtKVUgU1owW1dSYIULqJlXvYPTbr7/k9qQZnTbD65NG63ju5To5IIfkmETkjLTIJbkibcLII3kiz+QleA3eg4/g87u1Esxn9skCKitf3saywA==</latexit>

[a] [photo] [of] [a] [train railway].

<latexit sha1_base64="spEv4ej9mg3LJ30AXIGKzolRdcY=">AAACPHicbZDLTgIxFIY7eENUBE3cuGkkJqzIDAYxrjBuXGIilwQI6ZQCDe100p7RkJGXcavP4Hu4d2fcurZcFgL+SZM//zknp+fzQ8ENuO6Hk9jY3NreSe6m9vYP0oeZ7FHdqEhTVqNKKN30iWGCB6wGHARrhpoR6QvW8Ee303rjkWnDVfAA45B1JBkEvM8pARt1MydtJsNhfENppAkdX+NyqXBRmnQzObfgzoTXjbcwObRQtZt10u2eopFkAVBBjGl5bgidmGjgVLBJqh0ZFhI6IgPWsjYgkplOPDtggs9t0sN9pe0LAM/SvxMxkcaMpW87JYGhWa1Nw/9qrQj6V52YB2EELKDzRf1IYFB4SgP3uGYUxNgaQjW3f8V0SCwJsMyWtvhy6YZYRgK4Vk/Lqa/UCIhvJilL0FvltW7qxYJ3WXDvi7lKfsEyiU7RGcojD5VRBd2hKqohip7RC3pFb8678+l8Od/z1oSzmDlGS3J+fgEPPa1w</latexit>

<latexit sha1_base64="oKJTGrxy+9QoHd9nSFUvs70YTHk=">AAADS3icjVLNbtNAEF67FEqA/sCRy4qkUrhEdg5pxamoFy5AkUhbybai9WZTr7LrtXbHSS3LT8DTcIVn4AF4Dm6IA2s3ICfhwEhrffPNzDfrTxtnghvwvO+Ou3Nv9/6DvYedR4+f7B8cHj29NCrXlI2pEkpfx8QwwVM2Bg6CXWeaERkLdhXPz+v61YJpw1X6EYqMRZLcpHzGKQFLTY6c3nEvCGNZLqqJH/Xw32RYJ2EyVWBa7FvLBiGwW2hWl5pNqxI04Sm2H7EkRRUNOp2WZr8Gt9XLNe02ub2jVf2vXeGiUWhg8gc2g/GsPFfvs6qVbwq+Y0tMBTGGmcr2hUxmSfma0lwTWrzCJ6PB6aiaHHa9gdcE3gb+CnTRKi6sr/vhVNFcshQa9cD3MohKooFTwapOmBuWETonNyywMCWSmahs7lXhY8tM8Uxpe1LADdueKIk0ppCx7ZQEErNZq8l/1YIcZqdRydMsB5bSu0WzXGBQuH4ceMo1oyAKCwjV3N4V04RYJ8A+obUtsVz7h1LmArhWy3U2VmoOJDZVxzrob/q1DS6HA3808D4Mu2f9lZd76Dl6gfrIRyfoDL1BF2iMqPPJ+ex8cb6639wf7k/3112r66xmnqG12Nn9Db2/DH0=</latexit>

<latexit sha1_base64="1srz8pVJrnh6c3fD9AduK7Gk5D0=">AAADS3icjVLBbtNAEF27FEqA0sKRy4qkUrhEdiRK21NRL1yAIpG2UmxF6/W6XmXXa+2Ok0aWv4Cv4QrfwAfwHdwQB9ZuQE7CgZHWevNm5s36aaNccAOe991xt+5s3723c7/z4OGj3cd7+08ujCo0ZSOqhNJXETFM8IyNgINgV7lmREaCXUbTs7p+OWPacJV9hEXOQkmuM55wSsBSk32nd9AbB5EsZ9XED3v4bzKskyCNFZgW+9ay4wDYDTSrS83iqgRNeIbtR8zJogoHnU5Ls1+Dm+rFinab3NzRqv7XrmDWKDQw/QObwSgpz9T7vGrl64Lv2BxTQYxhprJ9AZN5Wr6mtNCELk7w4cvB0XE12et6A68JvAn8JeiiZZxbX3eDWNFCsgwa9bHv5RCWRAOnglWdoDAsJ3RKrtnYwoxIZsKyuVeFDywT40RpezLADdueKIk0ZiEj2ykJpGa9VpP/qo0LSI7Ckmd5ASyjt4uSQmBQuH4cOOaaURALCwjV3N4V05RYJ8A+oZUtkVz5h1IWArhW81U2UmoKJDJVxzror/u1CS6GA/9w4H0Ydk/7Sy930DP0HPWRj16hU/QGnaMRos4n57PzxfnqfnN/uD/dX7etrrOceYpWYmv7N796DH4=</latexit>

Wind farm Accuracy: 75.35 Accuracy: 65.89 Accuracy: 76.86

<latexit sha1_base64="RkBZMghrkxbWYtXoTL4CTyYAvLU=">AAADNXicbVLLbtNAFB2bVzHQByzZjEgrhU1kZ5FWrIq6YQMUiTSVbCsaTybNKDMea+Y6qWX5S9jCN/AtLNghtvwCYzcgu+mVbJ97zn2MjybJBDfg+z8c9979Bw8f7Tz2njx9tru3f/D8wqhcUzamSih9mRDDBE/ZGDgIdplpRmQi2CRZntX6ZMW04Sr9DEXGYkmuUj7nlIClpgfO7tFhGCWyXFXTID7E/5NhnUSLmQLTYt9bNgRNeIrtS6xJEQ88rzWiX4Pr6nVnVJvcHtlS7xodrZqGBi7+wQjYNSTz8kx9zKpW3nwbV0rNZlX5ga0xFcQYZqqmjslsUb6lNNeEFm/w8WhwMrLChKczPCdaTvd7/sBvAm+DYAN6aBPntYPRTNFcshSaTWHgZxCXRAOnglVelBuWEbokVyy0MCWSmbhszljhI8vYxUrbJwXcsO2OkkhjCpnYSklgYW5rNXmXFuYwP4lLnmY5sJTeLJrnAoPC9TXAM64ZBVFYQKjm9qyYLog1Bexl6WxJZOcfSpkL4Fqtu2yi1BJIYirPOhjc9msbXAwHwWjgfxr2TvsbL3fQS/QK9VGAjtEpeofO0RhRJ3e+OF+db+5396f7y/19U+o6m54XqBPun79BNQFb</latexit>

Train railway
<latexit sha1_base64="8F2SgDUKU2CwjF7V4lnnVuQDSfM=">AAADOXicbVLLbtNAFB2bVzHQF0s2I9JKYRPZWaQVq6Ju2ABFatpKthWNJ5NmlBmPNXOd1LL8LWzhG/gSluwQW36AsRuQ3fRKts895z7GR5Nkghvw/R+O++Dho8dPtp56z56/2N7Z3du/MCrXlI2pEkpfJcQwwVM2Bg6CXWWaEZkIdpksTmv9csm04So9hyJjsSTXKZ9xSsBSkz1n//AgjBJZLqtJEB/g/8mwTqL5VIFpsR8sG4ImPMX2JVakiAee1xrRr8FN9aYzqk1ujmyp942Olk1DA+f/YATsBpJZeao+ZVUrb76NK6Vm06r8yFaYCmIMM1VTx2Q2L99RmmtCi7f4aDQ4HlnhvL13stvzB34TeBMEa9BD6zizLm5HU0VzyVJotoWBn0FcEg2cClZ5UW5YRuiCXLPQwpRIZuKyOWeFDy0zxTOl7ZMCbth2R0mkMYVMbKUkMDd3tZq8TwtzmB3HJU+zHFhKbxfNcoFB4foq4CnXjIIoLCBUc3tWTOfEGgP2wnS2JLLzD6XMBXCtVl02UWoBJDGVZx0M7vq1CS6Gg2A08D8Peyf9tZdb6BV6jfooQEfoBL1HZ2iMqFM4X5yvzjf3u/vT/eX+vi11nXXPS9QJ989fXHMDQg==</latexit>

(b) The instance-conditional prompts learned by CoCoOp are much more generalizable than CoOp to the unseen classes.

Figure 1. Motivation of our research: to learn generalizable prompts. The images are randomly selected from SUN397 [55], which is
a widely-used scene recognition dataset.

text words in a prompt into a set of learnable vectors, taking to image captioning [49], which explains why instance-
advantage of the differentiable nature of neural networks. conditional prompts are more generalizable: they are op-
With only a few labeled images for learning, CoOp achieves timized to characterize each instance (more robust to class
huge improvements over intensively-tuned manual prompts shift) rather than to serve only for some specific classes.
across a wide range of image recognition datasets. We present comprehensive experiments on 11 datasets
In our study, we identify a critical problem of CoOp: the covering a diverse set of visual recognition tasks. Specifi-
learned context is not generalizable to wider unseen classes cally, we design a base-to-new generalization setting where
within the same task. Figure 1 illustrates the problem: the a model is first learned using base classes and then tested
context learned by CoOp works well in distinguishing the on completely new classes. Compared with the zero-shot
base classes like “arrival gate” and “cathedral” but suffers a method [40] and CoOp [63], our approach achieves the best
significant drop in accuracy when it is transferred to the new overall performance (Table 1). Importantly, CoCoOp gains
(unseen) classes, such as “wind farm” and “train railway”— significant improvements over CoOp in unseen classes (Fig-
even though the task’s nature remains the same, i.e., recog- ure 3(a)), allowing the gap between manual and learning-
nizing scenes. The results suggest that the learned context based prompts to be substantially reduced.
overfits the base classes, thus failing to capture more gen- In a more challenging scenario where the context learned
eralizable elements that are vital for broader scene recog- for one task is directly transferred to other tasks with dras-
nition. We argue that such a problem is caused by CoOp’s tically different classes, CoCoOp still beats CoOp with a
static design: the context, which is fixed once learned, is clear margin (Table 2), suggesting that instance-conditional
optimized only for a specific set of (training) classes. On prompts are more transferable and have the potential to suc-
the contrary, the manually-designed prompts adopted by the ceed at larger scale. CoCoOp also obtains stronger domain
zero-shot method are relatively generalizable. generalization performance than CoOp (Table 3), further
To address the weak generalizability problem, we intro- justifying the strengths of dynamic prompts.
duce a novel concept: conditional prompt learning. The key In summary, our research provides timely insights into
idea is to make a prompt conditioned on each input instance the generalizability problem in prompt learning, and cru-
(image) rather than fixed once learned. To make the model cially, demonstrates the effectiveness of a simple idea in
parameter-efficient, we introduce a simple yet effective im- various problem scenarios. We hope our approach and the
plementation of conditional prompt learning. Specifically, findings presented in this work can pave the way for future
we extend CoOp by further learning a lightweight neural research in generalizable—and transferable—prompt learn-
network to generate for each image an input-conditional to- ing.
ken (vector), which is combined with the learnable con-
text vectors. We call our approach Conditional Context 2. Related Work
Optimization (CoCoOp).2 An overview is shown in Fig-
ure 2. Interestingly, the paradigm of CoCoOp is analogous Vision-Language Models We mainly review studies fo-
cused on aligning images and texts to learn a joint embed-
2 Pronounced as /k@Uku:p/. ding space [24, 40, 59]. The idea of cross-modality align-

2
Conditional Context Optimization (CoCoOp)

ment is certainly not new and has been investigated since context tokens

nearly a decade ago—though with dramatically different <latexit sha1_base64="HOGtH0eXP5R5+EEu2fQHZwuL4xw=">AAACF3icbVA9TwJBEN3DL8Qv1NJmI5hYkTsKtSSxscREPiJcyN6ywIbdvcvuHIZc7l/YWOhfsTO2lv4TSxe4QsCXTPLy3kxm5gWR4AZc99vJbWxube/kdwt7+weHR8Xjk6YJY01Zg4Yi1O2AGCa4Yg3gIFg70ozIQLBWML6d+a0J04aH6gGmEfMlGSo+4JSAlR7L3UAmk7TnlXvFkltx58DrxMtICWWo94o/3X5IY8kUUEGM6XhuBH5CNHAqWFroxoZFhI7JkHUsVUQy4yfzi1N8YZU+HoTalgI8V/9OJEQaM5WB7ZQERmbVm4n/eZ0YBjd+wlUUA1N0sWgQCwwhnr2P+1wzCmJqCaGa21sxHRFNKNiQlrYEcumHRMYCuA6f0oKNylsNZp00qxXvquLeV0u1ahZaHp2hc3SJPHSNaugO1VEDUaTQM3pFb86L8+58OJ+L1pyTzZyiJThfv5cLoGY=</latexit>

v1 v2 <latexit sha1_base64="/ObT7ntJtfRdwD1X3OgKDxBQKY0=">AAACF3icbVC7TgJBFJ3FF+ILtbSZCCZWZHcLtSSxscREHhE2ZHaYhQnz2MzMYshm/8LGQn/Fztha+ieWDrCFgCe5yck59+bee8KYUW1c99spbGxube8Ud0t7+weHR+Xjk5aWicKkiSWTqhMiTRgVpGmoYaQTK4J4yEg7HN/O/PaEKE2leDDTmAQcDQWNKEbGSo/VXsjTSdb3q/1yxa25c8B14uWkAnI0+uWf3kDihBNhMENadz03NkGKlKGYkazUSzSJER6jIelaKhAnOkjnF2fwwioDGEllSxg4V/9OpIhrPeWh7eTIjPSqNxP/87qJiW6ClIo4MUTgxaIoYdBIOHsfDqgi2LCpJQgram+FeIQUwsaGtLQl5Es/pDxhhir5lJVsVN5qMOuk5de8q5p771fqfh5aEZyBc3AJPHAN6uAONEATYCDAM3gFb86L8+58OJ+L1oKTz5yCJThfv5i2oGc=</latexit>

...
<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>
<latexit sha1_base64="t4hsqUX1Tf4ub/K3HNBJTidAYC8=">AAACF3icbVA9TwJBEJ3zE/ELtbTZCCZW5I5CLUlsbEwwkY8IF7K37MGG3b3L7h6GXPgXNhb6V+yMraX/xNIFrhDwJZO8vDeTmXlBzJk2rvvtrK1vbG5t53byu3v7B4eFo+OGjhJFaJ1EPFKtAGvKmaR1wwynrVhRLAJOm8HwZuo3R1RpFskHM46pL3BfspARbKz0WOoEIh1NunelbqHolt0Z0CrxMlKEDLVu4afTi0giqDSEY63bnhsbP8XKMMLpJN9JNI0xGeI+bVsqsaDaT2cXT9C5VXoojJQtadBM/TuRYqH1WAS2U2Az0MveVPzPaycmvPZTJuPEUEnmi8KEIxOh6fuoxxQlho8twUQxeysiA6wwMTakhS2BWPghFQk3TEVPk7yNylsOZpU0KmXvsuzeV4rVShZaDk7hDC7Agyuowi3UoA4EJDzDK7w5L8678+F8zlvXnGzmBBbgfP0Cxb+ggg==</latexit>

vM [CLASS] . Text Encoder

technologies than today. + + ... +

<latexit sha1_base64="MdQQkHPETjye9ifJOXGRj5eF6M4=">AAACF3icbVC7SgNBFJ31GeMrammzGARBCLsp1DJgYxnBPDBZwuxkNhkyj2XmrhqW/QsbC/0VO7G19E8snSRbmMQDFw7n3Mu994QxZwY879tZWV1b39gsbBW3d3b39ksHh02jEk1ogyiudDvEhnImaQMYcNqONcUi5LQVjq4nfuuBasOUvINxTAOBB5JFjGCw0n0X6BOEUXqe9Uplr+JN4S4TPydllKPeK/10+4okgkogHBvT8b0YghRrYITTrNhNDI0xGeEB7VgqsaAmSKcXZ+6pVfpupLQtCe5U/TuRYmHMWIS2U2AYmkVvIv7ndRKIroKUyTgBKslsUZRwF5Q7ed/tM00J8LElmGhmb3XJEGtMwIY0tyUUcz+kIuHAtHrMijYqfzGYZdKsVvyLindbLdeqeWgFdIxO0Bny0SWqoRtURw1EkETP6BW9OS/Ou/PhfM5aV5x85gjNwfn6BZuRoQE=</latexit> <latexit sha1_base64="MdQQkHPETjye9ifJOXGRj5eF6M4=">AAACF3icbVC7SgNBFJ31GeMrammzGARBCLsp1DJgYxnBPDBZwuxkNhkyj2XmrhqW/QsbC/0VO7G19E8snSRbmMQDFw7n3Mu994QxZwY879tZWV1b39gsbBW3d3b39ksHh02jEk1ogyiudDvEhnImaQMYcNqONcUi5LQVjq4nfuuBasOUvINxTAOBB5JFjGCw0n0X6BOEUXqe9Uplr+JN4S4TPydllKPeK/10+4okgkogHBvT8b0YghRrYITTrNhNDI0xGeEB7VgqsaAmSKcXZ+6pVfpupLQtCe5U/TuRYmHMWIS2U2AYmkVvIv7ndRKIroKUyTgBKslsUZRwF5Q7ed/tM00J8LElmGhmb3XJEGtMwIY0tyUUcz+kIuHAtHrMijYqfzGYZdKsVvyLindbLdeqeWgFdIxO0Bny0SWqoRtURw1EkETP6BW9OS/Ou/PhfM5aV5x85gjNwfn6BZuRoQE=</latexit>

<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>

⇡ ⇡ ... ⇡
A typical vision-language model consists of three key el-
<latexit sha1_base64="4KzQHU+3d8F0bpLXiRHf1+GMPSk=">AAACF3icbVA9T8MwEL3wWcpXgZHFokViqpIOwFiJhbFI9EM0UeW4TmvVdiLbAVVR/wULA/wVNsTKyD9hxG0z0JYnnfT03p3u7oUJZ9q47reztr6xubVd2Cnu7u0fHJaOjls6ThWhTRLzWHVCrClnkjYNM5x2EkWxCDlth6Obqd9+pEqzWN6bcUIDgQeSRYxgY6WHih+KzE/YpNIrld2qOwNaJV5OypCj0Sv9+P2YpIJKQzjWuuu5iQkyrAwjnE6KfqppgskID2jXUokF1UE2u3iCzq3SR1GsbEmDZurfiQwLrccitJ0Cm6Fe9qbif143NdF1kDGZpIZKMl8UpRyZGE3fR32mKDF8bAkmitlbERlihYmxIS1sCcXCD5lIuWEqfpoUbVTecjCrpFWrepdV965Wrtfy0ApwCmdwAR5cQR1uoQFNICDhGV7hzXlx3p0P53PeuubkMyewAOfrF+USoJU=</latexit> <latexit sha1_base64="4KzQHU+3d8F0bpLXiRHf1+GMPSk=">AAACF3icbVA9T8MwEL3wWcpXgZHFokViqpIOwFiJhbFI9EM0UeW4TmvVdiLbAVVR/wULA/wVNsTKyD9hxG0z0JYnnfT03p3u7oUJZ9q47reztr6xubVd2Cnu7u0fHJaOjls6ThWhTRLzWHVCrClnkjYNM5x2EkWxCDlth6Obqd9+pEqzWN6bcUIDgQeSRYxgY6WHih+KzE/YpNIrld2qOwNaJV5OypCj0Sv9+P2YpIJKQzjWuuu5iQkyrAwjnE6KfqppgskID2jXUokF1UE2u3iCzq3SR1GsbEmDZurfiQwLrccitJ0Cm6Fe9qbif143NdF1kDGZpIZKMl8UpRyZGE3fR32mKDF8bAkmitlbERlihYmxIS1sCcXCD5lIuWEqfpoUbVTecjCrpFWrepdV965Wrtfy0ApwCmdwAR5cQR1uoQFNICDhGV7hzXlx3p0P53PeuubkMyewAOfrF+USoJU=</latexit>

<latexit sha1_base64="4KzQHU+3d8F0bpLXiRHf1+GMPSk=">AAACF3icbVA9T8MwEL3wWcpXgZHFokViqpIOwFiJhbFI9EM0UeW4TmvVdiLbAVVR/wULA/wVNsTKyD9hxG0z0JYnnfT03p3u7oUJZ9q47reztr6xubVd2Cnu7u0fHJaOjls6ThWhTRLzWHVCrClnkjYNM5x2EkWxCDlth6Obqd9+pEqzWN6bcUIDgQeSRYxgY6WHih+KzE/YpNIrld2qOwNaJV5OypCj0Sv9+P2YpIJKQzjWuuu5iQkyrAwjnE6KfqppgskID2jXUokF1UE2u3iCzq3SR1GsbEmDZurfiQwLrccitJ0Cm6Fe9qbif143NdF1kDGZpIZKMl8UpRyZGE3fR32mKDF8bAkmitlbERlihYmxIS1sCcXCD5lIuWEqfpoUbVTecjCrpFWrepdV965Wrtfy0ApwCmdwAR5cQR1uoQFNICDhGV7hzXlx3p0P53PeuubkMyewAOfrF+USoJU=</latexit>

ements: two for image and text encoding while the third is ...
<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>

related to the design of loss functions. In early days, mod- meta token
<latexit sha1_base64="4KzQHU+3d8F0bpLXiRHf1+GMPSk=">AAACF3icbVA9T8MwEL3wWcpXgZHFokViqpIOwFiJhbFI9EM0UeW4TmvVdiLbAVVR/wULA/wVNsTKyD9hxG0z0JYnnfT03p3u7oUJZ9q47reztr6xubVd2Cnu7u0fHJaOjls6ThWhTRLzWHVCrClnkjYNM5x2EkWxCDlth6Obqd9+pEqzWN6bcUIDgQeSRYxgY6WHih+KzE/YpNIrld2qOwNaJV5OypCj0Sv9+P2YpIJKQzjWuuu5iQkyrAwjnE6KfqppgskID2jXUokF1UE2u3iCzq3SR1GsbEmDZurfiQwLrccitJ0Cm6Fe9qbif143NdF1kDGZpIZKMl8UpRyZGE3fR32mKDF8bAkmitlbERlihYmxIS1sCcXCD5lIuWEqfpoUbVTecjCrpFWrepdV965Wrtfy0ApwCmdwAR5cQR1uoQFNICDhGV7hzXlx3p0P53PeuubkMyewAOfrF+USoJU=</latexit>

⇡
Image Encoder ...
<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>

els for processing images and texts are often designed and
also learned independently, with their outputs connected by Meta-Net

extra modules (losses) for alignment. Images are often en-

coded using hand-crafted descriptors [10, 45] or neural net-
works [12, 29], while texts are encoded using, for instance, Figure 2. Our approach, Conditional Context Optimization (Co-
pre-trained word vectors [12, 45] or the frequency-based CoOp), consists of two learnable components: a set of context
TF-IDF features [10, 29]. In terms of cross-modality align- vectors and a lightweight neural network (Meta-Net) that gener-
ment, common approaches include metric learning [12], ates for each image an input-conditional token.
multi-label classification [16, 26], and n-gram language
learning [31]. Recently, a study suggests that training the
vision part with an image captioning loss can make the vi- gradients based on the label likelihood. Our research is most
sual representation more transferable [7]. related to continuous prompt learning methods [30, 32, 60],
Recent vision-language models [13,24,33,40] bridge the where the main idea is to turn a prompt into a set of contin-
two modalities by learning two encoders jointly. Also, the uous vectors that can be end-to-end optimized with respect
models are now built with much larger neural networks. As to an objective function. See Liu et al. [34] for a more com-
discussed in Zhou et al. [63], recent successes in vision- prehensive survey.
language models are mainly attributed to the developments In computer vision, prompt learning is a nascent research
in i) Transformers [48], ii) contrastive representation learn- direction that has only been explored very recently [27, 42,
ing [4, 17, 20], and iii) web-scale training datasets [24, 40]. 56,58,63]. Our research is built on top of CoOp [63], which
A representative approach is CLIP [40], which trains two is the earliest work to bring continuous prompt learning
neural network-based encoders using a contrastive loss to to the vision domain for adaptation of pre-trained vision-
match pairs of images and texts. After consuming 400 mil- language models. Crucially, our approach solves the weak
lion data pairs, the CLIP model demonstrates a remark- generalizability problem of CoOp [63], based on a simple
able zero-shot image recognition capability. Similar to idea of conditional prompt learning—which to our knowl-
CoOp [63], our approach is orthogonal to the research of edge is also novel in the context of NLP and thus could be
CLIP-like models [13,24,33,40], aiming to offer an efficient of interest to the NLP community as well.
solution for adapting pre-trained vision-language models to
downstream applications. Zero-Shot Learning (ZSL) is another relevant research
area where the goal is similar to ours, i.e., to recognize novel
Prompt Learning This topic originates from the NLP classes by training only on base classes [3,51,54,57]. More-
domain. The motivation was to view pre-trained language over, the generalization problem where a model trained on
models, such as BERT [8] or GPT [41], as knowledge base classes often fails on novel classes is also linked to
bases from which information useful to downstream tasks the “seen-class bias” issue raised in the ZSL literature [54].
is elicited [39]. Concretely, given a pre-trained language The most common approach to ZSL is to learn a semantic
model, the task is often formulated as a “fill-in-the-blank” space based on auxiliary information such as attributes [23]
cloze test, such as asking the model to predict the masked or word embeddings [12, 52]. Different from existing ZSL
token in “No reason to watch. It was [MASK]” as either methods, our work addresses the emerging problem of
“positive” or “negative” for sentiment classification. The adapting large vision-language models and uses drastically
key lies in how to design the underlined part, known as different techniques based on prompting.
prompt (template), in such a format familiar to the model.
Instead of manually designing a prompt, research in 3. Methodology
prompt learning aims to automate the process with the help
of affordable-sized labeled data. Jiang et al. [25] use text An overview of our approach is shown in Figure 2. Be-
mining and paraphrasing to generate a group of candidate low we first provide brief reviews on CLIP [40], which is
prompts, within which the optimal ones are chosen to have the base model used in this paper, and CoOp [63]. Then, we
the highest training accuracy. Shin et al. [44] propose Au- present the technical details of our approach as well as the
toPrompt, a gradient-based approach that selects from a vo- rationale behind the design. Same as CoOp, our approach
cabulary the best tokens that cause the greatest changes in is applicable to broader CLIP-like vision-language models.

3
3.1. Reviews of CLIP and CoOp ents can be propagated all the way back to update the con-
text vectors. Note that the base model of CLIP is frozen in
Contrastive Language-Image Pre-training known as
the entire training process (ours too).
CLIP [41], has well demonstrated the potential of learning
open-set visual concepts. CLIP is built using two encoders, 3.2. CoCoOp: Conditional Context Optimization
one for image and the other for text, as shown in Figure 2.
The image encoder can be either a ResNet [18] or a ViT [9], CoOp is a data-efficient approach allowing the context
which is used to transform an image into a feature vector. vectors to be trained with only a few labeled images in a
The text encoder is a Transformer [48], which takes as input downstream dataset. However, as discussed CoOp is not
a sequence of word tokens and again produces a vectorized generalizable to wider unseen classes within the same task.
representation. We argue that instance-conditional context can generalize
During training, CLIP adopts a contrastive loss to learn better because it shifts the focus away from a specific set of
a joint embedding space for the two modalities. Specifi- classes—for reducing overfitting—to each input instance,
cally, for a mini-batch of image-text pairs, CLIP maximizes and hence to the entire task.
for each image the cosine similarity with the matched text A straightforward way to implement CoCoOp is to build
while minimizes the cosine similarities with all other un- M neural networks to get M context tokens. However, such
matched texts, and the loss is computed in a similar fashion a design would require M × the size of a neural network,
for each text too. After training, CLIP can be used for zero- which is much larger than having M context vectors as in
shot image recognition. Let x be image features generated CoOp. Here we propose a parameter-efficient design that
by the image encoder and {wi }K works very well in practice. Specifically, on top of the M
i=1 a set of weight vectors
produced by the text encoder, each representing a category context vectors, we further learn a lightweight neural net-
(suppose there are K categories in total). In particular, each work, called Meta-Net, to generate for each input a condi-
wi is derived from a prompt, such as “a photo of a {class}” tional token (vector), which is then combined with the con-
where the “{class}” token is filled with the i-th class name. text vectors. See Figure 2 for a sketch of the architecture.
The prediction probability is then Let hθ (·) denote the Meta-Net parameterized by θ, each
context token is now obtained by vm (x) = vm + π where
exp(sim(x, wy )/τ ) π = hθ (x) and m ∈ {1, 2, ..., M }. The prompt for the
p(y|x) = PK , (1) i-th class is thus conditioned on the input, i.e., ti (x) =
i=1 exp(sim(x, wi )/τ )
{v1 (x), v2 (x), . . . , vM (x), ci }. The prediction probability
where sim(·, ·) denotes cosine similarity and τ is a learned is computed as
temperature parameter.
exp(sim(x, g(ty (x)))/τ )
Context Optimization (CoOp) aims to overcome the p(y|x) = PK . (3)
inefficiency problem in prompt engineering for better adapt- i=1 exp(sim(x, g(ti (x))/τ )

ing pre-trained vision-language models to downstream ap- During training, we update the context vectors {vm }M
m=1
plications [63]. The key idea in CoOp is to model each con- together with the Meta-Net’s parameters θ. In this work,
text token using a continuous vector that can be end-to-end the Meta-Net is built with a two-layer bottleneck structure
learned from data. Concretely, instead of using “a photo (Linear-ReLU-Linear), with the hidden layer reducing the
of a” as the context, CoOp introduces M learnable context input dimension by 16×. The input to the Meta-Net is sim-
vectors, {v1 , v2 , . . . , vM }, each having the same dimension ply the output features produced by the image encoder. We
with the word embeddings. The prompt for the i-th class, leave exploration of more advanced designs for future work.
denoted by ti , now becomes ti = {v1 , v2 , . . . , vM , ci }
where ci is the word embedding(s) for the class name. The 4. Experiments
context vectors are shared among all classes.3 Let g(·) de-
note the text encoder, the prediction probability is then Our approach is mainly evaluated in the following three
problem settings: 1) generalization from base to new classes
exp(sim(x, g(ty ))/τ ) within a dataset (Section 4.1); 2) cross-dataset transfer (Sec-
p(y|x) = PK . (2)
i=1 exp(sim(x, g(ti )/τ )
tion 4.2); 3) domain generalization (Section 4.3). All mod-
els used in our experiments are based on the open-source
To adapt CLIP to a downstream image recognition CLIP [40].4 Before discussing the results, we provide the
dataset, a cross-entropy loss can be used as the learning ob- details of the experimental setup below.
jective. Since the text encoder g(·) is differentiable, gradi-
Datasets For the first two settings, i.e., base-to-new gen-
3 CoOp has an alternative version that learns class-specific context, eralization and cross-dataset transfer, we use the 11 image
which is not considered here because it is not straightforward to transfer
class-specific context to unseen classes. 4 https://github.com/openai/CLIP.

4
Table 1. Comparison of CLIP, CoOp and CoCoOp in the base-to-new generalization setting. For learning-based methods (CoOp and
CoCoOp), their prompts are learned from the base classes (16 shots). The results strongly justify the strong generalizability of conditional
prompt learning. H: Harmonic mean (to highlight the generalization trade-off [54]).

(a) Average over 11 datasets. (b) ImageNet. (c) Caltech101.

Base New H Base New H Base New H

CLIP 69.34 74.22 71.70 CLIP 72.43 68.14 70.22 CLIP 96.84 94.00 95.40
CoOp 82.69 63.22 71.66 CoOp 76.47 67.88 71.92 CoOp 98.00 89.81 93.73
CoCoOp 80.47 71.69 75.83 CoCoOp 75.98 70.43 73.10 CoCoOp 97.96 93.81 95.84

(d) OxfordPets. (e) StanfordCars. (f) Flowers102.

Base New H Base New H Base New H

CLIP 91.17 97.26 94.12 CLIP 63.37 74.89 68.65 CLIP 72.08 77.80 74.83
CoOp 93.67 95.29 94.47 CoOp 78.12 60.40 68.13 CoOp 97.60 59.67 74.06
CoCoOp 95.20 97.69 96.43 CoCoOp 70.49 73.59 72.01 CoCoOp 94.87 71.75 81.71

(g) Food101. (h) FGVCAircraft. (i) SUN397.

Base New H Base New H Base New H

CLIP 90.10 91.22 90.66 CLIP 27.19 36.29 31.09 CLIP 69.36 75.35 72.23
CoOp 88.33 82.26 85.19 CoOp 40.44 22.30 28.75 CoOp 80.60 65.89 72.51
CoCoOp 90.70 91.29 90.99 CoCoOp 33.41 23.71 27.74 CoCoOp 79.74 76.86 78.27

(j) DTD. (k) EuroSAT. (l) UCF101.

Base New H Base New H Base New H

CLIP 53.24 59.90 56.37 CLIP 56.48 64.05 60.03 CLIP 70.53 77.50 73.85
CoOp 79.44 41.18 54.24 CoOp 92.19 54.74 68.69 CoOp 84.69 56.05 67.46
CoCoOp 77.01 56.00 64.85 CoCoOp 87.49 60.04 71.21 CoCoOp 82.33 73.45 77.64

recognition datasets as in Zhou et al. [63], which cover a di- CLIP [40] is also compared, which is based on manual
verse set of recognition tasks. Specifically, the benchmark prompts. It is worth mentioning that the manual prompt
includes ImageNet [6] and Caltech101 [11] for classifica- for each dataset was intensively tuned using all classes in
tion on generic objects; OxfordPets [38], StanfordCars [28], the test data [40].
Flowers102 [36], Food101 [2] and FGVCAircraft [35] for
fine-grained classification; SUN397 [55] for scene recogni-
tion; UCF101 [46] for action recognition; DTD [5] for tex-
Training Details Our implementation is based on
ture classification; and finally EuroSAT [19] for satellite im-
CoOp’s code.5 Throughout the experiments, we use the best
agery recognition. For domain generalization experiments,
available vision backbone in CLIP, i.e., ViT-B/16. Zhou et
we use ImageNet as the source dataset and four other vari-
al. [63] have suggested that a shorter context length and
ants of ImageNet that contain different types of domain shift
a good initialization can lead to better performance and
as the target datasets, namely ImageNetV2 [43], ImageNet-
stronger robustness to domain shift. Therefore, we fix the
Sketch [50], ImageNet-A [22] and ImageNet-R [21].
context length to 4 and initialize the context vectors using
Following Zhou et al. [63], we randomly sample for each the pre-trained word embeddings of “a photo of a” for both
dataset a few-shot training set while using the original test CoOp and CoCoOp. Due to the instance-conditional de-
set for testing. We only evaluate the highest shot number sign, our approach is slow to train and consumes much more
studied in Zhou et al. [63], i.e., 16 shots, which is sufficient GPU memory than CoOp. Therefore, to ensure the model
to justify our approach. For learning-based models, the re- can fit into a GPU and meanwhile reduce the training time,
sults are averaged over three runs. we train CoCoOp with batch size of 1 for 10 epochs. Such
a limitation is discussed in more detail in Section 5.
Baselines The direct rival to our approach is CoOp [63],
which essentially learns static prompts (in comparison
to our dynamic prompts). The zero-shot method, i.e., 5 https://github.com/KaiyangZhou/CoOp.

5
(a) (b)

Figure 3. Comprehensive comparisons of CoCoOp and CoOp in the base-to-new generalization setting. (a) CoCoOp is able to gain
consistent improvements over CoOp in unseen classes on all datasets. (b) CoCoOp’s declines in base accuracy are mostly under 3%, which
are far outweighed by the gains in generalization.

4.1. Generalization From Base to New Classes increases in accuracy on 5 out of 11 datasets. Notably, on
the challenging ImageNet dataset, CoCoOp’s surge from
Solving the weak generalizability problem of CoOp is
67.88% to 70.43% represents a non-trivial progress (the
the main focus in this research. On each of the 11 datasets,
70.43% accuracy even surpasses CLIP’s 68.14%).
we split the classes equally into two groups, one as base
classes and the other as new classes. Learning-based mod- CoCoOp’s Gains in Generalization Far Outweigh
els, i.e., CoOp and CoCoOp, are trained using only the base Losses in Base Accuracy In comparison to CoOp, per-
classes while evaluation is conducted on the base and new formance drops in the base classes occur for CoCoOp on
classes separately to test generalizability. The detailed re- most datasets (see Figure 3(b)). This is reasonable because
sults are shown in Table 1. CoOp optimizes specifically for base classes whereas Co-
CoOp optimizes for each instance in order to gain more
Failures of CoOp in Unseen Classes The split does not
generalization over an entire task. But it is worth noting
guarantee that the two class groups are equally difficult, as
that on the 9 datasets where CoCoOp’s base accuracy drops
evidenced in CLIP’s bumpy results: the base and new ac-
below CoOp’s, most losses are under 3% (precisely on 6
curacy numbers are dramatically different.6 Nonetheless,
out of 9 datasets), which are far outweighed by the gains in
CoOp’s new accuracy is consistently much weaker than the
unseen classes shown in Figure 3(a); even for those where
base accuracy on nearly all datasets, leaving a huge gap of
CoCoOp suffers the biggest losses, the boosts in generaliza-
almost 20% on average (82.69% vs 63.22%). Despite main-
tion are mostly significant enough to turn the averages into
taining an advantage over CLIP in terms of average perfor-
positives, e.g., StanfordCars sees the worst base accuracy
mance, CoOp’s gains in the base classes are nearly zeroed
drop of -7.63% but has the third-highest accuracy gain of
out by the catastrophic failures in the new classes, highlight-
+13.19% in the new classes, which together bring a 5.56%
ing the need to improve generalizability for learning-based
positive improvement for CoCoOp.
prompts.
CoCoOp Is More Compelling Than CLIP When tak-
CoCoOp Significantly Narrows Generalization Gap
ing into account both the base and new classes, CoCoOp
As shown in Table 1(a), CoCoOp improves the accuracy
shows a gain of more than 4% over CLIP (75.83% vs
in unseen classes from 63.22% to 71.69%, which largely
71.70), suggesting that instance-conditional prompts have
reduces the gap with manual prompts. The results confirm
a better potential in capturing more generalizable elements
that instance-conditional prompts are more generalizable.
that are relevant for a recognition task. Theoretically,
A more detailed breakdown of per-dataset improvement is
learning-based prompts have a much higher risk of overfit-
visualized in Figure 3(a) where we observe more than 10%
ting base classes than manual prompts. Therefore, CLIP is a
6 For convenience, we refer to base accuracy as the performance in base strong competitor to beat in unseen classes. Different from
classes; and similarly for new accuracy. CoOp, we obtain promising results for CoCoOp: the new

6
Table 2. Comparison of prompt learning methods in the cross-dataset transfer setting. Prompts applied to the 10 target datasets are
learned from ImageNet (16 images per class). Clearly, CoCoOp demonstrates better transferability than CoOp. ∆ denotes CoCoOp’s gain
over CoOp.

Source Target

FGVCAircraft
StanfordCars

Flowers102
Caltech101

OxfordPets
ImageNet

EuroSAT
Food101

SUN397

UCF101

Average
DTD
CoOp [63] 71.51 93.70 89.14 64.51 68.71 85.30 18.47 64.15 41.92 46.39 66.55 63.88
CoCoOp 71.02 94.43 90.14 65.32 71.88 86.06 22.94 67.36 45.73 45.37 68.21 65.74
∆ -0.49 +0.73 +1.00 +0.81 +3.17 +0.76 +4.47 +3.21 +3.81 -1.02 +1.66 +1.86

Table 3. Comparison of manual and learning-based prompts in domain generalization. CoOp and CoCoOp use as training data 16
images from each of the 1,000 classes on ImageNet. In general, CoCoOp is more domain-generalizable than CoOp.

Source Target
Learnable? ImageNet ImageNetV2 ImageNet-Sketch ImageNet-A ImageNet-R
CLIP [40] 66.73 60.83 46.15 47.77 73.96
CoOp [63] X 71.51 64.20 47.99 49.71 75.21
CoCoOp X 71.02 64.07 48.75 50.63 76.18

accuracy is even better than CLIP’s on 4 out of 11 datasets dog breeds, it is reasonable to see high accuracy for both
(i.e., ImageNet, OxfordPets, Food101 and SUN397) and not models on the relevant target datasets including Caltech101
too far away from CLIP’s on the rest except FGVCAircraft and OxfordPets.
where the gap between manual and learning-based prompts By comparison, the performance on other datasets with
is generally large. In the ablation study on context length, distant—and more fine-grained or specialized—categories
we find that FGVCAircraft benefits from longer context, is much lower, such as FGVCAircraft and DTD (contain-
which is aligned with the findings in Zhou et al. [63]. ing various textures) where the accuracy numbers are well
To close or even overturn the gaps between manual and below 50%. Nonetheless, CoCoOp exhibits much stronger
learning-based prompts in unseen classes, more efforts are transferability than CoOp on the two mentioned datasets as
required and we hope the insights presented in this research well as on most other fine-grained or specialized datasets.
can help the community tackle the generalizability issue in
prompt learning. 4.3. Domain Generalization

4.2. Cross-Dataset Transfer Generalization to out-of-distribution data is a capability

essential for machine learning models to succeed in prac-
Having demonstrated CoCoOp’s generalizability within tical applications [47, 62]. Zhou et al. [63] have revealed
a dataset, we further show that CoCoOp has the potential to that their learnable prompts are more robust than manual
transfer beyond a single dataset, which is a much more chal- prompts to domain shift. We are also interested to know if
lenging problem because the fundamentals can be totally instance-conditional prompts still maintain the advantages
changed across different datasets (e.g., from object recog- as in previous experiments.
nition to texture classification). We only consider prompt Following Zhou et al. [63], we evaluate CoCoOp’s do-
learning methods in this setting. main generalization performance by transferring the con-
We compare CoCoOp with CoOp by transferring context learned from ImageNet to the four specially designed
text learned from ImageNet, with all 1,000 classes used, to benchmarks. We also include the comparison with CLIP.
each of the other 10 datasets. The results are detailed in Table 3 shows the results. Both prompt learning methods
Table 2. On the source dataset, the two models perform clearly beat CLIP on all target datasets. Compared to CoOp,
similarly. Whereas on the target datasets, CoCoOp mostly CoCoOp performs slightly worse on ImageNetV2 but bet-
outperforms CoOp by a clear margin. Since the ImageNet ter on the other three. The results confirm that instance-
classes mainly contain objects, as well as a fair amount of conditional prompts are more domain-generalizable.

7
Table 4. Recognition accuracy (average over 11 datasets) on a Table 5. CoCoOp (last row) vs a bigger CoOp on ImageNet.
combination of base and new classes. The learnable models only
have access to training data from base classes. Model # params Base New H
CoOp (ctx=4) 2,048 76.47 67.88 71.92
Learnable? Accuracy
CoOp (ctx=60) 30,720 76.16 65.34 70.34
CLIP [40] 65.22 CoOp (ctx=4) + Meta-Net 34,816 75.98 70.43 73.10
CoOp [63] X 65.55
CoCoOp X 69.13
all context tokens. Figure 4(b) summarizes the results on
the 11 datasets. The differences in the base classes are fairly
small whereas in the new classes the models with a longer
context length clearly perform better. From Figure 4(a) and
(b) we observe that using 8 randomly initialized context to-
kens is marginally better than using 4 properly initialized
context tokens, suggesting that a further boost might be
possible if we initialize 8 context tokens with word embed-
dings.
(a) Ablation on initialization. (b) Ablation on context length.
CoCoOp vs a Bigger CoOp Since CoCoOp introduces
Figure 4. Ablation studies. more parameters than CoOp, namely the Meta-Net, one
might question if the improvements simply come from an
increased learning capacity. To clear the doubt, we remove
4.4. Further Analysis the Meta-Net part and increase the number of context tokens
in CoOp to the maximum such that CoOp’s and CoCoOp’s
Class-Incremental Test We consider a practical prob- sizes are similar. The results in Table 5 show that increasing
lem scenario where the recognition targets originally com- the parameter size is not the key.
posed of base classes are expanded to include completely
new classes. This problem is relevant to the existing 5. Limitations
continual learning literature [37] but different in that the
model here does not have access to any training data from The first limitation is about training efficiency: CoCoOp
new classes and needs to perform zero-shot recognition on is slow to train and would consume a significant amount
them. We compare CLIP, CoOp and CoCoOp using the of GPU memory if the batch size is set larger than one.
11 datasets. The average results are reported in Table 4. The reason is because CoCoOp is based on an instance-
Clearly, CoOp loses competitiveness against CLIP as their conditional design that requires for each image an indepen-
performance is similar but the former needs training data. dent forward pass of instance-specific prompts through the
Again, CoCoOp beats the two competitors with a signifi- text encoder. This is much less efficient than CoOp that only
cant margin. needs a single forward pass of prompts through the text en-
coder for an entire mini-batch of any size.
Initialization To understand the impact of initializa- The second limitation is that on 7 out of the 11 datasets
tion, we conduct an ablation study by comparing word (see Table 1), CoCoOp’s performance in unseen classes still
embeddings-based initialization and random initialization lags behind CLIP’s, indicating that more efforts are needed
while keeping all other parameters identical. For random from the community to fully close or overturn the gaps be-
initialization, we follow Zhou et al. [63] to sample from tween manual and learning-based prompts.
a zero-mean Gaussian distribution with 0.02 standard de-
viation. Figure 4(a) shows the base-to-new generalization 6. Discussion and Conclusion
results averaged over the 11 datasets, which suggest that a
proper initialization is more beneficial to both the base and Our research addresses an important issue that arises
new classes. Note that the findings from Figure 4 only rep- with the availability of large pre-trained AI models, i.e.,
resent the overall trend while each individual dataset might how to adapt them to downstream applications. These mod-
have a different result. els, also called foundation models [1], have received in-
creasing attention from academia and industry in both the
Context Length The ablation study on context length is vision and NLP communities because they are so powerful
also carried out in the base-to-new generalization setting. in terms of their capabilities for diverse downstream tasks.
Following Zhou et al. [63], we study 4, 8 and 16 context to- However, foundation models are costly to pre-train in terms
kens. For fair comparison, we use random initialization for of data scale and compute resources; and typically contain

8
Table 6. Domain generalization results on DOSCO-2k, a recently proposed benchmark focusing on broader contextual domain shift.
Among the three approaches, CoOp and its follow-up, CoCoOp, contain learnable components while CLIP here denotes the zero-shot
model. Both CoOp and CoCoOp use four learnable context tokens initialized with the word embeddings of “a photo of a”. Bold denotes
the best performance on each dataset for a specific architecture.

P-Air P-Cars P-Ctech P-Ins P-Mam P-Pets P-UCF Avg

ResNet-50
CLIP 16.1 56.1 86.7 62.7 59.7 84.0 60.6 60.9
CoOp 22.1 60.7 89.4 66.3 61.6 83.8 69.2 64.7
CoCoOp 20.1 59.8 90.4 67.9 63.8 87.6 69.1 65.5
ResNet-101
CLIP 17.5 63.2 89.5 62.4 62.2 84.2 61.3 62.9
CoOp 24.6 68.2 92.0 68.3 65.4 88.2 72.7 68.5
CoCoOp 22.5 65.2 93.3 69.9 67.5 88.6 71.5 68.4
ViT-B/32
CLIP 18.2 60.1 91.6 61.3 61.8 85.5 61.3 62.8
CoOp 24.0 63.0 93.6 67.3 65.7 88.5 74.5 68.1
CoCoOp 19.5 60.4 93.8 69.8 67.3 88.5 72.7 67.4
ViT-B/16
CLIP 24.4 64.9 92.6 67.5 67.9 87.4 66.1 67.2
CoOp 32.4 72.4 94.7 73.2 72.1 90.1 78.2 73.3
CoCoOp 30.4 68.7 94.8 73.5 73.6 91.6 76.3 72.7

an enormous number of parameters in order to develop suf- Appendix

ficient capacity. For instance, the CLIP model [40] based
on ViT-B/16 used in our experiments has a whopping 150M A. Results on DOSCO-2k
parameter size. These factors together highlight the need for DOSCO-2k The DOSCO (DOmain Shift in COntext)
research of efficient adaptation methods for democratizing benchmark [64] contains 7 image recognition datasets,
foundation models. which cover a wide range of classification problems, such
Our studies, which follow the line of parameter-efficient as generic object recognition, fine-grained recognition on
prompt learning [63], provide timely insights into the gen- aircraft models, and action recognition. Unlike existing do-
eralizability issue of static prompts, and more importantly, main generalization datasets where the domain labels are
demonstrate that a simple design based on conditional manually defined and often limited to image style varia-
prompt learning performs superbly in a variety of prob- tions, DOSCO-2k focuses on broader contextual domain
lem scenarios, including generalization from base to new shift, which is automatically detected by a neural network
classes, cross-dataset prompt transfer, and domain general- pre-trained on the Places dataset [61]. Following Zhou et
ization. al. [64], we use the 2k version where the training and valida-
In terms of future work, one direction is to further de- tion splits in each dataset have 2,000 images in total (1,600
velop conditional prompt learning with potentially a more for training and 400 for validation).
efficient implementation that can accelerate the training, as
well as enhance generalizability. The cross-dataset trans- Results We study three methods’ domain generalization
fer experiments indicate that instance-conditional prompts performance on DOSCO-2k: CLIP, CoOp and CoCoOp.
are more transferable—compared to static prompts—across All models are trained on the training set and the check-
tasks of varying natures. Therefore, it would be interesting points with the best validation performance are used for fi-
to see if such an idea can scale to, e.g., bigger model size for nal test in unseen domains. Table 6 shows the results of
the Meta-Net, larger-scale training images, and even hetero- four different architectures. It is clear that the two learn-
geneous training data mixed with different datasets. able methods outperform the zero-shot method with a large
margin, despite having only a small number of parameters
Acknowledgements This work is supported by NTU to tune. CoCoOp beats CoOp on 4 out of 7 datasets but
NAP, and under the RIE2020 Industry Alignment Fund – In- CoOp’s average performance is higher. In summary, the re-
dustry Collaboration Projects (IAF-ICP) Funding Initiative, sults suggest that efficient adaptation methods like CoOp
as well as cash and in-kind contribution from the industry and CoCoOp have great potential in tackling transfer learn-
partner(s). ing problems.

9
References visual features through embedding images into text topic
spaces. In CVPR, 2017. 3
[1] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt-
[17] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
man, Simran Arora, Sydney von Arx, Michael S Bernstein,
Girshick. Momentum contrast for unsupervised visual rep-
Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al.
resentation learning. In CVPR, 2020. 3
On the opportunities and risks of foundation models. arXiv
preprint arXiv:2108.07258, 2021. 8 [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[2] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Deep residual learning for image recognition. In CVPR,
Food-101–mining discriminative components with random 2016. 1, 4
forests. In ECCV, 2014. 5 [19] Patrick Helber, Benjamin Bischke, Andreas Dengel, and
[3] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Damian Borth. Eurosat: A novel dataset and deep learning
Sha. An empirical study and analysis of generalized zero- benchmark for land use and land cover classification. IEEE
shot learning for object recognition in the wild. In ECCV, Journal of Selected Topics in Applied Earth Observations
2016. 3 and Remote Sensing, 2019. 5
[4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- [20] Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali
offrey Hinton. A simple framework for contrastive learning Razavi, Carl Doersch, S. M. Ali Eslami, and Aäron van den
of visual representations. In ICML, 2020. 3 Oord. Data-efficient image recognition with contrastive pre-
[5] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy dictive coding. In ICML, 2020. 3
Mohamed, and Andrea Vedaldi. Describing textures in the [21] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada-
wild. In CVPR, 2014. 5 vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu,
[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image and Justin Gilmer. The many faces of robustness: A critical
database. In CVPR, 2009. 5 analysis of out-of-distribution generalization. ICCV, 2021. 5
[7] Karan Desai and Justin Johnson. Virtex: Learning visual [22] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein-
representations from textual annotations. In CVPR, 2021. 3 hardt, and Dawn Song. Natural adversarial examples. In
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina CVPR, 2021. 5
Toutanova. Bert: Pre-training of deep bidirectional trans- [23] Dat Huynh and Ehsan Elhamifar. Fine-grained generalized
formers for language understanding. In NAACL, 2019. 3 zero-shot learning via dense attribute-based attention. In
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, CVPR, 2020. 3
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [24] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom
vain Gelly, et al. An image is worth 16x16 words: Trans- Duerig. Scaling up visual and vision-language representation
formers for image recognition at scale. In ICLR, 2021. 4 learning with noisy text supervision. In ICML, 2021. 1, 2, 3
[10] Mohamed Elhoseiny, Babak Saleh, and Ahmed Elgammal. [25] Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neu-
Write a classifier: Zero-shot learning using purely textual big. How can we know what language models know? ACL,
descriptions. In ICCV, 2013. 3 2020. 1, 3
[11] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- [26] Armand Joulin, Laurens Van Der Maaten, Allan Jabri, and
ative visual models from few training examples: An incre- Nicolas Vasilache. Learning visual features from large
mental bayesian approach tested on 101 object categories. In weakly supervised data. In ECCV, 2016. 3
CVPR-W, 2004. 5
[27] Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi
[12] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio,
Xie. Prompting visual-language models for efficient video
Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. De-
understanding. arXiv preprint arXiv:2112.04478, 2021. 3
vise: A deep visual-semantic embedding model. NeurIPS,
2013. 3 [28] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
3d object representations for fine-grained categorization. In
[13] Andreas Fürst, Elisabeth Rumetshofer, Viet Tran, Hubert
ICCV-W, 2013. 5
Ramsauer, Fei Tang, Johannes Lehner, David Kreil, Michael
Kopp, Günter Klambauer, Angela Bitto-Nemling, et al. [29] Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. Predicting
Cloob: Modern hopfield networks with infoloob outperform deep zero-shot convolutional neural networks using textual
clip. arXiv preprint arXiv:2110.11316, 2021. 1, 3 descriptions. In ICCV, 2015. 3
[14] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao [30] Brian Lester, Rami Al-Rfou, and Noah Constant. The power
Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. of scale for parameter-efficient prompt tuning. arXiv preprint
Clip-adapter: Better vision-language models with feature arXiv:2104.08691, 2021. 1, 3
adapters. arXiv preprint arXiv:2110.04544, 2021. 1 [31] Ang Li, Allan Jabri, Armand Joulin, and Laurens van der
[15] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre- Maaten. Learning visual n-grams from web data. In ICCV,
trained language models better few-shot learners. arXiv 2017. 3
preprint arXiv:2012.15723, 2020. 1 [32] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz-
[16] Lluis Gomez, Yash Patel, Marçal Rusiñol, Dimosthenis ing continuous prompts for generation. arXiv preprint
Karatzas, and CV Jawahar. Self-supervised learning of arXiv:2101.00190, 2021. 1, 3

10
[33] Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Polosukhin. Attention is all you need. In NeurIPS, 2017. 1,
Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan. Su- 3, 4
pervision exists everywhere: A data efficient contrastive [49] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du-
language-image pre-training paradigm. arXiv preprint mitru Erhan. Show and tell: A neural image caption gen-
arXiv:2110.05208, 2021. 1, 3 erator. In CVPR, 2015. 2
[34] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hi- [50] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P
roaki Hayashi, and Graham Neubig. Pre-train, prompt, and Xing. Learning robust global representations by penalizing
predict: A systematic survey of prompting methods in nat- local predictive power. In NeurIPS, 2019. 5
ural language processing. arXiv preprint arXiv:2107.13586, [51] Wei Wang, Vincent W Zheng, Han Yu, and Chunyan Miao.
2021. 1, 3 A survey of zero-shot learning: Settings, methods, and ap-
[35] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew plications. TIST, 2019. 3
Blaschko, and Andrea Vedaldi. Fine-grained visual classi- [52] Xiaolong Wang, Yufei Ye, and Abhinav Gupta. Zero-shot
fication of aircraft. arXiv preprint arXiv:1306.5151, 2013. recognition via semantic embeddings and knowledge graphs.
5 In CVPR, 2018. 3
[36] Maria-Elena Nilsback and Andrew Zisserman. Automated [53] Mitchell Wortsman, Gabriel Ilharco, Mike Li, Jong Wook
flower classification over a large number of classes. In Kim, Hannaneh Hajishirzi, Ali Farhadi, Hongseok
ICVGIP, 2008. 5 Namkoong, and Ludwig Schmidt. Robust fine-tuning
[37] German I Parisi, Ronald Kemker, Jose L Part, Christopher of zero-shot models. arXiv preprint arXiv:2109.01903,
Kanan, and Stefan Wermter. Continual lifelong learning with 2021. 1
neural networks: A review. Neural Networks, 2019. 8 [54] Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot
[38] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and learning-the good, the bad and the ugly. In CVPR, 2017. 3,
CV Jawahar. Cats and dogs. In CVPR, 2012. 5 5
[39] Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton [55] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva,
Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian and Antonio Torralba. Sun database: Large-scale scene
Riedel. Language models as knowledge bases? In EMNLP, recognition from abbey to zoo. In CVPR, 2010. 2, 5
2019. 3 [56] Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-
[40] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Seng Chua, and Maosong Sun. Cpt: Colorful prompt tun-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, ing for pre-trained vision-language models. arXiv preprint
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- arXiv:2109.11797, 2021. 1, 3
ing transferable visual models from natural language super- [57] Kai Yi, Xiaoqian Shen, Yunhao Gou, and Mohamed El-
vision. In ICML, 2021. 1, 2, 3, 4, 5, 7, 8, 9 hoseiny. Exploring hierarchical graph representation for
[41] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario large-scale zero-shot image classification. arXiv preprint
Amodei, Ilya Sutskever, et al. Language models are unsu- arXiv:2203.01386, 2022. 3
pervised multitask learners. OpenAI blog, 2019. 3, 4 [58] Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu-
[42] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li.
Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Pointclip: Point cloud understanding by clip. arXiv preprint
Denseclip: Language-guided dense prediction with context- arXiv:2112.02413, 2021. 3
aware prompting. In CVPR, 2022. 3 [59] Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D
[43] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Manning, and Curtis P Langlotz. Contrastive learning of
Vaishaal Shankar. Do imagenet classifiers generalize to im- medical visual representations from paired images and text.
agenet? In ICML, 2019. 5 arXiv preprint arXiv:2010.00747, 2020. 2
[44] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric [60] Zexuan Zhong, Dan Friedman, and Danqi Chen. Factual
Wallace, and Sameer Singh. Autoprompt: Eliciting knowl- probing is [mask]: Learning vs. learning to recall. In NAACL,
edge from language models with automatically generated 2021. 1, 3
prompts. In EMNLP, 2020. 1, 3 [61] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva,
[45] Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bas- and Antonio Torralba. Places: A 10 million image database
tani, Christopher D Manning, and Andrew Y Ng. Zero-shot for scene recognition. IEEE transactions on pattern analysis
learning through cross-modal transfer. In NeurIPS, 2013. 3 and machine intelligence, 40(6):1452–1464, 2017. 9
[46] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. [62] Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and
Ucf101: A dataset of 101 human actions classes from videos Chen Change Loy. Domain generalization: A survey. arXiv
in the wild. arXiv preprint arXiv:1212.0402, 2012. 5 preprint arXiv:2103.02503, 2021. 7
[63] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
[47] Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Car-
Liu. Learning to prompt for vision-language models. arXiv
lini, Benjamin Recht, and Ludwig Schmidt. Measuring ro-
preprint arXiv:2109.01134, 2021. 1, 2, 3, 4, 5, 7, 8, 9
bustness to natural distribution shifts in image classification.
In NeurIPS, 2020. 7 [64] Kaiyang Zhou, Yuanhan Zhang, Yuhang Zang, Jingkang
Yang, Chen Change Loy, and Ziwei Liu. On-device domain
[48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
generalization. arXiv preprint arXiv:2209.07521, 2022. 9
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia