Co Co Op
Co Co Op
1
<latexit sha1_base64="8143cVdce6SsJFqV+hM9gRzsI3Q=">AAADNHicjVJNb9MwGHbC1whsbHDkYtFNKpcq6WFwnNQLF2BIdJuURJXjuKtVJ47sN+2qyH+EK/wG/gsSN8SV34CTBZS0HHilWI+fx+/zxK+cFIJr8P1vjnvn7r37D/Yeeo8e7x88OTx6eqFlqSibUimkukqIZoLnbAocBLsqFCNZIthlspzU+uWKKc1l/hE2BYszcp3zOacELDU7cvZPjsMoyaqVmQXxMf67GdebaJFK0B32rWXDCNgNNNGVYqmpQBGeY7uINdmYeOR5Hc9hDW7My553l9zN6Kj/lRWtGocGLlrY9CXzaiIn8n1hrPaH2XZ8x9aYCqI108bMDgf+yG8K74KgBQPU1rkd4EGUSlpmLIfGJQz8AuKKKOBUMONFpWYFoUtyzUILc5IxHVdNvsEnlknxXCr75YAbtttRkUzrTZbYkxmBhd7WavJfWljC/HVc8bwogeX0NmheCgwS168Ap1wxCmJjAaGK23/FdEEUoWDfSi8lyXp3qLJSAFdy3WcTKZdAEm08O8Fge1674GI8Ck5H/ofx4GzYznIPPUcv0BAF6BU6Q2/QOZoi6oDzyfnsfHG/ut/dH+7P26Ou0/Y8Q71yf/0GApsFgw==</latexit>
CoCoOp
<latexit sha1_base64="nJGHb+fSeYbFHtGpvQfsLkYuVT0=">AAADMnicjVLLbtQwFHXCq6RQWliysZhWGjajZBaFZaVuugGKxLSVkmjkOE7HGjuO7JuZjqL8R7fwDfwM7BBbPgInHVAyw4IrJTr3HN9zkisnheAGfP+b4967/+Dho53H3u6Tp3vP9g+eXxhVasomVAmlrxJimOA5mwAHwa4KzYhMBLtM5qeNfrlg2nCVf4JVwWJJrnOecUrAUtMDZ/foMIwSWS3qaRAf4r/NuGmiWarAdNh3lg0jYDfQRleapXUFmvAc25dYklUdjzyv4zlswE39uufdJbczOup/ZUWL1qGFszVs55KsOlUfitoqf/pNv/dsiakgxjBT19P9gT/y28LbIFiDAVrXuV3fXpQqWkqWQ+sSBn4BcUU0cCpY7UWlYQWhc3LNQgtzIpmJqza/xkeWSXGmtH1ywC3bnaiINGYlE3tSEpiZTa0h/6WFJWRv44rnRQksp3dBWSkwKNzcAZxyzSiIlQWEam6/FdMZ0YSCvSm9lET2/qGSpQCu1bLPJkrNgSSm9uwGg819bYOL8Sg4Hvkfx4OT4XqXO+gleoWGKEBv0Ak6Q+dogqijnVvns/PF/ep+d3+4P++Ous565gXqlfvrN8MdBL0=</latexit>
Zero-shot CoOp
<latexit sha1_base64="IJtCaraVVLlW6HHagghvMZyBOHc=">AAACOHicbVA9TwJBEN3DL8QvxNLmIjGxkdxRqCWJjaUmokQgZHeZkw27t5fdOYVc+Cu2+hv8J3Z2xtZf4AJXiPqSSV7em8nMPJZIYTEI3rzC0vLK6lpxvbSxubW9U96t3FidGg5NrqU2LUYtSBFDEwVKaCUGqGISbtnwfOrfPoCxQsfXOE6gq+h9LCLBKTqpV650EEbIouwOjD62A42TXrka1IIZ/L8kzEmV5Ljs7Xrbnb7mqYIYuaTWtsMgwW5GDQouYVLqpBYSyof0HtqOxlSB7Waz4yf+oVP6fqSNqxj9mfpzIqPK2rFirlNRHNjf3lT8z2unGJ11MxEnKULM54uiVPqo/WkSfl8Y4CjHjlBuhLvV5wNqKEeX18IWphZ+yFQqURj9uKgyrYdImZ2UXILh77z+kpt6LTypBVf1auMoz7JI9skBOSIhOSUNckEuSZNwMiJP5Jm8eK/eu/fhfc5bC14+s0cW4H19AzV1rS8=</latexit>
Base classes
<latexit sha1_base64="0ivCj/QEdXYHnnI7rZsyExeY+nE=">AAADQXicbVLPb9MwFHYyfowCYxs3uFh0k8olSnroJk6DXbgAQ6LbpCSqHMdZrdpxZL+0q6Ic+Gu4wt/AX8GfwA1x5YKbdVO67klxvve99z7bn5wUghvw/V+Ou3Hv/oOHm486j5883Xq2vbN7alSpKRtSJZQ+T4hhgudsCBwEOy80IzIR7CyZHC/qZ1OmDVf5F5gXLJbkIucZpwQsNdpxXuzvhVEiq2k9CuI9fJP0F0k0ThWYFvvBsiFownNsFzEj89jrdFoSvQW4rF+vSLXJdclW9S7paNoMNHB8DSNgl5Bk1bH6VNStvPk3rlSapXX1kc0wFcQYZuqmj8liXL2ltNSEzt/gg4F3OLCF6/l31sqbgdF21/f8JvA6CJagi5ZxYt3cilJFS8lyaETCwC8grogGTgWrO1FpWEHohFyw0MKcSGbiqjlvjfctk+JMafvlgBu2PVERacxcJrZTEhib27UFeVctLCE7jCueFyWwnF5tlJUCg8KLJ4FTrhkFMbeAUM3tWTEdE2sQ2IezsksiV+5QyVIA12q2yiZKTYAkpu5YB4Pbfq2D074XDDz/c7971Ft6uYleoleohwJ0gI7Qe3SChog6X51vznfnh/vT/e3+cf9etbrOcuY5Wgn3339dNAbl</latexit>
<latexit sha1_base64="+xVRXtBc7t53www2R1akeGF/K1k=">AAACSnicbVC7TgMxEPSFVwivACWNRUCiiu4ogBKJhjJIJCAdp2jP8SVW7PPJ3guKTvkBvoYWvoEf4DfoEA3OoyAJI9kezexqvRNnUlj0/U+vtLK6tr5R3qxsbe/s7lX3D1pW54bxJtNSm8cYLJci5U0UKPljZjioWPKHuH8z9h8G3Fih03scZjxS0E1FIhigk9rVkxAiGmY9jdq9OnHXWABjxAAk7QLyqN6u1vy6PwFdJsGM1MgMjfa+t/vU0SxXPEUmwdow8DOMCjAomOSjylNueQasD10eOpqC4jYqJuuM6KlTOjTRxp0U6UT921GAsnaoYlepAHt20RuL/3lhjslVVIg0y5GnbDooySVFTcfZ0I4wnKEcOgLMCPdXynpggKFLcG5KrOZ2KFQuURj9PK/GWvcRYjuquASDxbyWSeu8HlzU/bvz2vXZLMsyOSLH5IwE5JJck1vSIE3CyAt5JW/k3fvwvrxv72daWvJmPYdkDqXVX8LtsjE=</latexit>
[a] [photo] [of] [a] [arrival gate]. [v1 ] [v2 ] . . . [vM ] [arrival gate]. [v1 (x)] [v2 (x)] . . . [vM (x)] [arrival gate].
.. <latexit sha1_base64="4TW2Q2kvoIwAfSidpPz6R0O1N8s=">AAADAHicjVFNj9MwEHXC1xJg6cKRi0W7UrlUSQ/AsVIvXBCLRHdXaqLKdpytVTuO7El3qygXfg03xJV/Ar8GJ1uhZMuBkWy9efPxxmNaSGEhDH95/r37Dx4+OnocPHn67Pj54OTFudWlYXzBtNTmkhLLpcj5AgRIflkYThSV/IJu5k38YsuNFTr/AruCJ4pc5SITjICjVoPfp6NlTFW1rVdRMsJ/nWnjxOtUg+2wHx27jIHfQKtcGZ7WFRgicuwueU12dTIJgk7PcQNu6je93l3yUKMT/Q+tUbxtGzjVNplm1VzP9aeiXg2G4SRsDR+CaA+GaG9nqxPvOE41KxXPgUli7TIKC0gqYkAwyesgLi0vCNuQK750MCeK26Rqx6vxqWNSnGnjTg64ZbsVFVHW7hR1mYrA2t6NNeS/YssSsvdJJfKiBJ6zW6GslBg0bn4Up8JwBnLnAGFGuFkxWxNDGLh/76lQ1XtDpUoJwujrPku13gChtg7cBqO7+zoE59NJ9HYSfp4OZ+P9Lo/QK/QajVGE3qEZ+oDO0AIxb+ZlnvYK/6v/zf/u/7hN9b19zUvUM//nHw/58a8=</latexit>
.. <latexit sha1_base64="4TW2Q2kvoIwAfSidpPz6R0O1N8s=">AAADAHicjVFNj9MwEHXC1xJg6cKRi0W7UrlUSQ/AsVIvXBCLRHdXaqLKdpytVTuO7El3qygXfg03xJV/Ar8GJ1uhZMuBkWy9efPxxmNaSGEhDH95/r37Dx4+OnocPHn67Pj54OTFudWlYXzBtNTmkhLLpcj5AgRIflkYThSV/IJu5k38YsuNFTr/AruCJ4pc5SITjICjVoPfp6NlTFW1rVdRMsJ/nWnjxOtUg+2wHx27jIHfQKtcGZ7WFRgicuwueU12dTIJgk7PcQNu6je93l3yUKMT/Q+tUbxtGzjVNplm1VzP9aeiXg2G4SRsDR+CaA+GaG9nqxPvOE41KxXPgUli7TIKC0gqYkAwyesgLi0vCNuQK750MCeK26Rqx6vxqWNSnGnjTg64ZbsVFVHW7hR1mYrA2t6NNeS/YssSsvdJJfKiBJ6zW6GslBg0bn4Up8JwBnLnAGFGuFkxWxNDGLh/76lQ1XtDpUoJwujrPku13gChtg7cBqO7+zoE59NJ9HYSfp4OZ+P9Lo/QK/QajVGE3qEZ+oDO0AIxb+ZlnvYK/6v/zf/u/7hN9b19zUvUM//nHw/58a8=</latexit>
.. <latexit sha1_base64="4TW2Q2kvoIwAfSidpPz6R0O1N8s=">AAADAHicjVFNj9MwEHXC1xJg6cKRi0W7UrlUSQ/AsVIvXBCLRHdXaqLKdpytVTuO7El3qygXfg03xJV/Ar8GJ1uhZMuBkWy9efPxxmNaSGEhDH95/r37Dx4+OnocPHn67Pj54OTFudWlYXzBtNTmkhLLpcj5AgRIflkYThSV/IJu5k38YsuNFTr/AruCJ4pc5SITjICjVoPfp6NlTFW1rVdRMsJ/nWnjxOtUg+2wHx27jIHfQKtcGZ7WFRgicuwueU12dTIJgk7PcQNu6je93l3yUKMT/Q+tUbxtGzjVNplm1VzP9aeiXg2G4SRsDR+CaA+GaG9nqxPvOE41KxXPgUli7TIKC0gqYkAwyesgLi0vCNuQK750MCeK26Rqx6vxqWNSnGnjTg64ZbsVFVHW7hR1mYrA2t6NNeS/YssSsvdJJfKiBJ6zW6GslBg0bn4Up8JwBnLnAGFGuFkxWxNDGLh/76lQ1XtDpUoJwujrPku13gChtg7cBqO7+zoE59NJ9HYSfp4OZ+P9Lo/QK/QajVGE3qEZ+oDO0AIxb+ZlnvYK/6v/zf/u/7hN9b19zUvUM//nHw/58a8=</latexit>
...
<latexit sha1_base64="CrDUidhbV+gYx4dWg1zM4wwFCGM=">AAADNnicjVLLbtQwFHXCqwToA5ZsLKaVhs0omUVhWTEbNogiMW2lJBrZjtOxxokj+2baUeQ/YQvfwK+wYYfY8gk46YAyHRZcydHxub7nOEemlRQGwvCb59+5e+/+g52HwaPHT3b39g+enhlVa8anTEmlLygxXIqST0GA5BeV5qSgkp/TxaTtny+5NkKVH2FV8bQgl6XIBSPgqNmBt3d0GCe0aJZ2FqWH+O9m3G6SeabA9Nh3jo0T4NfQWTeaZ7YBTUSJ3UdekZVNR0HQ0xy24Nq+3NDuk9seve5/eSXLTiEI/mg5spujeTNRE/W+sj2mp0hlzW3zxsWHmSTGcGPtbH8QjsKu8DaI1mCA1nXqItxNMsXqgpfQqcRRWEHaEA2CSW6DpDa8ImxBLnnsYEkKbtKmu4HFR47JcK60WyXgju1PNKQwZlVQd7IgMDe3ey35r15cQ/46bURZ1cBLdmOU1xKDwu07wJnQnIFcOUCYFu6umM2JJgzca9lwocXGPzRFLUFodbXJUqUWQKixgUswup3XNjgbj6LjUfhhPDgZrrPcQc/RCzREEXqFTtBbdIqmiHlL75P32fvif/W/+z/8nzdHfW898wxtlP/rN106BlU=</latexit>
. . .
<latexit sha1_base64="oyFSkVR5pVK+BaFeHpURW7SEkEo=">AAACR3icbVBNTwIxEO3iF+IX6NFLlZhwIrsc1COJF4+YyEeybEi3dKGh3W7aWQ3ZcPbXeNXf4E/wV3gzHu0CBwEnaeflvZnMzAsTwQ247qdT2Nre2d0r7pcODo+OT8qV045RqaasTZVQuhcSwwSPWRs4CNZLNCMyFKwbTu5yvfvEtOEqfoRpwgJJRjGPOCVgqUH5wicB9pOxAmWziuyXE1Yes6EmIqgPylW37s4DbwJvCapoGa1BxTnuDxVNJYuBCmKM77kJBBnRwKlgs1I/NSwhdEJGzLcwJpKZIJvfMsNXlhniSGn7YsBz9m9HRqQxUxnaSmmXNOtaTv6n+SlEt0HG4yQFFtPFoCgVGBTOjcFDrhkFMbWAUM3trpiOiSYUrH0rU0K5ckMmUwFcq+dVNlRqAiQ0s5J10Fv3axN0GnXvuu4+NKrN2tLLIjpHl6iGPHSDmugetVAbUfSCXtEbenc+nC/n2/lZlBacZc8ZWomC8wtbS7EJ</latexit> <latexit sha1_base64="w2uawpOynxeIAr2p+Zz3wz9wQgQ=">AAADNnicbVJNb9NAEF2br2KgH3DksiKtFC6WnUNacSrqhQvQSk1bybai9XrSrLLrtXbXSSPL/4Qr/Ab+ChduiCs/gY0TIE4YydabN/Nmdp82LTjTJgi+Oe69+w8ePtp57D15+mx3b//g+ZWWpaIwoJJLdZMSDZzlMDDMcLgpFBCRcrhOJ2eL+vUUlGYyvzTzAhJBbnM2YpQYSw0PnL3DKE5FNa2HYXKI/ya9RRKPM2n0GvvespGVjiFThCe+5x39k3cX4K5+3RqzTm6PW6tGsYE701yoUpDVlVGE5dj++IzM6+WueNpMaOD4D2yE6ag6kx+Lei3fHPgBZphyojXouukDUYyrt5SWitD5G3zc90/69XC/E/hBE3gbhCvQQas4txbuxpmkpYDcNOOjMChMUhFlGOVQe3GpoSB0Qm4hsjAnAnRSNQer8ZFlMjySyn65wQ27rqiI0HouUtsprO96s7Yg/1eLSjM6SSqWF6WBnC4XjUqOjcSLd4AzpoAaPreAUMXsWTEdE+uEsa+ltSUVrTtUouSGKTlrs6mUE0NSXXvWwXDTr21w1fPDvh9c9Dqn3ZWXO+gleoW6KETH6BS9Q+dogKgzdT45n50v7lf3u/vD/blsdZ2V5gVqhfvrN5lQA7A=</latexit>
<latexit sha1_base64="Sds3QJMIKgV/HNcJ+p5gIs1s934=">AAADJHicbVJNj9MwEHXC11Jg6cKRi0VbqVyqpIfuitOivXABFonurpREleNMNlbtOLKddqsof4Ar/AZ+DTfEgQs/BeFkC7RbRorz5s34efySuOBMG8/74bi3bt+5e2/vfufBw0f7j7sHT860LBWFKZVcqouYaOAsh6lhhsNFoYCImMN5PD9p6ucLUJrJ/INZFRAJcpmzlFFiLDXr/hr0gzAW1aKe+VEf/03GTRJmiTR6g31j2cAownJsF74kq2jU6fxTGDbgqn6xpbRJ7ipuVAM7VAaJIrxRHfTDRdvcwuwPDA1cmTitTuS7ot7I23frR6Ugqau3sMSUE61B120fiCKrXlFaKkJXL/HhZHQ0qWfdnjfy2sC7wF+DHlrH6ezA2Q8TSUsBuWnlA98rTFQRZRjlUHfCUkNB6JxcQmBhTgToqGoHq/HAMglOpbJPbnDLbu6oiNB6JWLbKawX+matIf9XC0qTHkUVy4vSQE6vD0pLjo3EzVfHCVNADV9ZQKhidlZMM2KdMPbf2DolFlt3qETJDVNyuc3GUs4NiXXdsQ76N/3aBWfjkT8Zee/HvePh2ss99Aw9R0Pko0N0jF6jUzRF1Emcj84n57P7xf3qfnO/X7e6znrPU7QV7s/f+gr8gQ==</latexit>
[a] [photo] [of] [a] [cathedral]. [v1 ] [v2 ] . . . [vM ] [cathedral]. [v1 (x)] [v2 (x)] . . . [vM (x)] [cathedral].
<latexit sha1_base64="9fR6HmABMleNcUoWa6oyKg/n/6s=">AAACPHicbZDLTgIxFIY7eEO8gSZu3DQSE1ZkBhO8rDBuXGIilwQI6ZQCDe100p7RkJGXcavP4Hu4d2fcurbALAT8kyZ//nNOTs/nh4IbcN0PJ7W2vrG5ld7O7Ozu7R9kc4d1oyJNWY0qoXTTJ4YJHrAacBCsGWpGpC9Ywx/dTuuNR6YNV8EDjEPWkWQQ8D6nBGzUzR63mQyH8Q2lkSZ0fI3LV8Xz8qSbzbtFdya8arzE5FGiajfn7Ld7ikaSBUAFMabluSF0YqKBU8EmmXZkWEjoiAxYy9qASGY68eyACT6zSQ/3lbYvADxL/07ERBozlr7tlASGZrk2Df+rtSLoX3ZiHoQRsIDOF/UjgUHhKQ3c45pREGNrCNXc/hXTIbEkwDJb2OLLhRtiGQngWj0tpr5SIyC+mWQsQW+Z16qpl4peuejel/KVQsIyjU7QKSogD12gCrpDVVRDFD2jF/SK3px359P5cr7nrSknmTlCC3J+fgEWTa10</latexit>
<latexit sha1_base64="mCP733lRKW729yXdlkP1ECjR4Hs=">AAADS3icjVLBbtNAEF27FEqA0sKRy4qkUrhYdg6l4lTUCxegSKStZFvRer2pV9n1WrvjpJblL+BruMI38AF8BzfEgY0bkJNwYKS13ryZebN+2qQQ3IDvf3fcnTu7d+/t3e89ePho//HB4ZMLo0pN2ZgqofRVQgwTPGdj4CDYVaEZkYlgl8nsbFm/nDNtuMo/QlWwWJLrnE85JWCpyaEzOBqEUSLreTMJ4gH+m4yWSZSlCkyHfWvZMAJ2A+3qWrO0qUETnmP7EQtSNbHX63U0h0tw07xY0+6S2zs61f/aFc1bhRZmf2A7mEzrM/W+aDr5puA7tsBUEGOYaWxfxGSR1a8pLTWh1St84nvHfjM56Pue3wbeBsEK9NEqzq2v+1GqaClZDq16GPgFxDXRwKlgTS8qDSsInZFrFlqYE8lMXLf3avCRZVI8VdqeHHDLdidqIo2pZGI7JYHMbNaW5L9qYQnTk7jmeVECy+ntomkpMCi8fBw45ZpREJUFhGpu74ppRqwTYJ/Q2pZErv1DLUsBXKvFOpsoNQOSmKZnHQw2/doGFyMvOPb8D6P+6XDl5R56hp6jIQrQS3SK3qBzNEbU+eR8dr44X91v7g/3p/vrttV1VjNP0Vrs7P4GptUMcA==</latexit>
<latexit sha1_base64="kASuoAz9VVVabPM2RlbBZuw3XFo=">AAADS3icjVLBbtNAEF27FIqB0sKRy4qkUrhEdlQR4FTUCxegSKStZFvRer1uVtn1WrvjpJHlL+BruMI38AF8BzfEgY0bkJNwYKS13ryZebN+2qQQ3IDvf3fcnVu7t+/s3fXu3X+w//Dg8NG5UaWmbESVUPoyIYYJnrMRcBDsstCMyESwi2R6uqxfzJg2XOUfYVGwWJKrnGecErDU+NDpHnXDKJHVrB4HcRf/TQbLJJqkCkyLfWvZMAJ2Dc3qSrO0rkATnmP7EXOyqOO+57U0e0twXT9b026T2zta1f/aFc0ahQZO/sBmMMmqU/W+qFv5puA7NsdUEGOYqW1fxGQxqV5TWmpCF6/w8GV/eFyPDzp+328Cb4NgBTpoFWfW1/0oVbSULIdGPQz8AuKKaOBUsNqLSsMKQqfkioUW5kQyE1fNvWp8ZJkUZ0rbkwNu2PZERaQxC5nYTklgYjZrS/JftbCE7EVc8bwogeX0ZlFWCgwKLx8HTrlmFMTCAkI1t3fFdEKsE2Cf0NqWRK79QyVLAVyr+TqbKDUFkpjasw4Gm35tg/NBP3je9z8MOie9lZd76Al6inooQEN0gt6gMzRC1PnkfHa+OF/db+4P96f766bVdVYzj9Fa7Oz+Br3HDH0=</latexit>
(a) Both CoOp and CoCoOp work well on the base classes observed during training and beat manual prompts by a significant margin.
<latexit sha1_base64="8143cVdce6SsJFqV+hM9gRzsI3Q=">AAADNHicjVJNb9MwGHbC1whsbHDkYtFNKpcq6WFwnNQLF2BIdJuURJXjuKtVJ47sN+2qyH+EK/wG/gsSN8SV34CTBZS0HHilWI+fx+/zxK+cFIJr8P1vjnvn7r37D/Yeeo8e7x88OTx6eqFlqSibUimkukqIZoLnbAocBLsqFCNZIthlspzU+uWKKc1l/hE2BYszcp3zOacELDU7cvZPjsMoyaqVmQXxMf67GdebaJFK0B32rWXDCNgNNNGVYqmpQBGeY7uINdmYeOR5Hc9hDW7My553l9zN6Kj/lRWtGocGLlrY9CXzaiIn8n1hrPaH2XZ8x9aYCqI108bMDgf+yG8K74KgBQPU1rkd4EGUSlpmLIfGJQz8AuKKKOBUMONFpWYFoUtyzUILc5IxHVdNvsEnlknxXCr75YAbtttRkUzrTZbYkxmBhd7WavJfWljC/HVc8bwogeX0NmheCgwS168Ap1wxCmJjAaGK23/FdEEUoWDfSi8lyXp3qLJSAFdy3WcTKZdAEm08O8Fge1674GI8Ck5H/ofx4GzYznIPPUcv0BAF6BU6Q2/QOZoi6oDzyfnsfHG/ut/dH+7P26Ou0/Y8Q71yf/0GApsFgw==</latexit>
CoCoOp
<latexit sha1_base64="nJGHb+fSeYbFHtGpvQfsLkYuVT0=">AAADMnicjVLLbtQwFHXCq6RQWliysZhWGjajZBaFZaVuugGKxLSVkmjkOE7HGjuO7JuZjqL8R7fwDfwM7BBbPgInHVAyw4IrJTr3HN9zkisnheAGfP+b4967/+Dho53H3u6Tp3vP9g+eXxhVasomVAmlrxJimOA5mwAHwa4KzYhMBLtM5qeNfrlg2nCVf4JVwWJJrnOecUrAUtMDZ/foMIwSWS3qaRAf4r/NuGmiWarAdNh3lg0jYDfQRleapXUFmvAc25dYklUdjzyv4zlswE39uufdJbczOup/ZUWL1qGFszVs55KsOlUfitoqf/pNv/dsiakgxjBT19P9gT/y28LbIFiDAVrXuV3fXpQqWkqWQ+sSBn4BcUU0cCpY7UWlYQWhc3LNQgtzIpmJqza/xkeWSXGmtH1ywC3bnaiINGYlE3tSEpiZTa0h/6WFJWRv44rnRQksp3dBWSkwKNzcAZxyzSiIlQWEam6/FdMZ0YSCvSm9lET2/qGSpQCu1bLPJkrNgSSm9uwGg819bYOL8Sg4Hvkfx4OT4XqXO+gleoWGKEBv0Ak6Q+dogqijnVvns/PF/ep+d3+4P++Ous565gXqlfvrN8MdBL0=</latexit>
<latexit sha1_base64="bYymbds63ROIjnQFB2XBSZyp6V0=">AAADNnicbVJNb9NAEF2brxKgH3DksiKtFC6WnUNacSrqhQvQSk1bybai9XrTrLLrtXbHSSPL/4Qr/Ab+ChduiCs/gbUTIE4YydabNzNvdp82yQU34PvfHPfe/QcPH+087jx5+mx3b//g+ZVRhaZsSJVQ+iYhhgmesSFwEOwm14zIRLDrZHpW169nTBuusktY5CyW5DbjY04JWGp04OwdhlEiy1k1CuJD/Dfp10k0SRWYNfa9ZcM5z1I8JlrGXqdz9G+8V4O76nVLZp3cllurhhGwO2guVGqWViVowjNsf2JOFtVyVzRrFBo4+QObwWRcnqmPebWWbwp+YHNMBTGGmarpYzKflG8pLTShizf4eOCdDKrRftf3/CbwNghWoItWcW4t3I1SRQvJMmjkw8DPIS6JBk4FqzpRYVhO6JTcstDCjEhm4rI5WIWPLGPdVNp+GeCGXZ8oiTRmIRPbKQlMzGatJv9XCwsYn8Qlz/ICWEaXi8aFwKBw/Q5wyjWjIBYWEKq5PSumE2KdAPtaWlsS2bpDKQsBXKt5m02UmgJJTNWxDgabfm2Dq74XDDz/ot897a283EEv0SvUQwE6RqfoHTpHQ0SdmfPJ+ex8cb+6390f7s9lq+usZl6gVri/fgMcsQOA</latexit>
<latexit sha1_base64="9z/5HEKimh9eQz1kLtmR+HpiKRA=">AAACR3icbVC7TgMxEPSFd3glUNIYIiSq6I4CKJFoKINEINJxivYcH7Hix8neA0Wn1HwNLXwDn8BX0CFKnJCCEEayPZrd1XgnzaVwGIbvQWVhcWl5ZXWtur6xubVdq+/cOFNYxtvMSGM7KTguheZtFCh5J7ccVCr5bTq4GNdvH7h1wuhrHOY8UXCvRSYYoJe6tf0YEhrnfYPGvybz11h4FLpHM7AqaXZrjbAZTkDnSTQlDTJFq1sPtu56hhWKa2QSnIujMMekBIuCST6q3hWO58AGcM9jTzUo7pJyssuIHnrFWxvrj0Y6UX9PlKCcG6rUdyrAvvtbG4v/1eICs7OkFDovkGv2Y5QVkqKh42BoT1jOUA49AWaF/ytlfbDA0Mc345KqmR1KVUgU1jzOqqkxA4TUjao+wehvXvPk5rgZnTTDq+PG+dE0y1WyRw7IEYnIKTknl6RF2oSRJ/JMXshr8BZ8BJ/B109rJZjO7JIZVIJvBjyw2Q==</latexit>
[a] [photo] [of] [a] [wind farm]. [v1 (x)] [v2 (x)] . . . [vM (x)] [wind farm].
.. <latexit sha1_base64="4TW2Q2kvoIwAfSidpPz6R0O1N8s=">AAADAHicjVFNj9MwEHXC1xJg6cKRi0W7UrlUSQ/AsVIvXBCLRHdXaqLKdpytVTuO7El3qygXfg03xJV/Ar8GJ1uhZMuBkWy9efPxxmNaSGEhDH95/r37Dx4+OnocPHn67Pj54OTFudWlYXzBtNTmkhLLpcj5AgRIflkYThSV/IJu5k38YsuNFTr/AruCJ4pc5SITjICjVoPfp6NlTFW1rVdRMsJ/nWnjxOtUg+2wHx27jIHfQKtcGZ7WFRgicuwueU12dTIJgk7PcQNu6je93l3yUKMT/Q+tUbxtGzjVNplm1VzP9aeiXg2G4SRsDR+CaA+GaG9nqxPvOE41KxXPgUli7TIKC0gqYkAwyesgLi0vCNuQK750MCeK26Rqx6vxqWNSnGnjTg64ZbsVFVHW7hR1mYrA2t6NNeS/YssSsvdJJfKiBJ6zW6GslBg0bn4Up8JwBnLnAGFGuFkxWxNDGLh/76lQ1XtDpUoJwujrPku13gChtg7cBqO7+zoE59NJ9HYSfp4OZ+P9Lo/QK/QajVGE3qEZ+oDO0AIxb+ZlnvYK/6v/zf/u/7hN9b19zUvUM//nHw/58a8=</latexit>
.. <latexit sha1_base64="4TW2Q2kvoIwAfSidpPz6R0O1N8s=">AAADAHicjVFNj9MwEHXC1xJg6cKRi0W7UrlUSQ/AsVIvXBCLRHdXaqLKdpytVTuO7El3qygXfg03xJV/Ar8GJ1uhZMuBkWy9efPxxmNaSGEhDH95/r37Dx4+OnocPHn67Pj54OTFudWlYXzBtNTmkhLLpcj5AgRIflkYThSV/IJu5k38YsuNFTr/AruCJ4pc5SITjICjVoPfp6NlTFW1rVdRMsJ/nWnjxOtUg+2wHx27jIHfQKtcGZ7WFRgicuwueU12dTIJgk7PcQNu6je93l3yUKMT/Q+tUbxtGzjVNplm1VzP9aeiXg2G4SRsDR+CaA+GaG9nqxPvOE41KxXPgUli7TIKC0gqYkAwyesgLi0vCNuQK750MCeK26Rqx6vxqWNSnGnjTg64ZbsVFVHW7hR1mYrA2t6NNeS/YssSsvdJJfKiBJ6zW6GslBg0bn4Up8JwBnLnAGFGuFkxWxNDGLh/76lQ1XtDpUoJwujrPku13gChtg7cBqO7+zoE59NJ9HYSfp4OZ+P9Lo/QK/QajVGE3qEZ+oDO0AIxb+ZlnvYK/6v/zf/u/7hN9b19zUvUM//nHw/58a8=</latexit>
.. <latexit sha1_base64="4TW2Q2kvoIwAfSidpPz6R0O1N8s=">AAADAHicjVFNj9MwEHXC1xJg6cKRi0W7UrlUSQ/AsVIvXBCLRHdXaqLKdpytVTuO7El3qygXfg03xJV/Ar8GJ1uhZMuBkWy9efPxxmNaSGEhDH95/r37Dx4+OnocPHn67Pj54OTFudWlYXzBtNTmkhLLpcj5AgRIflkYThSV/IJu5k38YsuNFTr/AruCJ4pc5SITjICjVoPfp6NlTFW1rVdRMsJ/nWnjxOtUg+2wHx27jIHfQKtcGZ7WFRgicuwueU12dTIJgk7PcQNu6je93l3yUKMT/Q+tUbxtGzjVNplm1VzP9aeiXg2G4SRsDR+CaA+GaG9nqxPvOE41KxXPgUli7TIKC0gqYkAwyesgLi0vCNuQK750MCeK26Rqx6vxqWNSnGnjTg64ZbsVFVHW7hR1mYrA2t6NNeS/YssSsvdJJfKiBJ6zW6GslBg0bn4Up8JwBnLnAGFGuFkxWxNDGLh/76lQ1XtDpUoJwujrPku13gChtg7cBqO7+zoE59NJ9HYSfp4OZ+P9Lo/QK/QajVGE3qEZ+oDO0AIxb+ZlnvYK/6v/zf/u/7hN9b19zUvUM//nHw/58a8=</latexit>
...
<latexit sha1_base64="CrDUidhbV+gYx4dWg1zM4wwFCGM=">AAADNnicjVLLbtQwFHXCqwToA5ZsLKaVhs0omUVhWTEbNogiMW2lJBrZjtOxxokj+2baUeQ/YQvfwK+wYYfY8gk46YAyHRZcydHxub7nOEemlRQGwvCb59+5e+/+g52HwaPHT3b39g+enhlVa8anTEmlLygxXIqST0GA5BeV5qSgkp/TxaTtny+5NkKVH2FV8bQgl6XIBSPgqNmBt3d0GCe0aJZ2FqWH+O9m3G6SeabA9Nh3jo0T4NfQWTeaZ7YBTUSJ3UdekZVNR0HQ0xy24Nq+3NDuk9seve5/eSXLTiEI/mg5spujeTNRE/W+sj2mp0hlzW3zxsWHmSTGcGPtbH8QjsKu8DaI1mCA1nXqItxNMsXqgpfQqcRRWEHaEA2CSW6DpDa8ImxBLnnsYEkKbtKmu4HFR47JcK60WyXgju1PNKQwZlVQd7IgMDe3ey35r15cQ/46bURZ1cBLdmOU1xKDwu07wJnQnIFcOUCYFu6umM2JJgzca9lwocXGPzRFLUFodbXJUqUWQKixgUswup3XNjgbj6LjUfhhPDgZrrPcQc/RCzREEXqFTtBbdIqmiHlL75P32fvif/W/+z/8nzdHfW898wxtlP/rN106BlU=</latexit>
. . .
<latexit sha1_base64="CXyZvQbzqGgRKR8uT6+QZOyCWjs=">AAADOnicbVJNj9MwEHXC1xJgP9gjF4vuSuVSJT10V5wW7YULsEh0d6UkqhzX3Vq148ietBui/Beu8Bv4I1y5Ia78AJy0QNMyUqI3b2be2E9OMsEN+P43x71z9979BzsPvUePn+zu7R88vTQq15QNqRJKXyfEMMFTNgQOgl1nmhGZCHaVzM7r+tWcacNV+gGKjMWS3KR8wikBS40OnMOjMEpkOa9GQXyE/yb9OommYwVmjX1j2RA04Sm2P7EgRdzzvON/Et0a3FYvWlLr5LbkWjWMgN1Cc6lSs3FVtlZVy13RvFFo4PQPbAaTSXmu3mXVWr4p+JYtMBXEGGaqpo/JbFq+ojTXhBYv8cmgdzqoRvsdv+c3gbdBsAIdtIoLa+NuNFY0lyyFRj4M/AzikmjgVLDKi3LDMkJn5IaFFqZEMhOXzcEqfGyZMZ4obb8UcMOuT5REGlPIxHZKAlOzWavJ/9XCHCanccnTLAeW0uWiSS4wKFy/BTzmmlEQhQWEam7PiumUWCfAvpjWlkS27lDKXADXatFmE6VmQBJTedbBYNOvbXDZ7wWDnv++3znrrrzcQc/Qc9RFATpBZ+g1ukBDRJ2Pzifns/PF/ep+d3+4P5etrrOaOUStcH/9BowyBWc=</latexit>
<latexit sha1_base64="oKJTGrxy+9QoHd9nSFUvs70YTHk=">AAADS3icjVLNbtNAEF67FEqA/sCRy4qkUrhEdg5pxamoFy5AkUhbybai9WZTr7LrtXbHSS3LT8DTcIVn4AF4Dm6IA2s3ICfhwEhrffPNzDfrTxtnghvwvO+Ou3Nv9/6DvYedR4+f7B8cHj29NCrXlI2pEkpfx8QwwVM2Bg6CXWeaERkLdhXPz+v61YJpw1X6EYqMRZLcpHzGKQFLTY6c3nEvCGNZLqqJH/Xw32RYJ2EyVWBa7FvLBiGwW2hWl5pNqxI04Sm2H7EkRRUNOp2WZr8Gt9XLNe02ub2jVf2vXeGiUWhg8gc2g/GsPFfvs6qVbwq+Y0tMBTGGmcr2hUxmSfma0lwTWrzCJ6PB6aiaHHa9gdcE3gb+CnTRKi6sr/vhVNFcshQa9cD3MohKooFTwapOmBuWETonNyywMCWSmahs7lXhY8tM8Uxpe1LADdueKIk0ppCx7ZQEErNZq8l/1YIcZqdRydMsB5bSu0WzXGBQuH4ceMo1oyAKCwjV3N4V04RYJ8A+obUtsVz7h1LmArhWy3U2VmoOJDZVxzrob/q1DS6HA3808D4Mu2f9lZd76Dl6gfrIRyfoDL1BF2iMqPPJ+ex8cb6639wf7k/3112r66xmnqG12Nn9Db2/DH0=</latexit>
<latexit sha1_base64="1srz8pVJrnh6c3fD9AduK7Gk5D0=">AAADS3icjVLBbtNAEF27FEqA0sKRy4qkUrhEdiRK21NRL1yAIpG2UmxF6/W6XmXXa+2Ok0aWv4Cv4QrfwAfwHdwQB9ZuQE7CgZHWevNm5s36aaNccAOe991xt+5s3723c7/z4OGj3cd7+08ujCo0ZSOqhNJXETFM8IyNgINgV7lmREaCXUbTs7p+OWPacJV9hEXOQkmuM55wSsBSk32nd9AbB5EsZ9XED3v4bzKskyCNFZgW+9ay4wDYDTSrS83iqgRNeIbtR8zJogoHnU5Ls1+Dm+rFinab3NzRqv7XrmDWKDQw/QObwSgpz9T7vGrl64Lv2BxTQYxhprJ9AZN5Wr6mtNCELk7w4cvB0XE12et6A68JvAn8JeiiZZxbX3eDWNFCsgwa9bHv5RCWRAOnglWdoDAsJ3RKrtnYwoxIZsKyuVeFDywT40RpezLADdueKIk0ZiEj2ykJpGa9VpP/qo0LSI7Ckmd5ASyjt4uSQmBQuH4cOOaaURALCwjV3N4V05RYJ8A+oZUtkVz5h1IWArhW81U2UmoKJDJVxzror/u1CS6GA/9w4H0Ydk/7Sy930DP0HPWRj16hU/QGnaMRos4n57PzxfnqfnN/uD/dX7etrrOceYpWYmv7N796DH4=</latexit>
Train railway
<latexit sha1_base64="8F2SgDUKU2CwjF7V4lnnVuQDSfM=">AAADOXicbVLLbtNAFB2bVzHQF0s2I9JKYRPZWaQVq6Ju2ABFatpKthWNJ5NmlBmPNXOd1LL8LWzhG/gSluwQW36AsRuQ3fRKts895z7GR5Nkghvw/R+O++Dho8dPtp56z56/2N7Z3du/MCrXlI2pEkpfJcQwwVM2Bg6CXWWaEZkIdpksTmv9csm04So9hyJjsSTXKZ9xSsBSkz1n//AgjBJZLqtJEB/g/8mwTqL5VIFpsR8sG4ImPMX2JVakiAee1xrRr8FN9aYzqk1ujmyp942Olk1DA+f/YATsBpJZeao+ZVUrb76NK6Vm06r8yFaYCmIMM1VTx2Q2L99RmmtCi7f4aDQ4HlnhvL13stvzB34TeBMEa9BD6zizLm5HU0VzyVJotoWBn0FcEg2cClZ5UW5YRuiCXLPQwpRIZuKyOWeFDy0zxTOl7ZMCbth2R0mkMYVMbKUkMDd3tZq8TwtzmB3HJU+zHFhKbxfNcoFB4foq4CnXjIIoLCBUc3tWTOfEGgP2wnS2JLLzD6XMBXCtVl02UWoBJDGVZx0M7vq1CS6Gg2A08D8Peyf9tZdb6BV6jfooQEfoBL1HZ2iMqFM4X5yvzjf3u/vT/eX+vi11nXXPS9QJ989fXHMDQg==</latexit>
(b) The instance-conditional prompts learned by CoCoOp are much more generalizable than CoOp to the unseen classes.
Figure 1. Motivation of our research: to learn generalizable prompts. The images are randomly selected from SUN397 [55], which is
a widely-used scene recognition dataset.
text words in a prompt into a set of learnable vectors, taking to image captioning [49], which explains why instance-
advantage of the differentiable nature of neural networks. conditional prompts are more generalizable: they are op-
With only a few labeled images for learning, CoOp achieves timized to characterize each instance (more robust to class
huge improvements over intensively-tuned manual prompts shift) rather than to serve only for some specific classes.
across a wide range of image recognition datasets. We present comprehensive experiments on 11 datasets
In our study, we identify a critical problem of CoOp: the covering a diverse set of visual recognition tasks. Specifi-
learned context is not generalizable to wider unseen classes cally, we design a base-to-new generalization setting where
within the same task. Figure 1 illustrates the problem: the a model is first learned using base classes and then tested
context learned by CoOp works well in distinguishing the on completely new classes. Compared with the zero-shot
base classes like “arrival gate” and “cathedral” but suffers a method [40] and CoOp [63], our approach achieves the best
significant drop in accuracy when it is transferred to the new overall performance (Table 1). Importantly, CoCoOp gains
(unseen) classes, such as “wind farm” and “train railway”— significant improvements over CoOp in unseen classes (Fig-
even though the task’s nature remains the same, i.e., recog- ure 3(a)), allowing the gap between manual and learning-
nizing scenes. The results suggest that the learned context based prompts to be substantially reduced.
overfits the base classes, thus failing to capture more gen- In a more challenging scenario where the context learned
eralizable elements that are vital for broader scene recog- for one task is directly transferred to other tasks with dras-
nition. We argue that such a problem is caused by CoOp’s tically different classes, CoCoOp still beats CoOp with a
static design: the context, which is fixed once learned, is clear margin (Table 2), suggesting that instance-conditional
optimized only for a specific set of (training) classes. On prompts are more transferable and have the potential to suc-
the contrary, the manually-designed prompts adopted by the ceed at larger scale. CoCoOp also obtains stronger domain
zero-shot method are relatively generalizable. generalization performance than CoOp (Table 3), further
To address the weak generalizability problem, we intro- justifying the strengths of dynamic prompts.
duce a novel concept: conditional prompt learning. The key In summary, our research provides timely insights into
idea is to make a prompt conditioned on each input instance the generalizability problem in prompt learning, and cru-
(image) rather than fixed once learned. To make the model cially, demonstrates the effectiveness of a simple idea in
parameter-efficient, we introduce a simple yet effective im- various problem scenarios. We hope our approach and the
plementation of conditional prompt learning. Specifically, findings presented in this work can pave the way for future
we extend CoOp by further learning a lightweight neural research in generalizable—and transferable—prompt learn-
network to generate for each image an input-conditional to- ing.
ken (vector), which is combined with the learnable con-
text vectors. We call our approach Conditional Context 2. Related Work
Optimization (CoCoOp).2 An overview is shown in Fig-
ure 2. Interestingly, the paradigm of CoCoOp is analogous Vision-Language Models We mainly review studies fo-
cused on aligning images and texts to learn a joint embed-
2 Pronounced as /k@Uku:p/. ding space [24, 40, 59]. The idea of cross-modality align-
2
Conditional Context Optimization (CoCoOp)
ment is certainly not new and has been investigated since context tokens
v1 v2 <latexit sha1_base64="/ObT7ntJtfRdwD1X3OgKDxBQKY0=">AAACF3icbVC7TgJBFJ3FF+ILtbSZCCZWZHcLtSSxscREHhE2ZHaYhQnz2MzMYshm/8LGQn/Fztha+ieWDrCFgCe5yck59+bee8KYUW1c99spbGxube8Ud0t7+weHR+Xjk5aWicKkiSWTqhMiTRgVpGmoYaQTK4J4yEg7HN/O/PaEKE2leDDTmAQcDQWNKEbGSo/VXsjTSdb3q/1yxa25c8B14uWkAnI0+uWf3kDihBNhMENadz03NkGKlKGYkazUSzSJER6jIelaKhAnOkjnF2fwwioDGEllSxg4V/9OpIhrPeWh7eTIjPSqNxP/87qJiW6ClIo4MUTgxaIoYdBIOHsfDqgi2LCpJQgram+FeIQUwsaGtLQl5Es/pDxhhir5lJVsVN5qMOuk5de8q5p771fqfh5aEZyBc3AJPHAN6uAONEATYCDAM3gFb86L8+58OJ+L1oKTz5yCJThfv5i2oGc=</latexit>
...
<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>
<latexit sha1_base64="t4hsqUX1Tf4ub/K3HNBJTidAYC8=">AAACF3icbVA9TwJBEJ3zE/ELtbTZCCZW5I5CLUlsbEwwkY8IF7K37MGG3b3L7h6GXPgXNhb6V+yMraX/xNIFrhDwJZO8vDeTmXlBzJk2rvvtrK1vbG5t53byu3v7B4eFo+OGjhJFaJ1EPFKtAGvKmaR1wwynrVhRLAJOm8HwZuo3R1RpFskHM46pL3BfspARbKz0WOoEIh1NunelbqHolt0Z0CrxMlKEDLVu4afTi0giqDSEY63bnhsbP8XKMMLpJN9JNI0xGeI+bVsqsaDaT2cXT9C5VXoojJQtadBM/TuRYqH1WAS2U2Az0MveVPzPaycmvPZTJuPEUEnmi8KEIxOh6fuoxxQlho8twUQxeysiA6wwMTakhS2BWPghFQk3TEVPk7yNylsOZpU0KmXvsuzeV4rVShZaDk7hDC7Agyuowi3UoA4EJDzDK7w5L8678+F8zlvXnGzmBBbgfP0Cxb+ggg==</latexit>
<latexit sha1_base64="MdQQkHPETjye9ifJOXGRj5eF6M4=">AAACF3icbVC7SgNBFJ31GeMrammzGARBCLsp1DJgYxnBPDBZwuxkNhkyj2XmrhqW/QsbC/0VO7G19E8snSRbmMQDFw7n3Mu994QxZwY879tZWV1b39gsbBW3d3b39ksHh02jEk1ogyiudDvEhnImaQMYcNqONcUi5LQVjq4nfuuBasOUvINxTAOBB5JFjGCw0n0X6BOEUXqe9Uplr+JN4S4TPydllKPeK/10+4okgkogHBvT8b0YghRrYITTrNhNDI0xGeEB7VgqsaAmSKcXZ+6pVfpupLQtCe5U/TuRYmHMWIS2U2AYmkVvIv7ndRKIroKUyTgBKslsUZRwF5Q7ed/tM00J8LElmGhmb3XJEGtMwIY0tyUUcz+kIuHAtHrMijYqfzGYZdKsVvyLindbLdeqeWgFdIxO0Bny0SWqoRtURw1EkETP6BW9OS/Ou/PhfM5aV5x85gjNwfn6BZuRoQE=</latexit>
<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>
...
<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>
⇡ ⇡ ... ⇡
A typical vision-language model consists of three key el-
<latexit sha1_base64="4KzQHU+3d8F0bpLXiRHf1+GMPSk=">AAACF3icbVA9T8MwEL3wWcpXgZHFokViqpIOwFiJhbFI9EM0UeW4TmvVdiLbAVVR/wULA/wVNsTKyD9hxG0z0JYnnfT03p3u7oUJZ9q47reztr6xubVd2Cnu7u0fHJaOjls6ThWhTRLzWHVCrClnkjYNM5x2EkWxCDlth6Obqd9+pEqzWN6bcUIDgQeSRYxgY6WHih+KzE/YpNIrld2qOwNaJV5OypCj0Sv9+P2YpIJKQzjWuuu5iQkyrAwjnE6KfqppgskID2jXUokF1UE2u3iCzq3SR1GsbEmDZurfiQwLrccitJ0Cm6Fe9qbif143NdF1kDGZpIZKMl8UpRyZGE3fR32mKDF8bAkmitlbERlihYmxIS1sCcXCD5lIuWEqfpoUbVTecjCrpFWrepdV965Wrtfy0ApwCmdwAR5cQR1uoQFNICDhGV7hzXlx3p0P53PeuubkMyewAOfrF+USoJU=</latexit> <latexit sha1_base64="4KzQHU+3d8F0bpLXiRHf1+GMPSk=">AAACF3icbVA9T8MwEL3wWcpXgZHFokViqpIOwFiJhbFI9EM0UeW4TmvVdiLbAVVR/wULA/wVNsTKyD9hxG0z0JYnnfT03p3u7oUJZ9q47reztr6xubVd2Cnu7u0fHJaOjls6ThWhTRLzWHVCrClnkjYNM5x2EkWxCDlth6Obqd9+pEqzWN6bcUIDgQeSRYxgY6WHih+KzE/YpNIrld2qOwNaJV5OypCj0Sv9+P2YpIJKQzjWuuu5iQkyrAwjnE6KfqppgskID2jXUokF1UE2u3iCzq3SR1GsbEmDZurfiQwLrccitJ0Cm6Fe9qbif143NdF1kDGZpIZKMl8UpRyZGE3fR32mKDF8bAkmitlbERlihYmxIS1sCcXCD5lIuWEqfpoUbVTecjCrpFWrepdV965Wrtfy0ApwCmdwAR5cQR1uoQFNICDhGV7hzXlx3p0P53PeuubkMyewAOfrF+USoJU=</latexit>
<latexit sha1_base64="4KzQHU+3d8F0bpLXiRHf1+GMPSk=">AAACF3icbVA9T8MwEL3wWcpXgZHFokViqpIOwFiJhbFI9EM0UeW4TmvVdiLbAVVR/wULA/wVNsTKyD9hxG0z0JYnnfT03p3u7oUJZ9q47reztr6xubVd2Cnu7u0fHJaOjls6ThWhTRLzWHVCrClnkjYNM5x2EkWxCDlth6Obqd9+pEqzWN6bcUIDgQeSRYxgY6WHih+KzE/YpNIrld2qOwNaJV5OypCj0Sv9+P2YpIJKQzjWuuu5iQkyrAwjnE6KfqppgskID2jXUokF1UE2u3iCzq3SR1GsbEmDZurfiQwLrccitJ0Cm6Fe9qbif143NdF1kDGZpIZKMl8UpRyZGE3fR32mKDF8bAkmitlbERlihYmxIS1sCcXCD5lIuWEqfpoUbVTecjCrpFWrepdV965Wrtfy0ApwCmdwAR5cQR1uoQFNICDhGV7hzXlx3p0P53PeuubkMyewAOfrF+USoJU=</latexit>
<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>
ements: two for image and text encoding while the third is ...
<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>
...
<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>
related to the design of loss functions. In early days, mod- meta token
<latexit sha1_base64="4KzQHU+3d8F0bpLXiRHf1+GMPSk=">AAACF3icbVA9T8MwEL3wWcpXgZHFokViqpIOwFiJhbFI9EM0UeW4TmvVdiLbAVVR/wULA/wVNsTKyD9hxG0z0JYnnfT03p3u7oUJZ9q47reztr6xubVd2Cnu7u0fHJaOjls6ThWhTRLzWHVCrClnkjYNM5x2EkWxCDlth6Obqd9+pEqzWN6bcUIDgQeSRYxgY6WHih+KzE/YpNIrld2qOwNaJV5OypCj0Sv9+P2YpIJKQzjWuuu5iQkyrAwjnE6KfqppgskID2jXUokF1UE2u3iCzq3SR1GsbEmDZurfiQwLrccitJ0Cm6Fe9qbif143NdF1kDGZpIZKMl8UpRyZGE3fR32mKDF8bAkmitlbERlihYmxIS1sCcXCD5lIuWEqfpoUbVTecjCrpFWrepdV965Wrtfy0ApwCmdwAR5cQR1uoQFNICDhGV7hzXlx3p0P53PeuubkMyewAOfrF+USoJU=</latexit>
⇡
Image Encoder ...
<latexit sha1_base64="i4zpmuRm7l6oli4lXwkT2OQ8Jsk=">AAACFXicbVC7TgJBFJ31ifhCLW02ggkV2aVQSxIbS0zkkcCGzM7OwoR5bGbuasiGn7Cx0F+xM7bW/omlA2wh4ElucnLOvbn3njDhzIDnfTsbm1vbO7uFveL+weHRcenktG1UqgltEcWV7obYUM4kbQEDTruJpliEnHbC8e3M7zxSbZiSDzBJaCDwULKYEQxW6lb6o0iBqQxKZa/mzeGuEz8nZZSjOSj99CNFUkElEI6N6fleAkGGNTDC6bTYTw1NMBnjIe1ZKrGgJsjm907dS6tEbqy0LQnuXP07kWFhzESEtlNgGJlVbyb+5/VSiG+CjMkkBSrJYlGccheUO3vejZimBPjEEkw0s7e6ZIQ1JmAjWtoSiqUfMpFyYFo9TYs2Kn81mHXSrtf8q5p3Xy83qnloBXSOLlAV+egaNdAdaqIWIoijZ/SK3pwX5935cD4XrRtOPnOGluB8/QImmZ+d</latexit>
els for processing images and texts are often designed and
also learned independently, with their outputs connected by Meta-Net
3
3.1. Reviews of CLIP and CoOp ents can be propagated all the way back to update the con-
text vectors. Note that the base model of CLIP is frozen in
Contrastive Language-Image Pre-training known as
the entire training process (ours too).
CLIP [41], has well demonstrated the potential of learning
open-set visual concepts. CLIP is built using two encoders, 3.2. CoCoOp: Conditional Context Optimization
one for image and the other for text, as shown in Figure 2.
The image encoder can be either a ResNet [18] or a ViT [9], CoOp is a data-efficient approach allowing the context
which is used to transform an image into a feature vector. vectors to be trained with only a few labeled images in a
The text encoder is a Transformer [48], which takes as input downstream dataset. However, as discussed CoOp is not
a sequence of word tokens and again produces a vectorized generalizable to wider unseen classes within the same task.
representation. We argue that instance-conditional context can generalize
During training, CLIP adopts a contrastive loss to learn better because it shifts the focus away from a specific set of
a joint embedding space for the two modalities. Specifi- classes—for reducing overfitting—to each input instance,
cally, for a mini-batch of image-text pairs, CLIP maximizes and hence to the entire task.
for each image the cosine similarity with the matched text A straightforward way to implement CoCoOp is to build
while minimizes the cosine similarities with all other un- M neural networks to get M context tokens. However, such
matched texts, and the loss is computed in a similar fashion a design would require M × the size of a neural network,
for each text too. After training, CLIP can be used for zero- which is much larger than having M context vectors as in
shot image recognition. Let x be image features generated CoOp. Here we propose a parameter-efficient design that
by the image encoder and {wi }K works very well in practice. Specifically, on top of the M
i=1 a set of weight vectors
produced by the text encoder, each representing a category context vectors, we further learn a lightweight neural net-
(suppose there are K categories in total). In particular, each work, called Meta-Net, to generate for each input a condi-
wi is derived from a prompt, such as “a photo of a {class}” tional token (vector), which is then combined with the con-
where the “{class}” token is filled with the i-th class name. text vectors. See Figure 2 for a sketch of the architecture.
The prediction probability is then Let hθ (·) denote the Meta-Net parameterized by θ, each
context token is now obtained by vm (x) = vm + π where
exp(sim(x, wy )/τ ) π = hθ (x) and m ∈ {1, 2, ..., M }. The prompt for the
p(y|x) = PK , (1) i-th class is thus conditioned on the input, i.e., ti (x) =
i=1 exp(sim(x, wi )/τ )
{v1 (x), v2 (x), . . . , vM (x), ci }. The prediction probability
where sim(·, ·) denotes cosine similarity and τ is a learned is computed as
temperature parameter.
exp(sim(x, g(ty (x)))/τ )
Context Optimization (CoOp) aims to overcome the p(y|x) = PK . (3)
inefficiency problem in prompt engineering for better adapt- i=1 exp(sim(x, g(ti (x))/τ )
ing pre-trained vision-language models to downstream ap- During training, we update the context vectors {vm }M
m=1
plications [63]. The key idea in CoOp is to model each con- together with the Meta-Net’s parameters θ. In this work,
text token using a continuous vector that can be end-to-end the Meta-Net is built with a two-layer bottleneck structure
learned from data. Concretely, instead of using “a photo (Linear-ReLU-Linear), with the hidden layer reducing the
of a” as the context, CoOp introduces M learnable context input dimension by 16×. The input to the Meta-Net is sim-
vectors, {v1 , v2 , . . . , vM }, each having the same dimension ply the output features produced by the image encoder. We
with the word embeddings. The prompt for the i-th class, leave exploration of more advanced designs for future work.
denoted by ti , now becomes ti = {v1 , v2 , . . . , vM , ci }
where ci is the word embedding(s) for the class name. The 4. Experiments
context vectors are shared among all classes.3 Let g(·) de-
note the text encoder, the prediction probability is then Our approach is mainly evaluated in the following three
problem settings: 1) generalization from base to new classes
exp(sim(x, g(ty ))/τ ) within a dataset (Section 4.1); 2) cross-dataset transfer (Sec-
p(y|x) = PK . (2)
i=1 exp(sim(x, g(ti )/τ )
tion 4.2); 3) domain generalization (Section 4.3). All mod-
els used in our experiments are based on the open-source
To adapt CLIP to a downstream image recognition CLIP [40].4 Before discussing the results, we provide the
dataset, a cross-entropy loss can be used as the learning ob- details of the experimental setup below.
jective. Since the text encoder g(·) is differentiable, gradi-
Datasets For the first two settings, i.e., base-to-new gen-
3 CoOp has an alternative version that learns class-specific context, eralization and cross-dataset transfer, we use the 11 image
which is not considered here because it is not straightforward to transfer
class-specific context to unseen classes. 4 https://github.com/openai/CLIP.
4
Table 1. Comparison of CLIP, CoOp and CoCoOp in the base-to-new generalization setting. For learning-based methods (CoOp and
CoCoOp), their prompts are learned from the base classes (16 shots). The results strongly justify the strong generalizability of conditional
prompt learning. H: Harmonic mean (to highlight the generalization trade-off [54]).
recognition datasets as in Zhou et al. [63], which cover a di- CLIP [40] is also compared, which is based on manual
verse set of recognition tasks. Specifically, the benchmark prompts. It is worth mentioning that the manual prompt
includes ImageNet [6] and Caltech101 [11] for classifica- for each dataset was intensively tuned using all classes in
tion on generic objects; OxfordPets [38], StanfordCars [28], the test data [40].
Flowers102 [36], Food101 [2] and FGVCAircraft [35] for
fine-grained classification; SUN397 [55] for scene recogni-
tion; UCF101 [46] for action recognition; DTD [5] for tex-
Training Details Our implementation is based on
ture classification; and finally EuroSAT [19] for satellite im-
CoOp’s code.5 Throughout the experiments, we use the best
agery recognition. For domain generalization experiments,
available vision backbone in CLIP, i.e., ViT-B/16. Zhou et
we use ImageNet as the source dataset and four other vari-
al. [63] have suggested that a shorter context length and
ants of ImageNet that contain different types of domain shift
a good initialization can lead to better performance and
as the target datasets, namely ImageNetV2 [43], ImageNet-
stronger robustness to domain shift. Therefore, we fix the
Sketch [50], ImageNet-A [22] and ImageNet-R [21].
context length to 4 and initialize the context vectors using
Following Zhou et al. [63], we randomly sample for each the pre-trained word embeddings of “a photo of a” for both
dataset a few-shot training set while using the original test CoOp and CoCoOp. Due to the instance-conditional de-
set for testing. We only evaluate the highest shot number sign, our approach is slow to train and consumes much more
studied in Zhou et al. [63], i.e., 16 shots, which is sufficient GPU memory than CoOp. Therefore, to ensure the model
to justify our approach. For learning-based models, the re- can fit into a GPU and meanwhile reduce the training time,
sults are averaged over three runs. we train CoCoOp with batch size of 1 for 10 epochs. Such
a limitation is discussed in more detail in Section 5.
Baselines The direct rival to our approach is CoOp [63],
which essentially learns static prompts (in comparison
to our dynamic prompts). The zero-shot method, i.e., 5 https://github.com/KaiyangZhou/CoOp.
5
(a) (b)
Figure 3. Comprehensive comparisons of CoCoOp and CoOp in the base-to-new generalization setting. (a) CoCoOp is able to gain
consistent improvements over CoOp in unseen classes on all datasets. (b) CoCoOp’s declines in base accuracy are mostly under 3%, which
are far outweighed by the gains in generalization.
4.1. Generalization From Base to New Classes increases in accuracy on 5 out of 11 datasets. Notably, on
the challenging ImageNet dataset, CoCoOp’s surge from
Solving the weak generalizability problem of CoOp is
67.88% to 70.43% represents a non-trivial progress (the
the main focus in this research. On each of the 11 datasets,
70.43% accuracy even surpasses CLIP’s 68.14%).
we split the classes equally into two groups, one as base
classes and the other as new classes. Learning-based mod- CoCoOp’s Gains in Generalization Far Outweigh
els, i.e., CoOp and CoCoOp, are trained using only the base Losses in Base Accuracy In comparison to CoOp, per-
classes while evaluation is conducted on the base and new formance drops in the base classes occur for CoCoOp on
classes separately to test generalizability. The detailed re- most datasets (see Figure 3(b)). This is reasonable because
sults are shown in Table 1. CoOp optimizes specifically for base classes whereas Co-
CoOp optimizes for each instance in order to gain more
Failures of CoOp in Unseen Classes The split does not
generalization over an entire task. But it is worth noting
guarantee that the two class groups are equally difficult, as
that on the 9 datasets where CoCoOp’s base accuracy drops
evidenced in CLIP’s bumpy results: the base and new ac-
below CoOp’s, most losses are under 3% (precisely on 6
curacy numbers are dramatically different.6 Nonetheless,
out of 9 datasets), which are far outweighed by the gains in
CoOp’s new accuracy is consistently much weaker than the
unseen classes shown in Figure 3(a); even for those where
base accuracy on nearly all datasets, leaving a huge gap of
CoCoOp suffers the biggest losses, the boosts in generaliza-
almost 20% on average (82.69% vs 63.22%). Despite main-
tion are mostly significant enough to turn the averages into
taining an advantage over CLIP in terms of average perfor-
positives, e.g., StanfordCars sees the worst base accuracy
mance, CoOp’s gains in the base classes are nearly zeroed
drop of -7.63% but has the third-highest accuracy gain of
out by the catastrophic failures in the new classes, highlight-
+13.19% in the new classes, which together bring a 5.56%
ing the need to improve generalizability for learning-based
positive improvement for CoCoOp.
prompts.
CoCoOp Is More Compelling Than CLIP When tak-
CoCoOp Significantly Narrows Generalization Gap
ing into account both the base and new classes, CoCoOp
As shown in Table 1(a), CoCoOp improves the accuracy
shows a gain of more than 4% over CLIP (75.83% vs
in unseen classes from 63.22% to 71.69%, which largely
71.70), suggesting that instance-conditional prompts have
reduces the gap with manual prompts. The results confirm
a better potential in capturing more generalizable elements
that instance-conditional prompts are more generalizable.
that are relevant for a recognition task. Theoretically,
A more detailed breakdown of per-dataset improvement is
learning-based prompts have a much higher risk of overfit-
visualized in Figure 3(a) where we observe more than 10%
ting base classes than manual prompts. Therefore, CLIP is a
6 For convenience, we refer to base accuracy as the performance in base strong competitor to beat in unseen classes. Different from
classes; and similarly for new accuracy. CoOp, we obtain promising results for CoCoOp: the new
6
Table 2. Comparison of prompt learning methods in the cross-dataset transfer setting. Prompts applied to the 10 target datasets are
learned from ImageNet (16 images per class). Clearly, CoCoOp demonstrates better transferability than CoOp. ∆ denotes CoCoOp’s gain
over CoOp.
Source Target
FGVCAircraft
StanfordCars
Flowers102
Caltech101
OxfordPets
ImageNet
EuroSAT
Food101
SUN397
UCF101
Average
DTD
CoOp [63] 71.51 93.70 89.14 64.51 68.71 85.30 18.47 64.15 41.92 46.39 66.55 63.88
CoCoOp 71.02 94.43 90.14 65.32 71.88 86.06 22.94 67.36 45.73 45.37 68.21 65.74
∆ -0.49 +0.73 +1.00 +0.81 +3.17 +0.76 +4.47 +3.21 +3.81 -1.02 +1.66 +1.86
Table 3. Comparison of manual and learning-based prompts in domain generalization. CoOp and CoCoOp use as training data 16
images from each of the 1,000 classes on ImageNet. In general, CoCoOp is more domain-generalizable than CoOp.
Source Target
Learnable? ImageNet ImageNetV2 ImageNet-Sketch ImageNet-A ImageNet-R
CLIP [40] 66.73 60.83 46.15 47.77 73.96
CoOp [63] X 71.51 64.20 47.99 49.71 75.21
CoCoOp X 71.02 64.07 48.75 50.63 76.18
accuracy is even better than CLIP’s on 4 out of 11 datasets dog breeds, it is reasonable to see high accuracy for both
(i.e., ImageNet, OxfordPets, Food101 and SUN397) and not models on the relevant target datasets including Caltech101
too far away from CLIP’s on the rest except FGVCAircraft and OxfordPets.
where the gap between manual and learning-based prompts By comparison, the performance on other datasets with
is generally large. In the ablation study on context length, distant—and more fine-grained or specialized—categories
we find that FGVCAircraft benefits from longer context, is much lower, such as FGVCAircraft and DTD (contain-
which is aligned with the findings in Zhou et al. [63]. ing various textures) where the accuracy numbers are well
To close or even overturn the gaps between manual and below 50%. Nonetheless, CoCoOp exhibits much stronger
learning-based prompts in unseen classes, more efforts are transferability than CoOp on the two mentioned datasets as
required and we hope the insights presented in this research well as on most other fine-grained or specialized datasets.
can help the community tackle the generalizability issue in
prompt learning. 4.3. Domain Generalization
7
Table 4. Recognition accuracy (average over 11 datasets) on a Table 5. CoCoOp (last row) vs a bigger CoOp on ImageNet.
combination of base and new classes. The learnable models only
have access to training data from base classes. Model # params Base New H
CoOp (ctx=4) 2,048 76.47 67.88 71.92
Learnable? Accuracy
CoOp (ctx=60) 30,720 76.16 65.34 70.34
CLIP [40] 65.22 CoOp (ctx=4) + Meta-Net 34,816 75.98 70.43 73.10
CoOp [63] X 65.55
CoCoOp X 69.13
all context tokens. Figure 4(b) summarizes the results on
the 11 datasets. The differences in the base classes are fairly
small whereas in the new classes the models with a longer
context length clearly perform better. From Figure 4(a) and
(b) we observe that using 8 randomly initialized context to-
kens is marginally better than using 4 properly initialized
context tokens, suggesting that a further boost might be
possible if we initialize 8 context tokens with word embed-
dings.
(a) Ablation on initialization. (b) Ablation on context length.
CoCoOp vs a Bigger CoOp Since CoCoOp introduces
Figure 4. Ablation studies. more parameters than CoOp, namely the Meta-Net, one
might question if the improvements simply come from an
increased learning capacity. To clear the doubt, we remove
4.4. Further Analysis the Meta-Net part and increase the number of context tokens
in CoOp to the maximum such that CoOp’s and CoCoOp’s
Class-Incremental Test We consider a practical prob- sizes are similar. The results in Table 5 show that increasing
lem scenario where the recognition targets originally com- the parameter size is not the key.
posed of base classes are expanded to include completely
new classes. This problem is relevant to the existing 5. Limitations
continual learning literature [37] but different in that the
model here does not have access to any training data from The first limitation is about training efficiency: CoCoOp
new classes and needs to perform zero-shot recognition on is slow to train and would consume a significant amount
them. We compare CLIP, CoOp and CoCoOp using the of GPU memory if the batch size is set larger than one.
11 datasets. The average results are reported in Table 4. The reason is because CoCoOp is based on an instance-
Clearly, CoOp loses competitiveness against CLIP as their conditional design that requires for each image an indepen-
performance is similar but the former needs training data. dent forward pass of instance-specific prompts through the
Again, CoCoOp beats the two competitors with a signifi- text encoder. This is much less efficient than CoOp that only
cant margin. needs a single forward pass of prompts through the text en-
coder for an entire mini-batch of any size.
Initialization To understand the impact of initializa- The second limitation is that on 7 out of the 11 datasets
tion, we conduct an ablation study by comparing word (see Table 1), CoCoOp’s performance in unseen classes still
embeddings-based initialization and random initialization lags behind CLIP’s, indicating that more efforts are needed
while keeping all other parameters identical. For random from the community to fully close or overturn the gaps be-
initialization, we follow Zhou et al. [63] to sample from tween manual and learning-based prompts.
a zero-mean Gaussian distribution with 0.02 standard de-
viation. Figure 4(a) shows the base-to-new generalization 6. Discussion and Conclusion
results averaged over the 11 datasets, which suggest that a
proper initialization is more beneficial to both the base and Our research addresses an important issue that arises
new classes. Note that the findings from Figure 4 only rep- with the availability of large pre-trained AI models, i.e.,
resent the overall trend while each individual dataset might how to adapt them to downstream applications. These mod-
have a different result. els, also called foundation models [1], have received in-
creasing attention from academia and industry in both the
Context Length The ablation study on context length is vision and NLP communities because they are so powerful
also carried out in the base-to-new generalization setting. in terms of their capabilities for diverse downstream tasks.
Following Zhou et al. [63], we study 4, 8 and 16 context to- However, foundation models are costly to pre-train in terms
kens. For fair comparison, we use random initialization for of data scale and compute resources; and typically contain
8
Table 6. Domain generalization results on DOSCO-2k, a recently proposed benchmark focusing on broader contextual domain shift.
Among the three approaches, CoOp and its follow-up, CoCoOp, contain learnable components while CLIP here denotes the zero-shot
model. Both CoOp and CoCoOp use four learnable context tokens initialized with the word embeddings of “a photo of a”. Bold denotes
the best performance on each dataset for a specific architecture.
9
References visual features through embedding images into text topic
spaces. In CVPR, 2017. 3
[1] Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt-
[17] Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross
man, Simran Arora, Sydney von Arx, Michael S Bernstein,
Girshick. Momentum contrast for unsupervised visual rep-
Jeannette Bohg, Antoine Bosselut, Emma Brunskill, et al.
resentation learning. In CVPR, 2020. 3
On the opportunities and risks of foundation models. arXiv
preprint arXiv:2108.07258, 2021. 8 [18] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun.
[2] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Deep residual learning for image recognition. In CVPR,
Food-101–mining discriminative components with random 2016. 1, 4
forests. In ECCV, 2014. 5 [19] Patrick Helber, Benjamin Bischke, Andreas Dengel, and
[3] Wei-Lun Chao, Soravit Changpinyo, Boqing Gong, and Fei Damian Borth. Eurosat: A novel dataset and deep learning
Sha. An empirical study and analysis of generalized zero- benchmark for land use and land cover classification. IEEE
shot learning for object recognition in the wild. In ECCV, Journal of Selected Topics in Applied Earth Observations
2016. 3 and Remote Sensing, 2019. 5
[4] Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- [20] Olivier J. Hénaff, Aravind Srinivas, Jeffrey De Fauw, Ali
offrey Hinton. A simple framework for contrastive learning Razavi, Carl Doersch, S. M. Ali Eslami, and Aäron van den
of visual representations. In ICML, 2020. 3 Oord. Data-efficient image recognition with contrastive pre-
[5] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy dictive coding. In ICML, 2020. 3
Mohamed, and Andrea Vedaldi. Describing textures in the [21] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada-
wild. In CVPR, 2014. 5 vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu,
[6] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt,
and Li Fei-Fei. Imagenet: A large-scale hierarchical image and Justin Gilmer. The many faces of robustness: A critical
database. In CVPR, 2009. 5 analysis of out-of-distribution generalization. ICCV, 2021. 5
[7] Karan Desai and Justin Johnson. Virtex: Learning visual [22] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein-
representations from textual annotations. In CVPR, 2021. 3 hardt, and Dawn Song. Natural adversarial examples. In
[8] Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina CVPR, 2021. 5
Toutanova. Bert: Pre-training of deep bidirectional trans- [23] Dat Huynh and Ehsan Elhamifar. Fine-grained generalized
formers for language understanding. In NAACL, 2019. 3 zero-shot learning via dense attribute-based attention. In
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, CVPR, 2020. 3
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, [24] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- Hieu Pham, Quoc V Le, Yunhsuan Sung, Zhen Li, and Tom
vain Gelly, et al. An image is worth 16x16 words: Trans- Duerig. Scaling up visual and vision-language representation
formers for image recognition at scale. In ICLR, 2021. 4 learning with noisy text supervision. In ICML, 2021. 1, 2, 3
[10] Mohamed Elhoseiny, Babak Saleh, and Ahmed Elgammal. [25] Zhengbao Jiang, Frank F Xu, Jun Araki, and Graham Neu-
Write a classifier: Zero-shot learning using purely textual big. How can we know what language models know? ACL,
descriptions. In ICCV, 2013. 3 2020. 1, 3
[11] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- [26] Armand Joulin, Laurens Van Der Maaten, Allan Jabri, and
ative visual models from few training examples: An incre- Nicolas Vasilache. Learning visual features from large
mental bayesian approach tested on 101 object categories. In weakly supervised data. In ECCV, 2016. 3
CVPR-W, 2004. 5
[27] Chen Ju, Tengda Han, Kunhao Zheng, Ya Zhang, and Weidi
[12] Andrea Frome, Greg S Corrado, Jon Shlens, Samy Bengio,
Xie. Prompting visual-language models for efficient video
Jeff Dean, Marc’Aurelio Ranzato, and Tomas Mikolov. De-
understanding. arXiv preprint arXiv:2112.04478, 2021. 3
vise: A deep visual-semantic embedding model. NeurIPS,
2013. 3 [28] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
3d object representations for fine-grained categorization. In
[13] Andreas Fürst, Elisabeth Rumetshofer, Viet Tran, Hubert
ICCV-W, 2013. 5
Ramsauer, Fei Tang, Johannes Lehner, David Kreil, Michael
Kopp, Günter Klambauer, Angela Bitto-Nemling, et al. [29] Jimmy Lei Ba, Kevin Swersky, Sanja Fidler, et al. Predicting
Cloob: Modern hopfield networks with infoloob outperform deep zero-shot convolutional neural networks using textual
clip. arXiv preprint arXiv:2110.11316, 2021. 1, 3 descriptions. In ICCV, 2015. 3
[14] Peng Gao, Shijie Geng, Renrui Zhang, Teli Ma, Rongyao [30] Brian Lester, Rami Al-Rfou, and Noah Constant. The power
Fang, Yongfeng Zhang, Hongsheng Li, and Yu Qiao. of scale for parameter-efficient prompt tuning. arXiv preprint
Clip-adapter: Better vision-language models with feature arXiv:2104.08691, 2021. 1, 3
adapters. arXiv preprint arXiv:2110.04544, 2021. 1 [31] Ang Li, Allan Jabri, Armand Joulin, and Laurens van der
[15] Tianyu Gao, Adam Fisch, and Danqi Chen. Making pre- Maaten. Learning visual n-grams from web data. In ICCV,
trained language models better few-shot learners. arXiv 2017. 3
preprint arXiv:2012.15723, 2020. 1 [32] Xiang Lisa Li and Percy Liang. Prefix-tuning: Optimiz-
[16] Lluis Gomez, Yash Patel, Marçal Rusiñol, Dimosthenis ing continuous prompts for generation. arXiv preprint
Karatzas, and CV Jawahar. Self-supervised learning of arXiv:2101.00190, 2021. 1, 3
10
[33] Yangguang Li, Feng Liang, Lichen Zhao, Yufeng Cui, Wanli Polosukhin. Attention is all you need. In NeurIPS, 2017. 1,
Ouyang, Jing Shao, Fengwei Yu, and Junjie Yan. Su- 3, 4
pervision exists everywhere: A data efficient contrastive [49] Oriol Vinyals, Alexander Toshev, Samy Bengio, and Du-
language-image pre-training paradigm. arXiv preprint mitru Erhan. Show and tell: A neural image caption gen-
arXiv:2110.05208, 2021. 1, 3 erator. In CVPR, 2015. 2
[34] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hi- [50] Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P
roaki Hayashi, and Graham Neubig. Pre-train, prompt, and Xing. Learning robust global representations by penalizing
predict: A systematic survey of prompting methods in nat- local predictive power. In NeurIPS, 2019. 5
ural language processing. arXiv preprint arXiv:2107.13586, [51] Wei Wang, Vincent W Zheng, Han Yu, and Chunyan Miao.
2021. 1, 3 A survey of zero-shot learning: Settings, methods, and ap-
[35] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew plications. TIST, 2019. 3
Blaschko, and Andrea Vedaldi. Fine-grained visual classi- [52] Xiaolong Wang, Yufei Ye, and Abhinav Gupta. Zero-shot
fication of aircraft. arXiv preprint arXiv:1306.5151, 2013. recognition via semantic embeddings and knowledge graphs.
5 In CVPR, 2018. 3
[36] Maria-Elena Nilsback and Andrew Zisserman. Automated [53] Mitchell Wortsman, Gabriel Ilharco, Mike Li, Jong Wook
flower classification over a large number of classes. In Kim, Hannaneh Hajishirzi, Ali Farhadi, Hongseok
ICVGIP, 2008. 5 Namkoong, and Ludwig Schmidt. Robust fine-tuning
[37] German I Parisi, Ronald Kemker, Jose L Part, Christopher of zero-shot models. arXiv preprint arXiv:2109.01903,
Kanan, and Stefan Wermter. Continual lifelong learning with 2021. 1
neural networks: A review. Neural Networks, 2019. 8 [54] Yongqin Xian, Bernt Schiele, and Zeynep Akata. Zero-shot
[38] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and learning-the good, the bad and the ugly. In CVPR, 2017. 3,
CV Jawahar. Cats and dogs. In CVPR, 2012. 5 5
[39] Fabio Petroni, Tim Rocktäschel, Patrick Lewis, Anton [55] Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva,
Bakhtin, Yuxiang Wu, Alexander H Miller, and Sebastian and Antonio Torralba. Sun database: Large-scale scene
Riedel. Language models as knowledge bases? In EMNLP, recognition from abbey to zoo. In CVPR, 2010. 2, 5
2019. 3 [56] Yuan Yao, Ao Zhang, Zhengyan Zhang, Zhiyuan Liu, Tat-
[40] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Seng Chua, and Maosong Sun. Cpt: Colorful prompt tun-
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, ing for pre-trained vision-language models. arXiv preprint
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- arXiv:2109.11797, 2021. 1, 3
ing transferable visual models from natural language super- [57] Kai Yi, Xiaoqian Shen, Yunhao Gou, and Mohamed El-
vision. In ICML, 2021. 1, 2, 3, 4, 5, 7, 8, 9 hoseiny. Exploring hierarchical graph representation for
[41] Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario large-scale zero-shot image classification. arXiv preprint
Amodei, Ilya Sutskever, et al. Language models are unsu- arXiv:2203.01386, 2022. 3
pervised multitask learners. OpenAI blog, 2019. 3, 4 [58] Renrui Zhang, Ziyu Guo, Wei Zhang, Kunchang Li, Xu-
[42] Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong peng Miao, Bin Cui, Yu Qiao, Peng Gao, and Hongsheng Li.
Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Pointclip: Point cloud understanding by clip. arXiv preprint
Denseclip: Language-guided dense prediction with context- arXiv:2112.02413, 2021. 3
aware prompting. In CVPR, 2022. 3 [59] Yuhao Zhang, Hang Jiang, Yasuhide Miura, Christopher D
[43] Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Manning, and Curtis P Langlotz. Contrastive learning of
Vaishaal Shankar. Do imagenet classifiers generalize to im- medical visual representations from paired images and text.
agenet? In ICML, 2019. 5 arXiv preprint arXiv:2010.00747, 2020. 2
[44] Taylor Shin, Yasaman Razeghi, Robert L Logan IV, Eric [60] Zexuan Zhong, Dan Friedman, and Danqi Chen. Factual
Wallace, and Sameer Singh. Autoprompt: Eliciting knowl- probing is [mask]: Learning vs. learning to recall. In NAACL,
edge from language models with automatically generated 2021. 1, 3
prompts. In EMNLP, 2020. 1, 3 [61] Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva,
[45] Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bas- and Antonio Torralba. Places: A 10 million image database
tani, Christopher D Manning, and Andrew Y Ng. Zero-shot for scene recognition. IEEE transactions on pattern analysis
learning through cross-modal transfer. In NeurIPS, 2013. 3 and machine intelligence, 40(6):1452–1464, 2017. 9
[46] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. [62] Kaiyang Zhou, Ziwei Liu, Yu Qiao, Tao Xiang, and
Ucf101: A dataset of 101 human actions classes from videos Chen Change Loy. Domain generalization: A survey. arXiv
in the wild. arXiv preprint arXiv:1212.0402, 2012. 5 preprint arXiv:2103.02503, 2021. 7
[63] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
[47] Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Car-
Liu. Learning to prompt for vision-language models. arXiv
lini, Benjamin Recht, and Ludwig Schmidt. Measuring ro-
preprint arXiv:2109.01134, 2021. 1, 2, 3, 4, 5, 7, 8, 9
bustness to natural distribution shifts in image classification.
In NeurIPS, 2020. 7 [64] Kaiyang Zhou, Yuanhan Zhang, Yuhang Zang, Jingkang
Yang, Chen Change Loy, and Ziwei Liu. On-device domain
[48] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko-
generalization. arXiv preprint arXiv:2209.07521, 2022. 9
reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia
11