Large Language Models Are Good Prompt Learners
Large Language Models Are Good Prompt Learners
Abstract In one sentence, describe the distinctive large passenger window section
Human
appearance of a Yak-40, a type of aircraft.
Low-shot image classification, where training images The Yak-40 has a unique trijet configuration
with a large passenger window section and
are limited or inaccessible, has benefited from recent LLaMA
a sloping nose, along with three engines
mounted on the rear of the aircraft, creating
progress on pre-trained vision-language (VL) models with an unmistakable silhouette in the sky. trijet configuration
sloping nose
strong generalizability, e.g. CLIP. Prompt learning meth- three engines
ods built with VL models generate text features from the (a) Visual Descriptions from an LLM.
class names that only have confined class-specific informa-
tion. Large Language Models (LLMs), with their vast en- +8.5
+14.2
cyclopedic knowledge, emerge as the complement. Thus, in +2.8
80 +0.7
this paper, we discuss the integration of LLMs to enhance +1.1
+1.7 74.2 74.9
pre-trained VL models, specifically on low-shot classifica- 71.7 72.8
71.0
tion. However, the domain gap between language and vi- 70 69.3
sion blocks the direct application of LLMs. Thus, we pro-
Base Novel HM
pose LLaMP, Large Language Models as Prompt learners, CLIP CLIP+LLM LLaMP
that produces adaptive prompts for the CLIP text encoder,
establishing it as the connecting bridge. Experiments show (b) LLMs’ Knowledge Boosts the Performance.
that, compared with other state-of-the-art prompt learning
Figure 1. Demonstration of LLaMP: (a) LLMs can provide visual
methods, LLaMP yields better performance on both zero-
descriptions for fine-grained object categories; (b) Zero-shot base-
shot generalization and few-shot image classification, over
to-novel generalization benefits from the LLM knowledge.
a spectrum of 11 datasets. Code will be made available at:
https://github.com/zhaohengz/LLaMP.
e.g. GPT-4 [28] and LLaMA [38, 39], have demonstrated
their encyclopedic knowledge and thus can provide linguis-
tic visual descriptions for objects. Here, we investigate how
1. Introduction to leverage LLMs for low-shot image classification.
Low-shot image classification tasks, including few-shot and The emergence of prompt learning has provided an ef-
zero-shot variants, are to learn from a set of class names ficient way to adapt large pre-trained models. Previous
along with a limited or null set of images. Such capacities work has explored various strategies to prompt vision-
are crucial for the extension and generalization of vision language (VL) models, including vision-conditioned text
systems. Vision-Language (VL) models trained on large- prompt learning [48], joint VL prompt learning [18] and
scale web data, such as CLIP [32] and ALIGN [14], pro- self-regulated VL prompts[19]. On the text side, regardless
vide a new paradigm due to their generalization capabilities of the learning strategy, learned prompt vectors are shared
that includes zero-shot classification, and have been used in across all categories. The only difference among text in-
recent work [17–19, 23, 24, 48, 49]. Due to the scarcity of puts is the class name. In low-shot scenarios where visual
images for training, methods built for both tasks rely heav- data is limited, the extraction of class-specific knowledge
ily on merely category names as the source of class-specific from textual inputs becomes essential. However, the cur-
knowledge, resulting in a shortage of distinguishable de- rent paradigm, which relies on the CLIP text encoder to
scriptions. Meanwhile, Large Language Models (LLMs), distinguish between class names, faces challenges, partic-
1
ularly with fine-grained target categories. For example, in guage Models (LLMs) to enhance low-shot image classifi-
FGVCAircraft [25], the class name “Yak-40”, can barely cation; ii) We design a framework, LLaMP, to effectively
provide any information for recognizing the object. adapt LLMs for image classification, without training the
Large Language Models, trained with large text corpora, entire language model, and achieve state-of-the-art in both
are good candidates to serve as the complement. As in few-shot and zero-shot settings; iii) We conduct extensive
Fig. 1a, being queried about “Yak-40”, the LLM gener- analysis investigating the effectiveness of each components
ates a sentence detailing the visual appearance of the Yak- of LLaMP, and discuss the optimal setup for LLM-aided
40 that can be further parsed into noun phrases and inte- image classification.
grated into text prompts, providing richer information, com-
pared with the ordinary prompt. We also show in Fig. 1b 2. Related Work
that by simply incorporating noun phrases extracted from
Large Language Models (LLMs). Recent years have wit-
a LLM’s response, the performance of the ordinary CLIP
nessed remarkable progress in scaling up the size and capa-
models is improved by more than 1% without any train-
bilities of LLMs. Zhang et al. [47] first introduced a suite
ing. Although recent prompt-learning based methods have
of transformers pre-trained at scale, followed by PaLM [6].
shown notable improvements, it is non-trivial to apply them
ChatGPT/GPT-4 [27, 28] emerged as a milestone conversa-
on textual visual descriptions generated by LLMs. Thus,
tional model, demonstrated impressive conversational abil-
instead of directly taking LLM generations as the textual
ities as a generalist model. Vicuna [5] further advanced by
input, we aim at producing class-specific representations by
learning from ChatGPT, while LLaMA [38] demonstrates
adapting LLMs to low-shot image classification.
that larger scale training yields stronger foundation models.
One challenge of the adaption is the domain gap be- The subsequent LLaMA-2 [39] and PaLM-2 [2] achieved
tween vision and language. When trained exclusively with further gains in scale, efficiency and reasoning. Most re-
textual corpora, the latent feature space of a LLM signif- cently, Almazrouei et al. [1] released Falcon, a 400B model.
icantly diverges from that of its visual counterpart. Even Zero-Shot Learning (ZSL). ZSL stands in contrast to
worse, the data scarcity under the low-shot scenario make it traditional fully-supervised paradigms. Instead of relying
virtually impossible to align two spaces through plain con- on direct visual training samples, it leverages side informa-
trastive loss. We argue that, the CLIP text encoder, which tion that can be drawn from a multitude of non-visual do-
is trained to project features from the language domain into mains, including attributes [22], word embeddings [36, 40],
the joint VL domain, can serve as the bridge. Thus, we and descriptive texts [34]. Zhang et al. [46] designed an
propose the LLaMP framework, Large Language Models embedding model to bridge the gap between seen and un-
as Prompt learners, which leverages LLMs to learn infor- seen categories. Concurrently, studies like [4, 44, 50] have
mative prompts for CLIP models. In LLaMP, we treat spotlighted that generative models can produce features for
LLMs as the prompt learner of the CLIP text encoder. unseen categories. Moreover, Graph Convolution Networks
More specifically, for each object category, LLaMP ex- (GCN) [20] has been explored in research such as [16, 40]
tracts corresponding knowledge from the LLM and yields for further generalization.
class-specific prompt vectors, which are further combined Prompt Learning. With the progress in large-scale
with class-agnostic prompt embeddings (as in previous ap- vision-language models, such as CLIP [32] and ALIGN
proaches), and encoded by the CLIP text encoder. We de- [14], which reveal their capacity in zero-shot transferabil-
sign an efficient tuning pipeline to avoid fully fine-tuning ity, prompt learning has emerged as an efficient learning
the language model while performing effective adaptation. scheme, where learnable prompts are appended to the in-
Following the protocol in [48, 49], we evaluate LLaMP put to fine-tune models. For low-shot image classifica-
with two typical scenarios: zero-shot base-to-novel general- tion, CoOp [49] and CoCoOp [48], which modeled context
ization [49] and few-shot image classification. For each sce- words as learnable vectors to automate prompt engineering,
nario, we run LLaMP with 11 datasets covering a spectrum have shown significant improvements over regular CLIP.
of tasks. On average, LLaMP achieves a 1.3% boost on the MaPLe [18] further employed a hierarchical multi-modal
harmonic mean against the state-of-the-art PSRC [19], and prompting strategy across transformer blocks for progres-
9.6% over the ordinary CLIP [32], on base-to-novel gen- sive feature modeling. Kan et al. [17] incorporated ex-
eralization. We also observe an average improvement of ternal knowledge by designing knowledge-aware prompts
0.94% on 16-shot image classification. and adaptation head for better generalization. Lee et al.
In summary, our approach makes use of Large Language [23] used masked attention to prevent internal representa-
Models to improve performance in low-shot image classi- tion shift for better generalization. Khattak et al. [19] fur-
fication scenarios. The main contributions are: i) To the ther improved prompt learning by guiding prompts to bal-
best of our knowledge, we are the first to investigate how ance task-specific and task-agnostic knowledge via mutual
to use the encyclopedic knowledge inherent in Large Lan- agreement maximization and prompt ensemble.
2
3. Approach linear projection layers. More specifically, for a linear layer
with weight W0 ∈ Rd×k , LoRA creates ∆W by learning
3.1. Preliminaries two low rank matrices B ∈ Rd×r and A ∈ Rr×k :
Similar to previous CLIP-based learning approaches, we
consider the classification problem as an image-text match- \bm {h} = (W_0 + \Delta W)\bm {x} = W_0 \bm {x} + BA\bm {x}. (1)
ing problem. We denote the image encoder and the text en- We adopt a hybrid tuning scheme on the vision encoder,
coder, in CLIP-like models, as F and G, parameterized by which performs prompt learning on the first few layers and
θF and θG , respectively. An input image x ∈ RC×H×W is applies LoRA on the rest.
split into M equal-sized patches which are converted into a
sequence of embeddings x̃ = {ecls , e1 , e2 , . . . , eM }. The 3.2. Adaptive Prompt Learning with LLMs
visual input sequence x̃ is encoded by the image encoder,
The goal of prompt tuning is to find a set of optimal
producing the image feature f = F(x̃). On the text side,
prompts p = {pv , pt } which maximizes the log likeli-
the text label y and the associated name is formatted as “A
hood of P (x, y|θF , θG ) over target downstream distribution
photo of [STH]” and tokenized into a sequence of to-
(x, y) ∼ (X, Y ):
kens ỹ = {tbos , t1 , t2 , . . . , tL , teos }, where L is the length
of input tokens. The input sequence is then encoded into \bm {p} = \argmax _{\bm {p}} \mathbb {E}_{(\bm {x},\bm {y})\sim (\bm {X}, \bm {Y})} \log \mathcal {C}(\mathcal {F}(\bm {x};\bm {p}_v), \mathcal {G}(\bm {y};\bm {p}_t)) \label {eqn:prompt}
g = G(ỹ). For image classification, target class labels
{1, 2, . . . , C} are encoded into text features gi . The clas- (2)
sification is done by picking the class that has the highest However, the p optimized through Eqn. 2 has two issues.
similarity with the vision feature: ŷ = argmaxi C(f , gi ), First, p is shared for all categories for the downstream task,
where C is the softmax cosine-similarity function C(f , g) = while the optimal prompt for each category might be dif-
PC
exp (f ·g/τ )
with temperature τ . ferent. Second, in low-shot scenarios, p are usually empiri-
j=1 exp (f ·g /τ )
j cally estimated from a limited training-set X train with lim-
Multimodal Prompting Learning. Given the size of ited categories {1, 2, ..., C base }, and therefore such p can
the CLIP model, fine-tuning the entire model becomes in- often be over-fitted to the small training-set X train and fail
feasible. As both image and text encoders are built with to generalize to novel categories outside {1, 2, ..., C base }.
standard transformer architecture, prompt learning, which To overcome these problems, we propose to learn a meta
tunes the model by combining trainable prompts with hid- function on the language side pt = Θ(y) which can adap-
den states has been applied on the text encoder [48, 49], the tively estimate the optimal prompt for each category. An in-
image encoder [15, 41, 42], or both [18, 19, 33]. Similar to tuitive way to estimate proper prompts p for category name
[19, 33], we build our method following the vision-language y is to take advantage of the knowledge of the pre-trained
prompting paradigm, with deep prompting [15, 19], which Large Language Models (LLM) D and extract discrimina-
not only insert prompts to the input layer, but to later en- tive descriptions of category y. For example, given the input
coder layers. text z:“Describe {y}”,
More specifically, for each transformer layer that takes
prompts, we define V learnable visual prompts pv = \bm {p}_t = \{p_1, p_2, ..., p_k\} = \mathcal {D}(\bm {z}). (3)
{p1v , p2v , . . . , pVv } and T learnable language prompts pt =
{p1t , p2t , . . . , pTt }. For the i-th vision encoder layer, vi- while pi being sequentially generated by D such that
sual prompts piv are appended to input embeddings: x̃ip =
\begin {aligned} p_i &= \mathcal {D}(\bm {z}, t_1,...,t_{i-1}) = \mathcal {D}^{(i)}( \bm {z}) \\ t_i & = \mathcal {M}(p_i), \end {aligned}
{eicls , ei1 , ei2 , . . . , eiM , piv }. The prompt-augmented vision (4)
feature, fp = F(x̃p ), is produced by jointly encoding
prompts and the image. As the ViT [9] architecture in CLIP
where D(i) is the i-th forward iteration of D, and M maps
adopts the bi-directional attention mechanism, the place-
continuous hidden states into discrete language tokens. To
ment of pv has no effect on fp . On the language side,
accelerate the process and to obtain p in one pass, we
prompts are concatenated with the input of the i-th text
approximate the above process with K learnable prompts
encoder: ỹpi = {tibos , pit , ti1 , ti2 , . . . , tiL , tieos }. y˜p is fur-
pl = {θ1 , ..., θK } so that
ther processed by the text encoder, resulting in the prompt-
augmented language feature gp = G(ỹp ). More specifi- \bm {p}_t &= \Theta (y) = \mathcal {D}(\{\theta _1,...,\theta _K\}|\bm {z}) (5)
cally, prompts to the first layer p1t are initialized with the
embeddings of “A photo of a”. Discussion. While Large Language Models (LLMs)
Low-Rank Adaptation [13] (LoRA). As a parameter- possess robust foundational knowledge within the linguis-
efficient tuning technique, LoRA is designed to adapt tic domain, it is not feasible to directly substitute the text
large transformer model without updating original model encoder of CLIP with an LLM. The reason lies in the in-
weights. The LoRA technique is, in particular, applied to herent divergence between the LLM’s latent space, which
3
LoRA Tuned Frozen Tuned Text Encoder 𝒢𝒢
Text Encoder Layer Image Encoder Layer 𝒑𝒑1𝑡𝑡 𝒑𝒑𝑖𝑖𝑡𝑡
𝒈𝒈𝐩𝐩
“Chevrolet Corvette
⋮
Feed-Forward ZR1 2012”
KV Cache �1𝑙𝑙
𝒉𝒉 �𝑖𝑖𝑙𝑙
𝒉𝒉
Output
𝒉𝒉𝑙𝑙
ℒ𝐶𝐶𝐶𝐶
Decoder Layer 𝒟𝒟𝑁𝑁
⋮
Decoder Layer 𝒟𝒟2
Decoder Layer 𝒟𝒟1 Value Key Query
⋮
⋮
𝒇𝒇𝐩𝐩
𝒑𝒑𝑙𝑙 𝒑𝒑1𝑣𝑣 𝒑𝒑𝑖𝑖𝑣𝑣
“Describe a Chevrolet Image Encoder ℱ
Corvette ZR1 2012”
Figure 2. An Overview of the LLaMP Framework: We first generate the knowledge cache by passing the query prompt through the LLM D
and use the knowledge cache to encode pl , resulting the adaptive prompts h̃il = W hil + bi for the CLIP text encoder. h̃l is combined with
regular learnable prompts of G to generate the final text feature vector gp . The image feature vector fp is obtained through a hybrid-tuning
strategy combining prompt learning and low-rank adaptation (LoRA).
is purely language-oriented, and the image-focused latent LLM Knowledge Cache. A Large Language Model
space of vision encoders. Attempting a direct alignment (LLM), as implied by its name, typically comprises billions
via contrastive learning would require an extensive dataset of parameters. For example, the most compact LLaMA
that is typically beyond the scope of low-shot learning. To [38, 39] model has 7B parameters. Thus, even perform-
bridge this gap, we introduce LLaMP—an adaptive prompt ing prompt learning on a LLM become impractical. The
learning framework that leverages the LLM to craft class- memory consumption to store gradients for back propaga-
specific prompt vectors, to reinforce the text encoder for tion can go beyond the limit of mainstream GPUs. Instead,
low-shot image classification. the causal attention mechanism inherent in decoder-only
LLMs, where the embedding of an input token only depends
3.3. The LLaMP Framework on the preceding tokens, facilitates a feasible workaround.
Fig. 2 shows an overview of the LLaMP framework. For As previously mentioned, the prompt embeddings pl are
convenience, we denote the decoder-only LLM as D. The appended to the end of text tokens ỹ. According to the
input to the decoder D consists of two components: textual causal attention mechanism, ỹ is encoded independently of
prompts y in the form of sentences, tokenized as ỹ, and pl . Thus, we design a two-stage process, where we cre-
learnable prompts pl . We append prompt embeddings to the ate the LLM knowledge cache by passing ỹ through D and
end of the input sequence and obtain the last hidden states leverage the cache to convert pl into class-specific embed-
of D as the feature hl : dings for the CLIP text encoder G.
\bm {h}_l = \mathcal {D}(\bm {\Tilde {y}}, \bm {p}_l)[L+1:L+K], L = \text {Length}(\bm {\Tilde {y}}). (6) To compute the attention of a token, the only dependency
is the Key and Value vectors from the preceding tokens.
Hidden states of D are then mapped to the input space of the Thus, we adopt the KV-cache [31, 43], a technique used
CLIP text encoder by the projection matrix W ∈ Rd1 ×d2 , in inference acceleration of LLMs, to create the knowledge
where d1 and d2 are respectively the hidden sizes of the cache. At the first stage, we pass text tokens ỹ through
LLM D and the CLIP text encoder G. A set of prompt- the language model D and save the Keys and Values as the
specific biases b ∈ RK×d2 are added: knowledge cache for the second stage. Once computed, the
knowledge cache remains fixed throughout the entire train-
\bm {\Tilde {h}}_l = W \bm {h}_l + b (7) ing process and bears the information that is needed for fur-
ther computation. Thus, in LLaMP, we leverage the knowl-
We combine h̃l from LLM with regular learnable
edge cache obtained at the first stage to generate class-
prompts, as in previous approaches [19], to construct the
specific prompt embeddings.
input for CLIP text encoder. Similar to deep prompting
[15, 19], we create layer-specific prompts through differ- At the second stage, we create class-specific prompt em-
ent W matrices and b vectors. For the i-th layer, we let beddings from the pre-computed knowledge cache. As pl
h̃il = W i hl + bi and the entire sequence is constructed as is not initialized in the natural language domain, it need
not pass through the entire LLM; instead, we insert those
\bm {\Tilde {y}_{l}}^{i}=\{\bm {t}_{bos}^i, \bm {p}^i_t, \bm {t}_{1}^i, \bm {t}_{2}^i, \dots , \bm {t}_{L}^i, \bm {\Tilde {h}_l^i}, \bm {t}_{eos}^i\} (8) prompts pl to the last layer of the LLM DN . It is achieved
4
by encoding them alongside the cache from ỹ, as in similar to “In one sentence, describe the
distinctive appearance of [STH]” through
\bm {H}_l = \mathcal {D}_{N}(\bm {K}_{\bm {\Tilde {y}}}, \bm {V}_{\bm {\Tilde {y}}}, \bm {p}_l), \label {eqn:last} (9) GPT-4 [28], to further diversify the text input.
where Kỹ , Vỹ represent the knowledge cache. This de- 3.4. Training and Inference
sign enables LLaMP to efficiently learn informative prompt
Similar to PSRC [19], our objective function consists of
embeddings for the CLIP encoder G. It accomplishes this
three components: The main cross-entropy loss LCE ,
by incurring modest training costs, compared with training
feature-level L1 regularization Ll1 , and soft distillation loss
the entire LLM. Simultaneously, it maintains the essential
Ldist . Given C training categories and N training samples,
knowledge inherent in the LLM decoder D.
LCE is defined as
Training Targets of LLaMP. Although the training
strategy in Eqn. 9 has reduced the number of learnable pa-
rameters, a full decoder inside a LLM still consists of an \mathcal {L}_{CE} = -\frac {1}{\mathcal {N}}\sum _{i} \log \frac {\exp {(\bm {f_p}^i \cdot \bm {g_p}^i / \tau )}}{\sum \exp {(\bm {f_p}^i \cdot \bm {g_p}^j / \tau )}}. (10)
enormous number of parameters. For example, one layer
in LLaMA-7B bears 200M parameters, making training of The L1 regularization is computed between learned features
the entire layer costly. As the goal is to leverage the knowl- fp , gp and pre-trained CLIP features fˆ, ĝ:
edge from LLM, altering a full layer can lead to the loss
of knowledge. Thus, as shown in Fig. 2, a typical decoder \mathcal {L}_{l1} = \frac {1}{\mathcal {N}}\sum _{i}\lambda _v|\bm {f_p}^i - \bm {\hat {f}}^i| + \frac {1}{C}\sum _i\lambda _t |\bm {g_p}^i - \bm {\hat {g}}^i|, (11)
layer has two major components: the self attention module,
consisting Query, Key, Value and Output projection layers,
and the Feed-Forward Network (FFN). LLaMP targets the where λv and λt are coefficients. The prediction of LLaMP
Query and Output projection layer inside the self-attention is further bound by the KL-Divergence between predicted
module. By updating the Query layer, LLM prompts pl are distributions of LLaMP and vanilla CLIP:
learned to distill pertinent information from the knowledge \mathcal {L}_{dist} = \lambda _{dist} D_{KL}(\bm {f_p} \cdot \bm {g_p}, \bm {\hat {f}} \cdot \bm {\hat {g}}). (12)
cache and the Output layer projects it to the latent space.
We keep the Key and Value layers frozen to ensure the align- We sum all three losses up as the final objecttive function:
ment between pl and knowledge cache. We leave the FFN L = LCE + Ll1 + Ldist .
unchanged to preserve the knowledge. Further discussions During training, we randomly sample one LLM template
regarding these choices are made in Sec. 4.3. as the input of LLaMP for each batch. For inference, we
Textual Priors from Pre-Generated Responses. compute the probability distribution predicted from each in-
We extend the initial prompt, “In one sentence, put template and average them.
describe the distinctive appearance of
[STH]”, by incorporating the response generated by the 4. Experiments
language model into the input sequence. This approach
4.1. Experiment Setup
enriches the base content: the generated text provides a
clear and explicit description of the object’s appearance, Datasets. Similar to previous work [18, 19, 48], in our
acting as a valuable informative prior for language model study, we evaluate LLaMP performance over a spectrum
adaptation. However, it is common for responses from of classification tasks with 11 datasets, including Ima-
an LLM to include filler words like “sure” for sentence geNet [8] and Caltech101 [10] for generic image classi-
structure coherence. To refine the input, we parse the noun fication, OxfordPets [29], StanfordCars [21], Flowers102
phrases from the LLM’s response through spaCy [12], [26], Food101 [3], and FGVCAircraft [25] for fine-grained
an NLP engine, and merge them with the initial prompt, classification, SUN397 [45] for scene recognition, UCF101
forming a more focused and informative language prior. [37] for action recognition, DTD [7] for texture classifica-
Textual Augmentations. Following the insights of tion, and EuroSAT [11] for satellite image recognition.
Khattak et al. [19], which highlight the performance bene- Scenarios & Metrics. We evaluate LLaMP on two typi-
fits of diverse textual inputs, we aim to further augment the cal low-shot scenarios: zero-shot base-to-novel generaliza-
text inputs used in the CLIP text encoder. Our approach, tion and few-shot image classification. In zero-shot base-to-
building upon the methods in [19, 48], incorporates hand- novel generalization, the base classes are seen during train-
crafted templates and expands their diversity through a ing, while the novel classes are unseen. We measure models
two-step process: i) We introduce noun phrases into the performance through accuracies of base and novel classes,
existing templates for CLIP, for example, transforming “A and the harmonic mean of the two. For few-shot classifica-
photo of [STH]” to “A photo of [STH] with tion, we assess the accuracy with 16 shots per class.
[NP]”, thereby enriching the descriptive content; ii) We Implementation Details. We build LLaMP through
create a variety of new prompt templates for the LLM the PyTorch [30] framework. All models are trained with
5
Average ImageNet [8] Caltech101 [10] OxfordPets [29]
Method
Base Novel HM Base Novel HM Base Novel HM Base Novel HM
CLIP [32] 69.34 74.22 71.70 72.43 68.14 70.22 96.84 94.00 95.40 91.17 97.26 94.12
CoOp [49] 82.69 63.22 71.66 76.47 67.88 71.92 98.00 89.81 93.73 93.67 95.29 94.47
CoCoOp [48] 80.47 71.69 75.83 75.98 70.43 73.10 97.96 93.81 95.84 95.20 97.69 96.43
KAPT∗ [17] 78.41 70.52 74.26 71.10 65.20 68.02 97.10 93.53 95.28 93.13 96.53 94.80
ProDA [24] 81.56 72.30 76.65 76.66 70.54 73.47 97.74 94.36 96.02 95.43 97.76 96.58
MaPLe [18] 82.28 75.14 78.55 75.40 70.32 72.72 98.27 93.23 95.68 95.43 97.83 96.62
RPO [23] 81.13 75.00 77.78 76.60 71.57 74.00 97.97 94.37 96.03 94.63 97.50 96.05
PSRC [19] 84.26 76.10 79.97 77.60 70.73 74.01 98.10 94.03 96.02 95.33 97.30 96.30
LLaMP 85.16 77.71 81.27 77.99 71.27 74.48 98.45 95.85 97.13 96.31 97.74 97.02
∆ w.r.t. PSRC +0.90 +1.61 +1.30 +0.39 +0.54 +0.47 +0.35 +1.82 +1.11 +0.98 +0.44 +0.72
Table 1. Comparison with state-of-the-art methods on base-to-novel generalization. LLaMP shows strong generalization results over
previous approaches on 11 image classification tasks. Absolute gains over PSRC are indicated in blue. ∗ KAPT is trained with ViT-B/32
image encoder instead of ViT-B/16.
2 NVIDIA A100 40GB GPUs. For LLaMP, we adopt rate of LoRA modules is set to 2E-5. λt , λv and λdist are
LLaMA2-7B [39] as the language model D, and ViT-B/16 set to 25, 10 and 2.5, respectively.
[9] as the image encoder, following [18, 19, 48, 49]. On the
text side, we set prompt learning depth to 9. To tune the vi- 4.2. Quantitative Evaluation
sion encoder, we adopt the hybrid tuning scheme which per- Zero-Shot Base-to-Novel Generalization. LLaMP out-
forms deep prompt learning on the first 6 layers and LoRA performs existing state-of-the-art prompt learning methods
on the rest. Similar to [13], LoRA is applied to the Query on most metrics of 11 classification datasets in the base-
and Value projection layers inside attention modules. The to-novel generalization benchmark. As shown in Tab. 1,
number of pl prompts, K, is set to 16. We set a global compared to the latest model PSRC [19], LLaMP achieves
learning rate of 2E-4 with a batch size of 8. The learning average gains of 0.90% in base accuracy, 1.61% in novel
6
16-Shot Classification
11]
]
]
8]
[45
25]
[37
10]
[26
t[
[
]
ft [
AT
Ne
h[
9]
97
[ 7]
[3]
01
[21
e
rs
rag
s [2
ltec
cra
age
N3
roS
we
F1
D
od
rs
Ave
UC
Flo
Air
DT
Pet
SU
Im
Ca
Ca
Eu
Fo
CLIP [32] 78.79 (65.02) 67.31 95.43 85.34 80.44 97.37 82.90 45.36 73.28 69.96 87.21 82.11
CoOp [49] 79.89 (73.82) 71.87 95.57 91.87 83.07 97.07 84.20 43.40 74.67 69.87 84.93 82.23
CoCoOp [48] 74.90 (70.70) 70.83 95.16 93.34 71.57 87.84 87.25 31.21 72.15 63.04 73.32 78.14
MaPLe [18] 81.79 (75.58) 72.33 96.00 92.83 83.57 97.00 85.33 48.40 75.53 71.33 92.33 85.03
PSRC [19] 82.87 (77.90) 73.17 96.07 93.67 83.83 97.60 87.50 50.83 77.23 72.73 92.43 86.47
LLaMP 83.81 (78.50) 73.49 97.08 94.21 86.07 98.06 87.62 56.07 77.02 74.17 91.31 86.84
Table 2. Few shot classification results with 16 shots. Numbers in the bracket indicate the average performance over 1/2/4/8/16 shots.
accuracy, and 1.30% in harmonic mean on average. More- Method LLM Base Novel HM
over, LLaMP consistently achieves higher harmonic means
(HM) compared to other models. These improvements indi- 69.34 74.22 71.70
CLIP
cate that our approach better balances performance on base ✓ 70.95 74.93 72.79
and novel data, thus achieving stronger generalization com- 82.21 76.44 79.22
LLaMP
pared to the prior prompt learning techniques. ✓ 85.16 77.71 81.27
In particular, LLaMP excels in fine-grained datasets re-
quiring detailed analysis. On FGVCAircraft, LLaMP sur- Table 3. Ablation study on the LLM Knowledge.
passes PSRC by 4.57% on base accuracy and 1.75% on
HM, highlighting its strong understanding of detailed air- LP QO KV FFN % Base Novel HM
craft features. Furthermore, on EuroSAT, LLaMP achieves ✓ .03 85.00 77.29 80.96
improvements of 9.76% and 5.28% on novel accuracy and ✓ ✓ ✓ 33 85.20 77.45 81.14
HM, respectively. We also observe similar performance ✓ ✓ ✓ 83 85.05 77.73 81.22
gains on StanfordCars, where LLaMP outperformns by ✓ ✓ ✓ ✓ 100 85.23 77.56 81.21
3.29% on base accuracy and 1.31% on HM. The informa-
✓ ✓ 17 85.16 77.71 81.27
tion embedded in LLM enables LLaMP to capture and uti-
lize the rich semantic information necessary for distinguish-
Table 4. Ablation study on the Training Strategy. “%” indicates
ing between closely related categories. the ratio of parameters trained compared to fully tuning a layer.
Few-Shot Classification. LLaMP also achieves im-
provements across these classification datasets in few-shot plate, “A photo of [STH] with [NP]” to generate
classification tasks. As in Tab.2, with an average classi- the NP-augmented text embedding for CLIP. We take the
fication accuracy of 83.81%. Notably, on FGVCAircraft average of all augmented embeddings for classification. In
and StanfordCars, LLaMP shows a significant improve- Tab. 3 we show that even ordinary CLIP can benefit from
ment over PSRC, further demonstrating that the knowl- incorporating LLMs’ knowledge.
edge from language models benefits the recognition of fine- Furthermore, the comparison between LLaMP and
grained object categories, which aligns with our observa- LLaMP without the LLM indicates that merely integrat-
tion on zero-shot base-to-novel generalization. Moreover, ing LoRA [13] to the vision encoder is not beneficial. The
on DTD, where MaPLe and PSRC achieve around 72% ac- “LLaMP without LLM” is essentially an ordinary prompt-
curacy, LLaMP achieves a higher accuracy of 74.17%, un- ing learning model plus LoRAs in the vision encoder. We
derscoring its ability to recognize textures. show that the improved vision encoding capacity only ben-
efits when the quality of text embeddings s are enhanced by
4.3. Ablation Study
incorporating LLMs’ knowledge through LLaMP.
Is the knowledge from LLM helping? In Tab. 3, we show Decoder Training strategy. We categorize trainable pa-
that the knowledge from LLM benefits in both ways: With- rameters of DN into four groups: learnable prompts (LP),
out training, performance of ordinary CLIP model can be Query and Output projections (QO), Key and Value projec-
improved by introducing noun phrases; The LLaMP frame- tions (KV), and the feed-forward network (FFN). Tab. 4 in-
work shows further improvement after training. dicates LLaMP can achieve desirable results by just learn-
Noun phrases are parsed from the LLM’s responses the ing the prompts of D. One step further, adding QO into op-
prompt of “Describe [STH]”. We then use the tem- timization achieves the best performance. Although other
7
Method Priors Base Novel HM
✗ 84.90 77.59 81.08
LLaMP Plain 85.26 77.56 81.22
NP 85.16 77.71 81.27
Table 5. Ablation Study on Pre-generated Text Priors. ✗ refers to
“without textual priors” and NP stands for noun phrases.
R I / / 0 3 U R P S W V
Method Base Novel HM Figure 3. Effect of LLM Prompts on Harmonic Mean. 16 prompts
achieve the most balanced performance.
LLM Only 81.74 35.82 49.81
Image Noun Phrases Heatmap
LLaMP 85.16 77.71 81.27
Classname: An-12
8
References [14] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom
[1] Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al- Duerig. Scaling up visual and vision-language representation
shamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane learning with noisy text supervision. In ICML, pages 4904–
Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, 4916. PMLR, 2021. 1, 2
Quentin Malartic, et al. Falcon-40b: an open large language [15] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie,
model with state-of-the-art performance. Technical report, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi-
Technical report, Technology Innovation Institute, 2023. 2 sual prompt tuning. In European Conference on Computer
[2] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John- Vision, pages 709–727. Springer, 2022. 3, 4
son, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, [16] Michael Kampffmeyer, Yinbo Chen, Xiaodan Liang, Hao
Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 Wang, Yujia Zhang, and Eric P Xing. Rethinking knowledge
technical report. arXiv preprint arXiv:2305.10403, 2023. 2 graph propagation for zero-shot learning. In CVPR, pages
[3] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 11487–11496, 2019. 2
Food-101–mining discriminative components with random [17] Baoshuo Kan, Teng Wang, Wenpeng Lu, Xiantong Zhen,
forests. In ECCV, pages 446–461. Springer, 2014. 5, 6, 7, 1, Weili Guan, and Feng Zheng. Knowledge-aware prompt
2 tuning for generalizable vision-language models. In ICCV,
[4] Shiming Chen, Wenjie Wang, Beihao Xia, Qinmu Peng, pages 15670–15680, 2023. 1, 2, 6
Xinge You, Feng Zheng, and Ling Shao. Free: Feature re- [18] Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad
finement for generalized zero-shot learning. In ICCV, pages Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple:
122–131, 2021. 2 Multi-modal prompt learning. In CVPR, pages 19113–
[5] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- 19122, 2023. 1, 2, 3, 5, 6, 7
hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- [19] Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal
hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shah-
Xing. Vicuna: An open-source chatbot impressing gpt-4 baz Khan. Self-regulating prompts: Foundational model
with 90%* chatgpt quality, 2023. 2 adaptation without forgetting. In ICCV, pages 15190–15200,
[6] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, 2023. 1, 2, 3, 4, 5, 6, 7, 8
Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul [20] Thomas N. Kipf and Max Welling. Semi-supervised clas-
Barham, Hyung Won Chung, Charles Sutton, Sebastian sification with graph convolutional networks. In 5th ICLR,
Gehrmann, et al. Palm: Scaling language modeling with ICLR 2017, Toulon, France, April 24-26, 2017, Conference
pathways. arXiv preprint arXiv:2204.02311, 2022. 2 Track Proceedings. OpenReview.net, 2017. 2
[7] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy [21] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
Mohamed, and Andrea Vedaldi. Describing textures in the 3d object representations for fine-grained categorization. In
wild. In CVPR, pages 3606–3613, 2014. 5, 6, 7, 2 ICCV workshops, pages 554–561, 2013. 5, 6, 7, 2
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, [22] Christoph H Lampert, Hannes Nickisch, and Stefan Harmel-
and Li Fei-Fei. Imagenet: A large-scale hierarchical image ing. Attribute-based classification for zero-shot visual object
database. In CVPR, pages 248–255. Ieee, 2009. 5, 6, 7, 2 categorization. TPAMI, 36(3):453–465, 2013. 2
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [23] Dongjun Lee, Seokwon Song, Jihee Suh, Joonmyeong Choi,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Sanghyeok Lee, and Hyunwoo J Kim. Read-only prompt op-
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- timization for vision-language few-shot learning. In ICCV,
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is pages 1401–1411, 2023. 1, 2, 6
worth 16x16 words: Transformers for image recognition at [24] Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu,
scale. In ICLR, 2021. 3, 6 and Xinmei Tian. Prompt distribution learning. In Proceed-
[10] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ings of the IEEE/CVF Conference on Computer Vision and
ative visual models from few training examples: An incre- Pattern Recognition, pages 5206–5215, 2022. 1, 6
mental bayesian approach tested on 101 object categories. In [25] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew
CVPR workshop, pages 178–178. IEEE, 2004. 5, 6, 7, 2 Blaschko, and Andrea Vedaldi. Fine-grained visual classi-
[11] Patrick Helber, Benjamin Bischke, Andreas Dengel, and fication of aircraft. arXiv preprint arXiv:1306.5151, 2013.
Damian Borth. Eurosat: A novel dataset and deep learning 2, 5, 6, 7, 1
benchmark for land use and land cover classification. IEEE [26] Maria-Elena Nilsback and Andrew Zisserman. Automated
Journal of Selected Topics in Applied Earth Observations flower classification over a large number of classes. In 2008
and Remote Sensing, 12(7):2217–2226, 2019. 5, 6, 7, 2 Sixth Indian Conference on Computer Vision, Graphics &
[12] Matthew Honnibal, Ines Montani, Sofie Van Landeghem, Image Processing, pages 722–729. IEEE, 2008. 5, 6, 7, 2
and Adriane Boyd. spaCy: Industrial-strength Natural Lan- [27] OpenAI. Chatgpt, 2023. Available at https://openai.
guage Processing in Python. 2020. 5 com/chatgpt. 2
[13] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- [28] OpenAI. Gpt-4 technical report, 2023. 1, 2, 5
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. [29] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and
Lora: Low-rank adaptation of large language models. arXiv CV Jawahar. Cats and dogs. In CVPR, pages 3498–3505.
preprint arXiv:2106.09685, 2021. 3, 6, 7, 8 IEEE, 2012. 5, 6, 7, 2
9
[30] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, Dy, and Tomas Pfister. Learning to prompt for continual
James Bradbury, Gregory Chanan, Trevor Killeen, Zem- learning. In Proceedings of the IEEE/CVF Conference on
ing Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: Computer Vision and Pattern Recognition, pages 139–149,
An imperative style, high-performance deep learning library. 2022. 3
NeurIPS, pages 8026–8037, 2019. 5 [43] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau-
[31] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim
Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shiv- Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam
ani Agrawal, and Jeff Dean. Efficiently scaling transformer Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien
inference. Proceedings of Machine Learning and Systems, 5, Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama
2023. 4 Drame, Quentin Lhoest, and Alexander M. Rush. Trans-
[32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya formers: State-of-the-art natural language processing. In
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Proceedings of the 2020 Conference on Empirical Methods
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- in Natural Language Processing: System Demonstrations,
ing transferable visual models from natural language super- pages 38–45, Online, 2020. Association for Computational
vision. In ICML, pages 8748–8763. PMLR, 2021. 1, 2, 6, Linguistics. 4
7 [44] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep
[33] Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Akata. Feature generating networks for zero-shot learning.
Maaz, Salman Khan, and Fahad Shahbaz Khan. Fine-tuned In CVPR, pages 5542–5551, 2018. 2
clip models are efficient video learners. In Proceedings of [45] Jianxiong Xiao, Krista A Ehinger, James Hays, Antonio Tor-
the IEEE/CVF Conference on Computer Vision and Pattern ralba, and Aude Oliva. Sun database: Exploring a large col-
Recognition, pages 6545–6554, 2023. 3 lection of scene categories. IJCV, 119(1):3–22, 2016. 5, 6,
[34] Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 7, 2
Learning deep representations of fine-grained visual descrip- [46] Li Zhang, Tao Xiang, and Shaogang Gong. Learning a deep
tions. In CVPR, pages 49–58, 2016. 2 embedding model for zero-shot learning. In CVPR, pages
[35] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, 2021–2030, 2017. 2
Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. [47] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe,
Grad-cam: Visual explanations from deep networks via Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab,
gradient-based localization. In ICCV, pages 618–626, 2017. Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained trans-
8 former language models. arXiv preprint arXiv:2205.01068,
[36] Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bas- 2022. 2
tani, Christopher D Manning, and Andrew Y Ng. Zero- [48] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Zi-
shot learning through cross-modal transfer. arXiv preprint wei Liu. Conditional prompt learning for vision-language
arXiv:1301.3666, 2013. 2 models. In CVPR, pages 16816–16825, 2022. 1, 2, 3, 5, 6, 7
[37] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. [49] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
Ucf101: A dataset of 101 human actions classes from videos Liu. Learning to prompt for vision-language models. IJCV,
in the wild. arXiv preprint arXiv:1212.0402, 2012. 5, 6, 7, 2 130(9):2337–2348, 2022. 1, 2, 3, 6, 7
[38] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier [50] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng,
Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste and Ahmed Elgammal. A generative adversarial approach
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. for zero-shot learning from noisy texts. In CVPR, pages
Llama: Open and efficient foundation language models. 1004–1013, 2018. 2
arXiv preprint arXiv:2302.13971, 2023. 1, 2, 4
[39] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al.
Llama 2: Open foundation and fine-tuned chat models. arXiv
preprint arXiv:2307.09288, 2023. 1, 2, 4, 6
[40] Xiaolong Wang, Yufei Ye, and Abhinav Gupta. Zero-shot
recognition via semantic embeddings and knowledge graphs.
In CVPR, pages 6857–6866, 2018. 2
[41] Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun,
Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vin-
cent Perot, Jennifer Dy, et al. Dualprompt: Complementary
prompting for rehearsal-free continual learning. In European
Conference on Computer Vision, pages 631–648. Springer,
2022. 3
[42] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang,
Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer
10
Large Language Models are Good Prompt Learners
for Low-Shot Image Classification
Supplementary Material
Introduction observe that LLaMP surpasses PSRC [19] consistently on
FGVCAircraft (Aircraft) [25] and Food [3] with all num-
In the supplementary material, we provide extra discussions bers of shots. Such observation aligns with our argument
that did not fit in the main paper due to the space limita- in the main paper that the knowledge from LLMs provides
tion, including i) Ablation study on the textual augmenta- richer semantic information for fine-grained classification.
tion described in Sec 3.3; ii) Few-shot classification results
on 8/4/2/1 shots with comparisons with previous methods.
Few-shot Classification
In addition to the 16-shot classification result reported in
the main paper, we present few-shot classification results
with with 8/4/2/1 numbers of shots in Tab. 9 and compare
LLaMP against previous baseline models.
Results in Tab. 9 show that LLaMP outperforms previous
SOTAs under all settings, on average of all 11 benchmarks,
with 0.88% improvement with 8 shots. In particular, we
1
8-Shot Classification
11]
]
]
]
[ 45
25]
[ 37
8
[26
[10
et [
[
21]
t[
AT
]
97
[7]
[3]
01
[29
e
rs
h
raf
rag
ltec
age
N3
roS
we
F1
rs [
D
od
c
s
Ave
UC
Flo
Air
DT
Pet
SU
Im
Ca
Ca
Eu
Fo
CLIP [32] 74.47 62.23 93.41 78.36 73.67 96.10 79.79 39.35 69.08 63.46 84.43 79.34
CoOp [49] 76.98 70.63 94.37 91.27 79.30 94.97 82.67 39.00 71.53 64.77 78.07 80.20
CoCoOp [48] 72.96 70.63 95.04 93.45 70.44 84.30 86.97 26.61 70.84 58.89 68.21 77.14
MaPLe [18] 78.89 70.30 95.20 92.57 79.47 95.80 83.60 42.00 73.23 66.50 87.73 81.37
PSRC [19] 80.69 72.33 95.67 93.50 80.97 96.27 86.90 43.27 75.73 69.87 88.80 84.30
LLaMP 81.57 72.30 96.57 93.69 82.15 96.20 87.39 47.48 75.18 71.14 91.15 84.06
4-Shot Classification
11]
]
]
8]
[45
]
[37
10]
[26
[25
t[
[
21]
AT
Ne
h[
9]
97
7]
[3]
01
e
rs
t
raf
rag
s [2
D[
ltec
age
N3
roS
we
F1
rs [
od
c
Ave
UC
Flo
Air
DT
Pet
SU
Im
Ca
Ca
Eu
Fo
CLIP [32] 68.01 54.85 92.05 71.17 63.38 92.02 73.19 32.33 63.00 55.71 77.09 73.28
CoOp [49] 74.02 68.73 94.40 92.57 74.47 92.17 84.47 30.83 69.97 58.70 70.80 77.10
CoCoOp [48] 71.21 70.39 94.98 92.81 69.39 78.40 86.88 24.79 70.21 55.04 65.56 74.82
MaPLe [18] 75.37 67.70 94.43 91.90 75.30 92.67 81.77 34.87 70.67 61.00 84.50 78.47
PSRC [19] 78.35 71.07 95.27 93.43 77.13 93.87 86.17 37.47 74.00 65.53 86.30 81.57
LLaMP 78.83 71.37 95.84 93.61 76.79 93.96 87.17 40.02 74.05 66.37 86.16 81.80
2-Shot Classification
11]
]
]
]
26]
[45
25]
[37
8
10]
et [
[
rs [
21]
ft [
AT
h[
9]
97
7]
[3]
01
e
N
rag
s [2
D[
ltec
cra
age
N3
roS
we
F1
rs [
od
Ave
UC
Flo
Air
DT
Pet
SU
Im
Ca
Ca
Eu
Fo
CLIP [32] 57.98 44.88 89.01 58.37 50.28 85.07 61.51 26.41 53.70 40.76 61.98 65.78
CoOp [49] 70.65 67.07 93.07 89.80 70.50 87.33 84.40 26.20 66.53 53.60 65.17 73.43
CoCoOp [48] 67.65 69.78 94.82 92.64 68.37 75.79 86.22 15.06 69.03 52.17 46.74 73.51
MaPLe [18] 72.58 65.10 93.97 90.87 71.60 88.93 81.47 30.90 67.10 55.50 78.30 74.60
PSRC [19] 75.29 69.77 94.53 92.50 73.40 91.17 85.70 31.70 71.60 59.97 79.37 78.50
LLaMP 75.89 70.12 95.66 92.75 72.20 89.16 86.33 33.41 72.64 61.29 81.71 79.56
1-Shot Classification
11]
]
]
8]
[45
]
[37
10]
[26
[25
et [
[
1]
AT
h[
9]
97
7]
[3]
101
e
geN
rs
t
2
raf
rag
s [2
D[
ltec
N3
roS
we
rs [
od
F
c
a
Ave
UC
Flo
Air
DT
Pet
SU
Im
Ca
Ca
Eu
Fo
CLIP [32] 45.83 32.13 79.88 44.06 35.66 69.74 43.96 19.61 41.58 34.59 49.23 53.66
CoOp [49] 67.56 66.33 92.60 90.37 67.43 77.53 84.33 21.37 66.77 50.23 54.93 71.23
CoCoOp [48] 66.79 69.43 93.83 91.27 67.22 72.08 85.65 12.68 68.33 48.54 55.33 70.30
MaPLe [18] 69.27 62.67 92.57 89.10 66.60 83.30 80.50 26.73 64.77 52.13 71.80 71.83
PSRC [19] 72.32 68.13 93.67 92.00 69.40 85.93 84.87 27.67 69.67 56.23 73.13 74.80
LLaMP 72.42 69.12 94.59 91.91 70.02 84.03 85.83 30.39 69.69 54.98 70.36 75.72
Table 9. Few shot classification results with 8/4/2/1 shots. All numbers, excepts ours, are obtained from [19].