0% found this document useful (0 votes)

44 views12 pages

Large Language Models Are Good Prompt Learners

Uploaded by

Thilleli Rouas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views12 pages

Large Language Models Are Good Prompt Learners

Uploaded by

Thilleli Rouas

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Large Language Models are Good Prompt Learners

for Low-Shot Image Classification

Zhaoheng Zheng Jingmin Wei Xuefeng Hu Haidong Zhu Ram Nevatia

Viterbi School of Engineering
University of Southern California
{zhaoheng.zheng, jingminw, xuefengh, haidongz, nevatia}@usc.edu
arXiv:2312.04076v2 [cs.CV] 2 Apr 2024

Abstract In one sentence, describe the distinctive large passenger window section
Human
appearance of a Yak-40, a type of aircraft.

Low-shot image classification, where training images The Yak-40 has a unique trijet configuration
with a large passenger window section and
are limited or inaccessible, has benefited from recent LLaMA
a sloping nose, along with three engines
mounted on the rear of the aircraft, creating
progress on pre-trained vision-language (VL) models with an unmistakable silhouette in the sky. trijet configuration
sloping nose
strong generalizability, e.g. CLIP. Prompt learning meth- three engines

ods built with VL models generate text features from the (a) Visual Descriptions from an LLM.
class names that only have confined class-specific informa-

tion. Large Language Models (LLMs), with their vast en- +8.5
+14.2
cyclopedic knowledge, emerge as the complement. Thus, in +2.8
80 +0.7
this paper, we discuss the integration of LLMs to enhance +1.1
+1.7 74.2 74.9
pre-trained VL models, specifically on low-shot classifica- 71.7 72.8
71.0
tion. However, the domain gap between language and vi- 70 69.3
sion blocks the direct application of LLMs. Thus, we pro-
Base Novel HM
pose LLaMP, Large Language Models as Prompt learners, CLIP CLIP+LLM LLaMP
that produces adaptive prompts for the CLIP text encoder,
establishing it as the connecting bridge. Experiments show (b) LLMs’ Knowledge Boosts the Performance.
that, compared with other state-of-the-art prompt learning
Figure 1. Demonstration of LLaMP: (a) LLMs can provide visual
methods, LLaMP yields better performance on both zero-
descriptions for fine-grained object categories; (b) Zero-shot base-
shot generalization and few-shot image classification, over
to-novel generalization benefits from the LLM knowledge.
a spectrum of 11 datasets. Code will be made available at:
https://github.com/zhaohengz/LLaMP.
e.g. GPT-4 [28] and LLaMA [38, 39], have demonstrated
their encyclopedic knowledge and thus can provide linguis-
tic visual descriptions for objects. Here, we investigate how
1. Introduction to leverage LLMs for low-shot image classification.
Low-shot image classification tasks, including few-shot and The emergence of prompt learning has provided an ef-
zero-shot variants, are to learn from a set of class names ficient way to adapt large pre-trained models. Previous
along with a limited or null set of images. Such capacities work has explored various strategies to prompt vision-
are crucial for the extension and generalization of vision language (VL) models, including vision-conditioned text
systems. Vision-Language (VL) models trained on large- prompt learning [48], joint VL prompt learning [18] and
scale web data, such as CLIP [32] and ALIGN [14], pro- self-regulated VL prompts[19]. On the text side, regardless
vide a new paradigm due to their generalization capabilities of the learning strategy, learned prompt vectors are shared
that includes zero-shot classification, and have been used in across all categories. The only difference among text in-
recent work [17–19, 23, 24, 48, 49]. Due to the scarcity of puts is the class name. In low-shot scenarios where visual
images for training, methods built for both tasks rely heav- data is limited, the extraction of class-specific knowledge
ily on merely category names as the source of class-specific from textual inputs becomes essential. However, the cur-
knowledge, resulting in a shortage of distinguishable de- rent paradigm, which relies on the CLIP text encoder to
scriptions. Meanwhile, Large Language Models (LLMs), distinguish between class names, faces challenges, partic-

1
ularly with fine-grained target categories. For example, in guage Models (LLMs) to enhance low-shot image classifi-
FGVCAircraft [25], the class name “Yak-40”, can barely cation; ii) We design a framework, LLaMP, to effectively
provide any information for recognizing the object. adapt LLMs for image classification, without training the
Large Language Models, trained with large text corpora, entire language model, and achieve state-of-the-art in both
are good candidates to serve as the complement. As in few-shot and zero-shot settings; iii) We conduct extensive
Fig. 1a, being queried about “Yak-40”, the LLM gener- analysis investigating the effectiveness of each components
ates a sentence detailing the visual appearance of the Yak- of LLaMP, and discuss the optimal setup for LLM-aided
40 that can be further parsed into noun phrases and inte- image classification.
grated into text prompts, providing richer information, com-
pared with the ordinary prompt. We also show in Fig. 1b 2. Related Work
that by simply incorporating noun phrases extracted from
Large Language Models (LLMs). Recent years have wit-
a LLM’s response, the performance of the ordinary CLIP
nessed remarkable progress in scaling up the size and capa-
models is improved by more than 1% without any train-
bilities of LLMs. Zhang et al. [47] first introduced a suite
ing. Although recent prompt-learning based methods have
of transformers pre-trained at scale, followed by PaLM [6].
shown notable improvements, it is non-trivial to apply them
ChatGPT/GPT-4 [27, 28] emerged as a milestone conversa-
on textual visual descriptions generated by LLMs. Thus,
tional model, demonstrated impressive conversational abil-
instead of directly taking LLM generations as the textual
ities as a generalist model. Vicuna [5] further advanced by
input, we aim at producing class-specific representations by
learning from ChatGPT, while LLaMA [38] demonstrates
adapting LLMs to low-shot image classification.
that larger scale training yields stronger foundation models.
One challenge of the adaption is the domain gap be- The subsequent LLaMA-2 [39] and PaLM-2 [2] achieved
tween vision and language. When trained exclusively with further gains in scale, efficiency and reasoning. Most re-
textual corpora, the latent feature space of a LLM signif- cently, Almazrouei et al. [1] released Falcon, a 400B model.
icantly diverges from that of its visual counterpart. Even Zero-Shot Learning (ZSL). ZSL stands in contrast to
worse, the data scarcity under the low-shot scenario make it traditional fully-supervised paradigms. Instead of relying
virtually impossible to align two spaces through plain con- on direct visual training samples, it leverages side informa-
trastive loss. We argue that, the CLIP text encoder, which tion that can be drawn from a multitude of non-visual do-
is trained to project features from the language domain into mains, including attributes [22], word embeddings [36, 40],
the joint VL domain, can serve as the bridge. Thus, we and descriptive texts [34]. Zhang et al. [46] designed an
propose the LLaMP framework, Large Language Models embedding model to bridge the gap between seen and un-
as Prompt learners, which leverages LLMs to learn infor- seen categories. Concurrently, studies like [4, 44, 50] have
mative prompts for CLIP models. In LLaMP, we treat spotlighted that generative models can produce features for
LLMs as the prompt learner of the CLIP text encoder. unseen categories. Moreover, Graph Convolution Networks
More specifically, for each object category, LLaMP ex- (GCN) [20] has been explored in research such as [16, 40]
tracts corresponding knowledge from the LLM and yields for further generalization.
class-specific prompt vectors, which are further combined Prompt Learning. With the progress in large-scale
with class-agnostic prompt embeddings (as in previous ap- vision-language models, such as CLIP [32] and ALIGN
proaches), and encoded by the CLIP text encoder. We de- [14], which reveal their capacity in zero-shot transferabil-
sign an efficient tuning pipeline to avoid fully fine-tuning ity, prompt learning has emerged as an efficient learning
the language model while performing effective adaptation. scheme, where learnable prompts are appended to the in-
Following the protocol in [48, 49], we evaluate LLaMP put to fine-tune models. For low-shot image classifica-
with two typical scenarios: zero-shot base-to-novel general- tion, CoOp [49] and CoCoOp [48], which modeled context
ization [49] and few-shot image classification. For each sce- words as learnable vectors to automate prompt engineering,
nario, we run LLaMP with 11 datasets covering a spectrum have shown significant improvements over regular CLIP.
of tasks. On average, LLaMP achieves a 1.3% boost on the MaPLe [18] further employed a hierarchical multi-modal
harmonic mean against the state-of-the-art PSRC [19], and prompting strategy across transformer blocks for progres-
9.6% over the ordinary CLIP [32], on base-to-novel gen- sive feature modeling. Kan et al. [17] incorporated ex-
eralization. We also observe an average improvement of ternal knowledge by designing knowledge-aware prompts
0.94% on 16-shot image classification. and adaptation head for better generalization. Lee et al.
In summary, our approach makes use of Large Language [23] used masked attention to prevent internal representa-
Models to improve performance in low-shot image classi- tion shift for better generalization. Khattak et al. [19] fur-
fication scenarios. The main contributions are: i) To the ther improved prompt learning by guiding prompts to bal-
best of our knowledge, we are the first to investigate how ance task-specific and task-agnostic knowledge via mutual
to use the encyclopedic knowledge inherent in Large Lan- agreement maximization and prompt ensemble.

2
3. Approach linear projection layers. More specifically, for a linear layer
with weight W0 ∈ Rd×k , LoRA creates ∆W by learning
3.1. Preliminaries two low rank matrices B ∈ Rd×r and A ∈ Rr×k :
Similar to previous CLIP-based learning approaches, we
consider the classification problem as an image-text match- \bm {h} = (W_0 + \Delta W)\bm {x} = W_0 \bm {x} + BA\bm {x}. (1)
ing problem. We denote the image encoder and the text en- We adopt a hybrid tuning scheme on the vision encoder,
coder, in CLIP-like models, as F and G, parameterized by which performs prompt learning on the first few layers and
θF and θG , respectively. An input image x ∈ RC×H×W is applies LoRA on the rest.
split into M equal-sized patches which are converted into a
sequence of embeddings x̃ = {ecls , e1 , e2 , . . . , eM }. The 3.2. Adaptive Prompt Learning with LLMs
visual input sequence x̃ is encoded by the image encoder,
The goal of prompt tuning is to find a set of optimal
producing the image feature f = F(x̃). On the text side,
prompts p = {pv , pt } which maximizes the log likeli-
the text label y and the associated name is formatted as “A
hood of P (x, y|θF , θG ) over target downstream distribution
photo of [STH]” and tokenized into a sequence of to-
(x, y) ∼ (X, Y ):
kens ỹ = {tbos , t1 , t2 , . . . , tL , teos }, where L is the length
of input tokens. The input sequence is then encoded into \bm {p} = \argmax _{\bm {p}} \mathbb {E}_{(\bm {x},\bm {y})\sim (\bm {X}, \bm {Y})} \log \mathcal {C}(\mathcal {F}(\bm {x};\bm {p}_v), \mathcal {G}(\bm {y};\bm {p}_t)) \label {eqn:prompt}
g = G(ỹ). For image classification, target class labels
{1, 2, . . . , C} are encoded into text features gi . The clas- (2)
sification is done by picking the class that has the highest However, the p optimized through Eqn. 2 has two issues.
similarity with the vision feature: ŷ = argmaxi C(f , gi ), First, p is shared for all categories for the downstream task,
where C is the softmax cosine-similarity function C(f , g) = while the optimal prompt for each category might be dif-
PC
exp (f ·g/τ )
with temperature τ . ferent. Second, in low-shot scenarios, p are usually empiri-
j=1 exp (f ·g /τ )
j cally estimated from a limited training-set X train with lim-
Multimodal Prompting Learning. Given the size of ited categories {1, 2, ..., C base }, and therefore such p can
the CLIP model, fine-tuning the entire model becomes in- often be over-fitted to the small training-set X train and fail
feasible. As both image and text encoders are built with to generalize to novel categories outside {1, 2, ..., C base }.
standard transformer architecture, prompt learning, which To overcome these problems, we propose to learn a meta
tunes the model by combining trainable prompts with hid- function on the language side pt = Θ(y) which can adap-
den states has been applied on the text encoder [48, 49], the tively estimate the optimal prompt for each category. An in-
image encoder [15, 41, 42], or both [18, 19, 33]. Similar to tuitive way to estimate proper prompts p for category name
[19, 33], we build our method following the vision-language y is to take advantage of the knowledge of the pre-trained
prompting paradigm, with deep prompting [15, 19], which Large Language Models (LLM) D and extract discrimina-
not only insert prompts to the input layer, but to later en- tive descriptions of category y. For example, given the input
coder layers. text z:“Describe {y}”,
More specifically, for each transformer layer that takes
prompts, we define V learnable visual prompts pv = \bm {p}_t = \{p_1, p_2, ..., p_k\} = \mathcal {D}(\bm {z}). (3)
{p1v , p2v , . . . , pVv } and T learnable language prompts pt =
{p1t , p2t , . . . , pTt }. For the i-th vision encoder layer, vi- while pi being sequentially generated by D such that
sual prompts piv are appended to input embeddings: x̃ip =
\begin {aligned} p_i &= \mathcal {D}(\bm {z}, t_1,...,t_{i-1}) = \mathcal {D}^{(i)}( \bm {z}) \\ t_i & = \mathcal {M}(p_i), \end {aligned}
{eicls , ei1 , ei2 , . . . , eiM , piv }. The prompt-augmented vision (4)
feature, fp = F(x̃p ), is produced by jointly encoding
prompts and the image. As the ViT [9] architecture in CLIP
where D(i) is the i-th forward iteration of D, and M maps
adopts the bi-directional attention mechanism, the place-
continuous hidden states into discrete language tokens. To
ment of pv has no effect on fp . On the language side,
accelerate the process and to obtain p in one pass, we
prompts are concatenated with the input of the i-th text
approximate the above process with K learnable prompts
encoder: ỹpi = {tibos , pit , ti1 , ti2 , . . . , tiL , tieos }. y˜p is fur-
pl = {θ1 , ..., θK } so that
ther processed by the text encoder, resulting in the prompt-
augmented language feature gp = G(ỹp ). More specifi- \bm {p}_t &= \Theta (y) = \mathcal {D}(\{\theta _1,...,\theta _K\}|\bm {z}) (5)
cally, prompts to the first layer p1t are initialized with the
embeddings of “A photo of a”. Discussion. While Large Language Models (LLMs)
Low-Rank Adaptation [13] (LoRA). As a parameter- possess robust foundational knowledge within the linguis-
efficient tuning technique, LoRA is designed to adapt tic domain, it is not feasible to directly substitute the text
large transformer model without updating original model encoder of CLIP with an LLM. The reason lies in the in-
weights. The LoRA technique is, in particular, applied to herent divergence between the LLM’s latent space, which

3
LoRA Tuned Frozen Tuned Text Encoder 𝒢𝒢
Text Encoder Layer Image Encoder Layer 𝒑𝒑1𝑡𝑡 𝒑𝒑𝑖𝑖𝑡𝑡
𝒈𝒈𝐩𝐩
“Chevrolet Corvette

⋮
Feed-Forward ZR1 2012”
KV Cache �1𝑙𝑙
𝒉𝒉 �𝑖𝑖𝑙𝑙
𝒉𝒉
Output
𝒉𝒉𝑙𝑙
ℒ𝐶𝐶𝐶𝐶
Decoder Layer 𝒟𝒟𝑁𝑁
⋮
Decoder Layer 𝒟𝒟2
Decoder Layer 𝒟𝒟1 Value Key Query

⋮
⋮
𝒇𝒇𝐩𝐩
𝒑𝒑𝑙𝑙 𝒑𝒑1𝑣𝑣 𝒑𝒑𝑖𝑖𝑣𝑣
“Describe a Chevrolet Image Encoder ℱ
Corvette ZR1 2012”

Figure 2. An Overview of the LLaMP Framework: We first generate the knowledge cache by passing the query prompt through the LLM D
and use the knowledge cache to encode pl , resulting the adaptive prompts h̃il = W hil + bi for the CLIP text encoder. h̃l is combined with
regular learnable prompts of G to generate the final text feature vector gp . The image feature vector fp is obtained through a hybrid-tuning
strategy combining prompt learning and low-rank adaptation (LoRA).

is purely language-oriented, and the image-focused latent LLM Knowledge Cache. A Large Language Model
space of vision encoders. Attempting a direct alignment (LLM), as implied by its name, typically comprises billions
via contrastive learning would require an extensive dataset of parameters. For example, the most compact LLaMA
that is typically beyond the scope of low-shot learning. To [38, 39] model has 7B parameters. Thus, even perform-
bridge this gap, we introduce LLaMP—an adaptive prompt ing prompt learning on a LLM become impractical. The
learning framework that leverages the LLM to craft class- memory consumption to store gradients for back propaga-
specific prompt vectors, to reinforce the text encoder for tion can go beyond the limit of mainstream GPUs. Instead,
low-shot image classification. the causal attention mechanism inherent in decoder-only
LLMs, where the embedding of an input token only depends
3.3. The LLaMP Framework on the preceding tokens, facilitates a feasible workaround.
Fig. 2 shows an overview of the LLaMP framework. For As previously mentioned, the prompt embeddings pl are
convenience, we denote the decoder-only LLM as D. The appended to the end of text tokens ỹ. According to the
input to the decoder D consists of two components: textual causal attention mechanism, ỹ is encoded independently of
prompts y in the form of sentences, tokenized as ỹ, and pl . Thus, we design a two-stage process, where we cre-
learnable prompts pl . We append prompt embeddings to the ate the LLM knowledge cache by passing ỹ through D and
end of the input sequence and obtain the last hidden states leverage the cache to convert pl into class-specific embed-
of D as the feature hl : dings for the CLIP text encoder G.
\bm {h}_l = \mathcal {D}(\bm {\Tilde {y}}, \bm {p}_l)[L+1:L+K], L = \text {Length}(\bm {\Tilde {y}}). (6) To compute the attention of a token, the only dependency
is the Key and Value vectors from the preceding tokens.
Hidden states of D are then mapped to the input space of the Thus, we adopt the KV-cache [31, 43], a technique used
CLIP text encoder by the projection matrix W ∈ Rd1 ×d2 , in inference acceleration of LLMs, to create the knowledge
where d1 and d2 are respectively the hidden sizes of the cache. At the first stage, we pass text tokens ỹ through
LLM D and the CLIP text encoder G. A set of prompt- the language model D and save the Keys and Values as the
specific biases b ∈ RK×d2 are added: knowledge cache for the second stage. Once computed, the
knowledge cache remains fixed throughout the entire train-
\bm {\Tilde {h}}_l = W \bm {h}_l + b (7) ing process and bears the information that is needed for fur-
ther computation. Thus, in LLaMP, we leverage the knowl-
We combine h̃l from LLM with regular learnable
edge cache obtained at the first stage to generate class-
prompts, as in previous approaches [19], to construct the
specific prompt embeddings.
input for CLIP text encoder. Similar to deep prompting
[15, 19], we create layer-specific prompts through differ- At the second stage, we create class-specific prompt em-
ent W matrices and b vectors. For the i-th layer, we let beddings from the pre-computed knowledge cache. As pl
h̃il = W i hl + bi and the entire sequence is constructed as is not initialized in the natural language domain, it need
not pass through the entire LLM; instead, we insert those
\bm {\Tilde {y}_{l}}^{i}=\{\bm {t}_{bos}î, \bm {p}î_t, \bm {t}_{1}î, \bm {t}_{2}î, \dots , \bm {t}_{L}î, \bm {\Tilde {h}_lî}, \bm {t}_{eos}î\} (8) prompts pl to the last layer of the LLM DN . It is achieved

4
by encoding them alongside the cache from ỹ, as in similar to “In one sentence, describe the
distinctive appearance of [STH]” through
\bm {H}_l = \mathcal {D}_{N}(\bm {K}_{\bm {\Tilde {y}}}, \bm {V}_{\bm {\Tilde {y}}}, \bm {p}_l), \label {eqn:last} (9) GPT-4 [28], to further diversify the text input.
where Kỹ , Vỹ represent the knowledge cache. This de- 3.4. Training and Inference
sign enables LLaMP to efficiently learn informative prompt
Similar to PSRC [19], our objective function consists of
embeddings for the CLIP encoder G. It accomplishes this
three components: The main cross-entropy loss LCE ,
by incurring modest training costs, compared with training
feature-level L1 regularization Ll1 , and soft distillation loss
the entire LLM. Simultaneously, it maintains the essential
Ldist . Given C training categories and N training samples,
knowledge inherent in the LLM decoder D.
LCE is defined as
Training Targets of LLaMP. Although the training
strategy in Eqn. 9 has reduced the number of learnable pa-
rameters, a full decoder inside a LLM still consists of an \mathcal {L}_{CE} = -\frac {1}{\mathcal {N}}\sum _{i} \log \frac {\exp {(\bm {f_p}î \cdot \bm {g_p}î / \tau )}}{\sum \exp {(\bm {f_p}î \cdot \bm {g_p}^j / \tau )}}. (10)
enormous number of parameters. For example, one layer
in LLaMA-7B bears 200M parameters, making training of The L1 regularization is computed between learned features
the entire layer costly. As the goal is to leverage the knowl- fp , gp and pre-trained CLIP features fˆ, ĝ:
edge from LLM, altering a full layer can lead to the loss
of knowledge. Thus, as shown in Fig. 2, a typical decoder \mathcal {L}_{l1} = \frac {1}{\mathcal {N}}\sum _{i}\lambda _v|\bm {f_p}î - \bm {\hat {f}}î| + \frac {1}{C}\sum _i\lambda _t |\bm {g_p}î - \bm {\hat {g}}î|, (11)
layer has two major components: the self attention module,
consisting Query, Key, Value and Output projection layers,
and the Feed-Forward Network (FFN). LLaMP targets the where λv and λt are coefficients. The prediction of LLaMP
Query and Output projection layer inside the self-attention is further bound by the KL-Divergence between predicted
module. By updating the Query layer, LLM prompts pl are distributions of LLaMP and vanilla CLIP:
learned to distill pertinent information from the knowledge \mathcal {L}_{dist} = \lambda _{dist} D_{KL}(\bm {f_p} \cdot \bm {g_p}, \bm {\hat {f}} \cdot \bm {\hat {g}}). (12)
cache and the Output layer projects it to the latent space.
We keep the Key and Value layers frozen to ensure the align- We sum all three losses up as the final objecttive function:
ment between pl and knowledge cache. We leave the FFN L = LCE + Ll1 + Ldist .
unchanged to preserve the knowledge. Further discussions During training, we randomly sample one LLM template
regarding these choices are made in Sec. 4.3. as the input of LLaMP for each batch. For inference, we
Textual Priors from Pre-Generated Responses. compute the probability distribution predicted from each in-
We extend the initial prompt, “In one sentence, put template and average them.
describe the distinctive appearance of
[STH]”, by incorporating the response generated by the 4. Experiments
language model into the input sequence. This approach
4.1. Experiment Setup
enriches the base content: the generated text provides a
clear and explicit description of the object’s appearance, Datasets. Similar to previous work [18, 19, 48], in our
acting as a valuable informative prior for language model study, we evaluate LLaMP performance over a spectrum
adaptation. However, it is common for responses from of classification tasks with 11 datasets, including Ima-
an LLM to include filler words like “sure” for sentence geNet [8] and Caltech101 [10] for generic image classi-
structure coherence. To refine the input, we parse the noun fication, OxfordPets [29], StanfordCars [21], Flowers102
phrases from the LLM’s response through spaCy [12], [26], Food101 [3], and FGVCAircraft [25] for fine-grained
an NLP engine, and merge them with the initial prompt, classification, SUN397 [45] for scene recognition, UCF101
forming a more focused and informative language prior. [37] for action recognition, DTD [7] for texture classifica-
Textual Augmentations. Following the insights of tion, and EuroSAT [11] for satellite image recognition.
Khattak et al. [19], which highlight the performance bene- Scenarios & Metrics. We evaluate LLaMP on two typi-
fits of diverse textual inputs, we aim to further augment the cal low-shot scenarios: zero-shot base-to-novel generaliza-
text inputs used in the CLIP text encoder. Our approach, tion and few-shot image classification. In zero-shot base-to-
building upon the methods in [19, 48], incorporates hand- novel generalization, the base classes are seen during train-
crafted templates and expands their diversity through a ing, while the novel classes are unseen. We measure models
two-step process: i) We introduce noun phrases into the performance through accuracies of base and novel classes,
existing templates for CLIP, for example, transforming “A and the harmonic mean of the two. For few-shot classifica-
photo of [STH]” to “A photo of [STH] with tion, we assess the accuracy with 16 shots per class.
[NP]”, thereby enriching the descriptive content; ii) We Implementation Details. We build LLaMP through
create a variety of new prompt templates for the LLM the PyTorch [30] framework. All models are trained with

5
Average ImageNet [8] Caltech101 [10] OxfordPets [29]
Method
Base Novel HM Base Novel HM Base Novel HM Base Novel HM
CLIP [32] 69.34 74.22 71.70 72.43 68.14 70.22 96.84 94.00 95.40 91.17 97.26 94.12
CoOp [49] 82.69 63.22 71.66 76.47 67.88 71.92 98.00 89.81 93.73 93.67 95.29 94.47
CoCoOp [48] 80.47 71.69 75.83 75.98 70.43 73.10 97.96 93.81 95.84 95.20 97.69 96.43
KAPT∗ [17] 78.41 70.52 74.26 71.10 65.20 68.02 97.10 93.53 95.28 93.13 96.53 94.80
ProDA [24] 81.56 72.30 76.65 76.66 70.54 73.47 97.74 94.36 96.02 95.43 97.76 96.58
MaPLe [18] 82.28 75.14 78.55 75.40 70.32 72.72 98.27 93.23 95.68 95.43 97.83 96.62
RPO [23] 81.13 75.00 77.78 76.60 71.57 74.00 97.97 94.37 96.03 94.63 97.50 96.05
PSRC [19] 84.26 76.10 79.97 77.60 70.73 74.01 98.10 94.03 96.02 95.33 97.30 96.30
LLaMP 85.16 77.71 81.27 77.99 71.27 74.48 98.45 95.85 97.13 96.31 97.74 97.02
∆ w.r.t. PSRC +0.90 +1.61 +1.30 +0.39 +0.54 +0.47 +0.35 +1.82 +1.11 +0.98 +0.44 +0.72

StanfordCars [21] Flowers102[26] Food101 [3] FGVCAircraft [25]

Method
Base Novel HM Base Novel HM Base Novel HM Base Novel HM
CLIP [32] 63.37 74.89 68.65 72.08 77.80 74.83 90.10 91.22 90.66 27.19 36.29 31.09
CoOp [49] 78.12 60.40 68.13 97.60 59.67 74.06 88.33 82.26 85.19 40.44 22.30 28.75
CoCoOp [48] 70.49 73.59 72.01 94.87 71.75 81.71 90.70 91.29 90.99 33.41 23.71 27.74
KAPT∗ [17] 69.47 66.20 67.79 95.00 71.20 81.40 86.13 87.06 86.59 29.67 28.73 29.19
ProDA [24] 72.94 74.00 73.47 95.92 72.46 82.56 90.71 92.05 91.38 37.44 35.61 36.50
MaPLe [18] 74.70 71.20 72.91 97.70 68.68 80.66 90.30 88.57 89.43 36.90 34.13 35.46
RPO [23] 73.87 75.53 74.69 94.13 76.67 84.50 90.33 90.83 90.58 37.33 34.20 35.70
PSRC [19] 78.27 74.97 76.58 98.07 76.50 85.95 90.67 91.53 91.10 42.73 37.87 40.15
LLaMP 81.56 74.54 77.89 97.82 77.40 86.42 91.05 91.93 91.49 47.30 37.61 41.90
∆ w.r.t. PSRC +3.29 -0.43 +1.31 -0.25 +0.90 +0.47 +0.38 +0.40 +0.39 +4.57 -0.26 +1.75

SUN397 [45] DTD [7] EuroSAT [11] UCF101 [37]

Method
Base Novel HM Base Novel HM Base Novel HM Base Novel HM
CLIP [32] 69.36 75.35 72.23 53.24 59.90 56.37 56.48 64.05 60.03 70.53 77.50 73.85
CoOp [49] 80.60 65.89 72.51 79.44 41.18 54.24 92.19 54.74 68.69 84.69 56.05 67.46
CoCoOp [48] 79.74 76.86 78.27 77.01 56.00 64.85 87.49 60.04 71.21 82.33 73.45 77.64
KAPT∗ [17] 79.40 74.33 76.78 75.97 58.30 65.97 84.80 67.57 75.21 80.83 67.10 73.33
ProDA [24] 80.82 78.70 79.75 80.36 59.18 68.16 94.07 73.23 82.35 83.00 78.66 80.77
MaPLe [18] 78.47 76.93 77.79 80.67 56.48 66.44 83.90 66.00 73.88 85.23 71.97 78.04
RPO [23] 80.60 77.80 79.18 76.70 62.13 68.61 86.63 68.97 76.79 83.67 75.43 79.34
PSRC [19] 82.67 78.47 80.52 83.37 62.97 71.75 92.90 73.90 82.32 87.10 78.80 82.74
LLaMP 83.41 79.90 81.62 83.49 64.49 72.77 91.93 83.66 87.60 87.13 80.66 83.77
∆ w.r.t. PSRC +0.74 +1.43 +1.10 +0.12 +1.52 +1.02 -0.97 +9.76 +5.28 +0.03 +1.86 +1.03

Table 1. Comparison with state-of-the-art methods on base-to-novel generalization. LLaMP shows strong generalization results over
previous approaches on 11 image classification tasks. Absolute gains over PSRC are indicated in blue. ∗ KAPT is trained with ViT-B/32
image encoder instead of ViT-B/16.

2 NVIDIA A100 40GB GPUs. For LLaMP, we adopt rate of LoRA modules is set to 2E-5. λt , λv and λdist are
LLaMA2-7B [39] as the language model D, and ViT-B/16 set to 25, 10 and 2.5, respectively.
[9] as the image encoder, following [18, 19, 48, 49]. On the
text side, we set prompt learning depth to 9. To tune the vi- 4.2. Quantitative Evaluation
sion encoder, we adopt the hybrid tuning scheme which per- Zero-Shot Base-to-Novel Generalization. LLaMP out-
forms deep prompt learning on the first 6 layers and LoRA performs existing state-of-the-art prompt learning methods
on the rest. Similar to [13], LoRA is applied to the Query on most metrics of 11 classification datasets in the base-
and Value projection layers inside attention modules. The to-novel generalization benchmark. As shown in Tab. 1,
number of pl prompts, K, is set to 16. We set a global compared to the latest model PSRC [19], LLaMP achieves
learning rate of 2E-4 with a batch size of 8. The learning average gains of 0.90% in base accuracy, 1.61% in novel

6
16-Shot Classification

11]
]

]
8]

[45
25]

[37
10]

[26
t[

[
]

ft [

AT
Ne

[ 7]
[3]

01
[21
e

rs
rag

s [2
ltec

cra
age

roS
we

F1
D
od
rs
Ave

UC
Flo

Air

DT
Pet

SU
Im

Eu
Fo
CLIP [32] 78.79 (65.02) 67.31 95.43 85.34 80.44 97.37 82.90 45.36 73.28 69.96 87.21 82.11
CoOp [49] 79.89 (73.82) 71.87 95.57 91.87 83.07 97.07 84.20 43.40 74.67 69.87 84.93 82.23
CoCoOp [48] 74.90 (70.70) 70.83 95.16 93.34 71.57 87.84 87.25 31.21 72.15 63.04 73.32 78.14
MaPLe [18] 81.79 (75.58) 72.33 96.00 92.83 83.57 97.00 85.33 48.40 75.53 71.33 92.33 85.03
PSRC [19] 82.87 (77.90) 73.17 96.07 93.67 83.83 97.60 87.50 50.83 77.23 72.73 92.43 86.47
LLaMP 83.81 (78.50) 73.49 97.08 94.21 86.07 98.06 87.62 56.07 77.02 74.17 91.31 86.84

Table 2. Few shot classification results with 16 shots. Numbers in the bracket indicate the average performance over 1/2/4/8/16 shots.

accuracy, and 1.30% in harmonic mean on average. More- Method LLM Base Novel HM
over, LLaMP consistently achieves higher harmonic means
(HM) compared to other models. These improvements indi- 69.34 74.22 71.70
CLIP
cate that our approach better balances performance on base ✓ 70.95 74.93 72.79
and novel data, thus achieving stronger generalization com- 82.21 76.44 79.22
LLaMP
pared to the prior prompt learning techniques. ✓ 85.16 77.71 81.27
In particular, LLaMP excels in fine-grained datasets re-
quiring detailed analysis. On FGVCAircraft, LLaMP sur- Table 3. Ablation study on the LLM Knowledge.
passes PSRC by 4.57% on base accuracy and 1.75% on
HM, highlighting its strong understanding of detailed air- LP QO KV FFN % Base Novel HM
craft features. Furthermore, on EuroSAT, LLaMP achieves ✓ .03 85.00 77.29 80.96
improvements of 9.76% and 5.28% on novel accuracy and ✓ ✓ ✓ 33 85.20 77.45 81.14
HM, respectively. We also observe similar performance ✓ ✓ ✓ 83 85.05 77.73 81.22
gains on StanfordCars, where LLaMP outperformns by ✓ ✓ ✓ ✓ 100 85.23 77.56 81.21
3.29% on base accuracy and 1.31% on HM. The informa-
✓ ✓ 17 85.16 77.71 81.27
tion embedded in LLM enables LLaMP to capture and uti-
lize the rich semantic information necessary for distinguish-
Table 4. Ablation study on the Training Strategy. “%” indicates
ing between closely related categories. the ratio of parameters trained compared to fully tuning a layer.
Few-Shot Classification. LLaMP also achieves im-
provements across these classification datasets in few-shot plate, “A photo of [STH] with [NP]” to generate
classification tasks. As in Tab.2, with an average classi- the NP-augmented text embedding for CLIP. We take the
fication accuracy of 83.81%. Notably, on FGVCAircraft average of all augmented embeddings for classification. In
and StanfordCars, LLaMP shows a significant improve- Tab. 3 we show that even ordinary CLIP can benefit from
ment over PSRC, further demonstrating that the knowl- incorporating LLMs’ knowledge.
edge from language models benefits the recognition of fine- Furthermore, the comparison between LLaMP and
grained object categories, which aligns with our observa- LLaMP without the LLM indicates that merely integrat-
tion on zero-shot base-to-novel generalization. Moreover, ing LoRA [13] to the vision encoder is not beneficial. The
on DTD, where MaPLe and PSRC achieve around 72% ac- “LLaMP without LLM” is essentially an ordinary prompt-
curacy, LLaMP achieves a higher accuracy of 74.17%, un- ing learning model plus LoRAs in the vision encoder. We
derscoring its ability to recognize textures. show that the improved vision encoding capacity only ben-
efits when the quality of text embeddings s are enhanced by
4.3. Ablation Study
incorporating LLMs’ knowledge through LLaMP.
Is the knowledge from LLM helping? In Tab. 3, we show Decoder Training strategy. We categorize trainable pa-
that the knowledge from LLM benefits in both ways: With- rameters of DN into four groups: learnable prompts (LP),
out training, performance of ordinary CLIP model can be Query and Output projections (QO), Key and Value projec-
improved by introducing noun phrases; The LLaMP frame- tions (KV), and the feed-forward network (FFN). Tab. 4 in-
work shows further improvement after training. dicates LLaMP can achieve desirable results by just learn-
Noun phrases are parsed from the LLM’s responses the ing the prompts of D. One step further, adding QO into op-
prompt of “Describe [STH]”. We then use the tem- timization achieves the best performance. Although other

7
Method Priors Base Novel HM

✗ 84.90 77.59 81.08

LLaMP Plain 85.26 77.56 81.22

NP 85.16 77.71 81.27
Table 5. Ablation Study on Pre-generated Text Priors. ✗ refers to
“without textual priors” and NP stands for noun phrases.
RI//03URPSWV
Method Base Novel HM Figure 3. Effect of LLM Prompts on Harmonic Mean. 16 prompts
achieve the most balanced performance.
LLM Only 81.74 35.82 49.81
Image Noun Phrases Heatmap
LLaMP 85.16 77.71 81.27
Classname: An-12

Table 6. The CLIP text encoder helps adaptation.

four engines,
Scheme Base Novel HM turboprop aircraft,
large vertical fin
Prompt × 9 84.67 77.28 80.81
LoRA × 12 84.89 77.27 80.90
Classname: Industrial Buildings
Prompt ×6 + LoRA ×6 85.16 77.71 81.27
a cluster,
Table 7. Study on Vision Tuning Scheme. Our hybrid design rectangular structures,
achieves the best performance. flat roofs,
straight lines
setups introduce much more trainable parameters, they can
not surpass the “LP + QO” strategy.
Effect of Textual Priors. We study the effect of pre- Figure 4. Visualization of LLaMP Predictions by GradCAM [35]
generated textual priors on LLaMP. We compare three dif- Fig. 3, using 16 prompts optimizes the LLM’s capabilities,
ferent approaches: without textual priors, using plain re- achieving the highest harmonic mean at 81.27%.
sponses as the prior, and LLaMP, which takes parsed noun Visualizations. In Fig. 4, we visualize the gradient
phrases. Tab. 5 shows that LLaMP can achieve over 81% on heatmap of input images from FGVCAircraft and EuroSAT,
HM without pre-generated priors, while adding parsed noun through GradCAM [35]. The figure shows that, LLaMP can
phrases as textual priors further pushes the HM to 81.27%. capture distinctive features that matches LLM’s description.
CLIP as the bridge. One may wonder if it is possible
to replace CLIP text encoder with a large language model. 5. Conclusion
Here, we study two setups: i) LLM as encoder, which treats
Our study shows that the encyclopedic knowledge from
the output of the language model, h̃l as the text feature;
LLMs is beneficial for low-shot image classification as extra
ii) LLaMP, which treat h̃l as part of the text input prompt.
class-specific information. To leverage such knowledge, we
Tab. 6 reveals that relying solely on the LLM results in poor
propose LLaMP, a framework that adapts LLMs as prompt
accuracy for novel categories. This supports our hypothesis
learners for the CLIP model. Over two common low-shot
that aligning LLMs with vision encoders generally requires
scenarios: zero-shot generalization and few-shot learning,
a more extensive dataset. Furthermore, LLaMP’s design
LLaMP demonstrates notable improvements compared with
significantly improve the novel accuracy by 40%.
previous state-of-the-arts on a spectrum of datasets.
Vision Training Strategy. As ViT-16/B has 12 trans-
Limitations. While LLaMP reveals an effective way in
former layers, we compare different vision training strate-
leveraging LLMs’ knowledge, both modalities, vision and
gies within LLaMP in Tab. 7. Apart from the default hybrid
language, only interact at the finest feature level. Given the
scheme, we evaluate setups including prompt learning at
broader LLM-aided knowledge from the language side, the
first 9 layers (P9), a similar setup to PSRC [19], and LoRA
performance can be potentially further improved by intro-
[13] in all 12 layers. The results suggest that the scheme
ducing language priors at earlier vision encoding stages.
leverages the strengths of both prompt learning and LoRA,
addressing potential bottlenecks in the vision encoder and
Acknowledgment
enhancing overall performance in LLaMP.
Number of LLM Prompts. We vary the number of This research was supported, in part, by the Office of Naval
LLM prompts and study their effects on LLaMP. As in Research under grant #N00014-21-1-2802.

8
References [14] Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh,
Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom
[1] Ebtesam Almazrouei, Hamza Alobeidli, Abdulaziz Al- Duerig. Scaling up visual and vision-language representation
shamsi, Alessandro Cappelli, Ruxandra Cojocaru, Merouane learning with noisy text supervision. In ICML, pages 4904–
Debbah, Etienne Goffinet, Daniel Heslow, Julien Launay, 4916. PMLR, 2021. 1, 2
Quentin Malartic, et al. Falcon-40b: an open large language [15] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie,
model with state-of-the-art performance. Technical report, Serge Belongie, Bharath Hariharan, and Ser-Nam Lim. Vi-
Technical report, Technology Innovation Institute, 2023. 2 sual prompt tuning. In European Conference on Computer
[2] Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John- Vision, pages 709–727. Springer, 2022. 3, 4
son, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, [16] Michael Kampffmeyer, Yinbo Chen, Xiaodan Liang, Hao
Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 Wang, Yujia Zhang, and Eric P Xing. Rethinking knowledge
technical report. arXiv preprint arXiv:2305.10403, 2023. 2 graph propagation for zero-shot learning. In CVPR, pages
[3] Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. 11487–11496, 2019. 2
Food-101–mining discriminative components with random [17] Baoshuo Kan, Teng Wang, Wenpeng Lu, Xiantong Zhen,
forests. In ECCV, pages 446–461. Springer, 2014. 5, 6, 7, 1, Weili Guan, and Feng Zheng. Knowledge-aware prompt
2 tuning for generalizable vision-language models. In ICCV,
[4] Shiming Chen, Wenjie Wang, Beihao Xia, Qinmu Peng, pages 15670–15680, 2023. 1, 2, 6
Xinge You, Feng Zheng, and Ling Shao. Free: Feature re- [18] Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad
finement for generalized zero-shot learning. In ICCV, pages Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple:
122–131, 2021. 2 Multi-modal prompt learning. In CVPR, pages 19113–
[5] Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhang- 19122, 2023. 1, 2, 3, 5, 6, 7
hao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yong- [19] Muhammad Uzair Khattak, Syed Talal Wasim, Muzammal
hao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Naseer, Salman Khan, Ming-Hsuan Yang, and Fahad Shah-
Xing. Vicuna: An open-source chatbot impressing gpt-4 baz Khan. Self-regulating prompts: Foundational model
with 90%* chatgpt quality, 2023. 2 adaptation without forgetting. In ICCV, pages 15190–15200,
[6] Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, 2023. 1, 2, 3, 4, 5, 6, 7, 8
Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul [20] Thomas N. Kipf and Max Welling. Semi-supervised clas-
Barham, Hyung Won Chung, Charles Sutton, Sebastian sification with graph convolutional networks. In 5th ICLR,
Gehrmann, et al. Palm: Scaling language modeling with ICLR 2017, Toulon, France, April 24-26, 2017, Conference
pathways. arXiv preprint arXiv:2204.02311, 2022. 2 Track Proceedings. OpenReview.net, 2017. 2
[7] Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy [21] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei.
Mohamed, and Andrea Vedaldi. Describing textures in the 3d object representations for fine-grained categorization. In
wild. In CVPR, pages 3606–3613, 2014. 5, 6, 7, 2 ICCV workshops, pages 554–561, 2013. 5, 6, 7, 2
[8] Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, [22] Christoph H Lampert, Hannes Nickisch, and Stefan Harmel-
and Li Fei-Fei. Imagenet: A large-scale hierarchical image ing. Attribute-based classification for zero-shot visual object
database. In CVPR, pages 248–255. Ieee, 2009. 5, 6, 7, 2 categorization. TPAMI, 36(3):453–465, 2013. 2
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, [23] Dongjun Lee, Seokwon Song, Jihee Suh, Joonmyeong Choi,
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Sanghyeok Lee, and Hyunwoo J Kim. Read-only prompt op-
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- timization for vision-language few-shot learning. In ICCV,
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is pages 1401–1411, 2023. 1, 2, 6
worth 16x16 words: Transformers for image recognition at [24] Yuning Lu, Jianzhuang Liu, Yonggang Zhang, Yajing Liu,
scale. In ICLR, 2021. 3, 6 and Xinmei Tian. Prompt distribution learning. In Proceed-
[10] Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ings of the IEEE/CVF Conference on Computer Vision and
ative visual models from few training examples: An incre- Pattern Recognition, pages 5206–5215, 2022. 1, 6
mental bayesian approach tested on 101 object categories. In [25] Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew
CVPR workshop, pages 178–178. IEEE, 2004. 5, 6, 7, 2 Blaschko, and Andrea Vedaldi. Fine-grained visual classi-
[11] Patrick Helber, Benjamin Bischke, Andreas Dengel, and fication of aircraft. arXiv preprint arXiv:1306.5151, 2013.
Damian Borth. Eurosat: A novel dataset and deep learning 2, 5, 6, 7, 1
benchmark for land use and land cover classification. IEEE [26] Maria-Elena Nilsback and Andrew Zisserman. Automated
Journal of Selected Topics in Applied Earth Observations flower classification over a large number of classes. In 2008
and Remote Sensing, 12(7):2217–2226, 2019. 5, 6, 7, 2 Sixth Indian Conference on Computer Vision, Graphics &
[12] Matthew Honnibal, Ines Montani, Sofie Van Landeghem, Image Processing, pages 722–729. IEEE, 2008. 5, 6, 7, 2
and Adriane Boyd. spaCy: Industrial-strength Natural Lan- [27] OpenAI. Chatgpt, 2023. Available at https://openai.
guage Processing in Python. 2020. 5 com/chatgpt. 2
[13] Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- [28] OpenAI. Gpt-4 technical report, 2023. 1, 2, 5
Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. [29] Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and
Lora: Low-rank adaptation of large language models. arXiv CV Jawahar. Cats and dogs. In CVPR, pages 3498–3505.
preprint arXiv:2106.09685, 2021. 3, 6, 7, 8 IEEE, 2012. 5, 6, 7, 2

9
[30] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, Dy, and Tomas Pfister. Learning to prompt for continual
James Bradbury, Gregory Chanan, Trevor Killeen, Zem- learning. In Proceedings of the IEEE/CVF Conference on
ing Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: Computer Vision and Pattern Recognition, pages 139–149,
An imperative style, high-performance deep learning library. 2022. 3
NeurIPS, pages 8026–8037, 2019. 5 [43] Thomas Wolf, Lysandre Debut, Victor Sanh, Julien Chau-
[31] Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob mond, Clement Delangue, Anthony Moi, Pierric Cistac, Tim
Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shiv- Rault, Rémi Louf, Morgan Funtowicz, Joe Davison, Sam
ani Agrawal, and Jeff Dean. Efficiently scaling transformer Shleifer, Patrick von Platen, Clara Ma, Yacine Jernite, Julien
inference. Proceedings of Machine Learning and Systems, 5, Plu, Canwen Xu, Teven Le Scao, Sylvain Gugger, Mariama
2023. 4 Drame, Quentin Lhoest, and Alexander M. Rush. Trans-
[32] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya formers: State-of-the-art natural language processing. In
Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Proceedings of the 2020 Conference on Empirical Methods
Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- in Natural Language Processing: System Demonstrations,
ing transferable visual models from natural language super- pages 38–45, Online, 2020. Association for Computational
vision. In ICML, pages 8748–8763. PMLR, 2021. 1, 2, 6, Linguistics. 4
7 [44] Yongqin Xian, Tobias Lorenz, Bernt Schiele, and Zeynep
[33] Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Akata. Feature generating networks for zero-shot learning.
Maaz, Salman Khan, and Fahad Shahbaz Khan. Fine-tuned In CVPR, pages 5542–5551, 2018. 2
clip models are efficient video learners. In Proceedings of [45] Jianxiong Xiao, Krista A Ehinger, James Hays, Antonio Tor-
the IEEE/CVF Conference on Computer Vision and Pattern ralba, and Aude Oliva. Sun database: Exploring a large col-
Recognition, pages 6545–6554, 2023. 3 lection of scene categories. IJCV, 119(1):3–22, 2016. 5, 6,
[34] Scott Reed, Zeynep Akata, Honglak Lee, and Bernt Schiele. 7, 2
Learning deep representations of fine-grained visual descrip- [46] Li Zhang, Tao Xiang, and Shaogang Gong. Learning a deep
tions. In CVPR, pages 49–58, 2016. 2 embedding model for zero-shot learning. In CVPR, pages
[35] Ramprasaath R Selvaraju, Michael Cogswell, Abhishek Das, 2021–2030, 2017. 2
Ramakrishna Vedantam, Devi Parikh, and Dhruv Batra. [47] Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe,
Grad-cam: Visual explanations from deep networks via Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab,
gradient-based localization. In ICCV, pages 618–626, 2017. Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained trans-
8 former language models. arXiv preprint arXiv:2205.01068,
[36] Richard Socher, Milind Ganjoo, Hamsa Sridhar, Osbert Bas- 2022. 2
tani, Christopher D Manning, and Andrew Y Ng. Zero- [48] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Zi-
shot learning through cross-modal transfer. arXiv preprint wei Liu. Conditional prompt learning for vision-language
arXiv:1301.3666, 2013. 2 models. In CVPR, pages 16816–16825, 2022. 1, 2, 3, 5, 6, 7
[37] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. [49] Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei
Ucf101: A dataset of 101 human actions classes from videos Liu. Learning to prompt for vision-language models. IJCV,
in the wild. arXiv preprint arXiv:1212.0402, 2012. 5, 6, 7, 2 130(9):2337–2348, 2022. 1, 2, 3, 6, 7
[38] Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier [50] Yizhe Zhu, Mohamed Elhoseiny, Bingchen Liu, Xi Peng,
Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste and Ahmed Elgammal. A generative adversarial approach
Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. for zero-shot learning from noisy texts. In CVPR, pages
Llama: Open and efficient foundation language models. 1004–1013, 2018. 2
arXiv preprint arXiv:2302.13971, 2023. 1, 2, 4
[39] Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert,
Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov,
Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al.
Llama 2: Open foundation and fine-tuned chat models. arXiv
preprint arXiv:2307.09288, 2023. 1, 2, 4, 6
[40] Xiaolong Wang, Yufei Ye, and Abhinav Gupta. Zero-shot
recognition via semantic embeddings and knowledge graphs.
In CVPR, pages 6857–6866, 2018. 2
[41] Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun,
Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vin-
cent Perot, Jennifer Dy, et al. Dualprompt: Complementary
prompting for rehearsal-free continual learning. In European
Conference on Computer Vision, pages 631–648. Springer,
2022. 3
[42] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang,
Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer

10
Large Language Models are Good Prompt Learners
for Low-Shot Image Classification
Supplementary Material
Introduction observe that LLaMP surpasses PSRC [19] consistently on
FGVCAircraft (Aircraft) [25] and Food [3] with all num-
In the supplementary material, we provide extra discussions bers of shots. Such observation aligns with our argument
that did not fit in the main paper due to the space limita- in the main paper that the knowledge from LLMs provides
tion, including i) Ablation study on the textual augmenta- richer semantic information for fine-grained classification.
tion described in Sec 3.3; ii) Few-shot classification results
on 8/4/2/1 shots with comparisons with previous methods.

Ablation on Textual Augmentations

As discussed in Sec 3.3, we perform textual augmentations
in two steps: i) When computing ĝ, we replace the original
template “A photo of [STH]”, with “A photo of
[STH] with [NP]”, thereby enriching the descriptive
content with noun phrases extracted from LLM responses;
ii) We create new LLM prompt templates similar to “In
one sentence, describe the distinctive
appearance of [STH]” through GPT-4 [28], and
average the scores for final prediction.
We show ablation study results in Tab. 8, with TA1 and
TA2 referring to the step i) and ii) mentioned above. Results
show that, even without textual augmentations, LLaMP still
outperforms PSRC, the previous state-of-the-art, by 0.68%
on base accuracy, 0.94% on novel accuracy and 0.83% on
the harmonic mean. Moreover, we observe that both aug-
mentation steps further improve the performance of LLaMP.
More specifically, TA1 improves the HM by 0.35% while
TA2 brings in another boost of 0.12%.

Method TA1 TA2 Base Novel HM

PSRC [19] 84.26 76.10 79.97
84.94 77.04 80.80
✓ 84.78 77.31 80.86
LLaMP
✓ 85.16 77.50 81.15
✓ ✓ 85.16 77.71 81.27

Table 8. Ablation study on textual augmentations.

Few-shot Classification
In addition to the 16-shot classification result reported in
the main paper, we present few-shot classification results
with with 8/4/2/1 numbers of shots in Tab. 9 and compare
LLaMP against previous baseline models.
Results in Tab. 9 show that LLaMP outperforms previous
SOTAs under all settings, on average of all 11 benchmarks,
with 0.88% improvement with 8 shots. In particular, we

1
8-Shot Classification

11]
]

]
]

[ 45
25]

[ 37
8

[26
[10
et [

[
21]

AT
]

[7]
[3]

01
[29
e

rs
h

raf
rag

ltec
age

roS
we

F1
rs [

D
od

c
s
Ave

UC
Flo

Air

DT
Pet

SU
Im

Eu
Fo
CLIP [32] 74.47 62.23 93.41 78.36 73.67 96.10 79.79 39.35 69.08 63.46 84.43 79.34
CoOp [49] 76.98 70.63 94.37 91.27 79.30 94.97 82.67 39.00 71.53 64.77 78.07 80.20
CoCoOp [48] 72.96 70.63 95.04 93.45 70.44 84.30 86.97 26.61 70.84 58.89 68.21 77.14
MaPLe [18] 78.89 70.30 95.20 92.57 79.47 95.80 83.60 42.00 73.23 66.50 87.73 81.37
PSRC [19] 80.69 72.33 95.67 93.50 80.97 96.27 86.90 43.27 75.73 69.87 88.80 84.30
LLaMP 81.57 72.30 96.57 93.69 82.15 96.20 87.39 47.48 75.18 71.14 91.15 84.06

4-Shot Classification

11]
]

]
8]

[45
]

[37
10]

[26

[25
t[

[
21]

AT
Ne

7]
[3]

01
e

t
raf
rag

s [2

D[
ltec
age

roS
we

F1
rs [

c
Ave

UC
Flo

Air

DT
Pet

SU
Im

Eu
Fo
CLIP [32] 68.01 54.85 92.05 71.17 63.38 92.02 73.19 32.33 63.00 55.71 77.09 73.28
CoOp [49] 74.02 68.73 94.40 92.57 74.47 92.17 84.47 30.83 69.97 58.70 70.80 77.10
CoCoOp [48] 71.21 70.39 94.98 92.81 69.39 78.40 86.88 24.79 70.21 55.04 65.56 74.82
MaPLe [18] 75.37 67.70 94.43 91.90 75.30 92.67 81.77 34.87 70.67 61.00 84.50 78.47
PSRC [19] 78.35 71.07 95.27 93.43 77.13 93.87 86.17 37.47 74.00 65.53 86.30 81.57
LLaMP 78.83 71.37 95.84 93.61 76.79 93.96 87.17 40.02 74.05 66.37 86.16 81.80

2-Shot Classification

11]
]

]
]

26]

[45
25]

[37
8

10]
et [

[
rs [
21]

ft [

AT
h[

7]
[3]

01
e

N
rag

s [2

D[
ltec

cra
age

roS
we

F1
rs [

od
Ave

UC
Flo

Air

DT
Pet

SU
Im

Eu
Fo

CLIP [32] 57.98 44.88 89.01 58.37 50.28 85.07 61.51 26.41 53.70 40.76 61.98 65.78
CoOp [49] 70.65 67.07 93.07 89.80 70.50 87.33 84.40 26.20 66.53 53.60 65.17 73.43
CoCoOp [48] 67.65 69.78 94.82 92.64 68.37 75.79 86.22 15.06 69.03 52.17 46.74 73.51
MaPLe [18] 72.58 65.10 93.97 90.87 71.60 88.93 81.47 30.90 67.10 55.50 78.30 74.60
PSRC [19] 75.29 69.77 94.53 92.50 73.40 91.17 85.70 31.70 71.60 59.97 79.37 78.50
LLaMP 75.89 70.12 95.66 92.75 72.20 89.16 86.33 33.41 72.64 61.29 81.71 79.56

1-Shot Classification
11]
]

]
8]

[45
]

[37
10]

[26

[25
et [

[
1]

AT
h[

7]
[3]

101
e

geN

t
2

raf
rag

s [2

D[
ltec

roS
we
rs [

F
c
a
Ave

UC
Flo

Air

DT
Pet

SU
Im

Eu
Fo

CLIP [32] 45.83 32.13 79.88 44.06 35.66 69.74 43.96 19.61 41.58 34.59 49.23 53.66
CoOp [49] 67.56 66.33 92.60 90.37 67.43 77.53 84.33 21.37 66.77 50.23 54.93 71.23
CoCoOp [48] 66.79 69.43 93.83 91.27 67.22 72.08 85.65 12.68 68.33 48.54 55.33 70.30
MaPLe [18] 69.27 62.67 92.57 89.10 66.60 83.30 80.50 26.73 64.77 52.13 71.80 71.83
PSRC [19] 72.32 68.13 93.67 92.00 69.40 85.93 84.87 27.67 69.67 56.23 73.13 74.80
LLaMP 72.42 69.12 94.59 91.91 70.02 84.03 85.83 30.39 69.69 54.98 70.36 75.72

Table 9. Few shot classification results with 8/4/2/1 shots. All numbers, excepts ours, are obtained from [19].

Prompting - Unleashing The Potential of Prompt Engineering in Large Language Models
No ratings yet
Prompting - Unleashing The Potential of Prompt Engineering in Large Language Models
58 pages
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
No ratings yet
An Introduction To Vision-Language Modeling: Aishwarya Agrawal Kate Saenko Asli Celikyilmaz Vikas Chandra
76 pages
Incorporating Visual Information Into Natural Language Processing
No ratings yet
Incorporating Visual Information Into Natural Language Processing
151 pages
Vision-Language Pre-Training
No ratings yet
Vision-Language Pre-Training
102 pages
Laclip
No ratings yet
Laclip
29 pages
Design of Machine Learning Algorithms For Object Captioning
No ratings yet
Design of Machine Learning Algorithms For Object Captioning
45 pages
Acoustic-2019 Syllabus Specification 13thnov2020
No ratings yet
Acoustic-2019 Syllabus Specification 13thnov2020
84 pages
Negative Yields Positive
No ratings yet
Negative Yields Positive
25 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
48 pages
Cloth Captioning
No ratings yet
Cloth Captioning
36 pages
Rethinking Vlms and Llms For Image Classification
No ratings yet
Rethinking Vlms and Llms For Image Classification
23 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
47 pages
Learning To Prompt With Text Only Supervision For Vision-Language Models
No ratings yet
Learning To Prompt With Text Only Supervision For Vision-Language Models
15 pages
ELIP
No ratings yet
ELIP
31 pages
Efficient Few-Shot Continual Learning in Vision-Language Models
No ratings yet
Efficient Few-Shot Continual Learning in Vision-Language Models
27 pages
Grounding Language Models To Images For Multimodal Inputs and Outputs
No ratings yet
Grounding Language Models To Images For Multimodal Inputs and Outputs
18 pages
Contrastive Language Image Pre-Training
No ratings yet
Contrastive Language Image Pre-Training
18 pages
2311.11904v2 (Copy)
No ratings yet
2311.11904v2 (Copy)
19 pages
Chapter 1 The Effects of Learning Environment On The Academic Performance of Accountancy Business and Management Students in
No ratings yet
Chapter 1 The Effects of Learning Environment On The Academic Performance of Accountancy Business and Management Students in
8 pages
Learning Transferable Visual Models From Natural Language Supervision
No ratings yet
Learning Transferable Visual Models From Natural Language Supervision
14 pages
Perceptionlm: Open-Access Data and Models For Detailed Visual Understanding
No ratings yet
Perceptionlm: Open-Access Data and Models For Detailed Visual Understanding
54 pages
Grounding Descriptions in Images Informs Zero-Shot Visual Recognition
No ratings yet
Grounding Descriptions in Images Informs Zero-Shot Visual Recognition
18 pages
Amherstp 2
No ratings yet
Amherstp 2
19 pages
Visual Large Language Models For Generalized and Specialized Applications
No ratings yet
Visual Large Language Models For Generalized and Specialized Applications
29 pages
Fewvlm
No ratings yet
Fewvlm
13 pages
uCAP: An Unsupervised Prompting Method For Vision-Language Models
No ratings yet
uCAP: An Unsupervised Prompting Method For Vision-Language Models
16 pages
Vision-Language Models For Vision Tasks: A Survey: Jingyi Zhang, Jiaxing Huang, Sheng Jin and Shijian Lu
No ratings yet
Vision-Language Models For Vision Tasks: A Survey: Jingyi Zhang, Jiaxing Huang, Sheng Jin and Shijian Lu
24 pages
Prompt Distribution Learning
No ratings yet
Prompt Distribution Learning
13 pages
CLIP - Connecting Text and Images - OpenAI
No ratings yet
CLIP - Connecting Text and Images - OpenAI
16 pages
Applsci 14 05068
No ratings yet
Applsci 14 05068
30 pages
LLM
No ratings yet
LLM
28 pages
Zhou Conditional Prompt Learning For Vision-Language Models CVPR 2022 Paper
No ratings yet
Zhou Conditional Prompt Learning For Vision-Language Models CVPR 2022 Paper
10 pages
LLaMA VID
No ratings yet
LLaMA VID
18 pages
Research Paper 5
No ratings yet
Research Paper 5
10 pages
GalLoP Learning Global and Local Prompts LRKL T
No ratings yet
GalLoP Learning Global and Local Prompts LRKL T
27 pages
NeurIPS 2023 Bootstrapping Vision Language Learning With Decoupled Language Pre Training Paper Conference
No ratings yet
NeurIPS 2023 Bootstrapping Vision Language Learning With Decoupled Language Pre Training Paper Conference
16 pages
Llm2Clip: P L M U R V R: Owerful Anguage Odel Nlocks Icher Isual Epresentation
No ratings yet
Llm2Clip: P L M U R V R: Owerful Anguage Odel Nlocks Icher Isual Epresentation
13 pages
SigLIP 2 - Multilingual Vision-Language Encoders With Improved Semantic Understanding, Localization, and Dense Features
No ratings yet
SigLIP 2 - Multilingual Vision-Language Encoders With Improved Semantic Understanding, Localization, and Dense Features
20 pages
Xian Latent Embeddings For CVPR 2016 Paper
No ratings yet
Xian Latent Embeddings For CVPR 2016 Paper
9 pages
Prompt-Learning For Short Text Classification
No ratings yet
Prompt-Learning For Short Text Classification
8 pages
Co Op
No ratings yet
Co Op
13 pages
Lu Prompt Distribution Learning CVPR 2022 Paper
No ratings yet
Lu Prompt Distribution Learning CVPR 2022 Paper
10 pages
1-2024-arxiv- LLM-Seg：连接图像分割和大型语言模型推理
No ratings yet
1-2024-arxiv- LLM-Seg：连接图像分割和大型语言模型推理
10 pages
Images in Language Space: Exploring The Suitability of Large Language Models For Vision & Language Tasks
No ratings yet
Images in Language Space: Exploring The Suitability of Large Language Models For Vision & Language Tasks
13 pages
Ma PLe
No ratings yet
Ma PLe
13 pages
Co Co Op
No ratings yet
Co Co Op
11 pages
Pratt What Does A Platypus Look Like Generating Customized Prompts For ICCV 2023 Paper
No ratings yet
Pratt What Does A Platypus Look Like Generating Customized Prompts For ICCV 2023 Paper
11 pages
(2023-Arxiv) VisionLLM Large Language Model Is Also An Open-Ended Decoder For Vision-Centric Tasks
No ratings yet
(2023-Arxiv) VisionLLM Large Language Model Is Also An Open-Ended Decoder For Vision-Centric Tasks
15 pages
PromptDet Towards Open-Vocabulary
No ratings yet
PromptDet Towards Open-Vocabulary
21 pages
By My Eyes: Grounding Multimodal Large Language Models With Sensor Data Via Visual Prompting
No ratings yet
By My Eyes: Grounding Multimodal Large Language Models With Sensor Data Via Visual Prompting
23 pages
Artificial Intelligence Neural Contradictory
No ratings yet
Artificial Intelligence Neural Contradictory
9 pages
677 A Survey On Bridging VLMs
No ratings yet
677 A Survey On Bridging VLMs
20 pages
What Is CLIP - Contrastive Language-Image Pre-Processing Explained
No ratings yet
What Is CLIP - Contrastive Language-Image Pre-Processing Explained
16 pages
BLIP-2: Bootstrapping Language-Image Pre-Training With Frozen Image Encoders and Large Language Models
No ratings yet
BLIP-2: Bootstrapping Language-Image Pre-Training With Frozen Image Encoders and Large Language Models
13 pages
Semi-Detailed Lesson Plan - RWS Quarter IV - 4
No ratings yet
Semi-Detailed Lesson Plan - RWS Quarter IV - 4
6 pages
MAPLE
No ratings yet
MAPLE
5 pages
Contrastive Language and Vision Learning of General Fashion Concepts
No ratings yet
Contrastive Language and Vision Learning of General Fashion Concepts
13 pages
CLIP Report
No ratings yet
CLIP Report
7 pages
Deep Learning For Visual Understanding Bridging Text and Image With CLIP and BLIP
No ratings yet
Deep Learning For Visual Understanding Bridging Text and Image With CLIP and BLIP
9 pages
Cisco Networking Academy Course Information
No ratings yet
Cisco Networking Academy Course Information
12 pages
Code No.:OO Paper - I Subject: General Paper On Teaching Research Aptitude Syllabus
No ratings yet
Code No.:OO Paper - I Subject: General Paper On Teaching Research Aptitude Syllabus
7 pages
Gregg Shorthand
No ratings yet
Gregg Shorthand
4 pages
Math Fact Bingo
No ratings yet
Math Fact Bingo
17 pages
GCSE Combined Sci Chemistry Foundation Tier Topic Test 1
No ratings yet
GCSE Combined Sci Chemistry Foundation Tier Topic Test 1
21 pages
BT4395 RR Final
No ratings yet
BT4395 RR Final
32 pages
Strategies and Challenges of English Education Students in Vocabulary Mastery
No ratings yet
Strategies and Challenges of English Education Students in Vocabulary Mastery
14 pages
Lesson Plan # 8: Subject: Computer Science Grade: 8 Time: 30 Min
No ratings yet
Lesson Plan # 8: Subject: Computer Science Grade: 8 Time: 30 Min
3 pages
Pop Cycle Form Semester 3 Fall 2023
No ratings yet
Pop Cycle Form Semester 3 Fall 2023
5 pages
DepEd Values Education Program (Paper 2)
No ratings yet
DepEd Values Education Program (Paper 2)
2 pages
M1 Teaching Macro Skills
No ratings yet
M1 Teaching Macro Skills
35 pages
BIgbang LP
No ratings yet
BIgbang LP
4 pages
What's The Point of This?: The New Draft National Curriculum For Science Is A Missed Opportunity
No ratings yet
What's The Point of This?: The New Draft National Curriculum For Science Is A Missed Opportunity
5 pages
LGSC 6007 Course Outline 202301
No ratings yet
LGSC 6007 Course Outline 202301
9 pages
Unit 4C Written Report
No ratings yet
Unit 4C Written Report
3 pages
Elc-5 2
No ratings yet
Elc-5 2
19 pages
1.2 Managerial Levels & Skills
No ratings yet
1.2 Managerial Levels & Skills
15 pages
Equity in Education Research Brief
No ratings yet
Equity in Education Research Brief
12 pages
Activity Guide and Evaluation Rubric - Unit 1 - Task 2 - My Family and Daily Routine
No ratings yet
Activity Guide and Evaluation Rubric - Unit 1 - Task 2 - My Family and Daily Routine
7 pages
Eng4u Isp 6 Ka
No ratings yet
Eng4u Isp 6 Ka
6 pages
Design School Kolding - Tactus
No ratings yet
Design School Kolding - Tactus
11 pages
LP English (Writing Letter PP)
No ratings yet
LP English (Writing Letter PP)
2 pages
Reading and Writing Module 8
No ratings yet
Reading and Writing Module 8
4 pages
Furniture and Fittings Skill Council (FFSC) Summative Assessment Framework
No ratings yet
Furniture and Fittings Skill Council (FFSC) Summative Assessment Framework
5 pages
ICCE 2023 Paper 129
No ratings yet
ICCE 2023 Paper 129
4 pages
Workshop On Fundamentals of Adr
No ratings yet
Workshop On Fundamentals of Adr
2 pages
Daily Schedule ENGLISH IV
No ratings yet
Daily Schedule ENGLISH IV
1 page
Thinking About Star
From Everand
Thinking About Star
Francis McCabe
No ratings yet
Ian Talks Java A-Z
From Everand
Ian Talks Java A-Z
Ian Eress
No ratings yet
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet
Object Oriented Programming Inheritance: Fundamentals and Applications
From Everand
Object Oriented Programming Inheritance: Fundamentals and Applications
Fouad Sabry
No ratings yet