0% found this document useful (0 votes)
24 views27 pages

Efficient Few-Shot Continual Learning in Vision-Language Models

The paper introduces LoRSU (Low-Rank Adaptation with Structured Updates), a method for efficient few-shot continual learning in Vision-Language Models (VLMs) that selectively updates image encoders to improve performance while minimizing computational costs. It demonstrates that updating the image encoder rather than the language model is more effective for adapting to new visual domains, achieving significant resource efficiency and low rates of forgetting. Experimental results validate LoRSU's effectiveness across various visual question answering tasks, outperforming existing methods.

Uploaded by

lzhengnan389
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views27 pages

Efficient Few-Shot Continual Learning in Vision-Language Models

The paper introduces LoRSU (Low-Rank Adaptation with Structured Updates), a method for efficient few-shot continual learning in Vision-Language Models (VLMs) that selectively updates image encoders to improve performance while minimizing computational costs. It demonstrates that updating the image encoder rather than the language model is more effective for adapting to new visual domains, achieving significant resource efficiency and low rates of forgetting. Experimental results validate LoRSU's effectiveness across various visual question answering tasks, outperforming existing methods.

Uploaded by

lzhengnan389
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Efficient Few-Shot Continual Learning in Vision-Language Models

Aristeidis Panos 1 Rahaf Aljundi 2 Daniel Olmeda Reino 2 Richard E. Turner 1

Abstract the-art performance on vision-language tasks such as visual


question answering (VQA) and image captioning, highlight-
Vision-language models (VLMs) excel in tasks
ing their potential for general-purpose multimodal reasoning
such as visual question answering and image cap-
(Chen et al., 2024; Wang et al., 2024a).
arXiv:2502.04098v1 [cs.CV] 6 Feb 2025

tioning. However, VLMs are often limited by


their use of pretrained image encoders, like CLIP, Approaches that rely on pretrained image encoders typically
leading to image understanding errors that hin- use variants of the CLIP model (Radford et al., 2021), which
der overall performance. On top of that, real- is kept frozen in the vision-language binding process (Liu
world applications often require the model to be et al., 2024b). CLIP is a widely deployed vision transformer
continuously adapted as new and often limited that has strong zero-shot capabilities in various tasks and
data continuously arrive. To address this, we pro- domains. However several existing works have highlighted
pose LoRSU (Low-Rank Adaptation with Struc- various weaknesses of CLIP on out of domain data (Liu
tured Updates), a robust and computationally ef- et al., 2024b; Zhu et al., 2023; Chen et al., 2023; Li et al.,
ficient method for selectively updating image en- 2023; Tong et al., 2024). When deploying VLMs as visual
coders within VLMs. LoRSU introduces struc- assistants in new domains, it is then expected that VLMs
tured and localized parameter updates, effectively can be updated using a few images gathered from the target
correcting performance on previously error-prone environment whenever deficiencies are noted.
data while preserving the model’s general robust-
Continual learning allows a model to be continuously up-
ness. Our approach leverages theoretical insights
dated as new data from new tasks or domains are encoun-
to identify and update only the most critical pa-
tered. Recent literature on continual learning (CL) of vision
rameters, achieving significant resource efficiency.
language models focus on updating either the LLM (Srivas-
Specifically, we demonstrate that LoRSU reduces
tava et al., 2024) or language projection layers (Das et al.,
computational overhead by over 25x compared
2024), maintaining a frozen image encoder.
to full VLM updates, without sacrificing perfor-
mance. Experimental results on VQA tasks in In vision language models, the LLM component provides
the few-shot continual learning setting, validate reasoning and factual knowledge, while the image encoder’s
LoRSU’s scalability, efficiency, and effectiveness, role is to extract robust and accurate visual features. In
making it a compelling solution for image encoder this work, we argue that adapting VLMs to new visual
adaptation in resource-constrained environments. domains or tasks is more effective and efficient when the
image encoder is updated rather than the LLM. Figure 1
highlights this issue using images from the Toyota Smart
1. Introduction Home dataset (TSI) (Das et al., 2019) dataset: in the first
column, LLaVA (Liu et al., 2024a) struggles to recognize
Large Language Models (LLMs) have revolutionized natural the person’s action in the original image but accurately de-
language understanding and generation, enabling significant scribes the same action in a generated image from OpenAI’s
advancements across diverse applications. As intelligent DALL·E 2. This example underscores that the visual shift,
agents are increasingly expected to operate in real-world rather than the LLM’s understanding of the action, is the
multimodal environments, integrating visual understand- main source of weakness.
ing becomes essential. Vision-Language Models (VLMs)
extend LLMs by incorporating visual information, either Motivated by the above limitations, we introduce a
through pre-trained vision encoders or end-to-end multi- novel parameter-efficient fine-tuning (PEFT) method called
modal training. These models have demonstrated state-of- LoRSU (Low-Rank Adaptation with Structured Updates)
for selectively updating specific modules of the transformer
1
University of Cambridge 2 Toyota Motor Europe. Correspon- blocks of image encoders within VLMs. The right column
dence to: Aristeidis Panos <[email protected]>. of Figure 1 illustrates the (correct) responses of LLaVA after
Preprint. Under review. updating the image encoder separately with our method on

1
Efficient Few-Shot Continual Learning in Vision-Language Models

Q: What is this person doing?


Q: What is this person doing?
GT: Cooking on a stove
2. Related Work
GT: Cooking on a stove

LLaVA: “This person is Few-shot Continual Learning. Our work falls within the continual
Update
cooking a meal in a kitchen.” learning literature, where a model needs to be updated
incrementally as new data arrive, accumulating knowledge
Q: What is this person doing? over tasks and reducing forgetting of previously acquired
GT: Cooking on a stove
information (De Lange et al., 2021).
LLaVA-LoRSU: “The person is standing in
LLaVA: “The person is a kitchen, preparing food. They are using a
standing at a kitchen sink, pot to cook something, possibly stirring the Continual Learning for Multimodal Language Models.
washing dishes.” contents of the pot.”
Wu et al. (2024) provide a survey on continual learning for
LLMs, highlighting challenges of computational efficiency
and forgetting. Srivastava et al. (2024) explored continual
Figure 1. (Left) Responses of the pretrained LLaVA to samples multi-modal learning on VQA datasets, keeping the vision
from TSI dataset (bottom) compared to DALL·E 2 generated im- encoder frozen. He et al. (2023b) examined continual in-
ages (top) for the ‘cooking on a stove’ class. (Right) LLaVA’s struction tuning with sequential VQA datasets, proposing
correct response to the same TSI image after fine-tuning LLaVA a method where the projection head is expanded for each
using LoRSU. new task. Das et al. (2024) introduced a pseudo-rehearsal
strategy for vision-language models, updating only the lan-
guage projection layer. Our method adapts only the vision
encoder, preserving language capabilities.
Continual Learning with Few-Shot Updates. Verwimp
a low-number of samples from TSI dataset compared to the et al. (2023) posits that an ideal continual learning solution
pretrained LLaVA’s (wrong) response. would enable continual correction of model’s mistakes at a
lower computational cost than retraining from scratch. How-
Through extensive experiments, we demonstrate that up- ever, most continual few-shot learning from pre-trained mod-
dating the image encoder is essential for improving the els focuses on classification tasks and introduces solutions
performance of the VLM that relies on it. More importantly, that cannot scale to large multimodal models. Panos et al.
this approach is computationally efficient, as the image en- (2023) update the vision encoder on the first task only, later
coder has significantly fewer parameters compared to the adapting a covariance matrix for incoming tasks. Goswami
language model and the method is less prone to forgetting, et al. (2024) calibrate the covariance matrix for new classes
especially the LLM knowledge. based on semantic similarity. Zhao et al. (2024) introduce
We evaluated our approach on various VQA tasks compar- few and slow updates, proposing a transfer loss function and
ing to state-of-the-art CL methods and the PEFT baseline a cross-classification loss to mitigate catastrophic forget-
LoRA(Hu et al., 2021) on various few-shot CL settings. We ting. Few-shot updates can also be viewed through the lens
show significant improvements of the full VLM model on of model editing (Sinitsin et al., 2020). MEND (Mitchell
all settings and very low rates of forgetting without using et al., 2022) scales model editing to large language mod-
any replay buffer of data from the previous tasks. By selec- els by transforming the gradient obtained from fine-tuning,
tively updating the image encoder, our method provides a through a low-rank decomposition fed to auxiliary networks
robust and efficient solution for handling visual shifts. This designed to make fast, local edits to a pre-trained model,
targeted adaptation strategy avoids the need to modify the requiring a set of unrelated examples to prevent forgetting.
entire model, preserving existing knowledge whilst ensuring ROME (Meng et al., 2022) applies causal tracing to identify
strong performance in new domains. layers where incorrect factual knowledge is stored, applying
a low-rank update. However, ROME does not scale to con-
The contributions of the paper are as follows: tinual updates or non-association types of updates. Cheng
• We propose LoRSU, a novel replay-free PEFT method et al. (2023) studied multi-modal editing, showing negli-
tailored for few-shot continual learning. gible deterioration in multi-modal task performance when
• We introduce two new VQA datasets, TSI and DALLE, updating language models but severe forgetting when updat-
created to expose the limitations of pre-trained image ing vision encoders. To the contrary, our method focuses on
encoders in VLMs. adapting the vision encoder rather than updating the factual
knowledge in the LLM, yet achieving strong performance
• We conduct the first large-scale study of few-shot
gains and negligible forgetting.
CL in VLMs, evaluating LoRSU across ten diverse
VQA datasets and benchmarking against state-of-the- Continual Learning of Pre-Trained Image En-
art PEFT and CL methods. LoRSU consistently out- coders. SPT (He et al., 2023a) estimates a mask of updates
performs all baselines. based on parameter sensitivity, performing low-rank or

2
Efficient Few-Shot Continual Learning in Vision-Language Models

Figure 2. LoRSU mechanism: After computing the gradient ∇θ Lt (θ) over the target dataset at time t, LoRSU picks a small number of
attention heads and a small number of paremeters from the first linear layer of the MLP module in the transformer block based on the
magnitude of the gradients of ∇WAttn Lt and ∇Wfc1 Lt , respectively. Computational efficiency is ensured by introducing LoRA adapters to
the attention weight matrices.
sparse updates. SPU (Zhang et al., 2024) localizes updates block remain fixed. To enhance flexibility, we further update
to the first feed-forward layer of each transformer block, the most informative attention heads based on the gradient
inspired by knowledge neuron theory (Dai et al., 2021). Our of the task-specific loss.
approach generalizes updates to all layers, selecting relevant
More specifically, let a dataset Dt = {xn , yn }N t
n=1 for
parameters and maintaining gradient norms, combined
the current task t where xn is an image with text de-
with LoRA on selected attention heads for adaptivity
scription yn . We define L(θ; Dt ) := Lt (θ) as the loss
and stability, achieving SOTA performance on continual
used for training the model and θ ∈ Rd is the full set
fewshot multimodal tasks.
of model’s parameters. The standard Multi-head Self-
Attention Mechanism (MSA) (Vaswani et al., 2017), com-
3. Low-Rank Adaptation with Structured prised of H Dh -dimensional heads, is defined as the con-
Updates catenation of multiple self-attention (SA) blocks where
(i) (i) (i)
q(i) = Wq Z ⊤ , k(i) = Wk Z ⊤ , v(i) = Wv Z ⊤ ∈
Few-shot continual learning is a highly practical and chal-
RDh ×N , are the query, key and value matrices, which are
lenging scenario, where models must incrementally adapt
used to compute the self-attention outputs as follows
to new tasks with limited supervision while retaining pre-
viously acquired knowledge. This setting closely mirrors
real-world applications, such as interactive AI assistants and ⊤
A(i) = softmax(q(i) k(i) /
p
Dh ) ∈ RN ×N , (1)
autonomous systems, where models receive a continuous ⊤
stream of novel data but only sparse supervision per update. SAi (Z) = A(i) v(i) ∈ RN ×Dh , i = 1, . . . , H. (2)
To address the challenge of efficiently fine-tuning large-
scale visual encoders and transformer-based models under Z ∈ RN ×D is the input matrix of N tokens of dimen-
the few-shot continual learning setting, without causing (i) (i) (i)
sion D and Wq , Wk , and Wk are the query, key, and
catastrophic forgetting (i.e., degradation in performance on
value matrices of learnable parameters for head i, respec-
previously learned tasks), we propose a novel parameter-
tively. The final MSA function is defined as MSA(Z) =
efficient fine-tuning method called Low-Rank Adaptation
Concat [SA1 (Z), . . . , SAH (Z)] Wo ∈ RN ×D , Wo ∈
with Structured Updates (LoRSU).
RHDh ×D .
LoRSU updates specific parameters within each transformer
Since we care to update the parameters of the heads that
block in a resource-efficient manner, mitigating the risk
cause the largest changes in Lt (θ), we compute the gradi-
of generic knowledge loss when fine-tuning for new tasks.
ent of the loss with respect to the parameters of each head
Specifically, we selectively update a subset of parameters
and then we update only those heads with the largest cumu-
from the first linear layer in the MLP block of each trans-
lative contribution to the loss change. Since the matrices
former layer, as proposed in (Zhang et al., 2024). While this (i) (i) (i)
Wq , Wk , Wv are all the parameters of head i, we can de-
approach reduces the fine-tuning burden, it may limit model
fine an importance score for each head by adding the squared
flexibility as the remaining parameters in the transformer (i)
values of their corresponding gradients Gq = ∇W (i) Lt ,
q

3
Efficient Few-Shot Continual Learning in Vision-Language Models

(i) (i)
Gk = ∇W (i) Lt , and Gv = ∇W (i) Lt , as follows Self-Attention or MLP projector) we aim to learn, hence the
k v
constraint of mutually exclusiveness, Ii ∩ Ij = ∅, between
X different pairs of parameter groups. Also note that we al-

(i)
si = (G(i)
q [m, l])2
+ (Gk [m, l])2
+ (G(i)
v [m, l])2
.
lowed to choose a subset cℓ of the parameters of a specific
m,l
(3) group Iℓ which is the underneath mechanism of LoRSU
We provide a theoretical justification of (3) in the next sec- choosing attention heads and parameters of fc1. The mask
tion. We update only the top-k heads, based on their impor- p∗ is chosen so that the gradient norm of the masked gra-
tance scores {s1 , . . . , sH }, I ⊂ {1, . . . , H}, to be updated dients is as large as possible under the sparsity constraints.
on the current task. Nevertheless, the number of parameters We prove in appendix A that the indices of the non-zero
remain high due to the large weight matrices. Therefore, values of p∗ can be found using the importance scores in
we parametrize the original weights using LoRA (Hu et al., (3) and the magnitudes of the gradients with respect to the
2021) to further reduce the computational burden. The ma- fc1 parameters.
(i) (i) (i)
trices Wq , Wk , Wv , i ∈ I are now defined as

4. Experiments
Wα(i) = Wα(i) + A(i) (i)
α Bα , α ∈ {q, k, v}. (4)
We conduct a series of experiments under three different
Finally, to ensure that we only update few-shot continual learning (CL) settings (CL-5, CL-20,
(i) (i) (i)
Wq , Wk , Wv , ∀i ∈ I we use a binary mask on and CL-50 shots) to thoroughly investigate the performance
the gradient vector with respect to all parameters of all of LoRSU based on ten VQA datasets. By adopting this
attention heads. We keep the projection matrix Wo frozen. paradigm, we aim to assess the adaptability and efficiency
We note that most modern implementations of transformer of LoRSU under constrained learning conditions, ensuring
blocks concatenate the three attention weight matrices that it remains both computationally feasible and effective
Wq , Wk , Wv into one and thus we only need to apply LoRA in improving downstream performance. Specifically, we
once to this concatenated matrix. seek to address the following key questions: 1) How does
our method, LoRSU, compare to other fine-tuning and CL
Regarding the first linear layer in the MLP module, Wfc1 ∈ baselines that use the CLIP loss to update the image en-
Rd×D , we mask the gradients of Wfc1 so only the most coder? 2) Does updating the image encoder separately and
important parameters for the current task to be updated, then reintegrating it into the corresponding VLM enhance
i.e. we use the following biased gradient update. downstream VQA performance? 3) What is the effect of
using the perplexity loss instead of the CLIP loss to update
ˆ W Lt = Mfc1 ⊙ ∇W Lt ,
∇ (5)
fc1 fc1 the image encoder? 4) What are the benefits of choosing
a subset of attention heads to be fine-tuned using LoRSU?
where Mfc1 ∈ {0, 1}d×D is a zero-one mask that is built and 5) What are the computational benefits of LoRSU?
by choosing a proportion of the largest squared values of
∇Wfc1 Lt in a similar manner as in (Zhang et al., 2024) and
4.1. Datasets
⊙ is the Hadamard product.
We evaluate the performance of LoRSU on ten visual ques-
Theoretical justification. The importance scores in (3) can
tion answering (VQA) datasets falling in two broad cate-
be derived from the following constrained (binary) optimiza-
gories: regular VQA datasets and classification datasets
tion problem1
converted to VQA datasets.
2
∥p ⊙ ∇W L(θ 0 )∥ Regular VQA datasets. We consider four standard VQA
p∗ = arg max 2 , (6)
p∈{0,1}d ∥∇W L(θ 0 )∥ datasets used for benchmarking VLMs’ performance (Duan
G et al., 2024): VSR (Liu et al., 2023), the Visual Spatial
[
s.t. Iℓ ⊂ {1, 2, . . . , d}, Ii ∩ Ij = ∅, ∀i ̸= j, Reasoning corpus consists of caption-image pairs labeled
ℓ=1 as True or False, where each caption describes the spa-
G
X tial relation between two objects in the image. VLMs
and C = cℓ , cℓ ≤ |Iℓ | ∀ℓ, ∥p∥0 ≤ C, evaluate whether the caption accurately reflects the image.
ℓ=1 HM (Kiela et al., 2020), the Hateful Memes dataset designed
to detect multimodal hateful memes. MMVP (Tong et al.,
where θ 0 is the vector of the pretrained parameters before us- 2024), the Multimodal Visual Patterns dataset is a challeng-
ing Dt for fine-tuning the model. The groups of parameters ing benchmark which has been built on images that CLIP
Ii correspond to the parameters of a specific module (e.g. perceives as similar despite their clear visual differences.
1
For notational simplicity, we assume a single transformer VisOnly (Kamoi et al., 2024), a novel dataset created to
block for this case. directly assess the visual perception abilities of VLMs in

4
Efficient Few-Shot Continual Learning in Vision-Language Models

answering questions about geometric and numerical details zero-shot accuracy of each VQA dataset as the benchmark
in scientific figures. This dataset allows us to assess fine- baseline and report the change in accuracy on the test split
grained visual perception in VLMs independently of other of the target dataset so positive values indicate an improve-
abilities, such as reasoning, making it the most challenging ment in accuracy. This approach enables us to quantify
among the previously mentioned datasets. the model’s ability to accumulate knowledge, using the pre-
trained model as the reference point; we name this metric
Classification-to-VQA datasets. We convert four popular
as Target Improvement (TI) accuracy. We also calculate the
multi-class classification datasets into multiple-choice VQA
average accuracy change on the test splits of the remaining
problems, where each question has five choices, and the
datasets, when fine-tuning on a specific dataset, to estimate
VLM is tasked with selecting the correct answer. These
average forgetting of generic knowledge or possible pos-
datasets are introduced as examples of scenarios where
itive backward transfer (De Lange et al., 2021); we call
visual domain shifts are encountered, allowing us to ex-
this metric Control Change (CC) accuracy where ‘control’
amine the utility of updating the image encoder; a criti-
refers to the control datasets we use to calculate the aver-
cal consideration often overlooked in many standard VQA
age accuracy change. TI and CC are computed based on
datasets.The datasets include: GTS (Stallkamp et al., 2012),
the fine-tuned VLM after the last session of CL. We also
the German Traffic Sign dataset, which Zhang et al. (2024)
consider standard CL performance metrics such as Average
considered as an out-of-distribution dataset for CLIP pre-
Accuracy (ACC) and Backward Transfer (BWT) (Lopez-Paz
training; CAn (Wang et al., 2024b), a recent dataset created
& Ranzato, 2017) to examine how accuracy and forgetting
to test CLIP’s robustness with animal images containing
evolves through continuous adaptation. Notice that these
realistic spurious features such as unexpected backgrounds;
metrics, in contrast to TI and CC, focus on the accuracy and
AIR (Maji et al., 2013), a fine-grained aircraft classification
forgetting during continual adaptation and they do not take
dataset; ESAT (Helber et al., 2019), a dataset of satellite
into account the performance of the fine-tuned model on
images used for land cover classification.
other datasets.
TSI & DALLE. In addition to these existing datasets, we
Implementation details. Please see Appendix B.
introduce two novel VQA datasets: TSI and DALLE, both
designed to explore the effects of domain shift. The TSI (Das Models. For our experiments, we consider the popular
et al., 2019) dataset was preprocessed as a classification Vision Language Model LLaVA-v1.5 (Liu et al., 2024b)
dataset, where the goal is to recognize the activity depicted that leverages a frozen CLIP image encoder. Specifically,
in each image. Frames were extracted from videos to create LLaVA utilizes a frozen OpenAI-CLIP-L-14 (Radford et al.,
a training set of approximately 10K images and a test set of 2021) with a LLM (Vicuna-7b (Chiang et al., 2023)). The
approximately 5K images, encompassing 27 distinct activity two modules are connected through a two-layer MLP pro-
classes. The DALLE dataset, constructed by querying the jector that aligns image and text features. The LLM and the
OpenAI’s model DALL·E 2, includes representative images MLP projector are optimized during the visual instruction
generated from 22 activity classes appearing in TSI. For tuning while CLIP remains frozen. LLaVA concatenates
each activity, we generated 30 images, resulting in a total of adjacent tokens from CLIP-L-14 and processes them with
660 images designated exclusively for evaluation purposes. an MLP projector as input to LLama-2 (7B-chat) (Touvron
et al., 2023); the MLP projector and the language model are
We follow the common practice in few-shot continual learn-
optimized while the image encoder remains frozen.
ing (Panos et al., 2023) to construct the sequences. We
divide each dataset into 5 sets of disjoint classes/categories Baselines. We compare LoRSU to the following methods
and consider 5/20/50 shot settings where only 5/20/50 im- that also use the CLIP loss to fine-tune the image encoder:
ages per class in the current set are used for fine-tuning the
model. More details on how we split each of these datasets • LN (Perez et al., 2018; Panos et al., 2023) is used for both
for the CL settings are provided in appendix C. few-shot and CL. Only the image encoder LayerNorm
modules’ parameters are optimized.
• F-FT is the standard fine-tuning technique where all im-
4.2. Experimental Setting
age encoder parameters undergo gradient updates.
Metrics. While standard metrics in the CL literature ex- • F-EWC fine-tunes all the image encoder parameters with
ist to evaluate general performance (Lopez-Paz & Ranzato, EWC regularization (Kirkpatrick et al., 2017).
2017; Chaudhry et al., 2018), VLMs exhibit generic knowl- • LoRA (Hu et al., 2021) a popular PEFT method which pa-
edge across various domains beyond the one being adapted, rameterizes incremental updates by two low-dimensional
making it crucial to evaluate how adaptation impacts their matrices and only fine-tunes them.
overall performance. These metrics do not measure the • AdaLoRA (Zhang et al., 2023) dynamically adjusts the
change in performance relative to the model’s initial state low-rank update budget allocation during training.
prior to the learning process. To address this, we use the

5
Efficient Few-Shot Continual Learning in Vision-Language Models

Table 1. Performance comparison of LoRSU with the CLIP loss against baselines fine-tuning the image encoder using the same loss. We
report the Target Improvement (TI) and Control Change (CC) accuracies across three different continual learning (CL) settings. Greener
shades indicate higher positive values, while redder shades signify lower negative values. The highest accuracies across methods for each
dataset are underlined.
FT Method
LN F-FT F-EWC LoRA AdaLoRA SPU LoRSU
Setting FT Dataset
TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑)
GTS 3.5 -1.5 3.7 -6.5 5.0 -11.5 0.7 -4.8 -0.9 -4.9 5.4 -0.6 6.4 -0.7
TSI 0.8 0.0 7.4 -1.1 8.5 -1.0 -0.1 -2.8 1.1 0.2 0.9 0.1 3.2 0.1
CL-5
CAn -2.4 -0.2 -2.4 -2.2 -16.7 -9.4 -1.3 -4.6 -1.0 -0.1 -0.4 0.1 0.3 0.3
AIR 0.3 -1.6 2.0 -2.7 2.9 -2.8 1.3 -3.7 0.4 0.0 3.1 0.1 4.8 0.4
ESAT 4.2 0.6 -10.3 -1.4 -8.4 -2.1 -1.6 -0.7 1.9 0.1 4.5 0.1 6.8 0.2

GTS 5.2 -5.9 4.6 -7.3 6.7 -15.6 2.5 -10.5 0.2 -2.2 7.9 -1.3 8.6 -1.0
TSI 5.1 -1.9 15.3 -3.4 16.0 -32.5 8.5 -4.4 1.3 -9.6 7.8 -0.3 10.6 -0.1
CL-20
CAn -2.4 -0.4 0.3 -2.9 0.1 -5.1 -2.3 -5.4 -3.5 -2.5 0.1 0.5 1.1 0.3
AIR -0.2 -3.0 9.3 -1.8 10.2 -2.0 5.3 -2.7 2.7 -0.7 3.0 -0.2 5.9 -0.5
ESAT 0.9 -0.1 -24.9 -1.7 -22.0 -3.8 -11.5 -0.5 -6.8 -2.7 5.4 0.3 6.6 0.2

GTS 4.8 -6.5 3.4 -9.8 5.3 -12.9 3.1 -11.1 1.0 -3.3 7.7 -1.5 9.7 -1.3
TSI 7.0 -3.0 17.2 -4.6 22.4 -13.4 18.2 -6.3 7.9 -1.9 12.2 -0.5 19.1 -0.3
CL-50
CAn -5.7 -3.3 -1.0 -4.9 0.6 -9.7 -0.4 -4.4 -1.8 -0.8 0.6 -0.3 1.3 -0.5
AIR 1.8 -3.9 10.0 -3.1 10.9 -3.3 7.8 -3.8 4.6 -0.9 6.2 -0.6 8.2 -0.7
ESAT 4.6 0.1 -41.4 -3.3 -38.1 -2.0 -14.5 -3.6 -17.3 -2.4 5.8 0.1 7.0 0.2

• SPU (Zhang et al., 2024) is a PEFT baseline, specifically Table 2. Average accuracy (ACC) (↑) and backward transfer
designed to tackle catastrophic forgetting in CL scenarios, (BWT) (↑) scores (%). For reference, the ACC of the pretrained
that utilizes structured sparsity based on gradient informa- model on GTS and ESAT is 75.4 and 76.4, respectively, while
BWT is zero for all cases. The highest scores across methods are
tion to fine-tune the most significant parameters of the fc1
underlined.
module in the transformer block.
LoRA SPU LoRSU
4.3. CLIP-based Updates Setting FT Dataset
ACC BWT ACC BWT ACC BWT
We evaluate the performance of the Vision-Language Model GTS 79.2 -7.1 80.8 0.5 81.1 0.4
(VLM) when only the image encoder is fine-tuned using CL-5
ESAT 73.8 -3.4 79.8 1.5 82.2 2.0
the CLIP loss in a CL setting. This experiment compares GTS 77.2 -9.1 82.8 -0.6 83.5 -0.4
six strong CLIP-based baselines with our proposed method, CL-20
ESAT 64.1 -18.3 82.0 2.0 82.7 0.1
LoRSU. Table 1 reports the average accuracies of TI/CC GTS 79.3 -10.3 83.8 -0.7 84.7 -0.5
over three runs; detailed results can be found in appendix D. CL-50
ESAT 61.4 -27.8 81.2 -2.4 82.1 -0.8
We observe that LoRSU consistently achieves superior TI
SPU (TI=5.8, CC=0.1) and all other methods.
scores across datasets and CL settings, underscoring its abil-
ity to enhance task-specific performance effectively. Fur- Additional metrics. We assess the performance of LoRSU
thermore, LoRSU maintains CC accuracies that take consis- against LoRA and SPU in terms of ACC and BWT across
tently small negative or even positive values, highlighting two out-of-domain datasets, GTS and ESAT. Since LoRA
its capacity to preserve or slightly improve performance on and SPU have similar number of trainable parameters as
control datasets while fine-tuning on target datasets. Even LoRSU and competitive performance in our previous ex-
in datasets where other methods struggle (e.g., CAn, ESAT), periment, we choose those for comparison. Table 2 shows
LoRSU often performs better, maintaining positive or close- that LoRSU’s performs well with respect to these metrics,
to-neutral TI and CC scores. For instance, In ESAT (CL-50) following similar patterns as TI and CC in Table 1. LoRSU
containing challenging satellite images, LoRSU achieves achieves the best performance on ACC while exhibiting min-
the highest TI (7.0) with a positive CC (0.2), outperforming imal forgetting with the least negative BWT values. Similar

6
Efficient Few-Shot Continual Learning in Vision-Language Models

Table 3. Performance comparison between LoRSU using the CLIP loss (LoRSU) or the perplexity loss (LoRSU-Ppl) and other baselines
that fine-tune only the vision encoder (LoRA, LoRA-Ppl), only the LLM (LoRA-L), or both of them (LoRA-F). We report the Target
Improvement (TI) and Control Change (CC) for each CL setting. † and ‡ denote classification-to-VQA and regular VQA datasets,
respectively. The highest accuracies across methods for each dataset are underlined.

FT Method
Setting FT Dataset
LoRA-L LoRA LoRSU LoRA-Ppl LoRA-F LoRSU-Ppl
TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑)
GTS† -4.1 -0.2 0.7 -4.8 6.4 -0.7 -7.5 -3.0 -2.7 -1.8 1.6 -1.0
TSI† 6.0 -0.1 -0.1 -2.8 3.2 0.1 10.9 -2.4 -8.0 -2.4 13.1 1.5

CAn -3.3 -0.2 -1.3 -4.6 0.3 0.3 -3.5 -5.5 -4.1 -1.6 0.2 -0.2
CL-5
AIR† -1.7 0.3 1.3 -3.7 4.8 0.4 -0.7 -1.5 9.6 -1.9 5.8 -0.2

ESAT -0.2 -0.1 -1.6 -0.7 6.8 0.2 -0.6 0.4 5.4 -0.5 3.7 0.1
VSR‡ 16.8 -0.6 0.5 -4.0 0.4 0.2 10.2 -12.5 18.0 -10.6 10.5 -1.2
HM‡ 7.4 -2.7 -0.4 -6.8 0.6 0.4 -1.2 -1.2 6.0 -4.5 -0.8 0.2
VisOnly‡ -0.4 -0.1 -1.1 -4.5 0.9 0.1 0.3 -0.3 0.2 -0.4 2.7 0.7

GTS† -1.4 0.1 2.5 -10.5 8.6 -1.0 -0.5 -6.4 -1.4 -0.8 3.9 -0.7

TSI 5.9 0.0 8.5 -4.4 10.6 -0.1 6.5 -11.6 2.9 -3.1 13.9 -0.6
CAn† -1.9 -0.6 -2.3 -5.4 1.1 0.3 -3.7 -8.8 -2.1 -1.7 0.5 -1.2
CL-20 †
AIR 3.7 0.3 5.3 -2.7 5.9 -0.5 4.8 -3.5 16.3 -0.3 6.0 -0.3
ESAT† 0.7 0.4 -11.5 -0.5 6.6 0.2 -1.2 -0.1 -4.6 -0.0 2.9 -0.1

VSR 22.2 1.0 0.4 -3.9 0.1 -0.2 19.5 -0.3 23.3 -5.1 22.9 -1.6
HM‡ 10.6 -2.2 -1.8 -5.8 0.7 0.2 10.7 -0.1 11.7 -1.4 10.9 -0.2

VisOnly -2.3 0.7 -1.0 -4.7 0.2 0.1 -2.0 0.5 -1.0 0.2 1.7 0.5

GTS† -0.7 -0.3 3.1 -11.1 9.7 -1.3 -1.4 -6.7 -3.9 -2.1 6.9 -0.4

TSI 9.9 -0.0 18.2 -6.3 19.1 -0.4 -1.6 -16.5 15.1 -0.7 22.0 -1.1

CAn -1.8 -0.7 -0.4 -4.4 1.3 -0.5 -1.8 -9.8 -2.1 -1.1 1.0 -3.4
CL-50
AIR† 4.6 0.4 7.8 -3.8 8.2 -0.7 6.2 -3.1 17.9 -0.9 8.9 -0.4

ESAT 1.0 0.2 -14.5 -3.6 7.0 0.2 1.7 0.2 -9.5 -0.6 -0.7 -0.5
VSR‡ 21.9 1.0 0.4 -4.5 2.3 -0.3 20.2 -5.3 21.0 1.1 23.4 -3.6

HM 10.2 -2.1 0.7 -4.5 0.3 0.2 12.5 -1.5 12.3 -3.7 12.2 0.2
VisOnly‡ -2.4 0.6 -0.2 -6.8 0.3 -0.1 -2.0 0.7 0.2 0.2 0.3 0.1

patterns are observed on extra datasets in appendix D.2. • LoRA-F applies LoRA adapters to all weight matrices of
the LLM, the image encoder, and the MLP projector.
4.4. CLIP-based vs. Perplexity-based Updates
We aevaluate how LoRSU and LoRA perform compared to
Traditionally, LLMs and VLMs achieve impressive perfor- their perplexity-based counterparts, LoRSU-Ppl and LoRA-
mance through fine-tuning with the perplexity loss. LoRA Ppl, respectively. Furthermore, we seek to explore how
is the standard PEFT method for this purpose, and thus, we these methods compare to parameter-efficient fine-tuning
consider three extra LoRA variants plus LoRSU-Ppl which approaches when either the entire VLM (LoRA-F) or only
all utilize the perplexity loss to update the model. the LLM component (LoRA-L) is updated.
• LoRA-L applies LoRA adapters to all weight matrices of The results in Table 3 highlight the strong and robust per-
the LLM and thus perplexity loss is required. formance of LoRSU and LoRSU-Ppl compared to other
• LoRA-Ppl is the same method as LoRA but this time the baseline methods across various settings. Both LoRSU
perplexity loss is used to update the adapters. and LoRSU-Ppl achieve minimal negative or even positive
changes in CC, indicating reduced catastrophic forgetting

7
Efficient Few-Shot Continual Learning in Vision-Language Models

23.9
Table 4. Comparison of the importance of choosing a small subset TFlops
of attention heads. The GTS dataset is used for fine-tuning. We Params (M)
include error bars over 3 runs. The highest accuracies across 16.5
methods are underlined. 15.0
Setting Scores LoRSU-Rand LoRSU-AAH LoRSU
TI (↑) 4.1 ±0.4 5.9 ±0.8 6.4 ±1.3
CL-5 9.0 9.1
CC (↑) -1.0 ±0.5 -0.9 ±0.3 -0.7 ±0.6
TI (↑) 6.2 ±0.6 7.5 ±0.6 8.6 ±0.9
CL-20
CC (↑) -1.4 ±0.3 -0.7 ±0.4 -1.0 ±0.5
0.36
TI (↑) 7.8 ±0.4 9.1 ±0.1 9.7 ±0.1
CL-50
CC (↑) -1.7 ±0.2 -0.9 ±0.2 -1.3 ±0.1 LoRA-F LoRSU-Ppl LoRSU

and improved retention of generic knowledge compared to


baselines. The table reports the average accuracies of TI/CC Figure 3. TFlops and trainable parameters comparison between
over three runs with exact results provided in appendix D. LoRSU with CLIP loss (LoRSU), perplexity loss (LoRSU-Ppl),
and LoRA-F.
The use of the perplexity loss in LoRSU-Ppl demonstrates
a considerable improvement in TI accuracy over LoRSU terms of the number of training epochs can be found in
when fine-tuned for VQA datasets. For instance, LoRSU- appendix E.3; ablation studies of other hyperparameters of
Ppl achieves 10% higher TI accuracy than LoRSU on VSR. LoRSU are given in appendix E.
We hypothesize that the perplexity loss acts as an additional
signal that optimizes the image encoder to complement 4.6. Computational Efficiency
the frozen language model more effectively, improving the
alignment between visual and textual modalities in VQA. In Figure 3, we asses the computational benefits of LoRSU
using the CLIP loss comparing to baseline methods. We
However, we observe that LoRSU achieves a balance be- focus on two key metrics: trainable parameters and floating-
tween task-specific improvements and generalization, con- point operations per second (TFLOPs).
sistently demonstrating higher CC accuracy compared to
LoRSU-Ppl across most datasets. Lastly, although LoRA-F LoRSU requires 25× fewer computation resources than
achieves high TI scores on many datasets, it suffers sig- LoRA-F and LoRSU-Ppl, demonstrating the suitability of
nificantly from forgetting, underscoring the importance of using CLIP loss when computational resources are limited.
LoRSU’s structured updates in CL scenarios. Unlike the perplexity loss, which necessitates forward and
backward passes through both the vision encoder and LLM,
the CLIP loss operates solely on the vision encoder, sig-
4.5. The Choice of Attention Heads
nificantly reducing computational overhead. This makes
Given that LoRSU’s mechanism of choosing attention heads LoRSU more scalable, enabling efficient continual learning
is a key-point to its success, we conduct an ablation study even in resource-constrained environments.
on the different strategies for selecting attention heads dur-
ing the fine-tuning process. In this experiment, we com- 5. Discussion
pare LoRSU’s performance to two new variants of LoRSU,
namely, LoRSU-Rand and LoRSU-AAH. LoRSU-Rand ran- In this work, we introduced LoRSU, a novel parameter-
domly chooses the number of attention heads (k = 2 heads) efficient fine-tuning method, specifically designed for few-
to be fine-tuned while LoRSU-AAH fine-tunes all the avail- shot continual learning scenarios with VLMs. Unlike exist-
able attention heads (16 in total) in each transformer block. ing approaches, LoRSU operates without relying on a replay
For extra results on the sensitivity of the number of LoRSU’s buffer, making it uniquely suited for resource-constrained
optimal attention heads k, see appendix E.2. settings. Through extensive experiments, we demonstrate
that LoRSU achieves both computational efficiency and the
The results in Table 4 demonstrate that LoRSU’s targeted
preservation of the model’s generic knowledge by using
approach is performant, balancing task-specific improve-
localized and structured updates. LoRSU outperforms 12
ments (TI) and the retention of generic knowledge (CC).
baselines in over 80% of evaluations across 10 datasets and
Random selection (LoRSU-Rand) fails to generalize well,
3 settings, achieving the highest TI accuracies in most cases
while fine-tuning all attention heads (LoRSU-AHH) adds
while maintaining stable or even positive CC accuracies.
unnecessary computational overhead with less effective gen-
To the best of our knowledge, we are the first to explore
eralization. LoRSU outperforms both of the variants in TI
few-shot continual learning of VLMs.
while LoRSU-AHH is marginally better in CC. Additional
experiments that investigate the robustness of LoRSU in Whilst we focus on CLIP and LLaVA due to computational

8
Efficient Few-Shot Continual Learning in Vision-Language Models

constraints, our method is generic to any transformer model, IEEE/CVF Conference on Computer Vision and Pattern
and we plan to extend it to other VLMs and image encoders. Recognition, pp. 24185–24198, 2024.
Another promising direction is using a smaller LLM proxy
model in perplexity-based methods like LoRSU-Ppl, which Cheng, S., Tian, B., Liu, Q., Chen, X., Wang, Y., Chen, H.,
has shown strong VQA performance. This could improve and Zhang, N. Can we edit multimodal large language
scalability and LoRSU’s use in resource-limited settings. models? In Bouamor, H., Pino, J., and Bali, K. (eds.),
Finally, LoRSU’s binary mask-based structured updates Proceedings of the 2023 Conference on Empirical Meth-
ensure efficient, precise parameter updates, but scaling to ods in Natural Language Processing, pp. 13877–13888,
larger architectures like LLMs poses challenges. Replacing Singapore, December 2023. Association for Computa-
binary masks with more scalable solutions for vast parame- tional Linguistics. doi: 10.18653/v1/2023.emnlp-main.
ter spaces will be crucial to manage memory and processing 856. URL https://aclanthology.org/2023.
demands, offering opportunities for further refinement. emnlp-main.856/.

Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang,
Impact Statement H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E.,
Authors are required to include a statement of the potential et al. Vicuna: An open-source chatbot impressing gpt-4
broader impact of their work, including its ethical aspects with 90%* chatgpt quality. See https://vicuna. lmsys. org
and future societal consequences. This statement should be (accessed 14 April 2023), 2(3):6, 2023.
in an unnumbered section at the end of the paper (co-located
Dai, D., Dong, L., Hao, Y., Sui, Z., Chang, B., and Wei,
with Acknowledgements – the two may appear in either
F. Knowledge neurons in pretrained transformers. arXiv
order, but both must be before References), and does not
preprint arXiv:2104.08696, 2021.
count toward the paper page limit. In many cases, where
the ethical impacts and expected societal implications are Das, D., Talon, D., Mancini, M., Wang, Y., and Ricci, E.
those that are well established when advancing the field of One vlm to keep it learning: Generation and balancing
Machine Learning, substantial discussion is not required, for data-free continual visual question answering. arXiv
and a simple statement such as the following will suffice: preprint arXiv:2411.02210, 2024.
“This paper presents work whose goal is to advance the field
of Machine Learning. There are many potential societal Das, S., Dai, R., Koperski, M., Minciullo, L., Garattoni, L.,
consequences of our work, none which we feel must be Bremond, F., and Francesca, G. Toyota smarthome: Real-
specifically highlighted here.” world activities of daily living. In Proceedings of the
IEEE/CVF international conference on computer vision,
The above statement can be used verbatim in such cases, but pp. 833–842, 2019.
we encourage authors to think about whether there is content
which does warrant further discussion, as this statement will De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia,
be apparent if the paper is later flagged for ethics review. X., Leonardis, A., Slabaugh, G., and Tuytelaars, T. A
continual learning survey: Defying forgetting in classifi-
cation tasks. IEEE transactions on pattern analysis and
References
machine intelligence, 44(7):3366–3385, 2021.
Chaudhry, A., Dokania, P. K., Ajanthan, T., and Torr, P. H.
Riemannian walk for incremental learning: Understand- Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
ing forgetting and intransigence. In Proceedings of the Fei, L. ImageNet: A large-scale hierarchical image
European conference on computer vision (ECCV), pp. database. In 2009 IEEE Conference on Computer Vi-
532–547, 2018. sion and Pattern Recognition, pp. 248–255, 2009. doi:
10.1109/CVPR.2009.5206848.
Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krish- Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu,
namoorthi, R., Chandra, V., Xiong, Y., and Elhoseiny, M. Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.
Minigpt-v2: large language model as a unified interface Vlmevalkit: An open-source toolkit for evaluating large
for vision-language multi-task learning. arXiv preprint multi-modality models. In Proceedings of the 32nd
arXiv:2310.09478, 2023. ACM International Conference on Multimedia, pp. 11198–
11201, 2024.
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S.,
Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al. Internvl: Goswami, D., Twardowski, B., and Van De Weijer, J.
Scaling up vision foundation models and aligning for Calibrating higher-order statistics for few-shot class-
generic visual-linguistic tasks. In Proceedings of the incremental learning with pre-trained vision transformers.

9
Efficient Few-Shot Continual Learning in Vision-Language Models

In Proceedings of the IEEE/CVF Conference on Com- Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tun-
puter Vision and Pattern Recognition, pp. 4075–4084, ing. Advances in neural information processing systems,
2024. 36, 2024b.

He, H., Cai, J., Zhang, J., Tao, D., and Zhuang, B. Lopez-Paz, D. and Ranzato, M. Gradient episodic memory
Sensitivity-aware visual parameter-efficient fine-tuning. for continual learning. Advances in neural information
In Proceedings of the IEEE/CVF International Confer- processing systems, 30, 2017.
ence on Computer Vision, pp. 11825–11835, 2023a.
Loshchilov, I. Decoupled weight decay regularization. arXiv
He, J., Guo, H., Tang, M., and Wang, J. Continual instruc- preprint arXiv:1711.05101, 2017.
tion tuning for large multimodal models. arXiv preprint
Maji, S., Kannala, J., Rahtu, E., Blaschko, M., and Vedaldi,
arXiv:2311.16206, 2023b.
A. Fine-grained visual classification of aircraft. Technical
Helber, P., Bischke, B., Dengel, A., and Borth, D. Eurosat: report, University of Oxford, 2013.
A novel dataset and deep learning benchmark for land
Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating
use and land cover classification. IEEE Journal of Se-
and editing factual associations in gpt. Advances in Neu-
lected Topics in Applied Earth Observations and Remote
ral Information Processing Systems, 35:17359–17372,
Sensing, 2019.
2022.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, Mitchell, E., Lin, C., Bosselut, A., Finn, C., and Manning,
S., Wang, L., and Chen, W. Lora: Low-rank adaptation of C. D. Fast model editing at scale. In International Confer-
large language models. arXiv preprint arXiv:2106.09685, ence on Learning Representations, 2022. URL https:
2021. //openreview.net/forum?id=0DcZxeWfOPt.
Kamoi, R., Zhang, Y., Das, S. S. S., Zhang, R. H., and Panos, A., Kobe, Y., Reino, D. O., Aljundi, R., and Turner,
Zhang, R. Visonlyqa: Large vision language models still R. E. First session adaptation: A strong replay-free base-
struggle with visual perception of geometric information. line for class-incremental learning. In Proceedings of the
arXiv preprint arXiv:2412.00947, 2024. IEEE/CVF International Conference on Computer Vision,
pp. 18820–18830, 2023.
Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A.,
Ringshia, P., and Testuggine, D. The hateful memes Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
challenge: Detecting hate speech in multimodal memes. Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
Advances in neural information processing systems, 33: L., et al. Pytorch: An imperative style, high-performance
2611–2624, 2020. deep learning library. Advances in Neural Information
Processing Systems, 32, 2019.
Kingma, D. P. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980, 2014. Perez, E., Strub, F., De Vries, H., Dumoulin, V., and
Courville, A. FiLM: Visual reasoning with a general
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Des- conditioning layer. In Proceedings of the AAAI Confer-
jardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., ence on Artificial Intelligence, volume 32, 2018.
Grabska-Barwinska, A., et al. Overcoming catastrophic
forgetting in neural networks. Proceedings of the Na- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
tional Academy of Sciences, 114(13):3521–3526, 2017. Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,
et al. Learning transferable visual models from natural
Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language supervision. In International conference on
language-image pre-training with frozen image encoders machine learning, pp. 8748–8763. PMLR, 2021.
and large language models. In International conference
on machine learning, pp. 19730–19742. PMLR, 2023. Sinitsin, A., Plokhotnyuk, V., Pyrkin, D., Popov, S.,
and Babenko, A. Editable neural networks. In In-
Liu, F., Emerson, G. E. T., and Collier, N. Visual spatial ternational Conference on Learning Representations,
reasoning. Transactions of the Association for Computa- 2020. URL https://openreview.net/forum?
tional Linguistics, 2023. id=HJedXaEtvS.

Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., and Lee, Srivastava, S., Harun, M. Y., Shrestha, R., and Kanan, C.
Y. J. Llava-next: Improved reasoning, ocr, and world Improving multimodal large language models using con-
knowledge, 2024a. tinual learning. arXiv preprint arXiv:2410.19925, 2024.

10
Efficient Few-Shot Continual Learning in Vision-Language Models

Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. Man Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M.
vs. computer: Benchmarking machine learning algo- Minigpt-4: Enhancing vision-language understanding
rithms for traffic sign recognition. Neural networks, 32: with advanced large language models. arXiv preprint
323–332, 2012. arXiv:2304.10592, 2023.
Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., and Xie,
S. Eyes wide shut? exploring the visual shortcomings
of multimodal llms. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pp. 9568–9578, 2024.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., et al. Llama 2: Open foundation and fine-
tuned chat models. arXiv preprint arXiv:2307.09288,
2023.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. At-
tention is all you need. Advances in neural information
processing systems, 30, 2017.
Verwimp, E., Aljundi, R., Ben-David, S., Bethge, M., Cossu,
A., Gepperth, A., Hayes, T. L., Hüllermeier, E., Kanan, C.,
Kudithipudi, D., et al. Continual learning: Applications
and the road forward. arXiv preprint arXiv:2311.11908,
2023.
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen,
K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du,
M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., and
Lin, J. Qwen2-vl: Enhancing vision-language model’s
perception of the world at any resolution. arXiv preprint
arXiv:2409.12191, 2024a.
Wang, Q., Lin, Y., Chen, Y., Schmidt, L., Han, B., and
Zhang, T. Do clips always generalize better than imagenet
models? arXiv preprint arXiv:2403.11497, 2024b.
Wu, T., Luo, L., Li, Y.-F., Pan, S., Vu, T.-T., and Haffari, G.
Continual learning for large language models: A survey.
arXiv preprint arXiv:2402.01364, 2024.
Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N.,
He, P., Cheng, Y., Chen, W., and Zhao, T. Adalora:
Adaptive budget allocation for parameter-efficient fine-
tuning. arXiv preprint arXiv:2303.10512, 2023.
Zhang, W., Janson, P., Aljundi, R., and Elhoseiny, M. Over-
coming generic knowledge loss with selective parameter
update. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 24046–
24056, 2024.
Zhao, L., Zhang, X., Yan, K., Ding, S., and Huang, W.
Safe: Slow and fast parameter-efficient tuning for con-
tinual learning with pre-trained models. arXiv preprint
arXiv:2411.02175, 2024.

11
Efficient Few-Shot Continual Learning in Vision-Language Models

A. Proof of the optimal mask p∗


Definition A.1. The operator TOP-C : Rd → Rd , for 1 ≤ C ≤ d is defined as

xπ(i) , i ≤ C
(TOP-C(x))π(i) :=
0, otherwise,

where x = (x1 , . . . , xd )⊤ ∈ Rd and π is a permutation of {1, 2, . . . , d} such that |xπ(i) | ≥ |xπ(i+1) |, for i = 1, . . . , d − 1,
i.e. the TOP-S operator keeps only the S largest elements of x in magnitude and truncates the rest to zero.
Lemma A.2. For any x ∈ Rd − {0}, 1 ≤ C ≤ d, the optimal mask
2
∥p ⊙ x∥
p∗ = arg max 2 , s.t. ∥p∥0 ≤ C,
p∈{0,1}d ∥x∥

has zeros everywhere except the C largest elements of x in magnitude.

Proof. Rewriting the optimization problem as


d
X d
X
max pi x2i , s.t. pi ≤ C,
p∈{0,1}d
i=1 i=1

Notice that this is a trivial binary knapsack problem with maximum weight capacity C and weights equal to one. Hence, the
maximum is attained when we pick the top C maximal x2i elements.
Remark A.3.

It holds that TOP-S(x) = p∗ ⊙ x.


Corollary A.4. The optimal mask p∗ in (6) has zeros everywhere except for the indices i ∈ {j : ∃ℓ ∈
{1, . . . , G}, such that j ∈ {πℓ (1), . . . , πℓ (cℓ )}}, where πℓ is the same permutation as in Definition A.1 for the set of
indices Iℓ .

Proof. The result follows from the mutual exclusiveness of Iℓ in the constraints of (6) and Lemma A.2.

B. Implementation Details
We describe below the implementation details of section 4.

• All the experiments are conducted on a single NVIDIA A100 GPU.

• We have included error bars over three runs for all experiments.

• We use PyTorch (Paszke et al., 2019) to implement all the algorithms.

• We use Adam (Kingma, 2014) as an optimizer for the methods that utilize the CLIP loss for fine tuning and
AdamW (Loshchilov, 2017) for those ones that use the perplexity loss.

• A learning rate scheduler of Cosine Annealing with Warmup is employed for all methods.

• For all experiments, we set the learning rate 1 × 10−5 and 2 × 10−5 , for LoRSU and LoRSU-Ppl, respectively.

• We set batch size to 16 for all methods that fine-tune the vision encoder through CLIP loss. We reduce the batch size to
8 for those methods that fine-tune the vision encoder through perplexity loss or those that fine-tune the LLM. This was
due to GPU memory limitations.

• All methods run for 20, 15, and 10 epochs for the CL-5, CL-10, and CL-50 settings, respectively.

• For LoRA (-Ppl), we set rank r = 64 while LoRA-L and LoRA-F use r = 8, for all experiments.

12
Efficient Few-Shot Continual Learning in Vision-Language Models

• For AdaLoRA, we set the initial rank to 70 and the final average rank to 64.
• The adapters of LoRA and AdaLoRA are applied to all weight matrices of each of the transformer blocks.
• For SPU, we use sparsity=15% for all experiments.
• For LoRSU (-Ppl) we use sparsity=10%, rank=64, and we pick the top-2 attention heads for all experiments.

The choice of the above hyperparameters ensures that LoRA (-Ppl), LoRA-L, LoRA-F, AdaLoRA. SPU, and LoRSU (-Ppl)
have similar number of trainable parameters.

C. Datasets
Details on all datasets used in section 4 are presented here.

C.1. TSI & DALLE


We start with the description of how we constructed our newly introduced VQA datasets TSI and DALLE.

TSI. To extract images from the videos of the Toyota Smart Home dataset (TSI), we discretized each video clip into 2
frames per second and then selected the frame in the middle of the total time duration of the video clip. In Table 5 we
describe the actions that were selected and the corresponding prompt used for CLIP classification. We also note dropping
few actions to avoid ambiguous classes.

DALLE. We generated images from DALL·E 2 using OpenAI python package and we used the prompt “A person {a}”
where a ∈ { using a white coffee machine, eating, cutting bread, stirring the pot, holding a glass, watching TV, holding a
bottle, walking, making tea, cutting food, holding a cup, using a laptop, lying down, holding a can, person holding a black
kettle, reading a book, cleaning up, sitting down, using a tablet, boiling water in a black kettle, using a cordless phone,
washing dishes}.
In Table 6, we present the average number of images per session used to update the model for each CL setting. Finally,
Table 7 provides characteristics of the datasets used for evaluating performance.

C.2. Continual Learning Splits


For the continual learning settings of section 4, we split all datasets into five non-overlapping continual learning (CL) splits
based on the classes/categories of each dataset. Unless stated otherwise, we use the training split of each dataset to construct
these CL splits.

GTS (Stallkamp et al., 2012). We split the 43 classes of GTS as follows:

• Session 1: [25, 2, 11, 1, 40, 27, 5, 9, 17].


• Session 2: [32, 29, 20, 39, 21, 15, 23, 10, 3].
• Session 3: [18, 38, 42, 14, 22, 35, 34, 19, 33].
• Session 4: [12, 26, 41, 0, 37, 6, 13, 24].
• Session 5: [30, 28, 31, 7, 16, 4, 36, 8].

TSI (Das et al., 2019). We split the 27 action categories of TSI as follows:

• Session 1: [WatchTV, Laydown, Sitdown, Pour.Fromkettle, Enter, Drink.Frombottle].


• Session 2: [Eat.Attable, Pour.Frombottle, Cook.Cleandishes, Maketea.Boilwater, Leave, Cook.Cleanup].
• Session 3: [Maketea.Insertteabag, Makecoffee.Pourwater, Drink.Fromcan, Readbook, Cutbread].

13
Efficient Few-Shot Continual Learning in Vision-Language Models

Original Class name/Action Generated Caption


Cook.Cleandishes washing dishes
Cook.Cleanup cleaning up
Cook.Cut cutting food
Cook.Stir stirring the pot
Cook.Usestove ✗
Cook.Cutbread cutting bread
Drink.Frombottle holding a bottle
Drink.Fromcan holding a can
Drink.Fromcup holding a cup
Drink.Fromglass holding a glass
Eat.Attable eating
Eat.Snack ✗
Enter walking
Getup ✗
Laydown lying down
Leave walking
Makecoffee.Pourgrains using a white coffee machine
Makecoffee.Pourwater using a white coffee machine
Maketea.Boilwater boiling water in a black kettle
Maketea.Insertteabag making tea
Pour.Frombottle holding a bottle
Pour.Fromcan holding a can
Pour.Fromkettle holding a black kettle
Readbook reading a book
Sitdown sitting down
Takepills ✗
Uselaptop using a laptop
Usetablet using a tablet
Usetelephone using a cordless phone
Walk walking
WatchTV watching TV

Table 5. The original action names of the Toyota Smarthome dataset and their corresponding captions used to create the Toyota Smarthome
Images (TSI) dataset. We use ✗ to denore the actions that are ambiguous and were not used to build the TSI dataset. The final prompt is
created as “The person in this image is {caption}”.

14
Efficient Few-Shot Continual Learning in Vision-Language Models

• Session 4: [Drink.Fromcup, Drink.Fromglass, Usetablet, Pour.Fromcan, Usetelephone].

• Session 5: [Walk, Cook.Stir, Makecoffee.Pourgrains, Cook.Cut, Uselaptop].

CAn (Wang et al., 2024b). The 45 classes of CAn are split as follows:

• Session 1: [102, 9, 20, 56, 23, 30, 357, 291, 144].

• Session 2: [41, 293, 42, 49, 54, 57, 70, 279, 305].

• Session 3: [71, 10, 76, 79, 349, 16, 81, 83, 100].

• Session 4: [130, 30, 133, 150, 275, 276, 58, 277, 80].

• Session 5: [39, 290, 37, 296, 316, 337, 89, 360, 128].

The indices of CAn correspond to those of ImageNet (Deng et al., 2009) since the dataset was built based on these 45 animal
classes of ImageNet.

AIR (Maji et al., 2013). We split the 100 aircraft types of AIR as follows:

• Session 1: [23, 8, 11, 7, 48, 13, 1, 91, 94, 54, 16, 63, 52, 41, 80, 2, 47, 87, 78, 66].

• Session 2: [19, 6, 24, 10, 59, 30, 22, 29, 83, 37, 93, 81, 43, 99, 86, 28, 34, 88, 44, 14].

• Session 3: [84, 70, 4, 20, 15, 21, 31, 76, 57, 67, 73, 50, 69, 25, 98, 46, 96, 0, 72, 35].

• Session 4: [58, 92, 3, 95, 56, 90, 26, 40, 55, 89, 75, 71, 60, 42, 9, 82, 39, 18, 77, 68].

• Session 5: [32, 79, 12, 85, 36, 17, 64, 27, 74, 45, 61, 38, 51, 62, 65, 33, 5, 53, 97, 49].

ESAT (Helber et al., 2019). We split the 10 different land terrain classes of ESAT as follows:

• Session 1: [0, 1].

• Session 2: [2, 3].

• Session 3: [4, 5].

• Session 4: [6, 7].

• Session 5: [8, 9].

DALLE. This dataset was only used for performance evaluation (control dataset), and not fine-tuning.

VSR (Liu et al., 2023). The images of this VQA dataset are labeled according to 36 different categories that describe the
dominant object of the image. We create the CL splits as follows:

• Session 1: [oven, dining table, spoon, boat, cake, donut, sandwich].

• Session 2: [fire hydrant, elephant, airplane, truck, apple, hot dog, sheep].

• Session 3: [kite, baseball glove, cow, tie, scissors, toaster, tv].

• Session 4: [bicycle, banana, couch, teddy bear, bus, umbrella, bird].

• Session 5: [potted plant, bowl, broccoli, bottle, knife, orange, person, pizza].

15
Efficient Few-Shot Continual Learning in Vision-Language Models

Table 6. Average number of images per session (5 sessions in total) for each dataset used for fine-tuning.
FT Dataset
Setting GTS TSI CAn AIR ESAT VSR HM VisOnly
CL-5 43.0 27.0 45.0 100.0 10.0 100.0 100.0 7.0
CL-20 170.0 84.0 180.0 400.0 40.0 274.6 300.0 28.0
CL-50 430.0 253.8 450.0 1000.0 100.0 485.2 600.0 70.0

Table 7. Characteristics of the datasets used for performance evaluation in section 4.


Eval Datasets GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
# Samples 3, 990 4, 908 1, 796 3, 333 17, 000 660 1, 222 2, 000 150 1, 150
# Classes 43 27 45 100 10 27 36 NaN NaN 7

HM (Kiela et al., 2020). For the hateful memes dataset, since there was not any labeling information of the images so we
can spli the images in a meaningful way, we randomly split the training images into five disjoint sets to create our final CL
splits.

MMVP (Tong et al., 2024). This is the only dataset where no training split is available and it is comprised of just 300
images. For this reason, we only used it for evaluation in our experiments in the main paper. However, for completeness,
we included results in Table 21 where we fine-tune on it. We use 150 images for training which are equally split into five
sessions and the rest of the 150 images are used for evaluation. Thus, the setting can be considered as a 30-shot CL setting.

VisOnly (Kamoi et al., 2024). This dataset categorizes its samples into seven categories describing the nature of the
geometric and numerical information in scientific figures. We created the splits as follows:

• Session 1: Geometry-Triangle.
• Session 2: Geometry-Quadrilateral.
• Session 3: Geometry-Length
• Session 4: Geometry-Angle.
• Session 5: [Geometry-Area, 3D-Size, 3D-Angle].

D. Detailed Results
D.1. CLIP-based Updates+
The detailed accuracies for all baselines and datasets used to create Table 1 of the main paper can be found in Tables 8
through 12.

D.2. Extra ACC and BWT results


In Table 13 we present results of the ACC and BWT on extra datasets plus the ones in the main paper. The results follow the
same patterns as in section 4 with LoRSU demonstrating the most consistent performance in both ACC and BWT compared
to the other two baselines. SPU is close to LoRSU in terms of BWT but it significantly lacks behind in ACC.

D.3. CLIP-based vs. Perplexity-based Updates+


The detailed accuracies for all baselines and datasets used to create Table 3 of the main paper can be found in Tables 14
through 18. We have also included results on fine-tuning the model using MMVP dataset in Table 21.

16
Efficient Few-Shot Continual Learning in Vision-Language Models

Table 8. Accuracy scores (%) for LLaVA with the pretrained (Zr-Shot) or fine-tuned image encoder. All baselines use GTS dataset for
fine-tuning the image encoder (the LLM remains frozen) via CLIP loss. We include error bars over 3 runs.

VQA Datasets (Acc %)


Setting FT Method GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
Zr-Shot 75.6 53.1 82.7 60.4 76.1 91.1 51.5 61.2 58.0 31.3
LN 79.1±1.2 53.6±0.5 81.2±0.6 61.0±1.2 58.9±0.9 91.1±1.3 51.9±1.5 62.7±1.1 59.6±0.2 31.8±0.4
F-FT 79.3±0.6 55.1±0.8 76.8±1.3 58.8±1.0 25.6±0.9 89.2±1.2 51.7±0.9 62.1±0.8 56.4±0.4 30.9±0.2
F-EWC 80.6±0.6 37.4±1.3 63.2±0.7 55.8±1.4 26.1±1.4 81.5±1.1 51.8±1.4 61.2±0.6 53.8±0.4 31.2±0.4
CL-5 LoRA 76.3±0.8 52.6±1.4 73.3±0.6 56.7±1.2 49.3±0.8 87.1±1.3 51.8±1.2 61.3±1.2 58.1±0.3 31.6±0.4
AdaLoRA 74.7±0.9 49.7±0.7 79.6±0.9 56.3±0.8 42.5±0.8 91.6±1.1 52.0±0.8 60.9±1.2 57.1±0.3 31.7±0.2
SPU 81.0±1.4 53.7±1.5 82.5±0.7 61.0±1.0 67.8±0.6 91.6±1.3 52.0±0.6 62.0±1.3 58.2±0.2 31.6±0.2
LoRSU 82.0±1.3 53.5±1.3 82.4±0.8 60.8±1.4 66.6±0.9 91.5±1.4 51.6±0.7 61.7±1.4 59.8±0.2 31.6±0.2
LN 80.8±0.6 49.5±0.7 77.7±1.0 59.7±0.5 32.7±0.6 89.8±0.9 51.8±0.7 62.3±0.3 57.5±0.1 31.2±0.2
F-FT 80.2±0.8 54.5±0.7 74.9±0.8 57.2±1.0 23.2±0.7 86.7±0.4 51.9±0.9 61.6±1.0 58.3±0.2 31.7±0.3
F-EWC 82.3±0.9 35.5±0.9 55.7±0.4 35.4±0.3 28.7±0.9 72.4±0.8 51.6±0.7 60.9±0.8 53.5±0.2 31.0±0.3
CL-20 LoRA 78.1±0.8 55.6±0.3 59.0±0.9 47.6±0.4 26.0±0.6 83.6±0.8 52.1±0.5 62.1±1.0 53.7±0.3 30.8±0.2
AdaLoRA 75.8±0.8 51.9±0.5 79.3±0.9 59.3±0.4 62.1±0.4 90.7±1.0 51.6±0.5 61.1±0.6 57.7±0.2 31.7±0.2
SPU 83.5±0.6 53.1±0.6 82.2±0.7 60.7±0.8 62.0±0.4 91.5±0.4 51.9±0.5 61.8±0.7 58.8±0.2 31.5±0.2
LoRSU 84.2±0.9 52.9±0.6 82.2±0.5 60.7±0.6 64.7±0.6 90.8±0.5 51.9±0.4 61.7±0.5 59.5±0.1 31.6±0.2
LN 80.4±0.2 50.4±0.1 74.9±0.1 58.3±0.0 30.4±0.3 89.0±0.1 51.8±0.0 62.0±0.3 58.7±0.1 31.4±0.1
F-FT 79.0±0.1 48.9±0.2 65.0±0.2 55.0±0.3 23.5±0.0 86.8±0.2 52.0±0.1 60.8±0.1 54.9±0.1 30.7±0.1
F-EWC 80.9±0.2 45.2±0.4 60.5±0.4 43.2±0.0 26.9±0.3 78.5±0.1 52.0±0.0 58.7±0.1 52.9±0.0 31.7±0.1
CL-50 LoRA 78.7±0.0 50.7±0.0 62.1±0.2 47.4±0.1 24.2±0.2 82.9±0.3 51.7±0.3 61.0±0.2 54.3±0.1 30.8±0.0
AdaLoRA 76.6±0.4 50.4±0.0 79.0±0.2 57.4±0.1 58.3±0.1 90.4±0.2 51.6±0.2 61.8±0.3 55.4±0.1 31.8±0.1
SPU 83.3±0.3 53.8±0.2 81.8±0.2 61.1±0.4 58.8±0.0 91.0±0.2 51.8±0.4 62.1±0.1 59.5±0.1 32.2±0.1
LoRSU 85.3±0.1 54.2±0.1 81.9±0.2 60.5±0.2 61.4±0.3 91.0±0.1 51.7±0.2 62.2±0.4 58.9±0.1 31.8±0.1

Table 9. Accuracy scores (%) for LLaVA with the pretrained (Zr-Shot) or fine-tuned image encoder. All baselines use TSI dataset for
fine-tuning the image encoder (the LLM remains frozen) via CLIP loss. We include error bars over 3 runs.

VQA Datasets (Acc %)


Setting FT Method GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
Zr-Shot 75.6 53.1 82.7 60.4 76.1 91.1 51.5 61.2 58.0 31.3
LN 75.4±1.0 53.9±0.6 82.6±1.3 60.0±1.0 75.9±0.8 91.1±1.3 51.7±1.4 61.9±1.0 58.4±0.3 30.9±0.3
F-FT 73.8±0.5 60.5±1.1 81.6±0.9 59.5±1.5 70.4±1.0 91.1±1.2 51.8±0.9 61.5±1.3 56.9±0.2 31.3±0.3
F-EWC 74.9±1.1 61.6±1.0 82.1±1.1 58.8±0.9 72.3±1.2 89.9±1.4 51.9±0.9 62.4±1.4 55.5±0.4 31.5±0.3
CL-5 LoRA 73.4±1.0 53.0±0.9 80.2±0.6 58.8±0.7 59.1±1.4 90.2±1.1 51.6±1.3 61.2±1.4 56.7±0.4 31.7±0.4
AdaLoRA 75.6±0.8 54.2±0.6 82.6±1.1 60.0±1.3 75.7±1.3 91.1±1.2 51.6±0.9 62.1±1.0 59.5±0.3 31.7±0.2
SPU 75.4±0.7 54.0±1.1 83.0±1.3 60.1±0.6 75.7±1.5 91.3±1.3 51.9±1.4 61.7±0.9 58.5±0.4 31.6±0.4
LoRSU 75.9±0.9 56.3±0.7 82.7±0.9 60.8±1.0 76.2±1.4 91.3±1.2 51.6±0.9 61.7±0.8 57.7±0.3 31.2±0.3
LN 72.9±0.5 58.2±0.5 78.9±0.9 56.8±0.4 69.3±0.9 91.4±0.8 51.6±0.8 62.6±0.5 56.3±0.3 31.3±0.2
F-FT 72.1±0.7 68.4±0.4 80.0±0.7 55.4±0.4 58.8±0.8 88.4±0.3 51.8±0.6 62.3±0.5 56.9±0.2 31.2±0.3
F-EWC 23.3±0.6 69.1±0.6 20.4±0.7 20.1±0.6 24.2±0.6 17.7±0.7 51.7±0.7 56.9±0.8 49.6±0.3 31.1±0.1
CL-20 LoRA 68.5±0.7 61.6±0.3 76.7±0.9 55.3±0.7 55.6±0.6 88.8±0.8 51.9±0.3 61.4±0.6 59.1±0.3 31.1±0.3
AdaLoRA 70.3±0.5 54.4±0.4 72.4±0.5 43.6±0.8 34.6±0.7 77.0±0.3 52.2±0.9 62.6±0.4 57.0±0.1 31.9±0.3
SPU 75.5±0.7 60.9±0.8 82.3±0.4 59.2±0.5 73.7±1.0 91.2±0.7 51.7±0.8 61.8±0.9 58.2±0.3 32.0±0.2
LoRSU 75.9±0.6 63.7±0.4 82.8±0.8 60.4±0.3 73.4±0.6 90.9±0.6 51.7±0.4 61.5±0.7 58.8±0.2 31.9±0.2
LN 73.0±0.2 60.1±0.2 79.6±0.3 57.7±0.4 61.3±0.4 89.6±0.4 51.9±0.0 61.3±0.0 55.5±0.1 31.3±0.1
F-FT 72.5±0.4 70.3±0.1 78.3±0.4 53.4±0.0 50.6±0.2 89.1±0.3 52.3±0.3 61.1±0.2 57.1±0.1 31.7±0.0
F-EWC 48.0±0.3 75.5±0.2 59.5±0.4 38.8±0.1 42.6±0.3 82.5±0.0 52.5±0.1 56.4±0.3 55.4±0.1 31.3±0.1
CL-50 LoRA 66.1±0.2 71.3±0.3 76.0±0.1 56.0±0.1 44.5±0.2 88.9±0.3 51.8±0.1 60.4±0.2 56.3±0.1 31.6±0.1
AdaLoRA 73.1±0.2 61.0±0.0 80.6±0.0 52.0±0.4 72.2±0.3 88.9±0.3 51.7±0.2 62.0±0.4 59.1±0.0 31.2±0.1
SPU 75.4±0.0 65.3±0.1 81.8±0.1 59.7±0.2 72.3±0.1 90.8±0.2 51.9±0.1 61.9±0.4 58.0±0.1 31.8±0.0
LoRSU 75.3±0.2 72.2±0.4 82.4±0.3 59.7±0.3 72.5±0.3 90.8±0.3 51.7±0.2 61.7±0.4 58.5±0.1 31.7±0.0

17
Efficient Few-Shot Continual Learning in Vision-Language Models

Table 10. Accuracy scores (%) for LLaVA with the pretrained (Zr-Shot) or fine-tuned image encoder. All baselines use CAn dataset for
fine-tuning the image encoder (the LLM remains frozen) via CLIP loss. We include error bars over 3 runs.

VQA Datasets (Acc %)


Setting FT Method GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
Zr-Shot 75.6 53.1 82.7 60.4 76.1 91.1 51.5 61.2 58.0 31.3
LN 74.3±1.5 52.9±1.4 80.3±1.4 58.9±0.7 72.4±1.2 91.1±0.8 52.0±0.9 61.5±1.2 61.7±0.3 32.1±0.4
F-FT 73.5±1.1 50.6±0.9 80.3±0.8 56.5±0.6 63.1±0.6 91.3±1.5 51.7±1.4 61.8±0.8 58.4±0.2 31.3±0.4
F-EWC 65.9±1.5 39.1±0.7 66.0±1.3 40.0±0.9 41.7±0.7 86.2±0.8 51.8±1.3 59.9±1.0 57.6±0.4 31.3±0.2
CL-5 LoRA 69.7±1.4 44.8±1.1 81.4±0.7 56.9±1.0 50.7±1.3 92.9±1.3 52.0±1.0 61.8±1.5 56.5±0.4 31.3±0.4
AdaLoRA 75.5±1.4 53.2±0.7 81.7±0.6 60.1±0.7 72.0±1.2 92.1±0.9 51.9±1.4 61.8±1.5 59.0±0.3 31.9±0.3
SPU 76.0±0.9 53.2±0.6 82.3±1.1 60.3±1.3 75.7±0.9 91.3±1.3 51.7±0.8 61.5±1.2 58.4±0.3 31.4±0.4
LoRSU 75.2±0.8 52.7±0.9 83.0±1.0 60.1±0.7 76.8±1.0 91.8±1.4 51.6±1.1 62.3±1.2 58.7±0.3 31.4±0.4
LN 72.9±0.5 54.0±0.9 80.3±0.6 57.3±0.4 73.3±0.4 90.7±0.4 51.8±0.8 61.9±0.9 61.0±0.1 31.4±0.1
F-FT 72.9±0.5 47.9±0.6 83.0±0.7 56.9±0.9 62.7±0.9 90.6±0.9 51.9±0.4 61.3±0.4 56.5±0.2 31.5±0.3
F-EWC 70.1±1.0 48.7±0.4 82.8±0.5 51.1±0.8 54.8±0.9 88.3±0.7 51.8±1.0 57.0±0.8 59.6±0.3 31.2±0.3
CL-20 LoRA 67.5±0.6 48.9±0.6 80.4±0.4 57.3±0.9 39.7±0.4 91.1±0.6 51.8±0.9 61.7±0.3 60.1±0.2 31.9±0.3
AdaLoRA 72.5±1.0 51.5±1.0 79.2±0.4 54.1±1.0 65.5±0.7 90.6±0.8 51.7±0.9 61.9±0.9 56.5±0.3 31.7±0.3
SPU 75.0±0.5 53.5±0.3 82.8±0.8 59.9±0.6 76.1±0.9 91.6±0.9 51.6±0.6 61.9±0.4 61.8±0.2 31.6±0.3
LoRSU 75.3±0.8 53.1±0.9 83.8±0.9 58.8±1.0 75.5±0.7 92.0±0.3 51.9±0.4 62.3±0.6 60.4±0.2 31.6±0.2
LN 71.1±0.1 50.4±0.3 77.0±0.3 57.5±0.3 57.9±0.1 89.7±0.1 51.6±0.1 62.4±0.3 56.1±0.1 31.9±0.0
F-FT 70.1±0.1 48.9±0.3 81.7±0.0 56.2±0.2 47.5±0.1 89.9±0.3 52.0±0.1 61.2±0.1 57.7±0.1 31.1±0.1
F-EWC 61.7±0.0 43.9±0.3 83.3±0.4 46.2±0.3 38.9±0.2 87.5±0.1 51.8±0.3 55.8±0.3 54.7±0.1 30.7±0.1
CL-50 LoRA 66.8±0.2 47.8±0.3 82.3±0.2 55.7±0.0 52.0±0.3 91.0±0.3 51.7±0.3 61.6±0.2 60.2±0.0 31.6±0.1
AdaLoRA 73.5±0.0 49.9±0.1 80.9±0.4 55.7±0.4 77.8±0.1 93.1±0.0 51.5±0.1 61.4±0.3 56.9±0.0 31.6±0.1
SPU 75.2±0.2 53.2±0.0 83.3±0.3 59.3±0.2 73.1±0.3 91.4±0.4 51.7±0.3 61.7±0.1 58.5±0.1 31.6±0.1
LoRSU 75.0±0.2 51.8±0.1 84.0±0.4 58.5±0.2 72.7±0.3 91.9±0.3 51.7±0.1 62.3±0.4 58.1±0.0 31.7±0.1

Table 11. Accuracy scores (%) for LLaVA with the pretrained (Zr-Shot) or fine-tuned image encoder. All baselines use AIR dataset for
fine-tuning the image encoder (the LLM remains frozen) via CLIP loss. We include error bars over 3 runs.

VQA Datasets (Acc %)


Setting FT Method GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
Zr-Shot 75.6 53.1 82.7 60.4 76.1 91.1 51.5 61.2 58.0 31.3
LN 73.4±0.8 51.3±1.2 80.2±0.6 60.7±1.5 66.9±0.7 91.3±0.6 51.9±0.9 62.4±1.2 58.5±0.2 30.6±0.2
F-FT 72.5±1.2 50.5±0.5 79.9±0.9 62.4±0.9 60.7±1.4 90.6±0.5 51.7±0.9 60.9±1.1 58.3±0.4 31.4±0.3
F-EWC 74.9±1.2 52.4±0.8 71.5±1.2 63.3±1.0 63.8±1.0 90.7±1.5 51.2±0.5 61.2±0.8 58.1±0.4 31.4±0.4
CL-5 LoRA 70.9±0.9 52.7±0.6 79.0±0.7 61.7±0.5 48.8±0.7 90.6±0.6 52.0±0.9 62.5±0.8 60.0±0.3 31.1±0.2
AdaLoRA 75.0±1.0 53.3±0.8 83.7±0.9 60.8±0.8 75.2±1.5 91.7±1.0 51.6±0.8 61.6±0.8 56.9±0.3 31.9±0.4
SPU 76.2±0.6 53.0±1.3 83.0±0.8 63.5±0.8 75.3±0.7 91.5±1.5 51.5±0.6 61.5±0.8 58.1±0.3 31.5±0.4
LoRSU 76.2±0.8 53.4±1.4 82.5±1.0 65.2±1.3 76.0±0.9 91.8±0.8 51.6±0.8 62.1±1.1 59.0±0.4 31.2±0.3
LN 70.3±0.9 53.7±0.6 77.9±1.0 60.2±0.4 56.3±0.7 90.6±0.3 51.7±1.0 62.8±0.7 58.1±0.1 31.8±0.3
F-FT 73.0±0.6 54.1±0.6 80.3±0.9 69.7±0.5 62.7±0.5 90.0±0.4 51.9±0.3 61.8±0.4 58.9±0.1 31.4±0.1
F-EWC 71.2±0.5 53.9±1.0 79.3±0.4 70.6±1.0 64.6±0.7 89.7±0.6 51.7±0.4 61.5±0.5 58.9±0.3 31.4±0.2
CL-20 LoRA 71.8±0.9 51.1±0.8 78.6±0.3 65.7±0.4 63.4±0.8 89.9±1.0 51.7±0.3 62.3±0.3 56.2±0.2 31.5±0.2
AdaLoRA 73.4±0.8 51.6±0.6 81.2±0.9 63.1±0.6 73.8±0.8 90.8±0.5 52.1±0.4 62.7±0.8 57.7±0.2 31.2±0.1
SPU 75.7±0.4 52.2±0.7 82.0±0.8 63.4±0.9 72.6±0.6 91.7±0.6 51.8±0.6 62.2±0.5 59.0±0.2 31.4±0.2
LoRSU 75.7±0.9 52.6±0.9 81.4±0.7 66.3±0.7 73.0±0.8 90.9±0.8 51.9±0.8 61.8±0.8 56.9±0.1 31.6±0.3
LN 69.6±0.4 54.0±0.1 76.9±0.3 62.2±0.2 50.9±0.2 90.2±0.0 52.0±0.3 62.8±0.4 57.7±0.1 31.5±0.1
F-FT 71.2±0.3 50.3±0.3 78.3±0.2 70.4±0.4 59.9±0.0 90.1±0.1 51.9±0.1 61.8±0.3 57.5±0.1 31.3±0.1
F-EWC 71.8±0.2 51.6±0.1 78.3±0.0 71.3±0.2 57.6±0.2 90.2±0.2 51.7±0.1 61.1±0.2 57.4±0.1 31.5±0.0
CL-50 LoRA 69.8±0.0 54.7±0.0 77.0±0.3 68.2±0.3 51.6±0.1 90.0±0.1 52.0±0.4 62.4±0.0 57.1±0.1 31.5±0.1
AdaLoRA 74.2±0.3 52.0±0.2 82.4±0.1 65.0±0.2 72.6±0.0 91.9±0.1 51.7±0.2 60.7±0.1 55.6±0.0 31.3±0.0
SPU 75.2±0.2 52.2±0.4 82.6±0.3 66.6±0.4 70.0±0.2 91.6±0.3 51.9±0.2 62.0±0.3 57.6±0.0 31.8±0.0
LoRSU 75.4±0.4 52.7±0.3 81.6±0.2 68.6±0.3 69.7±0.3 91.5±0.2 51.7±0.4 62.2±0.1 58.7±0.1 31.1±0.1

18
Efficient Few-Shot Continual Learning in Vision-Language Models

Table 12. Accuracy scores (%) for LLaVA with the pretrained (Zr-Shot) or fine-tuned image encoder. All baselines use ESAT dataset for
fine-tuning the image encoder (the LLM remains frozen) via CLIP loss. We include error bars over 3 runs.

VQA Datasets (Acc %)


Setting FT Method GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
Zr-Shot 75.6 53.1 82.7 60.4 76.1 91.1 51.5 61.2 58.0 31.3
LN 75.8±0.9 53.2±0.6 82.6±1.1 60.0±1.3 80.3±1.0 92.7±1.0 51.9±0.7 61.7±0.5 60.4±0.4 31.8±0.3
F-FT 69.1±0.8 50.5±0.6 80.8±1.1 57.7±1.5 65.8±0.6 91.3±1.5 51.8±0.7 62.0±0.7 58.8±0.3 30.4±0.2
F-EWC 66.3±0.9 52.1±1.4 79.3±1.0 56.8±1.3 67.7±1.3 90.9±0.8 51.9±1.3 62.0±1.2 55.4±0.2 30.9±0.4
CL-5 LoRA 73.2±1.3 49.3±1.2 80.6±0.9 60.4±1.1 74.5±0.8 92.3±1.3 52.0±1.1 61.6±1.1 57.4±0.4 31.4±0.3
AdaLoRA 75.9±0.5 52.4±1.4 82.4±0.5 60.5±0.8 78.0±1.3 91.5±0.9 51.6±0.8 61.5±1.3 59.0±0.4 30.9±0.2
SPU 75.8±0.8 53.2±1.4 82.8±1.4 60.5±1.5 80.6±0.9 91.5±1.1 51.7±0.6 61.7±1.5 57.5±0.4 31.5±0.2
LoRSU 76.2±1.0 53.6±1.1 82.5±1.2 60.8±0.8 82.9±1.0 91.5±0.9 51.6±0.9 61.3±0.7 57.7±0.4 31.9±0.4
LN 74.5±0.5 52.6±0.7 82.5±0.5 58.8±0.7 77.0±0.4 92.4±0.5 51.9±1.0 62.5±0.5 58.0±0.3 31.2±0.1
F-FT 66.5±0.8 51.1±0.7 79.1±0.4 56.7±0.6 51.2±0.7 92.0±0.4 51.6±0.6 61.4±0.8 60.1±0.1 31.5±0.2
F-EWC 69.3±0.3 51.2±1.0 60.5±0.8 57.1±0.6 54.1±0.6 89.7±0.6 51.9±0.6 60.9±0.7 58.4±0.2 31.8±0.2
CL-20 LoRA 71.1±0.7 50.9±0.5 80.3±1.0 59.4±0.7 64.6±0.7 91.1±0.7 52.0±0.4 62.3±0.6 62.3±0.2 31.3±0.1
AdaLoRA 70.0±0.6 47.3±0.8 78.4±0.9 51.7±0.4 69.3±0.5 91.3±0.7 51.7±0.9 60.8±0.9 58.1±0.2 31.6±0.1
SPU 75.6±0.9 53.1±0.3 82.8±0.9 59.9±0.8 81.5±0.6 92.3±0.4 51.9±0.5 61.5±0.8 58.8±0.2 31.7±0.1
LoRSU 75.3±1.0 53.7±0.8 82.8±0.4 60.7±0.8 82.7±0.7 91.6±0.6 51.6±0.4 61.5±0.4 58.4±0.2 31.4±0.2
LN 73.1±0.3 53.0±0.2 82.0±0.1 59.1±0.2 80.7±0.0 92.4±0.2 51.8±0.3 62.0±0.1 60.4±0.0 32.0±0.0
F-FT 58.0±0.4 50.3±0.0 76.8±0.1 57.2±0.2 34.7±0.1 89.7±0.0 51.7±0.2 61.6±0.2 58.1±0.0 31.6±0.1
F-EWC 59.0±0.1 64.5±0.1 77.2±0.1 56.3±0.1 38.0±0.2 87.3±0.2 51.9±0.2 60.7±0.2 58.2±0.1 31.8±0.0
CL-50 LoRA 62.8±0.3 47.2±0.4 72.4±0.4 54.4±0.2 61.6±0.4 90.2±0.3 51.7±0.2 62.0±0.1 60.8±0.0 30.9±0.1
AdaLoRA 67.2±0.2 49.3±0.3 78.8±0.3 56.9±0.3 58.8±0.3 89.6±0.3 51.8±0.1 61.9±0.2 56.0±0.1 31.6±0.0
SPU 75.1±0.3 53.4±0.2 82.5±0.2 60.2±0.3 81.9±0.1 92.3±0.3 51.8±0.1 61.6±0.1 57.1±0.1 31.9±0.0
LoRSU 75.4±0.3 53.9±0.1 83.1±0.2 60.3±0.1 83.1±0.1 92.1±0.1 51.6±0.2 61.2±0.0 57.6±0.0 31.1±0.0

Table 13. Average accuracy (ACC) and backward transfer (BWT) scores (%) for LLaVA with the fine-tuned CLIP-L-14. Each column
indicates the setting and fine-tuning method. We include error bars over 3 runs.

FT Method
Zr-Shot LoRA SPU LoRSU
Setting FT Dataset
ACC (↑) BWT (↑) ACC (↑) BWT (↑) ACC (↑) BWT (↑) ACC (↑) BWT (↑)
GTS 75.4 0.0 79.2±0.7 −7.1±0.8 80.8±0.5 0.5±0.6 81.1±0.6 0.4±0.7
TSI 54.0 0.0 55.5±0.9 −2.5±0.6 55.5±0.6 0.2±0.5 57.0±0.8 0.5±0.6
CL-5
AIR 60.4 0.0 59.2±0.8 −2.1±0.7 64.7±0.5 2.8±0.6 65.0±0.7 2.5±0.6
ESAT 76.4 0.0 73.8±0.9 −3.4±0.6 79.8±0.6 1.5±0.7 82.2±0.7 2.0±0.6
GTS 75.4 0.0 77.2±0.4 −9.1±0.5 82.8±0.4 −0.6±0.3 83.5±0.6 −0.4±0.3
TSI 54.0 0.0 60.6±0.3 −7.2±0.4 60.1±0.5 −1.7±0.3 62.1±0.3 −0.9±0.4
CL-20
AIR 60.4 0.0 64.3±0.4 −3.6±0.6 65.2±0.7 1.1±0.4 65.4±0.3 0.9±0.4
ESAT 76.4 0.0 64.1±0.5 −18.3±0.7 82.0±0.4 2.0±0.2 82.7±0.5 0.1±0.3
GTS 75.4 0.0 79.3±0.3 −10.3±0.5 83.8±0.2 −0.7±0.1 84.7±0.3 −0.5±0.2
TSI 54.0 0.0 67.0±0.3 −8.1±0.6 61.8±0.2 −1.9±0.3 67.9±0.2 −1.1±0.3
CL-50
AIR 60.4 0.0 65.6±0.4 −6.1±0.3 67.1±0.3 0.5±0.2 67.7±0.3 0.7±0.3
ESAT 76.4 0.0 61.4±0.3 −27.8±0.4 81.2±0.3 −2.4±0.2 82.1±0.4 −0.8±0.2

19
Efficient Few-Shot Continual Learning in Vision-Language Models

Table 14. Exact accuracy scores (%) for each baseline used to fine-tune the model on the GTS dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.

VQA Datasets (Acc %)


Setting PEFT Method GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
Zr-Shot 75.6 53.1 82.7 60.4 76.1 91.1 51.5 61.2 58.0 31.3
LoRA-L 71.5±1.2 52.3±0.5 81.2±0.6 60.0±1.2 75.5±0.9 91.5±1.3 51.9±1.5 61.2±1.1 57.6±0.3 32.2±0.5
LoRA 76.3±0.8 52.6±1.4 73.3±0.6 56.7±1.2 49.3±0.8 87.1±1.3 51.8±1.2 61.3±1.2 58.1±0.3 31.6±0.4
LoRSU 82.0±1.3 53.5±1.3 82.4±0.8 60.8±1.4 66.6±0.9 91.5±1.4 51.6±0.7 61.7±1.4 59.8±0.2 31.6±0.2
CL-5
LoRA-Ppl 68.1±0.8 54.5±1.4 80.7±0.6 59.3±1.2 52.8±0.8 90.7±1.3 51.7±1.2 60.7±1.2 54.8±0.4 33.4±0.5
LoRA-F 72.9±0.9 54.0±0.7 81.5±0.9 59.6±0.8 61.9±0.8 90.3±1.1 51.9±0.8 60.9±1.2 58.4±0.4 31.1±0.3
LoRSU-Ppl 77.2±1.4 55.1±1.5 82.1±0.7 58.9±1.0 67.0±0.6 90.9±1.3 51.8±0.6 61.6±1.3 58.7±0.3 30.4±0.3
LoRA-L 74.2±0.9 52.2±0.9 82.1±0.5 59.6±1.0 75.9±0.6 91.8±1.0 51.6±0.4 62.1±0.9 59.1±0.2 31.8±0.2
LoRA 78.1±0.8 55.6±0.3 59.0±0.9 47.6±0.4 26.0±0.6 83.6±0.8 52.1±0.5 62.1±1.0 53.7±0.3 30.8±0.2
LoRSU 84.2±0.9 52.9±0.6 82.2±0.5 60.7±0.6 64.7±0.6 90.8±0.5 51.9±0.4 61.7±0.5 59.5±0.1 31.6±0.2
CL-20
LoRA-Ppl 75.1±0.9 50.4±0.9 75.8±0.4 56.5±0.3 40.1±0.9 89.7±0.8 51.6±0.7 57.8±0.8 54.2±0.2 31.5±0.4
LoRA-F 74.2±0.8 52.7±0.3 80.1±0.9 59.5±0.4 66.0±0.6 90.1±0.8 52.1±0.5 64.7±1.0 60.4±0.4 32.3±0.2
LoRSU-Ppl 79.5±0.8 56.1±0.5 82.1±0.9 59.8±0.4 66.1±0.4 90.8±1.0 51.7±0.5 62.1±0.6 59.0±0.3 31.5±0.3
LoRA-L 74.9±0.2 51.7±0.2 81.8±0.2 59.8±0.3 75.8±0.1 91.5±0.0 52.0±0.1 61.1±0.2 57.4±0.1 31.8±0.1
LoRA 78.7±0.0 50.7±0.0 62.1±0.2 47.4±0.1 24.2±0.2 82.9±0.3 51.7±0.3 61.0±0.2 54.3±0.1 30.8±0.0
LoRSU 85.3±0.1 54.2±0.1 81.9±0.2 60.5±0.2 61.4±0.3 91.0±0.1 51.7±0.2 62.2±0.4 58.9±0.1 31.8±0.1
CL-50
LoRA-Ppl 74.2±0.1 49.4±0.2 76.0±0.2 57.9±0.3 37.2±0.0 89.5±0.2 51.7±0.1 57.7±0.1 55.6±0.1 29.8±0.1
LoRA-F 71.7±0.2 51.7±0.4 80.8±0.4 58.3±0.0 60.9±0.3 90.8±0.1 52.1±0.0 63.3±0.1 57.5±0.0 30.9±0.1
LoRSU-Ppl 82.5±0.0 55.8±0.0 82.1±0.2 59.9±0.1 65.4±0.2 91.0±0.3 51.6±0.3 61.7±0.2 62.3±0.1 32.2±0.0

Table 15. Exact accuracy scores (%) for each baseline used to fine-tune the model on the TSI dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.

VQA Datasets (Acc %)


Setting PEFT Method GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
Zr-Shot 75.6 53.1 82.7 60.4 76.1 91.1 51.5 61.2 58.0 31.3
LoRA-L 76.0±1.5 59.1±0.6 82.7±0.9 60.7±0.7 75.9±0.9 91.5±1.0 51.5±0.9 63.6±1.2 54.1±0.4 31.2±0.4
LoRA 73.4±1.0 53.0±0.9 80.2±0.6 58.8±0.7 59.1±1.4 90.2±1.1 51.6±1.3 61.2±1.4 56.7±0.4 31.7±0.4
LoRSU 75.9±0.9 56.3±0.7 82.7±0.9 60.8±1.0 76.2±1.4 91.3±1.2 51.6±0.9 61.7±0.8 57.7±0.3 31.2±0.3
CL-5
LoRA-Ppl 75.0±1.0 64.0±0.6 82.8±1.3 58.4±1.0 60.8±0.8 88.7±1.3 51.6±1.4 61.5±1.0 55.0±0.4 32.2±0.4
LoRA-F 75.3±0.5 45.1±1.1 82.5±0.9 57.2±1.5 73.2±1.0 83.9±1.2 53.8±0.9 64.3±1.3 45.6±0.3 30.9±0.4
LoRSU-Ppl 76.1±1.1 66.2±1.0 83.9±1.1 66.1±0.9 76.1±1.2 91.1±1.4 52.0±0.9 64.4±1.4 60.8±0.5 31.1±0.4
LoRA-L 76.1±0.7 59.0±0.6 82.4±0.4 60.8±0.4 75.7±0.9 91.3±0.7 51.5±0.9 63.9±1.0 55.4±0.3 30.8±0.3
LoRA 68.5±0.7 61.6±0.3 76.7±0.9 55.3±0.7 55.6±0.6 88.8±0.8 51.9±0.3 61.4±0.6 59.1±0.3 31.1±0.3
LoRSU 75.9±0.6 63.7±0.4 82.8±0.8 60.4±0.3 73.4±0.6 90.9±0.6 51.7±0.4 61.5±0.7 58.8±0.2 31.9±0.2
CL-20
LoRA-Ppl 62.1±0.6 59.6±0.5 71.9±0.6 48.3±0.7 42.5±1.0 75.8±0.8 51.6±0.6 49.0±0.5 49.7±0.3 32.4±0.2
LoRA-F 76.1±0.5 56.0±0.5 82.8±0.9 58.2±0.4 67.7±0.9 87.5±0.8 51.6±0.8 64.4±0.5 40.3±0.4 31.2±0.2
LoRSU-Ppl 76.4±0.7 67.0±0.4 83.0±0.7 57.4±0.4 74.0±0.8 88.1±0.3 51.8±0.6 63.6±0.5 57.6±0.2 30.8±0.3
LoRA-L 76.4±0.2 63.0±0.2 81.9±0.2 60.5±0.2 75.6±0.2 91.1±0.2 51.7±0.2 64.1±0.3 55.6±0.2 30.9±0.0
LoRA 66.1±0.2 71.3±0.3 76.0±0.1 56.0±0.1 44.5±0.2 88.9±0.3 51.8±0.1 60.4±0.2 56.3±0.1 31.6±0.1
LoRSU 75.3±0.2 72.2±0.4 82.4±0.3 59.7±0.3 72.5±0.3 90.8±0.3 51.7±0.2 61.7±0.4 58.5±0.1 31.7±0.0
CL-50
LoRA-Ppl 46.3±0.3 51.5±0.3 63.4±0.1 40.1±0.1 41.3±0.4 73.9±0.2 51.7±0.3 49.5±0.3 40.2±0.1 32.7±0.1
LoRA-F 74.0±0.2 68.2±0.1 81.6±0.3 59.2±0.0 75.1±0.2 88.5±0.2 56.8±0.1 65.0±0.3 50.8±0.1 30.4±0.1
LoRSU-Ppl 75.8±0.2 75.1±0.2 82.1±0.3 56.0±0.4 74.2±0.4 86.0±0.4 52.0±0.0 63.2±0.0 58.1±0.1 30.2±0.1

20
Efficient Few-Shot Continual Learning in Vision-Language Models

Table 16. Exact accuracy scores (%) for each baseline used to fine-tune the model on the CAn dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.

VQA Datasets (Acc %)


Setting PEFT Method GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
Zr-Shot 75.6 53.1 82.7 60.4 76.1 91.1 51.5 61.2 58.0 31.3
LoRA-L 75.5±1.4 53.1±0.8 79.4±1.4 59.2±0.6 75.2±0.9 91.5±1.1 52.4±1.3 60.2±1.1 57.7±0.5 32.1±0.3
LoRA 69.7±1.4 44.8±1.1 81.4±0.7 56.9±1.0 50.7±1.3 92.9±1.3 52.0±1.0 61.8±1.5 56.5±0.4 31.3±0.4
LoRSU 75.2±0.8 52.7±0.9 83.0±1.0 60.1±0.7 76.8±1.0 91.8±1.4 51.6±1.1 62.3±1.2 58.7±0.3 31.4±0.4
CL-5
LoRA-Ppl 65.8±1.1 50.7±0.6 79.2±0.5 48.4±1.4 63.0±1.2 86.7±1.3 51.8±1.0 57.2±1.4 52.5±0.3 32.4±0.4
LoRA-F 70.1±0.6 52.2±0.7 78.6±0.7 50.9±0.9 73.4±0.8 91.3±1.0 54.7±0.8 62.2±1.4 58.0±0.5 31.3±0.3
LoRSU-Ppl 74.6±0.9 51.3±1.4 82.9±1.2 58.4±1.2 77.7±1.2 91.8±1.3 51.5±1.1 64.7±1.4 56.5±0.6 29.8±0.3
LoRA-L 73.6±1.0 52.2±0.9 80.8±0.9 56.7±0.4 74.7±0.8 91.7±0.5 52.2±0.6 60.9±0.8 59.1±0.3 31.9±0.4
LoRA 67.5±0.6 48.9±0.6 80.4±0.4 57.3±0.9 39.7±0.4 91.1±0.6 51.8±0.9 61.7±0.3 60.1±0.2 31.9±0.3
LoRSU 75.3±0.8 53.1±0.9 83.8±0.9 58.8±1.0 75.5±0.7 92.0±0.3 51.9±0.4 62.3±0.6 60.4±0.2 31.6±0.2
CL-20
LoRA-Ppl 65.6±0.9 47.0±0.7 79.0±0.4 46.0±0.6 58.9±0.8 82.5±0.8 51.9±0.7 43.9±1.0 52.5±0.4 30.4±0.3
LoRA-F 69.4±0.9 54.9±0.4 80.6±0.4 50.4±0.5 72.0±0.8 91.2±0.5 51.9±0.9 64.3±1.0 57.0±0.3 31.6±0.3
LoRSU-Ppl 72.4±0.6 49.2±0.4 83.2±0.7 56.4±0.9 75.5±0.6 91.8±0.9 51.6±0.5 61.0±0.8 57.7±0.3 31.6±0.3
LoRA-L 73.8±0.1 51.6±0.2 80.9±0.2 56.9±0.1 74.9±0.2 91.3±0.3 51.7±0.2 61.2±0.3 58.0±0.1 32.4±0.1
LoRA 66.8±0.2 47.8±0.3 82.3±0.2 55.7±0.0 52.0±0.3 91.0±0.3 51.7±0.3 61.6±0.2 60.2±0.0 31.6±0.1
LoRSU 75.0±0.2 51.8±0.1 84.0±0.4 58.5±0.2 72.7±0.3 91.9±0.3 51.7±0.1 62.3±0.4 58.1±0.0 31.7±0.1
CL-50
LoRA-Ppl 56.2±0.4 36.4±0.0 80.9±0.1 48.5±0.3 54.1±0.3 78.1±0.2 53.6±0.4 62.3±0.3 48.4±0.1 32.4±0.1
LoRA-F 69.2±0.2 52.0±0.2 80.6±0.1 53.7±0.3 74.4±0.1 90.7±0.2 51.8±0.4 66.5±0.0 58.7±0.1 31.4±0.1
LoRSU-Ppl 74.9±0.4 49.7±0.4 83.7±0.0 42.5±0.4 74.9±0.2 91.2±0.3 51.2±0.3 52.2±0.4 58.5±0.2 32.3±0.2

Table 17. Exact accuracy scores (%) for each baseline used to fine-tune the model on the AIR dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.

VQA Datasets (Acc %)


Setting PEFT Method GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
Zr-Shot 75.6 53.1 82.7 60.4 76.1 91.1 51.5 61.2 58.0 31.3
LoRA-L 75.6±0.7 54.4±0.5 81.8±1.1 58.7±0.9 75.7±1.4 92.0±1.4 51.6±0.9 61.0±0.6 59.1±0.3 32.2±0.5
LoRA 70.9±0.9 52.7±0.6 79.0±0.7 61.7±0.5 48.8±0.7 90.6±0.6 52.0±0.9 62.5±0.8 60.0±0.3 31.1±0.2
LoRSU 76.2±0.8 53.4±1.4 82.5±1.0 65.2±1.3 76.0±0.9 91.8±0.8 51.6±0.8 62.1±1.1 59.0±0.4 31.2±0.3
CL-5
LoRA-Ppl 74.9±0.8 54.2±1.2 79.1±0.5 59.7±0.9 68.5±0.9 90.8±1.3 51.8±0.7 62.0±0.7 55.1±0.5 31.1±0.5
LoRA-F 72.3±0.5 50.6±1.3 78.7±1.4 70.0±1.3 64.4±0.9 90.9±0.6 54.9±1.3 57.7±1.1 62.0±0.6 32.2±0.5
LoRSU-Ppl 75.6±1.0 54.6±1.2 79.8±1.0 66.2±0.5 76.4±1.1 90.6±1.3 51.7±1.3 60.1±0.9 58.8±0.4 31.1±0.4
LoRA-L 75.4±0.3 53.6±0.4 82.2±1.0 64.1±1.0 75.7±0.5 92.2±0.3 51.5±0.5 61.5±0.8 58.9±0.2 31.9±0.3
LoRA 71.8±0.9 51.1±0.8 78.6±0.3 65.7±0.4 63.4±0.8 89.9±1.0 51.7±0.3 62.3±0.3 56.2±0.2 31.5±0.2
LoRSU 75.7±0.9 52.6±0.9 81.4±0.7 66.3±0.7 73.0±0.8 90.9±0.8 51.9±0.8 61.8±0.8 56.9±0.1 31.6±0.3
CL-20
LoRA-Ppl 72.1±0.5 48.0±0.8 72.7±0.4 65.2±1.0 65.1±0.5 90.4±0.3 51.8±0.6 61.5±0.8 55.8±0.1 31.7±0.1
LoRA-F 74.5±0.8 53.0±0.3 82.0±0.6 76.7±0.6 74.9±0.9 91.1±0.3 52.4±0.6 59.3±0.8 59.6±0.4 31.3±0.3
LoRSU-Ppl 76.1±0.8 55.5±0.5 78.7±0.8 66.4±0.6 75.7±0.6 91.6±1.0 51.5±0.3 59.8±0.5 58.1±0.4 31.2±0.4
LoRA-L 75.6±0.2 53.8±0.1 83.5±0.1 65.0±0.0 75.7±0.1 92.0±0.0 51.8±0.2 61.1±0.1 58.7±0.1 32.3±0.0
LoRA 69.8±0.0 54.7±0.0 77.0±0.3 68.2±0.3 51.6±0.1 90.0±0.1 52.0±0.4 62.4±0.0 57.1±0.1 31.5±0.1
LoRSU 75.4±0.4 52.7±0.3 81.6±0.2 68.6±0.3 69.7±0.3 91.5±0.2 51.7±0.4 62.2±0.1 58.7±0.1 31.1±0.1
CL-50
LoRA-Ppl 74.4±0.1 50.9±0.4 76.8±0.2 66.6±0.3 65.4±0.2 91.3±0.1 51.6±0.1 57.2±0.2 53.7±0.1 31.5±0.1
LoRA-F 74.6±0.3 53.2±0.2 80.7±0.4 78.3±0.1 71.4±0.2 91.4±0.0 52.9±0.4 60.0±0.2 57.4±0.0 31.1±0.2
LoRSU-Ppl 75.1±0.2 54.5±0.1 78.0±0.4 69.3±0.1 75.7±0.1 91.5±0.1 51.7±0.0 61.5±0.1 58.2±0.0 30.8±0.0

21
Efficient Few-Shot Continual Learning in Vision-Language Models

Table 18. Exact accuracy scores (%) for each baseline used to fine-tune the model on the ESAT dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.

VQA Datasets (Acc %)


Setting PEFT Method GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
Zr-Shot 75.6 53.1 82.7 60.4 76.1 91.1 51.5 61.2 58.0 31.3
LoRA-L 75.4±0.7 52.2±1.4 82.8±0.6 60.6±1.5 75.9±1.1 91.7±0.9 51.5±0.6 60.2±0.8 57.6±0.5 31.6±0.4
LoRA 73.2±1.3 49.3±1.2 80.6±0.9 60.4±1.1 74.5±0.8 92.3±1.3 52.0±1.1 61.6±1.1 57.4±0.4 31.4±0.3
LoRSU 76.2±1.0 53.6±1.1 82.5±1.2 60.8±0.8 82.9±1.0 91.5±0.9 51.6±0.9 61.3±0.7 57.7±0.4 31.9±0.4
CL-5
LoRA-Ppl 76.0±0.7 52.6±1.0 82.6±1.3 60.4±1.4 75.5±0.9 91.9±1.0 51.8±0.9 62.8±0.8 59.0±0.4 31.6±0.5
LoRA-F 74.3±1.3 51.5±1.4 81.1±1.0 60.3±1.1 81.5±1.2 90.8±1.2 51.9±1.2 61.9±1.2 57.7±0.2 31.3±0.5
LoRSU-Ppl 75.6±1.4 52.3±0.6 82.0±1.2 60.5±1.0 79.8±1.1 92.3±0.5 51.8±1.2 62.2±1.4 57.7±0.4 31.3±0.4
LoRA-L 75.9±0.8 52.4±0.9 82.7±0.7 60.8±1.0 76.8±0.3 91.3±0.5 51.7±0.5 60.4±0.9 61.5±0.3 31.6±0.3
LoRA 71.1±0.7 50.9±0.5 80.3±1.0 59.4±0.7 64.6±0.7 91.1±0.7 52.0±0.4 62.3±0.6 62.3±0.2 31.3±0.1
LoRSU 75.3±1.0 53.7±0.8 82.8±0.4 60.7±0.8 82.7±0.7 91.6±0.6 51.6±0.4 61.5±0.4 58.4±0.2 31.4±0.2
CL-20
LoRA-Ppl 75.5±0.9 51.6±0.7 82.0±0.4 59.3±0.6 74.9±0.3 91.6±0.5 51.7±0.6 62.8±0.5 57.0±0.1 32.1±0.1
LoRA-F 74.8±0.7 52.7±1.0 81.6±0.8 59.4±0.9 71.5±0.7 91.0±0.8 51.7±0.7 63.4±0.8 58.9±0.2 31.0±0.2
LoRSU-Ppl 74.1±1.0 52.0±0.9 82.5±0.7 59.8±0.8 79.0±0.7 92.1±0.7 51.8±0.9 61.8±0.4 58.7±0.4 31.6±0.3
LoRA-L 75.6±0.2 53.0±0.1 82.7±0.3 60.6±0.3 77.1±0.2 91.5±0.2 51.7±0.1 60.7±0.0 59.8±0.1 31.4±0.1
LoRA 62.8±0.3 47.2±0.4 72.4±0.4 54.4±0.2 61.6±0.4 90.2±0.3 51.7±0.2 62.0±0.1 60.8±0.0 30.9±0.1
LoRSU 75.4±0.3 53.9±0.1 83.1±0.2 60.3±0.1 83.1±0.1 92.1±0.1 51.6±0.2 61.2±0.0 57.6±0.0 31.1±0.0
CL-50
LoRA-Ppl 74.9±0.3 51.7±0.3 81.9±0.2 59.8±0.2 77.8±0.1 92.1±0.3 51.8±0.2 62.9±0.3 59.4±0.2 31.9±0.1
LoRA-F 73.6±0.0 51.8±0.3 81.2±0.0 58.1±0.1 66.6±0.3 90.7±0.1 51.6±0.1 63.7±0.3 58.4±0.1 30.5±0.0
LoRSU-Ppl 72.9±0.1 51.1±0.3 81.3±0.4 59.4±0.4 75.4±0.2 91.6±0.2 51.7±0.1 62.7±0.4 57.5±0.1 32.1±0.0

Table 19. Exact accuracy scores (%) for each baseline used to fine-tune the model on the HM dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.

VQA Datasets (Acc %)


Setting PEFT Method GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
Zr-Shot 75.6 53.1 82.7 60.4 76.1 91.1 51.5 61.2 58.0 31.3
LoRA-L 76.5±1.0 51.5±1.1 83.2±1.2 60.5±0.8 75.7±1.0 90.9±0.9 51.6±0.9 68.6±0.7 34.4±0.5 31.1±0.5
LoRA 68.8±0.8 47.0±1.0 70.5±0.8 51.7±1.1 54.1±0.6 89.1±0.8 52.2±1.5 60.8±0.8 54.7±0.6 30.5±0.3
LoRSU 75.7±1.2 54.1±1.1 82.9±0.6 60.7±1.0 76.3±1.1 92.2±0.6 51.5±0.9 61.8±1.2 58.1±0.2 31.9±0.5
CL-5
LoRA-Ppl 76.2±0.6 48.4±1.4 82.5±1.2 57.2±0.9 72.8±0.9 90.9±0.9 51.8±1.0 60.0±1.0 56.4±0.4 33.1±0.4
LoRA-F 71.8±1.1 47.8±0.8 79.9±1.5 57.6±1.0 63.2±1.1 90.1±1.0 48.0±0.7 67.2±0.9 49.0±0.3 31.5±0.2
LoRSU-Ppl 76.6±1.0 51.7±1.3 83.6±1.4 60.3±0.6 75.2±0.8 90.8±1.0 51.7±1.3 60.4±1.4 60.7±0.5 31.2±0.2
LoRA-L 75.1±0.9 50.5±0.3 82.1±0.9 59.3±0.8 65.1±0.6 91.8±0.4 51.9±0.5 71.8±0.8 52.8±0.3 31.7±0.2
LoRA 68.1±1.0 46.8±0.8 76.3±0.4 56.4±0.8 49.6±0.7 87.3±0.6 51.7±0.4 59.4±0.4 59.7±0.3 31.4±0.3
LoRSU 76.1±0.8 53.0±0.7 82.7±0.5 60.4±0.6 75.7±0.4 92.1±0.7 51.8±0.8 61.9±0.5 58.4±0.2 31.5±0.2
CL-20
LoRA-Ppl 77.0±0.9 52.0±0.4 83.9±0.5 63.6±0.7 73.4±0.5 90.5±0.3 53.1±0.7 71.9±0.7 54.1±0.2 31.1±0.4
LoRA-F 75.6±0.4 50.9±0.5 80.6±0.5 60.8±0.5 71.2±0.7 90.9±0.7 52.2±0.7 72.9±0.7 53.6±0.3 31.6±0.1
LoRSU-Ppl 76.1±0.8 49.8±0.9 83.5±0.9 59.8±0.6 76.1±0.9 91.0±0.9 51.7±0.6 72.1±0.4 59.5±0.2 30.5±0.4
LoRA-L 75.8±0.2 49.5±0.3 83.4±0.3 59.9±0.3 71.1±0.3 89.9±0.3 51.7±0.1 71.4±0.2 48.7±0.1 31.1±0.0
LoRA 72.7±0.3 47.1±0.2 72.6±0.2 56.7±0.3 60.4±0.1 89.7±0.3 51.9±0.1 61.9±0.1 57.1±0.2 31.1±0.0
LoRSU 75.3±0.3 53.2±0.1 83.3±0.2 60.5±0.1 74.9±0.1 92.2±0.1 51.6±0.2 61.5±0.0 58.9±0.1 31.3±0.0
CL-50
LoRA-Ppl 76.6±0.2 49.3±0.4 81.9±0.3 60.3±0.4 72.7±0.2 89.8±0.3 52.5±0.2 73.7±0.3 52.7±0.0 30.9±0.1
LoRA-F 74.1±0.1 52.0±0.3 80.6±0.2 57.0±0.1 63.5±0.3 88.7±0.1 53.0±0.4 73.5±0.2 46.0±0.1 31.8±0.0
LoRSU-Ppl 76.0±0.1 50.4±0.1 83.4±0.1 60.6±0.4 76.4±0.1 91.4±0.2 51.9±0.1 73.4±0.4 59.8±0.1 32.0±0.1

22
Efficient Few-Shot Continual Learning in Vision-Language Models

Table 20. Exact accuracy scores (%) for each baseline used to fine-tune the model on the VSR dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.

VQA Datasets (Acc %)


Setting PEFT Method GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
Zr-Shot 75.6 53.1 82.7 60.4 76.1 91.1 51.5 61.2 58.0 31.3
LoRA-L 75.3±0.7 59.9±1.4 81.0±1.1 56.2±0.5 66.8±1.3 90.1±1.3 68.3±1.1 65.0±1.4 57.6±0.3 32.5±0.4
LoRA 72.6±1.3 49.5±1.5 78.2±0.8 57.5±1.5 55.0±0.9 88.8±0.7 52.0±1.0 61.9±1.5 59.7±0.3 30.4±0.5
LoRSU 75.6±0.7 52.2±1.4 82.2±0.6 60.1±0.9 77.9±0.6 91.1±1.1 51.9±1.3 62.2±1.5 58.4±0.3 31.7±0.3
CL-5
LoRA-Ppl 65.8±0.7 48.7±0.8 65.4±1.3 33.8±1.4 48.8±0.5 81.7±1.2 61.7±0.5 56.2±0.7 43.6±0.2 32.8±0.4
LoRA-F 76.0±0.9 64.5±0.8 81.2±1.3 57.6±0.6 69.7±1.5 89.4±0.8 69.5±1.0 12.8±0.5 30.3±0.5 13.0±0.3
LoRSU-Ppl 73.6±0.7 57.5±1.1 80.3±1.1 57.8±1.3 73.1±1.3 90.7±1.1 62.0±1.5 57.4±0.5 57.9±0.6 30.3±0.4
LoRA-L 77.1±0.8 54.7±0.9 84.5±0.9 61.4±0.5 75.5±0.7 90.9±0.8 73.7±0.5 64.5±0.8 56.9±0.2 32.6±0.4
LoRA 72.6±0.7 54.5±0.9 76.6±0.8 57.4±0.7 57.3±0.4 87.9±0.8 51.9±0.7 59.0±0.5 57.6±0.2 31.3±0.4
LoRSU 74.9±0.6 54.6±0.5 82.1±0.8 58.5±0.7 75.5±0.5 91.6±0.5 51.6±0.6 62.4±0.7 57.5±0.2 30.9±0.2
CL-20
LoRA-Ppl 74.9±0.4 62.2±0.4 82.4±0.3 58.2±0.7 70.5±0.7 89.0±0.6 71.0±0.8 64.8±0.5 55.8±0.2 28.6±0.2
LoRA-F 75.4±0.5 60.6±0.5 80.9±0.9 56.6±0.9 63.1±0.7 88.2±0.6 74.8±0.5 48.7±0.9 50.1±0.4 20.2±0.2
LoRSU-Ppl 72.6±0.8 52.7±0.5 81.6±0.8 60.3±0.5 69.4±0.7 89.6±0.5 74.4±0.9 62.5±0.8 57.1±0.3 29.7±0.4
LoRA-L 77.2±0.3 56.5±0.1 84.5±0.0 61.4±0.2 76.4±0.2 91.5±0.3 73.4±0.1 65.3±0.2 54.4±0.1 31.5±0.1
LoRA 73.4±0.1 53.8±0.0 74.6±0.4 56.7±0.1 56.2±0.1 87.0±0.2 51.9±0.0 59.2±0.2 57.6±0.1 30.8±0.0
LoRSU 75.3±0.1 54.7±0.1 81.6±0.1 58.3±0.2 75.7±0.1 91.4±0.4 53.8±0.2 62.1±0.3 57.3±0.1 30.8±0.0
CL-50
LoRA-Ppl 71.7±0.1 48.7±0.1 75.1±0.2 46.3±0.4 64.6±0.3 87.9±0.2 71.7±0.4 61.9±0.2 55.1±0.1 30.9±0.0
LoRA-F 76.3±0.3 64.2±0.2 84.5±0.4 58.1±0.3 69.6±0.1 90.1±0.1 72.5±0.3 64.6±0.1 61.4±0.1 30.6±0.1
LoRSU-Ppl 72.1±0.2 49.8±0.1 74.8±0.3 57.6±0.0 71.0±0.4 88.2±0.1 74.9±0.1 58.3±0.2 55.4±0.2 30.0±0.2

Table 21. Exact accuracy scores (%) for each baseline used to fine-tune the model on the MMVP dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.

VQA Datasets (Acc %)


Setting PEFT Method GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
Zr-Shot 75.6 53.1 82.7 60.4 76.1 91.1 51.5 61.2 58.0 31.3
LoRA-L 75.5 52.8 82.0 60.5 76.0 91.5 51.5 63.6 57.7 30.6
LoRA-Ppl 75.5 53.6 83.0 60.3 75.6 91.1 51.5 63.1 60.7 31.7
CL
LoRA-F 75.2 52.9 81.3 60.5 74.3 90.4 51.6 63.6 60.0 31.4
LoRSU-Ppl 75.1 52.0 81.2 57.4 75.2 90.2 51.7 63.9 60.3 30.8

23
Efficient Few-Shot Continual Learning in Vision-Language Models

Table 22. Exact accuracy scores (%) for each baseline used to fine-tune the model on the VisOnly dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.

VQA Datasets (Acc %)


Setting PEFT Method GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
Zr-Shot 75.6 53.1 82.7 60.4 76.1 91.1 51.5 61.2 58.0 31.3
LoRA-L 76.5±1.2 51.9±0.7 82.4±1.4 60.5±1.5 76.1±1.0 91.5±0.6 51.6±0.9 60.3±1.0 57.6±0.3 31.3±0.4
LoRA 70.9±1.4 52.1±1.2 77.5±1.3 55.6±0.6 52.6±0.8 89.3±0.6 51.7±0.8 61.7±0.7 56.9±0.6 30.9±0.5
LoRSU 75.9±0.7 53.1±0.8 82.5±0.7 60.4±1.0 76.1±1.5 91.9±0.8 51.5±1.3 61.3±1.2 58.9±0.4 31.5±0.2
CL-5
LoRA-Ppl 76.3±1.1 50.7±1.1 82.2±0.9 61.0±1.3 73.4±0.9 91.7±1.3 52.1±1.1 59.3±1.3 58.0±0.2 35.0±0.5
LoRA-F 76.0±0.8 51.1±1.4 82.9±1.1 59.9±0.7 71.2±1.2 91.7±1.1 51.6±1.3 60.8±0.7 58.4±0.2 34.9±0.4
LoRSU-Ppl 76.2±1.1 53.0±0.9 83.4±0.7 61.3±1.4 76.6±0.8 92.3±0.5 52.0±1.0 61.6±0.7 60.7±0.3 32.0±0.5
LoRA-L 77.8±1.0 53.0±0.8 83.4±0.4 62.1±0.6 75.5±0.8 91.6±0.4 52.4±0.9 61.2±0.6 55.6±0.3 32.5±0.3
LoRA 73.3±0.9 49.3±0.4 77.9±0.6 56.4±0.6 47.7±0.8 91.2±0.6 51.8±0.8 61.5±0.6 57.0±0.3 32.8±0.1
LoRSU 75.7±0.5 53.3±0.7 82.0±0.5 60.0±0.5 76.1±0.6 91.9±0.9 51.7±0.6 61.6±0.3 58.2±0.3 31.5±0.4
CL-20
LoRA-Ppl 78.0±0.4 52.8±0.4 83.7±0.6 60.9±0.7 74.3±0.4 91.5±0.7 51.9±0.5 61.7±0.7 56.0±0.2 32.8±0.3
LoRA-F 77.4±0.6 51.7±0.9 83.7±0.6 59.7±0.7 73.9±0.9 91.2±0.5 53.4±0.4 62.0±0.9 56.9±0.4 31.0±0.3
LoRSU-Ppl 76.7±0.5 53.7±0.4 83.8±0.6 61.4±0.3 75.5±0.6 91.2±0.8 51.8±0.3 61.9±0.9 59.6±0.4 31.3±0.2
LoRA-L 76.4±0.4 54.5±0.3 84.1±0.3 61.3±0.0 73.9±0.1 91.5±0.1 51.9±0.3 62.8±0.1 55.4±0.0 32.1±0.1
LoRA 70.0±0.1 46.8±0.0 70.5±0.1 51.0±0.2 50.9±0.0 88.1±0.0 52.0±0.3 61.2±0.3 57.8±0.2 31.7±0.1
LoRSU 75.6±0.4 53.1±0.1 81.7±0.3 58.2±0.1 75.3±0.2 91.8±0.3 51.7±0.1 62.1±0.1 58.3±0.1 31.9±0.0
CL-50
LoRA-Ppl 76.9±0.4 54.6±0.1 84.1±0.3 60.5±0.2 74.9±0.4 91.2±0.3 51.8±0.3 62.5±0.3 56.0±0.1 33.0±0.0
LoRA-F 77.1±0.0 53.0±0.2 83.9±0.4 60.9±0.1 73.1±0.1 92.2±0.3 51.9±0.2 61.4±0.4 58.0±0.0 32.5±0.1
LoRSU-Ppl 76.1±0.3 51.5±0.2 81.6±0.1 60.2±0.0 75.6±0.2 92.2±0.3 52.0±0.2 61.2±0.3 58.3±0.0 33.5±0.1

E. Extra Ablation Studies


E.1. Ablation on the rank r of LoRSU
In Table 23, we investigate the effect on performance of using different ranks for LoRSU. As the rank r increases, the
VQA accuracy on the target dataset slightly improves, peaking at r = 64. Beyond that, performance slightly decreases.
Performance on other datasets remains relatively stable with small fluctuations.

E.2. Ablation on the number of optimal attention heads of LoRSU


In Table 24, we examine how the number of attention heads chosen to be fine-tuned affects LoRSU’s performance. We
notice that more attention heads marginally improve the performance of the model while the extra flexibility can cause more
forgetting, e.g. ESAT.

E.3. Robustness on the Choice of Attention Heads


We show in Table 25 that LoRSU’s mechanism of choosing the most important attention heads provides a clear advantage
in terms of robustness over the other two LoRSU’s variants, LoRSU-Rand and LoRSU-AAH. We can see that TI and CC
decline in a lower rate compared to that of LoRSU-RAnd and LoRSU-AAH, as we increase the number of training epochs..
As expected, LoRSU-Rand appears to be the least robust method since the random choice of the attention heads constitute it
more unstable.

F. TSI vs. DALLE


In Figures 4 through 7, we present examples of images from TSI and DALLE for different actions. In general, we observe
that TSI comprised of natural, unposed images of senior individuals performing daily tasks, reflecting real-life scenarios.
The images are broader, showing the surrounding environment, which is crucial for context. On the other hand, DALLE
images are idealized or stylized images. The focus is narrower, with emphasis on the object of the action (e.g. tablet, glass,
etc.).

24
Efficient Few-Shot Continual Learning in Vision-Language Models

Table 23. Ablation study over the effect of the rank r used by LoRSU to fine-tune the image encoder, CLIP-L-14. We report the VQA
accuracies of the last session in the 50-shot CL setting. The accuracies on the target dataset are in red color. For this experiment, we use
two attention heads to fine-tune with LoRSU.

VQA Datasets (Acc %)


FT Dataset rank (r) GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
8 83.0 53.2 81.3 60.9 61.0 91.2 51.5 61.6 60.0 31.6
16 83.9 53.4 81.5 60.2 54.0 91.4 51.5 62.1 60.7 31.6
32 84.8 53.1 81.9 60.5 58.0 90.6 51.6 61.8 58.7 31.5
GTS
64 84.9 53.2 81.3 60.7 61.7 90.9 51.5 61.9 59.3 31.3
128 84.3 53.2 81.8 60.6 56.8 91.5 51.6 61.8 58.7 31.2
256 84.5 53.1 81.5 61.1 51.5 90.3 51.6 62.0 58.7 31.6
8 75.2 67.2 82.0 59.2 71.6 91.1 51.5 61.6 58.0 31.5
16 75.4 68.0 82.3 59.1 71.0 90.6 51.6 61.6 56.7 31.2
32 74.9 68.9 81.8 59.3 70.1 91.2 51.5 61.6 58.0 31.6
TSI
64 75.3 72.1 82.0 59.3 72.3 90.5 51.6 61.4 58.0 31.6
128 75.1 65.8 81.7 59.0 70.0 90.6 51.5 62.1 56.7 31.6
256 75.4 66.4 82.3 59.6 72.0 91.2 51.5 62.1 56.7 31.5
Zr-Shot 75.6 53.1 82.7 60.4 76.1 91.1 51.5 61.2 58.0 31.3

Table 24. Ablation study over the effect of the number of attention heads used by LoRSU to fine-tune the image encoder. We report the
VQA accuracies of the last session in the 50-shot CL setting. The accuracies on the target dataset are in red color. For this experiment, we
use r = 64 for the rank of LoRSU.

VQA Datasets (Acc %)


FT Dataset # heads GTS TSI CAn AIR ESAT DALLE VSR HM MMVP VisOnly
0 83.1 52.7 82.2 60.8 60.6 91.1 51.6 61.7 59.3 31.6
1 83.9 53.8 82.0 60.7 55.4 91.2 51.6 61.6 60.0 31.8
2 84.9 53.2 81.3 60.7 61.7 90.9 51.5 61.9 59.3 31.3
GTS
4 84.7 53.5 81.0 60.5 60.5 90.6 51.5 61.8 58.7 31.5
8 84.9 52.9 81.2 60.5 58.8 90.5 51.5 61.6 59.3 31.5
16 85.0 53.1 81.3 60.0 59.2 90.6 51.5 61.6 56.7 31.3
0 75.1 64.2 82.1 59.3 72.2 90.8 51.5 61.8 57.3 31.5
1 75.3 64.8 81.9 59.5 74.0 90.5 51.5 61.6 58.0 32.0
2 75.3 72.1 82.0 59.3 72.3 90.5 51.6 61.4 58.0 31.6
TSI
4 74.9 66.8 82.2 58.9 74.0 90.5 51.5 62.1 58.0 31.4
8 74.7 67.4 81.7 59.1 71.5 91.2 51.5 62.2 58.0 31.7
16 75.3 65.2 81.8 59.9 69.1 90.5 51.5 61.6 58.0 31.3
Zr-Shot 75.6 53.1 82.7 60.4 76.1 91.1 51.5 61.2 58.0 31.3

25
Efficient Few-Shot Continual Learning in Vision-Language Models

Table 25. Robustness comparison of LoRSU with respect to the number of training epochs. We consider LoRSU, LoRSU-Rand where the
k attention heads are chosen randomly and LoRSU-AAH where all the attention heads are chosen for fine tuning. We use 50 shots on the
GTS for each method and we report the Target Improvement (TI) on this dataset and the Control Change (CC) using only ESAT as a
control dataset. We include error bars over 3 runs.

LoRSU-Rand LoRSU-AAH LoRSU


# Epochs
TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑)
2 5.2±0.9 −11.1±1.1 6.1±0.3 −11.6±0.7 5.6±0.4 −9.7±0.8
5 7.6±0.8 −15.0±0.9 9.3±0.4 −15.6±0.6 8.6±0.3 −12.6±0.5
10 7.8±0.5 −18.1±0.8 9.1±0.1 −19.6±0.5 9.7±0.1 −14.3±0.7
20 5.9±0.6 −20.0±0.7 8.1±0.1 −21.5±0.6 7.4±0.2 −15.7±0.6

(a) TSI (b) DALLE

Figure 4. Instances of the ‘Use Laptop’ action.

(a) TSI (b) DALLE

Figure 5. Instances of the ‘Watching TV’ action.

26
Efficient Few-Shot Continual Learning in Vision-Language Models

(a) TSI (b) DALLE

Figure 6. Instances of the ‘Use Tablet’ action.

(a) TSI (b) DALLE

Figure 7. Instances of the ‘Use a telephone’ action.

27

You might also like