Efficient Few-Shot Continual Learning in Vision-Language Models
Efficient Few-Shot Continual Learning in Vision-Language Models
1
Efficient Few-Shot Continual Learning in Vision-Language Models
LLaVA: “This person is Few-shot Continual Learning. Our work falls within the continual
Update
cooking a meal in a kitchen.” learning literature, where a model needs to be updated
incrementally as new data arrive, accumulating knowledge
Q: What is this person doing? over tasks and reducing forgetting of previously acquired
GT: Cooking on a stove
information (De Lange et al., 2021).
LLaVA-LoRSU: “The person is standing in
LLaVA: “The person is a kitchen, preparing food. They are using a
standing at a kitchen sink, pot to cook something, possibly stirring the Continual Learning for Multimodal Language Models.
washing dishes.” contents of the pot.”
Wu et al. (2024) provide a survey on continual learning for
LLMs, highlighting challenges of computational efficiency
and forgetting. Srivastava et al. (2024) explored continual
Figure 1. (Left) Responses of the pretrained LLaVA to samples multi-modal learning on VQA datasets, keeping the vision
from TSI dataset (bottom) compared to DALL·E 2 generated im- encoder frozen. He et al. (2023b) examined continual in-
ages (top) for the ‘cooking on a stove’ class. (Right) LLaVA’s struction tuning with sequential VQA datasets, proposing
correct response to the same TSI image after fine-tuning LLaVA a method where the projection head is expanded for each
using LoRSU. new task. Das et al. (2024) introduced a pseudo-rehearsal
strategy for vision-language models, updating only the lan-
guage projection layer. Our method adapts only the vision
encoder, preserving language capabilities.
Continual Learning with Few-Shot Updates. Verwimp
a low-number of samples from TSI dataset compared to the et al. (2023) posits that an ideal continual learning solution
pretrained LLaVA’s (wrong) response. would enable continual correction of model’s mistakes at a
lower computational cost than retraining from scratch. How-
Through extensive experiments, we demonstrate that up- ever, most continual few-shot learning from pre-trained mod-
dating the image encoder is essential for improving the els focuses on classification tasks and introduces solutions
performance of the VLM that relies on it. More importantly, that cannot scale to large multimodal models. Panos et al.
this approach is computationally efficient, as the image en- (2023) update the vision encoder on the first task only, later
coder has significantly fewer parameters compared to the adapting a covariance matrix for incoming tasks. Goswami
language model and the method is less prone to forgetting, et al. (2024) calibrate the covariance matrix for new classes
especially the LLM knowledge. based on semantic similarity. Zhao et al. (2024) introduce
We evaluated our approach on various VQA tasks compar- few and slow updates, proposing a transfer loss function and
ing to state-of-the-art CL methods and the PEFT baseline a cross-classification loss to mitigate catastrophic forget-
LoRA(Hu et al., 2021) on various few-shot CL settings. We ting. Few-shot updates can also be viewed through the lens
show significant improvements of the full VLM model on of model editing (Sinitsin et al., 2020). MEND (Mitchell
all settings and very low rates of forgetting without using et al., 2022) scales model editing to large language mod-
any replay buffer of data from the previous tasks. By selec- els by transforming the gradient obtained from fine-tuning,
tively updating the image encoder, our method provides a through a low-rank decomposition fed to auxiliary networks
robust and efficient solution for handling visual shifts. This designed to make fast, local edits to a pre-trained model,
targeted adaptation strategy avoids the need to modify the requiring a set of unrelated examples to prevent forgetting.
entire model, preserving existing knowledge whilst ensuring ROME (Meng et al., 2022) applies causal tracing to identify
strong performance in new domains. layers where incorrect factual knowledge is stored, applying
a low-rank update. However, ROME does not scale to con-
The contributions of the paper are as follows: tinual updates or non-association types of updates. Cheng
• We propose LoRSU, a novel replay-free PEFT method et al. (2023) studied multi-modal editing, showing negli-
tailored for few-shot continual learning. gible deterioration in multi-modal task performance when
• We introduce two new VQA datasets, TSI and DALLE, updating language models but severe forgetting when updat-
created to expose the limitations of pre-trained image ing vision encoders. To the contrary, our method focuses on
encoders in VLMs. adapting the vision encoder rather than updating the factual
knowledge in the LLM, yet achieving strong performance
• We conduct the first large-scale study of few-shot
gains and negligible forgetting.
CL in VLMs, evaluating LoRSU across ten diverse
VQA datasets and benchmarking against state-of-the- Continual Learning of Pre-Trained Image En-
art PEFT and CL methods. LoRSU consistently out- coders. SPT (He et al., 2023a) estimates a mask of updates
performs all baselines. based on parameter sensitivity, performing low-rank or
2
Efficient Few-Shot Continual Learning in Vision-Language Models
Figure 2. LoRSU mechanism: After computing the gradient ∇θ Lt (θ) over the target dataset at time t, LoRSU picks a small number of
attention heads and a small number of paremeters from the first linear layer of the MLP module in the transformer block based on the
magnitude of the gradients of ∇WAttn Lt and ∇Wfc1 Lt , respectively. Computational efficiency is ensured by introducing LoRA adapters to
the attention weight matrices.
sparse updates. SPU (Zhang et al., 2024) localizes updates block remain fixed. To enhance flexibility, we further update
to the first feed-forward layer of each transformer block, the most informative attention heads based on the gradient
inspired by knowledge neuron theory (Dai et al., 2021). Our of the task-specific loss.
approach generalizes updates to all layers, selecting relevant
More specifically, let a dataset Dt = {xn , yn }N t
n=1 for
parameters and maintaining gradient norms, combined
the current task t where xn is an image with text de-
with LoRA on selected attention heads for adaptivity
scription yn . We define L(θ; Dt ) := Lt (θ) as the loss
and stability, achieving SOTA performance on continual
used for training the model and θ ∈ Rd is the full set
fewshot multimodal tasks.
of model’s parameters. The standard Multi-head Self-
Attention Mechanism (MSA) (Vaswani et al., 2017), com-
3. Low-Rank Adaptation with Structured prised of H Dh -dimensional heads, is defined as the con-
Updates catenation of multiple self-attention (SA) blocks where
(i) (i) (i)
q(i) = Wq Z ⊤ , k(i) = Wk Z ⊤ , v(i) = Wv Z ⊤ ∈
Few-shot continual learning is a highly practical and chal-
RDh ×N , are the query, key and value matrices, which are
lenging scenario, where models must incrementally adapt
used to compute the self-attention outputs as follows
to new tasks with limited supervision while retaining pre-
viously acquired knowledge. This setting closely mirrors
real-world applications, such as interactive AI assistants and ⊤
A(i) = softmax(q(i) k(i) /
p
Dh ) ∈ RN ×N , (1)
autonomous systems, where models receive a continuous ⊤
stream of novel data but only sparse supervision per update. SAi (Z) = A(i) v(i) ∈ RN ×Dh , i = 1, . . . , H. (2)
To address the challenge of efficiently fine-tuning large-
scale visual encoders and transformer-based models under Z ∈ RN ×D is the input matrix of N tokens of dimen-
the few-shot continual learning setting, without causing (i) (i) (i)
sion D and Wq , Wk , and Wk are the query, key, and
catastrophic forgetting (i.e., degradation in performance on
value matrices of learnable parameters for head i, respec-
previously learned tasks), we propose a novel parameter-
tively. The final MSA function is defined as MSA(Z) =
efficient fine-tuning method called Low-Rank Adaptation
Concat [SA1 (Z), . . . , SAH (Z)] Wo ∈ RN ×D , Wo ∈
with Structured Updates (LoRSU).
RHDh ×D .
LoRSU updates specific parameters within each transformer
Since we care to update the parameters of the heads that
block in a resource-efficient manner, mitigating the risk
cause the largest changes in Lt (θ), we compute the gradi-
of generic knowledge loss when fine-tuning for new tasks.
ent of the loss with respect to the parameters of each head
Specifically, we selectively update a subset of parameters
and then we update only those heads with the largest cumu-
from the first linear layer in the MLP block of each trans-
lative contribution to the loss change. Since the matrices
former layer, as proposed in (Zhang et al., 2024). While this (i) (i) (i)
Wq , Wk , Wv are all the parameters of head i, we can de-
approach reduces the fine-tuning burden, it may limit model
fine an importance score for each head by adding the squared
flexibility as the remaining parameters in the transformer (i)
values of their corresponding gradients Gq = ∇W (i) Lt ,
q
3
Efficient Few-Shot Continual Learning in Vision-Language Models
(i) (i)
Gk = ∇W (i) Lt , and Gv = ∇W (i) Lt , as follows Self-Attention or MLP projector) we aim to learn, hence the
k v
constraint of mutually exclusiveness, Ii ∩ Ij = ∅, between
X different pairs of parameter groups. Also note that we al-
(i)
si = (G(i)
q [m, l])2
+ (Gk [m, l])2
+ (G(i)
v [m, l])2
.
lowed to choose a subset cℓ of the parameters of a specific
m,l
(3) group Iℓ which is the underneath mechanism of LoRSU
We provide a theoretical justification of (3) in the next sec- choosing attention heads and parameters of fc1. The mask
tion. We update only the top-k heads, based on their impor- p∗ is chosen so that the gradient norm of the masked gra-
tance scores {s1 , . . . , sH }, I ⊂ {1, . . . , H}, to be updated dients is as large as possible under the sparsity constraints.
on the current task. Nevertheless, the number of parameters We prove in appendix A that the indices of the non-zero
remain high due to the large weight matrices. Therefore, values of p∗ can be found using the importance scores in
we parametrize the original weights using LoRA (Hu et al., (3) and the magnitudes of the gradients with respect to the
2021) to further reduce the computational burden. The ma- fc1 parameters.
(i) (i) (i)
trices Wq , Wk , Wv , i ∈ I are now defined as
′
4. Experiments
Wα(i) = Wα(i) + A(i) (i)
α Bα , α ∈ {q, k, v}. (4)
We conduct a series of experiments under three different
Finally, to ensure that we only update few-shot continual learning (CL) settings (CL-5, CL-20,
(i) (i) (i)
Wq , Wk , Wv , ∀i ∈ I we use a binary mask on and CL-50 shots) to thoroughly investigate the performance
the gradient vector with respect to all parameters of all of LoRSU based on ten VQA datasets. By adopting this
attention heads. We keep the projection matrix Wo frozen. paradigm, we aim to assess the adaptability and efficiency
We note that most modern implementations of transformer of LoRSU under constrained learning conditions, ensuring
blocks concatenate the three attention weight matrices that it remains both computationally feasible and effective
Wq , Wk , Wv into one and thus we only need to apply LoRA in improving downstream performance. Specifically, we
once to this concatenated matrix. seek to address the following key questions: 1) How does
our method, LoRSU, compare to other fine-tuning and CL
Regarding the first linear layer in the MLP module, Wfc1 ∈ baselines that use the CLIP loss to update the image en-
Rd×D , we mask the gradients of Wfc1 so only the most coder? 2) Does updating the image encoder separately and
important parameters for the current task to be updated, then reintegrating it into the corresponding VLM enhance
i.e. we use the following biased gradient update. downstream VQA performance? 3) What is the effect of
using the perplexity loss instead of the CLIP loss to update
ˆ W Lt = Mfc1 ⊙ ∇W Lt ,
∇ (5)
fc1 fc1 the image encoder? 4) What are the benefits of choosing
a subset of attention heads to be fine-tuned using LoRSU?
where Mfc1 ∈ {0, 1}d×D is a zero-one mask that is built and 5) What are the computational benefits of LoRSU?
by choosing a proportion of the largest squared values of
∇Wfc1 Lt in a similar manner as in (Zhang et al., 2024) and
4.1. Datasets
⊙ is the Hadamard product.
We evaluate the performance of LoRSU on ten visual ques-
Theoretical justification. The importance scores in (3) can
tion answering (VQA) datasets falling in two broad cate-
be derived from the following constrained (binary) optimiza-
gories: regular VQA datasets and classification datasets
tion problem1
converted to VQA datasets.
2
∥p ⊙ ∇W L(θ 0 )∥ Regular VQA datasets. We consider four standard VQA
p∗ = arg max 2 , (6)
p∈{0,1}d ∥∇W L(θ 0 )∥ datasets used for benchmarking VLMs’ performance (Duan
G et al., 2024): VSR (Liu et al., 2023), the Visual Spatial
[
s.t. Iℓ ⊂ {1, 2, . . . , d}, Ii ∩ Ij = ∅, ∀i ̸= j, Reasoning corpus consists of caption-image pairs labeled
ℓ=1 as True or False, where each caption describes the spa-
G
X tial relation between two objects in the image. VLMs
and C = cℓ , cℓ ≤ |Iℓ | ∀ℓ, ∥p∥0 ≤ C, evaluate whether the caption accurately reflects the image.
ℓ=1 HM (Kiela et al., 2020), the Hateful Memes dataset designed
to detect multimodal hateful memes. MMVP (Tong et al.,
where θ 0 is the vector of the pretrained parameters before us- 2024), the Multimodal Visual Patterns dataset is a challeng-
ing Dt for fine-tuning the model. The groups of parameters ing benchmark which has been built on images that CLIP
Ii correspond to the parameters of a specific module (e.g. perceives as similar despite their clear visual differences.
1
For notational simplicity, we assume a single transformer VisOnly (Kamoi et al., 2024), a novel dataset created to
block for this case. directly assess the visual perception abilities of VLMs in
4
Efficient Few-Shot Continual Learning in Vision-Language Models
answering questions about geometric and numerical details zero-shot accuracy of each VQA dataset as the benchmark
in scientific figures. This dataset allows us to assess fine- baseline and report the change in accuracy on the test split
grained visual perception in VLMs independently of other of the target dataset so positive values indicate an improve-
abilities, such as reasoning, making it the most challenging ment in accuracy. This approach enables us to quantify
among the previously mentioned datasets. the model’s ability to accumulate knowledge, using the pre-
trained model as the reference point; we name this metric
Classification-to-VQA datasets. We convert four popular
as Target Improvement (TI) accuracy. We also calculate the
multi-class classification datasets into multiple-choice VQA
average accuracy change on the test splits of the remaining
problems, where each question has five choices, and the
datasets, when fine-tuning on a specific dataset, to estimate
VLM is tasked with selecting the correct answer. These
average forgetting of generic knowledge or possible pos-
datasets are introduced as examples of scenarios where
itive backward transfer (De Lange et al., 2021); we call
visual domain shifts are encountered, allowing us to ex-
this metric Control Change (CC) accuracy where ‘control’
amine the utility of updating the image encoder; a criti-
refers to the control datasets we use to calculate the aver-
cal consideration often overlooked in many standard VQA
age accuracy change. TI and CC are computed based on
datasets.The datasets include: GTS (Stallkamp et al., 2012),
the fine-tuned VLM after the last session of CL. We also
the German Traffic Sign dataset, which Zhang et al. (2024)
consider standard CL performance metrics such as Average
considered as an out-of-distribution dataset for CLIP pre-
Accuracy (ACC) and Backward Transfer (BWT) (Lopez-Paz
training; CAn (Wang et al., 2024b), a recent dataset created
& Ranzato, 2017) to examine how accuracy and forgetting
to test CLIP’s robustness with animal images containing
evolves through continuous adaptation. Notice that these
realistic spurious features such as unexpected backgrounds;
metrics, in contrast to TI and CC, focus on the accuracy and
AIR (Maji et al., 2013), a fine-grained aircraft classification
forgetting during continual adaptation and they do not take
dataset; ESAT (Helber et al., 2019), a dataset of satellite
into account the performance of the fine-tuned model on
images used for land cover classification.
other datasets.
TSI & DALLE. In addition to these existing datasets, we
Implementation details. Please see Appendix B.
introduce two novel VQA datasets: TSI and DALLE, both
designed to explore the effects of domain shift. The TSI (Das Models. For our experiments, we consider the popular
et al., 2019) dataset was preprocessed as a classification Vision Language Model LLaVA-v1.5 (Liu et al., 2024b)
dataset, where the goal is to recognize the activity depicted that leverages a frozen CLIP image encoder. Specifically,
in each image. Frames were extracted from videos to create LLaVA utilizes a frozen OpenAI-CLIP-L-14 (Radford et al.,
a training set of approximately 10K images and a test set of 2021) with a LLM (Vicuna-7b (Chiang et al., 2023)). The
approximately 5K images, encompassing 27 distinct activity two modules are connected through a two-layer MLP pro-
classes. The DALLE dataset, constructed by querying the jector that aligns image and text features. The LLM and the
OpenAI’s model DALL·E 2, includes representative images MLP projector are optimized during the visual instruction
generated from 22 activity classes appearing in TSI. For tuning while CLIP remains frozen. LLaVA concatenates
each activity, we generated 30 images, resulting in a total of adjacent tokens from CLIP-L-14 and processes them with
660 images designated exclusively for evaluation purposes. an MLP projector as input to LLama-2 (7B-chat) (Touvron
et al., 2023); the MLP projector and the language model are
We follow the common practice in few-shot continual learn-
optimized while the image encoder remains frozen.
ing (Panos et al., 2023) to construct the sequences. We
divide each dataset into 5 sets of disjoint classes/categories Baselines. We compare LoRSU to the following methods
and consider 5/20/50 shot settings where only 5/20/50 im- that also use the CLIP loss to fine-tune the image encoder:
ages per class in the current set are used for fine-tuning the
model. More details on how we split each of these datasets • LN (Perez et al., 2018; Panos et al., 2023) is used for both
for the CL settings are provided in appendix C. few-shot and CL. Only the image encoder LayerNorm
modules’ parameters are optimized.
• F-FT is the standard fine-tuning technique where all im-
4.2. Experimental Setting
age encoder parameters undergo gradient updates.
Metrics. While standard metrics in the CL literature ex- • F-EWC fine-tunes all the image encoder parameters with
ist to evaluate general performance (Lopez-Paz & Ranzato, EWC regularization (Kirkpatrick et al., 2017).
2017; Chaudhry et al., 2018), VLMs exhibit generic knowl- • LoRA (Hu et al., 2021) a popular PEFT method which pa-
edge across various domains beyond the one being adapted, rameterizes incremental updates by two low-dimensional
making it crucial to evaluate how adaptation impacts their matrices and only fine-tunes them.
overall performance. These metrics do not measure the • AdaLoRA (Zhang et al., 2023) dynamically adjusts the
change in performance relative to the model’s initial state low-rank update budget allocation during training.
prior to the learning process. To address this, we use the
5
Efficient Few-Shot Continual Learning in Vision-Language Models
Table 1. Performance comparison of LoRSU with the CLIP loss against baselines fine-tuning the image encoder using the same loss. We
report the Target Improvement (TI) and Control Change (CC) accuracies across three different continual learning (CL) settings. Greener
shades indicate higher positive values, while redder shades signify lower negative values. The highest accuracies across methods for each
dataset are underlined.
FT Method
LN F-FT F-EWC LoRA AdaLoRA SPU LoRSU
Setting FT Dataset
TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑)
GTS 3.5 -1.5 3.7 -6.5 5.0 -11.5 0.7 -4.8 -0.9 -4.9 5.4 -0.6 6.4 -0.7
TSI 0.8 0.0 7.4 -1.1 8.5 -1.0 -0.1 -2.8 1.1 0.2 0.9 0.1 3.2 0.1
CL-5
CAn -2.4 -0.2 -2.4 -2.2 -16.7 -9.4 -1.3 -4.6 -1.0 -0.1 -0.4 0.1 0.3 0.3
AIR 0.3 -1.6 2.0 -2.7 2.9 -2.8 1.3 -3.7 0.4 0.0 3.1 0.1 4.8 0.4
ESAT 4.2 0.6 -10.3 -1.4 -8.4 -2.1 -1.6 -0.7 1.9 0.1 4.5 0.1 6.8 0.2
GTS 5.2 -5.9 4.6 -7.3 6.7 -15.6 2.5 -10.5 0.2 -2.2 7.9 -1.3 8.6 -1.0
TSI 5.1 -1.9 15.3 -3.4 16.0 -32.5 8.5 -4.4 1.3 -9.6 7.8 -0.3 10.6 -0.1
CL-20
CAn -2.4 -0.4 0.3 -2.9 0.1 -5.1 -2.3 -5.4 -3.5 -2.5 0.1 0.5 1.1 0.3
AIR -0.2 -3.0 9.3 -1.8 10.2 -2.0 5.3 -2.7 2.7 -0.7 3.0 -0.2 5.9 -0.5
ESAT 0.9 -0.1 -24.9 -1.7 -22.0 -3.8 -11.5 -0.5 -6.8 -2.7 5.4 0.3 6.6 0.2
GTS 4.8 -6.5 3.4 -9.8 5.3 -12.9 3.1 -11.1 1.0 -3.3 7.7 -1.5 9.7 -1.3
TSI 7.0 -3.0 17.2 -4.6 22.4 -13.4 18.2 -6.3 7.9 -1.9 12.2 -0.5 19.1 -0.3
CL-50
CAn -5.7 -3.3 -1.0 -4.9 0.6 -9.7 -0.4 -4.4 -1.8 -0.8 0.6 -0.3 1.3 -0.5
AIR 1.8 -3.9 10.0 -3.1 10.9 -3.3 7.8 -3.8 4.6 -0.9 6.2 -0.6 8.2 -0.7
ESAT 4.6 0.1 -41.4 -3.3 -38.1 -2.0 -14.5 -3.6 -17.3 -2.4 5.8 0.1 7.0 0.2
• SPU (Zhang et al., 2024) is a PEFT baseline, specifically Table 2. Average accuracy (ACC) (↑) and backward transfer
designed to tackle catastrophic forgetting in CL scenarios, (BWT) (↑) scores (%). For reference, the ACC of the pretrained
that utilizes structured sparsity based on gradient informa- model on GTS and ESAT is 75.4 and 76.4, respectively, while
BWT is zero for all cases. The highest scores across methods are
tion to fine-tune the most significant parameters of the fc1
underlined.
module in the transformer block.
LoRA SPU LoRSU
4.3. CLIP-based Updates Setting FT Dataset
ACC BWT ACC BWT ACC BWT
We evaluate the performance of the Vision-Language Model GTS 79.2 -7.1 80.8 0.5 81.1 0.4
(VLM) when only the image encoder is fine-tuned using CL-5
ESAT 73.8 -3.4 79.8 1.5 82.2 2.0
the CLIP loss in a CL setting. This experiment compares GTS 77.2 -9.1 82.8 -0.6 83.5 -0.4
six strong CLIP-based baselines with our proposed method, CL-20
ESAT 64.1 -18.3 82.0 2.0 82.7 0.1
LoRSU. Table 1 reports the average accuracies of TI/CC GTS 79.3 -10.3 83.8 -0.7 84.7 -0.5
over three runs; detailed results can be found in appendix D. CL-50
ESAT 61.4 -27.8 81.2 -2.4 82.1 -0.8
We observe that LoRSU consistently achieves superior TI
SPU (TI=5.8, CC=0.1) and all other methods.
scores across datasets and CL settings, underscoring its abil-
ity to enhance task-specific performance effectively. Fur- Additional metrics. We assess the performance of LoRSU
thermore, LoRSU maintains CC accuracies that take consis- against LoRA and SPU in terms of ACC and BWT across
tently small negative or even positive values, highlighting two out-of-domain datasets, GTS and ESAT. Since LoRA
its capacity to preserve or slightly improve performance on and SPU have similar number of trainable parameters as
control datasets while fine-tuning on target datasets. Even LoRSU and competitive performance in our previous ex-
in datasets where other methods struggle (e.g., CAn, ESAT), periment, we choose those for comparison. Table 2 shows
LoRSU often performs better, maintaining positive or close- that LoRSU’s performs well with respect to these metrics,
to-neutral TI and CC scores. For instance, In ESAT (CL-50) following similar patterns as TI and CC in Table 1. LoRSU
containing challenging satellite images, LoRSU achieves achieves the best performance on ACC while exhibiting min-
the highest TI (7.0) with a positive CC (0.2), outperforming imal forgetting with the least negative BWT values. Similar
6
Efficient Few-Shot Continual Learning in Vision-Language Models
Table 3. Performance comparison between LoRSU using the CLIP loss (LoRSU) or the perplexity loss (LoRSU-Ppl) and other baselines
that fine-tune only the vision encoder (LoRA, LoRA-Ppl), only the LLM (LoRA-L), or both of them (LoRA-F). We report the Target
Improvement (TI) and Control Change (CC) for each CL setting. † and ‡ denote classification-to-VQA and regular VQA datasets,
respectively. The highest accuracies across methods for each dataset are underlined.
FT Method
Setting FT Dataset
LoRA-L LoRA LoRSU LoRA-Ppl LoRA-F LoRSU-Ppl
TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑) TI (↑) CC (↑)
GTS† -4.1 -0.2 0.7 -4.8 6.4 -0.7 -7.5 -3.0 -2.7 -1.8 1.6 -1.0
TSI† 6.0 -0.1 -0.1 -2.8 3.2 0.1 10.9 -2.4 -8.0 -2.4 13.1 1.5
†
CAn -3.3 -0.2 -1.3 -4.6 0.3 0.3 -3.5 -5.5 -4.1 -1.6 0.2 -0.2
CL-5
AIR† -1.7 0.3 1.3 -3.7 4.8 0.4 -0.7 -1.5 9.6 -1.9 5.8 -0.2
†
ESAT -0.2 -0.1 -1.6 -0.7 6.8 0.2 -0.6 0.4 5.4 -0.5 3.7 0.1
VSR‡ 16.8 -0.6 0.5 -4.0 0.4 0.2 10.2 -12.5 18.0 -10.6 10.5 -1.2
HM‡ 7.4 -2.7 -0.4 -6.8 0.6 0.4 -1.2 -1.2 6.0 -4.5 -0.8 0.2
VisOnly‡ -0.4 -0.1 -1.1 -4.5 0.9 0.1 0.3 -0.3 0.2 -0.4 2.7 0.7
GTS† -1.4 0.1 2.5 -10.5 8.6 -1.0 -0.5 -6.4 -1.4 -0.8 3.9 -0.7
†
TSI 5.9 0.0 8.5 -4.4 10.6 -0.1 6.5 -11.6 2.9 -3.1 13.9 -0.6
CAn† -1.9 -0.6 -2.3 -5.4 1.1 0.3 -3.7 -8.8 -2.1 -1.7 0.5 -1.2
CL-20 †
AIR 3.7 0.3 5.3 -2.7 5.9 -0.5 4.8 -3.5 16.3 -0.3 6.0 -0.3
ESAT† 0.7 0.4 -11.5 -0.5 6.6 0.2 -1.2 -0.1 -4.6 -0.0 2.9 -0.1
‡
VSR 22.2 1.0 0.4 -3.9 0.1 -0.2 19.5 -0.3 23.3 -5.1 22.9 -1.6
HM‡ 10.6 -2.2 -1.8 -5.8 0.7 0.2 10.7 -0.1 11.7 -1.4 10.9 -0.2
‡
VisOnly -2.3 0.7 -1.0 -4.7 0.2 0.1 -2.0 0.5 -1.0 0.2 1.7 0.5
GTS† -0.7 -0.3 3.1 -11.1 9.7 -1.3 -1.4 -6.7 -3.9 -2.1 6.9 -0.4
†
TSI 9.9 -0.0 18.2 -6.3 19.1 -0.4 -1.6 -16.5 15.1 -0.7 22.0 -1.1
†
CAn -1.8 -0.7 -0.4 -4.4 1.3 -0.5 -1.8 -9.8 -2.1 -1.1 1.0 -3.4
CL-50
AIR† 4.6 0.4 7.8 -3.8 8.2 -0.7 6.2 -3.1 17.9 -0.9 8.9 -0.4
†
ESAT 1.0 0.2 -14.5 -3.6 7.0 0.2 1.7 0.2 -9.5 -0.6 -0.7 -0.5
VSR‡ 21.9 1.0 0.4 -4.5 2.3 -0.3 20.2 -5.3 21.0 1.1 23.4 -3.6
‡
HM 10.2 -2.1 0.7 -4.5 0.3 0.2 12.5 -1.5 12.3 -3.7 12.2 0.2
VisOnly‡ -2.4 0.6 -0.2 -6.8 0.3 -0.1 -2.0 0.7 0.2 0.2 0.3 0.1
patterns are observed on extra datasets in appendix D.2. • LoRA-F applies LoRA adapters to all weight matrices of
the LLM, the image encoder, and the MLP projector.
4.4. CLIP-based vs. Perplexity-based Updates
We aevaluate how LoRSU and LoRA perform compared to
Traditionally, LLMs and VLMs achieve impressive perfor- their perplexity-based counterparts, LoRSU-Ppl and LoRA-
mance through fine-tuning with the perplexity loss. LoRA Ppl, respectively. Furthermore, we seek to explore how
is the standard PEFT method for this purpose, and thus, we these methods compare to parameter-efficient fine-tuning
consider three extra LoRA variants plus LoRSU-Ppl which approaches when either the entire VLM (LoRA-F) or only
all utilize the perplexity loss to update the model. the LLM component (LoRA-L) is updated.
• LoRA-L applies LoRA adapters to all weight matrices of The results in Table 3 highlight the strong and robust per-
the LLM and thus perplexity loss is required. formance of LoRSU and LoRSU-Ppl compared to other
• LoRA-Ppl is the same method as LoRA but this time the baseline methods across various settings. Both LoRSU
perplexity loss is used to update the adapters. and LoRSU-Ppl achieve minimal negative or even positive
changes in CC, indicating reduced catastrophic forgetting
7
Efficient Few-Shot Continual Learning in Vision-Language Models
23.9
Table 4. Comparison of the importance of choosing a small subset TFlops
of attention heads. The GTS dataset is used for fine-tuning. We Params (M)
include error bars over 3 runs. The highest accuracies across 16.5
methods are underlined. 15.0
Setting Scores LoRSU-Rand LoRSU-AAH LoRSU
TI (↑) 4.1 ±0.4 5.9 ±0.8 6.4 ±1.3
CL-5 9.0 9.1
CC (↑) -1.0 ±0.5 -0.9 ±0.3 -0.7 ±0.6
TI (↑) 6.2 ±0.6 7.5 ±0.6 8.6 ±0.9
CL-20
CC (↑) -1.4 ±0.3 -0.7 ±0.4 -1.0 ±0.5
0.36
TI (↑) 7.8 ±0.4 9.1 ±0.1 9.7 ±0.1
CL-50
CC (↑) -1.7 ±0.2 -0.9 ±0.2 -1.3 ±0.1 LoRA-F LoRSU-Ppl LoRSU
8
Efficient Few-Shot Continual Learning in Vision-Language Models
constraints, our method is generic to any transformer model, IEEE/CVF Conference on Computer Vision and Pattern
and we plan to extend it to other VLMs and image encoders. Recognition, pp. 24185–24198, 2024.
Another promising direction is using a smaller LLM proxy
model in perplexity-based methods like LoRSU-Ppl, which Cheng, S., Tian, B., Liu, Q., Chen, X., Wang, Y., Chen, H.,
has shown strong VQA performance. This could improve and Zhang, N. Can we edit multimodal large language
scalability and LoRSU’s use in resource-limited settings. models? In Bouamor, H., Pino, J., and Bali, K. (eds.),
Finally, LoRSU’s binary mask-based structured updates Proceedings of the 2023 Conference on Empirical Meth-
ensure efficient, precise parameter updates, but scaling to ods in Natural Language Processing, pp. 13877–13888,
larger architectures like LLMs poses challenges. Replacing Singapore, December 2023. Association for Computa-
binary masks with more scalable solutions for vast parame- tional Linguistics. doi: 10.18653/v1/2023.emnlp-main.
ter spaces will be crucial to manage memory and processing 856. URL https://aclanthology.org/2023.
demands, offering opportunities for further refinement. emnlp-main.856/.
Chiang, W.-L., Li, Z., Lin, Z., Sheng, Y., Wu, Z., Zhang,
Impact Statement H., Zheng, L., Zhuang, S., Zhuang, Y., Gonzalez, J. E.,
Authors are required to include a statement of the potential et al. Vicuna: An open-source chatbot impressing gpt-4
broader impact of their work, including its ethical aspects with 90%* chatgpt quality. See https://vicuna. lmsys. org
and future societal consequences. This statement should be (accessed 14 April 2023), 2(3):6, 2023.
in an unnumbered section at the end of the paper (co-located
Dai, D., Dong, L., Hao, Y., Sui, Z., Chang, B., and Wei,
with Acknowledgements – the two may appear in either
F. Knowledge neurons in pretrained transformers. arXiv
order, but both must be before References), and does not
preprint arXiv:2104.08696, 2021.
count toward the paper page limit. In many cases, where
the ethical impacts and expected societal implications are Das, D., Talon, D., Mancini, M., Wang, Y., and Ricci, E.
those that are well established when advancing the field of One vlm to keep it learning: Generation and balancing
Machine Learning, substantial discussion is not required, for data-free continual visual question answering. arXiv
and a simple statement such as the following will suffice: preprint arXiv:2411.02210, 2024.
“This paper presents work whose goal is to advance the field
of Machine Learning. There are many potential societal Das, S., Dai, R., Koperski, M., Minciullo, L., Garattoni, L.,
consequences of our work, none which we feel must be Bremond, F., and Francesca, G. Toyota smarthome: Real-
specifically highlighted here.” world activities of daily living. In Proceedings of the
IEEE/CVF international conference on computer vision,
The above statement can be used verbatim in such cases, but pp. 833–842, 2019.
we encourage authors to think about whether there is content
which does warrant further discussion, as this statement will De Lange, M., Aljundi, R., Masana, M., Parisot, S., Jia,
be apparent if the paper is later flagged for ethics review. X., Leonardis, A., Slabaugh, G., and Tuytelaars, T. A
continual learning survey: Defying forgetting in classifi-
cation tasks. IEEE transactions on pattern analysis and
References
machine intelligence, 44(7):3366–3385, 2021.
Chaudhry, A., Dokania, P. K., Ajanthan, T., and Torr, P. H.
Riemannian walk for incremental learning: Understand- Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
ing forgetting and intransigence. In Proceedings of the Fei, L. ImageNet: A large-scale hierarchical image
European conference on computer vision (ECCV), pp. database. In 2009 IEEE Conference on Computer Vi-
532–547, 2018. sion and Pattern Recognition, pp. 248–255, 2009. doi:
10.1109/CVPR.2009.5206848.
Chen, J., Zhu, D., Shen, X., Li, X., Liu, Z., Zhang, P., Krish- Duan, H., Yang, J., Qiao, Y., Fang, X., Chen, L., Liu,
namoorthi, R., Chandra, V., Xiong, Y., and Elhoseiny, M. Y., Dong, X., Zang, Y., Zhang, P., Wang, J., et al.
Minigpt-v2: large language model as a unified interface Vlmevalkit: An open-source toolkit for evaluating large
for vision-language multi-task learning. arXiv preprint multi-modality models. In Proceedings of the 32nd
arXiv:2310.09478, 2023. ACM International Conference on Multimedia, pp. 11198–
11201, 2024.
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S.,
Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al. Internvl: Goswami, D., Twardowski, B., and Van De Weijer, J.
Scaling up vision foundation models and aligning for Calibrating higher-order statistics for few-shot class-
generic visual-linguistic tasks. In Proceedings of the incremental learning with pre-trained vision transformers.
9
Efficient Few-Shot Continual Learning in Vision-Language Models
In Proceedings of the IEEE/CVF Conference on Com- Liu, H., Li, C., Wu, Q., and Lee, Y. J. Visual instruction tun-
puter Vision and Pattern Recognition, pp. 4075–4084, ing. Advances in neural information processing systems,
2024. 36, 2024b.
He, H., Cai, J., Zhang, J., Tao, D., and Zhuang, B. Lopez-Paz, D. and Ranzato, M. Gradient episodic memory
Sensitivity-aware visual parameter-efficient fine-tuning. for continual learning. Advances in neural information
In Proceedings of the IEEE/CVF International Confer- processing systems, 30, 2017.
ence on Computer Vision, pp. 11825–11835, 2023a.
Loshchilov, I. Decoupled weight decay regularization. arXiv
He, J., Guo, H., Tang, M., and Wang, J. Continual instruc- preprint arXiv:1711.05101, 2017.
tion tuning for large multimodal models. arXiv preprint
Maji, S., Kannala, J., Rahtu, E., Blaschko, M., and Vedaldi,
arXiv:2311.16206, 2023b.
A. Fine-grained visual classification of aircraft. Technical
Helber, P., Bischke, B., Dengel, A., and Borth, D. Eurosat: report, University of Oxford, 2013.
A novel dataset and deep learning benchmark for land
Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating
use and land cover classification. IEEE Journal of Se-
and editing factual associations in gpt. Advances in Neu-
lected Topics in Applied Earth Observations and Remote
ral Information Processing Systems, 35:17359–17372,
Sensing, 2019.
2022.
Hu, E. J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, Mitchell, E., Lin, C., Bosselut, A., Finn, C., and Manning,
S., Wang, L., and Chen, W. Lora: Low-rank adaptation of C. D. Fast model editing at scale. In International Confer-
large language models. arXiv preprint arXiv:2106.09685, ence on Learning Representations, 2022. URL https:
2021. //openreview.net/forum?id=0DcZxeWfOPt.
Kamoi, R., Zhang, Y., Das, S. S. S., Zhang, R. H., and Panos, A., Kobe, Y., Reino, D. O., Aljundi, R., and Turner,
Zhang, R. Visonlyqa: Large vision language models still R. E. First session adaptation: A strong replay-free base-
struggle with visual perception of geometric information. line for class-incremental learning. In Proceedings of the
arXiv preprint arXiv:2412.00947, 2024. IEEE/CVF International Conference on Computer Vision,
pp. 18820–18830, 2023.
Kiela, D., Firooz, H., Mohan, A., Goswami, V., Singh, A.,
Ringshia, P., and Testuggine, D. The hateful memes Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
challenge: Detecting hate speech in multimodal memes. Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
Advances in neural information processing systems, 33: L., et al. Pytorch: An imperative style, high-performance
2611–2624, 2020. deep learning library. Advances in Neural Information
Processing Systems, 32, 2019.
Kingma, D. P. Adam: A method for stochastic optimization.
arXiv preprint arXiv:1412.6980, 2014. Perez, E., Strub, F., De Vries, H., Dumoulin, V., and
Courville, A. FiLM: Visual reasoning with a general
Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Des- conditioning layer. In Proceedings of the AAAI Confer-
jardins, G., Rusu, A. A., Milan, K., Quan, J., Ramalho, T., ence on Artificial Intelligence, volume 32, 2018.
Grabska-Barwinska, A., et al. Overcoming catastrophic
forgetting in neural networks. Proceedings of the Na- Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G.,
tional Academy of Sciences, 114(13):3521–3526, 2017. Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,
et al. Learning transferable visual models from natural
Li, J., Li, D., Savarese, S., and Hoi, S. Blip-2: Bootstrapping language supervision. In International conference on
language-image pre-training with frozen image encoders machine learning, pp. 8748–8763. PMLR, 2021.
and large language models. In International conference
on machine learning, pp. 19730–19742. PMLR, 2023. Sinitsin, A., Plokhotnyuk, V., Pyrkin, D., Popov, S.,
and Babenko, A. Editable neural networks. In In-
Liu, F., Emerson, G. E. T., and Collier, N. Visual spatial ternational Conference on Learning Representations,
reasoning. Transactions of the Association for Computa- 2020. URL https://openreview.net/forum?
tional Linguistics, 2023. id=HJedXaEtvS.
Liu, H., Li, C., Li, Y., Li, B., Zhang, Y., Shen, S., and Lee, Srivastava, S., Harun, M. Y., Shrestha, R., and Kanan, C.
Y. J. Llava-next: Improved reasoning, ocr, and world Improving multimodal large language models using con-
knowledge, 2024a. tinual learning. arXiv preprint arXiv:2410.19925, 2024.
10
Efficient Few-Shot Continual Learning in Vision-Language Models
Stallkamp, J., Schlipsing, M., Salmen, J., and Igel, C. Man Zhu, D., Chen, J., Shen, X., Li, X., and Elhoseiny, M.
vs. computer: Benchmarking machine learning algo- Minigpt-4: Enhancing vision-language understanding
rithms for traffic sign recognition. Neural networks, 32: with advanced large language models. arXiv preprint
323–332, 2012. arXiv:2304.10592, 2023.
Tong, S., Liu, Z., Zhai, Y., Ma, Y., LeCun, Y., and Xie,
S. Eyes wide shut? exploring the visual shortcomings
of multimodal llms. In Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition,
pp. 9568–9578, 2024.
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi,
A., Babaei, Y., Bashlykov, N., Batra, S., Bhargava, P.,
Bhosale, S., et al. Llama 2: Open foundation and fine-
tuned chat models. arXiv preprint arXiv:2307.09288,
2023.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones,
L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. At-
tention is all you need. Advances in neural information
processing systems, 30, 2017.
Verwimp, E., Aljundi, R., Ben-David, S., Bethge, M., Cossu,
A., Gepperth, A., Hayes, T. L., Hüllermeier, E., Kanan, C.,
Kudithipudi, D., et al. Continual learning: Applications
and the road forward. arXiv preprint arXiv:2311.11908,
2023.
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen,
K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du,
M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., and
Lin, J. Qwen2-vl: Enhancing vision-language model’s
perception of the world at any resolution. arXiv preprint
arXiv:2409.12191, 2024a.
Wang, Q., Lin, Y., Chen, Y., Schmidt, L., Han, B., and
Zhang, T. Do clips always generalize better than imagenet
models? arXiv preprint arXiv:2403.11497, 2024b.
Wu, T., Luo, L., Li, Y.-F., Pan, S., Vu, T.-T., and Haffari, G.
Continual learning for large language models: A survey.
arXiv preprint arXiv:2402.01364, 2024.
Zhang, Q., Chen, M., Bukharin, A., Karampatziakis, N.,
He, P., Cheng, Y., Chen, W., and Zhao, T. Adalora:
Adaptive budget allocation for parameter-efficient fine-
tuning. arXiv preprint arXiv:2303.10512, 2023.
Zhang, W., Janson, P., Aljundi, R., and Elhoseiny, M. Over-
coming generic knowledge loss with selective parameter
update. In Proceedings of the IEEE/CVF Conference
on Computer Vision and Pattern Recognition, pp. 24046–
24056, 2024.
Zhao, L., Zhang, X., Yan, K., Ding, S., and Huang, W.
Safe: Slow and fast parameter-efficient tuning for con-
tinual learning with pre-trained models. arXiv preprint
arXiv:2411.02175, 2024.
11
Efficient Few-Shot Continual Learning in Vision-Language Models
where x = (x1 , . . . , xd )⊤ ∈ Rd and π is a permutation of {1, 2, . . . , d} such that |xπ(i) | ≥ |xπ(i+1) |, for i = 1, . . . , d − 1,
i.e. the TOP-S operator keeps only the S largest elements of x in magnitude and truncates the rest to zero.
Lemma A.2. For any x ∈ Rd − {0}, 1 ≤ C ≤ d, the optimal mask
2
∥p ⊙ x∥
p∗ = arg max 2 , s.t. ∥p∥0 ≤ C,
p∈{0,1}d ∥x∥
Notice that this is a trivial binary knapsack problem with maximum weight capacity C and weights equal to one. Hence, the
maximum is attained when we pick the top C maximal x2i elements.
Remark A.3.
Proof. The result follows from the mutual exclusiveness of Iℓ in the constraints of (6) and Lemma A.2.
B. Implementation Details
We describe below the implementation details of section 4.
• We have included error bars over three runs for all experiments.
• We use Adam (Kingma, 2014) as an optimizer for the methods that utilize the CLIP loss for fine tuning and
AdamW (Loshchilov, 2017) for those ones that use the perplexity loss.
• A learning rate scheduler of Cosine Annealing with Warmup is employed for all methods.
• For all experiments, we set the learning rate 1 × 10−5 and 2 × 10−5 , for LoRSU and LoRSU-Ppl, respectively.
• We set batch size to 16 for all methods that fine-tune the vision encoder through CLIP loss. We reduce the batch size to
8 for those methods that fine-tune the vision encoder through perplexity loss or those that fine-tune the LLM. This was
due to GPU memory limitations.
• All methods run for 20, 15, and 10 epochs for the CL-5, CL-10, and CL-50 settings, respectively.
• For LoRA (-Ppl), we set rank r = 64 while LoRA-L and LoRA-F use r = 8, for all experiments.
12
Efficient Few-Shot Continual Learning in Vision-Language Models
• For AdaLoRA, we set the initial rank to 70 and the final average rank to 64.
• The adapters of LoRA and AdaLoRA are applied to all weight matrices of each of the transformer blocks.
• For SPU, we use sparsity=15% for all experiments.
• For LoRSU (-Ppl) we use sparsity=10%, rank=64, and we pick the top-2 attention heads for all experiments.
The choice of the above hyperparameters ensures that LoRA (-Ppl), LoRA-L, LoRA-F, AdaLoRA. SPU, and LoRSU (-Ppl)
have similar number of trainable parameters.
C. Datasets
Details on all datasets used in section 4 are presented here.
TSI. To extract images from the videos of the Toyota Smart Home dataset (TSI), we discretized each video clip into 2
frames per second and then selected the frame in the middle of the total time duration of the video clip. In Table 5 we
describe the actions that were selected and the corresponding prompt used for CLIP classification. We also note dropping
few actions to avoid ambiguous classes.
DALLE. We generated images from DALL·E 2 using OpenAI python package and we used the prompt “A person {a}”
where a ∈ { using a white coffee machine, eating, cutting bread, stirring the pot, holding a glass, watching TV, holding a
bottle, walking, making tea, cutting food, holding a cup, using a laptop, lying down, holding a can, person holding a black
kettle, reading a book, cleaning up, sitting down, using a tablet, boiling water in a black kettle, using a cordless phone,
washing dishes}.
In Table 6, we present the average number of images per session used to update the model for each CL setting. Finally,
Table 7 provides characteristics of the datasets used for evaluating performance.
TSI (Das et al., 2019). We split the 27 action categories of TSI as follows:
13
Efficient Few-Shot Continual Learning in Vision-Language Models
Table 5. The original action names of the Toyota Smarthome dataset and their corresponding captions used to create the Toyota Smarthome
Images (TSI) dataset. We use ✗ to denore the actions that are ambiguous and were not used to build the TSI dataset. The final prompt is
created as “The person in this image is {caption}”.
14
Efficient Few-Shot Continual Learning in Vision-Language Models
CAn (Wang et al., 2024b). The 45 classes of CAn are split as follows:
• Session 2: [41, 293, 42, 49, 54, 57, 70, 279, 305].
• Session 3: [71, 10, 76, 79, 349, 16, 81, 83, 100].
• Session 4: [130, 30, 133, 150, 275, 276, 58, 277, 80].
• Session 5: [39, 290, 37, 296, 316, 337, 89, 360, 128].
The indices of CAn correspond to those of ImageNet (Deng et al., 2009) since the dataset was built based on these 45 animal
classes of ImageNet.
AIR (Maji et al., 2013). We split the 100 aircraft types of AIR as follows:
• Session 1: [23, 8, 11, 7, 48, 13, 1, 91, 94, 54, 16, 63, 52, 41, 80, 2, 47, 87, 78, 66].
• Session 2: [19, 6, 24, 10, 59, 30, 22, 29, 83, 37, 93, 81, 43, 99, 86, 28, 34, 88, 44, 14].
• Session 3: [84, 70, 4, 20, 15, 21, 31, 76, 57, 67, 73, 50, 69, 25, 98, 46, 96, 0, 72, 35].
• Session 4: [58, 92, 3, 95, 56, 90, 26, 40, 55, 89, 75, 71, 60, 42, 9, 82, 39, 18, 77, 68].
• Session 5: [32, 79, 12, 85, 36, 17, 64, 27, 74, 45, 61, 38, 51, 62, 65, 33, 5, 53, 97, 49].
ESAT (Helber et al., 2019). We split the 10 different land terrain classes of ESAT as follows:
DALLE. This dataset was only used for performance evaluation (control dataset), and not fine-tuning.
VSR (Liu et al., 2023). The images of this VQA dataset are labeled according to 36 different categories that describe the
dominant object of the image. We create the CL splits as follows:
• Session 2: [fire hydrant, elephant, airplane, truck, apple, hot dog, sheep].
• Session 5: [potted plant, bowl, broccoli, bottle, knife, orange, person, pizza].
15
Efficient Few-Shot Continual Learning in Vision-Language Models
Table 6. Average number of images per session (5 sessions in total) for each dataset used for fine-tuning.
FT Dataset
Setting GTS TSI CAn AIR ESAT VSR HM VisOnly
CL-5 43.0 27.0 45.0 100.0 10.0 100.0 100.0 7.0
CL-20 170.0 84.0 180.0 400.0 40.0 274.6 300.0 28.0
CL-50 430.0 253.8 450.0 1000.0 100.0 485.2 600.0 70.0
HM (Kiela et al., 2020). For the hateful memes dataset, since there was not any labeling information of the images so we
can spli the images in a meaningful way, we randomly split the training images into five disjoint sets to create our final CL
splits.
MMVP (Tong et al., 2024). This is the only dataset where no training split is available and it is comprised of just 300
images. For this reason, we only used it for evaluation in our experiments in the main paper. However, for completeness,
we included results in Table 21 where we fine-tune on it. We use 150 images for training which are equally split into five
sessions and the rest of the 150 images are used for evaluation. Thus, the setting can be considered as a 30-shot CL setting.
VisOnly (Kamoi et al., 2024). This dataset categorizes its samples into seven categories describing the nature of the
geometric and numerical information in scientific figures. We created the splits as follows:
• Session 1: Geometry-Triangle.
• Session 2: Geometry-Quadrilateral.
• Session 3: Geometry-Length
• Session 4: Geometry-Angle.
• Session 5: [Geometry-Area, 3D-Size, 3D-Angle].
D. Detailed Results
D.1. CLIP-based Updates+
The detailed accuracies for all baselines and datasets used to create Table 1 of the main paper can be found in Tables 8
through 12.
16
Efficient Few-Shot Continual Learning in Vision-Language Models
Table 8. Accuracy scores (%) for LLaVA with the pretrained (Zr-Shot) or fine-tuned image encoder. All baselines use GTS dataset for
fine-tuning the image encoder (the LLM remains frozen) via CLIP loss. We include error bars over 3 runs.
Table 9. Accuracy scores (%) for LLaVA with the pretrained (Zr-Shot) or fine-tuned image encoder. All baselines use TSI dataset for
fine-tuning the image encoder (the LLM remains frozen) via CLIP loss. We include error bars over 3 runs.
17
Efficient Few-Shot Continual Learning in Vision-Language Models
Table 10. Accuracy scores (%) for LLaVA with the pretrained (Zr-Shot) or fine-tuned image encoder. All baselines use CAn dataset for
fine-tuning the image encoder (the LLM remains frozen) via CLIP loss. We include error bars over 3 runs.
Table 11. Accuracy scores (%) for LLaVA with the pretrained (Zr-Shot) or fine-tuned image encoder. All baselines use AIR dataset for
fine-tuning the image encoder (the LLM remains frozen) via CLIP loss. We include error bars over 3 runs.
18
Efficient Few-Shot Continual Learning in Vision-Language Models
Table 12. Accuracy scores (%) for LLaVA with the pretrained (Zr-Shot) or fine-tuned image encoder. All baselines use ESAT dataset for
fine-tuning the image encoder (the LLM remains frozen) via CLIP loss. We include error bars over 3 runs.
Table 13. Average accuracy (ACC) and backward transfer (BWT) scores (%) for LLaVA with the fine-tuned CLIP-L-14. Each column
indicates the setting and fine-tuning method. We include error bars over 3 runs.
FT Method
Zr-Shot LoRA SPU LoRSU
Setting FT Dataset
ACC (↑) BWT (↑) ACC (↑) BWT (↑) ACC (↑) BWT (↑) ACC (↑) BWT (↑)
GTS 75.4 0.0 79.2±0.7 −7.1±0.8 80.8±0.5 0.5±0.6 81.1±0.6 0.4±0.7
TSI 54.0 0.0 55.5±0.9 −2.5±0.6 55.5±0.6 0.2±0.5 57.0±0.8 0.5±0.6
CL-5
AIR 60.4 0.0 59.2±0.8 −2.1±0.7 64.7±0.5 2.8±0.6 65.0±0.7 2.5±0.6
ESAT 76.4 0.0 73.8±0.9 −3.4±0.6 79.8±0.6 1.5±0.7 82.2±0.7 2.0±0.6
GTS 75.4 0.0 77.2±0.4 −9.1±0.5 82.8±0.4 −0.6±0.3 83.5±0.6 −0.4±0.3
TSI 54.0 0.0 60.6±0.3 −7.2±0.4 60.1±0.5 −1.7±0.3 62.1±0.3 −0.9±0.4
CL-20
AIR 60.4 0.0 64.3±0.4 −3.6±0.6 65.2±0.7 1.1±0.4 65.4±0.3 0.9±0.4
ESAT 76.4 0.0 64.1±0.5 −18.3±0.7 82.0±0.4 2.0±0.2 82.7±0.5 0.1±0.3
GTS 75.4 0.0 79.3±0.3 −10.3±0.5 83.8±0.2 −0.7±0.1 84.7±0.3 −0.5±0.2
TSI 54.0 0.0 67.0±0.3 −8.1±0.6 61.8±0.2 −1.9±0.3 67.9±0.2 −1.1±0.3
CL-50
AIR 60.4 0.0 65.6±0.4 −6.1±0.3 67.1±0.3 0.5±0.2 67.7±0.3 0.7±0.3
ESAT 76.4 0.0 61.4±0.3 −27.8±0.4 81.2±0.3 −2.4±0.2 82.1±0.4 −0.8±0.2
19
Efficient Few-Shot Continual Learning in Vision-Language Models
Table 14. Exact accuracy scores (%) for each baseline used to fine-tune the model on the GTS dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.
Table 15. Exact accuracy scores (%) for each baseline used to fine-tune the model on the TSI dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.
20
Efficient Few-Shot Continual Learning in Vision-Language Models
Table 16. Exact accuracy scores (%) for each baseline used to fine-tune the model on the CAn dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.
Table 17. Exact accuracy scores (%) for each baseline used to fine-tune the model on the AIR dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.
21
Efficient Few-Shot Continual Learning in Vision-Language Models
Table 18. Exact accuracy scores (%) for each baseline used to fine-tune the model on the ESAT dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.
Table 19. Exact accuracy scores (%) for each baseline used to fine-tune the model on the HM dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.
22
Efficient Few-Shot Continual Learning in Vision-Language Models
Table 20. Exact accuracy scores (%) for each baseline used to fine-tune the model on the VSR dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.
Table 21. Exact accuracy scores (%) for each baseline used to fine-tune the model on the MMVP dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.
23
Efficient Few-Shot Continual Learning in Vision-Language Models
Table 22. Exact accuracy scores (%) for each baseline used to fine-tune the model on the VisOnly dataset under three different continual
learning (5, 10, 50 shots) settings. We include error bars over 3 runs.
24
Efficient Few-Shot Continual Learning in Vision-Language Models
Table 23. Ablation study over the effect of the rank r used by LoRSU to fine-tune the image encoder, CLIP-L-14. We report the VQA
accuracies of the last session in the 50-shot CL setting. The accuracies on the target dataset are in red color. For this experiment, we use
two attention heads to fine-tune with LoRSU.
Table 24. Ablation study over the effect of the number of attention heads used by LoRSU to fine-tune the image encoder. We report the
VQA accuracies of the last session in the 50-shot CL setting. The accuracies on the target dataset are in red color. For this experiment, we
use r = 64 for the rank of LoRSU.
25
Efficient Few-Shot Continual Learning in Vision-Language Models
Table 25. Robustness comparison of LoRSU with respect to the number of training epochs. We consider LoRSU, LoRSU-Rand where the
k attention heads are chosen randomly and LoRSU-AAH where all the attention heads are chosen for fine tuning. We use 50 shots on the
GTS for each method and we report the Target Improvement (TI) on this dataset and the Control Change (CC) using only ESAT as a
control dataset. We include error bars over 3 runs.
26
Efficient Few-Shot Continual Learning in Vision-Language Models
27