0% found this document useful (0 votes)
35 views11 pages

Roy Convolutional Prompting Meets Language Models For Continual Learning CVPR 2024 Paper

Project
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views11 pages

Roy Convolutional Prompting Meets Language Models For Continual Learning CVPR 2024 Paper

Project
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

This CVPR paper is the Open Access version, provided by the Computer Vision Foundation.

Except for this watermark, it is identical to the accepted version;


the final published version of the proceedings is available on IEEE Xplore.

Convolutional Prompting meets Language Models for Continual Learning


Anurag Roy1 Riddhiman Moulick1 Vinay K. Verma2 * Saptarshi Ghosh1 Abir Das1
1
IIT Kharagpur, 2 IML Amazon India
{anurag_roy@,riddhimanmoulick@kgpian.,saptarshi@cse.,abir@cse.}iitkgp.ac.in, [email protected]

Abstract train the model again from the beginning. Naturally, this
will incur increased storage and computation as well. A
Continual Learning (CL) enables machine learning mod- line of work attempting to maintain a balance, keeps a few
els to learn from continuously shifting new training data samples from previous tasks and uses them for rehearsing
in absence of data from old tasks. Recently, pretrained vi- the previous concepts while learning a new task [4, 6, 8, 16,
sion transformers combined with prompt tuning have shown 19, 29, 36, 38, 42, 56]. However, storing samples from older
promise for overcoming catastrophic forgetting in CL. These tasks may not always be feasible, especially where long-
approaches rely on a pool of learnable prompts which can be term storage of data is not permitted possibly due to privacy,
inefficient in sharing knowledge across tasks leading to infe- security or legislative concerns [17]. Therefore, developing
rior performance. In addition, the lack of fine-grained layer rehearsal-free CL approaches requiring no storage of old
specific prompts does not allow these to fully express the data has come up to be desirable.
strength of the prompts for CL. We address these limitations
Recently, there has been a surge of prompt tuning based
by proposing ConvPrompt, a novel convolutional prompt
approaches [9, 10, 20, 23, 34, 46, 47, 53, 54] that leverage on
creation mechanism that maintains layer-wise shared em-
pre-trained transformers and show promising performance
beddings, enabling both layer-specific learning and better
without using any rehearsal data. Prompt tuning, originally
concept transfer across tasks. The intelligent use of con-
introduced in NLP [28] attaches small learnable vectors to
volution enables us to maintain a low parameter overhead
a pre-trained frozen model to properly reuse the already
without compromising performance. We further leverage
learned representations. Although promising for continual
Large Language Models to generate fine-grained text de-
learning, these approaches, however, suffer from the follow-
scriptions of each category which are used to get task simi-
ing drawbacks – (1) Learning task-specific and task-shared
larity and dynamically decide the number of prompts to be
information in separate layers ignores possible interaction
learned. Extensive experiments demonstrate the superiority
between task-specific and task-shared components [43, 49]
of ConvPrompt and improves SOTA by ∼ 3% with signif-
and (2) Always learning a fixed number of prompts per task
icantly less parameter overhead. We also perform strong
irrespective of the tasks’ similarity with the previous ones.
ablation over various modules to disentangle the importance
The redundant prompts can lead to overfitting specifically
of different components. 1
in cases where the tasks are highly similar usually seen in
fine-grained datasets.

1. Introduction In this work, we propose ConvPrompt, which leverages


task-specific prompts that is generated by convolution over
In this constantly changing world, computer vision models task-shared parameters. We also exploit similarity between
must adapt to the new and emerging concepts. However, tasks to control the generation of prompts. Specifically, task
such models often suffer from catastrophic forgetting [15, shared knowledge is modeled by a learnable embedding
31, 40], a phenomenon where previously learned concepts matrix at each layer. Prompts, on the other hand, are task
are forgotten when adapting to novel concepts. A trivial specific and are generated by convoluting on the task shared
solution to this problem is to have separate models for each embeddings with learnable kernels. The shared embeddings
new task. However, this would require task identities to be are free to adapt with the different tasks. The prompt gen-
available during inference which may not be very practical. erating convolution kernels, on the other hand, are set aside
Another way to tackle this will be to keep all the data that once learned for a task and new set of kernels are employed
the model was trained on, add the new data to it, and then for the next task. Such a design enables the shared em-
* Work started before joining Amazon
beddings to capture common concepts while allowing the
1 Project page: https : / / cvir . github . io / projects / convolution operations to capture the task-specific concepts
convprompt from the common concepts. Moreover, we employ language

23616
models to get similarity between tasks in order to determine which task an image belongs to during inference which may
the number of trainable convolution kernels. Such an ex- be unrealistic. (3) Rehearsal-based methods [3, 4, 12, 20–
pansion strategy allows the model to be parameter efficient, 22, 41, 52] store a few representative training samples from
introducing only the necessary number of new parameters previous tasks in a buffer, which is then used for train-
as and when needed. The benefits of our approach are three- ing alongside the current task. Though effective, these ap-
fold – (1) It facilitates knowledge transfer across tasks using proaches are limited by the size of the buffer and the length
task-shared embeddings. (2) Convolutional prompt gener- of the task sequences. These are also not particularly suitable
ation promotes efficient adaptation to new tasks with low for scenarios with data privacy requirements. In contrast,
parameter overhead. (3) Exploiting similar tasks by large ConvPrompt addresses rehearsal-free continual learning
language models results in further reduction in parameters by intelligently utilizing prompts on pre-trained models.
and superior performance. Recently, vision transformers have performed very well
Extensive experimentation across several rehearsal- in CL [12, 22, 43, 47, 52, 55]. Authors in [33] examined
free continual learning benchmarks shows that our the impact of attention heads while Dytox [12] acquires
ConvPrompt approach achieves significant performance new skills by expanding special task tokens. LVT [52] in-
gain compared to state-of-the-art approaches. On average, troduced inter-task attention in vision transformers. Both
across the 3 benchmark datasets and different experimental methods, however, require additional memory to store previ-
settings, we outperform the state-of-the-art prompt based CL ous training instances. MEAT [55] uses a parameter isolation
approaches (e.g., CODA-Prompt [46]) by a margin of ∼ 3% approach while learning new tasks. However, the model’s
while requiring a significantly lower number of parameters. expandability is limited, restricting the number of tasks it
Our contribution can be summarized as follows: can learn and it requires task-ids to be known during infer-
• We propose a local prompt creation mechanism by ap- ence. ContraCon [43] uses convolutional re-weighting of
plying convolution over task-invariant global parameters the self-attention weights. However, inference is costly as it
enabling an efficient transfer of local and global concepts uses augmentation-based entropic task identification.
across tasks that helps the new tasks to adapt better. Prompt Learning: Prompt-based CL offers robust protec-
• We incorporate a novel language based task similarity tion against catastrophic forgetting by incorporating a small
prediction, for the first time in continual learning, which number of model instructions called prompts, rather than
helps to reduce the model parameters significantly without directly modifying encoder parameters [9, 10, 20, 23, 34,
sacrificing the performance and without adding significant 46, 47, 53, 54]. Initial approaches like L2P [54] and Dual-
pre-processing overhead as well. Prompt [53] employ a prompt pool from which prompts are
• The extensive ablations and experimentation over the vari- selected. These methods match input data to prompts without
ous standard datasets and experimental settings show the relying on task identification, using local clustering based
superiority of our approach by a significant margin. optimization. Recently, S-Prompts [51] used prompts to
continually learn in a domain incremental learning scenario,
2. Related Work which involves learning the same set of classes under co-
variate distribution shifts. CODA-Prompt builds on [53, 54]
Continual Learning: CL approaches can be classified into
and applies soft attention on the prompts towards end-to-end
3 broad categories. (1) Regularization-based methods tackle
continual learning. ProgressivePrompts [37] progressively
catastrophic forgetting by applying regularizers that prior-
learns new prompt tokens for each incoming tasks, but as-
itize the preservation of important parameters associated
sumes the presence of task-id during inference. A contem-
with previously learned tasks. By minimizing interference
porary work, LGCL [23], uses handcrafted prompts from
between new and old knowledge through the introduction
outside and contrastively learns to bring the output represen-
of penalty terms [1, 2, 24, 26, 58, 59] or constraining the
tation of the transformer to the prompt. However, without
direction of parameter update [14, 45], these methods en-
indigenous prompt learning, this approach can act only as a
courage important parameters to remain in close proximity
plugin to existing CL approaches with incremental improve-
to previous solutions. While regularization based methods
ment in performance. Our approach, ConvPrompt, stands
have shown promising results involving smaller number of
out from the rest in its proficiency in knowledge-sharing
tasks, their performance can be less satisfactory when con-
across tasks, its ability for on-the-fly prompt generation and
fronted with challenging scenarios involving large number
its effective handling of the additional parameters required.
of tasks. (2) Dynamic Architecture-based methods learn new
tasks by assigning distinct parameters for each task [13, 55–
57]. While these approaches exhibit the capability to learn 3. Preliminaries
extended sequences of tasks, they may encounter substan- Continual Learning: Continual Learning (CL) trains mod-
tial memory and computational overhead. Also, most ap- els on tasks arriving sequentially over time without forgetting
proaches under this category require the information of the previous ones. Each task t ∈ {1, . . . , T } contains train-

23617
Figure 1. The proposed ConvPrompt architecture. The layerwise [cls] embedding is passed through the projection network to
generate task-specific query representations which are then matched with the prompt-keys using cosine similarity to get the similarity values
of prompt-generators. The prompt generators are applied over the shared embeddings to generate input specific prompt components. A
weighted average of these prompt components are calculated with the corresponding cosine similarity value as weights to get the input
specific key and value prompts which are applied to the pre-trained model. The learnable components are highlighted in green, the frozen
components are highlighted in grey.

ing samples {(xti , yit )} where xti is the ith sample of the tth of key, query and value vectors. The activations from dif-
task and yit ∈ C t is the corresponding label. The set of ferent attention heads are then concatenated and residually
the class labels for different tasks are mutually exclusive, added to the input zl before performing layer normalization.
i.e., C 0 ∩ C 1 . . . ∩ C T = ϕ. We address the challenging The resulting activations are passed through a FFN block
rehearsal-free and Class-Incremental Learning (CIL) setting consisting of two linear layers and an activation function
of CL where a trained model f needs to predict the label (usually GELU). After another residual connection and layer
y = f (x) for an unseen test sample x, regardless of its task normalization, the output zl+1 at the lth layer is generated.
and without access to training data from previous tasks. Prompt and Prefix Tuning: Prefix or prompt tuning aims to
Transformer Architecture: We build our approach on top learn continuous vectors where the pre-trained transformer
of a pre-trained Vision Transformer (ViT) [11]. A trans- is kept frozen. It prepends lp learnable vectors to the original
former starts by dividing an input image x into a set of keys and values of the self-attention heads at every layer.
N fixed-sized patches z1 ∈ RN ×d which are then embed- (K) (V )
Specifically, lp length prefix vectors Pl,h ; Pl,h ∈ Rlp ×dh
ded into a d-dimensional space with positional encoding. are concatenated with the original key Kl,h and value Vl,h
A single encoder layer of ViT consists of stacks of Multi- respectively. Then the self-attention values from head h
Head Self-Attention (MHSA), layer normalization and Feed (K) (V )
at layer l is computed as A(Ql,h , [Pl,h , Kl,h ], [Pl,h , Vl,h ])
Forward Network (FFN) blocks with residual connections. following Eqn. 1. Unlike existing prompt based approaches
Given the input zl at the lth layer, the output zl+1 is gener- learning directly the additional vectors [46, 53, 54], our
ated that goes into the next layer as input for l ∈ {1, 2, · · · L} work focuses on creating prompts by maintaining a balance
where L is the total number of encoder layers. At the lth between old knowledge of the system and new information
layer, MHSA block computes self-attention on the input to be put on an ad-hoc basis.
zl by using H separate self-attention heads. Self-attention
values from head h ∈ {1, 2, · · · H} at layer l is given by, 4. Methodology
T 
Ql,h Kl,h

A(Ql,h , Kl,h , Vl,h ) = sof tmax √ Vl,h (1) In this work, we propose a prompt-based CL approach
dk (ConvPrompt) by generating prompt vectors for each new
Q K V
task in combination with knowledge learned previously (ref.
where Ql,h = zl Wl,h , Kl,h = zl Wl,h and Vl,h = zl Wl,h are Fig. 1). Knowledge from previous tasks is modeled by a
Q K
query, key, and value with learnable weights Wl,h , Wl,h and learnable embedding matrix shared between all tasks. The
V
Wl,h ∈ Rd×dh respectively. dh = d/H is the dimension task-specific prompts are created by employing convolution

23618
operation on the shared embedding matrix. While existing tion Network (P Nϕ ), parameterized by ϕ, consisting of a
works [53, 54] have shown to perform well by prompting a two-layer fully-connected neural network with a ReLU ac-
pre-trained transformer, they rely on a single set of prompt tivation [35] in between. The output P Nϕ ([CLS]) maps
vectors needing to compress all necessary information of an the input image to the same space as the prompt keys where
image into one single set. Rather than one set of lp prompt a cosine similarity is taken between P Nϕ ([CLS]) and the
vectors, following CODA-Prompt [46], we have M such sets M prompt keys in each layer l resulting in the importance
which we call prompt components. However, unlike CODA- values {sl,1 , sl,2 , · · · , sl,M }. The prompt components ob-
Prompt, the prompt components are not directly learned in tained as a result of the convolution operation between the
our approach. Instead, they are generated from previous task prompt generators and shared embeddings are weighed by
(K)
knowledge by employing M convolution kernels (known the M similarity scores to get the final prompts Pl+1,h and
as the prompt generators) in each head of each layer. A (V )
Pl+1,h as follows,
weighted combination of the prompt components provides
the final lp prompt vectors where the weights come from the M M
(K) (V )
X X
[CLS] embedding at each layer of the ViT corresponding K V
Pl+1,h= sl,m P Cl,h,m ; Pl+1,h= sl,m P Cl,h,m (2)
to the input image. The weighing not only allows us to opti- m=1 m=1
mize the model end-to-end but also provides a unique blend
While existing works have linearly combined prompt
of previous task knowledge and the input image.
components to get the final prompt, the similarity scores
4.1. Prompt Generation were generated using the same final [CLS] token across all
layers. However, this requires a full forward pass through
The prompt vectors in our approach are dynamically gener- the transformer solely for getting the similarity scores and
ated by small convolution operations. Convolution is per- then the final prompts at each layer for the final predictions
formed between two learnable components – (i) Shared Em- resulting in a total of two passes. In contrast, we utilize the
beddings and (ii) Prompt Generators. Corresponding to each [cls] embedding from each layer enabling us to generate
head h ∈ {1, 2, · · · H} in each layer l ∈ {1, 2, · · · L} of the the final prompts for the subsequent layers in one single
K
transformer, there are shared embedding matrices SEl,h and pass resulting in a significant reduction of computation. The
V
SEl,h respectively for the keys and the values. These matri- similarity values generated from the image help different
ces are shared across tasks. The prompt generators are single prompt components to focus on specific features towards the
channel convolution kernels which are applied on the shared final prompt [46]. Building on this, we used a non-linearly
embeddings. We learn a set of M prompt generators for learned projection network that better captures complex tasks
each head in each layer, for both keys and values. For head h as shown empirically.
and layer l, the prompt generators are denoted as GK l,h,m and
GVl,h,m , where m ∈ {1, 2, · · · M }. The shared embeddings 4.3. Language Guided Prompting
are of dimension (lp + k − 1) × (dh + k − 1) where k is To maintain a balance between learning new tasks and pre-
the size of the convolution kernels of the prompt generators. serving knowledge accumulated from old tasks, we make use
The convolution operation results in prompt components of both task-shared as well as task-specific parameters. In
K V
P Cl,h,m and P Cl,h,m of size lp × dh . The convolutional our work, P Nϕ and SE act as shared parameters facilitating
prompt generators not only enables us to maintain a good inter-task information sharing, while with new tasks coming,
trade-off between performance and low parameter require- the previously learnt prompt generators and prompt-keys are
ments per task but also makes use of the inherent inductive frozen, thus making them task-specialized. A new set of
bias for structured prompt creation [48]. these are freshly learned for every incoming task.
Let the total number of prompt generators per layer
4.2. Prompt Weighting
learnt till the (t − 1)th task be Mt−1 , while the number
Instead of compressing all the information into one set of of prompt generators learnt for task i only is Ji . Natu-
prompt vectors we make use of M prompt components to t−1
P
rally, Mt−1 = Ji and all Mt−1 prompt generators are
get the final prompt vectors at each head of each layer. For i=1
K V
each prompt component P Cl,h,m (and P Cl,h,m ), we gener- kept frozen when the new set of Jt prompt generators are
ate weights (between −1 and 1) which can be interpreted learned. Notwithstanding the prompt generators’ role to
as the relative importance of the particular prompt com- mitigate catastrophic forgetting, it is also crucial to keep
ponent in blending them together. We employ M learn- the increase in parameters in check with increasing number
able keys, referred to as prompt keys π ∈ Rdπ for this of tasks. Ideally, if a task is similar to a task seen earlier,
purpose. The prompt keys work on the image features then the prompt generators learnt previously can be reused.
expressed as a nonlinear function of the [CLS] token at Thus, in contrast to previous works [46] which learn a fixed
each layer. The [CLS] token is passed through a Projec- number of prompt components per task, we learn a dynamic

23619
Tasks Split CIFAR-100 Split CUB-200 Nparam (↓)
Method AT (↑) FT (↓) AT (↑) FT (↓) Train/Total
Joint-FT (upper bound) 93.22 ± 0.16 – 88.00 ± 0.15 – 100/ 100
Seq-FT 8.6 ± 0.43 42.67 ± 0.13 23.87 ± 0.54 62.52 ± 0.57 100/ 100
ER (buffer size 5000) 82.30 ± 0.42 16.30 ± 0.24 60.73 ± 0.23 8.71 ± 0.65 100/ 100
LwF [27] 64.56 ± 1.23 25.27 ± 1.32 48.73 ± 1.46 25.18 ± 0.31 100/ 100
L2P [54] 82.76 ± 1.17 7.86 ± 0.39 62.21 ± 1.92 7.12 ± 0.33 0.7/ 100.7
L2P + LGCL [23] 84.33 ± 0.06 5.83 ± 0.23 – – 0.7/ 100.7
DualPrompt [53] 85.07 ± 0.49 5.57 ± 0.20 66.00 ± 0.57 4.4 ± 0.31 1.3/ 101.3
DualPrompt + LGCL [23] 87.23 ± 0.21 5.10 ± 0.15 – – 1.3/ 101.3
CODA-Prompt [46] 87.00 ± 0.38 4.78 ± 0.24 74.40 ± 0.74 6.40 ± 0.34 4.6/ 104.6
ConvPrompt 88.87 ± 0.33 4.75 ± 0.15 80.2 ± 0.52 5.6 ± 0.38 2.0/ 102.0

Table 1. Results (%) on CIFAR-100 and CUB-200. Reported results are for 10 tasks with a supervised ImageNet-21k pretrained ViT as
the backbone. AT denotes the average accuracy and FT denotes the forgetting. Nparam denotes the percentage of trainable/final parameters
w.r.t that of the ViT model. The Nparam values are dynamic for our approach and the reported value (2.0/102.0) is the average of the values
in Split-CIFAR-100 (2.2/102.2) and Split CUB-200 (1.8/101.8).

number of prompt generators depending on the similarity less from the previous tasks. To achieve this, when learning
of the task with the previous ones. The task similarity can for task t, we regularize the set of parameters ϕt of the
be naively modeled by comparing the visual features of the projection network and the shared embeddings SEt , to have
images from different tasks. We present an alternative frame- low l1 norm with that of the previous task as follows:
work to get task similarity cheaply using Large Language
Models (LLMs) such as GPT-3 [5] which show remarkable Lr (ϕt , ϕt−1 ) = ||ϕt−1 − ϕt ||1
(3)
world knowledge on a variety of topics. Lr (SEt , SEt−1 ) = ||SEt−1 − SEt ||1
Our key insight is that we can use language as a tool
SEt denote the shared embedding parameters of task t for
to get descriptions of visual attributes of different classes all heads and layers combined and for both keys and val-
and use these to find task similarity. Visual attributes are ues. Note that suffix t used with the shared embeddings may
additional semantic knowledge that articulate the visual con- indicate that SE is task specific, they are not. Shared signi-
cepts associated with each categories. For example, some fies the same set of embeddings is used to learn the shared
visual attributes of bee are ‘black and yellow stripes’, ‘two semantics among different tasks. We incrementally update
pairs of wings’, ‘three body segments’, etc. Instead of man- the same SE for each task, using a copy of the SE from the
ually writing these, we queried GPT-3 to get the attributes immediately preceding task for regularization in Eqn. 3 that
for each set of classes in each task as they arrived. This is is discarded after training. ϕt denote the projection network
computationally cheap, requires no additional training, and parameters at the tth task. The final objective is:
is scalable to large number of classes. Inspired by works Lcls (f (x), y)+ 1(t > 1)λ[Lr (ϕt ,ϕt−1 )+Lr (SEt ,SEt−1 )] (4)
like [30, 32], the class attributes are generated by using the
query – “What are useful features for distinguishing a [class where Lcls denotes classification loss, t the task-id and λ ∈
name] in a photo?” for each class in a task. We generate [0, 1] the hyper-parameter to weigh the loss components. The
the BERT embeddings of these attributes and store them in a indicator function 1(t > 1) denotes the fact that regularization
pool for all seen tasks. For each attribute of the current task is applied after the first task.
t, cosine similarity with all stored attribute embeddings till
task t − 1 is computed. The similarity simt of task t with 5. Experiments
the previous tasks, is the mean of such maximum similarity Datasets: We evaluate ConvPrompt on the benchmark
values across all attributes of task t. Let Jmax denote the datasets, ImageNet-R [18] and CIFAR-100 [25], as well
maximum prompt generators per task. Then Jt is given by as on the fine-grained dataset, CUB-200 [50] in the CIL
(1 − simt )Jmax i.e., higher similarity enables to have lower setup. ImageNet-R [18] is formed from 200 subcategories
number of prompts. Leveraging linguistic knowledge, we of ImageNet [44] but with images from different domains
reduce the learnable parameters if the classes in a task have such as cartoon, graffiti and origami. It also includes some
a high overlap with the previously encountered ones. hard examples from ImageNet that standard models fail to
classify. It contains 24,000 training images and 6,000 test
4.4. Regularization and Final Objective
images. Following [53, 60], we split the 200 classes into
To prevent over-writing of concepts captured by previous 10 tasks with each task containing 20 classes respectively.
tasks in global task-shared P Nϕ and SE, we need to ensure CIFAR-100 [25], a widely used dataset in continual learn-
that while learning the current task, these parameters deviate ing, contains 100 classes, with each having 500 training and

23620
Tasks 5 Task 10 Task 20 Task Nparam (↓)
Method AT (↑) FT (↓) AT (↑) FT (↓) AT (↑) FT (↓) Train/Total
Joint-FT (upper bound) 79.6 ± 0.87 – 79.6 ± 0.87 – 79.6 ± 0.87 – 100/ 100
Seq-FT 21.82 ± 0.85 76.26 ± 0.37 11.42 ± 0.76 78.32 ± 0.64 8.75 ± 0.42 82.21 ± 0.96 100/ 100
ER (buffer size 5000) 70.53 ± 0.68 17.47 ± 0.35 64.32 ± 0.65 22.35 ± 0.97 53.26 ± 0.83 34.21 ± 0.82 100/ 100
LwF [27] 49.75 ± 0.65 41.36 ± 0.27 39.27 ± 1.92 51.23 ± 0.34 30.29 ± 1.82 60.32 ± 0.86 100/ 100
L2P [54] 67.43 ± 0.11 5.12 ± 0.62 63.49 ± 0.40 6.85 ± 0.42 59.38 ± 0.50 5.89 ± 0.36 0.7/ 100.7
L2P + LGCL [23] – – 62.51 ± 0.05 8.9 ± 0.17 – – 0.7/ 100.7
DualPrompt [53] 70.42 ± 0.88 4.1 ± 0.33 68.50 ± 0.52 5.14 ± 0.18 63.21 ± 0.49 5.28 ± 0.45 1.3/ 101.3
DualPrompt + LGCL [23] – – 69.46 ± 0.04 4.2 ± 0.06 – – 1.3/ 101.3
CODA-Prompt [46] 75.38 ± 0.34 6.08 ± 0.36 74.24 ± 0.56 4.92 ± 0.21 70.86 ± 0.42 6.87 ± 0.25 4.6/ 104.6
ConvPrompt 79.10 ± 0.47 3.08 ± 0.11 77.86 ± 0.25 4.33 ± 0.24 75.1 ± 0.39 4.1 ± 0.29 2.2/ 102.2

Table 2. Results (%) on ImageNet-R. Reported results are for 5, 10 and 20 tasks splits of ImageNet-R with a supervised ImageNet-21k
pretrained ViT as the backbone. AT denotes the average accuracy and FT denotes the forgetting. Nparam denotes the percentage of
trainable/final parameters w.r.t that of the ViT model. The Nparam values vary for ConvPrompt with varying number of tasks and the
reported value is the average of the values for 5-tasks (1.64/101.64), 10-tasks (2.0/102.0) and 20 tasks (2.88/102.88).

100 test images. Following [46, 53, 54, 60], we use the where St,T is the test classification accruacy on task t after
10 task setup of CIFAR-100 with each task containing 10 the model has been trained on task T . In other words, aver-
classes. CUB-200 [50] is a fine-grained dataset containing age accuracy measures the average accuracy of all the tasks
200 classes of different bird species with 5994 training im- after training on the last task and forgetting measures the
ages and 5794 test images. Following [60], we use the 10 average drop in accuracy of a task after training on the last
task setup of CUB-200 with each task containing 20 classes. task from its maximum accuracy attained.
Training and Implementation Details: We use the ViT- Comparisons: We evaluated our approach against several
B/16 [11] model pre-trained on the ImageNet-21k [39] as rehearsal free approaches. These include Learning without
the backbone over which ConvPrompt is applied. Our pro- Forgetting (LwF) [27], Learning to Prompt (L2P) [54], Du-
jector network is a two-layer neural network having d/2 and alPrompt [53] and CODA-Prompt [46]. We also evaluated
d/4 neurons in these two layers respectively where the input against LGCL [23], which uses language based prompts
([CLS] token) is d-dimensional. A ReLU [35] activation and acts as a plugin to L2P and DualPrompt. Additionally,
function is applied between the two layers. Our approach we also compared our approach with a rehearsal-based ap-
applies prompts to 7 layers of the pre-trained ViT as our proach, Experience Replay (ER) [7] with buffer size 5000.
ablation experiments show that more layers with prompts In addition, we report Joint-FT and Seq-FT performances as
does not help improve the performance although parame- they serve as bounds of the performance in many situations.
ter overhead increases. We train each task in CIFAR-100, In Joint-FT, the ViT model is trained jointly on the training
ImageNet-R and CUB-200 for 10, 10 and 60 epochs respec- data of all the tasks combined, serving as an upper bound
tively. The hyperparameter Jmax is set to 5, acting as an on the performance. Seq-FT represents fine-tuning the ViT
upper limit for the maximum number of prompt components model sequentially using only the new task’s training data
needed per task. The hyperparameter λ, which is used to and thus is severely affected by catastrophic forgetting.
weigh the regularization terms in Eqn. 4 is set to 0.01. The
experiments leading to our choice of λ and other additional 5.1. Results and Analysis
experimental results can be found in the supplementary ma-
terial. We present our results after conducting five random As shown in Table 1 and Table 2, ConvPrompt outper-
trials, where task orders were randomly selected for each forms both rehearsal-free and rehearsal-based approaches
run. The mean ± std values are reported. significantly. On average, ConvPrompt outperforms the
existing state-of-the-art CODA-Prompt [46] by ∼ 3% while
Metrics Used: We report Average accuracy AT and Forget- using only ∼ 40% of the trainable parameters used by
ting FT calculated over the T tasks for all our experiments. CODA-Prompt. Our approach shows slightly more forget-
Specifically, after training on T tasks is completed, AT and ting in the 10-task setup of Split-CUB. However, even with
FT are calculated as follows: more forgetting, our approach is able to outperform the exist-
T
ing approaches by atleast 3% margin thereby confirming that
1X our design enables efficient adaptation to new tasks by pre-
AT = St,T
T t=1 venting overfitting that leads to higher maximum accuracy by
(5) the tasks. This indicates that our convolutional prompt cre-
T −1
1 X ation mechanism is able to utilize the shared inter-task con-
FT = max (St,t′ − St,T )
T − 1 t=1 t′ ∈{1,...,T −1} cepts better than CODA-Prompt [46] and DualPrompt [53].

23621
Dataset SLCA SLCA + ConvPrompt prompt generators on top of the task-shared embeddings. We
Split CIFAR-100 91.53 ± 0.28 90.60 ± 0.35
Split ImageNet-R 77.00 ± 0.33 78.5 ± 0.37
experimented with two functional forms of the prompt gener-
Split CUB-200 84.71 ± 0.40 87.12 ± 0.31 ators – a) Convolution kernels and b) Neural Networks. Both
types show significant improvement (ref. rows ‘+SE +N N ’
Table 3. Results with SLCA: We report the AT values for SLCA and ‘+SE+Conv’). However, using a neural network is
and the SLCA + ConvPrompt. ConvPrompt when applied on naturally heavier on compute compared to a convolutional
top of SLCA, improves its performance by 1 − 2%. prompt generator as shown in the rightmost column. Overall,
Method AT (↑) FT (↓) Nparam (↓) the addition of task-specific prompt generators leads to a sig-
Upper Bound 79.7 ± 0.15 – 100/100
ViT (linear probing) (Lower Bound) 62.1 ± 0.52 5.74 ± 0.23 0.02/100 nificant improvement of ∼ 6% over the task-shared-only ap-
+ SE 67.3 ± 0.58 5.19 ± 0.21 0.7/100.7 proach and an improvement of ∼ 3% over DualPrompt [53]
+ SE + NN 73.82 ± 0.84 9.92 ± 0.48 14.9/114.9
+ SE + Conv 73.92 ± 0.65 9.43 ± 0.33 3.4/103.4
which also uses both task shared and task specific knowledge.
+ SE + Conv + P N (1-layer) 76.28 ± 0.23 4.46 ± 0.17 3.5/103.5 This reaffirms our hypothesis that prompt generation over
+ SE + Conv + P N (2-layer) 77.96 ± 0.54 4.75 ± 0.57 3.7/103.7 shared embeddings helps in the effective sharing of inter-task
ViT + SE + Conv +
P N (2-layer) + Task-Sim 77.86 ± 0.25 4.33 ± 0.24 2.0/102.0 concepts to better adaptation of newer tasks.
( ConvPrompt ) Significance of Prompt Weighing: We then add our prompt
weighting mechanism which further gives an improvement
Table 4. Ablation over ImageNet-R 10 tasks: We report AT and
of ∼ 3% (ref. two rows starting with ‘+ SE + Conv + P N ’).
FT averaged for 5 trials. We also report the number of trainable
params of each of the variants. ViT denotes the ViT pre-trained on
This signifies that our prompt weighting exploits important
ImageNet-21k over which we apply the different components. prompts better. To investigate if a non-linearity in projector
network helps, we compare the performances of a linear
Comparison with SLCA: We also conducted a comparison projection network (single layer fully-connected neural net)
with a recent SOTA approach Slow Learner with Classifier with a non-linear one (two-layer fully-connected neural net
Alignment (SLCA) [60] that does not use prompts for CL. with a ReLU in between) and observe that the non-linearity
Instead, it finetunes the whole network with smaller learning naturally is better with complex visual data albeit with a
rate for the represenation layers and larger learning rate slight increase in parameter count.
for the classification layer. This approach has demonstrated Language helps reduce parameters: Finally, we add lan-
superior performance compared to existing continual prompt- guage driven task-similarity to dynamically determine the
tuning methods. As it involves full network tuning, it is number of prompt generators (ref. last row of Table 4) which
computationally expensive making it impractical in resource- makes the full ConvPrompt. This leads to significant pa-
constrained scenarios. In contrast, our approach is viable in rameter reduction while maintaining the same performance.
situations where compute is limited. In this experiment, we In datasets containing very similar tasks (e.g., CUB-200) the
show that our approach can also exploit differential learning performance also gets boosted (∼ 1%) possibly due to less
rates and classifier alignment as SLCA and provide further overfitting as a result of reduced number of parameters.
improvement in performance. Specifically, in such cases, we Effect of Prompt Length: We analyze the effect of the
learn the prompts along with the transformer weights during length (lp ) of prompt vectors on the performances of differ-
fine-tuning. The results in Table 3 demonstrate that our ent prompt based CL approaches including ours. For this
approach combined with SLCA, outperforms it in two out purpose, we ran experiments with different prompt lengths,
of the three datasets achieving the new state-of-the-art while increasing the length in multiples of 4 from 4 to 40. As can
in the CIFAR-100 dataset it is almost at par with SLCA. be seen in Fig. 2a, ConvPrompt performs the best across
all values of lp and the performance has an increasing trend
5.2. Ablation Studies and Other Analysis till the prompt length is 20 after which it saturates. So, we
We perform all our ablation studies on the 10 task setup of the used lp = 20, unless otherwise mentioned.
ImageNet-R dataset unless otherwise mentioned. Through- Effect of Increasing Layers to Prompt: To analyze if
out the study, we consider the ImageNet-21k pre-trained prompting every layer helps we applied prompts to differ-
ViT-B/16 model onto which we gradually integrate our mod- ent layers of the pre-trained backbone for ConvPrompt
ules to showcase their importance. Table 4 shows the results. and the closely related approaches (ref. Fig. 2b). Specifi-
Exploiting Shared Embedding across Tasks: Adding cally, starting with prompts to the first 5 layers, we go on
shared embedding on top of the ViT-B/16 baseline, already to apply prompts till the last layer (i.e., 12th layer). As
gives better performance than L2P [54] that uses a pool of seen, ConvPrompt’s performance peaks at 7 layers before
prompts (ref. row ‘+SE’). This shows that knowledge stagnating while for DualPrompt and CODA-Prompt perfor-
shared across tasks, when properly regularized, can itself be mance decreases after 5 layers and then stagnates.
better than purely individual prompts. Effect of number of Prompt Generators: We analyze the
Role of Prompt Generators: Next, we add the task-specific effect of increasing the maximum number of prompt genera-

23622
Figure 2. Ablation over Split ImageNet-R 10 tasks: (a) average accuracy vs prompt length: The performance peaks at 8 for CODA-P and
at 20 for rest of the models (b) average accuracy vs number of layers prompts are applied to: The performance for ConvPrompt, peaks at 7
layers while it peaks at 5 for the other models. (c) Average accuracy vs maximum number of prompt components per task: The performance
of ConvPrompt peaks at Jmax = 5 prompts per task, with the final prompt count after 10 tasks reaching to 18 owing to task-similarity.
(d) Average accuracy vs kernel size of the prompt creators: The performance for ConvPrompt, peaks at kernel size 17.
Method AT (↑) FT (↓) Nparam (↓) Method MACs
class-label based sim 76.93 ± 0.45 4.49 ± 0.57 2.22/102.22 L2P [54] 35.85B
image-based sim 77.18 ± 0.42 4.38 ± 0.47 2.29/102.29
Class Attribute
DualPrompt [53] 33.72B
based sim (ConvPrompt) 77.86 ± 0.25 4.33 ± 0.24 2.00/102.00 CODA-Prompt [46] 33.72B
ConvPrompt 17.98B
Table 5. ImageNet-R 10 tasks: Comparison of different task-
similarity measures. We report AT and FT averaged for 5 trials. Table 7. Comparison of Inference Times: MAC (multiply accu-
mulate) operations required for the evaluation of tasks once the
tors Jmax (ref. Fig. 2c). We observe that the performance model has been fully trained on the 10 tasks of Split ImageNet-R.
peaks at Jmax = 5 after which the performance decreases.
Effect of Kernel Size of Prompt Generators: To under- generated attributes, we directly used the class labels and
stand the impact of the prompt generator’s kernel size, we in the second we extracted visual features using pretrained
varied the convolution kernel size k from 5 to 25 in steps of ViT-B/16 and used them to measure similarity with previous
2. As can be seen in Fig. 2d, convolution kernels of size 17 tasks. Table 5 shows the performance and parameter require-
gives the best results while slightly decreasing and stagnat- ments with these on 10-task ImageNet-R. It can be seen that
ing for higher values. We chose kernel size 17 as the default the attribute-based class similarity leads to the introduction
value since it leads to a better trade-off between the number of least number of parameters as well as best performance.
of trainable parameters and performance. Comparing Inference-Time: We compared the inference
P→C C→P time with competing prompt tuning based approaches,
Datasets AT (↑) FT (↓) AT (↑) FT (↓) namely L2P [54], DualPrompt [53] and CODA-Prompt [46].
CIFAR-100 88.87 ± 0.33 4.75 ± 0.15 88.24 ± 0.31 3.86 ± 0.34
ImageNet-R 77.86 ± 0.25 4.33 ± 0.24 77.76 ± 0.28 3.65 ± 0.27 Specifically, for a model, we compute the number of MACs
CUB-200 80.2 ± 0.52 5.6 ± 0.38 80.1 ± 0.45 5.7 ± 0.26 (Multiply and Accumulate Operations) an input image con-
Table 6. Prefixes before and after projection on 10-task setup. sumes during inference after the model has been trained for
all the 10 tasks on the ImageNet-R dataset. As can be seen
Effect of Prompting before or after Projection: We ana-
(K) in Table 7, ConvPrompt requires the least compute backed
lyze the effect of concatenating the prompt vectors Pl,h and by the single pass for prompting through it.
(V )
Pl,h with the original key Kl,h and value Vl,h respectively
vis-a-vis concatenating them with the input zl which is pro- 6. Conclusion
jected to get the key and values. Performance does not vary
In this paper, we proposed ConvPrompt, a novel convo-
much if projection is performed after concatenation (C→P )
lutional prompt generation mechanism coupled with a task
compared to the other way round (P →C) (Table 6). Natu-
similarity based expansion strategy for rehearsal-free CL.
rally, computation is more (by ∼ 0.2B MACs) in (C→P )
Different from the existing approaches, our approach cre-
due to bigger matrix-vector multiplications. With low com-
ates prompts in each layer by applying convolution over
pute overhead and comparable performance, (P → C) is
task shared embeddings causing better knowledge trans-
advantageous and is used in our experiments.
fer across tasks. Moreover, our expansion strategy with
Best way to measure Task Similarity: As similarity be-
LLM driven task similarity ensures that this performance
tween tasks plays a crucial role in performance as well as
boost is achieved without a significant increase in the num-
additional parameters trained, we tried different avenues to
ber of learnable parameters. Extensive experimentation
measure task similarity. Specifically, we experimented with
showed that ConvPrompt outperforms SOTA baselines
(i) class-label based task similarity and (ii) image-based task
significantly while requiring fewer additional parameters.
similarity. In the first approach, instead of taking GPT-3

23623
References [13] Verma et. al. Efficient feature transformations for discrim-
inative and generative continual learning. In CVPR, 2021.
[1] Rahaf Aljundi, Francesca Babiloni, Mohamed Elhoseiny, 2
Marcus Rohrbach, and Tinne Tuytelaars. Memory aware
[14] Mehrdad Farajtabar, Navid Azizan, Alex Mott, and Ang Li.
synapses: Learning what (not) to forget. In The European
Orthogonal gradient descent for continual learning. In Inter-
Conference on Computer Vision (ECCV), 2018. 2
national Conference on Artificial Intelligence and Statistics,
[2] Rahaf Aljundi, Klaas Kelchtermans, and Tinne Tuytelaars. pages 3762–3773. PMLR, 2020. 2
Task-free continual learning. In Proceedings of the IEEE/CVF [15] Ian J. Goodfellow, Mehdi Mirza, Da Xiao, Aaron Courville,
Conference on Computer Vision and Pattern Recognition, and Yoshua Bengio. An empirical investigation of catas-
pages 11254–11263, 2019. 2 trophic forgetting in gradient-based neural networks, 2015.
[3] Rahaf Aljundi, Min Lin, Baptiste Goujaud, and Yoshua Ben- 1
gio. Gradient based sample selection for online continual [16] Tyler L Hayes, Kushal Kafle, Robik Shrestha, Manoj Acharya,
learning. Advances in neural information processing systems, and Christopher Kanan. Remind your Neural Network to
32, 2019. 2 Prevent Catastrophic Forgetting. In European Conference on
[4] Jihwan Bang, Heesu Kim, YoungJoon Yoo, Jung-Woo Ha, Computer Vision, pages 466–483. Springer, 2020. 1
and Jonghyun Choi. Rainbow memory: Continual learn- [17] Jiangpeng He and Fengqing Zhu. Exemplar-free Online Con-
ing with a memory of diverse samples. In Proceedings of tinual Learning. arXiv preprint arXiv:2202.05491, 2022. 1
the IEEE/CVF Conference on Computer Vision and Pattern [18] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada-
Recognition (CVPR), pages 8218–8227, 2021. 1, 2 vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu,
[5] Tom Brown, Benjamin Mann, Nick Ryder, Melanie Sub- Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt,
biah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, and Justin Gilmer. The many faces of robustness: A critical
Pranav Shyam, Girish Sastry, Amanda Askell, et al. Lan- analysis of out-of-distribution generalization. ICCV, 2021. 5
guage Models are Few-Shot Learners. Advances in neural [19] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and
information processing systems, 33:1877–1901, 2020. 5 Dahua Lin. Learning a unified classifier incrementally via
[6] Pietro Buzzega, Matteo Boschini, Angelo Porrello, and Si- rebalancing. In Proceedings of the IEEE/CVF Conference on
mone Calderara. Rethinking experience replay: a bag of tricks Computer Vision and Pattern Recognition (CVPR), 2019. 1
for continual learning. 2020 25th International Conference [20] Zhiyuan Hu, Jiancheng Lyu, Dashan Gao, and Nuno Vascon-
on Pattern Recognition (ICPR), pages 2180–2187, 2021. 1 celos. Pop: Prompt of prompts for continual learning, 2023.
[7] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, 1, 2
Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, [21] David Isele and Akansel Cosgun. Selective experience replay
and Marc’Aurelio Ranzato. On tiny episodic memories in for lifelong learning. In Proceedings of the AAAI Conference
continual learning. arXiv preprint arXiv:1902.10486, 2019. 6 on Artificial Intelligence, 2018.
[8] Arslan Chaudhry, Marcus Rohrbach, Mohamed Elhoseiny, [22] Kishaan Jeeveswaran, Prashant Bhat, Bahram Zonooz, and
Thalaiyasingam Ajanthan, Puneet K Dokania, Philip HS Torr, Elahe Arani. Birt: Bio-inspired replay in vision transformers
and M Ranzato. Continual learning with tiny episodic memo- for continual learning. arXiv preprint arXiv:2305.04769,
ries. In International Conference on Machine Learning, 2019. 2023. 2
1 [23] Muhammad Gul Zain Ali Khan, Muhammad Ferjad Naeem,
[9] Haoran Chen, Zuxuan Wu, Xintong Han, Menglin Jia, and Yu- Luc Van Gool, Didier Stricker, Federico Tombari, and
Gang Jiang. Promptfusion: Decoupling stability and plasticity Muhammad Zeshan Afzal. Introducing language guidance
for continual learning. arXiv preprint arXiv:2303.07223, in prompt-based continual learning. In Proceedings of the
2023. 1, 2 IEEE/CVF International Conference on Computer Vision,
[10] Marco D’Alessandro, Alberto Alonso, Enrique Calabrés, and pages 11463–11473, 2023. 1, 2, 5, 6
Mikel Galar. Multimodal parameter-efficient few-shot class [24] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel
incremental learning. arXiv preprint arXiv:2303.04751, 2023. Veness, Guillaume Desjardins, Andrei A Rusu, Kieran Milan,
1, 2 John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska,
[11] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, et al. Overcoming catastrophic forgetting in neural networks.
Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Proceedings of the national academy of sciences, 114(13):
Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- 3521–3526, 2017. 2
vain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is [25] Alex Krizhevsky et al. Learning multiple layers of features
worth 16x16 words: Transformers for image recognition at from tiny images. Citeseer, 2009. 5
scale. In International Conference on Learning Representa- [26] Sang-Woo Lee, Jin-Hwa Kim, Jaehyun Jun, Jung-Woo Ha,
tions, 2021. 3, 6 and Byoung-Tak Zhang. Overcoming catastrophic forget-
[12] Arthur Douillard, Alexandre Ramé, Guillaume Couairon, and ting by incremental moment matching. Advances in neural
Matthieu Cord. Dytox: Transformers for continual learn- information processing systems, 30, 2017. 2
ing with dynamic token expansion. In Proceedings of the [27] Zhizhong Li and Derek Hoiem. Learning without forget-
IEEE Conference on Computer Vision and Pattern Recogni- ting. IEEE transactions on pattern analysis and machine
tion (CVPR), 2022. 2 intelligence, 40(12):2935–2947, 2017. 5, 6

23624
[28] Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hi- learning. In Advances in Neural Information Processing Sys-
roaki Hayashi, and Graham Neubig. Pre-train, Prompt, and tems, 2019. 1
Predict: A Systematic Survey of Prompting Methods in Nat- [43] Anurag Roy, Vinay Verma, Sravan Voonna, Kripabandhu
ural Language Processing. ACM Computing Surveys, 55(9): Ghosh, Saptarshi Ghosh, and Abir Das. Exemplar-free con-
1–35, 2023. 1 tinual transformer with convolutions. In International Con-
[29] David Lopez-Paz and Marc’Aurelio Ranzato. Gradient ference on Computer Vision (ICCV), 2023. 1, 2
Episodic Memory for Continual Learning. Advances in neural [44] Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, San-
information processing systems, 30, 2017. 1 jeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy,
[30] Chengzhi Mao, Revant Teotia, Amrutha Sundar, Sachit Aditya Khosla, Michael Bernstein, et al. Imagenet large
Menon, Junfeng Yang, Xin Wang, and Carl Vondrick. Doubly scale visual recognition challenge. International journal of
right object recognition: A why prompt for visual rationales. computer vision, 115(3), 2015. 5
In Proceedings of the IEEE/CVF Conference on Computer [45] Gobinda Saha, Isha Garg, and Kaushik Roy. Gradient projec-
Vision and Pattern Recognition, pages 2722–2732, 2023. 5 tion memory for continual learning. In International Confer-
[31] Michael McCloskey and Neal J. Cohen. Catastrophic inter- ence on Learning Representations, 2021. 2
ference in connectionist networks: The sequential learning [46] James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola
problem. Psychology of Learning and Motivation, 24:109– Cascante-Bonilla, Donghyun Kim, Assaf Arbelle, Rameswar
165, 1989. 1 Panda, Rogerio Feris, and Zsolt Kira. Coda-prompt: Contin-
[32] Sachit Menon and Carl Vondrick. Visual classification via ual decomposed attention-based prompting for rehearsal-free
description from large language models. arXiv preprint continual learning. arXiv preprint arXiv:2211.13218, 2022.
arXiv:2210.07183, 2022. 5 Accepted for publication at CVPR 2023. 1, 2, 3, 4, 5, 6, 8
[33] Seyed Iman Mirzadeh, Arslan Chaudhry, Dong Yin, Timothy [47] Yu-Ming Tang, Yi-Xing Peng, and Wei-Shi Zheng. When
Nguyen, Razvan Pascanu, Dilan Gorur, and Mehrdad Fara- prompt-based incremental learning does not meet strong pre-
jtabar. Architecture matters in continual learning. arXiv, 2022. training. In Proceedings of the IEEE/CVF International Con-
2 ference on Computer Vision, pages 1706–1716, 2023. 1, 2
[34] Jun-Yeong Moon, Keon-Hee Park, Jung Uk Kim, and Gyeong-
[48] Yun-Yun Tsai, Chengzhi Mao, and Junfeng Yang. Convolu-
Moon Park. Online class incremental learning on stochastic
tional Visual Prompt for Robust Visual Perception. In Neural
blurry task boundary via mask and visual prompt tuning. In
Information Processing Systems, 2023. 4
Proceedings of the IEEE/CVF International Conference on
[49] Vinay Kumar Verma, Kevin J Liang, Nikhil Mehta, Piyush
Computer Vision, pages 11731–11741, 2023. 1, 2
Rai, and Lawrence Carin. Efficient feature transformations
[35] Vinod Nair and Geoffrey E. Hinton. Rectified linear units
for discriminative and generative continual learning. In Pro-
improve restricted boltzmann machines. In International
ceedings of the IEEE/CVF Conference on Computer Vision
Conference on Machine Learning, page 807–814, Madison,
and Pattern Recognition, pages 13865–13875, 2021. 1
WI, USA, 2010. Omnipress. 4, 6
[36] Ameya Prabhu, Philip HS Torr, and Puneet K Dokania. [50] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.
GDumb: A Simple Approach that Questions our Progress The caltech-ucsd birds-200-2011 dataset. In California Insti-
in Continual Learning. In European conference on computer tute of Technology, 2011. 5, 6
vision, pages 524–540. Springer, 2020. 1 [51] Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. S-prompts
[37] Anastasia Razdaibiedina, Yuning Mao, Rui Hou, Madian learning with pre-trained transformers: An occam’s razor
Khabsa, Mike Lewis, and Amjad Almahairi. Progressive for domain incremental learning. In Conference on Neural
prompts: Continual learning for language models. In In- Information Processing Systems (NeurIPS), 2022. 2
ternational Conference on Learning Representations, 2023. [52] Zhen Wang, Liu Liu, Yiqun Duan, Yajing Kong, and Dacheng
2 Tao. Continual learning with lifelong vision transformer.
[38] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, In Proceedings of the IEEE/CVF Conference on Computer
and Christoph H Lampert. icarl: Incremental classifier and Vision and Pattern Recognition, pages 171–181, 2022. 2
representation learning. In Proceedings of the IEEE con- [53] Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun,
ference on Computer Vision and Pattern Recognition, pages Han Zhang, Chen-Yu Lee, Xiaoqi Ren, Guolong Su, Vin-
2001–2010, 2017. 1 cent Perot, Jennifer Dy, et al. DualPrompt: Complementary
[39] Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik- Prompting for Rehearsal-free Continual Learning. In Euro-
Manor. Imagenet-21k pretraining for the masses. arXiv, 2021. pean Conference on Computer Vision, 2022. 1, 2, 3, 4, 5, 6,
6 7, 8
[40] Anthony V. Robins. Catastrophic forgetting, rehearsal and [54] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang,
pseudorehearsal. Connect. Sci., 7:123–146, 1995. 1 Ruoxi Sun, Xiaoqi Ren, Guolong Su, Vincent Perot, Jennifer
[41] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lill- Dy, and Tomas Pfister. Learning to Prompt for Continual
icrap, and Gregory Wayne. Experience replay for continual Learning. In Proceedings of the IEEE/CVF Conference on
learning. Advances in Neural Information Processing Systems, Computer Vision and Pattern Recognition, pages 139–149,
32, 2019. 2 2022. 1, 2, 3, 4, 5, 6, 7, 8
[42] David Rolnick, Arun Ahuja, Jonathan Schwarz, Timothy Lill- [55] Mengqi Xue, Haofei Zhang, Jie Song, and Mingli Song.
icrap, and Gregory Wayne. Experience replay for continual Meta-attention for vit-backed continual learning. In 2022

23625
IEEE/CVF Conference on Computer Vision and Pattern
Recognition (CVPR), 2022. 2
[56] Shipeng Yan, Jiangwei Xie, and Xuming He. Der: Dynami-
cally expandable representation for class incremental learning.
In Proceedings of the IEEE Conference on Computer Vision
and Pattern Recognition (CVPR), 2021. 1
[57] Jaehong Yoon, Eunho Yang, Jeongtae Lee, and Sung Ju
Hwang. Lifelong learning with dynamically expandable net-
works. In International Conference on Learning Representa-
tions. ICLR, 2018. 2
[58] Friedemann Zenke, Ben Poole, and Surya Ganguli. Contin-
ual learning through synaptic intelligence. In International
conference on machine learning, pages 3987–3995. PMLR,
2017. 2
[59] Friedemann Zenke, Ben Poole, and Surya Ganguli. Contin-
ual learning through synaptic intelligence. In International
Conference on Machine Learning, pages 3987–3995. PMLR,
2017. 2
[60] Gengwei Zhang, Liyuan Wang, Guoliang Kang, Ling Chen,
and Yunchao Wei. Slca: Slow learner with classifier alignment
for continual learning on a pre-trained model. In Proceed-
ings of the IEEE/CVF International Conference on Computer
Vision, 2023. 5, 6, 7

23626

You might also like