Ran PAC
Ran PAC
[email protected], [email protected],
[email protected], [email protected],
[email protected]
Abstract
Continual learning (CL) aims to incrementally learn different tasks (such as clas-
sification) in a non-stationary data stream without forgetting old ones. Most CL
works focus on tackling catastrophic forgetting under a learning-from-scratch
paradigm. However, with the increasing prominence of foundation models, pre-
trained models equipped with informative representations have become available
for various downstream requirements. Several CL methods based on pre-trained
models have been explored, either utilizing pre-extracted features directly (which
makes bridging distribution gaps challenging) or incorporating adaptors (which
may be subject to forgetting). In this paper, we propose a concise and effective
approach for CL with pre-trained models. Given that forgetting occurs during
parameter updating, we contemplate an alternative approach that exploits training-
free random projectors and class-prototype accumulation, which thus bypasses
the issue. Specifically, we inject a frozen Random Projection layer with nonlin-
ear activation between the pre-trained model’s feature representations and output
head, which captures interactions between features with expanded dimensional-
ity, providing enhanced linear separability for class-prototype-based CL. We also
demonstrate the importance of decorrelating the class-prototypes to reduce the
distribution disparity when using pre-trained representations. These techniques
prove to be effective and circumvent the problem of forgetting for both class- and
domain-incremental continual learning. Compared to previous methods applied to
pre-trained ViT-B/16 models, we reduce final error rates by between 20% and 62%
on seven class-incremental benchmark datasets, despite not using any rehearsal
memory. We conclude that the full potential of pre-trained models for simple,
effective, and fast continual learning has not hitherto been fully tapped. Code is
available at https://github.com/RanPAC/RanPAC.
1 Introduction
Continual Learning (CL) is the subfield of machine learning within which models must learn from a
distribution of training samples and/or supervision signals that change over time (often divided into a
distinct set of T episodes/tasks/stages) while remaining performant on anything learned previously
during training [51, 53]. Traditional training methods do not work well for CL because parameter
updates become biased to newer samples, overwriting what was learned previously. Moreover, training
on sequential disjoint sets of data means there is no opportunity to learn differences between samples
from different stages [48]. These effects are often characterised as ‘catastrophic forgetting’ [51].
2
Our contributions are summarised as follows:
1. We examine in detail the CP strategy for CL with pre-trained networks, show that it benefits from
injecting a Random Projection layer followed by nonlinear activation, and illustrate why. We also
analyse why it is important to follow the lead of [37] to linearly transform CPs, via decorrelation
using second-order feature statistics.
2. We show that random projections are particularly useful when also using PETL methods with
first-session training (see Section 2.1). Accuracy with this combination approaches the CL joint
training upper bound on some datasets (Table 2). For ViT-B/16 pre-trained models, we report the
highest-to-date rehearsal-free CL accuracies on all class-incremental and domain-incremental
datasets we tested on, with large margins when compared to past CP strategies.
3. We highlight the flexibility of our resulting CL algorithm, RanPAC; it works with arbitrary
feature vectors (e.g. ViT, ResNet, CLIP), and is applicable to diverse CL scenarios including
class-incremental (Section 5), domain-incremental (Section 5) and task-agnostic CL (Appendix F).
2 Related Work
2.1 Three strategies for CL with strong pre-trained models
Prompting of transformer networks: Using a ViT-B/16 network [9], Learning To Prompt (L2P) [57],
and DualPrompt [56] reported large improvements over the best CL methods that do not leverage
pre-trained models, by training a small pool of prompts that update through the CL process. CODA-
Prompt [46], S-Prompt [55] and PromptFusion [5] then built on these, showing improvements in
performance.
Careful fine-tuning of the backbone: SLCA [63] found superior accuracy to prompt strategies by
fine-tuning a ViT backbone with a lower learning rate than in a classifier head. However, it was
found that use of softmax necessitated introduction of a ‘classifier alignment’ method, which incurs a
high memory cost, in the form of a feature covariance matrix for every class. Another example of
this strategy used selected fine-tuning of some ViT attention blocks [47], combined with traditional
CL method, L2 parameter regularization. Fine-tuning was also applied to the CLIP vision-language
model, combined with well-established CL method, LwF [8].
Class-Prototype (CP) accumulation: Subsequent to L2P, it was pointed out for CL image classi-
fiers [20] that comparable performance can be achieved by appending a nearest class mean (NCM)
classifier to a ViT model’s feature outputs (see also [39]). This strategy can be significantly boosted
by combining with Parameter-Efficient Transfer Learning (PETL) methods (originally proposed for
NLP models in a non-CL context [17, 12]) trained only on the first CL stage (‘first-session training’)
to bridge any domain gap [65, 37]. The three PETL methods considered by [65] for transformer
networks, and the FiLM method used by [37] for CNNs have in common with the first strategy
(prompting) that they require learning of new parameters, but avoid updating any parameters of
the backbone pre-trained network. Importantly, [37] also showed that a simple NCM classifier is
easily surpassed in accuracy by also accumulating the covariance matrix of embedding features,
and learning a linear classifier head based on linear discriminant analysis (LDA) [35]. The simple
and computationally lightweight algorithm of [37] enables CL to proceed after the first session in
a perfect manner relative to the union of all training episodes, with the possibility of catastrophic
forgetting avoided entirely.
CPs are well suited to CL generally [42, 7, 31, 16] and for application to pre-trained models [20, 65,
37], because when the model from which feature vectors are extracted is frozen, CPs accumulated
across T tasks will be identical regardless of the ordering of the tasks. Moreover, their memory cost
is low compared with using a rehearsal buffer, the strategy integral to many CL methods [53].
As mentioned, the original non-CL usage of a frozen RP layer followed by nonlinear projection as
in [43, 4, 18] had different motivations to us, characterized by the following three properties. First,
keeping weights frozen removes the computational cost of training them. Second, when combined
with a linear output layer, the mean-square-error-optimal output weights can be learned by exact
numerical computation using all training data simultaneously (see Appendix B.3) instead of iteratively.
Third, nonlinearly activating random projections of randomly projected features is motivated by the
3
assumption that nonlinear random interactions between features may be more linearly separable than
the original features. Analysis of the special case of pair-wise interactions induced by nonlinearity
can be found in [32], and mathematical properties for general nonlinearities (with higher order
interactions) have also been discussed extensively, e.g. [4, 18].
3 Background
3.1 Continual learning problem setup
For CL, using conventional cross-entropy loss by linear probing or fine-tuning the feature representa-
tions of a frozen pre-trained model creates risks of task-recency bias [31] and catastrophic forgetting.
Benefiting from the high-quality representations of a pre-trained model, the most straightforward
Class-Prototype (CP) strategy is to use Nearest Class Mean (NCM) classifiers [58, 34], as applied and
investigated by [65, 20]. CPs for each class are usually constructed by averaging the extracted feature
vectors over training samples within classes, which we denote for class y as c̄y . In inference, the
class of a test sample is determined by finding the highest similarity between its representation and
the set of CPs. For example, [65] use cosine similarity to find the predicted class for a test sample,
⊤
ftest c̄y
ytest = arg max sy′ , sy := . (1)
′y ∈{1,...,K} ||ftest || · ||c̄y ||
However, it is also not difficult to go beyond NCM within the same general CL strategy, by leveraging
second-order feature statistics [11, 37]. For example, [37] finds consistently better CL results with pre-
trained CNNs than NCM using an incremental version [36, 11] of Linear Discriminant Analysis (LDA)
classification [35], in which the covariance matrix of the extracted features is continually updated.
Under mild assumptions, LDA is equivalent to comparing feature vectors and class-prototypes using
Mahalanobis distance (see Appendix B.4), i.e. different to cosine distance used by [65].
We will also use incrementally calculated second-order feature statistics, but create a simplification
compared with LDA (see Appendix B.4), by using the Gram matrix of the features, G, and cy (CPs
with averaging dropped), to obtain the predicted label
⊤
ytest = arg
′
max sy′ , sy := ftest G−1 cy . (2)
y ∈{1,...,K}
Like LDA, but unlike cosine similarity, this form makes use of a training set to ‘calibrate’ similarities.
This objective has a basis in long established theory for least square error predictions of one-hot
encoded class labels [35] (see Appendix B.3). Similar to incremental LDA [36, 11], during CL
training, we describe in Section 4.3 how the Gram matrix and the CPs corresponding to cy can easily
be updated progressively with each task. Note that Eqn. (2) is expressed in terms of the maximum
number of classes after T tasks, K. However, for CIL, it can be calculated after completion of tasks
t < T with fewer classes than K. For DIL, all K classes are often available in all tasks.
4
4 The proposed approach and theoretical insights: RanPAC
We show in Section 5 that Eqn. (2) leads to better results than NCM. We attribute this to the fact
that raw CPs are often highly correlated between classes, resulting in poorly calibrated cosine
similarities, whereas the use of LDA or Eqn (2) mostly removes correlations between CPs, creating
better separability between classes. To illustrate these insights, we use the example of a ViT-B/16
transformer model [9] pre-trained on ImageNet-21K with its classifier head removed, and data from
the well-established 200-class Split Imagenet-R CIL benchmark [56].
For comparison with CP approaches, we jointly trained, on all 200 classes, a linear probe softmax
classifier. We treat the weights of the joint probe as class-prototypes for this exercise and then
find the Pearson correlation coefficients between each pair of prototypes as shown for the first
10 classes in Fig. 2 (right). Compared with the linear probe, very high off-diagonal correlations
are clearly observed when using NCM, with a mean value more than twice that of the original
ImageNet-1K training data treated in the same way. This illustrates the extent of the domain shift
for the downstream dataset. However, these correlations are mostly removed when using Eqn (2).
Fig. 2 (left) shows that high correlation coefficients coincide with poorly calibrated cosine similarities
between class-prototypes and training samples, both for true class comparisons (similarities between
a sample’s feature vector and the CP for the class label corresponding to that sample.) and inter-class
comparisons (similarities between a sample’s feature vector and the class prototypes for the set of N-1
classes not equal to the sample’s class label). However, when using Eqn. (2), the result (third row) is
to increase the training-set accuracy from 64% to 75%, coinciding with reduced overlap between inter-
and true-class similarity distributions, and significantly reduced off-diagonal correlation coefficients
between CPs. The net result is linear classification weights that are much closer to those produced by
the jointly-trained linear probe. These results are consistent with known mathematical relationships
between Eqn (2) and decorrelation, which we outline in Appendix B.4.4.
CP methods that use raw c̄y for CL assume a Gaussian distribution with isotropic covariance.
Empirically, we have found that when used with pre-trained models this assumption is invalid
(Figure 2). One can learn a non-linear function (e.g. using SGD training of a neural network) with
where W(i) denotes the ith column. That is, the expected inner products of projections of these two
features can be decomposed. Now, we have W(i) ∼ N (0, σ 2 I), thus EW [W(i) ] = EW [W(j) ] = 0
PM 2
and the second term in Eqn. (3) vanishes. Further, i EW [W(i) ] = M σ 2 . We can make two
observations (theoretical details are provided in Appendix B.2):
1. As M increases, the likelihood that the norm of any projected feature approaching the variance
increases. In other words, the projected vectors in higher dimensions almost surely reside on the
boundary of the distribution with almost equal distance to the mean (the distribution approaches
isotropic Gaussian).
2. As M increases, it is more likely for angles between two randomly projected instances to be
distinct (i.e. the inner products in the projected space are more likely to be larger than some
constant).
This discussion can readily be extended to incorporate nonlinearity. The benefit of incorporating
nonlinearity is to (i) incorporate interaction terms [54, 50], and, (ii) simulate higher dimensional
projections that otherwise could be prohibitively large in number. To see the later point, denoting by
ϕ the nonlinearity of interest, we have ϕ(W⊤ f ) ≈ Ŵ⊤ f̂ where f̂ is obtained from a linear expansion
using Taylor series and Ŵ is the corresponding projections. The Taylor expansion of the nonlinear
function ϕ gives rise to higher order interactions between dimensions. Although vectors of interaction
terms can be formed directly, as in the methods of [38, 54], this is computationally prohibitive for
⊤
non-trivial L. Hence, the use of nonlinear projections of the form htest := ϕ(ftest W) is a convenient
alternative, as known to work effectively in a non-CL context [43, 4, 18, 32] .
Using the random projection discussed above, with ϕ(·) as an element-wise nonlinear activation
function, given feature sample ft,n we obtain length M representations for CL training in each task,
⊤ ⊤
ht,n := ϕ(ft,n W) (Fig. 1). For inference, htest := ϕ(ftest W) is used in sy in Eqn. (2) instead
of ftest . We define H as an M × N matrix in which columns are formed from all hk,n and for
convenience refer to only the final H after all N samples are used. We now have an M × M Gram
6
matrix for the features, G = HH⊤ . The random projections discussed above are sampled once and
left frozen throughout all CL stages.
Like the covariance matrix updates in streaming LDA applied to CL [11, 37], variables are updated
either for individual samples, or one entire CL stage, Dt , at a time. We introduce matrix C to denote
the concatenated column vectors of all the cy . Rather than covariance, S, we update the Gram matrix,
and the CPs in C (the concatenation of cy ’s) iteratively. This can be achieved either one task at a
time, or one sample at a time, since we can express G and C as summations over outer products as
X Nt
T X Nt
T X
X
G= ht,n ⊗ ht,n , C= ht,n ⊗ yt,n . (4)
t=1 n=1 t=1 n=1
Both C and G will be invariant to the sequence in which the entirety of N training samples are
presented, a property ideal for CL algorithms.
The mentioned origins of Eqn. (2) in least squares theory is of practical use; we find it works best to
use ridge regression [35], and calculate the l2 regularized inverse, (G + λI)−1 , where I denotes the
identity matrix. This is achieved methodically using cross-validation—see Appendix C. The revised
score for CL can be rewritten as
⊤
sy = ϕ(ftest W)(G + λI)−1 cy . (5)
In matrix form the scores can be expressed as predictions for each class label as ypred = htest Wo .
Different to streaming LDA, our approach has the benefits of (i) removing the need for bias calcula-
tions from the score; (ii) updating the Gram matrix instead of the covariance avoids outer products of
means; and (iii) the form Wo = (G + λI)−1 C arises as a closed form solution for mean-square-error
loss with l2 -regularization (see Appendix B.3), in contrast to NCM, where no such theoretical result
exists. Phase 2 in Algorithm 1 summarises the above CL calculations.
Application of Eqn. (5) is superficially similar to AdaptMLP modules [6], but instead of a bottleneck
layer, we expand to dimensionality M > L, since past applications found this to be necessary to
compensate for W being random rather than learned [43, 4, 18, 32]. As discussed in the introduction,
the use of a random and training-free weights layer is particularly well suited to CL.
The value of transforming the original features to nonlinear random projections is illustrated in
Fig. 3. Features for the T = 10 split ImageNet-R CIL dataset were extracted from a pre-trained
ViT-B/16 [9] network and Eqn. (5) applied after each of the T = 10 tasks. Fig. 3(a) shows the
typical CL average accuracy trend, whereby accuracy falls as more tasks are added. When a nonlinear
activation, ϕ(·) is used (e.g. ReLU or squaring), performance improves as M increases, but when
the nonlinear activation is omitted, accuracy is no better than not using RP, even with M very large.
On the other hand, if dimensionality is reduced without nonlinearity (in this case from 768 to 500),
then performance drops below the No RP case, highlighting that if RPs in this application create
dimensionality reduction, it leads to poor performance.
Fig. 3(b) casts light on why nonlinearity is important. We use only the first 100 extracted features per
sample and compare application of Eqn. (2) to raw feature vectors (black) and to pair-wise interaction
terms, formed from the flattened cross-product of each extracted feature vector (blue trace). Use of
the former significantly outperforms the latter. However, when Eqn. (5) is used instead (red trace),
the drop in performance compared with flattened cross products is relatively small. Although this
suggests exhaustively creating products of features instead of RPs, this is computationally infeasible.
As an alternative, RP is a convenient and computationally cheap means to create nonlinear feature
interactions that enhance linear separability, with particular value in CL with pre-trained models.
Use of an RP layer has the benefit of being model agnostic, e.g. it can be applied to any feature
extractor. As we show, it can be applied orthogonally to PETL methods. PETL is very appealing for
CL, particularly approaches that do not alter any learned parameters of the original pre-trained model,
such as [10, 28]. We combine RP with a PETL method trained using CL-compatible ‘first-session’
training, as carried out by [65, 37]. This means training PETL parameters only on the first CL task,
D1 , and then freezing them thereafter (see Phase 1 of Algorithm 1). The rationale is that the training
data and labels in the first task may be more representative of the downstream dataset than that
used to train the original pre-trained model. If a new dataset drifts considerably, e.g. as in DIL, the
7
Figure 3: Impact of RP compared to alternatives. (a) Using only Phase 2 of Algorithm 1, we
show average accuracy (see Results) after each of T = 10 tasks in the split ImageNet-R dataset. The
data illustrates the value of nonlinearity combined with large numbers of RPs, M . (b) Illustration
that interaction terms created from feature vectors extracted from frozen pre-trained models contain
important information that can be mostly recovered when RP and nonlinearity are used.
benefits of PETL may be reduced because of this choice. But the assumption is that the domain gap
between the pre-training dataset and the new dataset is significant enough to still provide a benefit.
First-session training of PETL parameters requires a temporary linear output layer, learned using
SGD with cross-entropy loss and softmax and only K1 classes, which is discarded prior to Phase 2.
For transformer nets, we experiment with the same three methods as [65], i.e. AdaptFormer [6],
SSF [28], and VPT [21]. For details of these methods see the cited references and Appendix D.
Unlike [65] we do not concatenate adapted and unadapted features, as we found this added minimal
value when using RP.
The RP layer increases the total memory re- Algorithm 1 RanPAC Training
quired by the class prototypes by a factor of
1 + (M − L)/L, while an additional LM Input: Sequence of T tasks, D = {D1 , . . . DT },
parameters are frozen and untrainable. For pre-trained model, and PETL method
typical values of L = 768 and K = 200, the Phase 1: ‘First-session’ PETL adaptation.
injection of an RP layer of size M = 10000 for sample n = 1, . . . , N1 in D1 do
therefore creates ∼ 10 times the number of Extract feature vector, f1,n
trainable parameters, and adds ∼ 10M non- Use in SGD training of PETL parameters
trainable parameters. Although this is a sig- end for
nificant number of new parameters, it is still Phase 2: Continual Learning with RPs.
small compared with the overall size of the Create frozen L × M RP weights, W ∼ N (0, 1)
ViT-B/16 model, with its 84 millions param- for task t = 1, . . . , T do
eters. Moreover, the weights of W can be for sample n = 1, . . . , Nt in Dt do
bipolar instead of Gaussian and if stored us- Extract feature vector, ft,n
⊤
ing a single bit for each element contribute Apply ht,n = ϕ(ft,n W)
only a tiny fraction to additional model mem- Update G and C matrices using Eqn. (4)
ory. During training only, we also require end for
updating an M × M Gram matrix, which is Optimize λ (A. C) and compute (G + λI)−1
smaller than the K (L × L) covariance ma- end for
trices of √
the SLCA fine-tuning approach [63]
if M < KL. L2P [57], DualPrompt [56] and ADaM [65] each use ∼0.3M–0.5M trainable parame-
ters. For K = 200 classes and M = 10000, RanPAC uses a lot more (∼2–2.5M, depending on the
PETL method), comparable to CODA-Prompt [46]. RanPAC also uses 10M untrained parameters,
but we highlight that these are not trainable. Moreover, M need not be as high as 10000 (Table A5).
5 Experiments
We have applied Algorithm 1 to both CIL and DIL benchmarks. For the pre-trained model, we
experiment mainly with two ViT-B/16 models [9] as per [65]: one self-supervised on ImageNet-21K,
8
and another with supervised fine-tuning on ImageNet-1K. Comparisons
Pt are made using a standard CL
metric, Average Accuracy [30], which is defined as At = 1t i=1 Rt,i , where Rt,i are classification
accuracies on the i–th task, following training on the t-th task. We report final accuracy, AT , in the
main paper, with analysis of each At and Rt,i left for Appendix F, along with forgetting metrics and
analysis of variability across seeds. Appendix F also shows that our method works well with ResNet
and CLIP backbones.
We use split datasets previously used for CIL or DIL (see citations in Tables 1 and 3); details are
provided in Appendix E. We use M = 10000 in Algorithm 1 except where stated; investigation of
scaling with M is in Appendix F.5. All listed ‘Joint’ results are non-CL training of comparisons on
the entirety of D, using cross-entropy loss and softmax.
Key indicative results for CIL are shown in Table 1, for T = 10, with equally sized stages (except
VTAB which is T = 5, identical to [65]). For T = 5 and T = 20, see Appendix F. For each dataset,
our best method surpasses the accuracy of prompting methods and the CP methods of [20] and [65],
by large margins. Ablations of Algorithm 1 listed in Table 1 show that inclusion of our RP layer
results in error-rate reductions of between 11% and 28% when PETL is used. The gain is reduced
otherwise, but is ≥ 8%, except for VTAB. Table 1 also highlights the limitations of NCM.
9
Method CIFAR100 IN-R IN-A CUB OB VTAB Cars
Joint full fine-tuning 93.8% 86.6% 70.8% 90.5% 83.8% 92.0% 86.9%
SLCA [63] 91.5% 77.0% - 84.7%∗ - - 67.7%
Ours (Algorithm 1) 92.2% 78.1% 61.8% 90.3% 79.9% 92.6% 77.7%
Table 2: Comparison with fine-tuning strategies for CIL. Results for SLCA are directly from [63].
‘Joint’ means full fine-tuning of the entire pre-trained ViT network, i.e. a continual learning upper
bound. Notes: L2 [47] achieved 76.1% on Imagenet-R, but reported no other ViT results. Cells
containing - indicates SLCA did not report results, and no codebase is available yet. Superscript ∗ :
the number of CUB training samples used by [63] is smaller than that defined by [65], which we use.
6 Conclusion
We have demonstrated that feature representations extracted from pre-trained foundation models such
as ViT-B/16 have not previously achieved their full potential for continual learning. Application of
our simple and rehearsal-free class-prototype strategy, RanPAC, results in significantly reduced error
rates on diverse CL benchmark datasets, without risk of forgetting in the pre-trained model. These
findings highlight the benefits of CP strategies for CL with pre-trained models.
Limitations: The value of Eqs (4) and (5) are completely reliant on supply of a good generic
feature extractor. For this reason, they are unlikely to be as powerful if used in CL methods that train
networks from scratch. However, it is possible that existing CL methods that utilise self-supervised
learning, or otherwise create good feature extractor backbones could leverage similar approaches
to ours for downstream CL tasks. As discussed in Section 4.5, RanPAC uses additional parameters
compared to methods like L2P. However, this is arguably worth trading-off for the simplicity of
implementation and low-cost training.
Future Work: Examples in Appendix F shows that our method works well with other CL protocols
including: (i) task-agnostic, i.e. CL without task boundaries during training (e.g. Gaussian scheduled
CIFAR-100), (ii) use of a non one-hot-encoded target, e.g. language embeddings in the CLIP model,
and (iii) regression targets, which requires extending the conceptualisation of class-prototypes to
generic feature prototypes. Each of these has a lot of potential extensions and room for exploration.
Other interesting experiments that we leave for future work include investigation of combining our
approach with prompting methods, and (particularly for DIL) investigation of whether training PETL
parameters beyond the first session is feasible without catastrophic forgetting. Few-shot learning with
pre-trained models [1, 45] may potentially also benefit from a RanPAC-like algorithm.
10
Acknowledgements
This work was supported by the Centre for Augmented Reasoning at the Australian Institute for
Machine Learning, established by a grant from the Australian Government Department of Education.
Dong Gong is the recipient of an Australian Research Council Discovery Early Career Award (project
number DE230101591) funded by the Australian Government. We would like to thank Sebastien
Wong of DST Group, Australia, and Lingqiao Liu of the Australian Institute for Machine Learning,
The University of Adelaide, for valuable suggestions and discussion.
References
[1] Peyman Bateni, Raghav Goyal, Vaden Masrani, Frank Wood, and Leonid Sigal. Improved
few-shot visual classification. In Proc. Conference on Computer Vision and Pattern Recognition
(CVPR), 2020. 10
[2] Pietro Buzzega, Matteo Boschini, Angelo Porrello, Davide Abati, and Simone Calderara. Dark
experience for general continual learning: A strong, simple baseline. Advances in Neural
Information Processing Systems, 33:15920–15930, 2020. 2
[3] Arslan Chaudhry, Puneet K. Dokania, Thalaiyasingam Ajanthan, and Philip H. S. Torr. Rie-
mannian walk for incremental learning: Understanding forgetting and intransigence. In Proc.
European Conference on Computer Vision (ECCV), September 2018. 21
[4] C. L. Philip Chen. A rapid supervised learning neural network for function interpolation and
approximation. IEEE Transactions on Neural Networks, 7:1220–1230, 1996. 2, 3, 4, 6, 7
[5] Haoran Chen, Zuxuan Wu, Xintong Han, Menglin Jia, and Yu-Gang Jiang. PromptFusion:
Decoupling stability and plasticity for continual learning. arXiv:2303.07223, Submitted 13
March 2023, 2023. 3, 9
[6] Shoufa Chen, Chongjian GE, Zhan Tong, Jiangliu Wang, Yibing Song, Jue Wang, and Ping
Luo. AdaptFormer: Adapting vision transformers for scalable visual recognition. In S. Koyejo,
S. Mohamed, A. Agarwal, D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural
Information Processing Systems, volume 35, pages 16664–16678. Curran Associates, Inc., 2022.
2, 7, 8, 20, 21
[7] Matthias De Lange and Tinne Tuytelaars. Continual prototype evolution: Learning online from
non-stationary data streams. In Proc. IEEE/CVF International Conference on Computer Vision
(ICCV), pages 8230–8239, 2021. 3
[8] Yuxuan Ding, Lingqiao Liu, Chunna Tian, Jingyuan Yang, and Haoxuan Ding. Don’t stop
learning: Towards continual learning for the CLIP model. arXiv: 2207.09248, 2022. 2, 3
[9] Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai,
Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly,
Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image
recognition at scale. arXiv:2010.11929 (ICLR 2020), 2020. 3, 5, 7, 8
[10] Beyza Ermis, Giovanni Zappella, Martin Wistuba, Aditya Rawal, and Cedric Archambeau.
Memory efficient continual learning with transformers. In S. Koyejo, S. Mohamed, A. Agarwal,
D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems,
volume 35, pages 10629–10642. Curran Associates, Inc., 2022. 2, 7
[11] Tyler L. Hayes and Christopher Kanan. Lifelong machine learning with deep streaming linear
discriminant analysis. arXiv:1909.01520, 2020. 2, 4, 7
[12] Junxian He, Chunting Zhou, Xuezhe Ma, Taylor Berg-Kirkpatrick, and Graham Neubig. To-
wards a unified view of parameter-efficient transfer learning. arXiv: 2110.04366 (ICLR 2022),
2022. 3
11
[13] Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kadavath, Frank Wang, Evan Dorundo,
Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, Dawn Song, Jacob Steinhardt, and Justin
Gilmer. The many faces of robustness: A critical analysis of out-of-distribution generalization.
In Proc. IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC,
Canada, October 10-17, 2021, pages 8320–8329. IEEE, 2021. 21
[14] Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Steinhardt, and Dawn Song. Natural adver-
sarial examples. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR), pages 15262–15271, June 2021. 21
[15] Guillaume Hocquet, Olivier Bichler, and Damien Querlioz. OvA-INN: Continual learning
with invertible neural networks. In Proc. International Joint Conference on Neural Networks
(IJCNN), pages 1–7, 2020. 2
[16] Saihui Hou, Xinyu Pan, Chen Change Loy, Zilei Wang, and Dahua Lin. Learning a unified
classifier incrementally via rebalancing. In Proc. 2019 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), pages 831–839, 2019. 3
[17] Edward Hu, Yelong Shen, Phil Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Lu Wang, and Weizhu
Chen. LoRA: Low-rank adaptation of large language models. arXiv:2106.09685 (ICLR 2022),
2021. 3
[18] Guang-Bin Huang, Qin-Yu Zhu, and Chee-Kheong Siew. Extreme learning machine: Theory
and applications. Neurocomputing, 70:489–501, 2006. 2, 3, 4, 6, 7
[19] Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan
Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Hannaneh Hajishirzi,
Ali Farhadi, and Ludwig Schmidt. OpenCLIP, July 2021. URL https://doi.org/10.5281/
zenodo.5143773. 29
[20] Paul Janson, Wenxuan Zhang, Rahaf Aljundi, and Mohamed Elhoseiny. A simple baseline that
questions the use of pretrained-models in continual learning. arXiv: 2210.04428 (2022 NeurIPS
Workshop on Distribution Shifts), 2023. 2, 3, 4, 9, 10, 16
[21] Menglin Jia, Luming Tang, Bor-Chun Chen, Claire Cardie, Serge Belongie, Bharath Hariharan,
and Ser-Nam Lim. Visual prompt tuning. In Proc. European Conference on Computer Vision
(ECCV), 2022. 2, 8, 20, 21
[22] Shibo Jie, Zhi-Hong Deng, and Ziheng Li. Alleviating representational shift for continual
fine-tuning. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition
(CVPR) Workshops, pages 3810–3819, June 2022. 2
[23] Agnan Kessy, Alex Lewin, and Korbinian Strimmer. Optimal whitening and decorrelation. The
American Statistician, 72:309–314, 2018. 19
[24] Simon Kornblith, Jonathon Shlens, and Quoc V. Le. Do better imagenet models transfer better?
In Proc. IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2019, Long
Beach, CA, USA, June 16-20, 2019, pages 2661–2671. Computer Vision Foundation / IEEE,
2019. 2
[25] Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3D object representations for fine-
grained categorization. In Proc. IEEE International Conference on Computer Vision Workshops,
pages 554–561, 2013. 21
[26] Alex Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, Dept of
CS, University of Toronto. See http://www.cs.toronto.edu/ kriz/cifar.html), 2009.
21
[27] Chuqiao Li, Zhiwu Huang, Danda Pani Paudel, Yabin Wang, Mohamad Shahbazi, Xiaopeng
Hong, and Luc Van Gool. A continual deepfake detection benchmark: Dataset, methods, and
essentials. arXiv:2205.05467, 2022. 9, 22
12
[28] Dongze Lian, Daquan Zhou, Jiashi Feng, and Xinchao Wang. Scaling & shifting your features:
A new baseline for efficient model tuning. In S. Koyejo, S. Mohamed, A. Agarwal, D. Belgrave,
K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems, volume 35,
pages 109–123. Curran Associates, Inc., 2022. 2, 7, 8, 20, 21
[29] Vincenzo Lomonaco and Davide Maltoni. Core50: A new dataset and benchmark for contin-
uous object recognition. In Sergey Levine, Vincent Vanhoucke, and Ken Goldberg, editors,
Proceedings of the 1st Annual Conference on Robot Learning, volume 78 of Proceedings of
Machine Learning Research, pages 17–26. PMLR, 13–15 Nov 2017. 9, 22
[30] David Lopez-Paz and Marc'Aurelio Ranzato. Gradient episodic memory for continual learning.
In I. Guyon, U. Von Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,
editors, Advances in Neural Information Processing Systems, volume 30. Curran Associates,
Inc., 2017. 9, 21
[31] Zheda Mai, Ruiwen Li, Hyunwoo Kim, and Scott Sanner. Supervised contrastive replay:
Revisiting the nearest class mean classifier in online class-incremental continual learning.
In Proc. IEEE Conference on Computer Vision and Pattern Recognition Workshops, CVPR
Workshops 2021, virtual, June 19-25, 2021, pages 3589–3599. Computer Vision Foundation /
IEEE, 2021. 3, 4
[32] Mark D. McDonnell, Robby G. McKilliam, and Philip de Chazal. On the importance of pair-
wise feature correlations for image classification. In Proc. 2016 International Joint Conference
on Neural Networks (IJCNN), pages 2290–2297, 2016. 2, 4, 6, 7
[33] Sanket Vaibhav Mehta, Darshan Patil, Sarath Chandar, and Emma Strubell. An empirical
investigation of the role of pre-training in lifelong learning. arXiv:2112.09153, 2021. 2
[34] Thomas Mensink, Jakob J. Verbeek, Florent Perronnin, and Gabriela Csurka. Metric learning
for large scale image classification: Generalizing to new classes at near-zero cost. In Proc.
European Conference on Computer Vision, 2012. 4
[35] Kevin P. Murphy. Machine learning : A probabilistic perspective. MIT Press, Cambridge, Mass.
[u.a.], 2013. 3, 4, 7, 18, 19
[36] Shaoning Pang, S. Ozawa, and N. Kasabov. Incremental linear discriminant analysis for
classification of data streams. IEEE Transactions on Systems, Man, and Cybernetics, Part B
(Cybernetics), 35:905–914, 2005. 4
[37] Aristeidis Panos, Yuriko Kobe, Daniel Olmeda Reino, Rahaf Aljundi, and Richard E. Turner.
First session adaptation: A strong replay-free baseline for class-incremental learning. In Proc.
IEEE/CVF International Conference on Computer Vision (ICCV), pages 18820–18830, October
2023. 2, 3, 4, 6, 7, 10, 16, 28
[38] Yoh-Han Pao and Yoshiyasu Takefuji. Functional-link net computing: Theory, system architec-
ture, and functionalities. Computing Magazine, 3:7679–7682, 1991. 6
[39] Francesco Pelosin. Simpler is better: Off-the-shelf continual learning through pretrained
backbones. arXiv:2205.01586, 2022. 2, 3
[40] Xingchao Peng, Qinxun Bai, Xide Xia, Zijun Huang, Kate Saenko, and Bo Wang. Moment
matching for multi-source domain adaptation. In Proc. IEEE International Conference on
Computer Vision, pages 1406–1415, 2019. 9, 22
[41] Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual
models from natural language supervision. In Proc. International Conference on Machine
Learning, pages 8748–8763. PMLR, 2021. 29
[42] Sylvestre-Alvise Rebuffi, Alexander Kolesnikov, Georg Sperl, and Christoph H. Lampert. icarl:
Incremental classifier and representation learning. In Proc. IEEE Conference on Computer
Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21-26, 2017, pages
5533–5542. IEEE Computer Society, 2017. 3, 21
13
[43] Wouter F. Schmidt, Martin A. Kraaijveld, and Robert P. W. Duin. Feedforward neural networks
with random weights. In Proc. 11th IAPR International Conference on Pattern Recognition.
Vol.II. Conference B: Pattern Recognition Methodology and Systems, pages 1–4, 1992. 2, 3, 6, 7
[44] Murray Shanahan, Christos Kaplanis, and Jovana Mitrovic. Encoders and ensembles for
task-free continual learning. arXiv:2105.13327, 2021. 27
[45] Aliaksandra Shysheya, John Bronskill, Massimiliano Patacchiola, Sebastian Nowozin, and
Richard E Turner. FiT: Parameter efficient few-shot transfer learning for personalized and
federated image classification. arXiv:2206.08671, 2023. 10
[46] James Seale Smith, Leonid Karlinsky, Vyshnavi Gutta, Paola Cascante-Bonilla, Donghyun Kim,
Assaf Arbelle, Rameswar Panda, Rogerio Feris, and Zsolt Kira. CODA-Prompt: COntinual De-
composed Attention-based prompting for rehearsal-free continual learning. In Proc. IEEE/CVF
Conference on Computer Vision and Pattern Recognition (CVPR), pages 11909–11919, June
2023. 2, 3, 8, 9, 10, 16
[47] James Seale Smith, Junjiao Tian, Shaunak Halbe, Yen-Chang Hsu, and Zsolt Kira. A closer
look at rehearsal-free continual learning. arXiv:2203.17269 (Revised 3 April, 2023. 2, 3, 9, 10,
16
[48] Albin Soutif-Cormerais, Marc Masana, Joost van de Weijer, and Bartlomiej Twardowski. On
the importance of cross-task features for class-incremental learning. arXiv:2106.11930, 2021. 1
[49] Hai-Long Sun, Da-Wei Zhou, Han-Jia Ye, and De-Chuan Zhan. PILOT: A Pre-Trained Model-
Based Continual Learning Toolbox. arXiv:2309.07117, 2023. 9
[50] Michael Tsang, Dehua Cheng, and Yan Liu. Detecting statistical interactions from neural
network weights. In Proc. Intl Conference on Learning Representations (ICLR), 2018. 6
[51] Gido M. van de Ven, Tinne Tuytelaars, and Andreas S. Tolias. Three types of incremental
learning. Nature Machine Intelligence, pages 1185–1197, 2022. 1, 2, 4
[52] Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The Caltech-
UCSD Birds-200-2011 Dataset. Technical report, California Institute of Technology, 2011.
Technical Report CNS-TR-2011-001. 21
[53] Liyuan Wang, Xingxing Zhang, Hang Su, and Jun Zhu. A comprehensive survey of continual
learning: Theory, method and application. arXiv: 2302.00487, 2023. 1, 2, 3
[54] Ruoxi Wang, Bin Fu, Gang Fu, and Mingliang Wang. Deep & cross network for ad click
predictions. arXiv:1708.05123 (ADKDD’17), 2017. 6
[55] Yabin Wang, Zhiwu Huang, and Xiaopeng Hong. S-Prompts learning with pre-trained transform-
ers: An Occam’s razor for domain incremental learning. In S. Koyejo, S. Mohamed, A. Agarwal,
D. Belgrave, K. Cho, and A. Oh, editors, Advances in Neural Information Processing Systems,
volume 35, pages 5682–5695. Curran Associates, Inc., 2022. 2, 3, 10, 16
[56] Zifeng Wang, Zizhao Zhang, Sayna Ebrahimi, Ruoxi Sun, Han Zhang, Chen-Yu Lee, Xiaoqi
Ren, Guolong Su, Vincent Perot, Jennifer Dy, and Tomas Pfister. DualPrompt: Complementary
prompting for rehearsal-free continual learning. In Computer Vision – ECCV 2022: 17th
European Conference, Tel Aviv, Israel, October 23–27, 2022, Proceedings, Part XXVI, page
631–648, Berlin, Heidelberg, 2022. Springer-Verlag. ISBN 978-3-031-19808-3. 2, 3, 5, 8, 9,
16, 21
[57] Zifeng Wang, Zizhao Zhang, Chen-Yu Lee, Han Zhang, Ruoxi Sun, Xiaoqi Ren, Guolong Su,
Vincent Perot, Jennifer G. Dy, and Tomas Pfister. Learning to prompt for continual learning. In
Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New
Orleans, LA, USA, June 18-24, 2022, pages 139–149. IEEE, 2022. 2, 3, 8, 9, 16, 27
[58] Andrew R. Webb and Keith D. Copsey. Statistical Pattern Recognition. John Wiley & Sons,
Ltd, 2011. 4
14
[59] Tz-Ying Wu, Gurumurthy Swaminathan, Zhizhong Li, Avinash Ravichandran, Nuno Vascon-
celos, Rahul Bhotika, and Stefano Soatto. Class-incremental learning with strong pre-trained
models. In Proc. IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
pages 9601–9610, 2022. 2
[60] Qingsen Yan, Dong Gong, Yuhang Liu, Anton van den Hengel, and Javen Qinfeng Shi. Learning
Bayesian sparse networks with full experience replay for continual learning. In Proc. IEEE/CVF
Conference on Computer Vision and Pattern Recognition, pages 109–118, 2022. 2
[61] Chen Zeno, Itay Golan, Elad Hoffer, and Daniel Soudry. Task agnostic continual learning using
online variational Bayes. arXiv:1803.10123, 2019. 27
[62] Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario
Lucic, Josip Djolonga, André Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, Lucas
Beyer, Olivier Bachem, Michael Tschannen, Marcin Michalski, Olivier Bousquet, Sylvain Gelly,
and Neil Houlsby. The visual task adaptation benchmark. arXiv:1910.04867, 2019. 21
[63] Gengwei Zhang, Liyuan Wang, Guoliang Kang, Ling Chen, and Yunchao Wei. SLCA: Slow
learner with classifier alignment for continual learning on a pre-trained model. In Proc.
IEEE/CVF International Conference on Computer Vision (ICCV), pages 19148–19158, October
2023. 2, 3, 8, 9, 10, 16, 21
[64] Yuanhan Zhang, Zhenfei Yin, Jing Shao, and Ziwei Liu. Benchmarking omni-vision repre-
sentation through the lens of visual realms. In Shai Avidan, Gabriel J. Brostow, Moustapha
Cissé, Giovanni Maria Farinella, and Tal Hassner, editors, Computer Vision - ECCV 2022 - 17th
European Conference, Tel Aviv, Israel, October 23-27, 2022, Proceedings, Part VII, volume
13667 of Lecture Notes in Computer Science, pages 594–611. Springer, 2022. 21
[65] Da-Wei Zhou, Han-Jia Ye, De-Chuan Zhan, and Ziwei Liu. Revisiting class-incremental learning
with pre-trained models: Generalizability and adaptivity are all you need. arXiv:2303.07338,
Submitted 13 March, 2023. 2, 3, 4, 6, 7, 8, 9, 10, 16, 18, 20, 21, 25, 26, 28
15
Appendices
Figure A1 provides a graphical overview of the two phases in Algorithm 1. Table A1 provides a
summary of different strategies for leveraging pre-trained models for Continal Learning (CL) and
how our own method, RanPAC, compares.
Frozen layers
Adapted
PTM PETL
PTM
Adapted
PTM
f W h Wo y
Decorrelated
Extracted Random Activated Class
Features Projection Features Prototypes
Prompting Fine-tuning CP CP + RP
[57, 56, 46, 55] [63, 47] [20, 65, 37] RanPAC
No Rehearsal Buffer ✓ ✓ ✓ ✓
Pre-trained model frozen ✓ ✓ ✓
Transformers and CNNs ? ✓ ✓
Simplicity ✓ ✓
Parameter-Efficient ✓ ✓ ✓
Theoretical support ✓
SOTA Performance ✓
16
Appendix B Theoretical support
B.1 Chernoff bound of the norm of the projected vectors
We provide further details for the discussion in Section 4.2. The norm of the vector projected using
the Chernoff Bound can be written as:
ϵ2 σ 2
P |∥W⊤ f ∥ − EW ∥W⊤ f ∥ | > ϵσ 2 ≤ 2 exp −
. (6)
2M + ϵ
This bound indicates the relation between the dimensionality and the expected variation in the norm of
the projected vectors. For fixed σ and ϵ, as M increases, the right-hand side approaches 1, indicating
that it is more likely for the norm of the projected vector to be in a desired distance to the expectation.
In other words, these projected vectors in higher dimensions almost surely reside on the boundary of
the distribution with a similar distance to the mean (the distribution is a better Gaussian fit).
The Gram matrix of the projected vectors can be obtained by considering the inner product of any
two vectors f , f ′ . As presented in Eqn. (3), this is derived as:
EW (W⊤ f )⊤ (W⊤ f ′ ) = EW (f ⊤ W)(W⊤ f ′ )
XM M
X
2 ⊤ ′ ⊤
= EW W(i) f f + W(i) W(j) f ⊤ f ′
i i̸=j
M h i M
X
2
X ⊤
f ⊤f ′ + EW W(i) EW W(j) f ⊤ f ′ ,
= EW W(i) (7)
i i̸=j
| {z } | {z }
=M σ 2 f ⊤ f ′ =0
where the second term is zero for any two zero-mean independently drawn random vec-
tors W(i) , W(j) . We can derive the following from this expansion:
1. Using the Chernoff inequality, for any two vectors, we have
′2
ϵ M σ2
P (W⊤ f )⊤ (W⊤ f ′ ) − EW (W⊤ f )⊤ (W⊤ f ′ ) > ϵ′ M σ 2 ≤ 2 exp −
(2 + ϵ′ )
′2
ϵ M σ2
P (W⊤ f )⊤ (W⊤ f ′ ) − M σ 2 f ⊤ f ′ > ϵ′ M σ 2 ≤ 2 exp −
(2 + ϵ′ )
(W⊤ f )⊤ (W⊤ f ′ ) ϵ′2 M σ 2
⊤ ′ ′
P − f f > ϵ ≤ 2 exp − . (8)
M σ2 (2 + ϵ′ )
This bound indicates that as the dimension of the projections increases, it is more likely for
the inner product of any two vectors and their projections to be distinct. In other words, it is
increasingly unlikely for the inner product of two vectors and their projections to be equal as M
increases.
2. As M increases, it is more likely for the inner product of any two randomly projected instances to
be distinct (i.e. the inner products in the projected space are more likely to be larger than some
constant). That is because, using Markov’s inequality, for a larger M it is easier to choose larger
ϵ2 that satisfies
M σ2
P |(W⊤ f )⊤ (W⊤ f ′ )| ≥ ϵ2 ≤ . (9)
ϵ2
17
B.3 Connection to least squares
B.4 Connection to Linear Discriminant Analysis, Mahalanobis distance and ZCA whitening
There are two reasons why we use Eqn. (2) for Algorithm 1 rather than utilize LDA. First and
foremost, is the fact that our formulation is mean-square-error optimal, as per Section B.3. Second,
use of the inverted Gram matrix results in two convenient simplifications compared with LDA: (i)
the simple form of Eqn. (2) can be used for inference instead of a form in which biases calculated
from class-prototypes are required, and (ii) accumulation of updates to the Gram matrix and class-
prototypes during the CL process as in Eqn. (4) are more efficient than using a covariance matrix.
We now work through the theory that leads to these conclusions. The insight gained is that they show
that using Eqn. (2) is equivalent to learning a linear classifier optimized with a mean square error loss
function and l2 regularization, applied to the feature vectors of the training set. These derivations
apply in both a CL and non-CL context. For CL, the key is to realise that CPs and second-order
statistics can be accumulated identically for the same overall set of data, regardless of the sequence in
which it arrives for training.
18
B.4.2 Relationship to LDA and Mahalanobis distance
The form of Eqn. (2) resembles Linear Discriminant Analysis (LDA) classification [35, Eqn. (4.38),
p. 104]. For LDA, the score from which the predicted class for ftest is chosen is commonly expressed
in a weighting and bias form
ψy = ftest a + b (18)
⊤
= ftest S−1 c̄y − 0.5c̄⊤
yS
−1
c̄y + log (πy ), y = 1, . . . , K, (19)
where S is the M × M covariance matrix for the M dimensional feature vectors and πy is the
frequency of class y.
Findingpthe maximum ψy is equivalent to a minimization involving the Mahalanobis distance,
dM := (ftest − c̄y )⊤ S−1 (ftest − c̄y ), between a test vector and the CPs, i.e. minimizing
ψ̂y = d2M − log (πy2 ) (20)
⊤ −1
= (ftest − c̄y ) S (ftest − c̄y ) − log (πy2 ) y = 1, . . . .K. (21)
This form highlights that if all classes are equiprobable, minimizing the Mahalanobis distance suffices
for LDA classification.
With reference to the final step in Algorithm 1, we optimized λ as follows. For each task, t, in Phase
2, the training data for that task was randomly split in the ratio 80:20. We parameter swept over
17 orders of magnitude, namely λ ∈ {10−8 , 10−7 , . . . , 108 } and for each value of λ used C and G
updated with only the first 80% of the training data for task t to then calculate Wo = (G + λI)−1 C.
19
We then calculated the mean square error between targets and the set of predictions of the form
h⊤ Wo for the remaining 20% of the training data. We then chose the value of λ that minimized
the mean square error on this 20% split. Hence, λ is updated after every task, and computed in a
manner compatible with CL, i.e. without access to data from previous training tasks. It is worth noting
that optimizing λ to a value between orders of magnitude will potentially slightly boost accuracies.
Note also that choosing λ only for data from the current task may not be optimal relative to non-CL
learning on the same data, in which case the obvious difference would be to optimize λ on a subset of
training data from the entire training set.
For Phase 2 in Algorithm 1, the training data was used as follows. For each sample, features were
extracted from a frozen pretrained model, in order to update the G and C matrices. We then computed
Wo using matrix inversion and multiplication. Hence, no SGD based weight updates are required in
Phase 2.
For Phase 1 in Algorithm 1, we used SGD to train the parameters of PETL methods, namely
AdaptFormer [6], SSF [28], and VPT [21]. For each of these, we used batch sizes of 48, a learning
rate of 0.01, weight decay of 0.0005, momentum of 0.9, and a cosine annealing schedule that finishes
with a learning rate of 0. Generally we trained for 20 epochs, but in some experiments reduced to
fewer epochs if overfitting was clear. When using these methods, softmax and cross-entropy loss was
used. The number of classes was equal to the number in the first task, i.e. N1 . The resulting trained
weights and head were discarded prior to commencing Phase 2 of Algorithm 1.
For reported data in Table 1 using linear probes we used batch sizes of 128, a learning rate of 0.01 in
the classification head, weight decay of 0.0005, momentum of 0.9 and training for 30 epochs. For
full fine-tuning (Table 2), we used the same settings, but additionally used a learning rate in the body
(the pre-trained weights of the ViT backbone) of 0.0001. We used a fixed learning rate schedule, with
no reductions in learning rate. We found for fine-tuning that the learning rate lower in the body than
the head was essential for best performance.
Data augmentation during training for all datasets included random resizing then cropping to 224×224
pixels, and random horizontal flips. For inference, images are resized to short side 256 and then
center-cropped to 224 × 224 for all datasets except CIFAR100, which are simply resized from the
original 32 × 32 to 224 × 224.
Given our primary comparison is with results from [65], we use the same seed for our main experi-
ments, i.e. a seed value of 1993. This enables us to obtain identical results as [65] in our ablations.
However, we note that we use Average Accuracy as our primary metric, whereas [65] in their public
repository calculate overall accuracy after each task which can be slightly different to Average
Accuracy. For investigation of variability in Section F, we also use seeds 1994 and 1995.
The speed for inference with RanPAC is negligibly different to the speed of the original pre-trained
network, because both the RP layer and the output linear classification head (comprised from
decorrelated class prototypes) are implemented as simple fully-connected layers on top of the
underlying network. For training, Phase 1 trains PETL parameters using SGD for 20 epochs, on
(1/T )’th of the training set, so is much faster than joint training. Phase 2 is generally only slightly
slower than running all training data through the network in inference mode, because the backbone
is frozen. The slowest part is the inversions of the Gram matrix, during selection of λ, but even for
M = 10000, this is in the order of 1 minute per task on a CPU, which can be easily optimized further
if needed. Our view is that the efficiency and simplicity of our approach compared to the alternatives
is very strong.
C.4 Compute
All experiments were conducted on a single PC running Ubuntu 22.04.2 LTS, with 32 GB of RAM,
and Intel Core i9-13900KF x32 processor. Acceleration was provided by a single NVIDIA GeForce
4090 GPU.
20
Appendix D Parameter-Efficient Transfer Learning (PETL) methods
We experiment with the same three methods as [65], i.e. AdaptFormer [6], SSF [28], and VPT [21].
Details can be found in [65]. For VPT, we use the deep version, with prompt length 5. For
AdaptFormer, we use the same settings as in [65], i.e. with projected dimension equal to 64.
Appendix E Datasets
E.1 Class Incremental Learning (CIL) Datasets
The seven CIL datasets we use are summmarised in Table A2. For Imagenet-A, CUB, Om-
nibenchmark and VTAB, we used specific train-validation splits defined and outlined in detail
by [65]. Those four datasets, plus Imagenet-R (created by [56]) were downloaded from links pro-
vided at https://github.com/zhoudw-zdw/RevisitingCIL. CIFAR100 was accessed through
torchvision. Stanford cars was downloaded from https://ai.stanford.edu/~jkrause/cars/
car_dataset.html.
For Stanford Cars and T = 10, we use 16 classes in t = 1 and 20 in the 9 subsequent tasks. It is
interesting to note that VTAB has characteristics of both CIL and DIL. Unlike the three DIL datasets
we use, VTAB introduces new disjoint sets of classes in each of 5 tasks that originate in different
domains. For this reason we use only T = 5 for VTAB, whereas we explore T = 5, T = 10 and
T = 20 for the other CIL datasets.
Table A2: CIL Datasets. We list references for the original source of each dataset and for split CL
versions of them. In the column headers, N is the total number of training samples, K is the number
of classes following training on all tasks, and # val samples is the number of validation samples in
the standard validation sets.
For DIL, we list the domains for each dataset in Table A3. Further details can be found in the cited
references in the first column of Table A3. As in previous work, validation data includes samples from
each domain for CDDB-Hard, and DomainNet, but three entire domains are reserved for CORe50.
We provide results measured by Average Accuracy and Average Forgetting. Average Accuracy is
defined as [30]
t
1X
At = Rt,i , (26)
t i=1
where Rt,i are classification accuracies on the i–th task, following training on the t-th task. Average
Forgetting is defined as [3]
t−1
1 X
Ft = max (Rt′ ,i − Rt,i ). (27)
t − 1 i=1 t′ ∈{1,2,...,t−1}
21
K T N Nt , t = 1, . . . , T Nval val set
CORe50 [29] 50 8 119894 44972 S3, S7, S10
(Nearly class N1 = 14989 (S1)
balanced) N2 = 14986 (S2)
(∼ 2400 / class) N3 = 14995 (S4)
N4 = 14966 (S5)
N5 = 14989 (S6)
N6 = 14984 (S8)
N7 = 14994 (S9)
N8 = 14991 (S11)
CDDB-Hard [27] 2 5 16068 5353 Standard
(Class balanced N1 = 6000 (gaugan) 2000
both train and val) N2 = 2400 (biggan) 800
N3 = 6208 (wild) 2063
N4 = 1200 (whichfaceisreal) 400
N5 = 260 (san) 90
DomainNet [40] 345 6 409832 176743 Standard
(Imbalanced) N1 = 120906 (Real) 52041
N2 = 120750 (Quickdraw) 51750
N3 = 50416 (Painting) 21850
N4 = 48212 (Sketch) 20916
N5 = 36023 (Infograph) 15582
N6 = 33525 (Clipart) 14604
Table A3: DIL Datasets. K is the number of classes, all of which are included in each task. T
is the total number of tasks, N is the total number of training samples across all tasks and Nval
is the number of validation samples, either overall (first row per dataset) or per task. Also shown
are the number of training samples in each task, Nt , and the domain names for the correspond-
ing tasks. Core50 was downloaded from http://bias.csr.unibo.it/maltoni/download/
core50/core50_imgs.npz. CDDB was downloaded from https://coral79.github.io/
CDDB_web/. DomainNet was downloaded from http://ai.bu.edu/M3SDA/#dataset – we used
the version labelled as “cleaned version, recommended.”
Note that for CIL, we calculate the Rt,i as the accuracy for the subset of classes in Di .
For DIL, since all classes are present in each task, Rt,i has a different nature, and is dependent
on dataset conventions. For CORe50, the validation set consists of three domains not used during
training (S3, S7 and S10, as per Table A3), and in this case, each Rt,i is calculated on the entire
validation set. Therefore Rt,i is constant for all i and At = Rt,0 . For CDDB-Hard and DomainNet,
for the results in Table 3 we treated the validation data in the same way as for CORe50. However, it
is also interesting to calculate accuracies for validation subsets in each domain – we provide such
results in the following subsection.
Fig. A2 shows, for three different random seeds, Average Accuracies and Average Forgetting after
each CIL task matching Table 1 in Section 5, without PETL (Phase 1). Due to not using PETL, the
only random variability is (i) the choice of which classes are randomly assigned to which task and
(ii) the random values sampled for the weights in W. We find that after the final task, the Average
Accuracy is identical for each random seed when all classes have the same number of samples, and
are nearly identical otherwise. Clearly the randomness in W has negligible impact. For all datasets
except VTAB, the value of RP is clear, i.e. RP (black traces) delivers higher accuracies by the time of
the final task than not using RP, or when only using NCM (blue traces). The benefits of second-order
statistics even without RP are also evident (magenta traces), with Average Accuracies better than
NCM. Note that VTAB always has the same classes assigned to the same tasks, which is why only
one repetition is shown. The difference with VTAB is also evident in the average accuracy trend as
the number of tasks increases, i.e. the Average Accuracy does not have a clearly decreasing trend as
the number of tasks increases. This is possibly due to the difference in domains for each Task making
it less likely that confusions occur for classes in one task against those in another task.
22
Fig. A3 shows comparisons when Phase 1 (PETL) is used. The same trends are apparent, except that
due to the SGD training required for PETL parameters, greater variability in Average Accuracy after
the final task can be seen. Nevertheless, the benefits of using RP are clearly evident.
NCM Only
8.0 No RP
CIFAR 100 - At (%)
Ft (%)
90.0 4.0
2.0
85.0
0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
97.5 NCM Only
8.0 No RP
95.0 RP, M=10000
CUB - At (%)
6.0
Ft (%)
92.5 4.0
90.0 2.0
87.5
0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
15.0 NCM Only
85.0
ImageNet-R - At (%)
No RP
80.0 RP, M=10000
10.0
Ft (%)
75.0
70.0 5.0
65.0
60.0 0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
25.0 NCM Only
80.0
ImageNet-A - At (%)
20.0 No RP
RP, M=10000
70.0 15.0
Ft (%)
10.0
60.0
5.0
50.0 0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
95.0
OmniBenchmark - At (%)
NCM Only
15.0 No RP
90.0 RP, M=10000
85.0 10.0
Ft (%)
80.0 5.0
75.0
0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
3.0 NCM Only
94.0 No RP
RP, M=10000
VTAB - At (%)
92.0 2.0
Ft (%)
90.0
88.0 1.0
86.0
0.0
D1 D2 D3 D4 D5 D2 D3 D4 D5
NCM Only
No RP
80.0 20.0 RP, M=10000
Cars - At (%)
Ft (%)
60.0 10.0
40.0 0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
Figure A2: Average Accuracies and Forgetting after each Task for CIL datasets (no Phase 1).
The left column shows Average Accuracies for three random seeds after each of T tasks for the seven
CIL datasets, without using Phase 1. The right column shows Average Forgetting.
23
6.0 PETL+NCM
97.5 PETL+No RP
Ft (%)
92.5
2.0
90.0
87.5 0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
8.0 PETL+NCM
97.5 PETL+No RP
6.0 RanPAC
95.0
CUB - At (%)
Ft (%)
92.5 4.0
90.0 2.0
87.5 0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
95.0 10.0 PETL+NCM
90.0 8.0 PETL+No RP
ImageNet-R - At (%)
RanPAC
85.0 6.0
Ft (%)
80.0 4.0
75.0 2.0
70.0 0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
25.0 PETL+NCM
80.0 PETL+No RP
ImageNet-A - At (%)
20.0 RanPAC
70.0 15.0
Ft (%)
60.0 10.0
5.0
50.0 0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
95.0 15.0 PETL+NCM
OmniBenchmark - At (%)
PETL+No RP
90.0 RanPAC
10.0
Ft (%)
85.0
80.0 5.0
75.0 0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
95.0 PETL+NCM
3.0 PETL+No RP
92.5 RanPAC
VTAB - At (%)
2.0
Ft (%)
90.0
87.5 1.0
85.0 0.0
D1 D2 D3 D4 D5 D2 D3 D4 D5
90.0 25.0 PETL+NCM
PETL+No RP
80.0 20.0 RanPAC
Cars - At (%)
70.0 15.0
Ft (%)
60.0 10.0
50.0 5.0
40.0 0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
Figure A3: Average Accuracies and Forgetting after each Task for CIL datasets (with Phase
1). The left column shows Average Accuracies for three random seeds after each of T tasks for the
seven CIL datasets, for the best choice of PETL method in Phase 1. The right column shows Average
Forgetting.
24
Fig. A4 shows how accuracy on individual domains changes as tasks are added through training for
the DIL dataset, CDDB-Hard. The figure shows that after training data for a particular domain is first
used, that accuracy on the corresponding validation data for that domain tends to increase. In some
cases forgetting is evident, e.g. for the ‘wild’ domain, after training on T4 and T5 . The figure also
shows that averaging accuracies for individual domains (‘mean over domains’) is significantly lower
than ‘Overall accuracy’. For this particular dataset, ‘Average accuracy’ is potentially a misleading
metric as it does not take into account the much lower number of validation samples in some domains,
e.g. even though performance on ‘san’ increases after training on it, it is still under 60%, which is
poor for a binary classification task.
Fig. A5 shows the same accuracies by domain for DomainNet. Interestingly, unlike CDDB-Hard,
accuracy generally increases for each Domain as new tasks are learned. This suggests that the
pre-trained model is highly capable of forming similar feature representation for the different domains
in DomainNet, such that increasing amounts of training data makes it easier to discriminate between
classes. The possible difference with CDDB-Hard is that that dataset is a binary classification task
in which discriminating between the ‘real’ and ‘fake’ classes is inherently more difficult and not
reflected in the data used to train the pre-trained model.
100
gaugan
biggan
90 wild
whichfaceisreal
san
Accuracy (%)
80
Mean over domains
Average Accuracy
70
60
50
T1 (gaugan) T2 (biggan) T3 (wild) T4 (whichfaceisreal) T5 (san)
Figure A4: Accuracies for DIL dataset CDDB-Hard. Results are shown for the full RanPAC
algorithm, using VPT as the PETL method, and M = 10000. Each task, Ti corresponds to learning
on a new domain as shown on the x-axis. The accuracies shown in colors are those for individual
domains, following training on the domain shown on the x-axis. ‘Mean over domains’ is the average
of the five domain accuracies after each task.
As shown in Fig. A2, when Phase 1 is excluded, the final Average Accuracy after the final task has
negligible variability despite different random assignments of classes to tasks. This is a consequence
primarily of Eqns. (4) being invariant to the order in which a completed set of data from all T tasks is
used in Algorithm 1. As mentioned, the influence of different random values in W is negligible. The
same effect occurs if the data is split to different numbers of tasks, e.g. T = 5, T = 10 and T = 20,
in the event that all classes have equal number of samples, such as CIFAR100.
Therefore, in this section we show in Table A4 results for RanPAC only for the case where variability
has greater effect, i.e. when Phase 1 is included, and AdaptMLP chosen. VTAB is excluded from
this analysis, since it is a special case where it makes sense to consider only the case of T = 5. The
comparison results for L2P, DualPrompt and ADaM are copied from [65].
Performance when AdaptMLP is used tends to be better for T = 5 and worse for T = 20. This is
consistent with the fact that more classes are used for first-session training with the PETL method
when T = 5. By comparison, T = 20 is often on par with not using the PETL method at all,
indicating that the first session strategy may have little value if insufficient data diversity is available
within the first task.
25
65
60
Accuracy (%) real
55 quickdraw
painting
sketch
50 infograph
clipart
45 Mean over domains
Average Accuracy
T1 (real) T2 (quickdraw) T3 (painting) T4 (sketch) T5 (infograph) T6 (clipart)
Figure A5: Accuracies for DIL dataset DomainNet. Results are shown for the full RanPAC
algorithm, using VPT as the PETL method, and M = 10000. Each task, Ti , corresponds to learning
on a new domain as shown on the x-axis. The accuracies shown in colors are those for individual
domains, following training on the domain shown on the x-axis. ‘Mean over domains’ is the average
of the six domain accuracies after each task.
T =5 T = 10 T = 20 RanPAC – No Phase 1
CIFAR100 RanPAC 92.4% 92.2% 90.8% 89.0%
AdaM 88.5% 87.5% 85.2%
DualPrompt 86.9% 84.1% 81.2%
L2P 87.0% 84.6% 79.9%
NCM 88.6% 87.8% 86.6% 83.4%
ImageNet-R RanPAC 79.9% 77.9% 74.5% 71.8%
AdaM 74.3% 72.9% 70.5%
DualPrompt 72.3% 71.0% 68.6%
L2P 73.6% 72.4% 69.3%
NCM 74.0% 71.2% 64.7% 61.2%
ImageNet-A RanPAC 63.0% 58.6% 58.9% 58.2%
AdaM 56.1% 54.0% 51.5%
DualPrompt 46.6% 45.4% 42.7%
L2P 45.7% 42.5% 38.5%
NCM 54.8% 49.7% 49.3% 49.3%
CUB RanPAC 90.6% 90.3% 89.7% 89.9%
AdaM 87.3% 87.1% 86.7%
DualPrompt 73.7% 68.5% 66.5%
L2P 69.7% 65.2% 56.3%
NCM 87.0% 87.0% 86.9% 86.7%
OmniBenchmark RanPAC 79.6% 79.9% 79.4% 78.2%
AdaM 75.0%∗ 74.5% 73.5%
DualPrompt 69.4%∗ 65.5% 64.4%
L2P 67.1%∗ 64.7% 60.2%
NCM 75.1% 74.2% 73.0% 73.2%
Cars RanPAC 69.6% 67.4% 67.2% 67.1%
NCM 41.0% 38.0% 37.9% 37.9%
Table A4: Comparison of CIL results for different number of tasks. For this table, the AdaptMLP
PETL method was used for RanPAC. The comparison results for L2P, DualPrompt and ADaM are
copied from [65]; for ADaM, the best performing PETL method was used. In most cases, Average
Accuracy decreases as T increases, with T = 20 typically not significantly better than the case of
no PETL (“No Phase 1”). Values marked with ∗ indicates data is for T = 6 tasks, instead of T = 5.
Note that accuracies for comparison methods for Cars were not available from [65]
26
F.4 Task Agnostic Continual Learning
Unlike CIL and DIL, ‘task agnostic’ CL is a scenario where there is no clear concept of a ‘task’
during training [61]. The concept is also known as ‘task-free’ [44]. It contrasts with standard CIL
where although inference is task agnostic, training is applied to disjoint sets of classes, described as
tasks. To illustrate the flexibility of RanPAC, we show here that it is simple to apply it to task agnostic
continual learning. We use the Gaussian scheduled CIFAR100 protocol of [57], which was adapted
from [44]. We use 200 ‘micro-tasks’ which sample from a gradually shifting subset of classes, with
5 batches of 48 samples in each micro-task. There are different possible choices for how to apply
Algorithm 1. For instance, the ‘first session’ for Phase 1 could be defined as a particular total number
of samples trained on, e.g. 10% of the anticipated total number of samples. Then in Phase 2, the outer
for loop over tasks could be replaced by a loop over all batches, or removed entirely. In both cases,
the result for G and C will be unaffected. The greater challenge is in determining λ, but generally
for a large number of samples, λ can be small, or zero. For a small number of training samples, a
queue of samples could be maintained such that the oldest sample in the queue is used to update G
and C, with all newer samples used to compute λ if inference is required, and then all samples in the
buffer added to G and C.
Here, for simplicity we illustrate application to Gaussian-scheduled CIFAR100 without any Phase
1. Fig. A6 shows how test accuracy changes through training both with and without RP. The green
trace illustrates how the number of classes seen in at least one sample increases gradually through
training, instead of in a steps like in CIL. The red traces show validation accuracy on the entirety of
the validation set. As expected, this increases as training becomes exposed to more classes. The black
traces show the accuracy on only the classes seen so far through training. By the end of training, the
red and black traces converge, as is expected. Fluctuations in black traces can be partially attributed
to not optimizing λ. The final accuracies with and without RP match the values for T = 10 CIL
shown in Table 1.
100
80
60
%
Figure A6: Task agnostic example. Application of Phase 2 of Algorithm 1 to the Gaussian-
scheduled CIFAR100 task-agnostic CL protocol.
Table A5 shows, for the example of split CIFAR100, that it is important to ensure M is sufficiently
large. In order to surpass the accuracy obtained without RP (see ‘No RPs or Phase 1’ in Table 1), M
needs to be larger than 1250.
Fig. A7 shows how performance varies with PETL method and by ViT-B/16 backbone. For some
datasets, there is a clearly superior PETL method. For example, for CIFAR100, AdaptMLP gives
better results than SSF or VPT, for both backbones, for Cars VPT is best, and for ImageNet-A, SSF
27
M Accuracy
100 71.6%
200 80.3%
400 83.9%
800 86.2%
1250 86.8%
2500 87.7%
5000 88.4%
10000 88.8%
15000 89.0%
Table A5: Scaling with M for split CIFAR100. The table shows final average accuracy for T = 10
and no Phase 1, with λ = 100 as a constant for all M , using ViT-B/16 trained on ImageNet21K.
is best. There is also an interesting outlier for VTAB, where VPT and the ImageNet-21K backbone
failed badly. This variability by PETL method suggests that the choice of method for first-session CL
strategies should be investigated in depth in future work.
Fig. A7 also makes it clear that the same backbone is not optimal for all datasets. For example, the
ViT-B/16 model fine-tuned on ImageNet-1K is best for ImageNet-A, but the ViT-B/16 model trained
on ImageNet-21K is best for CIFAR100 and OmniBenchmark. For Cars, the best backbone depends
on the PETL method.
Fig. A8 summarises how the two ViT networks compare. For all datasets and method variants we
plot the Average Accuracy after the final task for one pre-trained ViT network (self-supervied on
ImageNet-21K) against the other (fine-tuned on ImageNet-1K). Consistent with Fig. A7, the best
choice of backbone is both dataset dependent and method-dependent.
Unlike prompting strategies, our approach works with any feature extractor, e.g. both pre-trained
transformer networks and pre-trained convolutional neural networks. To illustrate this, Table A6 and
Table A7 shows CIL results for ResNet50 and ResNet 152, respectively, pre-trained on ImageNet.
We used T = 10 tasks (except for VTAB, which is T = 5 tasks). Although this is different to the
T = 20 tasks used for ResNets by [65], the accuracies we report for NCM are very comparable to
those in [65]. As with results for pre-trained ViT-B/16 networks, the use of random projections and
second-order statistics both provide significant performance boosts compared with NCM alone. We
do not use Phase 1 of Algorithm 1 here but as shown by [37, 65], this is feasible for diverse PETL
methods for convolutional neural networks. Interestingly, ResNet152 with RP produces reduced
accuracies compared to ResNet50 on CUB, Omnibenchmark and VTAB. It is possible that this
would be remedied by seeking an optimal value of M , whereas for simplicity we chose M = 10000.
Note that unlike the pre-trained ViT-B/16 model, the pre-trained ResNets require preprocessing to
normalize the input images.
28
ResNet results for DIL are provided in Table A8.
To further verify the general applicability of our method, we show in Table A9 CIL results from
using a CLIP [41] vision model as the backbone pre-trained model. The same general trend as for
pre-trained ViT-B/16 models and ResNets can be seen, where use of RPs (Phase 2 in Algorithm 1)
produces better accuracies than NCM alone. Interestingly, the results for Cars is substantially better
with the CLIP vision backbone than for ViT-B/16 networks. It is possible that data from a very
similar domain as the Cars dataset was included in training of the CLIP vision model. The CLIP
result for Phase 2 only (see ablations in Table 1) is also better than ViT/B-16 for Imagenet-R , but for
all other datasets, ViT/B-16 has higher accuracy. Note that unlike the pre-trained ViT-B/16 model,
the pre-trained CLIP vision model requires preprocessing to normalize the input images.
CLIP results for DIL are provided in Table A10. The CLIP result outperforms that for Phase 2 only
for the ViT-B/16 model for CDDB-Hard and DomainNet, but not CORe50 (Table 3).
F.9 Experiments with regression targets using CLIP vision and language models
Until this point, we have defined the matrix C as containing Class-Prototypes (CPs), i.e. C has N
columns representing averaged feature vectors of length M . However, with reference to Eqn. (13),
the assumed targets for regression, Ytrain , can be replaced by different targets. Here, we illustrate
this using CLIP language model representations as targets, using OpenAI’s CLIP ViT-B/16 model.
Using CIFAR100 as our example, we randomly project CLIP’s length-512 vision model representa-
tions as usual, but also use the length-512 language model’s representations of the 100 class names,
averaged over templates as in [41]. We create a target matrix of size N × 512 in which each row
is the language model’s length 512 representation for the class of each sample. We then solve for
Wo ∈ RM ×512 using this target instead of Ytrain .
29
When the resulting Wo is applied to a test sample, the result is a length-512 prediction of a language
model representation. In order to translate this to a class prediction, we then apply CLIP in a standard
zero-shot manner, i.e. we calculate the cosine similarity between the predictions and each of the
normalized language model’s length 512 representation for the class of each sample.
The resulting final Average Accuracy for M = 5000 and T = 10 is 77.5%. In comparison, CLIP’s
zero shot accuracy for the same data is 68.6%, which highlights there is value in using the training
data to modify the vision model’s outputs. When RP is not used, the resulting final Average Accuracy
is 71.4%.
In future work, we will investigate whether these preliminary results from applying Algorithm 1 to a
combination of pre-trained vision and language models can translate into demonstrable benefits for
continual learning.
Accuracies reported in Tables 1-3 have been updated since the original submission to reflect those
published in our code repository at https://github.com/zhoudw-zdw/RevisitingCIL. Note
that accuracy results achieved can vary by ∼ ±1% for different random seeds, even for the same
class ordering, especially for PETL methods.
30
97.5 4.0
CIFAR 100 At (%)
95.0
Ft (%)
AdaptMLP (IN21K)
SSF (IN21K)
VPT (IN21K)
2.0
92.5 AdaptMLP (IN1K)
SSF (IN1K)
90.0 VPT (IN1K)
0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
8.0
97.5 6.0
CUB At (%)
95.0
Ft (%)
AdaptMLP (IN21K)
SSF (IN21K)
4.0
92.5 VPT (IN21K)
AdaptMLP (IN1K) 2.0
SSF (IN1K)
90.0 VPT (IN1K)
0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
95.0 8.0
ImageNet-R At (%)
90.0 6.0
Ft (%)
AdaptMLP (IN21K)
85.0 SSF (IN21K) 4.0
VPT (IN21K)
80.0 AdaptMLP (IN1K) 2.0
SSF (IN1K)
VPT (IN1K)
75.0 0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
20.0
80.0
ImageNet-A At (%)
15.0
70.0
Ft (%)
10.0
90.0
Ft (%)
AdaptMLP (IN21K)
85.0 SSF (IN21K) 5.0
VPT (IN21K)
AdaptMLP (IN1K)
80.0 SSF (IN1K)
VPT (IN1K)
0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
100.0
75.0 60.0
VTAB At (%)
40.0
Ft (%)
Ft (%)
AdaptMLP (IN21K)
80.0 SSF (IN21K)
VPT (IN21K) 5.0
AdaptMLP (IN1K)
70.0 SSF (IN1K)
VPT (IN1K)
0.0
D1 D2 D3 D4 D5 D6 D7 D8 D9 D10 D2 D3 D4 D5 D6 D7 D8 D9 D10
Figure A7: Comparison of PETL Methods and two ViT models. For the seven CIL datasets, and
T = 10, the figure shows Average Accuracy and Average Forgetting after each task, for each of
AdaptMLP, SSF and VPT. It shows this information for the two ViT-B/16 pre-trained backbones we
primarily investigated.
31
100
NCM
90 No RP
ViT-B/16 finetuned on ImageNet-1K
RP (M=10000)
80
70
60
50
40
30
20
20 30 40 50 60 70 80 90 100
ViT-B/16 pretrained on ImageNet21K
Figure A8: Comparison of backbone ViTs. Scatter plot of results for ViT-B/16 models pre-trained
on ImageNet1K vs ImageNet21K.
32