0% found this document useful (0 votes)
29 views25 pages

NeST Compressed

The paper presents a novel method called New classifier pre-Tuning (NeST) for class incremental semantic segmentation, which aims to mitigate issues of catastrophic forgetting and background shift by generating new classifiers from old ones before formal training. NeST utilizes transformation matrices to align new classifiers with the backbone and adapt to new data, thus enhancing stability while allowing for the learning of new classes. Experimental results on Pascal VOC 2012 and ADE20K datasets demonstrate significant performance improvements over existing methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views25 pages

NeST Compressed

The paper presents a novel method called New classifier pre-Tuning (NeST) for class incremental semantic segmentation, which aims to mitigate issues of catastrophic forgetting and background shift by generating new classifiers from old ones before formal training. NeST utilizes transformation matrices to align new classifiers with the backbone and adapt to new data, thus enhancing stability while allowing for the learning of new classes. Experimental results on Pascal VOC 2012 and ADE20K datasets demonstrate significant performance improvements over existing methods.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

Early Preparation Pays Off: New Classifier

Pre-tuning for Class Incremental Semantic


Segmentation

Zhengyuan Xie1 , Haiquan Lu1 , Jia-wen Xiao1 , Enguang Wang1 ,


Le Zhang3 , and Xialei Liu1,2(B)
arXiv:2407.14142v1 [cs.CV] 19 Jul 2024

1
VCIP, CS, Nankai University
NKIARI, Shenzhen Futian
2

{xiezhengyuan}@mail.nankai.edu.cn
{xialei}@nankai.edu.cn
3
SICE, UESTC

Abstract. Class incremental semantic segmentation aims to preserve


old knowledge while learning new tasks, however, it is impeded by catas-
trophic forgetting and background shift issues. Prior works indicate the
pivotal importance of initializing new classifiers and mainly focus on
transferring knowledge from the background classifier or preparing clas-
sifiers for future classes, neglecting the flexibility and variance of new
classifiers. In this paper, we propose a new classifier pre-tuning (NeST)
method applied before the formal training process, learning a transforma-
tion from old classifiers to generate new classifiers for initialization rather
than directly tuning the parameters of new classifiers. Our method can
make new classifiers align with the backbone and adapt to the new data,
preventing drastic changes in the feature extractor when learning new
classes. Besides, we design a strategy considering the cross-task class sim-
ilarity to initialize matrices used in the transformation, helping achieve
the stability-plasticity trade-off. Experiments on Pascal VOC 2012 and
ADE20K datasets show that the proposed strategy can significantly im-
prove the performance of previous methods. The code is available at
https://github.com/zhengyuan-xie/ECCV24_NeST.

Keywords: Class incremental learning · Semantic segmentation

1 Introduction
Nowadays, segmentation models driven by deep neural networks have achieved
notable success in various fields [9, 10, 33]. However, these models are usually
trained and tested in a static environment [20], which is contrary to real-world
scenarios [47]. When facing a continuous data stream [24], a model needs to han-
dle several novel classes. Blending new and previous data to retrain the model, or
called Joint Training, ensures the performance of the model. However, the com-
putational cost becomes prohibitive when dealing with large datasets. Simply
fine-tuning the model may lead to the loss of previously acquired knowledge, a
2 Z. Xie et al.

BG CLS Init AUX CLS Init Channel Selection NeST(Ours)

Step t-1 Step t Step t-1 Step t Step t-1 Step t Step t-1 Step t
Initialize Initialize Initialize Initialize
car car car car car car car car
bicycle bicycle bicycle bicycle bicycle bicycle bicycle bicycle
... ... ... ... ... ... ... ...

bg Initialize bg bg bg bg bg bg bg
person Aux Cls person channel person
Initialize selection
Aux Cls person

Fig. 1: Different classifier initialization methods for class incremental semantic seg-
mentation. MiB [5] directly uses background classifiers to initialize new classifiers.
Some methods [2,6] train an auxiliary classifier for future classes. AWT [21] selects the
most relevant weights from the background classifier for new classifiers’ initialization
by gradient-based attribution. Our new classifier pre-tuning method learns a transfor-
mation from all old classifiers to generate new classifiers for initialization.

phenomenon known as catastrophic forgetting [29,36]. In particular, for semantic


segmentation, it also faces the problem of background shift [5] (i.e., pixels labeled
as background in the current step may belong to previous or future classes).
Class incremental semantic segmentation (CISS) [5, 6, 16, 55] has been pro-
posed to address the challenges of catastrophic forgetting and background shift.
It requires the segmentation model to learn concepts of new classes while pre-
serving old knowledge. If no restrictions are imposed, parameters crucial to old
classes may be updated in the wrong direction [28], causing the model to for-
get previously learned knowledge. On the contrary, only focusing on memoriz-
ing old knowledge may limit the model to learning new classes. Thus, the key
issue to continual learning is actually the trade-off between stability and plas-
ticity [22, 28].
As far as stability is concerned, existing methods rely on old models learned
from the previous step for knowledge transfer [5, 16, 50, 55], e.g. weight initial-
ization. Yet the old model does not have the ability to recognize new classes,
finding a proper way to initialize the newly added classifier is crucial. Random
initialization may cause the misalignment between new classifiers and features,
leading to training instabilities [5]. In accordance with this situation, as Fig. 1
shows, MiB [5] utilizes the background classifier to initialize new classifiers, as
future classes may appear in current data, labeled as the background. However,
it may cause the model to make incorrect predictions for true background pix-
els [21]. AWT [21] selects the most relevant weight for initialization from the old
background classifier via a gradient-based attribution technique, yet it neglects
other old classifiers and brings a huge memory cost. Some methods train an aux-
iliary classifier to initialize new classifiers [2,6], while there is still a bias between
the auxiliary classifier and true future classifiers because there is not any future
data with ground truth. What’s more, the above initialization methods treat
each new classifier equally, ignoring the differences between new classes.
From the above observations, we propose New claSsifier pre-Tuning (NeST)
method to make new classifiers better align with the backbone and adapt to
the training data, preventing drastic changes in the feature extractor caused
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 3

by unstable training process. To better utilize the old knowledge, instead of


directly tuning the parameters of new classifiers, we tune a linear transformation
from all old classifiers to generate new classifiers. Specifically, before the formal
training process at the current step, we assign two transformation matrices, i.e.,
an importance matrix, and a projection matrix, to each new class. As trainable
parameters, the importance matrix learns a weighted score associated with each
channel of the old classifier pertinent to the new class, while the projection
matrix learns a linear combination from the weighted old classifiers to generate
the new classifier. Subsequently, we use data from the current step to learn the
transformation from old classifiers to each new classifier. Finally, we employ
learned importance matrices and projection matrices to generate new classifiers
from old ones, and then leverage the derived weights for classifier initialization
of the formal training process. By using these two matrices, new classifiers are
generated via old knowledge and are different from each other.
To achieve the stability-plasticity trade-off, we also find it crucial for the
initial values of these transformation matrices and propose a strategy for initial-
ization by considering cross-task class similarities between old and new classes,
thus facilitating the learning of new classifiers.
To summarize, the main contributions of our paper are threefold:
1. We propose a new classifier pre-tuning (NeST) method that learns a trans-
formation from all old classifiers to generate new classifiers with previous
knowledge before the formal training process.
2. We further optimize the initialization of transformation matrices, striking a
balance between stability and plasticity.
3. We conduct experiments on Pascal VOC 2012 and ADE20K. Results show
that NeST can be readily integrated into other approaches and significantly
enhance their performance.

2 Related Work
Semantic Segmentation. In recent years, the development of deep learn-
ing techniques has facilitated the performance of semantic segmentation models.
Fully Convolutional Networks (FCN) [34] pioneers semantic segmentation, and
a series of convolution-based segmentation networks have achieved high per-
formance in many benchmarks [8, 19, 23, 41, 48, 57]. More recently, the Trans-
former architecture has become ever more popular and reached remarkable per-
formance [11,13,32,40,44,51,59]. Thus, we conduct experiments on two different
backbones: ResNet and Swin-Transformer.
Class Incremental Learning. Class incremental learning tackles the catas-
trophic forgetting of the task with ever-increasing categories. The whole training
process is divided into several steps and in each step, the model is required to
learn one or more classes, which is the most challenging. Existing methods can
be categorized into three groups. The model expansion methods [26, 27, 52, 54]
enlarge the model size in incremental steps to learn new knowledge while pre-
serving old knowledge in fixed parameters. The rehearsal based methods store a
4 Z. Xie et al.

series of exemplars [3, 45] or prototypes along the task sequence [61], or using
generative networks to maintain old knowledge [35, 43]. The parameter regular-
ization methods either constrain the learning of some parameters [7, 30] or use
knowledge distillation techniques [1, 16, 17, 49].
Class Incremental Semantic Segmentation. CISS requires a segmenta-
tion model to recognize all learned classes in continuous learning steps. Different
from classification tasks, segmentation has its own background shift [5] issue.
MiB [5] first proposes unbiased cross entropy and distillation to address back-
ground shift. PLOP [16] uses pseudo-label and intermediate features to transfer
old knowledge. SDR [38] proposes a contrastive-learning-based method, min-
imizing intra-class feature distances. RCIL [55] introduces reparameterization
to continual semantic segmentation with a complementary network structure.
GSC [14] explores incremental semantic segmentation considering the gradient
and semantic compensation. EWF [50] fuses old and new knowledge with a
weight fusion strategy. Many methods have explored the classifier initialization
in CISS. SSUL [6] employs auxiliary data to train an ‘unknown’ classifier and use
it to initialize new classifiers. DKD [2] proposes a decomposed knowledge dis-
tillation, improving the rigidity and stability. AWT [21] applies gradient-based
attribution to transfer the relevant weight of old classifiers to new classifiers. Dif-
ferent from the method above, we propose a new classifier pre-tuning method,
achieving the goal of the trade-off between plasticity and stability.

3 Method
3.1 Preliminaries
Problem Definition. Following [5, 16], CISS contains several steps {t}nt=1
to receive sequential data stream {Dt }nt=1 with classes {Ct }nt=1 . In each step t,
the model is required to learn |Ct | new classes, while old training data is not
available. For an image in the current step, pixels belonging to Ct are labeled as
their ground truth classes, leaving other pixels labeled as background. Finally,
after the last step, the model will be tested on the data of all learned classes.
At step t, the segmentation model is composed of a feature extractor fθt and
a classifier htϕ . ϕt is newly added and weights of previous classifiers are ϕ1:t−1 .
In this paper, our main concern is the initialization of ϕt .
Revisiting Previous Initialization Methods. In this part, we illustrate
the shortcomings of previous initialization methods and further clarify our mo-
tivation. As mentioned before, classes in Ct will appear as background in pre-
vious steps, which is known as the background shift. According to this premise,
previous methods either try to find a better way of utilizing the background
classifier [5, 21], or train an auxiliary classifier for future classes, both neglecting
the differences between new classes [2, 6]. Meanwhile, there will also be a mis-
matching issue between new classifiers and the feature extractor as the methods
above do not have training processes with data from new classes. This may lead
to drastic parameter changes in the feature extractor when training with new
data, thus undermining previous knowledge.
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 5

New Classifier Generation Transformation Matrix Initialization


old classifier
Old classfier weights Transformation Matrices N pixel-wise features of class dog
new classifier Hadamard
D product
gradient flow
old classifier
classification
... scores
bus
M
M W
P Old N
dog classifier ... D ...
weights
class 1 class C

New Classifier Pre-tuning


average
select sum
bus softmax

... Importance
dog weights

car expand

sum
bicycle softmax
...
...
Projection
weights
bg

Fig. 2: Illustration of our new classifier pre-tuning (NeST) method. The left side of
the figure is an iteration of the new classifier pre-tuning process. The right side of
the figure represents the cross-task class similarity-based initialization of importance
matrices and projection matrices before the pre-tuning process.

3.2 New Classifier Pre-Tuning

Considering the observations above, we propose a simple yet effective method,


NeST, to improve the performance of CISS. In the pre-tuning stage before the
formal training process, we first learn to generate each new classifier via old
classifiers and then tune it with new data in each forward pass.
New Classifier Generation. As shown in Fig. 2, for each new class c ∈ Ct ,
we assign an importance matrix and a projection matrix, learning the transfor-
mation from all old classifiers in each forward pass as follows:

\begin {aligned} \label {eqn:transform} \mathbf {w}_{c} &= (\mathbf {M}_c\odot \mathbf {W}_{old})\mathbf {P}_c\,, \end {aligned} (1)

where wc ∈ Rd denotes the weight of new class c, Mc ∈ Rd×nold denotes the


St−1
importance matrix of class c and nold = i=1 Ci + 1 denotes the number of old
classes including the background, ⊙ denotes the operation of Hadamard product,
Wold ∈ Rd×nold denotes the weight matrix of all old classifiers, Pc ∈ Rnold ×1
denotes the projection matrix of class c, and d denotes the output dimension of
fθt−1 . Mc reflects the importance of each channel of the old classifiers for the new
class c, and Pc combines weighted old classifiers, completing weight generation
for the new class. After generating new weights, we concatenate them as the
parameter of ϕt as follows:

\begin {aligned} \phi _t=\rm {concat} [\mathbf {w}_{n_{old}}, ..., \mathbf {w}_{n_{old}+\left |\mathcal {C}_{t}\right | - 1}]\,, \end {aligned} (2)
6 Z. Xie et al.

where concat[] is the concatenation operation. Specifically, as the old background


class may see new classes in previous steps and the score of background in the
pre-tuning process may be high, preventing the model from learning new classes,
we also learn a transformation from the old background classifier to the new
background classifier during the pre-tuning process as follows:

\begin {aligned} \label {eqn:transform_bg} \hat {\mathbf {w}}_{0} &= (\mathbf {M}_0\odot \mathbf {w}_{0})\mathbf {P}_0\,, \end {aligned} (3)

where w0 ∈ Rd denotes the previous background classifier, M0 ∈ Rd×1 and


P0 ∈ R1×1 are matrices specifically for the background, and ŵ0 ∈ Rd is new
background classifier prepared for current task.
Pre-tuning and Initialization. In each forward pass, as Fig. 2 shows, after
generating weights of new classifiers, we feed training data from the current step
to the old model equipped with generated classifiers, using the output to calcu-
late the loss and then backpropagating the loss to update learnable parameters.
The loss function used in the pre-tuning is unbiased cross entropy Lunce [5], as
it can help to avoid the overfitting problem [39]:

\begin {aligned} &\mathcal {L}_{unce}=-\frac {1}{\left | \mathcal {I} \right |} \sum _{i\in \mathcal {I}}\log \tilde {q}_x^t(i, y_i)\,, \end {aligned} \label {eqn:unce} (4)

where I denotes the pixel set of an image, yi denotes the ground-truth label of
pixel i at current step, q̃xt denotes the modified output of the current model. It
should be noted that during the whole pre-tuning process, the importance matri-
ces and projection matrices are learnable and will be updated, while other com-
ponents such as the feature extractor remain frozen. After the pre-tuning process,
we use importance matrices and projection matrices to generate weights of each
new classifier for initialization, and additional parameters will be removed.

3.3 Cross-Task Class Similarity-based Transformation Matrix


Initialization
We found it crucial to initialize importance matrices and projection matrices.
Random initialization disregards the importance of different channels among
old classifiers and results in poor performance. AWT [21] reveals that only a
few channels of the background classifier contribute to the classification of new
classes, redundant channels may hinder the learning of new classifiers. Mean-
while, for different new classes, the important channels of old classifiers should
be different. Hence, we introduce a cross-task class similarity-based transforma-
tion matrix initialization method. The core idea is that if an old class is more
similar to a new class, then during the pre-tuning process, the contribution of
this old classifier should be more significant, and thus, greater initial weights
should be assigned to corresponding positions in the importance matrix and
projection matrix.
Hence, we employ the predictions made by the old model as cross-task class
similarity scores for each new class pixel, aiming to assess the resemblance be-
tween the new class and every old class. As shown in Fig. 2, before learning
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 7

transformation, we feed each training image in the current step to the old model
and get the corresponding output of the layer before ht−1
ϕ . Then considering the
classification process of a new class pixel embedding pu ∈ Rd extracted from the
old model, assuming that the total number of old classes is nold , the matrix mul-
tiplication can be decomposed into an element-wise multiplication (also called
Hadamard product) between pu and the old classifier weight Wold ∈ Rd×nold
and a sum-softmax operation, as follows:

\begin {split} \mathbf {H}_{u} &= (\mathbf {W}_{old}\odot \mathbf {p}_u'),\\ \mathbf {s}_u &= {\rm softmax}({\rm sum}(\mathbf {H}_{u}))^{\top }\,, \end {split} \label {cross_task_similarity}
(5)

where p′u ∈ Rd×nold denotes broadcasting pu by nold times, Hu ∈ Rd×nold


denotes the result of the Hadamard product, sum denotes summing Hu along
the channel dimension, and su ∈ Rnold denotes the classification scores of pu .
We consider that for each column (corresponding to an old class i) of Hu , the
positive elements contribute to pu being classified into class i. Thus, to select
relevant channels, we set all positions with positive values to 1 and get Hmask
u ,
which can be regarded as a binary mask, as follows:

\mathbf {H}_u^{mask}(i, j)= \left \{ \begin {aligned} &1, \mathbf {H}_u(i, j)>0 \\ &0, otherwise\,. \end {aligned} \right . \label {H_mask} (6)

Finally, by utilizing ground-truth masks of the current step, for each pixel
embedding belonging to new class cnew , we calculate the Hadamard product
between its corresponding Hmask and predicted score, then averaged the results
to get the importance matrix weight Mcnew of class cnew , as follows:

\begin {aligned} \mathbf {M}_{c_{new}} = \frac {1}{N} \sum _{\mathbf {p}_u} \mathbf {H}_u^{mask} \odot \mathbf {s}_u' \,, \end {aligned} \label {eqn:H_avg} (7)

where pu denotes the pixel embedding belonging to the new class cnew , s′u ∈
Rd×nold denotes broadcasting the predicted score su of pu by d times, N denotes
the number of pixels belonging to new class cnew . Here we use s′u because the old
class with a very small score will not contribute too much to the initialization of
the new class, as the similarity between these two classes is relatively low. For
projection matrix weight Pcnew , we sum up Mcnew along the channel dimension
and apply the softmax function to get the weight score for old classes, as follows:

\begin {aligned} \mathbf {P}_{c_{new}} = ({\rm softmax}({\rm sum}(\mathbf {M}_{c_{new}})))^{\top }\,. \end {aligned} (8)

Then we use Pcnew ∈ Rnold ×1 to initialize the projection matrix of class cnew as
these scores reflect the degree of similarity between the new class and old classes,
deciding which old class the model should focus on for knowledge transfer. We
provide the pseudo-code of our NeST in Algorithm 1.
8 Z. Xie et al.

Algorithm 1 Pseudo-code of the proposed NeST.


Input: Training samples Dt = {(xi , yi )}t of current task t, feature extractor fθ0 , clas-
sifier h0ϕ , task number T , new class set Ct .
Output: initial weight of new classifier ϕt
1: for t ∈ {1, 2, ..., T } do
2: for i ∈ Ct do \triangleright Initializing Transformation matrices. Sec. 3.3
3: (Mi , Pi ) ← Init(Dt , ht−1 t−1
ϕ , fθ )
4: end for
5: while not converged do \triangleright Pre-tuning new classifiers. Sec. 3.2
6: train (M, P) by minimizing Lunce
7: end while
8: ϕt ← Transform(M, P, ϕ1:t−1 ) \triangleright Initializing new classifiers.
9: fθt ← fθt−1
10: htϕ ← (ht−1ϕ , ϕt )
11: while not converged do \triangleright Formal training process.
12: train fθt and htϕ
13: end while
14: end for

4 Experiments

4.1 Experimental setup

Protocols. In CISS, the whole training process contains T steps, and the
task of each step may contain one or more classes. In step t, pixels belonging
to previous steps are labeled as background. In evaluation, the model needs to
identify all learned classes. There are two settings: Disjoint and Overlapped [5].
Disjoint setting assumes that in the current step, images do not contain any
pixel belonging to classes that will be learned in the future. Overlapped setting
is more realistic, as future classes may appear in images from the current step,
and we mainly evaluate our method on Overlapped setting.
Datasets. Pascal VOC 2012 dataset [18] contains 20 classes including 10,582
images for training and 1449 images for validation. ADE20K dataset [60] contains
150 classes including 20,210 images for training and 2000 images for validation.
In CISS, the training setting can be described as X − Y , X means the number
of classes trained in the initial step and Y means the number of classes trained
in incremental steps. We conduct experiments with 10-1, 15-1, 15-5 and 19-1
settings on Pascal VOC 2012 dataset [18]. For ADE20K dataset [60], we validate
our method on 100-50, 100-10, 100-5 and 50-50 settings.
Implementation Details. Following previous methods [5, 16, 55], we use
DeepLabV3 [8] with ResNet-101 [25] as our segmentation model. We also use
Swin-B [32] as the Transformer-based backbone. Following [5,16], we use random
crop and horizontal flip for data augmentation. We train the model with a batch
size of 24 on 4 GPUs for all experiments. We use SGD as the optimizer in our
experiments. The learning rate is set to 0.02 for the initial step and 0.001 for
incremental steps, both for Pascal VOC 2012 and ADE20K. We also use poly
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 9

Table 1: The mIoU (%) of the last step on the Pascal VOC dataset for four different
CISS scenarios. * implies results from the re-implementation of the official code.

10-1(11 steps) 15-1(6 steps) 15-5(2 steps) 19-1(2 steps)


Method Backbone
0-10 11-20 all 0-15 16-20 all 0-15 16-20 all 0-19 20 all
Joint Res101 78.8 77.7 78.3 79.8 73.4 78.3 79.8 73.4 78.3 78.2 80.0 78.3
ILT [37] Res101 7.2 3.7 5.5 9.6 7.8 9.2 67.8 40.6 61.3 68.2 12.3 65.5
SDR [38] Res101 32.4 17.1 25.1 44.7 21.8 39.2 75.4 52.6 69.9 69.1 32.6 67.4
PLOP+UCD [53] Res101 42.3 28.3 35.3 66.3 21.6 55.1 75.0 51.8 69.2 75.9 39.5 74.0
SPPA [31] Res101 - - - 66.2 23.3 56.0 78.1 52.9 72.1 76.5 36.2 74.6
MiB+AWT [21] Res101 33.2 18.0 26.0 59.1 17.2 49.1 77.3 52.9 71.5 - - -
ALIFE [39] Res101 - - - 64.4 34.9 57.4 77.2 52.5 71.3 76.6 49.4 75.3
GSC [14] Res101 50.6 17.3 34.7 72.1 24.4 60.8 78.3 54.2 72.6 76.9 42.7 75.3
MiB [5] Res101 12.2 13.1 12.6 38.0 13.5 32.2 76.4 49.4 70.0 71.2 22.1 68.9
MiB* [5] Res101 10.4 9.9 10.2 45.2 15.7 38.2 76.8 49.1 70.2 71.6 28.6 69.6
MiB+NeST (Ours) Res101 52.3 21.0 37.4 61.7 20.4 51.8 77.1 50.1 70.7 71.7 28.2 69.7
PLOP [16] Res101 44.0 15.5 30.5 65.1 21.1 54.6 75.7 51.7 70.1 75.4 37.4 73.5
PLOP* [16] Res101 45.9 17.1 32.2 66.8 22.3 56.2 77.0 50.9 70.8 75.7 39.4 74.0
PLOP+NeST (Ours) Res101 54.2 17.8 36.9 72.2 33.7 63.1 77.6 55.8 72.4 77.0 49.1 75.7
RCIL [55] Res101 55.4 15.1 34.3 70.6 23.7 59.4 78.8 52.0 72.4 77.0 31.5 74.7
RCIL* [55] Res101 47.8 17.0 33.1 69.9 23.9 58.9 78.8 52.4 72.5 76.8 28.9 74.5
RCIL+NeST (Ours) Res101 51.4 20.9 36.8 71.9 28.0 61.4 79.0 52.8 72.8 77.0 33.3 74.9
Joint Swin-B 80.4 79.7 80.1 81.1 76.7 80.1 81.1 76.7 80.1 80.0 80.7 80.1
MiB* [5] Swin-B 11.4 18.9 15.0 35.0 43.2 36.9 80.7 66.5 77.3 79.2 60.2 78.3
MiB+NeST (Ours) Swin-B 65.2 35.8 51.2 77.0 53.3 71.4 81.2 67.4 77.9 79.7 60.0 78.8
PLOP* [16] Swin-B 37.8 23.1 30.8 74.1 52.1 68.9 80.1 68.1 77.2 77.0 65.8 76.4
PLOP+NeST (Ours) Swin-B 64.3 28.3 47.2 76.8 57.2 72.2 80.5 70.8 78.2 79.6 70.2 79.1

schedule as the weight decay strategy. For Pascal VOC 2012, we pre-tune for 5
epochs with a learning rate of 0.001. For ADE20K, pre-tuning lasts 15 epochs,
with a learning rate of 0.1 for experiments involving RCIL and Swin-B, and
0.5 for other experiments. Further details are provided in the supplementary
material.

4.2 Results
In this section, we apply NeST to three classic methods, MiB [5], PLOP [16] and
RCIL [55] with different backbones.
Pascal VOC 2012. We conduct experiments on 10-1, 15-1, 15-5 and 19-1
settings. As shown in Tab. 1, with NeST, MiB [5], PLOP [16] and RCIL [55] can
achieve huge performance gains. On the 15-1 setting with Deeplab, by simply
using the proposed new classifier strategy, we can improve the performance of
MiB, PLOP, and RCIL by 13.6%, 6.9%, and 2.5%, respectively. In the more
difficult 10-1 scenario, NeST can boost the performance of MiB by a large mar-
gin, achieving an improvement of 27.2%. Meanwhile, results with Swin-B show
that NeST is also suitable for Transformer-based models. On MiB with Swin-B,
NeST can improve the performance by 34.5% on the 15-1 setting and 36.2%
10 Z. Xie et al.

80 MiB 80 MiB
PLOP PLOP
RCIL RCIL
MiB+NeST 70 MiB+NeST
PLOP+NeST PLOP+NeST
70 RCIL+NeST
60 RCIL+NeST

mIoU(%)

mIoU(%)
50
60
40

50 30

20
40
10
0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9 10
step step
(a) 15-1 on PASCAL VOC 2012 (b) 10-1 on PASCAL VOC 2012

Fig. 3: The mIoU (%) at each step for the setting 15-1 (a) and 10-1 (b).

Table 2: The mIoU (%) of the last step on the ADE20K dataset for four different
CISS scenarios. * implies results from the re-implementation of the official code.

100-50(2 steps) 100-10(6 steps) 100-5(11 steps) 50-50(3 steps)


Method Backbone
0-100 101-150 all 0-100 101-150 all 0-100 101-150 all 0-50 51-150 all
Joint [55] Res101 44.3 28.2 38.9 44.3 28.2 38.9 44.3 28.2 38.9 51.1 33.3 38.9
ILT [37] Res101 18.3 14.8 17.0 0.1 2.9 1.1 0.1 1.3 0.5 13.6 6.2 9.7
PLOP+UCD [53] Res101 42.1 15.8 33.3 40.8 15.2 32.3 - - - 47.1 24.1 31.8
SPPA [31] Res101 42.9 19.9 35.2 41.0 12.5 31.5 - - - 49.8 23.9 32.5
MiB+AWT [21] Res101 40.9 24.7 35.6 39.1 21.3 33.2 38.6 16.0 31.1 46.6 26.9 33.5
ALIFE [39] Res101 42.2 23.1 35.9 41.0 22.8 35.0 - - - 49.0 25.7 33.6
GSC [14] Res101 42.4 19.2 34.8 40.8 17.6 32.6 39.5 11.2 30.2 46.2 26.2 33.0
MiB [5] Res101 40.5 17.2 32.8 38.2 11.1 29.2 36.0 5.6 25.9 45.6 21.0 29.3
MiB* [5] Res101 40.5 23.5 34.9 37.8 12.1 29.3 35.8 6.0 25.9 45.9 23.9 31.3
MiB+NeST (Ours) Res101 40.3 24.6 35.1 40.2 20.6 33.7 39.9 18.0 32.7 45.6 26.8 33.2
PLOP [16] Res101 41.9 14.9 32.9 40.5 13.6 31.6 39.1 7.8 28.7 48.8 21.0 30.4
PLOP* [16] Res101 42.2 15.3 33.3 41.0 13.7 32.0 39.6 8.2 29.2 48.5 20.8 30.2
PLOP+NeST (Ours) Res101 42.2 24.3 36.3 40.9 22.0 34.7 39.3 17.4 32.0 48.7 27.7 34.8
RCIL [55] Res101 42.3 18.8 34.5 39.3 17.6 32.1 38.5 11.5 29.6 48.3 25.0 32.5
RCIL* [55] Res101 42.3 16.5 33.8 39.9 15.7 31.9 39.1 12.2 30.2 48.2 23.6 31.9
RCIL+NeST (Ours) Res101 42.3 22.8 35.8 40.7 19.0 33.5 39.4 15.5 31.5 48.2 27.4 34.4
Joint Swin-B 43.4 31.9 39.6 43.4 31.9 39.6 43.4 31.9 39.6 50.7 33.9 39.6
MiB* Swin-B 42.7 26.1 37.2 40.2 15.0 31.8 39.1 8.6 29.0 48.3 26.8 34.1
MiB+NeST (Ours) Swin-B 42.8 27.8 37.9 41.8 23.8 35.9 40.5 19.9 33.7 49.7 29.3 36.2
PLOP* Swin-B 43.4 17.1 34.7 41.4 17.7 33.6 39.7 13.6 31.0 50.5 24.1 33.0
PLOP+NeST (Ours) Swin-B 43.5 26.5 37.9 41.7 24.2 35.9 39.7 18.3 32.6 50.6 28.9 36.2

on the 10-1 setting. In Tab. 1, the results across all settings for old and new
classes demonstrate that NeST significantly enhances stability by aligning new
classifiers with the existing backbone through the pre-tuning process, thereby
mitigating the forgetting of old knowledge. We also report the mIoU of our
method and baselines at each step. As shown in Fig. 3, The performance of our
method surpasses baselines’ performance during the whole training process.
ADE20K. To further evaluate NeST, we conduct experiments on the more
challenging settings of the ADE20K dataset. Results of 100-50, 100-10, 100-5 and
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 11

MiB 1.8 MiB


0.87 MiB+NeST 1.6 MiB+NeST

0.86 1.4

similarity
1.2
0.85

loss
1.0
0.84
0.8
0.83 0.6
0.4
0.82
0.2
0 5 10 15 20 25 30 0 5 10 15 20 25 30
epoch epoch
(a) feature map similarity (b) training loss

Fig. 4: The feature map similarity and training loss on Pascal VOC 2012 15-1 over-
lapped setting.

50-50 settings are shown in Tab. 2. NeST outperforms other competing methods.
On the most difficult 100-5 setting, NeST improves MiB [5], PLOP [16] and
RCIL [55] by 6.8%, 2.8% and 1.3%, which indicates that NeST is also applicable
to scenarios with a large number of classes.
Effectiveness of Our Method. It is crucial for pre-tuning new classifiers
before the formal training step, not only for learning new classes but also for
bridging the stability gap [15], which means that the model will undergo a short
period of forgetting when the task changes and then gradually recover. This
phenomenon is also discovered in CISS scenarios [2,50]. At the beginning of each
formal training step, a well-tuned new classifier can make the loss smaller, thus
there will also be less impact on other model parameters as the gradient values
become smaller. As the model’s parameters are inherited from the old model,
the old knowledge learned from previous steps can be preserved. We conduct
experiments on the 15-1 setting with the baseline MiB [5] and MiB+NeST. For
a fair and plain comparison, we plot losses and cosine feature similarities by the
mean and standard deviation of each epoch at the formal training process of the
second step, while in the first step, we use the original MiB for all experiments. As
shown in Fig. 4a, the feature similarity of NeST exceeds that of MiB throughout
the entire recovery process, demonstrating the ability of NeST to help bridge the
stability gap in the model. Fig. 4b shows that during the formal training process,
the loss of NeST is lower than the loss of MiB, helping the model converge faster
and learn better.

4.3 Ablation Study

Different Classifier Initialization Methods. We compare NeST with other


initialization ways. As shown in Tab. 3, random initialization gets the worst
performance because of the misalignment in the early stages of training. Initial-
ization by the background classifier can get better performance. Directly tuning
the weight of new classifiers initialized by the background classifier before the
formal training step can bring a slight increase in performance. Unlike the afore-
12 Z. Xie et al.

mentioned approaches, NeST enables new classifiers to adapt to training data


and enhances the performance of the baseline by 16.5% on old classes and 4.7%
on new classes.

Table 3: Ablation study of different classifier initialization strategies. The performance


is reported on Pascal VOC 2012 15-1 overlapped setting with MiB [5].

Method 0-15 16-20 all


Random 43.5 4.2 34.1
Background [5] 45.2 15.7 38.2
Two-Stage 46.0 15.3 38.7
NeST (Ours) 61.7 20.4 51.8

Table 4: Comparison between different transformation matrix initialization methods.


The performance is reported on Pascal VOC 2012 15-1 overlapped setting.

Method 0-15 16-20 all


Random Matrix Initialization 53.8 7.5 42.8
NeST (Ours) 61.7 20.4 51.8

Transformation Matrices Initialization. According to Tab. 4, if we use


random initialization instead of our designed strategy, the overall performance
will still be higher than the performance of baseline [5]. However, the mIoU of
new classes learned in incremental steps is significantly lower than baseline’s
performance. It means that if we random initialize importance matrices and
projection matrices, it will only help to maintain the stability of the model.
The second row in Tab. 4 shows that after initializing importance matrices and
projection matrices using our proposed strategy increases the performance of new
classes to 20.4%, while also improving the performance of the old classes. The
design of initial values for matrices used in the transformation can facilitate the
model’s learning of new classes. Thus, by initializing matrices with our strategy,
NeST can achieve the trade-off between stability and plasticity.
Component-Level Ablation Study. We conduct a component-level ablation
study on two kinds of matrices, as shown in Tab. 5. Eliminating the importance
matrix means that we set values in the importance matrix to 1 and freeze it,
i.e., only learning a combination of old classifiers to get new classifiers. Simi-
larly, eliminating the projection matrix indicates that we average the weighted
old classifiers to generate new classifiers. Experimental results demonstrate that
with equally treated channels, the performance dramatically drops due to the
unawareness of different channels.
Computational Costs. The number of extra parameters can be represented
as nnew × nold × (d + 1) + d + nnew + 1 and nold increases linearly with steps
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 13

step0 step1 step2 step3 step4 step5

Image & GT base potted plant sheep sofa train monitor

MiB

MiB+NeST

MiB

MiB+NeST

Fig. 5: Qualitative comparisons on Pascal VOC 2012 15-1 overlapped setting.

Table 5: Component level ablation study on Pascal VOC 2012 15-1 overlapped setting.

Method Projection Matrix Importance Matrix 0-15 16-20 all


baseline 45.2 15.7 38.2
Variants ✓ 53.3 15.4 44.3
✓ 59.7 19.2 50.1
✓ ✓ 61.7 20.4 51.8

while nnew is always fixed. The additional FLOPs (about 15.6K) and param-
eters (about 5.4K) are negligible during the pre-tuning process of Step 5 on
the 15-1 setting. The whole training process of MiB+NeST takes an additional
2.7% (3.7 minutes) of the total training time (2.3 hours). Note that after pre-
tuning, extra parameters are removed and the formal training process is the
same as MiB.
Discussion on AWT v.s. NeST. AWT [21] utilizes the new training data
and old background classifier to generate new classifiers. Here we discuss the
differences between AWT and NeST. AWT employs gradient-based attribution
to transfer the most relevant weights to new classifiers. However, it treats each
new class equally, neglecting the role of other old classifiers and the differences
between new classes. Meanwhile, the gradient-based attribution technique intro-
duces a huge memory cost, AWT can not run on RTX 3090 even if the batch
size is set to 1. NeST is based on learning, which can better make the generated
classifiers’ weight align with the backbone. Moreover, as shown in Tab. 6, the
memory cost is acceptable, and NeST is a little faster than AWT.
14 Z. Xie et al.

Table 6: The memory costs and additional training time on Pascal VOC 2012 15-1
overlapped setting.

Method Memory per GPU Additional Time


MiB+AWT > 24GB 6%
MiB+NeST (Ours) 6.47GB 2.7%

Table 7: The average performance of five different class orders on Pascal VOC 2012
15-1 overlapped setting.

Method A B C D E avg
MiB 38.2 28.2 38.5 38.0 52.0 39.0
MiB+NeST (Ours) 51.8 43.4 50.4 60.0 59.8 51.5

Robustness of Different Class Orders. In CISS scenarios, the class order is


crucial as the learning process of a class may boost or damage the performance
of another class. To validate the robustness of NeST, we conduct experiments
on the 15-1 setting with five different class orders. As shown in Tab. 7, the
performance of NeST is higher than the performance of baseline.

4.4 Visualization
In Fig. 5, we show the visualization results of our proposed method based on
MiB [5]. Samples are selected from step 0 and from step 5 to show the stabil-
ity and plasticity of our NeST. In the first two rows, classes person and horse
are learned in step 0, then in the following steps, the baseline MiB [5] gradu-
ally forget concepts learned in step 0, while NeST helps the model to preserve
old knowledge. In the last two rows, class chair is learned in step 0 and class
tv/monitor is learned in step 5. After learning class train, MiB has almost for-
gotten the knowledge of class chair, while NeST can give correct predictions for
most of the pixels. In the last step, NeST can help the model learn new class
tv/monitor better than MiB, showing the plasticity of our method.

5 Conclusions
In this work, we propose a simple yet effective new classifier pre-tuning method
that can enhance the CISS ability by learning a transformation from old classi-
fiers to new classifiers. We further find a way to initialize matrices by utilizing the
information of cross-task class similarities between old classes and new classes,
helping the model achieve the stability-plasticity trade-off. Experiments on two
datasets show that NeST can significantly improve the performance of baselines,
and it can be easily applied to other CISS methods. While our approach can lead
to huge performance gains, the limitation is that it introduces additional com-
putational overhead. In future work, we will try our method in other continual
learning tasks.
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 15

Acknowledgments

This work is funded by NSFC (NO. 62206135), Key Program for International
Cooperation of Ministry of Science and Technology, China (NO. 2024YFE0100700),
Young Elite Scientists Sponsorship Program by CAST (NO. 2023QNRC001),
Tianjin Natural Science Foundation (NO. 23JCQNJC01470), and the Funda-
mental Research Funds for the Central Universities (Nankai University). Com-
putation is supported by the Supercomputing Center of Nankai University.

References

1. Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M., Tuytelaars, T.: Memory
aware synapses: Learning what (not) to forget. In: Eur. Conf. Comput. Vis. pp.
139–154 (2018)
2. Baek, D., Oh, Y., Lee, S., Lee, J., Ham, B.: Decomposed knowledge distillation for
class-incremental semantic segmentation. In: Adv. Neural Inform. Process. Syst.
(2022)
3. Bang, J., Kim, H., Yoo, Y., Ha, J.W., Choi, J.: Rainbow memory: Continual learn-
ing with a memory of diverse samples. In: IEEE Conf. Comput. Vis. Pattern Recog.
pp. 8218–8227 (2021)
4. Cermelli, F., Cord, M., Douillard, A.: Comformer: Continual learning in semantic
and panoptic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3010–
3020 (2023)
5. Cermelli, F., Mancini, M., Bulo, S.R., Ricci, E., Caputo, B.: Modeling the back-
ground for incremental learning in semantic segmentation. In: IEEE Conf. Comput.
Vis. Pattern Recog. pp. 9233–9242 (2020)
6. Cha, S., Yoo, Y., Moon, T., et al.: Ssul: Semantic segmentation with unknown label
for exemplar-based class-incremental learning. In: Adv. Neural Inform. Process.
Syst. vol. 34, pp. 10919–10930 (2021)
7. Chaudhry, A., Dokania, P.K., Ajanthan, T., Torr, P.H.: Riemannian walk for in-
cremental learning: Understanding forgetting and intransigence. In: Eur. Conf.
Comput. Vis. pp. 532–547 (2018)
8. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se-
mantic image segmentation with deep convolutional nets, atrous convolution, and
fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
9. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder
with atrous separable convolution for semantic image segmentation. In: Eur. Conf.
Comput. Vis. pp. 801–818 (2018)
10. Chen, Y.H., Chen, W.Y., Chen, Y.T., Tsai, B.C., Frank Wang, Y.C., Sun, M.: No
more discrimination: Cross city adaptation of road scene segmenters. In: Int. Conf.
Comput. Vis. pp. 1992–2001 (2017)
11. Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention
mask transformer for universal image segmentation. In: IEEE Conf. Comput. Vis.
Pattern Recog. pp. 1290–1299 (2022)
12. Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention
mask transformer for universal image segmentation. In: IEEE Conf. Comput. Vis.
Pattern Recog. pp. 1290–1299 (2022)
16 Z. Xie et al.

13. Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for
semantic segmentation. In: Adv. Neural Inform. Process. Syst. vol. 34, pp. 17864–
17875 (2021)
14. Cong, W., Cong, Y., Dong, J., Sun, G., Ding, H.: Gradient-semantic compensation
for incremental semantic segmentation. IEEE Trans. Multimedia (2023)
15. De Lange, M., van de Ven, G., Tuytelaars, T.: Continual evaluation for lifelong
learning: Identifying the stability gap. arXiv preprint arXiv:2205.13452 (2022)
16. Douillard, A., Chen, Y., Dapogny, A., Cord, M.: Plop: Learning without forgetting
for continual semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog.
(2021)
17. Douillard, A., Cord, M., Ollion, C., Robert, T., Valle, E.: Podnet: Pooled outputs
distillation for small-tasks incremental learning. In: IEEE Conf. Comput. Vis. Pat-
tern Recog. pp. 86–102. Springer (2020)
18. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.:
The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results.
http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
(2012)
19. Gao, S.H., Cheng, M.M., Zhao, K., Zhang, X.Y., Yang, M.H., Torr, P.: Res2net:
A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell.
43(2), 652–662 (2019)
20. Geng, C., Huang, S.j., Chen, S.: Recent advances in open set recognition: A survey.
IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3614–3631 (2020)
21. Goswami, D., Schuster, R., van de Weijer, J., Stricker, D.: Attribution-aware weight
transfer: A warm-start initialization for class-incremental semantic segmentation.
In: WACV. pp. 3195–3204 (2023)
22. Grossberg, S.T.: Studies of mind and brain: Neural principles of learning, per-
ception, development, cognition, and motor control, vol. 70. Springer Science &
Business Media (2012)
23. Guo, M.H., Lu, C.Z., Hou, Q., Liu, Z., Cheng, M.M., Hu, S.M.: Segnext: Rethinking
convolutional attention design for semantic segmentation. In: Adv. Neural Inform.
Process. Syst. vol. 35, pp. 1140–1156 (2022)
24. Hadsell, R., Rao, D., Rusu, A.A., Pascanu, R.: Embracing change: Continual learn-
ing in deep neural networks. Trends in cognitive sciences 24(12), 1028–1040 (2020)
25. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: IEEE Conf. Comput. Vis. Pattern Recog. (2016)
26. Hu, Z., Li, Y., Lyu, J., Gao, D., Vasconcelos, N.: Dense network expansion for class
incremental learning. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 11858–11867
(2023)
27. Hung, C.Y., Tu, C.H., Wu, C.E., Chen, C.H., Chan, Y.M., Chen, C.S.: Compacting,
picking and growing for unforgetting continual learning. In: Adv. Neural Inform.
Process. Syst. vol. 32 (2019)
28. Kim, D., Han, B.: On the stability-plasticity dilemma of class-incremental learning.
In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 20196–20204 (2023)
29. Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu,
A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al.: Overcoming
catastrophic forgetting in neural networks. Proceedings of the national academy of
sciences 114(13), 3521–3526 (2017)
30. Li, Z., Hoiem, D.: Learning without forgetting. IEEE Trans. Pattern Anal. Mach.
Intell. 40(12), 2935–2947 (2017)
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 17

31. Lin, Z., Wang, Z., Zhang, Y.: Continual semantic segmentation via structure pre-
serving and projected feature alignment. In: Eur. Conf. Comput. Vis. pp. 345–361.
Springer (2022)
32. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin
transformer: Hierarchical vision transformer using shifted windows. In: Int. Conf.
Comput. Vis. pp. 10012–10022 (2021)
33. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3431–3440 (2015)
34. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3431–3440 (2015)
35. Maracani, A., Michieli, U., Toldo, M., Zanuttigh, P.: Recall: Replay-based continual
learning in semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog.
pp. 7026–7035 (2021)
36. McCloskey, M., Cohen, N.J.: Catastrophic interference in connectionist networks:
The sequential learning problem. In: Psychology of learning and motivation, vol. 24,
pp. 109–165. Elsevier (1989)
37. Michieli, U., Zanuttigh, P.: Incremental learning techniques for semantic segmen-
tation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 0–0 (2019)
38. Michieli, U., Zanuttigh, P.: Continual semantic segmentation via repulsion-
attraction of sparse and disentangled latent representations. In: IEEE Conf. Com-
put. Vis. Pattern Recog. pp. 1114–1124 (2021)
39. Oh, Y., Baek, D., Ham, B.: Alife: Adaptive logit regularizer and feature replay for
incremental semantic segmentation. In: Adv. Neural Inform. Process. Syst. vol. 35,
pp. 14516–14528 (2022)
40. Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction.
In: Int. Conf. Comput. Vis. pp. 12179–12188 (2021)
41. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. In: MICCAI. pp. 234–241. Springer (2015)
42. Shang, C., Li, H., Meng, F., Wu, Q., Qiu, H., Wang, L.: Incrementer: Transformer
for class-incremental semantic segmentation with knowledge distillation focusing
on old class. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7214–7224 (2023)
43. Shin, H., Lee, J.K., Kim, J., Kim, J.: Continual learning with deep generative
replay. In: Adv. Neural Inform. Process. Syst. vol. 30 (2017)
44. Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for seman-
tic segmentation. In: Int. Conf. Comput. Vis. pp. 7262–7272 (2021)
45. Verwimp, E., De Lange, M., Tuytelaars, T.: Rehearsal revealed: The limits and
merits of revisiting samples in continual learning. In: Int. Conf. Comput. Vis. pp.
9385–9394 (2021)
46. Vinogradova, K., Dibrov, A., Myers, G.: Towards interpretable semantic segmenta-
tion via gradient-weighted class activation mapping (student abstract). In: AAAI.
pp. 13943–13944 (2020)
47. Wang, E., Peng, Z., Xie, Z., Yang, F., Liu, X., Cheng, M.M.: Unlocking the multi-
modal potential of clip for generalized category discovery (2024), https://arxiv.
org/abs/2403.09974
48. Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y.,
Tan, M., Wang, X., et al.: Deep high-resolution representation learning for visual
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2020)
49. Wu, Y., Chen, Y., Wang, L., Ye, Y., Liu, Z., Guo, Y., Fu, Y.: Large scale incre-
mental learning. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 374–382 (2019)
18 Z. Xie et al.

50. Xiao, J.W., Zhang, C.B., Feng, J., Liu, X., van de Weijer, J., Cheng, M.M.: End-
points weight fusion for class incremental semantic segmentation. In: IEEE Conf.
Comput. Vis. Pattern Recog. pp. 7204–7213 (2023)
51. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer:
Simple and efficient design for semantic segmentation with transformers. Adv. Neu-
ral Inform. Process. Syst. 34, 12077–12090 (2021)
52. Yan, S., Xie, J., He, X.: Der: Dynamically expandable representation for class
incremental learning. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3014–3023
(2021)
53. Yang, G., Fini, E., Xu, D., Rota, P., Ding, M., Nabi, M., Alameda-Pineda, X.,
Ricci, E.: Uncertainty-aware contrastive distillation for incremental semantic seg-
mentation. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 2567–2581 (2022)
54. Yoon, J., Yang, E., Lee, J., Hwang, S.J.: Lifelong learning with dynamically ex-
pandable networks. arXiv preprint arXiv:1708.01547 (2017)
55. Zhang, C.B., Xiao, J.W., Liu, X., Chen, Y.C., Cheng, M.M.: Representation com-
pensation networks for continual semantic segmentation. In: IEEE Conf. Comput.
Vis. Pattern Recog. pp. 7053–7064 (2022)
56. Zhang, Z., Gao, G., Jiao, J., Liu, C.H., Wei, Y.: Coinseg: Contrast inter-and intra-
class representations for incremental segmentation. In: Int. Conf. Comput. Vis. pp.
843–853 (2023)
57. Zhang, Z., Zhang, X., Peng, C., Xue, X., Sun, J.: Exfuse: Enhancing feature fusion
for semantic segmentation. In: Eur. Conf. Comput. Vis. pp. 269–284 (2018)
58. Zhao, B., Xiao, X., Gan, G., Zhang, B., Xia, S.T.: Maintaining discrimination and
fairness in class incremental learning. In: IEEE Conf. Comput. Vis. Pattern Recog.
pp. 13208–13217 (2020)
59. Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T.,
Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence
perspective with transformers. In: IEEE Conf. Comput. Vis. Pattern Recog. pp.
6881–6890 (2021)
60. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing
through ade20k dataset. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 633–641
(2017)
61. Zhu, F., Zhang, X.Y., Wang, C., Yin, F., Liu, C.L.: Prototype augmentation and
self-supervision for incremental learning. In: IEEE Conf. Comput. Vis. Pattern
Recog. pp. 5871–5880 (2021)
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 19

A Baseline Details
In this section, we introduce the baselines used in experiments.
MiB. In MiB [5], two kinds of loss are used to model the background, i.e.,
Lunce and Lunkd :

\begin {aligned} &\mathcal {L}_{unce}=-\frac {1}{\left | \mathcal {I} \right |} \sum _{i\in \mathcal {I}}\log \tilde {q}_x^t(i, y_i)\,, \\ &\mathcal {L}_{unkd}=-\frac {1}{\left |\mathcal {I}\right |}\sum _{i\in \mathcal {I}}\sum _{c\in \bigcup _{j=1}^{t-1}\mathcal {C}_j}q_{x}^{t-1}(i, c)\log \hat {q}_x^t(i, c)\,, \end {aligned} \label {eqn:unce_supp}

(9)

where I denotes the pixel set of an image, yi ∈ cbg ∪ Ct denotes the ground-truth
label of pixel i, qxt denotes the output of the model at step t, q̃xt and q̂xt denotes
the modified output of the current model, considering the old classes for the
cross entropy loss and new classes for the knowledge distillation loss.
PLOP. Different from MiB [5], PLOP [16] utilizes pseudo-labeling to address
the issue of background shift, as follows:

\begin {aligned} \mathcal {L}_{pseudo}=-\frac {\nu }{WH}\sum _{w,h}^{W, H}\sum _{c\in \mathcal {C}_t}\tilde {S}(w,h,c)\log \hat {S}^t(w, h, c)\,, \end {aligned} (10)

where Ŝ denotes the prediction of the model and S̃ denotes the pseudo-labels
generated by the old model in the previous step.
It also distills intermediate features by Local POD, as follows:

\begin {aligned} \mathcal {L}_{LocalPod}=\frac {1}{L}\sum _{l=1}^L\left |\left | \Phi (f_l^t(I))-\Phi (f_l^{t-1}(I)) \right |\right |\,, \end {aligned} (11)

where L denotes the number of layers, Φ denotes the operation of Local POD
embedding extraction, flt (I) denotes the output feature from the layer l with
the input I.
RCIL. RCIL [55] decouple the remembering of old knowledge and the learning
of new knowledge by adding a parallel module composed of a convolution layer
and a normalization layer for each 3 × 3 convolution module. At step 0, all pa-
rameters are trainable. At the beginning of each incremental step, two branches
of the old model are fused into one frozen branch to memorize the old knowl-
edge, while the other branch is learnable. A drop path strategy is also used when
fusing the outputs of two branches, which can be denoted as:

\begin {aligned} x_{out} = \eta \cdot x_1 + (1-\eta )\cdot x_2\,, \end {aligned} (12)

where xout denotes the fused output, x1 and x2 denotes the outputs from two
branches, and η denotes a channel-wise weight vector. For training process, η
is sampled from the set {0, 0.5, 1} and for evaluation η is set to 0.5. RCIL also
proposed a Pooled Cube Knowledge Distillation, using average pooling operation
on spatial and channel dimensions.
20 Z. Xie et al.

B Results and Analysis of Disjoint Settings


In Disjoint settings, at each step, the bg classifier will not see any future class,
leading to MiB’s initialization struggling in these settings, while our NeST lever-
ages semantic knowledge from old classifiers to generate new classifiers for initial-
ization, the pre-tuning process also benefits the stability of the model. Results
of NeST and baselines on 15-1 Disjoint and 10-1 Disjoint settings are shown
in Tab. 8, indicating that NeST can significantly improve the performance of
previous methods in Disjoint settings.

Table 8: Results of disjoint settings on Pascal VOC 2012 dataset.

Method 15-1 Disjoint 10-1 Disjoint


MiB 38.6 2.0
MiB+NeST 41.0 20.5
PLOP 40.7 12.6
PLOP+NeST 52.7 22.4

C Experiments of COCO-Stuff 10K


To prove the ability to apply our NeST in scenarios with more classes, we intro-
duce another dataset for class incremental semantic segmentation, COCO-Stuff
10K, to evaluate the effectiveness of our method. COCO-Stuff 10K includes 80
thing classes and 91 stuff classes, which is a subset of the original COCO-Stuff
dataset. We evaluate our NeST on the 80-91 (2 Steps) overlapped setting, which
contains more classes than ADE20K and Pascal VOC 2012. As shown in Tab. 9,
our method can handle scenarios with more classes in one step.

Table 9: Results of 80-91 overlapped setting on COCO-Stuff 10K dataset.

Method 0-80 81-171 all


MiB 40.2 18.6 28.9
MiB+NeST 41.9 20.7 30.7
PLOP 46.1 17.0 30.8
PLOP+NeST 46.0 18.8 31.6

D Comparisons with Transformer-based SOTA Methods


In recent years, many Transformer-based CISS methods have emerged, here we
briefly discuss the differences between NeST and these methods. Comformer [4]
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 21

uses a universal segmentation model Mask2Former [12] to do mask classifica-


tion for continual panoptic segmentation and continual semantic segmentation.
CoinSeg [56] introduces a pretrained Mask2Former model as class-agnostic mask
generator. Incrementer [42] sequentially adds tokens of new classes and performs
dot production between features and updated class tokens to generate segmenta-
tion prediction results. The baseline is based on a simple per-pixel classification
model SETR with ViT-B as the backbone, while equipped with NeST, it can
achieve SOTA performances, as shown in Tab. 10. Moreover, NeST has the po-
tential to be integrated into these transformer-based methods, and we leave it
as our future work.

Table 10: Comparisons with Transformer-base SOTA methods.

Method Backbone Model 15-1 15-5 10-1


CoinSeg Swin-B Deeplab+Mask2Former 75.5 77.6 70.5
Incrementer ViT-B Segmenter 75.5 79.9 70.2
MiB ViT-B SETR 53.3 80.2 25.5
MiB+NeST (Ours) ViT-B SETR 76.5 80.3 71.9

E More Implementation Details

Weight Align. To prevent the new classifiers’ weight from being too large
during the pre-tuning process, we apply Weight Aligning (WA) [58] as follows:

\begin {aligned} \hat {w}_{new} = w_{new} \cdot \frac {Mean(Norm_{old})}{Mean(Norm_{new})} \end {aligned} (13)

where N ormold and N ormnew denote norms of old and new classifiers’ weights,
M ean(·) denotes the operation of calculating mean values. Relevant experiment
results in Tab. 11 show that WA can correct the biased weight thus boosting the
performance of NeST.

Table 11: Ablation study of Weight Align for NeST. All performances are reported
on the 15-1 setting.

Method 0-15 16-20 all


MiB+NeST w/o WA 58.4 10.9 47.1
MiB+NeST w/ WA 61.7 20.4 51.8
PLOP+NeST w/o WA 72.5 32.4 62.9
PLOP+NeST w/ WA 72.2 33.7 63.1
22 Z. Xie et al.

Fix old classifiers. We find that the pseudo-labeling strategy may change
the geometric structure of old classifiers severely, which has a detrimental impact
on our method. This phenomenon is particularly obvious on the Pascal VOC
2012 dataset. To preserve the old knowledge learned in previous steps, following
EWF [50], we fix old classifiers in the formal training steps on settings of Pascal
VOC 2012. Relevant experiment results are shown in Tab. 12.

Table 12: Ablation study of fixing previous classifiers for our method based on
PLOP [16]. All performances are reported on the 15-1 setting.

Method 0-15 16-20 all


PLOP w/ fix 56.9 11.3 46.0
PLOP+NeST w/o fix 66.8 20.2 55.7
PLOP+NeST w/ fix 72.2 33.7 63.1

F Further Analysis

BG 1.0

Car

0.0

Fig. 6: Visualization of the importance matrix on ADE20K 100-5 step1.

Effectiveness of the importance matrix. For plasticity, we learn to generate


a new classifier with relevant old classifiers. In particular, the importance matrix
can capture the semantic relationship between old and new classifiers on the
channel level. To verify this, we visualize the importance matrix M ∈ R100×256
of the new class van on ADE20K 100-5 step1. We normalized the absolute values
to [0, 1] and a lighter color means a higher value. As shown in Fig. 6, the bg class
(row: 0) and car class (row: 21) make the largest contributions. It is intuitive, as
the class van may appear in old data, labeled as bg, and van and old class car
are closer in semantic relationship.
Different class orders. To evaluate the effectiveness of our method, following
PLOP [16], we use five different class orders of Pascal VOC 2012 15-1 overlapped
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 23

step0 step1 step2 step3 step4 step5

Image & GT base potted plant sheep sofa train monitor

MiB

MiB+NeST

MiB

MiB+NeST

MiB

MiB+NeST

Fig. 7: More qualitative results. All experiments are conducted on the 15-1 setting.
MiB

setting, as follows:
MiB+NeST
\scriptsize \begin {aligned} A:[0, 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20]\,,\\ B:[0, 12, 9, 20, 7, 15, 8, 14, 16, 5, 19, 4, 1, 13, 2, 11, 17, 3, 6, 18, 10]\,,\\ C:[0, 13, 19, 15, 17, 9, 8, 5, 20, 4, 3, 10, 11, 18, 16, 7, 12, 14, 6, 1, 2]\,,\\ D:[0, 15, 3, 2, 12, 14, 18, 20, 16, 11, 1, 19, 8, 10, 7, 17, 6, 5, 13, 9, 4]\,,\\ E:[0, 7, 5, 3, 9, 13, 12, 14, 19, 10, 2, 1, 4, 16, 8, 17, 15, 18, 6, 11, 20]\,. \end {aligned} \label {eqn:class_order}

(14)

MiB

More qualitative results. More qualitative results are shown in Fig. 7


and Fig. 8. By applying the pre-tuning process, our method can help the model
MiB+NeST old knowledge.
preserve
Moreover, to validate the effectiveness of the matrix initialization, we also
visualize the class activation map for the last class tv/ monitor. To visualize
segmentation CAMs, we adopt the method proposed in [46]. As shown in Fig. 9,
MiB
with the designed matrix initialization strategy, the model can pay more atten-
tion to areas of new classes.

MiB+NeST

MiB

MiB+NeST
MiB+NeST

MiB

MiB+NeST
24 Z. Xie et al.

MiB

MiB+NeST

MiB

MiB+NeST

MiB

MiB+NeST

MiB

MiB+NeST

MiB

MiB+NeST

Fig. 8: More qualitative results. All experiments are conducted on the 15-1 setting.
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 25

Fig. 9: Class activation maps for the last class tv/monitor on the 15-1 setting w \!/\! o (the
top row) and w \!/\! (the bottom row) our matrix initialization strategy.

You might also like