NeST Compressed
NeST Compressed
1
VCIP, CS, Nankai University
NKIARI, Shenzhen Futian
2
{xiezhengyuan}@mail.nankai.edu.cn
{xialei}@nankai.edu.cn
3
SICE, UESTC
1 Introduction
Nowadays, segmentation models driven by deep neural networks have achieved
notable success in various fields [9, 10, 33]. However, these models are usually
trained and tested in a static environment [20], which is contrary to real-world
scenarios [47]. When facing a continuous data stream [24], a model needs to han-
dle several novel classes. Blending new and previous data to retrain the model, or
called Joint Training, ensures the performance of the model. However, the com-
putational cost becomes prohibitive when dealing with large datasets. Simply
fine-tuning the model may lead to the loss of previously acquired knowledge, a
2 Z. Xie et al.
Step t-1 Step t Step t-1 Step t Step t-1 Step t Step t-1 Step t
Initialize Initialize Initialize Initialize
car car car car car car car car
bicycle bicycle bicycle bicycle bicycle bicycle bicycle bicycle
... ... ... ... ... ... ... ...
bg Initialize bg bg bg bg bg bg bg
person Aux Cls person channel person
Initialize selection
Aux Cls person
Fig. 1: Different classifier initialization methods for class incremental semantic seg-
mentation. MiB [5] directly uses background classifiers to initialize new classifiers.
Some methods [2,6] train an auxiliary classifier for future classes. AWT [21] selects the
most relevant weights from the background classifier for new classifiers’ initialization
by gradient-based attribution. Our new classifier pre-tuning method learns a transfor-
mation from all old classifiers to generate new classifiers for initialization.
2 Related Work
Semantic Segmentation. In recent years, the development of deep learn-
ing techniques has facilitated the performance of semantic segmentation models.
Fully Convolutional Networks (FCN) [34] pioneers semantic segmentation, and
a series of convolution-based segmentation networks have achieved high per-
formance in many benchmarks [8, 19, 23, 41, 48, 57]. More recently, the Trans-
former architecture has become ever more popular and reached remarkable per-
formance [11,13,32,40,44,51,59]. Thus, we conduct experiments on two different
backbones: ResNet and Swin-Transformer.
Class Incremental Learning. Class incremental learning tackles the catas-
trophic forgetting of the task with ever-increasing categories. The whole training
process is divided into several steps and in each step, the model is required to
learn one or more classes, which is the most challenging. Existing methods can
be categorized into three groups. The model expansion methods [26, 27, 52, 54]
enlarge the model size in incremental steps to learn new knowledge while pre-
serving old knowledge in fixed parameters. The rehearsal based methods store a
4 Z. Xie et al.
series of exemplars [3, 45] or prototypes along the task sequence [61], or using
generative networks to maintain old knowledge [35, 43]. The parameter regular-
ization methods either constrain the learning of some parameters [7, 30] or use
knowledge distillation techniques [1, 16, 17, 49].
Class Incremental Semantic Segmentation. CISS requires a segmenta-
tion model to recognize all learned classes in continuous learning steps. Different
from classification tasks, segmentation has its own background shift [5] issue.
MiB [5] first proposes unbiased cross entropy and distillation to address back-
ground shift. PLOP [16] uses pseudo-label and intermediate features to transfer
old knowledge. SDR [38] proposes a contrastive-learning-based method, min-
imizing intra-class feature distances. RCIL [55] introduces reparameterization
to continual semantic segmentation with a complementary network structure.
GSC [14] explores incremental semantic segmentation considering the gradient
and semantic compensation. EWF [50] fuses old and new knowledge with a
weight fusion strategy. Many methods have explored the classifier initialization
in CISS. SSUL [6] employs auxiliary data to train an ‘unknown’ classifier and use
it to initialize new classifiers. DKD [2] proposes a decomposed knowledge dis-
tillation, improving the rigidity and stability. AWT [21] applies gradient-based
attribution to transfer the relevant weight of old classifiers to new classifiers. Dif-
ferent from the method above, we propose a new classifier pre-tuning method,
achieving the goal of the trade-off between plasticity and stability.
3 Method
3.1 Preliminaries
Problem Definition. Following [5, 16], CISS contains several steps {t}nt=1
to receive sequential data stream {Dt }nt=1 with classes {Ct }nt=1 . In each step t,
the model is required to learn |Ct | new classes, while old training data is not
available. For an image in the current step, pixels belonging to Ct are labeled as
their ground truth classes, leaving other pixels labeled as background. Finally,
after the last step, the model will be tested on the data of all learned classes.
At step t, the segmentation model is composed of a feature extractor fθt and
a classifier htϕ . ϕt is newly added and weights of previous classifiers are ϕ1:t−1 .
In this paper, our main concern is the initialization of ϕt .
Revisiting Previous Initialization Methods. In this part, we illustrate
the shortcomings of previous initialization methods and further clarify our mo-
tivation. As mentioned before, classes in Ct will appear as background in pre-
vious steps, which is known as the background shift. According to this premise,
previous methods either try to find a better way of utilizing the background
classifier [5, 21], or train an auxiliary classifier for future classes, both neglecting
the differences between new classes [2, 6]. Meanwhile, there will also be a mis-
matching issue between new classifiers and the feature extractor as the methods
above do not have training processes with data from new classes. This may lead
to drastic parameter changes in the feature extractor when training with new
data, thus undermining previous knowledge.
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 5
... Importance
dog weights
car expand
sum
bicycle softmax
...
...
Projection
weights
bg
Fig. 2: Illustration of our new classifier pre-tuning (NeST) method. The left side of
the figure is an iteration of the new classifier pre-tuning process. The right side of
the figure represents the cross-task class similarity-based initialization of importance
matrices and projection matrices before the pre-tuning process.
\begin {aligned} \label {eqn:transform} \mathbf {w}_{c} &= (\mathbf {M}_c\odot \mathbf {W}_{old})\mathbf {P}_c\,, \end {aligned} (1)
\begin {aligned} \phi _t=\rm {concat} [\mathbf {w}_{n_{old}}, ..., \mathbf {w}_{n_{old}+\left |\mathcal {C}_{t}\right | - 1}]\,, \end {aligned} (2)
6 Z. Xie et al.
\begin {aligned} \label {eqn:transform_bg} \hat {\mathbf {w}}_{0} &= (\mathbf {M}_0\odot \mathbf {w}_{0})\mathbf {P}_0\,, \end {aligned} (3)
\begin {aligned} &\mathcal {L}_{unce}=-\frac {1}{\left | \mathcal {I} \right |} \sum _{i\in \mathcal {I}}\log \tilde {q}_x^t(i, y_i)\,, \end {aligned} \label {eqn:unce} (4)
where I denotes the pixel set of an image, yi denotes the ground-truth label of
pixel i at current step, q̃xt denotes the modified output of the current model. It
should be noted that during the whole pre-tuning process, the importance matri-
ces and projection matrices are learnable and will be updated, while other com-
ponents such as the feature extractor remain frozen. After the pre-tuning process,
we use importance matrices and projection matrices to generate weights of each
new classifier for initialization, and additional parameters will be removed.
transformation, we feed each training image in the current step to the old model
and get the corresponding output of the layer before ht−1
ϕ . Then considering the
classification process of a new class pixel embedding pu ∈ Rd extracted from the
old model, assuming that the total number of old classes is nold , the matrix mul-
tiplication can be decomposed into an element-wise multiplication (also called
Hadamard product) between pu and the old classifier weight Wold ∈ Rd×nold
and a sum-softmax operation, as follows:
\begin {split} \mathbf {H}_{u} &= (\mathbf {W}_{old}\odot \mathbf {p}_u'),\\ \mathbf {s}_u &= {\rm softmax}({\rm sum}(\mathbf {H}_{u}))^{\top }\,, \end {split} \label {cross_task_similarity}
(5)
\mathbf {H}_u^{mask}(i, j)= \left \{ \begin {aligned} &1, \mathbf {H}_u(i, j)>0 \\ &0, otherwise\,. \end {aligned} \right . \label {H_mask} (6)
Finally, by utilizing ground-truth masks of the current step, for each pixel
embedding belonging to new class cnew , we calculate the Hadamard product
between its corresponding Hmask and predicted score, then averaged the results
to get the importance matrix weight Mcnew of class cnew , as follows:
\begin {aligned} \mathbf {M}_{c_{new}} = \frac {1}{N} \sum _{\mathbf {p}_u} \mathbf {H}_u^{mask} \odot \mathbf {s}_u' \,, \end {aligned} \label {eqn:H_avg} (7)
where pu denotes the pixel embedding belonging to the new class cnew , s′u ∈
Rd×nold denotes broadcasting the predicted score su of pu by d times, N denotes
the number of pixels belonging to new class cnew . Here we use s′u because the old
class with a very small score will not contribute too much to the initialization of
the new class, as the similarity between these two classes is relatively low. For
projection matrix weight Pcnew , we sum up Mcnew along the channel dimension
and apply the softmax function to get the weight score for old classes, as follows:
\begin {aligned} \mathbf {P}_{c_{new}} = ({\rm softmax}({\rm sum}(\mathbf {M}_{c_{new}})))^{\top }\,. \end {aligned} (8)
Then we use Pcnew ∈ Rnold ×1 to initialize the projection matrix of class cnew as
these scores reflect the degree of similarity between the new class and old classes,
deciding which old class the model should focus on for knowledge transfer. We
provide the pseudo-code of our NeST in Algorithm 1.
8 Z. Xie et al.
4 Experiments
Protocols. In CISS, the whole training process contains T steps, and the
task of each step may contain one or more classes. In step t, pixels belonging
to previous steps are labeled as background. In evaluation, the model needs to
identify all learned classes. There are two settings: Disjoint and Overlapped [5].
Disjoint setting assumes that in the current step, images do not contain any
pixel belonging to classes that will be learned in the future. Overlapped setting
is more realistic, as future classes may appear in images from the current step,
and we mainly evaluate our method on Overlapped setting.
Datasets. Pascal VOC 2012 dataset [18] contains 20 classes including 10,582
images for training and 1449 images for validation. ADE20K dataset [60] contains
150 classes including 20,210 images for training and 2000 images for validation.
In CISS, the training setting can be described as X − Y , X means the number
of classes trained in the initial step and Y means the number of classes trained
in incremental steps. We conduct experiments with 10-1, 15-1, 15-5 and 19-1
settings on Pascal VOC 2012 dataset [18]. For ADE20K dataset [60], we validate
our method on 100-50, 100-10, 100-5 and 50-50 settings.
Implementation Details. Following previous methods [5, 16, 55], we use
DeepLabV3 [8] with ResNet-101 [25] as our segmentation model. We also use
Swin-B [32] as the Transformer-based backbone. Following [5,16], we use random
crop and horizontal flip for data augmentation. We train the model with a batch
size of 24 on 4 GPUs for all experiments. We use SGD as the optimizer in our
experiments. The learning rate is set to 0.02 for the initial step and 0.001 for
incremental steps, both for Pascal VOC 2012 and ADE20K. We also use poly
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 9
Table 1: The mIoU (%) of the last step on the Pascal VOC dataset for four different
CISS scenarios. * implies results from the re-implementation of the official code.
schedule as the weight decay strategy. For Pascal VOC 2012, we pre-tune for 5
epochs with a learning rate of 0.001. For ADE20K, pre-tuning lasts 15 epochs,
with a learning rate of 0.1 for experiments involving RCIL and Swin-B, and
0.5 for other experiments. Further details are provided in the supplementary
material.
4.2 Results
In this section, we apply NeST to three classic methods, MiB [5], PLOP [16] and
RCIL [55] with different backbones.
Pascal VOC 2012. We conduct experiments on 10-1, 15-1, 15-5 and 19-1
settings. As shown in Tab. 1, with NeST, MiB [5], PLOP [16] and RCIL [55] can
achieve huge performance gains. On the 15-1 setting with Deeplab, by simply
using the proposed new classifier strategy, we can improve the performance of
MiB, PLOP, and RCIL by 13.6%, 6.9%, and 2.5%, respectively. In the more
difficult 10-1 scenario, NeST can boost the performance of MiB by a large mar-
gin, achieving an improvement of 27.2%. Meanwhile, results with Swin-B show
that NeST is also suitable for Transformer-based models. On MiB with Swin-B,
NeST can improve the performance by 34.5% on the 15-1 setting and 36.2%
10 Z. Xie et al.
80 MiB 80 MiB
PLOP PLOP
RCIL RCIL
MiB+NeST 70 MiB+NeST
PLOP+NeST PLOP+NeST
70 RCIL+NeST
60 RCIL+NeST
mIoU(%)
mIoU(%)
50
60
40
50 30
20
40
10
0 1 2 3 4 5 0 1 2 3 4 5 6 7 8 9 10
step step
(a) 15-1 on PASCAL VOC 2012 (b) 10-1 on PASCAL VOC 2012
Fig. 3: The mIoU (%) at each step for the setting 15-1 (a) and 10-1 (b).
Table 2: The mIoU (%) of the last step on the ADE20K dataset for four different
CISS scenarios. * implies results from the re-implementation of the official code.
on the 10-1 setting. In Tab. 1, the results across all settings for old and new
classes demonstrate that NeST significantly enhances stability by aligning new
classifiers with the existing backbone through the pre-tuning process, thereby
mitigating the forgetting of old knowledge. We also report the mIoU of our
method and baselines at each step. As shown in Fig. 3, The performance of our
method surpasses baselines’ performance during the whole training process.
ADE20K. To further evaluate NeST, we conduct experiments on the more
challenging settings of the ADE20K dataset. Results of 100-50, 100-10, 100-5 and
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 11
0.86 1.4
similarity
1.2
0.85
loss
1.0
0.84
0.8
0.83 0.6
0.4
0.82
0.2
0 5 10 15 20 25 30 0 5 10 15 20 25 30
epoch epoch
(a) feature map similarity (b) training loss
Fig. 4: The feature map similarity and training loss on Pascal VOC 2012 15-1 over-
lapped setting.
50-50 settings are shown in Tab. 2. NeST outperforms other competing methods.
On the most difficult 100-5 setting, NeST improves MiB [5], PLOP [16] and
RCIL [55] by 6.8%, 2.8% and 1.3%, which indicates that NeST is also applicable
to scenarios with a large number of classes.
Effectiveness of Our Method. It is crucial for pre-tuning new classifiers
before the formal training step, not only for learning new classes but also for
bridging the stability gap [15], which means that the model will undergo a short
period of forgetting when the task changes and then gradually recover. This
phenomenon is also discovered in CISS scenarios [2,50]. At the beginning of each
formal training step, a well-tuned new classifier can make the loss smaller, thus
there will also be less impact on other model parameters as the gradient values
become smaller. As the model’s parameters are inherited from the old model,
the old knowledge learned from previous steps can be preserved. We conduct
experiments on the 15-1 setting with the baseline MiB [5] and MiB+NeST. For
a fair and plain comparison, we plot losses and cosine feature similarities by the
mean and standard deviation of each epoch at the formal training process of the
second step, while in the first step, we use the original MiB for all experiments. As
shown in Fig. 4a, the feature similarity of NeST exceeds that of MiB throughout
the entire recovery process, demonstrating the ability of NeST to help bridge the
stability gap in the model. Fig. 4b shows that during the formal training process,
the loss of NeST is lower than the loss of MiB, helping the model converge faster
and learn better.
MiB
MiB+NeST
MiB
MiB+NeST
Table 5: Component level ablation study on Pascal VOC 2012 15-1 overlapped setting.
while nnew is always fixed. The additional FLOPs (about 15.6K) and param-
eters (about 5.4K) are negligible during the pre-tuning process of Step 5 on
the 15-1 setting. The whole training process of MiB+NeST takes an additional
2.7% (3.7 minutes) of the total training time (2.3 hours). Note that after pre-
tuning, extra parameters are removed and the formal training process is the
same as MiB.
Discussion on AWT v.s. NeST. AWT [21] utilizes the new training data
and old background classifier to generate new classifiers. Here we discuss the
differences between AWT and NeST. AWT employs gradient-based attribution
to transfer the most relevant weights to new classifiers. However, it treats each
new class equally, neglecting the role of other old classifiers and the differences
between new classes. Meanwhile, the gradient-based attribution technique intro-
duces a huge memory cost, AWT can not run on RTX 3090 even if the batch
size is set to 1. NeST is based on learning, which can better make the generated
classifiers’ weight align with the backbone. Moreover, as shown in Tab. 6, the
memory cost is acceptable, and NeST is a little faster than AWT.
14 Z. Xie et al.
Table 6: The memory costs and additional training time on Pascal VOC 2012 15-1
overlapped setting.
Table 7: The average performance of five different class orders on Pascal VOC 2012
15-1 overlapped setting.
Method A B C D E avg
MiB 38.2 28.2 38.5 38.0 52.0 39.0
MiB+NeST (Ours) 51.8 43.4 50.4 60.0 59.8 51.5
4.4 Visualization
In Fig. 5, we show the visualization results of our proposed method based on
MiB [5]. Samples are selected from step 0 and from step 5 to show the stabil-
ity and plasticity of our NeST. In the first two rows, classes person and horse
are learned in step 0, then in the following steps, the baseline MiB [5] gradu-
ally forget concepts learned in step 0, while NeST helps the model to preserve
old knowledge. In the last two rows, class chair is learned in step 0 and class
tv/monitor is learned in step 5. After learning class train, MiB has almost for-
gotten the knowledge of class chair, while NeST can give correct predictions for
most of the pixels. In the last step, NeST can help the model learn new class
tv/monitor better than MiB, showing the plasticity of our method.
5 Conclusions
In this work, we propose a simple yet effective new classifier pre-tuning method
that can enhance the CISS ability by learning a transformation from old classi-
fiers to new classifiers. We further find a way to initialize matrices by utilizing the
information of cross-task class similarities between old classes and new classes,
helping the model achieve the stability-plasticity trade-off. Experiments on two
datasets show that NeST can significantly improve the performance of baselines,
and it can be easily applied to other CISS methods. While our approach can lead
to huge performance gains, the limitation is that it introduces additional com-
putational overhead. In future work, we will try our method in other continual
learning tasks.
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 15
Acknowledgments
This work is funded by NSFC (NO. 62206135), Key Program for International
Cooperation of Ministry of Science and Technology, China (NO. 2024YFE0100700),
Young Elite Scientists Sponsorship Program by CAST (NO. 2023QNRC001),
Tianjin Natural Science Foundation (NO. 23JCQNJC01470), and the Funda-
mental Research Funds for the Central Universities (Nankai University). Com-
putation is supported by the Supercomputing Center of Nankai University.
References
1. Aljundi, R., Babiloni, F., Elhoseiny, M., Rohrbach, M., Tuytelaars, T.: Memory
aware synapses: Learning what (not) to forget. In: Eur. Conf. Comput. Vis. pp.
139–154 (2018)
2. Baek, D., Oh, Y., Lee, S., Lee, J., Ham, B.: Decomposed knowledge distillation for
class-incremental semantic segmentation. In: Adv. Neural Inform. Process. Syst.
(2022)
3. Bang, J., Kim, H., Yoo, Y., Ha, J.W., Choi, J.: Rainbow memory: Continual learn-
ing with a memory of diverse samples. In: IEEE Conf. Comput. Vis. Pattern Recog.
pp. 8218–8227 (2021)
4. Cermelli, F., Cord, M., Douillard, A.: Comformer: Continual learning in semantic
and panoptic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3010–
3020 (2023)
5. Cermelli, F., Mancini, M., Bulo, S.R., Ricci, E., Caputo, B.: Modeling the back-
ground for incremental learning in semantic segmentation. In: IEEE Conf. Comput.
Vis. Pattern Recog. pp. 9233–9242 (2020)
6. Cha, S., Yoo, Y., Moon, T., et al.: Ssul: Semantic segmentation with unknown label
for exemplar-based class-incremental learning. In: Adv. Neural Inform. Process.
Syst. vol. 34, pp. 10919–10930 (2021)
7. Chaudhry, A., Dokania, P.K., Ajanthan, T., Torr, P.H.: Riemannian walk for in-
cremental learning: Understanding forgetting and intransigence. In: Eur. Conf.
Comput. Vis. pp. 532–547 (2018)
8. Chen, L.C., Papandreou, G., Kokkinos, I., Murphy, K., Yuille, A.L.: Deeplab: Se-
mantic image segmentation with deep convolutional nets, atrous convolution, and
fully connected crfs. IEEE Trans. Pattern Anal. Mach. Intell. 40(4), 834–848 (2017)
9. Chen, L.C., Zhu, Y., Papandreou, G., Schroff, F., Adam, H.: Encoder-decoder
with atrous separable convolution for semantic image segmentation. In: Eur. Conf.
Comput. Vis. pp. 801–818 (2018)
10. Chen, Y.H., Chen, W.Y., Chen, Y.T., Tsai, B.C., Frank Wang, Y.C., Sun, M.: No
more discrimination: Cross city adaptation of road scene segmenters. In: Int. Conf.
Comput. Vis. pp. 1992–2001 (2017)
11. Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention
mask transformer for universal image segmentation. In: IEEE Conf. Comput. Vis.
Pattern Recog. pp. 1290–1299 (2022)
12. Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention
mask transformer for universal image segmentation. In: IEEE Conf. Comput. Vis.
Pattern Recog. pp. 1290–1299 (2022)
16 Z. Xie et al.
13. Cheng, B., Schwing, A., Kirillov, A.: Per-pixel classification is not all you need for
semantic segmentation. In: Adv. Neural Inform. Process. Syst. vol. 34, pp. 17864–
17875 (2021)
14. Cong, W., Cong, Y., Dong, J., Sun, G., Ding, H.: Gradient-semantic compensation
for incremental semantic segmentation. IEEE Trans. Multimedia (2023)
15. De Lange, M., van de Ven, G., Tuytelaars, T.: Continual evaluation for lifelong
learning: Identifying the stability gap. arXiv preprint arXiv:2205.13452 (2022)
16. Douillard, A., Chen, Y., Dapogny, A., Cord, M.: Plop: Learning without forgetting
for continual semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog.
(2021)
17. Douillard, A., Cord, M., Ollion, C., Robert, T., Valle, E.: Podnet: Pooled outputs
distillation for small-tasks incremental learning. In: IEEE Conf. Comput. Vis. Pat-
tern Recog. pp. 86–102. Springer (2020)
18. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.:
The PASCAL Visual Object Classes Challenge 2012 (VOC2012) Results.
http://www.pascal-network.org/challenges/VOC/voc2012/workshop/index.html
(2012)
19. Gao, S.H., Cheng, M.M., Zhao, K., Zhang, X.Y., Yang, M.H., Torr, P.: Res2net:
A new multi-scale backbone architecture. IEEE Trans. Pattern Anal. Mach. Intell.
43(2), 652–662 (2019)
20. Geng, C., Huang, S.j., Chen, S.: Recent advances in open set recognition: A survey.
IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3614–3631 (2020)
21. Goswami, D., Schuster, R., van de Weijer, J., Stricker, D.: Attribution-aware weight
transfer: A warm-start initialization for class-incremental semantic segmentation.
In: WACV. pp. 3195–3204 (2023)
22. Grossberg, S.T.: Studies of mind and brain: Neural principles of learning, per-
ception, development, cognition, and motor control, vol. 70. Springer Science &
Business Media (2012)
23. Guo, M.H., Lu, C.Z., Hou, Q., Liu, Z., Cheng, M.M., Hu, S.M.: Segnext: Rethinking
convolutional attention design for semantic segmentation. In: Adv. Neural Inform.
Process. Syst. vol. 35, pp. 1140–1156 (2022)
24. Hadsell, R., Rao, D., Rusu, A.A., Pascanu, R.: Embracing change: Continual learn-
ing in deep neural networks. Trends in cognitive sciences 24(12), 1028–1040 (2020)
25. He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
In: IEEE Conf. Comput. Vis. Pattern Recog. (2016)
26. Hu, Z., Li, Y., Lyu, J., Gao, D., Vasconcelos, N.: Dense network expansion for class
incremental learning. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 11858–11867
(2023)
27. Hung, C.Y., Tu, C.H., Wu, C.E., Chen, C.H., Chan, Y.M., Chen, C.S.: Compacting,
picking and growing for unforgetting continual learning. In: Adv. Neural Inform.
Process. Syst. vol. 32 (2019)
28. Kim, D., Han, B.: On the stability-plasticity dilemma of class-incremental learning.
In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 20196–20204 (2023)
29. Kirkpatrick, J., Pascanu, R., Rabinowitz, N., Veness, J., Desjardins, G., Rusu,
A.A., Milan, K., Quan, J., Ramalho, T., Grabska-Barwinska, A., et al.: Overcoming
catastrophic forgetting in neural networks. Proceedings of the national academy of
sciences 114(13), 3521–3526 (2017)
30. Li, Z., Hoiem, D.: Learning without forgetting. IEEE Trans. Pattern Anal. Mach.
Intell. 40(12), 2935–2947 (2017)
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 17
31. Lin, Z., Wang, Z., Zhang, Y.: Continual semantic segmentation via structure pre-
serving and projected feature alignment. In: Eur. Conf. Comput. Vis. pp. 345–361.
Springer (2022)
32. Liu, Z., Lin, Y., Cao, Y., Hu, H., Wei, Y., Zhang, Z., Lin, S., Guo, B.: Swin
transformer: Hierarchical vision transformer using shifted windows. In: Int. Conf.
Comput. Vis. pp. 10012–10022 (2021)
33. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3431–3440 (2015)
34. Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic
segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3431–3440 (2015)
35. Maracani, A., Michieli, U., Toldo, M., Zanuttigh, P.: Recall: Replay-based continual
learning in semantic segmentation. In: IEEE Conf. Comput. Vis. Pattern Recog.
pp. 7026–7035 (2021)
36. McCloskey, M., Cohen, N.J.: Catastrophic interference in connectionist networks:
The sequential learning problem. In: Psychology of learning and motivation, vol. 24,
pp. 109–165. Elsevier (1989)
37. Michieli, U., Zanuttigh, P.: Incremental learning techniques for semantic segmen-
tation. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 0–0 (2019)
38. Michieli, U., Zanuttigh, P.: Continual semantic segmentation via repulsion-
attraction of sparse and disentangled latent representations. In: IEEE Conf. Com-
put. Vis. Pattern Recog. pp. 1114–1124 (2021)
39. Oh, Y., Baek, D., Ham, B.: Alife: Adaptive logit regularizer and feature replay for
incremental semantic segmentation. In: Adv. Neural Inform. Process. Syst. vol. 35,
pp. 14516–14528 (2022)
40. Ranftl, R., Bochkovskiy, A., Koltun, V.: Vision transformers for dense prediction.
In: Int. Conf. Comput. Vis. pp. 12179–12188 (2021)
41. Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed-
ical image segmentation. In: MICCAI. pp. 234–241. Springer (2015)
42. Shang, C., Li, H., Meng, F., Wu, Q., Qiu, H., Wang, L.: Incrementer: Transformer
for class-incremental semantic segmentation with knowledge distillation focusing
on old class. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 7214–7224 (2023)
43. Shin, H., Lee, J.K., Kim, J., Kim, J.: Continual learning with deep generative
replay. In: Adv. Neural Inform. Process. Syst. vol. 30 (2017)
44. Strudel, R., Garcia, R., Laptev, I., Schmid, C.: Segmenter: Transformer for seman-
tic segmentation. In: Int. Conf. Comput. Vis. pp. 7262–7272 (2021)
45. Verwimp, E., De Lange, M., Tuytelaars, T.: Rehearsal revealed: The limits and
merits of revisiting samples in continual learning. In: Int. Conf. Comput. Vis. pp.
9385–9394 (2021)
46. Vinogradova, K., Dibrov, A., Myers, G.: Towards interpretable semantic segmenta-
tion via gradient-weighted class activation mapping (student abstract). In: AAAI.
pp. 13943–13944 (2020)
47. Wang, E., Peng, Z., Xie, Z., Yang, F., Liu, X., Cheng, M.M.: Unlocking the multi-
modal potential of clip for generalized category discovery (2024), https://arxiv.
org/abs/2403.09974
48. Wang, J., Sun, K., Cheng, T., Jiang, B., Deng, C., Zhao, Y., Liu, D., Mu, Y.,
Tan, M., Wang, X., et al.: Deep high-resolution representation learning for visual
recognition. IEEE Trans. Pattern Anal. Mach. Intell. 43(10), 3349–3364 (2020)
49. Wu, Y., Chen, Y., Wang, L., Ye, Y., Liu, Z., Guo, Y., Fu, Y.: Large scale incre-
mental learning. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 374–382 (2019)
18 Z. Xie et al.
50. Xiao, J.W., Zhang, C.B., Feng, J., Liu, X., van de Weijer, J., Cheng, M.M.: End-
points weight fusion for class incremental semantic segmentation. In: IEEE Conf.
Comput. Vis. Pattern Recog. pp. 7204–7213 (2023)
51. Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer:
Simple and efficient design for semantic segmentation with transformers. Adv. Neu-
ral Inform. Process. Syst. 34, 12077–12090 (2021)
52. Yan, S., Xie, J., He, X.: Der: Dynamically expandable representation for class
incremental learning. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 3014–3023
(2021)
53. Yang, G., Fini, E., Xu, D., Rota, P., Ding, M., Nabi, M., Alameda-Pineda, X.,
Ricci, E.: Uncertainty-aware contrastive distillation for incremental semantic seg-
mentation. IEEE Trans. Pattern Anal. Mach. Intell. 45(2), 2567–2581 (2022)
54. Yoon, J., Yang, E., Lee, J., Hwang, S.J.: Lifelong learning with dynamically ex-
pandable networks. arXiv preprint arXiv:1708.01547 (2017)
55. Zhang, C.B., Xiao, J.W., Liu, X., Chen, Y.C., Cheng, M.M.: Representation com-
pensation networks for continual semantic segmentation. In: IEEE Conf. Comput.
Vis. Pattern Recog. pp. 7053–7064 (2022)
56. Zhang, Z., Gao, G., Jiao, J., Liu, C.H., Wei, Y.: Coinseg: Contrast inter-and intra-
class representations for incremental segmentation. In: Int. Conf. Comput. Vis. pp.
843–853 (2023)
57. Zhang, Z., Zhang, X., Peng, C., Xue, X., Sun, J.: Exfuse: Enhancing feature fusion
for semantic segmentation. In: Eur. Conf. Comput. Vis. pp. 269–284 (2018)
58. Zhao, B., Xiao, X., Gan, G., Zhang, B., Xia, S.T.: Maintaining discrimination and
fairness in class incremental learning. In: IEEE Conf. Comput. Vis. Pattern Recog.
pp. 13208–13217 (2020)
59. Zheng, S., Lu, J., Zhao, H., Zhu, X., Luo, Z., Wang, Y., Fu, Y., Feng, J., Xiang, T.,
Torr, P.H., et al.: Rethinking semantic segmentation from a sequence-to-sequence
perspective with transformers. In: IEEE Conf. Comput. Vis. Pattern Recog. pp.
6881–6890 (2021)
60. Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing
through ade20k dataset. In: IEEE Conf. Comput. Vis. Pattern Recog. pp. 633–641
(2017)
61. Zhu, F., Zhang, X.Y., Wang, C., Yin, F., Liu, C.L.: Prototype augmentation and
self-supervision for incremental learning. In: IEEE Conf. Comput. Vis. Pattern
Recog. pp. 5871–5880 (2021)
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 19
A Baseline Details
In this section, we introduce the baselines used in experiments.
MiB. In MiB [5], two kinds of loss are used to model the background, i.e.,
Lunce and Lunkd :
\begin {aligned} &\mathcal {L}_{unce}=-\frac {1}{\left | \mathcal {I} \right |} \sum _{i\in \mathcal {I}}\log \tilde {q}_x^t(i, y_i)\,, \\ &\mathcal {L}_{unkd}=-\frac {1}{\left |\mathcal {I}\right |}\sum _{i\in \mathcal {I}}\sum _{c\in \bigcup _{j=1}^{t-1}\mathcal {C}_j}q_{x}^{t-1}(i, c)\log \hat {q}_x^t(i, c)\,, \end {aligned} \label {eqn:unce_supp}
(9)
where I denotes the pixel set of an image, yi ∈ cbg ∪ Ct denotes the ground-truth
label of pixel i, qxt denotes the output of the model at step t, q̃xt and q̂xt denotes
the modified output of the current model, considering the old classes for the
cross entropy loss and new classes for the knowledge distillation loss.
PLOP. Different from MiB [5], PLOP [16] utilizes pseudo-labeling to address
the issue of background shift, as follows:
\begin {aligned} \mathcal {L}_{pseudo}=-\frac {\nu }{WH}\sum _{w,h}^{W, H}\sum _{c\in \mathcal {C}_t}\tilde {S}(w,h,c)\log \hat {S}^t(w, h, c)\,, \end {aligned} (10)
where Ŝ denotes the prediction of the model and S̃ denotes the pseudo-labels
generated by the old model in the previous step.
It also distills intermediate features by Local POD, as follows:
\begin {aligned} \mathcal {L}_{LocalPod}=\frac {1}{L}\sum _{l=1}^L\left |\left | \Phi (f_l^t(I))-\Phi (f_l^{t-1}(I)) \right |\right |\,, \end {aligned} (11)
where L denotes the number of layers, Φ denotes the operation of Local POD
embedding extraction, flt (I) denotes the output feature from the layer l with
the input I.
RCIL. RCIL [55] decouple the remembering of old knowledge and the learning
of new knowledge by adding a parallel module composed of a convolution layer
and a normalization layer for each 3 × 3 convolution module. At step 0, all pa-
rameters are trainable. At the beginning of each incremental step, two branches
of the old model are fused into one frozen branch to memorize the old knowl-
edge, while the other branch is learnable. A drop path strategy is also used when
fusing the outputs of two branches, which can be denoted as:
\begin {aligned} x_{out} = \eta \cdot x_1 + (1-\eta )\cdot x_2\,, \end {aligned} (12)
where xout denotes the fused output, x1 and x2 denotes the outputs from two
branches, and η denotes a channel-wise weight vector. For training process, η
is sampled from the set {0, 0.5, 1} and for evaluation η is set to 0.5. RCIL also
proposed a Pooled Cube Knowledge Distillation, using average pooling operation
on spatial and channel dimensions.
20 Z. Xie et al.
Weight Align. To prevent the new classifiers’ weight from being too large
during the pre-tuning process, we apply Weight Aligning (WA) [58] as follows:
\begin {aligned} \hat {w}_{new} = w_{new} \cdot \frac {Mean(Norm_{old})}{Mean(Norm_{new})} \end {aligned} (13)
where N ormold and N ormnew denote norms of old and new classifiers’ weights,
M ean(·) denotes the operation of calculating mean values. Relevant experiment
results in Tab. 11 show that WA can correct the biased weight thus boosting the
performance of NeST.
Table 11: Ablation study of Weight Align for NeST. All performances are reported
on the 15-1 setting.
Fix old classifiers. We find that the pseudo-labeling strategy may change
the geometric structure of old classifiers severely, which has a detrimental impact
on our method. This phenomenon is particularly obvious on the Pascal VOC
2012 dataset. To preserve the old knowledge learned in previous steps, following
EWF [50], we fix old classifiers in the formal training steps on settings of Pascal
VOC 2012. Relevant experiment results are shown in Tab. 12.
Table 12: Ablation study of fixing previous classifiers for our method based on
PLOP [16]. All performances are reported on the 15-1 setting.
F Further Analysis
BG 1.0
Car
0.0
MiB
MiB+NeST
MiB
MiB+NeST
MiB
MiB+NeST
Fig. 7: More qualitative results. All experiments are conducted on the 15-1 setting.
MiB
setting, as follows:
MiB+NeST
\scriptsize \begin {aligned} A:[0, 1 , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20]\,,\\ B:[0, 12, 9, 20, 7, 15, 8, 14, 16, 5, 19, 4, 1, 13, 2, 11, 17, 3, 6, 18, 10]\,,\\ C:[0, 13, 19, 15, 17, 9, 8, 5, 20, 4, 3, 10, 11, 18, 16, 7, 12, 14, 6, 1, 2]\,,\\ D:[0, 15, 3, 2, 12, 14, 18, 20, 16, 11, 1, 19, 8, 10, 7, 17, 6, 5, 13, 9, 4]\,,\\ E:[0, 7, 5, 3, 9, 13, 12, 14, 19, 10, 2, 1, 4, 16, 8, 17, 15, 18, 6, 11, 20]\,. \end {aligned} \label {eqn:class_order}
(14)
MiB
MiB+NeST
MiB
MiB+NeST
MiB+NeST
MiB
MiB+NeST
24 Z. Xie et al.
MiB
MiB+NeST
MiB
MiB+NeST
MiB
MiB+NeST
MiB
MiB+NeST
MiB
MiB+NeST
Fig. 8: More qualitative results. All experiments are conducted on the 15-1 setting.
Early Preparation Pays Off: New Classifier Pre-tuning for CISS 25
Fig. 9: Class activation maps for the last class tv/monitor on the 15-1 setting w \!/\! o (the
top row) and w \!/\! (the bottom row) our matrix initialization strategy.