0% found this document useful (0 votes)
20 views7 pages

FCL KD

There are 4 necessary conditions to achieve a deadlock: Mutual Exclusion: At least one resource must be held in a non-sharable mode. If any other process requests this resource, then that process must wait for the resource to be released. Hold and Wait: A process must be simultaneously holding at least one resource and waiting for at least one resource that is currently being held by some other process. No preemption: Once a process is holding a resource ( i.e. once its request has been granted

Uploaded by

Arjun Mehra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views7 pages

FCL KD

There are 4 necessary conditions to achieve a deadlock: Mutual Exclusion: At least one resource must be held in a non-sharable mode. If any other process requests this resource, then that process must wait for the resource to be released. Hold and Wait: A process must be simultaneously holding at least one resource and waiting for at least one resource that is currently being held by some other process. No preemption: Once a process is holding a resource ( i.e. once its request has been granted

Uploaded by

Arjun Mehra
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

Continual Federated Learning Based on Knowledge Distillation

Yuhang Ma1 , Zhongle Xie1 , Jue Wang1 , Ke Chen1 and Lidan Shou1,2
1
College of Computer Science and Technology, Zhejiang University
2
State Key Laboratory of CAD&CG, Zhejiang University
{myh0032, xiezl, zjuwangjue, chenk, should}@zju.edu.cn

Abstract (a) Overview of FL Intra-task


Forgetting
Federated learning (FL) is a promising approach
for learning a shared global model on decentralized Global Model Global Model
data owned by multiple clients without exposing
their privacy. In real-world scenarios, data accu- Aggregating Aggregating
Broadcast to the
mulated at the client-side varies in distribution over Central Server devices of next round
time. As a consequence, the global model tends to Clients
forget the knowledge obtained from previous tasks
while learning new tasks, showing signs of “catas-
trophic forgetting”. Previous studies in centralized
learning use techniques such as data replay and pa-
100% Ideal

rameter regularization to mitigate catastrophic for- Convergence Point


Week 1
getting. Unfortunately, these techniques cannot ad- 80%
equately solve the non-trivial problem in FL. We 60%
propose Continual Federated Learning with Distil-
40%
lation (CFeD) to address catastrophic forgetting un-
Week 2
Week 1

Initial Model 
20%
der FL. CFeD performs knowledge distillation on Week 2 Inter-task
both the clients and the server, with each party in- 0%
Mon Tue Wed Thu Fri Forgetting
dependently having an unlabeled surrogate dataset, Social Productivity Entertainment
to mitigate forgetting. Moreover, CFeD assigns dif- (b) Distribution of screen time in two weeks (c) Local Update
ferent learning objectives, namely learning the new
task and reviewing old tasks, to different clients, Figure 1: (a) An overview of a FL system to predicate the usage
aiming to improve the learning ability of the model. habits on mobile device. (b) Distribution of screen time in two
The results show that our method performs well weeks for a specific user. (c) Local update suffers catastrophic for-
in mitigating catastrophic forgetting and achieves getting while learning from new data.
a good trade-off between the two objectives.
Although FL can protect data privacy well, its performance
1 Introduction is at risk in practice due to the following issues. First, clients
Federated Learning (FL) [McMahan et al., 2017] is proposed participating in one round of training may become unavail-
as a solution to learn a shared model using decentralized data able in the next round due to network failure, causing vari-
owned by multiple clients without disclosing their private ation in the training data distribution between consecutive
data. Figure 1 illustrates an example of a FL task to infer rounds. Second, the data accumulated at the client-side may
the usage habits of mobile device users. The sensitive data, vary over time, and can even be considered as a new task with
namely a stream of the temporal histogram of screen time, different data distribution or different labels, which poses sig-
is collected on mobile devices (clients) and used to train a nificant challenges to the adaptability of the model. Further-
local model. Meanwhile, a global model is built on a cen- more, due to the inaccessibility of the raw data, minimiz-
tral server, which leverages the local models submitted by the ing the loss on the new task may increase the loss on old
clients without accessing any client’s private data. During tasks. These issues all lead to underperformance of the global
each round of training, the server first broadcasts the global model, a phenomenon known as “catastrophic forgetting”.
model to clients. Then, each client independently uses its lo- Specifically, catastrophic forgetting in FL system is ob-
cal data to update the model and submits the model update served in two main categories, namely intra-task forgetting
to the server. Finally, the server aggregates these updates to and inter-task forgetting. (1) Intra-task forgetting occurs
produce a new global model for the next round. when two different subsets of clients are involved in two con-

2182
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

secutive rounds. In Fig. 1(a), since a client does not partic- 2021]. Models with strong stability forget little but perform
ipate in a training round, the new global model may forget poorly on the new task. In contrast, models with better plas-
knowledge obtained from this client in previous rounds and ticity can adapt quickly to the new task but tend to forget old
thus performs poorly on its local data. (2) Inter-task forget- ones. Existing CL methods can generally be divided into
ting occurs when clients accumulate new data with different three categories: data replay, parameter separation and pa-
domains or different labels. As shown in Fig. 1(c), due to the rameter regularization.
different distribution of the screen time data in week 2, the The core idea of data replay methods [Chaudhry et al.,
performance of the new global model on week 1 data is de- 2018] is to store and replay raw data from old tasks to mit-
graded. It should be noted that the Non-IID issue brings more igate forgetting. However, storing old data directly violates
challenges to both kinds of forgetting. privacy and storage restrictions. Parameter separation meth-
Catastrophic forgetting in FL is a non-trivial problem be- ods [Hung et al., 2019] overcome catastrophic forgetting by
cause the conventional approaches to catastrophic forgetting, assigning different subsets of the model parameters to dealing
namely Continual Learning (CL), cannot be easily applied with each task. However, separation methods will result in an
in FL due to privacy and resource constraints. Some re- infinite increase of parameters with the arrival of new tasks,
cent attempts on this topic, such as [Shoham et al., 2019; which can quickly become unacceptable in FL.
Usmanova et al., 2021], do not solve the problem adequately The regularization-based methods limit the updating pro-
since they are designed to address either intra-task forgetting cess by punish the updates on important parameters [Kirk-
or inter-task forgetting. patrick et al., 2017] or adding knowledge distillation [Hin-
In this paper, we propose a framework called Continual ton et al., 2015] loss to the objective function [Li and Hoiem,
Federated Learning with Distillation (CFeD) to mitigate 2017; Lee et al., 2019; Zhang et al., 2020] to learn the knowl-
catastrophic forgetting on both intra-task and inter-task cate- edge from the old model. However, the importance is difficult
gories when learning new tasks. Specifically, CFeD leverages to be precisely evaluated. Some distillation-based methods
knowledge distillation in two ways based on unlabeled public perform distillation based on the new task data, but its ef-
datasets called the surrogate datasets. First, while learning ficacy drops significantly when domains vary greatly. The
the new task, each client transfers the knowledge of old tasks others that leverage unlabeled external data solve difficul-
from the model converged on the last task into the new model ties above. CFeD adopts knowledge distillation with public
via its own surrogate dataset to mitigate inter-task forget- datasets and proposes a client division mechanism to reduce
ting. Meanwhile, CFeD assigns the two objectives to differ- the cost of time and computation.
ent clients to improve the performance, called clients division
mechanism. Second, the server also maintains another inde- 2.2 Federated Learning
pendent surrogate dataset to fine-tune the aggregated model Recent studies have considered introducing Continual Learn-
in each round by distilling the knowledge learned in the cur- ing into FL to improve the performance of models under Non-
rent and last rounds into the new aggregated one, called server IID data [Shoham et al., 2019]. [Yoon et al., 2021] proposed
distillation. Federated Continual Learning and focused on multiple con-
The main contributions of this paper are as follows: tinual learning agents that use each other’s indirect experi-
ence to enhance the continual learning performance of their
• We extend continual learning to the federated learning
local models, rather than to jointly train a better global model.
scenario and define Continual Federated Learning (CFL)
Therefore, the purpose of their study is to obtain a collection
to address catastrophic forgetting when learning a series
of local models for the participating clients. While our work
of tasks. (Section 3)
looks literally similar, our research problem is very different,
• We propose a CFL framework called CFeD, which em- as we aim at learning a better global model. Thus we decide
ploys knowledge distillation based on surrogate datasets not to compare with it in our experiment study.
to mitigate catastrophic forgetting both at the server-side Based on FedAvg, [Jeong et al., 2018] proposed Feder-
and client-side. In each round, the inter-task forgetting ated Distillation framework to enhance communication effi-
is mitigated by assigning clients to learning the new task ciency without comprising performance. There are also sev-
or to reviewing the old tasks. The intra-task forgetting is eral works leveraging additional datasets constructed from
mitigated by applying a distillation scheme at the server- public dataset [Li and Wang, 2019; Lin et al., 2020] or orig-
side. (Section 4) inal local data [Itahara et al., 2020] to aid distillation. Com-
• We evaluate two scenarios of CFL by varying the data pared with the above works, our proposed method to mit-
distribution and adding categories on text and image igate catastrophic forgetting is orthogonal and could work
classification datasets. CFeD outperforms existing FL with them together.
methods in overcoming forgetting without sacrificing
the ability to learn new tasks. (Section 5) 3 Problem Definition
In FL, a central server and K clients cooperate to train a
2 Related Work model on task T through R rounds of communication. The
optimization objective can be written as follows:
2.1 Continual Learning K
Continual Learning (CL) aims to solve stability-plasticity
X nk
min L(T k ; θ) (1)
dilemma when learning a sequence of tasks [Delange et al., θ na
k=1

2183
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

where T k refers to a collection of nk training samples at k-th Server


-th Communication Round
client and na is the sum of all nk .
Here we define Continual Federated Learning (CFL), Learn the new task Aggregation &
Client A Server Distillation
which aims to train a global model via iterative communi-
cation between the server and clients on a series of tasks Review the old tasks
Client B
{T1 , T2 , T3 , · · · } accumulated at the client-side.
When the t-th task arrives, data of previous tasks will be- Client model
come unavailable for training. We define the global model Prediction on Cross Entropy
parameters obtained from the previous task as θ t−1 , and the Loss
SK
new task as Tt = k=1 Ttk , where Ttk contains newly col- Clients to learn the new task
lected data at each client. Client model
The goal is to train a global model with a minimized loss Prediction on
on the new task Tt as well as old tasks {T1 , · · · , Tt−1 }. The Surrogate Data
optimization objective is achieved by minimizing the losses Global model of Distillation Loss
Surrogate Data last round
of K clients on all local tasks up to time t through iterative
server-client communication. The global model parameters Soft Labels
can be obtained as follows: Clients to review the old tasks
K t
X nk X Soft Labels on
1
θ t = arg min L(Tik ; θ) (2) sample 1
θ na Soft Labels
k=1 i=1 2
on sample 2
3 Soft Labels on
The global model at task Tt is expected to achieve no sample 3
higher loss on historical tasks than that at time t − 1. That Surrogate Data
Pt−1 Pt−1 Aggregating Distillation Loss
is, i=1 L(Ti ; θ t ) ≤ i=1 L(Ti ; θ t−1 ). (Server)
However, in real-world scenarios, due to the limitation on 1 2 3 Prediction on
accessing previous data, CFL suffers catastrophic forgetting sample 1, 2 & 3
at “intra-task” and “inter-task” levels. Formally, intra-task Server Distillation

forgetting means that after the updates of r-th round, the


Figure 2: Continual federated learning with knowledge distillation.
global model gets a higher loss than the (r-1)-th round on the
local dataset of k-th client: L(Ttk ; θ t,r ) > L(Ttk ; θ t,r−1 ),
especially when the distribution across clients is Non-IID. 4.1 Clients Distillation
And then, inter-task forgetting is that the loss of the model The distillation process of CFeD assumes that there is a surro-
at time t on old tasks is higher than that at time t − 1: gate dataset Xsk at the k-th client. For each sample xs ∈ Xsk ,
Pt−1 k
Pt−1 k
i=1 L(Ti ; θ t ) > i=1 L(Ti ; θ t−1 ). its label at time t is ys,t = f (xs ; θ t−1 ). Thus we obtain a per-
client set of sample pairs Stk = {(xs , ys,t ), ∀xs ∈ Xsk }. A
4 Proposed Method distillation term Ld (Stk ; θ) is added to approximate the loss
Pt−1
on old tasks i=1 L(Tik ; θ). For a specific surrogate sample
To tackle the catastrophic forgetting in FL, we propose a pair (xs , ys,t ) at time t, the distillation loss can be formalized
framework named CFeD (Continual Federated learning with as a modified version of cross-entropy loss:
Distillation). As shown in Figure 2, the core idea is to use
l
the model of the last task to predict the surrogate dataset, and X 0(i) 0(i)
treat the outputs as pseudo-labels to perform knowledge dis- Ld ((xs , ys,t ); θ) = − ys,t log ŷs,t (3)
tillation to review the knowledge of unavailable data. To im- i=1
prove the learning ability of the global model and fully utilize where l is the number of target classes, ys,t is the modified
0(i)
computation resources, learning the new task and reviewing 0(i)
old tasks can be assigned to different clients and those clients surrogate label, and ŷs,t is the modified output of the model
without enough new task data could only review the old tasks. on surrogate sample xs . The latter two are defined as:
Moreover, a server distillation mechanism is proposed to mit- (i) (i)
igate the intra-task forgetting in Non-IID data. The aggre- 0(i) (ys,t )1/T 0(i) (ŷs,t )1/T
ys,t = Pl (j) 1/T
, ŷs,t = Pl (j) 1/T
(4)
gated global model is finetuned to mimic the outputs of the j=1 (ys,t ) j=1 (ŷs,t )
global model on the last round and local models on the cur-
(i)
rent round. where ŷs,t is the i-th element of f (xs ; θ), T is the temperature
The surrogate dataset should be collected from public of distillation and a greater T value could amplify minor log-
datasets for privacy and cover as many features as possible or its so as to increase the information provided by the teacher
be similar to the old tasks to ensure the effectiveness of dis- model.
tillation. Since the model parameters do not increase, there is Based on the above notions, CFeD computes the total loss
no additional communication cost compared with FedAvg. on all clients at time t by substituting the unknown losses on

2184
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

old tasks in Equation 2 with the distillation losses on the per- them to the received local models and one to the global
client surrogate datasets. Formally, the total loss is: model of the last round, and collects outputs Ŷs,r of above-
K t−1 mentioned models on their assigned batches to construct a
X nk X
labeled set of sample pairs for the server distillation, denoted
(L(Ttk ; θ) + L(Tik ; θ))
na i=1
as Sr = {(xs , ys,r ), ∀xs ∈ Xs }. Next, the aggregated global
k=1
(5) model θ r will be iteratively updated with a distillation loss
K
X nk Ld (Sr ; θ).
∝ (L(Ttk ; θ) + λd Ld (Stk ; θ)) With server distillation, the global model is able to further
na
k=1
retrieve the knowledge from the local models and the previ-
where λd (default 0.1) is a pre-defined parameter to weight ous global model, thus to mitigate intra-task forgetting.
the distillation loss.
4.2 Clients Division Mechanism
5 Experiments
In real-world scenarios, different clients accumulate data at In this section, we evaluate CFeD and baseline approaches
different speeds. While some clients are ready to learn a new extended from traditional continual learning methods on text
task, others may not have gained enough data for the new task and image classification tasks.
and thus cannot be effectively utilized in learning it.
To leverage the under-utilized computation resources, 5.1 Datasets and Tasks
CFeD treats learning the new task and reviewing old tasks as We consider both text and image classification datasets:
two individual objectives. The framework assigns one of the THUCNews [Li et al., 2006] contains 14 categories of Chi-
two objectives to each client so that some clients can only per- nese news data collected from Sina News RSS between 2005
form reviewing. Further, this kind of division may improve and 2011. SogouCS [SogouLabs, 2012] contains 18 cate-
the exploration of the model on different objectives and help gories of 511218 Chinese news data in 2012. Sina2019 con-
the model depart from previous local minima. Formally, re- tains 30533 Chinese news data in 2019 crawled from the
garding the division mechanism, Equation 5 can be expanded Sina News by ourselves. NLPIR Weibo Corpus [NLPIR,
and rewritten as: 2017] consists of 230000 samples obtained from Sina Weibo
N K and Tencent Weibo, two Chinese social media sites. We
X nk X nk use it as a surrogate dataset across different tasks. CIFAR-
Ld (Stk ; θ) + L(Ttk ; θ) (6)
na na 10[Krizhevsky, 2009] contains 60000 images with 10 classes.
k=1 k=N +1
CIFAR-100[Krizhevsky, 2009] contains 60000 images with
where N = dαKe, α ∈ [0, 1]. We introduce a factor α to 100 classes. Caltech-256[Griffin et al., 2007] contains 30608
describe the proportion of clients involved in reviewing per images with 256 classes as the surrogate dataset in image
round. classification. 1
We design task sequences to be learned in two different
4.3 Server Distillation scenarios: Domain-IL indicates the case where the input dis-
Although the clients division mechanism allows spare com- tributions continually vary in the sequence; Class-IL indicates
puting resources to be utilized, we must note that some active the case where new classes incrementally emerge in the se-
clients are not learning the new task now. This may harm quence. Tasks on the text dataset are denoted by ‘Tx’ while
the performance on the new task and lead to severe intra-task those on the image dataset are denoted by ‘Ix’, where ‘x’ is
forgetting, especially in the Non-IID scenario. The reason the task sequence ID. Most task sequences in the experiments
is that, since the labels of the data on different clients may are short, containing two or three classification tasks.
not intersect each other, the global model fitting one client
(learning the new task) may easily exhibit forgetting in an- 5.2 Compared Methods
other (reviewing the old tasks, or learning the new task on We choose the following approaches for evaluation:
data of different labels). (1) Finetuning: A naive method that trains the model on
To mitigate such performance degradation on the new task, tasks sequentially. (2) FedAvg: A FL method that each client
a client may take a naive solution to increase its local training learns tasks in sequence and the server aggregates local mod-
iterations. However, increasing the number of epochs of local els from clients. (3) MultiHead: A CL method training indi-
updates on Non-IID data could easily cause overfitting and vidual classifiers for each task, requiring task labels to spec-
unstablize the performance of the global model. ify the output during the inferring phase. FedMultiHead de-
Motivated by the paradigm of mini-batch iterative updates notes FedAvg with MultiHead applied to clients. (4) EWC:
in centralized training [Toneva et al., 2018], we propose a regularization-based method [Kirkpatrick et al., 2017] that
Server Distillation (SD) to mitigate the intra-task forgetting uses Fisher information matrix to estimate the importance of
and stabilize the performance of the aggregated model at parameters. FedEWC denotes FedAvg with EWC applied
the server-side. In our approach, the server also maintains to clients. (5) LwF: A distillation-based method. Instead
a lightweight, unlabeled public dataset Xs , similar to the of unlabeled data, LwF leverages new task data to perform
surrogate datasets on the clients. After the r-th round of
aggregating the K local models collected from the clients, 1
Our code and datasets are publicly available at https://github.
the server divides Xs into K + 1 batches, assigns K of com/lianziqt/CFeD.

2185
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

Domain-IL scenario Class-IL scenario


Domain-T1 Domain-T2 Class-T1 Class-T2 Class-I1 Class-I2

MultiHead 94.66 93.40 95.84 95.66 51.81 57.04
Finetuing 85.96 91.32 48.00 32.43 36.74 30.58
EWC 87.42 91.98 47.96 32.42 36.35 30.72
LwF 90.58 92.01 48.59 39.10 35.31 26.32
DMC - - 48.37 41.92 37.81 27.37
CFeD(C)lr=1e-6 82.48 83.20 71.21 63.45 37.88 28.36
CFeD(C)lr=1e-3 94.49 92.95 62.22 34.26 33.75 22.21
FedMultiHead∗ 81.83 91.04 96.26 96.07 56.97 60.43
FedAvg 86.50 92.60 48.25 32.61 32.53 29.28
FedEWC 84.76 92.06 48.24 32.51 30.37 28.62
FedLwF 87.35 92.41 59.39 44.76 33.97 23.46
FedDMC - - 46.58 10.46 16.50 8.33
FedDMCfull - - 56.95 50.87 40.13 29.18
CFeD 92.34 94.15 85.81 83.80 40.51 32.33

Table 1: The average accuracy on learned tasks (%) of different methods (global models for FL). The top 7 rows are from centralized
methods, while the bottom 7 rows are from FL methods. ∗ indicates methods with additional information (task labels). DMC is only suitable
for the Class-IL scenario.

distillation. FedLwF denotes FedAvg with LwF applied to global model is evaluated on the test set at the end of each
clients. (6) DMC: Deep Model Consolidation [Zhang et al., training round. All experiments are repeated for 3 runs with
2020], a Class-IL CL method that first trains a separate model different random seeds.
only for the new task, and then leverages an unlabeled public
dataset to distill the knowledge of the old tasks and the new 5.4 Experimental Results
task to obtain a new combined model. FedDMC denotes Fe- Effect on Mitigating Inter-task Forgetting
dAvg with DMC applied to clients. (7) CFeD: our method We evaluate all methods in both centralized and FL scenarios.
and CFeD(C) denotes the centralized version of our method Table 1 summarizes the average accuracy on ever learned task
CFeD. after the model learns the second and third tasks sequentially,
Note that MultiHead and FedMultiHead require task labels both in Domain-IL (left) and Class-IL (right). By compar-
during inference to know which task the current input belongs ing the results of different methods, it can be seen that CFeD
to. Moreover, multiple classifiers inevitably bring more pa- exceeds other baselines on average accuracy.
rameters in the Domain-IL scenario. Owing to these addi- In Domain-IL scenario (left half of Table 1), CFeD exceeds
tional information, their performance can be seen as a target FedAvg, FedLwF, FedEWC and FedDMC methods on the
for other methods. average accuracy, being close to FedMultiHead. Moreover,
the average accuracy of all methods improves after the model
5.3 Experimental Settings continually learns the Domain-T2. The result implies that
We use TextCNN [Kim, 2014] or ResNet-18 [He et al., 2016] the new task of Domain-T2 may cover some features of old
followed by fully-connected layers for classification. Each tasks, which help the models review the old knowledge, and
task trains the model for R = 20 rounds. For the local updat- notably, CFeD still outperforms the other baselines.
ing in each client, the learning epoch is 10 in Domain-IL or In the Class-IL scenario (right half of Table 1), we can see
40 in Class-IL. Unless otherwise stated, the constraint factor that the average accuracy of FedAvg and FedEWC both drop
λ of the EWC method is set to 100000. The temperature of significantly. The reason is that the labels of the old task are
distillation is set to 2 as default. not available, and the model quickly overfits the new task.
For the configuration of FL, we assume that there are 100 In contrast, CFeD outperforms other baselines, indicating the
clients, and only random 10% clients are sampled to partic- benefit of leveraging the surrogate dataset to get reasonable
ipate in each training round. The training dataset and surro- soft labels for old tasks.
gate dataset are both divided into 200 shards randomly (IID) We notice that the performance of FedDMC drops signif-
or sorted by the category (Non-IID). In each experiment, ev- icantly in Class-IL. Since the model consolidation of DMC
ery client selects two shards of data on each task as the local only uses surrogate data for distillation, i.e. no new task data,
dataset and also two shards of the surrogate dataset as the lo- to learn new tasks and review old tasks, its performance is
cal surrogate. In particular, the server also selects two shards significantly limited by the surrogate dataset size (in our case,
for server distillation in the Non-IID distribution. All above each client only has 2300 surrogate samples). To verify this,
selections are conducted randomly. we construct a variant FedDMCfull where every client has
For each task, we select 70% of data as the training set, access to the entire surrogate dataset. Under such a set-
10% as the validation set and the rest as the test set. The ting, FedDMCfull achieves considerable improvement as each

2186
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

New Task Accuracy Domain-T1 Domain-T2 Class-T1 Class-T2 Class-T1 Class-T2

New Task Accuracy


90
75 75
80 80 80 80
50 50
70 60 60
70 25 25 β=2 β=20
60 40 40
0 0 β=4 β=15
0 20 40 0 20 40 0 20 40 0 20 40 20 20 β=10 β=40
Domain-T1 Domain-T2 Class-T1 Class-T2 0 0
Average Accuracy

0 5 10 15 20 0 5 10 15 20
60
70 80 60 Communication Rounds Communication Rounds
FedAvg
40
70 40 CFeD 90 90

Average Accuracy
60
CFeD+SD
20 80 80
0 20 40 0 20 40 0 20 40 0 20 40
Communication Rounds 70 70

60 60
Figure 3: The performance of FedAVG(blue), CFeD(green) and 50 50
CFeD+SD(red) on both the Domain-IL and Class-IL scenarios un- 0 5 10 15 20 0 5 10 15 20
Communication Rounds Communication Rounds
der Non-IID distribution

Figure 4: Results of varying surrogate-ratio (β denotes the number


Class-T1 Class-T2 of selected shards)
Avg New Avg New
CFeD(C)lr=1e-6 71.21 59.40 63.45 66.60 assigned to each client from 2 (default) to 40. To reduce the
CFeD(C)lr=1e-3 62.22 93.90 34.26 93.34 experiment time, we set the local number of epochs to 10.
CFeDlr=1e-6 85.81 95.30 83.80 97.22 The experiment results are shown in Figure 4. Gener-
ally, the performance on new tasks reduces when β increases.
Table 2: The average accuracy on learned tasks (Avg, %) and test However, it is not the case for the old tasks. When varying
accuracy on the new task (New, %) under different learning rates. β from 2 to 40, the performance on old tasks improves first
and then decreases. The optimal value is around 10 for T1
and 20 for T2. It is worth noting that when β is small, the
client has more data. However, our approach still outperforms
average accuracy of both tasks reaches a peak value and then
it, showing the robustness of CFeD to the surrogate dataset
diminishes slowly as learning proceeds (bottom left subfig-
size.
ure). This indicates that the model learns the new task quickly
Effect on Mitigating Intra-task Forgetting (reaching the peak) and then gradually forgets the old tasks,
To illustrate the effect of our proposed approach against intra- which offsets the performance gained from the new task. The
task forgetting, we compare three methods: FedAVG, CFeD, forgetting is apparently postponed in task sequence T1 when
and CFeD with Server Distillation (namely CFeD+SD) both we enlarge β. But the effect of postponing is not obvious due
in Domain-IL and Class-IL scenarios with Non-IID distri- to the large number of tasks in T2.
bution. Figure 3 shows the accuracy on new tasks and the
average accuracy on learned tasks of the model during the 6 Conclusions
learning process. The results show that CFeD+SD improves
the performance on mitigating without sacrificing the abil- In this paper, we tackle the problem of catastrophic forgetting
ity to learn the new task. Moreover, the performance of all in federated learning of a series of tasks. We proposed a Con-
methods in Non-IID distribution degrades significantly, but tinual Federated Learning framework named CFeD, which
CFeD+SD is more stable than the other two methods. Owing leverages knowledge distillation based on surrogate datasets,
to clients division and server distillation, CFeD+SD achieves to address the problem. Our approach allows clients to re-
higher average accuracy without sacrificing plasticity. view the knowledge learned in the old model by optimizing
the distillation loss based on their own surrogate datasets. The
Clients Division Mechanism server also performs distillation to mitigate intra-task forget-
To evaluate the effect of the clients division mechanism, Ta- ting. To further improve the learning ability of the model,
ble 2 shows more detailed results of both the accuracy on the clients could be assigned to either learning the new task
new tasks and the average accuracy to illustrate the trade-off or reviewing the old tasks separately. The experiment results
of CFeD between stability and plasticity (See Section 2.1). showed that our proposed approach outperforms baselines in
It can be seen that, in Class-IL, CFeD(C) also suffers the mitigating catastrophic forgetting and achieves a good trade-
dilemma between plasticity and stability: CFeD(C)lr=1e-6 can- off between stability and plasticity. For future work, we will
not learn well on the new tasks and CFeD(C)lr=1e-3 performs further enhance our approach to overcome the intra-task for-
poorly on the average accuracy. In contrast, CFeD strikes getting in Non-IID data and reduce its training costs.
a good balance between plasticity and stability owing to the
clients division mechanism. Acknowledgments
Varying Surrogate Data Size This work was supported by the Key Research and De-
To see how the surrogate data size affects the performance of velopment Program of Zhejiang Province of China (No.
CFeD, we vary the number of shards (β) of surrogate data 2021C01009), the National Natural Science Foundation of

2187
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

China (Grant No. 62050099), and the Fundamental Research IEEE/CVF International Conference on Computer Vision,
Funds for the Central Universities. pages 312–321, 2019.
[Li and Hoiem, 2017] Zhizhong Li and Derek Hoiem.
References Learning without forgetting. IEEE transactions on pattern
[Chaudhry et al., 2018] Arslan Chaudhry, Marc’Aurelio analysis and machine intelligence, 40(12):2935–2947,
Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. 2017.
Efficient lifelong learning with a-gem. arXiv preprint [Li and Wang, 2019] Daliang Li and Junpu Wang. Fedmd:
arXiv:1812.00420, 2018. Heterogenous federated learning via model distillation.
[Delange et al., 2021] Matthias Delange, Rahaf Aljundi, arXiv preprint arXiv:1910.03581, 2019.
Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Greg [Li et al., 2006] Jingyang Li, Maosong Sun, and Xian
Slabaugh, and Tinne Tuytelaars. A continual learning Zhang. A comparison and semi-quantitative analysis of
survey: Defying forgetting in classification tasks. IEEE words and character-bigrams as features in chinese text
Transactions on Pattern Analysis and Machine Intelli- categorization. In Proceedings of the International Con-
gence, pages 1–1, 2021. ference on ACL, pages 545–552, 2006.
[Griffin et al., 2007] Gregory Griffin, Alex Holub, and Pietro [Lin et al., 2020] Tao Lin, Lingjing Kong, Sebastian U Stich,
Perona. Caltech-256 object category dataset. Technical and Martin Jaggi. Ensemble distillation for robust model
report, California Institute of Technology, 2007. fusion in federated learning. Advances in Neural Informa-
[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing tion Processing Systems, 33:2351–2363, 2020.
Ren, and Jian Sun. Deep residual learning for image recog- [McMahan et al., 2017] Brendan McMahan, Eider Moore,
nition. In Proceedings of the IEEE conference on computer Daniel Ramage, Seth Hampson, and Blaise Aguera y Ar-
vision and pattern recognition, pages 770–778, 2016. cas. Communication-efficient learning of deep networks
[Hinton et al., 2015] Geoffrey Hinton, Oriol Vinyals, Jeff from decentralized data. In Artificial intelligence and
Dean, et al. Distilling the knowledge in a neural network. statistics, pages 1273–1282. PMLR, 2017.
arXiv preprint arXiv:1503.02531, 2(7), 2015. [NLPIR, 2017] NLPIR. Nlpir weibo corpus. http://www.
[Hung et al., 2019] Ching-Yi Hung, Cheng-Hao Tu, Cheng- nlpir.org/, 2017. Accessed: 2017-12-03.
En Wu, Chien-Hung Chen, Yi-Ming Chan, and Chu-Song [Shoham et al., 2019] Neta Shoham, Tomer Avidor, Aviv
Chen. Compacting, picking and growing for unforgetting Keren, Nadav Israel, Daniel Benditkis, Liron Mor-Yosef,
continual learning. In Advances in Neural Information and Itai Zeitak. Overcoming forgetting in federated learn-
Processing Systems, pages 13647–13657, 2019. ing on non-iid data. arXiv preprint arXiv:1910.07796,
[Itahara et al., 2020] Sohei Itahara, Takayuki Nishio, 2019.
Yusuke Koda, Masahiro Morikura, and Koji Yamamoto. [SogouLabs, 2012] SogouLabs. Sohu news data(sogoucs).
Distillation-based semi-supervised federated learning http://www.sogou.com/labs/resource/cs.php, 2012. Ac-
for communication-efficient collaborative training with cessed: 2020-06-22.
non-iid private data. arXiv preprint arXiv:2008.06180,
2020. [Toneva et al., 2018] Mariya Toneva, Alessandro Sordoni,
Remi Tachet des Combes, Adam Trischler, Yoshua Ben-
[Jeong et al., 2018] Eunjeong Jeong, Seungeun Oh, Hye-
gio, and Geoffrey J Gordon. An empirical study of exam-
sung Kim, Jihong Park, Mehdi Bennis, and Seong-Lyun ple forgetting during deep neural network learning. arXiv
Kim. Communication-efficient on-device machine learn- preprint arXiv:1812.05159, 2018.
ing: Federated distillation and augmentation under non-iid
private data. arXiv preprint arXiv:1811.11479, 2018. [Usmanova et al., 2021] Anastasiia Usmanova, François
Portet, Philippe Lalanda, and German Vega. A distillation-
[Kim, 2014] Yoon Kim. Convolutional neural networks for
based approach integrating continual learning and fed-
sentence classification. CoRR, abs/1408.5882, 2014. erated learning for pervasive services. arXiv preprint
[Kirkpatrick et al., 2017] James Kirkpatrick, Razvan Pas- arXiv:2109.04197, 2021.
canu, Neil Rabinowitz, Joel Veness, Guillaume Des- [Yoon et al., 2021] Jaehong Yoon, Wonyong Jeong, Gi-
jardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago
woong Lee, Eunho Yang, and Sung Ju Hwang. Feder-
Ramalho, Agnieszka Grabska-Barwinska, et al. Overcom-
ated continual learning with weighted inter-client transfer.
ing catastrophic forgetting in neural networks. Proceed-
In International Conference on Machine Learning, pages
ings of the national academy of sciences, 114(13):3521–
12073–12086. PMLR, 2021.
3526, 2017.
[Zhang et al., 2020] Junting Zhang, Jie Zhang, Shalini
[Krizhevsky, 2009] A. Krizhevsky. Learning multiple layers
Ghosh, Dawei Li, Serafettin Tasci, Larry Heck, Heming
of features from tiny images. Technical report, University
Zhang, and C-C Jay Kuo. Class-incremental learning via
of Toronto, 2009.
deep model consolidation. In The IEEE Winter Conference
[Lee et al., 2019] Kibok Lee, Kimin Lee, Jinwoo Shin, and on Applications of Computer Vision, pages 1131–1140,
Honglak Lee. Overcoming catastrophic forgetting with 2020.
unlabeled data in the wild. In Proceedings of the

2188

You might also like