0% found this document useful (0 votes)

20 views7 pages

FCL KD

There are 4 necessary conditions to achieve a deadlock: Mutual Exclusion: At least one resource must be held in a non-sharable mode. If any other process requests this resource, then that process must wait for the resource to be released. Hold and Wait: A process must be simultaneously holding at least one resource and waiting for at least one resource that is currently being held by some other process. No preemption: Once a process is holding a resource ( i.e. once its request has been granted

Uploaded by

Arjun Mehra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views7 pages

FCL KD

Uploaded by

Arjun Mehra

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 7

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

Continual Federated Learning Based on Knowledge Distillation

Yuhang Ma1 , Zhongle Xie1 , Jue Wang1 , Ke Chen1 and Lidan Shou1,2
1
College of Computer Science and Technology, Zhejiang University
2
State Key Laboratory of CAD&CG, Zhejiang University
{myh0032, xiezl, zjuwangjue, chenk, should}@zju.edu.cn

Abstract (a) Overview of FL Intra-task

Forgetting
Federated learning (FL) is a promising approach
for learning a shared global model on decentralized Global Model Global Model
data owned by multiple clients without exposing
their privacy. In real-world scenarios, data accu- Aggregating Aggregating
Broadcast to the
mulated at the client-side varies in distribution over Central Server devices of next round
time. As a consequence, the global model tends to Clients
forget the knowledge obtained from previous tasks
while learning new tasks, showing signs of “catas-
trophic forgetting”. Previous studies in centralized
learning use techniques such as data replay and pa-
100% Ideal

rameter regularization to mitigate catastrophic for- Convergence Point

Week 1
getting. Unfortunately, these techniques cannot ad- 80%
equately solve the non-trivial problem in FL. We 60%
propose Continual Federated Learning with Distil-
40%
lation (CFeD) to address catastrophic forgetting un-
Week 2
Week 1

Initial Model
20%
der FL. CFeD performs knowledge distillation on Week 2 Inter-task
both the clients and the server, with each party in- 0%
Mon Tue Wed Thu Fri Forgetting
dependently having an unlabeled surrogate dataset, Social Productivity Entertainment
to mitigate forgetting. Moreover, CFeD assigns dif- (b) Distribution of screen time in two weeks (c) Local Update
ferent learning objectives, namely learning the new
task and reviewing old tasks, to different clients, Figure 1: (a) An overview of a FL system to predicate the usage
aiming to improve the learning ability of the model. habits on mobile device. (b) Distribution of screen time in two
The results show that our method performs well weeks for a specific user. (c) Local update suffers catastrophic for-
in mitigating catastrophic forgetting and achieves getting while learning from new data.
a good trade-off between the two objectives.
Although FL can protect data privacy well, its performance
1 Introduction is at risk in practice due to the following issues. First, clients
Federated Learning (FL) [McMahan et al., 2017] is proposed participating in one round of training may become unavail-
as a solution to learn a shared model using decentralized data able in the next round due to network failure, causing vari-
owned by multiple clients without disclosing their private ation in the training data distribution between consecutive
data. Figure 1 illustrates an example of a FL task to infer rounds. Second, the data accumulated at the client-side may
the usage habits of mobile device users. The sensitive data, vary over time, and can even be considered as a new task with
namely a stream of the temporal histogram of screen time, different data distribution or different labels, which poses sig-
is collected on mobile devices (clients) and used to train a nificant challenges to the adaptability of the model. Further-
local model. Meanwhile, a global model is built on a cen- more, due to the inaccessibility of the raw data, minimiz-
tral server, which leverages the local models submitted by the ing the loss on the new task may increase the loss on old
clients without accessing any client’s private data. During tasks. These issues all lead to underperformance of the global
each round of training, the server first broadcasts the global model, a phenomenon known as “catastrophic forgetting”.
model to clients. Then, each client independently uses its lo- Specifically, catastrophic forgetting in FL system is ob-
cal data to update the model and submits the model update served in two main categories, namely intra-task forgetting
to the server. Finally, the server aggregates these updates to and inter-task forgetting. (1) Intra-task forgetting occurs
produce a new global model for the next round. when two different subsets of clients are involved in two con-

2182
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

secutive rounds. In Fig. 1(a), since a client does not partic- 2021]. Models with strong stability forget little but perform
ipate in a training round, the new global model may forget poorly on the new task. In contrast, models with better plas-
knowledge obtained from this client in previous rounds and ticity can adapt quickly to the new task but tend to forget old
thus performs poorly on its local data. (2) Inter-task forget- ones. Existing CL methods can generally be divided into
ting occurs when clients accumulate new data with different three categories: data replay, parameter separation and pa-
domains or different labels. As shown in Fig. 1(c), due to the rameter regularization.
different distribution of the screen time data in week 2, the The core idea of data replay methods [Chaudhry et al.,
performance of the new global model on week 1 data is de- 2018] is to store and replay raw data from old tasks to mit-
graded. It should be noted that the Non-IID issue brings more igate forgetting. However, storing old data directly violates
challenges to both kinds of forgetting. privacy and storage restrictions. Parameter separation meth-
Catastrophic forgetting in FL is a non-trivial problem be- ods [Hung et al., 2019] overcome catastrophic forgetting by
cause the conventional approaches to catastrophic forgetting, assigning different subsets of the model parameters to dealing
namely Continual Learning (CL), cannot be easily applied with each task. However, separation methods will result in an
in FL due to privacy and resource constraints. Some re- infinite increase of parameters with the arrival of new tasks,
cent attempts on this topic, such as [Shoham et al., 2019; which can quickly become unacceptable in FL.
Usmanova et al., 2021], do not solve the problem adequately The regularization-based methods limit the updating pro-
since they are designed to address either intra-task forgetting cess by punish the updates on important parameters [Kirk-
or inter-task forgetting. patrick et al., 2017] or adding knowledge distillation [Hin-
In this paper, we propose a framework called Continual ton et al., 2015] loss to the objective function [Li and Hoiem,
Federated Learning with Distillation (CFeD) to mitigate 2017; Lee et al., 2019; Zhang et al., 2020] to learn the knowl-
catastrophic forgetting on both intra-task and inter-task cate- edge from the old model. However, the importance is difficult
gories when learning new tasks. Specifically, CFeD leverages to be precisely evaluated. Some distillation-based methods
knowledge distillation in two ways based on unlabeled public perform distillation based on the new task data, but its ef-
datasets called the surrogate datasets. First, while learning ficacy drops significantly when domains vary greatly. The
the new task, each client transfers the knowledge of old tasks others that leverage unlabeled external data solve difficul-
from the model converged on the last task into the new model ties above. CFeD adopts knowledge distillation with public
via its own surrogate dataset to mitigate inter-task forget- datasets and proposes a client division mechanism to reduce
ting. Meanwhile, CFeD assigns the two objectives to differ- the cost of time and computation.
ent clients to improve the performance, called clients division
mechanism. Second, the server also maintains another inde- 2.2 Federated Learning
pendent surrogate dataset to fine-tune the aggregated model Recent studies have considered introducing Continual Learn-
in each round by distilling the knowledge learned in the cur- ing into FL to improve the performance of models under Non-
rent and last rounds into the new aggregated one, called server IID data [Shoham et al., 2019]. [Yoon et al., 2021] proposed
distillation. Federated Continual Learning and focused on multiple con-
The main contributions of this paper are as follows: tinual learning agents that use each other’s indirect experi-
ence to enhance the continual learning performance of their
• We extend continual learning to the federated learning
local models, rather than to jointly train a better global model.
scenario and define Continual Federated Learning (CFL)
Therefore, the purpose of their study is to obtain a collection
to address catastrophic forgetting when learning a series
of local models for the participating clients. While our work
of tasks. (Section 3)
looks literally similar, our research problem is very different,
• We propose a CFL framework called CFeD, which em- as we aim at learning a better global model. Thus we decide
ploys knowledge distillation based on surrogate datasets not to compare with it in our experiment study.
to mitigate catastrophic forgetting both at the server-side Based on FedAvg, [Jeong et al., 2018] proposed Feder-
and client-side. In each round, the inter-task forgetting ated Distillation framework to enhance communication effi-
is mitigated by assigning clients to learning the new task ciency without comprising performance. There are also sev-
or to reviewing the old tasks. The intra-task forgetting is eral works leveraging additional datasets constructed from
mitigated by applying a distillation scheme at the server- public dataset [Li and Wang, 2019; Lin et al., 2020] or orig-
side. (Section 4) inal local data [Itahara et al., 2020] to aid distillation. Com-
• We evaluate two scenarios of CFL by varying the data pared with the above works, our proposed method to mit-
distribution and adding categories on text and image igate catastrophic forgetting is orthogonal and could work
classification datasets. CFeD outperforms existing FL with them together.
methods in overcoming forgetting without sacrificing
the ability to learn new tasks. (Section 5) 3 Problem Definition
In FL, a central server and K clients cooperate to train a
2 Related Work model on task T through R rounds of communication. The
optimization objective can be written as follows:
2.1 Continual Learning K
Continual Learning (CL) aims to solve stability-plasticity
X nk
min L(T k ; θ) (1)
dilemma when learning a sequence of tasks [Delange et al., θ na
k=1

2183
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

where T k refers to a collection of nk training samples at k-th Server

-th Communication Round
client and na is the sum of all nk .
Here we define Continual Federated Learning (CFL), Learn the new task Aggregation &
Client A Server Distillation
which aims to train a global model via iterative communi-
cation between the server and clients on a series of tasks Review the old tasks
Client B
{T1 , T2 , T3 , · · · } accumulated at the client-side.
When the t-th task arrives, data of previous tasks will be- Client model
come unavailable for training. We define the global model Prediction on Cross Entropy
parameters obtained from the previous task as θ t−1 , and the Loss
SK
new task as Tt = k=1 Ttk , where Ttk contains newly col- Clients to learn the new task
lected data at each client. Client model
The goal is to train a global model with a minimized loss Prediction on
on the new task Tt as well as old tasks {T1 , · · · , Tt−1 }. The Surrogate Data
optimization objective is achieved by minimizing the losses Global model of Distillation Loss
Surrogate Data last round
of K clients on all local tasks up to time t through iterative
server-client communication. The global model parameters Soft Labels
can be obtained as follows: Clients to review the old tasks
K t
X nk X Soft Labels on
1
θ t = arg min L(Tik ; θ) (2) sample 1
θ na Soft Labels
k=1 i=1 2
on sample 2
3 Soft Labels on
The global model at task Tt is expected to achieve no sample 3
higher loss on historical tasks than that at time t − 1. That Surrogate Data
Pt−1 Pt−1 Aggregating Distillation Loss
is, i=1 L(Ti ; θ t ) ≤ i=1 L(Ti ; θ t−1 ). (Server)
However, in real-world scenarios, due to the limitation on 1 2 3 Prediction on
accessing previous data, CFL suffers catastrophic forgetting sample 1, 2 & 3
at “intra-task” and “inter-task” levels. Formally, intra-task Server Distillation

forgetting means that after the updates of r-th round, the

Figure 2: Continual federated learning with knowledge distillation.
global model gets a higher loss than the (r-1)-th round on the
local dataset of k-th client: L(Ttk ; θ t,r ) > L(Ttk ; θ t,r−1 ),
especially when the distribution across clients is Non-IID. 4.1 Clients Distillation
And then, inter-task forgetting is that the loss of the model The distillation process of CFeD assumes that there is a surro-
at time t on old tasks is higher than that at time t − 1: gate dataset Xsk at the k-th client. For each sample xs ∈ Xsk ,
Pt−1 k
Pt−1 k
i=1 L(Ti ; θ t ) > i=1 L(Ti ; θ t−1 ). its label at time t is ys,t = f (xs ; θ t−1 ). Thus we obtain a per-
client set of sample pairs Stk = {(xs , ys,t ), ∀xs ∈ Xsk }. A
4 Proposed Method distillation term Ld (Stk ; θ) is added to approximate the loss
Pt−1
on old tasks i=1 L(Tik ; θ). For a specific surrogate sample
To tackle the catastrophic forgetting in FL, we propose a pair (xs , ys,t ) at time t, the distillation loss can be formalized
framework named CFeD (Continual Federated learning with as a modified version of cross-entropy loss:
Distillation). As shown in Figure 2, the core idea is to use
l
the model of the last task to predict the surrogate dataset, and X 0(i) 0(i)
treat the outputs as pseudo-labels to perform knowledge dis- Ld ((xs , ys,t ); θ) = − ys,t log ŷs,t (3)
tillation to review the knowledge of unavailable data. To im- i=1
prove the learning ability of the global model and fully utilize where l is the number of target classes, ys,t is the modified
0(i)
computation resources, learning the new task and reviewing 0(i)
old tasks can be assigned to different clients and those clients surrogate label, and ŷs,t is the modified output of the model
without enough new task data could only review the old tasks. on surrogate sample xs . The latter two are defined as:
Moreover, a server distillation mechanism is proposed to mit- (i) (i)
igate the intra-task forgetting in Non-IID data. The aggre- 0(i) (ys,t )1/T 0(i) (ŷs,t )1/T
ys,t = Pl (j) 1/T
, ŷs,t = Pl (j) 1/T
(4)
gated global model is finetuned to mimic the outputs of the j=1 (ys,t ) j=1 (ŷs,t )
global model on the last round and local models on the cur-
(i)
rent round. where ŷs,t is the i-th element of f (xs ; θ), T is the temperature
The surrogate dataset should be collected from public of distillation and a greater T value could amplify minor log-
datasets for privacy and cover as many features as possible or its so as to increase the information provided by the teacher
be similar to the old tasks to ensure the effectiveness of dis- model.
tillation. Since the model parameters do not increase, there is Based on the above notions, CFeD computes the total loss
no additional communication cost compared with FedAvg. on all clients at time t by substituting the unknown losses on

2184
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

old tasks in Equation 2 with the distillation losses on the per- them to the received local models and one to the global
client surrogate datasets. Formally, the total loss is: model of the last round, and collects outputs Ŷs,r of above-
K t−1 mentioned models on their assigned batches to construct a
X nk X
labeled set of sample pairs for the server distillation, denoted
(L(Ttk ; θ) + L(Tik ; θ))
na i=1
as Sr = {(xs , ys,r ), ∀xs ∈ Xs }. Next, the aggregated global
k=1
(5) model θ r will be iteratively updated with a distillation loss
K
X nk Ld (Sr ; θ).
∝ (L(Ttk ; θ) + λd Ld (Stk ; θ)) With server distillation, the global model is able to further
na
k=1
retrieve the knowledge from the local models and the previ-
where λd (default 0.1) is a pre-defined parameter to weight ous global model, thus to mitigate intra-task forgetting.
the distillation loss.
4.2 Clients Division Mechanism
5 Experiments
In real-world scenarios, different clients accumulate data at In this section, we evaluate CFeD and baseline approaches
different speeds. While some clients are ready to learn a new extended from traditional continual learning methods on text
task, others may not have gained enough data for the new task and image classification tasks.
and thus cannot be effectively utilized in learning it.
To leverage the under-utilized computation resources, 5.1 Datasets and Tasks
CFeD treats learning the new task and reviewing old tasks as We consider both text and image classification datasets:
two individual objectives. The framework assigns one of the THUCNews [Li et al., 2006] contains 14 categories of Chi-
two objectives to each client so that some clients can only per- nese news data collected from Sina News RSS between 2005
form reviewing. Further, this kind of division may improve and 2011. SogouCS [SogouLabs, 2012] contains 18 cate-
the exploration of the model on different objectives and help gories of 511218 Chinese news data in 2012. Sina2019 con-
the model depart from previous local minima. Formally, re- tains 30533 Chinese news data in 2019 crawled from the
garding the division mechanism, Equation 5 can be expanded Sina News by ourselves. NLPIR Weibo Corpus [NLPIR,
and rewritten as: 2017] consists of 230000 samples obtained from Sina Weibo
N K and Tencent Weibo, two Chinese social media sites. We
X nk X nk use it as a surrogate dataset across different tasks. CIFAR-
Ld (Stk ; θ) + L(Ttk ; θ) (6)
na na 10[Krizhevsky, 2009] contains 60000 images with 10 classes.
k=1 k=N +1
CIFAR-100[Krizhevsky, 2009] contains 60000 images with
where N = dαKe, α ∈ [0, 1]. We introduce a factor α to 100 classes. Caltech-256[Griffin et al., 2007] contains 30608
describe the proportion of clients involved in reviewing per images with 256 classes as the surrogate dataset in image
round. classification. 1
We design task sequences to be learned in two different
4.3 Server Distillation scenarios: Domain-IL indicates the case where the input dis-
Although the clients division mechanism allows spare com- tributions continually vary in the sequence; Class-IL indicates
puting resources to be utilized, we must note that some active the case where new classes incrementally emerge in the se-
clients are not learning the new task now. This may harm quence. Tasks on the text dataset are denoted by ‘Tx’ while
the performance on the new task and lead to severe intra-task those on the image dataset are denoted by ‘Ix’, where ‘x’ is
forgetting, especially in the Non-IID scenario. The reason the task sequence ID. Most task sequences in the experiments
is that, since the labels of the data on different clients may are short, containing two or three classification tasks.
not intersect each other, the global model fitting one client
(learning the new task) may easily exhibit forgetting in an- 5.2 Compared Methods
other (reviewing the old tasks, or learning the new task on We choose the following approaches for evaluation:
data of different labels). (1) Finetuning: A naive method that trains the model on
To mitigate such performance degradation on the new task, tasks sequentially. (2) FedAvg: A FL method that each client
a client may take a naive solution to increase its local training learns tasks in sequence and the server aggregates local mod-
iterations. However, increasing the number of epochs of local els from clients. (3) MultiHead: A CL method training indi-
updates on Non-IID data could easily cause overfitting and vidual classifiers for each task, requiring task labels to spec-
unstablize the performance of the global model. ify the output during the inferring phase. FedMultiHead de-
Motivated by the paradigm of mini-batch iterative updates notes FedAvg with MultiHead applied to clients. (4) EWC:
in centralized training [Toneva et al., 2018], we propose a regularization-based method [Kirkpatrick et al., 2017] that
Server Distillation (SD) to mitigate the intra-task forgetting uses Fisher information matrix to estimate the importance of
and stabilize the performance of the aggregated model at parameters. FedEWC denotes FedAvg with EWC applied
the server-side. In our approach, the server also maintains to clients. (5) LwF: A distillation-based method. Instead
a lightweight, unlabeled public dataset Xs , similar to the of unlabeled data, LwF leverages new task data to perform
surrogate datasets on the clients. After the r-th round of
aggregating the K local models collected from the clients, 1
Our code and datasets are publicly available at https://github.
the server divides Xs into K + 1 batches, assigns K of com/lianziqt/CFeD.

2185
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

Domain-IL scenario Class-IL scenario

Domain-T1 Domain-T2 Class-T1 Class-T2 Class-I1 Class-I2
∗
MultiHead 94.66 93.40 95.84 95.66 51.81 57.04
Finetuing 85.96 91.32 48.00 32.43 36.74 30.58
EWC 87.42 91.98 47.96 32.42 36.35 30.72
LwF 90.58 92.01 48.59 39.10 35.31 26.32
DMC - - 48.37 41.92 37.81 27.37
CFeD(C)lr=1e-6 82.48 83.20 71.21 63.45 37.88 28.36
CFeD(C)lr=1e-3 94.49 92.95 62.22 34.26 33.75 22.21
FedMultiHead∗ 81.83 91.04 96.26 96.07 56.97 60.43
FedAvg 86.50 92.60 48.25 32.61 32.53 29.28
FedEWC 84.76 92.06 48.24 32.51 30.37 28.62
FedLwF 87.35 92.41 59.39 44.76 33.97 23.46
FedDMC - - 46.58 10.46 16.50 8.33
FedDMCfull - - 56.95 50.87 40.13 29.18
CFeD 92.34 94.15 85.81 83.80 40.51 32.33

Table 1: The average accuracy on learned tasks (%) of different methods (global models for FL). The top 7 rows are from centralized
methods, while the bottom 7 rows are from FL methods. ∗ indicates methods with additional information (task labels). DMC is only suitable
for the Class-IL scenario.

distillation. FedLwF denotes FedAvg with LwF applied to global model is evaluated on the test set at the end of each
clients. (6) DMC: Deep Model Consolidation [Zhang et al., training round. All experiments are repeated for 3 runs with
2020], a Class-IL CL method that first trains a separate model different random seeds.
only for the new task, and then leverages an unlabeled public
dataset to distill the knowledge of the old tasks and the new 5.4 Experimental Results
task to obtain a new combined model. FedDMC denotes Fe- Effect on Mitigating Inter-task Forgetting
dAvg with DMC applied to clients. (7) CFeD: our method We evaluate all methods in both centralized and FL scenarios.
and CFeD(C) denotes the centralized version of our method Table 1 summarizes the average accuracy on ever learned task
CFeD. after the model learns the second and third tasks sequentially,
Note that MultiHead and FedMultiHead require task labels both in Domain-IL (left) and Class-IL (right). By compar-
during inference to know which task the current input belongs ing the results of different methods, it can be seen that CFeD
to. Moreover, multiple classifiers inevitably bring more pa- exceeds other baselines on average accuracy.
rameters in the Domain-IL scenario. Owing to these addi- In Domain-IL scenario (left half of Table 1), CFeD exceeds
tional information, their performance can be seen as a target FedAvg, FedLwF, FedEWC and FedDMC methods on the
for other methods. average accuracy, being close to FedMultiHead. Moreover,
the average accuracy of all methods improves after the model
5.3 Experimental Settings continually learns the Domain-T2. The result implies that
We use TextCNN [Kim, 2014] or ResNet-18 [He et al., 2016] the new task of Domain-T2 may cover some features of old
followed by fully-connected layers for classification. Each tasks, which help the models review the old knowledge, and
task trains the model for R = 20 rounds. For the local updat- notably, CFeD still outperforms the other baselines.
ing in each client, the learning epoch is 10 in Domain-IL or In the Class-IL scenario (right half of Table 1), we can see
40 in Class-IL. Unless otherwise stated, the constraint factor that the average accuracy of FedAvg and FedEWC both drop
λ of the EWC method is set to 100000. The temperature of significantly. The reason is that the labels of the old task are
distillation is set to 2 as default. not available, and the model quickly overfits the new task.
For the configuration of FL, we assume that there are 100 In contrast, CFeD outperforms other baselines, indicating the
clients, and only random 10% clients are sampled to partic- benefit of leveraging the surrogate dataset to get reasonable
ipate in each training round. The training dataset and surro- soft labels for old tasks.
gate dataset are both divided into 200 shards randomly (IID) We notice that the performance of FedDMC drops signif-
or sorted by the category (Non-IID). In each experiment, ev- icantly in Class-IL. Since the model consolidation of DMC
ery client selects two shards of data on each task as the local only uses surrogate data for distillation, i.e. no new task data,
dataset and also two shards of the surrogate dataset as the lo- to learn new tasks and review old tasks, its performance is
cal surrogate. In particular, the server also selects two shards significantly limited by the surrogate dataset size (in our case,
for server distillation in the Non-IID distribution. All above each client only has 2300 surrogate samples). To verify this,
selections are conducted randomly. we construct a variant FedDMCfull where every client has
For each task, we select 70% of data as the training set, access to the entire surrogate dataset. Under such a set-
10% as the validation set and the rest as the test set. The ting, FedDMCfull achieves considerable improvement as each

2186
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

New Task Accuracy Domain-T1 Domain-T2 Class-T1 Class-T2 Class-T1 Class-T2

New Task Accuracy

90
75 75
80 80 80 80
50 50
70 60 60
70 25 25 β=2 β=20
60 40 40
0 0 β=4 β=15
0 20 40 0 20 40 0 20 40 0 20 40 20 20 β=10 β=40
Domain-T1 Domain-T2 Class-T1 Class-T2 0 0
Average Accuracy

0 5 10 15 20 0 5 10 15 20
60
70 80 60 Communication Rounds Communication Rounds
FedAvg
40
70 40 CFeD 90 90

Average Accuracy
60
CFeD+SD
20 80 80
0 20 40 0 20 40 0 20 40 0 20 40
Communication Rounds 70 70

60 60
Figure 3: The performance of FedAVG(blue), CFeD(green) and 50 50
CFeD+SD(red) on both the Domain-IL and Class-IL scenarios un- 0 5 10 15 20 0 5 10 15 20
Communication Rounds Communication Rounds
der Non-IID distribution

Figure 4: Results of varying surrogate-ratio (β denotes the number

Class-T1 Class-T2 of selected shards)
Avg New Avg New
CFeD(C)lr=1e-6 71.21 59.40 63.45 66.60 assigned to each client from 2 (default) to 40. To reduce the
CFeD(C)lr=1e-3 62.22 93.90 34.26 93.34 experiment time, we set the local number of epochs to 10.
CFeDlr=1e-6 85.81 95.30 83.80 97.22 The experiment results are shown in Figure 4. Gener-
ally, the performance on new tasks reduces when β increases.
Table 2: The average accuracy on learned tasks (Avg, %) and test However, it is not the case for the old tasks. When varying
accuracy on the new task (New, %) under different learning rates. β from 2 to 40, the performance on old tasks improves first
and then decreases. The optimal value is around 10 for T1
and 20 for T2. It is worth noting that when β is small, the
client has more data. However, our approach still outperforms
average accuracy of both tasks reaches a peak value and then
it, showing the robustness of CFeD to the surrogate dataset
diminishes slowly as learning proceeds (bottom left subfig-
size.
ure). This indicates that the model learns the new task quickly
Effect on Mitigating Intra-task Forgetting (reaching the peak) and then gradually forgets the old tasks,
To illustrate the effect of our proposed approach against intra- which offsets the performance gained from the new task. The
task forgetting, we compare three methods: FedAVG, CFeD, forgetting is apparently postponed in task sequence T1 when
and CFeD with Server Distillation (namely CFeD+SD) both we enlarge β. But the effect of postponing is not obvious due
in Domain-IL and Class-IL scenarios with Non-IID distri- to the large number of tasks in T2.
bution. Figure 3 shows the accuracy on new tasks and the
average accuracy on learned tasks of the model during the 6 Conclusions
learning process. The results show that CFeD+SD improves
the performance on mitigating without sacrificing the abil- In this paper, we tackle the problem of catastrophic forgetting
ity to learn the new task. Moreover, the performance of all in federated learning of a series of tasks. We proposed a Con-
methods in Non-IID distribution degrades significantly, but tinual Federated Learning framework named CFeD, which
CFeD+SD is more stable than the other two methods. Owing leverages knowledge distillation based on surrogate datasets,
to clients division and server distillation, CFeD+SD achieves to address the problem. Our approach allows clients to re-
higher average accuracy without sacrificing plasticity. view the knowledge learned in the old model by optimizing
the distillation loss based on their own surrogate datasets. The
Clients Division Mechanism server also performs distillation to mitigate intra-task forget-
To evaluate the effect of the clients division mechanism, Ta- ting. To further improve the learning ability of the model,
ble 2 shows more detailed results of both the accuracy on the clients could be assigned to either learning the new task
new tasks and the average accuracy to illustrate the trade-off or reviewing the old tasks separately. The experiment results
of CFeD between stability and plasticity (See Section 2.1). showed that our proposed approach outperforms baselines in
It can be seen that, in Class-IL, CFeD(C) also suffers the mitigating catastrophic forgetting and achieves a good trade-
dilemma between plasticity and stability: CFeD(C)lr=1e-6 can- off between stability and plasticity. For future work, we will
not learn well on the new tasks and CFeD(C)lr=1e-3 performs further enhance our approach to overcome the intra-task for-
poorly on the average accuracy. In contrast, CFeD strikes getting in Non-IID data and reduce its training costs.
a good balance between plasticity and stability owing to the
clients division mechanism. Acknowledgments
Varying Surrogate Data Size This work was supported by the Key Research and De-
To see how the surrogate data size affects the performance of velopment Program of Zhejiang Province of China (No.
CFeD, we vary the number of shards (β) of surrogate data 2021C01009), the National Natural Science Foundation of

2187
Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

China (Grant No. 62050099), and the Fundamental Research IEEE/CVF International Conference on Computer Vision,
Funds for the Central Universities. pages 312–321, 2019.
[Li and Hoiem, 2017] Zhizhong Li and Derek Hoiem.
References Learning without forgetting. IEEE transactions on pattern
[Chaudhry et al., 2018] Arslan Chaudhry, Marc’Aurelio analysis and machine intelligence, 40(12):2935–2947,
Ranzato, Marcus Rohrbach, and Mohamed Elhoseiny. 2017.
Efficient lifelong learning with a-gem. arXiv preprint [Li and Wang, 2019] Daliang Li and Junpu Wang. Fedmd:
arXiv:1812.00420, 2018. Heterogenous federated learning via model distillation.
[Delange et al., 2021] Matthias Delange, Rahaf Aljundi, arXiv preprint arXiv:1910.03581, 2019.
Marc Masana, Sarah Parisot, Xu Jia, Ales Leonardis, Greg [Li et al., 2006] Jingyang Li, Maosong Sun, and Xian
Slabaugh, and Tinne Tuytelaars. A continual learning Zhang. A comparison and semi-quantitative analysis of
survey: Defying forgetting in classification tasks. IEEE words and character-bigrams as features in chinese text
Transactions on Pattern Analysis and Machine Intelli- categorization. In Proceedings of the International Con-
gence, pages 1–1, 2021. ference on ACL, pages 545–552, 2006.
[Griffin et al., 2007] Gregory Griffin, Alex Holub, and Pietro [Lin et al., 2020] Tao Lin, Lingjing Kong, Sebastian U Stich,
Perona. Caltech-256 object category dataset. Technical and Martin Jaggi. Ensemble distillation for robust model
report, California Institute of Technology, 2007. fusion in federated learning. Advances in Neural Informa-
[He et al., 2016] Kaiming He, Xiangyu Zhang, Shaoqing tion Processing Systems, 33:2351–2363, 2020.
Ren, and Jian Sun. Deep residual learning for image recog- [McMahan et al., 2017] Brendan McMahan, Eider Moore,
nition. In Proceedings of the IEEE conference on computer Daniel Ramage, Seth Hampson, and Blaise Aguera y Ar-
vision and pattern recognition, pages 770–778, 2016. cas. Communication-efficient learning of deep networks
[Hinton et al., 2015] Geoffrey Hinton, Oriol Vinyals, Jeff from decentralized data. In Artificial intelligence and
Dean, et al. Distilling the knowledge in a neural network. statistics, pages 1273–1282. PMLR, 2017.
arXiv preprint arXiv:1503.02531, 2(7), 2015. [NLPIR, 2017] NLPIR. Nlpir weibo corpus. http://www.
[Hung et al., 2019] Ching-Yi Hung, Cheng-Hao Tu, Cheng- nlpir.org/, 2017. Accessed: 2017-12-03.
En Wu, Chien-Hung Chen, Yi-Ming Chan, and Chu-Song [Shoham et al., 2019] Neta Shoham, Tomer Avidor, Aviv
Chen. Compacting, picking and growing for unforgetting Keren, Nadav Israel, Daniel Benditkis, Liron Mor-Yosef,
continual learning. In Advances in Neural Information and Itai Zeitak. Overcoming forgetting in federated learn-
Processing Systems, pages 13647–13657, 2019. ing on non-iid data. arXiv preprint arXiv:1910.07796,
[Itahara et al., 2020] Sohei Itahara, Takayuki Nishio, 2019.
Yusuke Koda, Masahiro Morikura, and Koji Yamamoto. [SogouLabs, 2012] SogouLabs. Sohu news data(sogoucs).
Distillation-based semi-supervised federated learning http://www.sogou.com/labs/resource/cs.php, 2012. Ac-
for communication-efficient collaborative training with cessed: 2020-06-22.
non-iid private data. arXiv preprint arXiv:2008.06180,
2020. [Toneva et al., 2018] Mariya Toneva, Alessandro Sordoni,
Remi Tachet des Combes, Adam Trischler, Yoshua Ben-
[Jeong et al., 2018] Eunjeong Jeong, Seungeun Oh, Hye-
gio, and Geoffrey J Gordon. An empirical study of exam-
sung Kim, Jihong Park, Mehdi Bennis, and Seong-Lyun ple forgetting during deep neural network learning. arXiv
Kim. Communication-efficient on-device machine learn- preprint arXiv:1812.05159, 2018.
ing: Federated distillation and augmentation under non-iid
private data. arXiv preprint arXiv:1811.11479, 2018. [Usmanova et al., 2021] Anastasiia Usmanova, François
Portet, Philippe Lalanda, and German Vega. A distillation-
[Kim, 2014] Yoon Kim. Convolutional neural networks for
based approach integrating continual learning and fed-
sentence classification. CoRR, abs/1408.5882, 2014. erated learning for pervasive services. arXiv preprint
[Kirkpatrick et al., 2017] James Kirkpatrick, Razvan Pas- arXiv:2109.04197, 2021.
canu, Neil Rabinowitz, Joel Veness, Guillaume Des- [Yoon et al., 2021] Jaehong Yoon, Wonyong Jeong, Gi-
jardins, Andrei A Rusu, Kieran Milan, John Quan, Tiago
woong Lee, Eunho Yang, and Sung Ju Hwang. Feder-
Ramalho, Agnieszka Grabska-Barwinska, et al. Overcom-
ated continual learning with weighted inter-client transfer.
ing catastrophic forgetting in neural networks. Proceed-
In International Conference on Machine Learning, pages
ings of the national academy of sciences, 114(13):3521–
12073–12086. PMLR, 2021.
3526, 2017.
[Zhang et al., 2020] Junting Zhang, Jie Zhang, Shalini
[Krizhevsky, 2009] A. Krizhevsky. Learning multiple layers
Ghosh, Dawei Li, Serafettin Tasci, Larry Heck, Heming
of features from tiny images. Technical report, University
Zhang, and C-C Jay Kuo. Class-incremental learning via
of Toronto, 2009.
deep model consolidation. In The IEEE Winter Conference
[Lee et al., 2019] Kibok Lee, Kimin Lee, Jinwoo Shin, and on Applications of Computer Vision, pages 1131–1140,
Honglak Lee. Overcoming catastrophic forgetting with 2020.
unlabeled data in the wild. In Proceedings of the

2188

Report
No ratings yet
Report
59 pages
1263 Federated Adversarial Lea
No ratings yet
1263 Federated Adversarial Lea
28 pages
Jothim
No ratings yet
Jothim
20 pages
Towards Model-Agnostic Federated Learning
No ratings yet
Towards Model-Agnostic Federated Learning
23 pages
Semi-Synchronous Personalized Federated Learning
No ratings yet
Semi-Synchronous Personalized Federated Learning
17 pages
M D H F L D A: Itigating ATA Eterogeneity in Ederated Earning With ATA Ugmentation
No ratings yet
M D H F L D A: Itigating ATA Eterogeneity in Ederated Earning With ATA Ugmentation
18 pages
Personalized Edge Intelligence Via Federated Self-Knowledge Distillation
No ratings yet
Personalized Edge Intelligence Via Federated Self-Knowledge Distillation
14 pages
Clustered Federated Learning - Model-Agnostic Distributed Multitask Optimization Under Privacy Constraints
No ratings yet
Clustered Federated Learning - Model-Agnostic Distributed Multitask Optimization Under Privacy Constraints
13 pages
PI-Fed Continual Federated Learning With Parameter-Level Importance Aggregation
No ratings yet
PI-Fed Continual Federated Learning With Parameter-Level Importance Aggregation
13 pages
FedEraser IWQoS21
No ratings yet
FedEraser IWQoS21
10 pages
1 Fu
No ratings yet
1 Fu
10 pages
Paper 10
No ratings yet
Paper 10
10 pages
ChainsFL Blockchain-Driven Federated Learning From Design To Realization
No ratings yet
ChainsFL Blockchain-Driven Federated Learning From Design To Realization
6 pages
Federated Learning For Edge Networks: Resource Optimization and Incentive Mechanism
No ratings yet
Federated Learning For Edge Networks: Resource Optimization and Incentive Mechanism
7 pages
A Survey On Cluster-Based Federated Learning
No ratings yet
A Survey On Cluster-Based Federated Learning
22 pages
Adaptive Clustered Federated Learning For Clients With Time-Varying Interests
No ratings yet
Adaptive Clustered Federated Learning For Clients With Time-Varying Interests
10 pages
Federated Continual Learning For Edge-AI: A Comprehensive Survey
No ratings yet
Federated Continual Learning For Edge-AI: A Comprehensive Survey
35 pages
Flair Paper 19 Socialsec
No ratings yet
Flair Paper 19 Socialsec
19 pages
2023 Bfu Bayesian Federated Unlearning With Parameter Self-Sharing - Compressed
No ratings yet
2023 Bfu Bayesian Federated Unlearning With Parameter Self-Sharing - Compressed
12 pages
Bonawitz, Eichner Et Al 2019 - Towards Federated Learning at Scale
No ratings yet
Bonawitz, Eichner Et Al 2019 - Towards Federated Learning at Scale
15 pages
Learning Critically Selective Self Distillation in Federated Learning On Non-IID Data
No ratings yet
Learning Critically Selective Self Distillation in Federated Learning On Non-IID Data
12 pages
Federated Dropout
No ratings yet
Federated Dropout
12 pages
20894-Article Text-24907-1-2-20220628
No ratings yet
20894-Article Text-24907-1-2-20220628
9 pages
Newres 5
No ratings yet
Newres 5
23 pages
Technical Report
No ratings yet
Technical Report
35 pages
Design of Anti-Plagiarism Mechanisms in Decentralized Federated Learning
No ratings yet
Design of Anti-Plagiarism Mechanisms in Decentralized Federated Learning
15 pages
1st Review PPT B8
No ratings yet
1st Review PPT B8
21 pages
Continual Learning and Catastrophic Forgetting
No ratings yet
Continual Learning and Catastrophic Forgetting
21 pages
1 s2.0 S0020025523008204 Main
No ratings yet
1 s2.0 S0020025523008204 Main
22 pages
Federated Learning For Generalization Robustness Fairness A Survey and Benchmark
No ratings yet
Federated Learning For Generalization Robustness Fairness A Survey and Benchmark
20 pages
Gradient-Congruity Guided Federated Sparse
No ratings yet
Gradient-Congruity Guided Federated Sparse
12 pages
CAIE - A Review of Applications in Federated Learning - Deposit
No ratings yet
CAIE - A Review of Applications in Federated Learning - Deposit
58 pages
Asynchronous Federated Learning On Heterogeneous Devices A Survey
No ratings yet
Asynchronous Federated Learning On Heterogeneous Devices A Survey
15 pages
A Survey On Federated Learning For Resource-Constrained IoT Devices
No ratings yet
A Survey On Federated Learning For Resource-Constrained IoT Devices
24 pages
FederatedLearning Bahaa Nashwa 2311.16021v1
No ratings yet
FederatedLearning Bahaa Nashwa 2311.16021v1
7 pages
Selective Knowledge Sharing For Privacy-Preserving Federated Distillation Without A Good Teacher
No ratings yet
Selective Knowledge Sharing For Privacy-Preserving Federated Distillation Without A Good Teacher
11 pages
From Distributed Machine Learning To Federated Learning: A Survey
No ratings yet
From Distributed Machine Learning To Federated Learning: A Survey
33 pages
1FL 2024
No ratings yet
1FL 2024
75 pages
Flute: A S, E F H - P F L S: Calable Xtensible Ramework For IGH Erformance Ederated Earning Imulations
No ratings yet
Flute: A S, E F H - P F L S: Calable Xtensible Ramework For IGH Erformance Ederated Earning Imulations
13 pages
FL-AAAI-22 Paper 44
No ratings yet
FL-AAAI-22 Paper 44
9 pages
Open Source FL Frameworks Ranking
No ratings yet
Open Source FL Frameworks Ranking
26 pages
Federated Learning
No ratings yet
Federated Learning
15 pages
MTSW GC 2023
No ratings yet
MTSW GC 2023
6 pages
To Appear in KAIS: From Distributed Machine Learning To Federated Learning: A Survey
No ratings yet
To Appear in KAIS: From Distributed Machine Learning To Federated Learning: A Survey
36 pages
Federated Learning Via Consensus Mechanism
No ratings yet
Federated Learning Via Consensus Mechanism
12 pages
Federated Learning A Survery
No ratings yet
Federated Learning A Survery
31 pages
Fed Adp
No ratings yet
Fed Adp
11 pages
A Fair Federated Learning Framework With
No ratings yet
A Fair Federated Learning Framework With
8 pages
FL Tut Ans
No ratings yet
FL Tut Ans
19 pages
Learning Critically Selective Self-Distillation in Federated Learning On Non-IID Data
No ratings yet
Learning Critically Selective Self-Distillation in Federated Learning On Non-IID Data
12 pages
Federated Learning Presentation
No ratings yet
Federated Learning Presentation
11 pages
Optimizing Federated Learning On Non-IID Data With Reinforcement Learning
No ratings yet
Optimizing Federated Learning On Non-IID Data With Reinforcement Learning
10 pages
Towards Federated Learning at Scale System Design
No ratings yet
Towards Federated Learning at Scale System Design
15 pages
2022 IEEEIoTM FLIoTVision
No ratings yet
2022 IEEEIoTM FLIoTVision
6 pages
Fedlab A Flexible Federated Learning Framework
No ratings yet
Fedlab A Flexible Federated Learning Framework
10 pages
Implementation and Analysis of A Federated Learning Architecture Using CIFAR 10 Dataset 1
No ratings yet
Implementation and Analysis of A Federated Learning Architecture Using CIFAR 10 Dataset 1
6 pages
Bharati Et Al 2022 Federated Learning Applications Challenges and Future Directions
No ratings yet
Bharati Et Al 2022 Federated Learning Applications Challenges and Future Directions
17 pages
Federated Learning Article
No ratings yet
Federated Learning Article
68 pages
2024 MTH058 Lecture07 FederatedLearning
No ratings yet
2024 MTH058 Lecture07 FederatedLearning
25 pages
Wireless Hacking Tool
No ratings yet
Wireless Hacking Tool
9 pages
OWBN Anarch-Anarch - Genre-2023
100% (2)
OWBN Anarch-Anarch - Genre-2023
74 pages
AZ900
No ratings yet
AZ900
136 pages
PTS Syllabus
100% (1)
PTS Syllabus
6 pages
Week 2 Assignment OOP C++
100% (1)
Week 2 Assignment OOP C++
7 pages
Scribd-Caia Level 1
0% (3)
Scribd-Caia Level 1
3 pages
Operating System PDF
No ratings yet
Operating System PDF
43 pages
Universal Remote Instruction Manual
No ratings yet
Universal Remote Instruction Manual
16 pages
SpaceLYnk (LSS100200) For Firmware 2 - 8 - 0 - User Guide (Version Q)
No ratings yet
SpaceLYnk (LSS100200) For Firmware 2 - 8 - 0 - User Guide (Version Q)
172 pages
Leica GS18 I DS 900756 0422 en LR
No ratings yet
Leica GS18 I DS 900756 0422 en LR
2 pages
Daniel - Matta Analytics
No ratings yet
Daniel - Matta Analytics
1 page
Shwet CV
No ratings yet
Shwet CV
34 pages
Unit-5 Bi
No ratings yet
Unit-5 Bi
47 pages
CBSN4103 Network Security (SG) - Ejan23
No ratings yet
CBSN4103 Network Security (SG) - Ejan23
142 pages
Cisco Meeting Server Single Server Simplified Setup Guide 3 1
No ratings yet
Cisco Meeting Server Single Server Simplified Setup Guide 3 1
48 pages
The Magic Cafe Forums - Red Streamlined Convertible by David Regal
No ratings yet
The Magic Cafe Forums - Red Streamlined Convertible by David Regal
3 pages
2022-Cloud Computing Security and Law-1
No ratings yet
2022-Cloud Computing Security and Law-1
5 pages
Chetan Sap Complete
No ratings yet
Chetan Sap Complete
15 pages
03 Activity 1 3 Ans
No ratings yet
03 Activity 1 3 Ans
2 pages
Microsoft: Exam Questions MS-500
No ratings yet
Microsoft: Exam Questions MS-500
14 pages
Holidays Homework - Summer Vacation 2024-2025 Computer Science
No ratings yet
Holidays Homework - Summer Vacation 2024-2025 Computer Science
2 pages
MIMO: Controllable Character Video Synthesis With Spatial Decomposed Modeling
No ratings yet
MIMO: Controllable Character Video Synthesis With Spatial Decomposed Modeling
10 pages
Monthly Expense Calculator Using C
No ratings yet
Monthly Expense Calculator Using C
8 pages
Science - Passage 5-6
No ratings yet
Science - Passage 5-6
2 pages
Fiction Becomes Fact: Sustainable Information and Communications Technology in 2020
No ratings yet
Fiction Becomes Fact: Sustainable Information and Communications Technology in 2020
38 pages
Impact of Os To Users (Main)
No ratings yet
Impact of Os To Users (Main)
17 pages
Python
No ratings yet
Python
2 pages
2023 Midterm Papers
No ratings yet
2023 Midterm Papers
5 pages
Project Chatbot Using Python
No ratings yet
Project Chatbot Using Python
2 pages
LS 3 Ge16
No ratings yet
LS 3 Ge16
1 page
Constrained Conditional Model: Fundamentals and Applications
From Everand
Constrained Conditional Model: Fundamentals and Applications
Fouad Sabry
No ratings yet

FCL KD

Uploaded by

FCL KD

Uploaded by

Proceedings of the Thirty-First International Joint Conference on Artificial Intelligence (IJCAI-22)

Continual Federated Learning Based on Knowledge Distillation

Abstract (a) Overview of FL Intra-task

rameter regularization to mitigate catastrophic for- Convergence Point

where T k refers to a collection of nk training samples at k-th Server

forgetting means that after the updates of r-th round, the

Domain-IL scenario Class-IL scenario

New Task Accuracy Domain-T1 Domain-T2 Class-T1 Class-T2 Class-T1 Class-T2

New Task Accuracy

Figure 4: Results of varying surrogate-ratio (β denotes the number

You might also like