0% found this document useful (0 votes)
6 views7 pages

Personalized Federated Learning: A Combinational Approach: Sone Kyaw Pye Han Yu

Uploaded by

aaron127168
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views7 pages

Personalized Federated Learning: A Combinational Approach: Sone Kyaw Pye Han Yu

Uploaded by

aaron127168
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Personalized Federated Learning: A Combinational Approach

Sone Kyaw Pye Han Yu


School of Computer Science and Engineering School of Computer Science and Engineering
Nanyang Technological University Nanyang Technological University
Singapore Singapore
[email protected] [email protected]

ABSTRACT movement might contravene privacy laws such as the European


General Data Protection Regulation [3] and Health Insurance
Federated learning (FL) is a distributed machine learning
Portability and Accountability Act [4]. FL addresses these
approach involving multiple clients collaboratively training a
concerns as there is no need for centralized data storage, causing it
shared model. Such a system has the advantage of more training
to garner increasing interest recently, resulting in various
data from multiple clients, but data can be non-identically and
emerging FL applications across multiple fields.
independently distributed (non-i.i.d.). Privacy and integrity
preserving features such as differential privacy (DP) and robust Two challenges in FL are the non-identically and independently
aggregation (RA) are commonly used in FL. In this work, we distributed (non-i.i.d.) nature of data and model heterogeneity
show that on common deep learning tasks, the performance of FL caused by dissimilar clients needing specific models for their
models differs amongst clients and situations, and FL models can unique context. These challenges cause the FL model to perform
sometimes perform worse than local models due to non-i.i.d. data. poorly for certain clients as it fails to adapt to unique distributions
Secondly, we show that incorporating DP and RA degrades of data of individual clients, resulting in local models for clients
performance further. Then, we conduct an ablation study on the performing better than an FL model [5], reducing incentive for
performance impact of different combinations of common those clients to participate in FL. FL also faces privacy concerns
personalization approaches for FL, such as finetuning, mixture-of- as FL models have been shown to leak training data [6]. To
experts ensemble, multi-task learning, and knowledge distillation. counteract this, differential privacy (DP) has been proposed. FL is
It is observed that certain combinations of personalization also vulnerable to adversarial attacks, and robust aggregation
approaches are more impactful in certain scenarios while others (RA) has been suggested as a defense against some attacks [7].
always improve performance, and combination approaches are However, incorporating DP and RA degrades performance, which
better than individual ones. Most clients obtained better disincentivizes clients to participate in FL since there might not be
performance with combined personalized FL and recover from a significant or even a positive performance gain with these
performance degradation caused by non-i.i.d. data, DP, and RA. additional measures in place.
One way to overcome performance degradation is the
CCS CONCEPTS personalization of FL models. There are several personalization
• Computing methodologies ~ Artificial intelligence ~ Distributed approaches such as finetuning (FT), mixture of experts (MoE),
artificial intelligence multi-task learning (MTL), and knowledge distillation (KD), each
of which has been shown to yield positive performance gains for
KEYWORDS FL models [7]. However, to our knowledge, no work has been
Federated Learning, Computer Vision, Natural Language done to combine personalization approaches, which could provide
Processing, Deep Learning further performance gains.

1 Introduction This paper addresses this gap and evaluates various combinations
of personalization approaches in scenarios of plain FL,
Federated Learning (FL) is a distributed machine learning (ML) differentially private FL (DP-FL), and robust aggregation FL
approach that involves multiple users, referred to as clients, (RA-FL). We observe that existing personalization approaches
collaboratively training a global model without transferring data affect different aspects of the FL process, such as MoE not
from local storage to a central server [1]. Ideally, an FL model affecting the FL model, while FT further trains the FL model after
performs better than individual models trained only on each federated training, and KD and MTL modifies FT. Therefore, we
client's data due to more training data. FL can be further classified try out all possible combinations of these approaches. Our main
into two scenarios: cross-device and cross-silo FL, with the contributions in this paper are as follows:
difference between them being the number of clients, with the
latter having significantly fewer clients but with more data per • We demonstrate that for certain clients, FL does not provide
client [2]. FL's distributed data paradigm contrasts traditional ML, enough performance incentive to be a part of the federation
which requires data to be stored in a single location which brings of clients and incorporating DP and RA can further reduce
about concerns in terms of communication costs and privacy. that incentive due to performance degradation.
There can be significant communication overheads incurred in • We propose combinations of personalization approaches
transferring data from devices to a central location, and such data comprising common personalization approaches.
• We empirically show that these combinations yield better updates for aggregation. RA reduces the impact that statistical
performance gains than standalone personalization outliers have on model weights as only median weight, which
approaches and compensate for performance degradation. outliers do not contribute to, is used. RA is represented as such:
• We observe that certain combinations are more impactful in 𝐺 𝑡 = 𝐺 𝑡−1 + 𝜂(𝑃̃ 𝑡 − 𝐺 𝑡−1 ) (4)
certain scenarios and tasks while others improve
performance across the board, and a combination of where 𝑃̃ 𝑡 is the element-wise median for the updates acquired by
approaches always tends to be better than individual ones. the server performing the aggregation in round 𝑡 of FL training.
RA has also been shown to degrade performance of models [7].
The rest of the paper is organized as follows. Section 2 presents
background on FL, DP, RA, and personalization approaches. 2.3 Personalization of FL Models
Section 3 presents our experimental setup, Section 4 showcases Numerous personalization approaches have been proposed, and
results and analysis and conclusion is presented in Section 5. most can be categorized into the following archetypes:
2 Background Finetuning (FT): FL model after federated training is further
trained on client's local data. The intuition is akin to transfer
2.1 Federated Learning
learning, where knowledge acquired from a global pool of data is
A typical FL training process would encompass: leveraged to learn better local features instead of learning from
1. Selection of training participants: In each FL training scratch on a limited local pool of data [10]. A variant of FT, called
round 𝑡 = 1 … 𝑇, server randomly samples m clients. This
freeze-base FT, involves freezing some model layers, such as
selection only occurs in cross-device FL, as cross-silo FL
involves all clients due to the small number of total clients. base layers, and leaving only top layers unfrozen [7].
2. Distribution of Initial Global Model: Selected clients from Multi-task learning (MTL): A MTL problem involves solving
Step 1 download the latest model 𝐺 𝑡−1 from server. multiple related tasks together using commonalities across tasks
3. Local Training: Each client trains 𝐺 𝑡−1 for K epochs using [11]. In FL, training the FL model and personalizing it can be
local data and computes an update 𝑃𝑖𝑡 to 𝐺 𝑡−1 . treated as related tasks [12]. In [7], FL training process is treated
4. Aggregation: Server collects updates and averages them into as task X and personalization for a client as task Y to formulate an
updated global model 𝐺 𝑡 using Federated Averaging MTL problem. The aim is to use 𝐺 𝑇 , which is optimized for X,
(FedAvg) with aggregation learning rate η: and optimize it for Y, producing personalized model A.
𝑚
𝜂 This optimization can be viewed as an extension of FT with a
𝐺 𝑡 = 𝐺 𝑡−1 + ∑(𝑃𝑖𝑡 − 𝐺 𝑡−1 ) (1)
𝑚 different loss function. To address possible catastrophic forgetting
𝑖=1
This process can go on as long as new data is available for [13] of X while optimizing for Y, elastic weight controls [14] are
training from clients and there are clients eligible for training. used [7] to reduce rate of learning on critical layers/weights for X.
As such, cross-entropy loss is augmented:
2.2 Differential Privacy & Robust Aggregation
𝜆 2
DP limits information learnable about clients from model updates 𝑙(𝐴, 𝑥) = 𝐿𝑐𝑟𝑜𝑠𝑠 (𝐴, 𝑥) + ∑ ( 𝐹𝑖 (𝐴𝑖 − 𝐺𝑖𝑇 ) ) (5)
2
or FL models [8]. However, DP degrades performance of the FL 𝑖
model. In formal terms, differential privacy (DP) provides an where 𝜆 is the importance of task X vs. Y, F is the Fisher
(𝜖, 𝛿) privacy guarantee when the federated mechanism M and information matrix, and 𝑖 is the label of each parameter.
two sets of users Q, Q' that differ by one participant produce
models in any set G with probabilities that satisfy: Knowledge distillation (KD): KD involves extracting learned
features of a teacher model to teach a student model [15]. In FL,
Pr[𝑀(𝑄) ∈ 𝐺] ≤ 𝑒 𝜖 Pr[𝑀(𝑄 ′ ) ∈ 𝐺] + 𝛿 (2)
treating FL model (𝐺 𝑇 ) as the teacher and personalized model (A)
To incorporate DP in FL, it involves clipping each client's update as the student and using loss function from knowledge distillation
and adding Gaussian noise 𝑁(0, 𝜎) . Referencing Equation 1, literature, KD can be viewed as an extension to FT. Like MTL,
aggregation is modified as follows: the cross-entropy loss function is augmented as such:
𝑚
𝜂 𝑙(𝐴, 𝑥) = 𝛼𝐾 2 𝐿𝑐𝑟𝑜𝑠𝑠 (𝐴, 𝑥) (6)
𝐺𝑡 = 𝐺 𝑡−1 + ∑(𝐶𝑙𝑖𝑝(𝑃𝑖𝑡 − 𝐺 𝑡−1 , 𝑆)) + 𝑁(0, 𝜎) (3)
𝑚 𝐺 𝑇 (𝑥) 𝐴(𝑥)
𝑖=1 + (1 − 𝛼)𝐾𝐿(𝜎 ( ),𝜎( ))
𝐾 𝐾
where 𝑆 is clipping bound, and 𝑁(0, 𝜎) is noise added. These
values are dependent on the number of clients. The lesser the where KL is Kullback-Leibler divergence loss, 𝜎 is softmax, 𝛼 is
number of clients, the larger the magnitude of clipping and noise the weight parameter, and K is the temperature constant.
added to preserve privacy [8]. This makes it incompatible with
Mixture of Experts (MoE): MoE treats personalization as an
cross-silo FL, with its small number of clients.
ensemble learning task, where a local model trained only on the
RA is a suggested defense against poisoning attacks of malicious client's data is used together with the FL model [16]. The local
clients. RA replaces FedAvg, and instead of averaging updates model and FL model are put into a weighted average ensemble,
like FedAvg, the geometric median is used [9]. Typically, which can be expressed as such:
poisoning attacks would involve scaling model weights or using
𝑦 = 𝛼(𝐺 𝑇 (𝑥)) + (1 − 𝛼)(𝐷𝐸𝑖 (𝑥))
poisoned data to train the FL model before sending the poisoned
where 𝑥 is the testing data, 𝐺 𝑇 is the FL model, 𝐷𝐸𝑖 is the domain acting as a client: Amazon-branded products, Alexa-branded
expert, 𝛼 is the weight, and 𝑦 is the prediction/output. products, food, phones, and headphones [21-25]. Reviews are
rated 1 to 5. The datasets were obtained from Kaggle. This dataset
Meta-learning: A meta-learner trains a model on similar tasks
would be used for cross-silo FL for text classification.
with the aim of adapting to a new but similar task quickly despite
limited data for the new task [17]. For FL, meta-learning Cross-Sector Dataset: This dataset comprises three smaller
considers personalization for clients as similar tasks [18]. consumer review datasets of different customer service-related
sectors, each acting as a client: Amazon, Yelp, and Hotel [21,
Although these approaches have been studied individually, no
study has been conducted to explore the efficacy of combining 26,27]. All reviews are rated 1 to 5. The datasets were obtained
personalization approaches. As certain archetypes such as meta- from Kaggle. This dataset would be used for the cross-silo FL
learning are incompatible with others, they will not be included in setting for the text classification task.
our study. The compatible approaches (FT, MTL, KD, MoE) have CIFAR-10 FL Dataset: CIFAR-10 [28] is a well-known dataset
not been studied on both cross-silo and cross-device FL scenarios, for image classification. For cross-device FL, the training set is
and a comparison across scenarios, tasks, and combinations of divided into 100 subsets, each acting as a client. Following [29],
personalization approaches has not been done before. to simulate non-iid-ness in the dataset amongst clients, each client
is allocated images from each class using a Dirichlet distribution
3 Methodology with α = 0.9. As for evaluating the FL model on each client,
This study explores how different combinations of personalization unlike other datasets where each client had its own test set, the
approaches impacts performance of FL across various original CIFAR-10 test set is used, with the model's per-class
tasks/scenarios. Tasks and datasets used are in Table 1. accuracy being multiplied by the corresponding class's ratio in the
Table 1: FL Tasks and Datasets client's training set and summed up.
Scenario Task Dataset Clients Reddit Dataset: For next-word prediction with cross-device FL,
Cross- Image Classification Office 3 the dataset from [29], which is made up of posts of 80,000 Reddit
Silo DomainNet 5 users from November 2017, is used [30]. A corpus of the 50,000
Text Classification Cross-Sector 3
most frequent words is used for the task, with the rest being
Cross-Product 5
replaced with the <unk> token. The data for each user was split
Cross- Image Classification CIFAR-10 100
Device into the training and testing sets in a ratio of 90:10.
Next Word Prediction Reddit 80,000
3.2 Tasks & Model Architecture
3.1 Datasets Image Classification: ResNet18 model architecture [31] was
As datasets made explicitly for FL are still rare, it is common to used for image classification tasks, with randomly initialized
retrofit ML datasets into FL ones by dividing the datasets into weights. Stochastic gradient descent (SGD) was used as the
subsets, representing clients. Domain adaptation datasets, which optimizer for all experiments as most Federated Learning works
already have the dataset divided into domains, can be used, with currently use SGD. The metric was top-1 accuracy.
each domain representing a client. The subsections that follow
will elaborate on the details of the six datasets we used. The FL model was trained for 100 rounds for cross-silo FL, with
all clients participating for every round. For cross-device FL, the
Office Dataset: This dataset contains three domains/clients: FL model was trained for 1000 epochs, with each round involving
Amazon, Webcam, and DSLR, representing cross-silo FL ten randomly selected clients. For both scenarios, local training
scenario for image classification. Each client contains images for each client was for two epochs.
from Amazon or images taken using a webcam or DSLR camera
[19]. The unequal number and different origin of images for each Text Classification: A CNN model with word embeddings for
client simulate the non-iid nature of FL data. sentence classification [32] was used. FL model was trained for
20 rounds, with two epochs of local training for each client, with
DomainNet Dataset: This dataset contains five domains/clients: all clients participating every round.
Infograph, Painting, Quickdraw, Real and Sketch [20], with each
client having different forms of visual representations for the Next-Word Prediction: For next-word prediction, a two-layer
same classes of objects. This dataset is used for the cross-silo FL LSTM model with 200 hidden units and 10 million parameters
scenario for the task of image classification. Only a subset of the was used, following [29]. FL model was trained for 2000 rounds
entire DomainNet dataset was used for this project in the interest with 100 randomly selected clients participating in each round.
of time and available computation resources. The subset has For personalization of the FL model, 8000 clients were randomly
seventeen randomly chosen classes but retains the five domains as selected for personalization experiments rather than all clients.
per the full dataset. 3.3 Personalization Approaches
Cross-Product Dataset: This dataset comprises five smaller In terms of coming up with combinations of different
consumer review datasets of different product categories, each personalization approaches, there is a need to consider the cross-
compatibility of different approaches and where they augment the since each client contributes a significant proportion of the overall
traditional FL process. FT is universally compatible and is the dataset compared to the cross-device scenario.
basis for the other approaches except for MoE. FB is a
modification of FT, so it is universally compatible as well. MTL
and KD modify FT/FB's loss function so they would be mutually
exclusive in combinations. MoE would come in after local
personalization is done through a combination of FT/FB and
KD/MTL. As such, the combinations of personalization
approaches to be explored can be found in Table 2.
Table 2: Combinations of Approaches
Combination FT FB KD MTL MoE
1 ✓
2 ✓ ✓
3 ✓ ✓ Figure 1: Performance comparison between local and FL
4 ✓ models
5 ✓ ✓ 4.2 FL with DP and RA
6 ✓ ✓ With DP and RA, performance degrades further as expected, as
7 ✓ seen in Figure 2, which compares the average accuracy of tasks
8 ✓ ✓ for normal FL against DP-FL and RA-FL where applicable.
9 ✓ ✓ ✓ Results suggest that RA-FL causes more degradation of
10 ✓ ✓ ✓ performance than DP in cross-device tasks. This could be due to
11 ✓ ✓ non-iid data, together with median aggregation, which would take
12 ✓ ✓ ✓ the median update that belongs to a single client, even though that
13 ✓ ✓ ✓ client might have a skewed distribution of data.

4 Results & Discussion


4.1 Local Models vs. FL Models
As mentioned in Section 1, local models trained only on
individual clients' data can sometimes perform better than FL
models. This is due to the FL model failing to adapt to that
particular client's unique data distribution.
We expect that cross-silo FL would see a more significant
discrepancy between local and FL models, with local models
performing better than FL models than cross-device FL. This is
due to each client in cross-silo FL having a sizeable local dataset Figure 2: Performance comparison between FL, RA-FL, and
that can be used to train decently performing local models, while DP-FL models
each client in cross-device FL tends to have much lesser local
training samples. The type of clients for cross-silo FL, which is In cross-silo scenarios, degradation caused by RA-FL is not as
usually large organizations, would also mean more significant significant. This could be due to the small number of clients
statistical heterogeneity amongst clients since different involved, so there is a higher chance of a client's update being the
organizations might use different data collection methods, median in each round, resulting in a greater degree of
software, and hardware. This contrasts with cross-device FL, representation of each client's distribution in the eventual FL
where clients would tend to use the same kind of application and model, so average performance does not degrade as much. The
generate data of a more similar nature. results show an inherent trade-off between having enhanced
privacy/integrity measures and FL models' performance.
Our experimental results support the hypothesis above, with all
cross-silo tasks having the FL model underperforming compared 4.3 Individual Personalization of FL Models
to local models, as seen in Figure 1. For cross-device tasks, the FL The results of the experiments described in Section 3 can be found
model underperformed for less than 1% of clients for image in Table 3 and Table 4 for cross-silo and cross-device FL,
classification and next-word prediction tasks. Such respectively. For each column, the combination(s) with the
underperformance for cross-silo tasks is concerning for FL, as it highest accuracy is bolded. The subsections that follow will
will disincentivize clients from joining the federation. Any client analyze and discuss points of interest in the results.
leaving or not joining the federation will have a noticeable effect
4.3.1 Finetuning Recovers Performance. As seen in Table 3 cross-device FL, both approaches seem to have little or no
and Table 4, finetuning the FL model (FL+FT) helps the FL positive impact on performance and sometimes even degrade the
model's performance recover from the degradation. After personalized FL model. This could be explained once again by the
finetuning, the average performance of the FL model was also non-iid-ness of the datasets.
better than the local models for all tasks across both cross-silo and
Since each client would have its own unique distribution of data
cross-device FL. This shows that finetuning as a personalization
and features, having approaches like KD and MTL which put
approach works and can be universally applied. Finetuning is also
further emphasis on deriving learned features from the generalized
simple to implement since it can make use of the same algorithm
global FL model would not be useful. However, it is notable that
for local training during FL, and all clients would be able to
when these approaches are combined with MoE, there seem to be
implement such an approach without much difficulty. This makes
better improvements in performance. This will be elaborated
finetuning something that should always be implemented along
further on in the subsequent sections.
with FL in real-life applications.
4.3.2 Mixture of Experts – Another Universal Personalization 4.4 Combining Personalization Approaches
Approach. Mixture of experts (FL+MoE) is a personalization 4.4.1 Combining FT and MoE. Earlier subsections have shown
approach that requires an additional local model to be trained for that FT and MoE are universally suitable to be implemented for
each client. With the training and evaluation algorithms and setup all FL scenarios and tasks explored in this project. Combining
already implemented for FL training, the training of a local model these two approaches alone also gave the best performance out of
for MoE would only require additional space and training time for all combinations tested for five out of eight cross-silo FL setups,
clients to implement. In terms of performance gains, MoE is as seen in Table 3.
another personalization approach that yields improvements across
Combining FT and MoE ensemble allows for the best features
both cross-silo and cross-device FL scenarios and tasks, as seen in
of both approaches to be combined, with the domain-specific
Table 3 and Table 4. Compared to finetuning, MoE does not
features being used in cases where the domain expert model is
overwrite or customize the features learned in FL but rather
very sure of its prediction and the expanded and finetuned feature
complements it separately using the weighted average. So, for
space from the finetuned FL model being used in cases where the
instances where the testing data contains features that are not
domain expert model is not too sure of its prediction.
commonly found in the training data, the FL model may be
weighted higher, and for instances where testing data contains 4.4.2 Combinations with FB. For both cross-silo and cross-
features found in training data, the domain expert model may be device FL, just like how FL+FB on its own performed worse than
weighted higher. FL+FT, combination approaches with FB can be seen in Tables 3
and 4 to not perform well normal FT combinations beating FB
4.3.3 Performance of Knowledge Distillation and Multi-Task
combinations most of the time. Even for cases where
Learning. Both KD and MTL implemented for this project are
combinations with FB performed the best, other combinations'
extensions of FT, with the loss function being the main
performance is not too far off or equal to it. Hence, FB should not
differentiating factor between the three. For both cross-silo and
be used in most situations for combination approaches.

Table 3: Average accuracy (%) of clients for cross-silo FL ablation study


Scenario Cross-Silo
Dataset Cross-Sector Cross-Product DomainNet Office
Local Model 64.13 71.76 58.36 81.71
Approach FL RA-FL FL RA-FL FL RA-FL FL RA-FL
FL 63.44 60.09 70.43 70.80 45.40 43.31 50.48 48.67
FT + FT 64.58 64.65 72.61 72.94 59.45 59.68 84.31 82.69
FL + FT + KD 63.99 63.30 71.22 71.36 55.94 56.40 83.72 81.82
FL + FT + MTL 64.33 63.30 71.14 71.94 59.57 59.09 83.05 86.92
FL + FB 64.11 63.49 71.16 71.32 56.64 57.38 83.48 77.44
FL + FB + KD 63.48 62.89 70.80 71.16 56.02 56.14 83.48 77.91
FL + FB + MTL 63.73 60.74 70.45 70.82 56.59 56.80 83.48 77.91
FL + MoE 65.90 65.56 73.20 73.24 59.68 59.58 83.01 83.10
FL + FT + MoE 67.01 66.08 73.66 73.70 61.42 61.26 86.01 84.09
FL + FT + KD + MoE 65.91 65.12 73.26 73.20 61.19 60.66 85.56 85.68
FL + FT + MTL + MoE 66.05 65.60 73.29 73.51 61.69 61.43 85.20 87.15
FL + FB + MoE 66.26 65.64 73.24 73.30 61.65 61.01 85.56 84.54
FL + FB + KD + MoE 65.82 65.35 73.26 73.24 61.19 60.77 85.56 85.26
FL + FB + MTL +
MoE 65.97 65.69 73.27 73.31 61.20 61.09 86.01 85.14
Table 4: Average accuracy (%) of clients for cross-device FL ablation study
Approach Cross-Device
Dataset CIFAR-10 Reddit
Local Model 55.65 4.27
Approach FL DP-FL RA-FL FL DP-FL RA-FL
FL 91.84 81.40 69.86 19.02 17.46 16.50
FT + FT 94.23 87.02 78.06 19.77 19.00 17.03
FL + FT + KD 89.48 80.35 71.46 20.29 18.69 17.76
FL + FT + MTL 94.16 87.20 79.00 20.38 19.18 17.54
FL + FB 93.77 87.94 79.37 20.33 19.06 17.52
FL + FB + KD 88.56 78.21 78.28 19.70 18.31 17.02
FL + FB + MTL 93.58 87.02 78.07 20.16 18.81 17.39
FL + MoE 92.70 84.01 74.14 19.68 18.22 16.95
FL + FT + MoE 94.38 87.07 78.30 20.16 19.32 17.48
FL + FT + KD + MoE 91.32 91.19 75.28 20.55 18.95 17.94
FL + FT + MTL + MoE 96.05 91.19 79.29 20.81 19.53 18.01
FL + FB + MoE 94.01 88.02 79.54 20.53 19.23 17.69
FL + FB + KD + MoE 93.85 88.02 78.54 20.08 18.66 17.25
FL + FB + MTL + MoE 93.80 88.02 73.69 20.50 19.14 17.63

4.4.3 Combinations with KD and MTL. Although KD and implementations should always try to include combination
MTL on their own are not effective personalization approaches, approaches as part of their personalization solutions.
when combined with MoE, there is a greater degree of
4.4.5 Best Combination Approach for Cross-Silo and Cross-
performance improvement. This could be due to KD and MTL
Device FL. Across the eight tasks in cross-silo FL, for image
acting to create a personalized FL model that is more influenced
classification tasks, the best combination of personalization
by the global pool of data since both approaches introduce
approaches contains FT/FB, MoE, and MTL. For text
additional influences in the loss function based on the original
classification tasks, the best combination of personalization
global FL model. This effect would allow an even wider
approaches contains just FT with MoE. Across the six tasks in
distribution of features to be accessible to the MoE ensemble, thus
cross-device FL, the best combination of personalization
increasing performance to a greater degree than FL+FT+MoE.
approaches contains FT, MoE, and MTL.
This effect is more pronounced in cross-device FL, with the
KD/MTL combinations obtaining the best performance in five out
of six setups, compared to 3 out of 8 setups for cross-silo FL. 5 Conclusions and Future Work
Between KD and MTL, MTL appears to be the better The success of federated learning systems is dependent on the
personalization approach in terms of performance, with MTL number of clients that participate, and this is influenced by the
outperforming KD in 39 setups, KD outperforming MTL in 12 benefits that clients get in terms of performance gains, as well as
setups, and being of equal performance in 5 setups. As such, MTL the protections such systems offer, such as privacy and integrity.
should be used in cross-device FL scenarios and tasks when We have shown that due to the statistical heterogeneity present
possible. KD does not appear to be a viable alternative to MTL in across clients' data and the addition of privacy and integrity
terms of performance. For cross-silo FL scenarios and tasks, the protection, the performance of FL systems can suffer, sometimes
suitability of MTL would be limited to tasks such as image to the point where non-participation is favored. Personalization of
classification and not text classification ones. FL models, either through standalone approaches or combined
4.4.4 Effect of Combination Approaches on RA-FL and DP- ones, can reverse this performance degradation and even bring
FL. All combination approaches explored managed to compensate additional gains in performance, without the need for significant
for degradation caused by the additional privacy and integrity additional resources. Among the combinations of personalization
features of RA-FL and DP-FL. Such an effect would re- approaches explored for both cross-silo and cross-device FL,
incentivize clients which may not have joined or left when the combinations with finetuning, mixture of experts, and multi-task
setup was plain FL without personalization. These individual and learning gave the best performance gains.
combination approaches also do not incur much additional Future work could be in the form of further exploration of
overheads in terms of resources like time and space, and the different FL tasks beyond the domains of computer vision and
mechanism for implementing them is already mostly available natural language processing, more varied personalization
through the implementation of the FL framework. approaches that go beyond the typical archetypes presented here,
Combination approaches clearly provide a better performance or into the vertical FL scenarios since this project is solely
gain compared to individual ones, and therefore FL framework focused on the horizontal FL scenario.
Acknowledgements [20] X. Peng, Q. Bai, X. Xia, Z. Huang, K. Saenko and B. Wang, "Moment
Matching for Multi-Source Domain Adaptation", arXiv.org, 2018. [Online].
This research is supported, in part, by the National Research Available: https://arxiv.org/abs/1812.01754. [Accessed: 6- Mar- 2021]
[21] "Consumer Reviews of Amazon Products", Kaggle.com. [Online]. Available:
Foundation, Singapore under its the AI Singapore Programme https://www.kaggle.com/datafiniti/consumer-reviews-of-amazon-products.
(AISG2-RP-2020-019); the Joint NTU-WeBank Research Centre [Accessed: 6- Mar- 2021]
[22] "Amazon Alexa Reviews | Kaggle", Kaggle.com. [Online]. Available:
on Fintech (NWJ-2020-008); the Nanyang Assistant Professorship https://www.kaggle.com/sid321axn/amazon-alexa-reviews. [Accessed: 6- Mar-
(NAP); the RIE 2020 Advanced Manufacturing and Engineering 2021]
Programmatic Fund (A20G8b0102), Singapore; the SDU-NTU [23] "Amazon Fine Food Reviews | Kaggle", Kaggle.com. [Online]. Available:
https://www.kaggle.com/snap/amazon-fine-food-reviews. [Accessed: 6- Mar-
Centre for AI Research (C-FAIR), Shandong University, China. 2021]
Any opinions, findings and conclusions or recommendations [24] "Amazon Reviews: Unlocked Mobile Phones | Kaggle", Kaggle.com. [Online].
Available: https://www.kaggle.com/PromptCloudHQ/amazon-reviews-
expressed in this material are those of the authors and do not unlocked-mobile-phones. [Accessed: 6- Mar- 2021]
reflect the views of the funding agencies. [25] "Headphone Reviews | Kaggle", Kaggle.com. [Online]. Available:
https://www.kaggle.com/pbabvey/headphone-reviews. [Accessed: 6- Mar-
2021]
REFERENCES [26] "Hotel Reviews | Kaggle", Kaggle.com. [Online]. Available:
[1] H. McMahan, E. Moore, D. Ramage, S. Hampson and B. Arcas. https://www.kaggle.com/datafiniti/hotel-reviews. [Accessed: 6- Mar- 2021]
"Communication-efficient learning of deep networks from decentralized data", [27] "Yelp Reviews Dataset | Kaggle", Kaggle.com. [Online]. Available:
in Proceedings of the 20th International Conference on Artificial Intelligence https://www.kaggle.com/omkarsabnis/yelp-reviews-dataset. [Accessed: 6- Mar-
and Statistics, pages 1273–1282, 2017. Available: 2021]
https://research.google/pubs/pub44822/. [Accessed: 6- Mar- 2021]. [28] A. Krizhevsky, "Learning Multiple Layers of Features from Tiny Images",
[2] P. Kairouz et al., "Advances and Open Problems in Federated Learning", 2009.
Google Research, 2020. [Online]. Available: [29] E. Bagdasaryan and V. Shmatikov, "Differential Privacy Has Disparate Impact
https://research.google/pubs/pub49232/. [Accessed: 6- Mar- 2021]. on Model Accuracy", arXiv.org, 2021. [Online]. Available:
[3] "General Data Protection Regulation (GDPR) Compliance Guidelines", https://arxiv.org/abs/1905.12101. [Accessed: 6- Mar- 2021].
GDPR.eu. [Online]. Available: https://gdpr.eu. [Accessed: 6- Mar- 2021]. [30] "Reddit comments". [Online]. Available:
[4] G. Annas, "HIPAA Regulations — A New Era of Medical-Record Privacy?", https://bigquery.cloud.google.com/dataset/fh-bigquery:reddit_comments.
New England Journal of Medicine, vol. 348, no. 15, pp. 1486-1490, 2003. [Accessed: 6- Mar- 2021]
[5] T. Li, A. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar and V. Smith, "Federated [31] K. He, X. Zhang, S. Ren, and J. Sun, "Deep Residual Learning for Image
Optimization in Heterogeneous Networks", arXiv.org, 2018. [Online]. Recognition," arXiv.org, 2015, [Online]. Available:
Available: https://arxiv.org/abs/1812.06127. [Accessed: 6- Mar- 2021]. http://arxiv.org/abs/1512.03385. [Accessed: 6- Mar- 2021]
[6] L. Melis, C. Song, E. De Cristofaro and V. Shmatikov, "Exploiting Unintended [32] Y. Kim, "Convolutional Neural Networks for Sentence Classification,"
Feature Leakage in Collaborative Learning", arXiv.org, 2019. [Online]. arXiv.org 2014. [Online]. Available: http://arxiv.org/abs/1408.5882. [Accessed:
Available: https://arxiv.org/abs/1805.04049. [Accessed: 6- Mar- 2021]. 6- Mar- 2021]
[7] T. Yu, E. Bagdasaryan and V. Shmatikov, "Salvaging Federated Learning by
Local Adaptation", arXiv.org, 2021. [Online]. Available:
https://arxiv.org/abs/2002.04758. [Accessed: 6- Mar- 2021].
[8] H. McMahan, D. Ramage, K. Talwar and L. Zhang, "Learning differentially
private recurrent language models", in International Conference on Learning
Representations (ICLR), 2018 [Online]. Available:
https://arxiv.org/abs/1710.06963. [Accessed: 6- Mar- 2021].
[9] X. Chen, T. Chen, H. Sun, Z. Wu and M. Hong, "Distributed training with
heterogeneous data: Bridging median and mean based algorithms", Arxiv.org,
2019. [Online]. Available: https://arxiv.org/pdf/1906.01736v1.pdf. [Accessed:
6- Mar- 2021].
[10] K. Wang, R. Mathews, C. Kiddon, H. Eichner, F. Beaufays and D. Ramage,
"Federated Evaluation of On-device Personalization", arXiv.org, 2019.
[Online]. Available: https://arxiv.org/abs/1910.10252v1. [Accessed: 6- Mar-
2021].
[11] R. Caruana, "Multi-task learning", Cs.cornell.edu, 1997. [Online]. Available:
https://www.cs.cornell.edu/~caruana/mlj97.pdf. [Accessed: 6- Mar- 2021]
[12] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, "Federated multi-task
learning", in Advances in Neural Information Processing Systems, 2017
[Online]. Available: https://proceedings.neurips.cc/paper/7029-federated-multi-
task-learning.pdf. [Accessed: 6- Mar- 2021]
[13] R. French, "Catastrophic forgetting in connectionist networks", Trends in
Cognitive Sciences, vol. 3, no. 4, pp. 128-135, 1999.
[14] J. Kirkpatrick, et al. "Overcoming catastrophic forgetting in neural networks".
Proc. NAS, 114(13):3521–3526, 2017.
[15] G. Hinton, O. Vinyals and J. Dean, "Distilling the Knowledge in a Neural
Network", arXiv.org, 2015. [Online]. Available:
https://arxiv.org/abs/1503.02531v1. [Accessed: 6- Mar- 2021]
[16] D. Peterson, P. Kanani, and V. J. Marathe, "Private Federated Learning with
Domain Adaptation", arXiv.org, 2019. [Online] Available:
http://arxiv.org/abs/1912.06733. [Accessed: 11- Mar- 2021]
[17] C. Finn, P. Abbeel, and S. Levine, "Model-Agnostic Meta-Learning for Fast
Adaptation of Deep Networks," arXiv.org, 2017. [Online]. Available:
http://arxiv.org/abs/1703.03400. [Accessed: 6- Mar- 2021]
[18] Y. Jiang, J. Konečný, K. Rush, and S. Kannan, "Improving Federated Learning
Personalization via Model Agnostic Meta Learning," arXiv.org, 2019. [Online].
Available: http://arxiv.org/abs/1909.12488. [Accessed: 6- Mar- 2021]
[19] "Domain Adaptation - UC Berkeley", Domain Adaptation Project. [Online].
Available: https://people.eecs.berkeley.edu/~jhoffman/domainadapt. [Accessed:
6- Mar- 2021]

You might also like