FJ ORD
FJ ORD
Abstract
Federated Learning (FL) has been gaining significant traction across different ML
tasks, ranging from vision to keyboard predictions. In large-scale deployments,
client heterogeneity is a fact and constitutes a primary problem for fairness, training
performance and accuracy. Although significant efforts have been made into
tackling statistical data heterogeneity, the diversity in the processing capabilities
and network bandwidth of clients, termed as system heterogeneity, has remained
largely unexplored. Current solutions either disregard a large portion of available
devices or set a uniform limit on the model’s capacity, restricted by the least capable
participants. In this work, we introduce Ordered Dropout, a mechanism that
achieves an ordered, nested representation of knowledge in deep neural networks
(DNNs) and enables the extraction of lower footprint submodels without the need of
retraining. We further show that for linear maps our Ordered Dropout is equivalent
to SVD. We employ this technique, along with a self-distillation methodology, in
the realm of FL in a framework called FjORD. FjORD alleviates the problem of
client system heterogeneity by tailoring the model width to the client’s capabilities.
Extensive evaluation on both CNNs and RNNs across diverse modalities shows
that FjORD consistently leads to significant performance gains over state-of-the-art
baselines, while maintaining its nested structure.
1 Introduction
Over the past few years, advances in deep learning have revolutionised the way we interact with every-
day devices. Much of this success relies on the availability of large-scale training infrastructures and
the collection of vast amounts of training data. However, users and providers are becoming increas-
ingly aware of the privacy implications of this ever-increasing data collection, leading to the creation
of various privacy-preserving initiatives by service providers [3] and government regulators [10].
Federated Learning (FL) [46] is a relatively new subfield of machine learning (ML) that allows
the training of models without the data leaving the users’ devices; instead, FL allows users to
collaboratively train a model by moving the computation to them. At each round, participating
devices download the latest model and compute an updated model using their local data. These locally
trained models are then sent from the participating devices back to a central server where updates are
∗
Indicates equal contribution.
†
Work while intern at Samsung AI Center.
Clients
Higher Tier
Figure 2: Ordered vs. Random Dropout. In this example,
Figure 1: FjORD employs OD to tailor the left-most features are used by more devices during
the amount of computation to the capa- training, creating a natural ordering to the importance of
bilities of each participating device. these features.
aggregated for next round’s global model. Until now, a lot of research effort has been invested with the
sole goal of maximising the accuracy of the global model [46, 42, 39, 31, 63], while complementary
mechanisms have been proposed to ensure privacy and robustness [6, 14, 47, 48, 27, 4].
A key challenge of deploying FL in the wild is the vast heterogeneity of devices [38], ranging from
low-end IoT to flagship mobile devices. Despite this fact, the widely accepted norm in FL is that
the local models have to share the same architecture as the global model. Under this assumption,
developers typically opt to either drop low-tier devices from training, hence introducing training
bias due to unseen data [30], or limit the global model’s size to accommodate the slowest clients,
leading to degraded accuracy due to the restricted model capacity [8]. In addition to these limitations,
variability in sample sizes, computation load and data transmission speeds further contribute to a
very unbalanced training environment. Finally, the resulting model might not be as efficient as
models specifically tailored to the capabilities of each device tier to meet the minimum processing-
performance requirements [34].
In this work, we introduce FjORD (Fig. 1), a novel adaptive training framework that enables hetero-
geneous devices to participate in FL by dynamically adapting model size – and thus computation,
memory and data exchange sizes – to the available client resources. To this end, we introduce Ordered
Dropout (OD), a mechanism for run-time ordered (importance-based) pruning, which enables us
to extract and train submodels in a nested manner. As such, OD enables all devices to participate
in the FL process independently of their capabilities by training a submodel of the original DNN,
while still contributing knowledge to the global model. Alongside OD, we propose a self-distillation
method from the maximal supported submodel on a device to enhance the feature extraction of
smaller submodels. Finally, our framework has the additional benefit of producing models that can be
dynamically scaled during inference, based on the hardware and load constraints of the device.
Our evaluation shows that FjORD enables significant accuracy benefits over the baselines across
diverse datasets and networks, while allowing for the extraction of submodels of varying FLOPs and
sizes without the need for retraining.
2 Motivation
Despite the progress on the accuracy front, the unique deployment challenges of FL still set a limit to
the attainable performance. FL is typically deployed on either siloed setups, such as among hospitals,
or on mobile devices in the wild [7]. In this work, we focus on the latter setting. Hence, while
cloud-based distributed training uses powerful high-end clients [19], in FL these are commonly
substituted by resource-constrained and heterogeneous embedded devices.
In this respect, FL deployment is currently hindered by the vast heterogeneity of client hardware [66,
28, 7]. On the one hand, different mobile hardware leads to significantly varying processing speed [1],
in turn leading to longer waits upon aggregation of updates (i.e. stragglers). At the same time, devices
of mid and low tiers might not even be able to support larger models, e.g. the model does not fit
in memory or processing is slow, and, thus, are either excluded or dropped upon timeouts from
the training process, together with their unique data. More interestingly, the resource allocation to
participating devices may also reflect on demographic and socio-economic information of owners,
that makes the exclusion of such clients unfair [30] in terms of participation. Analogous to the
device load and heterogeneity, a similar trend can be traced in the downstream (model) and upstream
2
(updates) network communication in FL, which can be an additional substantial bottleneck for the
training procedure [55].
3 Ordered Dropout
In this paper, we firstly introduce the tools that act as enablers for heterogeneous federated training.
Concretely, we have devised a mechanism of importance-based pruning for the easy extraction
of subnetworks from the original, specially trained model, each with a different computational
and memory footprint. We name this technique Ordered Dropout (OD), as it orders knowledge
representation in nested submodels of the original network.
More specifically, our technique starts by sampling a value (denoted by p) from a distribution of
candidate values. Each of these values corresponds to a specific submodel, which in turn gets
translated to a specific computational and memory footprint (see Table 1b). Such sampled values
and associations are depicted in Fig. 2. Contrary to conventional dropout (RD), our technique drops
adjacent components of the model instead of random neurons, which translates to computational
benefits3 in today’s linear algebra libraries and higher accuracy as shown later.
3.1 Ordered Dropout Mechanics
The proposed OD method is parametrised with respect to: i) the value of the dropout rate p ∈ (0, 1]
per layer, ii) the set of candidate values P, such that p ∈ P and iii) the sampling method of p over the
set of candidate values, such that p ∼ DP , where DP is the distribution over P.
A primary hyperparameter of OD is the dropout rate p which defines how much of each layer is to be
included, with the rest of the units dropped in a structured and ordered manner. The value of p is
selected by sampling from the dropout distribution DP which is represented by a set of discrete values
P = {s1 , s2 , . . . , s|P| } such that 0<s1 <. . . <s|P| ≤ 1 and probabilities P(p = si ) > 0, ∀i ∈ [|P|]
P|P|
such that i=1 P(p = si ) = 1. For instance, a uniform distribution over P is denoted by p ∼ UP
(i.e. D = U). In our experiments we use uniform distribution over the set P = {i/k}ki=1 , which we
refer to as Uk (or uniform-k). The discrete nature of the distribution stems from the innately discrete
number of neurons or filters to be selected. The selection of set P is discussed in the next subsection.
The dropout rate p can be constant across all layers or configured individually per layer l, leading
l
to pl ∼ DP . As such an approach opens the search space dramatically, we refer the reader to NAS
techniques [69] and continue with the same p value across network layers for simplicity, without
hurting the generality of our approach.
Given a p value, a pruned p-subnetwork can be directly obtained as follows. For each4 layer l with
width5 Kl , the submodel for a given p has all neurons/filters with index {0, 1, . . . , dp · Kl e − 1}
included and {dp · Kl e , . . . , Kl − 1} pruned. Moreover, the unnecessary connections between
pruned neurons/filters are also removed6 . We denote a pruned p-subnetwork Fp with its weights wp ,
where F and w are the original network and weights, respectively. Importantly, contrary to existing
pruning techniques [18, 35, 49], a p-subnetwork from OD can be directly obtained post-training
without the need to fine-tune, thus eliminating the requirement to access any labelled data.
3.2 Training OD Formulation
We propose two ways to train an OD-enabled network: i) plain OD and ii) knowledge distillation OD
training (OD w/ KD). In the first approach, in each step we first sample p ∼ DP ; then we perform
the forward and backward pass using the p-reduced network Fp ; finally we update the submodel’s
weights using the selected optimiser. Since sampling a p-reduced network provides us significant
computational savings on average, we can exploit this reduction to further boost accuracy. Therefore,
in the second approach we exploit the nested structure of OD, i.e. p1 < p2 =⇒ Fp1 ⊂ Fp2 and
allow for the bigger capacity supermodel to teach the sampled p-reduced network at each iteration
3
OD, through its nested pruning scheme that requires neither additional data structures for bookkeeping nor
complex and costly data layout transformations, can capitalise directly over the existing and highly optimised
dense matrix multiplication libraries.
4
Note that p affects the number of output channels/neurons and thus the number of input channels/neurons of
the next layer. Furthermore, OD is not applied on the input and last layer to maintain the same dimensionality.
5
i.e. neurons for fully-connected layers (linear and recurrent) and filters for convolutional layers. RNN cells
can be seen as a set of linear feedforward layers with activation and composition functions.
6
For BatchNorm, we maintain a separate set of statistics for every dropout rate p. This has only a marginal
effect on the number of parameters and can be used in a privacy-preserving manner [41].
3
95
OD w/ KD OD w/ KD OD w/ KD
−3.5
SM 84 SM SM
Negative Perplexity
94
82 −4.0
Accuracy
Accuracy
93
80
−4.5
92
78
−5.0
91 76
0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
submodel (p) submodel (p) submodel (p)
4
intractable at scale. Therefore, we cluster devices of similar capabilities together and subsequently
associate a single pimax value with each cluster. This clustering can be done heuristically (i.e. based
on the specifications of the device) or via benchmarking of the model on the actual device and is
considered a system-design decision for our paper. As smartphones nowadays run a multitude of
simultaneous tasks [43], our framework can further support modelling of transient device load by
reducing its associated pimax , which essentially brings the capabilities of the device to a lower tier at
run time, thus bringing real-time adaptability to FjORD.
Concretely, the discrete candidate values of P depend on i) the number of clusters and corresponding
device tiers, ii) the different load levels being modelled and iii) the size of the network itself, as
i.e. for each tier i there exists pimax beyond which the network cannot be resolved. In this paper, we
treat the former two as invariants (assumed to be given by the service provider), but provide results
across different number and distributions of clusters, models and datasets.
3.5 Preliminary Results
Here, we present some results to showcase the performance of OD in the centralised non-FL training
setting (i.e. the server has access to all training data) across three tasks, explained in detail in § 5.
Concretely, we run OD with distribution DP = U5 (uniform distribution over the set {i/5}5i=1 ) and
compare it with end-to-end trained submodels (SM) trained in isolation for the given width of the
model. Fig. 3 shows that across the three datasets, the best attained performance of OD along every
width p is very close to the performance of the baseline models. We extend this comparison against
Random Dropout in the Appendix. We note at this point that the submodel baselines are trained
from scratch, explicitly optimised to that given width with no possibility to jump across them, while
our OD model was trained using a single training loop and offers the ability to switch between
accuracy-computation points without the need to retrain.
4 FjORD
Building upon the shoulders of OD, we introduce FjORD, a framework for federated training over
heterogenous clients. We subsequently describe the FjORD’s workflow, further documented in Alg. 1.
As a starting point, the global model architecture, F, is initialised with weights w0 , either randomly
or via a pretrained network. The dropout rates space P is selected along with distribution DP with
|P| discrete candidate values, with each p corresponding to a subnetwork of the global model with
varying FLOPs and parameters. Next, the devices to participate are clustered into |Ctiers | tiers and a
pcmax value is associated with each cluster c. The resulting pcmax represents the maximum capacity of
the network that devices in this cluster can handle without violating a latency or memory constraint.
At the beginning of each communication round t, the set of participating devices St is determined,
which either consists of all available clients At or contains only a random subset of At based on the
server’s capacity. Next, the server broadcasts the current model to the set of clients St and each client
i receives wpimax . On the client side, each client runs E local iterations and at each local iteration k, the
device i samples p(i,k) from conditional distribution DP |DP ≤ pimax which accounts for its limited
capability. Subsequently, each client updates the respective weights (wp(i,k) ) of the local submodel
using the FedAvg [46] update rule. In this step, other strategies [39, 63, 31] can be interchangeably
employed. At the end of the local iterations, each device sends its update back to the server.
Finally, the server aggregates these communicated changes and updates the global model, to be
distributed in the next global federated round to a different subset of devices. Heterogeneity of
devices leads to heterogeneity in the model updates and, hence, we need to account for that in the
global aggregation step. To this end, we utilise the following aggregation rule
n o
(i,t,E)
wt+1 t+1
sj \ w sj−1 = WA w is \ w(i,t,E)
sj−1 j
(1)
j i∈St
where wsj \ wsj−1 are the weights that belong to Fsj but not to Fsj−1 , wt+1 the global weights
at communication round t + 1, w(i,t,E) the weights on client i at communication round t after E
local iterations, Stj = {i ∈ St : pimax ≥ sj } a set of clients that have the capacity to update wsj , and
WA stands for weighted average, where weights are proportional to the amount of data on each client.
Communication Savings. In addition to the computational savings (§3.4), OD provides additional
communication savings. First, for the server-to-client transfer, every device with pimax < 1 observes
a reduction of approximately 1/(pimax )2 in the downstream transferred data due to the smaller model
size (§ 3.4). Accordingly, the upstream client-to-server transfer is decreased by 1/(pimax )2 as only the
gradient updates of the unpruned units are transmitted.
5
Algorithm 1: FjORD (Proposed Framework)
Input: F, w0 , DP , T, E
1 for t ← 0 to T − 1 do // Global rounds
2 Server selects clients as a subset St ⊂ At
3 Server broadcasts weights of pimax -submodel to each client i ∈ St
4 for k ← 0 to E − 1 do // Local iterations
5 ∀i ∈ St : Device i samples p(i,k) ∼ DP |DP ≤ pimax and updates the weights of local model
6 end
7 ∀i ∈ St : device i sends to the server the updated weights w(i,t,E)
8 Server updates wt+1 as in Eq. (1)
9 end
5.1 Experimental Setup (b) MACs and parameters per p-reduced network
Infrastructure. FjORD was implemented on top of the Flower (v0.14dev) [5] framework and
PyTorch (v1.4.0) [51]. We run all our experiments on a private cloud cluster, consisting of Nvidia
V100 GPUs. To scale to hundreds of clients on a single machine, we optimized Flower so that
clients only allocate GPU resources when actively participating in a federated client round. We report
average performance and the standard deviation across three runs for all experiments. To model client
availability, we run up to 100 Flower clients in parallel and sample 10% at each global round, with
the ability for clients to switch identity at the beginning of each round to overprovision for larger
federated datasets. Furthermore, we model client heterogeneity by assigning each client to one of the
device clusters. We provide the following setups:
6
90
80 −3.75
80 78 −4.00
Negative perplexity
76 −4.25
Accuracy
Accuracy
70
−4.50
74
60 −4.75
FjORD w/ KD 72 FjORD w/ KD FjORD w/ KD
FjORD w/ eFD FjORD w/ eFD −5.00 FjORD w/ eFD
50 eFD 70 eFD eFD
−5.25
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
submodel (p) submodel (p) submodel (p)
7
80 −3.7
86 FjORD w/ KD FjORD w/ KD FjORD w/ KD
FjORD FjORD −3.8 FjORD
78
Negative perplexity
85 −3.9
76
Accuracy
Accuracy
−4.0
84
74 −4.1
83
−4.2
72
82 −4.3
70 −4.4
81
0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1 0.2 0.4 0.6 0.8 1
submodel (p) submodel (p) submodel (p)
8
80
80
78
75
Accuracy
76
Accuracy
70
74
65 72
FjORD w/ KD - drop scale 0.5
Uniform 5 - FjORD w/ KD 70 FjORD w/ KD - drop scale 1
60
Uniform 10 - FjORD w/ KD
0.2 0.4 0.6 0.8 1
0.2 0.4 0.6 0.8 1.0 submodel (p)
submodel (p)
Negative perplexity
−4.00
Negative perplexity
−3.8
−4.25
−4.0
−4.50
−4.75 −4.2
9
Computation-Communication Co-optimisation. A few works aim to co-optimise both the compu-
tational and bandwidth costs. PruneFL [29] proposes an unstructured pruning method. Despite the
similarity to our work in terms of pruning, this method assumes a common pruned model across all
clients at a given round, thus not allowing more powerful devices to update more weights. Hence, the
pruned model needs to meet the constraints of the least capable devices, which severely limits the
model capacity. Moreover, the adopted unstructured sparsity is difficult to translate to processing
speed gains [67]. Federated Dropout [8] randomly sparsifies the global model, before sharing it
to the clients. Similarly to PruneFL, Federated Dropout does not consider the system diversity
and distributes the same model size to all clients. Thus, it is restricted by the low-end devices or
excludes them altogether from the FL process. Additionally, Federated Dropout does not translate to
computational benefits at inference time, since the whole model is deployed after federated training.
Contrary to the presented works, our framework embraces the client heterogeneity, instead of treating
it as a limitation, and thus pushes the boundaries of FL deployment in terms of fairness, scalability
and performance by tailoring the model size to the device at hand, both at training and inference time,
in a “train-once-deploy-everywhere” manner.
7 Conclusions & Future Work
In this work, we have introduced FjORD, a federated learning method for heterogeneous device
training. To this direction, FjORD builds on top of our Ordered Dropout technique as a means to
extract submodels of smaller footprints from a main model in a way where training the part also
participates in training the whole. We show that our Ordered Dropout is equivalent to SVD for linear
mappings and demonstrate that FjORD’s performance in the local and federated setting exceeds that
of competing techniques, while maintaining flexibility across different environment setups.
In the future, we plan to investigate how FjORD can be deployed and extended to future-gen devices
and models in a life-long manner, the interplay between system and data heterogeneity for OD-based
personalisation as well as alternative dynamic inference techniques for tackling system heterogeneity.
Broader Impact
Our work has a dual broader societal impact: i) on privacy and fairness in participation and ii)
on the environment. On the one hand, centralised DNN training [19] has been the norm for a
long time, mainly facilitated by the advances in server-grade accelerator design and cheap storage.
However, this paradigm comes with a set of disadvantages, both in terms of data privacy and energy
consumption. With mobile and embedded devices becoming more capable and FL becoming a viable
alternative [3, 7], one can leverage the free compute cycles of client devices to train models on-device,
without data ever leaving the device premises. These devices, being typically battery-powered,
operate under a more constrained power envelope compared to data-center accelerators [1]. Moreover,
these devices are already deployed in the wild, but typically not used for training purposes. What
FjORD contributes is the ability for even less capable devices to participate in the training process,
thus increasing the representation of low-tier devices (and by extension the correlated demographic
groups), as well as adding to the overall compute capabilities of the distributed system as a whole,
potentially offsetting part of the carbon footprint of centralised training data centers [52].
However, moving the computation cost from the service provider to the user of the device is a
non-negligible step and the user should be made aware what their device is used for, especially if
they are contributing to the knowledge of a model they do not own. Moreover, while many large data
centers [11, 15] are increasingly dependent on renewable resources for meeting their power demands,
this might not be the case for household electricity, which may impede the sustainability of training
on device, at least in the short run.
Funding Disclosure
This work was entirely performed at and funded by Samsung AI.
References
[1] Mario Almeida, Stefanos Laskaridis, Ilias Leontiadis, Stylianos I. Venieris, and Nicholas D.
Lane. EmBench: Quantifying Performance Variations of Deep Neural Networks Across Modern
Commodity Devices. In The 3rd International Workshop on Deep Learning for Mobile Systems
and Applications (EMDL), 2019.
10
[2] Mohammad Mohammadi Amiri, Deniz Gunduz, Sanjeev R Kulkarni, and H Vincent Poor.
Federated Learning with Quantized Global Model Updates. arXiv preprint arXiv:2006.10672,
2020.
[3] Apple. Learning with Privacy at Scale. In Differential Privacy Team Technical Report, 2017.
[4] Eugene Bagdasaryan, Andreas Veit, Yiqing Hua, Deborah Estrin, and Vitaly Shmatikov. How
To Backdoor Federated Learning. In Proceedings of the Twenty Third International Conference
on Artificial Intelligence and Statistics (AISTATS), pages 2938–2948, 2020.
[5] Daniel J Beutel, Taner Topal, Akhil Mathur, Xinchi Qiu, Titouan Parcollet, and Nicholas D Lane.
Flower: A Friendly Federated Learning Research Framework. arXiv preprint arXiv:2007.14390,
2020.
[6] Keith Bonawitz et al. Practical Secure Aggregation for Privacy-Preserving Machine Learning. In
Proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security
(CCS), 2017.
[7] Keith Bonawitz et al. Towards Federated Learning at Scale: System Design. In Proceedings of
Machine Learning and Systems (MLSys), 2019.
[8] Sebastian Caldas, Jakub Konečný, Brendan McMahan, and Ameet Talwalkar. Expanding the
Reach of Federated Learning by Reducing Client Resource Requirements. In NeurIPS Workshop
on Federated Learning for Data Privacy and Confidentiality, 2018.
[9] Łukasz Dudziak, Mohamed S Abdelfattah, Ravichander Vipperla, Stefanos Laskaridis, and
Nicholas D Lane. ShrinkML: End-to-End ASR Model Compression Using Reinforcement
Learning. In INTERSPEECH, pages 2235–2239, 2019.
[10] European Commission. GDPR: 2018 Reform of EU Data Protection Rules.
[11] Facebook. Software, servers, systems, sensors, and science: Facebook’s recipe for hyperefficient
data centers. https://tech.fb.com/hyperefficient-data-centers/, 2021. Accessed:
January 10, 2022.
[12] Alireza Fallah, Aryan Mokhtari, and Asuman Ozdaglar. Personalized Federated Learning with
Theoretical Guarantees: A Model-Agnostic Meta-Learning Approach. Advances in Neural
Information Processing Systems (NeurIPS), 2020.
[13] Biyi Fang, Xiao Zeng, and Mi Zhang. NestDNN: Resource-Aware Multi-Tenant On-Device
Deep Learning for Continuous Mobile Vision. In Proceedings of the 24th Annual International
Conference on Mobile Computing and Networking (MobiCom), pages 115–127, 2018.
[14] Robin C. Geyer, Tassilo J. Klein, and Moin Nabi. Differentially Private Federated Learning: A
Client Level Perspective. In NeurIPS Workshop on Machine Learning on the Phone and other
Consumer Devices (MLPCD), 2017.
[15] Google. Google datacenters efficiency. https://www.google.co.uk/about/
datacenters/efficiency/, 2021. Accessed: January 10, 2022.
[16] Yiwen Guo, Anbang Yao, and Yurong Chen. Dynamic Network Surgery for Efficient DNNs. In
Advances in Neural Information Processing Systems (NeuriPS), pages 1387–1395, 2016.
[17] Pengchao Han, Shiqiang Wang, and Kin K Leung. Adaptive Gradient Sparsification for Efficient
Federated Learning: An Online Learning Approach. In IEEE International Conference on
Distributed Computing Systems (ICDCS), 2020.
[18] Song Han, Jeff Pool, John Tran, and William Dally. Learning both Weights and Connections for
Efficient Neural Network. In Advances in Neural Information Processing Systems (NeurIPS),
pages 1135–1143, 2015.
[19] K. Hazelwood et al. Applied Machine Learning at Facebook: A Datacenter Infrastructure
Perspective. In IEEE International Symposium on High Performance Computer Architecture
(HPCA), 2018.
11
[20] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep Residual Learning for Image
Recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition,
pages 770–778, 2016.
[21] Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the Knowledge in a Neural Network.
In NeurIPS Deep Learning Workshop, 2014.
[22] Sepp Hochreiter and Jürgen Schmidhuber. Long short-term memory. Neural computation,
9(8):1735–1780, 1997.
[23] Samuel Horváth, Chen-Yu Ho, L’udovít Horváth, Atal Narayan Sahu, Marco Canini, and
Peter Richtárik. Natural Compression for Distributed Deep Learning. arXiv preprint
arXiv:1905.10988, 2019.
[24] Samuel Horváth, Aaron Klein, Peter Richtárik, and Cédric Archambeau. Hyperparameter
transfer learning with adaptive complexity. In International Conference on Artificial Intelligence
and Statistics, pages 1378–1386. PMLR, 2021.
[25] Samuel Horváth and Peter Richtárik. A Better Alternative to Error Feedback for Communication-
Efficient Distributed Learning. In International Conference on Learning Representations, 2021.
[26] Kevin Hsieh, Amar Phanishayee, Onur Mutlu, and Phillip Gibbons. The Non-IID Data Quagmire
of Decentralized Machine Learning. In International Conference on Machine Learning (ICML),
2020.
[27] R. Hu, Y. Guo, H. Li, Q. Pei, and Y. Gong. Personalized Federated Learning With Differential
Privacy. IEEE Internet of Things Journal (JIOT), 7(10):9530–9539, 2020.
[28] Andrey Ignatov, Radu Timofte, Andrei Kulik, Seungsoo Yang, Ke Wang, Felix Baum, Max Wu,
Lirong Xu, and Luc Van Gool. AI Benchmark: All About Deep Learning on Smartphones in
2019. In International Conference on Computer Vision Workshops (ICCVW), 2019.
[29] Yuang Jiang, Shiqiang Wang, Bong Jun Ko, Wei-Han Lee, and Leandros Tassiulas. Model
Pruning Enables Efficient Federated Learning on Edge Devices. In Workshop on Scalability,
Privacy, and Security in Federated Learning (SpicyFL), NeurIPS, 2020.
[30] Peter Kairouz, H Brendan McMahan, Brendan Avent, Aurélien Bellet, Mehdi Bennis, Ar-
jun Nitin Bhagoji, Keith Bonawitz, Zachary Charles, Graham Cormode, Rachel Cummings,
et al. Advances and open problems in federated learning. arXiv preprint arXiv:1912.04977,
2019.
[31] Sai Praneeth Karimireddy, Satyen Kale, Mehryar Mohri, Sashank Reddi, Sebastian Stich, and
Ananda Theertha Suresh. SCAFFOLD: Stochastic Controlled Averaging for Federated Learning.
In International Conference on Machine Learning (ICML), 2020.
[32] Jakub Konečný, H. Brendan McMahan, Felix X. Yu, Peter Richtarik, Ananda Theertha Suresh,
and Dave Bacon. Federated Learning: Strategies for Improving Communication Efficiency. In
NeurIPS Workshop on Private Multi-Party Machine Learning, 2016.
[33] Alex Krizhevsky, Geoffrey Hinton, et al. Learning multiple layers of features from tiny images.
2009.
[34] Stefanos Laskaridis, Stylianos I. Venieris, Hyeji Kim, and Nicholas D. Lane. HAPI: Hardware-
Aware Progressive Inference. In International Conference on Computer-Aided Design (ICCAD),
2020.
[35] Namhoon Lee, Thalaiyasingam Ajanthan, and Philip Torr. SNIP: Single-Shot Network Pruning
based on Connection Sensitivity. In International Conference on Learning Representations
(ICLR), 2019.
[36] Daliang Li and Junpu Wang. FedMD: Heterogenous Federated Learning via Model Distillation.
In NeurIPS 2019 Workshop on Federated Learning for Data Privacy and Confidentiality, 2019.
[37] Hao Li, Asim Kadav, Igor Durdanovic, Hanan Samet, and Hans Peter Graf. Pruning Filters for
Efficient ConvNets. In International Conference on Learning Representations (ICLR), 2016.
12
[38] Tian Li, Anit Kumar Sahu, Ameet Talwalkar, and Virginia Smith. Federated Learning: Chal-
lenges, Methods, and Future Directions. IEEE Signal Processing Magazine, 2020.
[39] Tian Li, Anit Kumar Sahu, Manzil Zaheer, Maziar Sanjabi, Ameet Talwalkar, and Virginia Smith.
Federated Optimization in Heterogeneous Networks. In Proceedings of Machine Learning and
Systems (MLSys), 2020.
[40] Tian Li, Maziar Sanjabi, Ahmad Beirami, and Virginia Smith. Fair Resource Allocation in
Federated Learning. In International Conference on Learning Representations (ICLR), 2020.
[41] Xiaoxiao Li, Meirui JIANG, Xiaofei Zhang, Michael Kamp, and Qi Dou. Fed{BN}: Federated
Learning on Non-{IID} Features via Local Batch Normalization. In International Conference
on Learning Representations (ICLR), 2021.
[42] Paul Pu Liang, Terrance Liu, Liu Ziyin, Nicholas B Allen, Randy P Auerbach, David Brent,
Ruslan Salakhutdinov, and Louis-Philippe Morency. Think Locally, Act Globally: Federated
Learning with Local and Global Representations. In NeurIPS 2019 Workshop on Federated
Learning, 2019.
[43] Robert LiKamWa and Lin Zhong. Starfish: Efficient Concurrency Support for Computer Vision
Applications. In Proceedings of the 13th Annual International Conference on Mobile Systems,
Applications, and Services (MobiSys), pages 213–226, 2015.
[44] Yujun Lin, Song Han, Huizi Mao, Yu Wang, and Bill Dally. Deep Gradient Compression:
Reducing the Communication Bandwidth for Distributed Training. In International Conference
on Learning Representations (ICLR), 2018.
[45] Bing Luo, Xiang Li, Shiqiang Wang, Jianwei Huang, and Leandros Tassiulas. Cost-Effective
Federated Learning Design. In INFOCOM, 2021.
[46] Brendan McMahan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas.
Communication-Efficient Learning of Deep Networks from Decentralized Data. In Proceedings
of the 20th International Conference on Artificial Intelligence and Statistics (AISTATS), 2017.
[47] H. Brendan McMahan, Daniel Ramage, Kunal Talwar, and Li Zhang. Learning Differentially
Private Recurrent Language Models. In International Conference on Learning Representations
(ICLR), 2018.
[48] Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov. Exploiting
Unintended Feature Leakage in Collaborative Learning. In IEEE Symposium on Security and
Privacy (SP), pages 691–706, 2019.
[49] Pavlo Molchanov, Arun Mallya, Stephen Tyree, Iuri Frosio, and Jan Kautz. Importance
Estimation for Neural Network Pruning. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition (CVPR), pages 11264–11272, 2019.
[50] Takayuki Nishio and Ryo Yonetani. Client Selection for Federated Learning with Heterogeneous
Resources in Mobile Edge. In IEEE International Conference on Communications (ICC), 2019.
[51] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan,
Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, Alban Desmaison, Andreas
Kopf, Edward Yang, Zachary DeVito, Martin Raison, Alykhan Tejani, Sasank Chilamkurthy,
Benoit Steiner, Lu Fang, Junjie Bai, and Soumith Chintala. PyTorch: An Imperative Style,
High-Performance Deep Learning Library. In Advances in Neural Information Processing
Systems (NeurIPS), pages 8026–8037, 2019.
[52] Xinchi Qiu, Titouan Parcollet, Daniel J. Beutel, Taner Topal, Akhil Mathur, and Nicholas D.
Lane. A first look into the carbon footprint of federated learning. CoRR, abs/2010.06537, 2020.
[53] Aditya Rajagopal, Diederik Vink, Stylianos Venieris, and Christos-Savvas Bouganis. Multi-
Precision Policy Enforced Training (MuPPET) : A Precision-Switching Strategy for Quantised
Fixed-Point Training of CNNs. In Proceedings of the 37th International Conference on Machine
Learning (ICML), pages 7943–7952, 2020.
13
[54] Oren Rippel, Michael Gelbart, and Ryan Adams. Learning Ordered Representations with
Nested Dropout. In International Conference on Machine Learning (ICML), pages 1746–1754,
2014.
[55] F. Sattler, S. Wiedemann, K. R. Müller, and W. Samek. Robust and Communication-Efficient
Federated Learning From Non-i.i.d. Data. IEEE Transactions on Neural Networks and Learning
Systems (TNNLS), 31(9):3400–3413, 2020.
[56] Reza Shokri and Vitaly Shmatikov. Privacy-Preserving Deep Learning. In Proceedings of
the 22nd ACM SIGSAC Conference on Computer and Communications Security (CCS), pages
1310–1321, 2015.
[57] Sidak Pal Singh and Martin Jaggi. Model Fusion via Optimal Transport. Advances in Neural
Information Processing Systems (NeurIPS), 33, 2020.
[58] Virginia Smith, Chao-Kai Chiang, Maziar Sanjabi, and Ameet S Talwalkar. Federated Multi-
Task Learning. In Advances in Neural Information Processing Systems (NeurIPS), 2017.
[59] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov.
Dropout: A Simple Way to Prevent Neural Networks from Overfitting. Journal of Machine
Learning Research (JMLR), 15(56):1929–1958, 2014.
[60] Martin J Wainwright, Michael Jordan, and John C Duchi. Privacy Aware Learning. In Advances
in Neural Information Processing Systems (NeurIPS), 2012.
[61] Hongyi Wang, Scott Sievert, Shengchao Liu, Zachary Charles, Dimitris Papailiopoulos, and
Stephen Wright. ATOMO: Communication-Efficient Learning via Atomic Sparsification.
Advances in Neural Information Processing Systems (NeurIPS), 2018.
[62] Hongyi Wang, Mikhail Yurochkin, Yuekai Sun, Dimitris Papailiopoulos, and Yasaman Khaza-
eni. Federated Learning with Matched Averaging. In International Conference on Learning
Representations (ICLR), 2020.
[63] Jianyu Wang, Qinghua Liu, Hao Liang, Gauri Joshi, and H Vincent Poor. Tackling the Ob-
jective Inconsistency Problem in Heterogeneous Federated Optimization. Advances in Neural
Information Processing Systems (NeurIPS), 2020.
[64] Kuan Wang, Zhijian Liu, Yujun Lin, Ji Lin, and Song Han. HAQ: Hardware-Aware Automated
Quantization with Mixed Precision. In Proceedings of the IEEE conference on Computer Vision
and Pattern Recognition (CVPR), pages 8612–8620, 2019.
[65] Shiqiang Wang, Tiffany Tuor, Theodoros Salonidis, Kin K Leung, Christian Makaya, Ting
He, and Kevin Chan. Adaptive Federated Learning in Resource Constrained Edge Computing
Systems. IEEE Journal on Selected Areas in Communications (JSAC), 37(6), 2019.
[66] C. Wu et al. Machine Learning at Facebook: Understanding Inference at the Edge. In IEEE
International Symposium on High Performance Computer Architecture (HPCA), 2019.
[67] Zhuliang Yao, Shijie Cao, Wencong Xiao, Chen Zhang, and Lanshun Nie. Balanced Sparsity
for Efficient DNN Inference on GPU. In AAAI Conference on Artificial Intelligence (AAAI),
volume 33, pages 5676–5683, 2019.
[68] Mikhail Yurochkin, Mayank Agarwal, Soumya Ghosh, Kristjan Greenewald, Nghia Hoang,
and Yasaman Khazaeni. Bayesian Nonparametric Federated Learning of Neural Networks. In
International Conference on Machine Learning (ICML), pages 7252–7261. PMLR, 2019.
[69] Barret Zoph and Quoc Le. Neural Architecture Search with Reinforcement Learning. In
International Conference on Learning Representations (ICLR), 2017.
14