0% found this document useful (0 votes)

23 views30 pages

Accelerating DNN Training in Wireless Federated Edge Learning

Uploaded by

tippars

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

23 views30 pages

Accelerating DNN Training in Wireless Federated Edge Learning

Uploaded by

tippars

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

1

Accelerating DNN Training in Wireless

Federated Edge Learning System
Jinke Ren, Guanding Yu, and Guangyao Ding
arXiv:1905.09712v1 [cs.LG] 23 May 2019

Abstract

Training task in classical machine learning models, such as deep neural networks (DNN), is
generally implemented at the remote computationally-adequate cloud center for centralized learning,
which is typically time-consuming and resource-hungry. It also incurs serious privacy issue and long
communication latency since massive data are transmitted to the centralized node. To overcome these
shortcomings, we consider a newly-emerged framework, namely federated edge learning (FEEL), to
aggregate the local learning updates at the edge server instead of users’ raw data. Aiming at accelerating
the training process while guaranteeing the learning accuracy, we first define a novel performance
evaluation criterion, called learning efficiency and formulate a training acceleration optimization problem
in the CPU scenario, where each user device is equipped with CPU. The closed-form expressions for joint
batchsize selection and communication resource allocation are developed and some insightful results are
also highlighted. Further, we extend our learning framework into the GPU scenario and propose a novel
training function to characterize the learning property of general GPU modules. The optimal solution
in this case is manifested to have the similar structure as that of the CPU scenario, recommending that
our proposed algorithm is applicable in more general systems. Finally, extensive experiments validate
our theoretical analysis and demonstrate that our proposal can reduce the training time and improve the
learning accuracy simultaneously.

Index Terms

Federated edge learning, learning efficiency, training acceleration, learning accuracy, batchsize
selection, resource allocation.

I. I NTRODUCTION

With AlphaGo defeating the world’s top Go player and the troika Y. Bengio, G. Hinton,
and Y. LeCun winning the 2018 ACM A.M. Turing Award, artifical intelligence (AI) becomes

J. Ren, G. Yu, and G. Ding are with the College of Information Science and Electronic Engineering, Zhejiang University,
Hangzhou 310027, China (e-mail: {renjinke, yuguanding, guangyaoding}@zju.edu.cn). Corresponding author: G. Yu.
2

the most cutting-edge technique in both academia and industry communities and is envisioned
as a revolutionary innovation enabling a smart earth in the future [1]. The implementation of
AI in wireless networks is one of the most fundamental research directions, leading the trends
of communication and computation convergence [2]–[4]. The key idea of implementing AI in
wireless networks is leveraging the rich data collected by massive distributed user devices to
learn approximate AI models for network planning and optimization. Various conceptual and
engineering AI breakthroughs have been applied for wireless network design, such as channel
estimation [5], signal detection [6], and resource allocation [7], [8].
In despite of the substantial progress in AI techniques, current learning algorithms demand
enormous computation and memory resources for data processing. However, the training data
in wireless networks is unevenly distributed over a large amount of resource-constrained user
devices, whereas each device only owns a small fraction of data [9]. These two hostile conditions
make it hard to implement AI algorithms in user devices. Conventional solution generally offloads
the local training data to a remote cloud center for centralized learning. Nevertheless, this method
suffers from two key disadvantages. On the one hand, the latency for data transmission is
typically very large because of the limited communication resource. On the other hand, the
privacy information involved in the training data may be leaked since the cloud center can
be inevitably attacked by some malicious third parties. Hence, classical cloud-based learning
framework is no longer suitable for the scenarios where data privacy is of paramount importance,
such as intelligent healthcare system and smart bank system [10] .
To address the first issue, an innovative architecture called mobile edge computing (MEC) has
been developed by implementing cloud computation capability at the network edge and migrating
learning tasks from the cloud center to the edge server [11]. By this means, the communication
latency can be significantly reduced [12], the mobile energy consumption can be extensively
saved [13], and the core network congestion can be notably relieved [14]. To overcome the
second deficiency, a novel distributed learning framework, namely federated learning (FL) has
been recently proposed in [15]. The key idea of this framework is to globally aggregate local
learning updates (gradients or parameters) trained on user devices in a centralized node while
keeping the privacy-sensitive raw data remained at local devices. Toward this end, the benefit of
the shared models trained from the affluent data can be reaped and the computation resources
of both user devices and cloud servers can be collectively exploited [16]. Motivated by this,
effective collaboration between MEC and FL, referred to as federated edge learning (FEEL) has
3

the great potential to facilitate AI implementation in wireless networks [17].

This paper makes use of the FEEL framework to accelerate the training task of general deep
neural networks (DNN). We are inspired by the pioneering studies on applying FL to learn
AI models [18]–[27]. In [18], a practical FederatedAveraging algorithm for distributed DNN
training has been developed, while both independently and identically distributed (IID) and non-
IID datasets are used to test its learning performance. Two approaches called structured updates
and sketched updates have been proposed to reduce communication cost between the center
node and the user devices [19]. Based on this, substantial studies have devoted great effort
to further reducing the communication overhead by developing effective compression methods
[20]–[23]. Moreover, via creating a shared subset of data gathered from user devices, an efficient
collaborative learning strategy for non-IID data distributed system has been devised to improve
the learning accuracy [24]. In addition, the authors in [25] have investigated the multi-task FL
system and proposed a system-aware optimization framework to balance the communication cost,
stragglers, and fault tolerance. Particularly, a block-chained FL architecture has been introduced
in [26], where the optimal block generation rate has been derived to minimize the end-to-end
latency. Last but not least, a prominent broadband analog aggregation scheme based on the over-
the-air computation technology has been proposed in [27], and two communication-and-learning
tradeoffs have been presented to achieve a low-latency FEEL system.
The precedent works on FL mainly focus on accelerating the training task from the commu-
nication perspective, i.e., reducing the communication overhead between centralized node and
user devices. However, the optimization for the training task itself is also crucial and has not
been investigated yet. To facilitate the intelligent system implementation, we consider the joint
communication and computation resource allocation in this paper. The major hyperparameter,
i.e., the training batchsize is also optimized to improve the training efficiency. This work is
most closely related to the pioneering work of [28], which has developed a control algorithm to
achieve the tradeoff between local update and global parameter aggregation under fixed resource
budget. In contrast, we go one further step toward the training batchsize optimization from the
learning perspective. Our main result is that the batchsize of each device should dynamically
adapt to the wireless channel condition in order to achieve lower training latency and higher
learning accuracy. The main contributions of this work are summarized as follows.
• To quantitatively analyze the training process, we first define a significant global loss decay
function with respect to the training batchsize. Based on this, a novel criterion, namely
4

learning efficiency is also developed as the ratio between the global loss decay and the
end-to-end latency, which can well evaluate the system learning performance.
• We theoretically analyze the detailed expression of the learning efficiency in the CPU
scenario and formulate a training acceleration problem under both communication and
learning resource budgets. The closed-form expressions for joint batchsize selection and
communication resource allocation are also derived. Specifically, the optimal batchsize is
proved to scale linearly with the local training speed and increase with both the training
priority ratio and the uplink data rate in the power of −1/2.
• We extend the training acceleration problem to the GPU scenario and develop a new training
function to characterize the relation between the training latency and the batchsize of general
GPU modules. The corresponding solution in this scenario is manifested to have the similar
structure as that in the CPU scenario, revealing that our proposed algorithm can be applied
in more general systems.
• The proposed algorithms in both CPU and GPU scenarios are implemented in software.
Several classical DNN models are used to test the system performance based on a real
image dataset. The experimental results demonstrate that our proposed scheme can attain a
better learning performance than some benchmark schemes.

The rest of this paper is organized as follows. In Section II, we introduce the FEEL system
and establish the DNN model and the channel model. In Section III, we quantitatively analyze
the training process and formulate the training acceleration optimization problem in the CPU
scenario. The closed-form solution for the CPU scenario is developed in Section IV. In Section
V, we extend the training acceleration problem to the GPU scenario and discuss the solution
in this case. Section VI presents the experimental results and the whole paper is concluded in
Section VII.

II. S YSTEM M ODEL

A. Federated Edge Learning System Model

As depicted in Fig. 1, we consider an FEEL system comprising one edge server and K
distributed single-antenna user devices, denoted by the set K = {1, 2, . . . , K}. A shared DNN
model (e.g., convolutional neural networks, CNN) needs to be collaboratively trained by these
devices. By interacting with the own user, each device collects a number of labelled data samples
5

-937:8;< L3.MN.3<:/;
!
$ n = GH$ I!J D$ K
...

%& ... *&

%' ... *'

...

...
...

...

...
...

...
%(& )1 ... *(+ -1

%(& ... *(+

...

D1
Device 1

...

%& ... *&

%' *'

-937:8;< >??98?3<:/;
...
...

...
...

...

%(& )1 ... *(+ -1

%(& ... *(+

!
1 !
n = # |D | $ [n]
...

D2 | "ODO | $ O

Device 2 BS Edge Server

……

...
%& ... *&

%' ... *'

...
...

...

...
%& *&
%(&)1 *( )1
...
...
%' ... *' +
...

...
...

...

%(&
...

%(& )1 ... (+ -1 ... (+

%(& ... *(+

...
...

D, -./03. 455 6/78.

6/78. CF73<:;?
Device , ! n@1 = ! n @ A[n] !
[n]

Fig. 1. Federated edge learning system.

and constitutes its local dataset Dk = {(x1 , y1 ), . . . , (xNk , yNk )}, where xi is the training sample
and yi represents the corresponding ground-true label.
To accomplish the training task, two schemes have been widely employed: 1) centralized
learning, i.e., each device directly uploads the raw data to the base station (BS) for global
training, and the updated model is then multicast back to each device; 2) individual learning, i.e.,
each device trains an independent model via local dataset without any collaboration. The former
faces with severe privacy issue since the edge server is inevitably attacked by some malicious
third parties. In contrast, the latter is capable of avoiding privacy disclosure but suffers from
isolated data island. Thus, the learning accuracy and reliability cannot be guaranteed. To deal
with the above two issues, we adopt the FEEL scheme to accelerate the training task and improve
the learning accuracy simultaneously. In this scheme, the edge server only aggregates the local
gradients without centrally collecting the raw data.1 For convenience, we define the following

1
Due to the sparsity of the gradient, the communication overhead can be significantly reduced by some gradient compression
methods [23]. Therefore, we consider to aggregate the gradients instead of the parameters in this paper.
6

five steps as a training period, which is periodically performed before achieving a satisfactory
learning accuracy. The detailed procedures can be summarized as follows.

• Step 1 (Local Gradient Calculation): In each training period, say the n-th period, each
device first selects the training data from the local dataset, performs the forward-backward
propagation algorithm, and then derives the local gradient vector gkw [n], where w is the
parameter set of the DNN model.
• Step 2 (Local Gradient Uploading): After quantizing and compressing the gradient vector,
each device transmits its local gradient to the edge server via multiple channel access, such
as the time division multiple access (TDMA) or the orthogonal frequency division multiple
access (OFDMA).
• Step 3 (Global Gradient Aggregation): The edge server receives the gradient vectors from
all user devices and then aggregates (averages) them as the global gradient, as

X K
w 1
g [n] = |Dk |gkw [n]. (1)
| ∪k Dk | k=1

• Step 4 (Global Gradient Downloading): After finishing the gradient aggregation, the edge
server delivers the global gradient g w [n] to the BS, which broadcasts it to all user devices.
• Step 5 (Local Model Updating): Each device runs the stochastic gradient descent (SGD)
algorithm based on the received global gradient. Mathematically, the local DNN models are
updated by
w k [n + 1] = w k [n] + η[n]g w [n], ∀k ∈ K, (2)

where η[n] is the step-size in the n-th training period.

B. DNN Model

In this work, we take a generalized fully-connected DNN model for analysis. To comprehen-
sively characterize the network structure, we denote L as the number of layers, where the i-th
layer is equipped with ni neurons. Then in the (i + 1)-th layer, the number of weights between
each pair of connected neurons and the number of biases added to each neuron are ni ni+1 and
ni+1 , respectively. Therefore, the total number of parameters (weights and biases) is given by
L−1
X
p= (ni + 1) ni+1 . (3)
i=1
7

We assume that the DNN model is deployed in each user device and the edge server. Moreover,
the local loss function of each device that measures the training error is defined as

1 X
Lk (w, Dk ) = ℓ(w, xi , yi ), (4)
|Dk |
(xi ,yi )∈Dk

where ℓ(w, xi , yi ) is the sample-wise loss function that quantifies the prediction error between
the learning output (via input xi and parameter w) and the ground-true label yi . Accordingly,
the global loss function at the edge server can be expressed as

1 X
L(w) = |Dk |Lk (w, Dk ). (5)
| ∪k Dk | k∈K

The target of the training task is to optimize the parameters towards minimizing the global
loss function L(w) via the SGD algorithm. Further, the gradient vector of each device can be
expressed as
gkw = ∇Lk (w, Dk ) , (6)

where ∇ implies the gradient operator. Note that each parameter has a counterpart gradient,
therefore the total number of gradients of each device exactly equals p in (3).

C. Communication Model

Without loss of generality, we adopt the typical TDMA method for data transmission in this
paper. Let PkU denote the transmit power of device k for gradient uploading and hUk denote
the uplink channel power gain. Accordingly, denote PkD as the transmit power of the BS for
transmitting the global gradient to device k and hDk as the downlink channel power gain.
It should be emphasized that the durations of both uplink and downlink time-slots are relatively
short (e.g., the frame duration of LTE protocol is 10 ms), during which the channel power gains
keep fixed. However, the time duration for one training period is usually in the time-scale
of second because of the high computational complexity in running SGD algorithm as well
as the limited computation resource of user devices. In view of this, each training period will
experience multiple time-slots. Since this work focuses on accelerating the training task from the
long-term learning perspective, the channel dynamics would not affect the learning performance
too much. Toward this end, we employ the average uplink and downlink data rates instead of
8

the instantaneous ones, which can be respectively evaluated as

U pUk |hUk |2
Rk = W Eh log2 1 + , (7)
N0

D pDk |hDk |2
Rk = W Eh log2 1 + , (8)
N0

where Eh represents the expectation over channel fading, W denotes the system bandwidth, and
N0 implies the variance of additive white Gaussian noise (AWGN).
Thus far, we have elaborated the detailed procedures of the FEEL scheme and established
both the learning and communication models. In the next, we will analyze the training task and
formulate the training acceleration optimization problem. In our work, the edge server is always
equipped with powerful GPU whereas the training module of user devices can be either CPU
or GPU. Therefore, we will respectively investigate the CPU and GPU scenarios in the sequel.

III. F EDERATED L EARNING IN CPU S CENARIO : P ROBLEM F ORMULATION

In this section, we will first investigate the CPU scenario where each device is only equipped
with CPU for DNN training. The training acceleration problem is formulated to maximize the
system learning efficiency. Some insightful results about network planning are also discussed.

A. Training Loss Decay Analysis

To accelerate the training process and achieve a satisfactory learning accuracy, we take the
mini-batch SGD algorithm in this paper. As is known to us, the major difficulty performing this
algorithm is the selection of the hyperparameter, i.e., the training batchsize, which greatly affects
the learning accuracy and is valuable to be optimized.
To quantitatively assess the training performance, we first define an auxiliary function, namely
global loss decay, as
∆L[n] = L(w[n]) − L(w[n − 1]), (9)

which represents the difference of the global loss function across the n-th training period. For
brevity, we rewrite ∆L[n] as ∆L since our following analysis is based on one training period.
Now we analyze the training performance of the introduced FEEL system. Recall that in one
training period, each device first selects a subset of data, namely one batch for local gradient
calculation. Denote Bk as the number of data involved in the batch of device k. Then the edge
9

server aggregates the gradient information from all devices and calculates the global gradient.
P
Therefore, the FEEL system is capable of processing B = K k=1 Bk data in one training period,

which is referred to as global batchsize. Note that the target of the training task is toward
minimizing the global loss function. Then according to [29], the relation between the global loss
decay function and the global batchsize can be approximately evaluated as
v
u K
√ uX
∆L = ξ B = ξ t Bk , (10)
k=1

where ξ is a coefficient determined by the specific structure of the DNN model. We shall remark
that the global loss decay does not increase linearly with the global batchsize. It is because that
the learning rate should adapt to the batchsize to ensure the convergence of the mini-batch SGD
algorithm and guarantee the learning accuracy as well [30].

B. End-to-End Latency Analysis

As mentioned earlier, we aim at accelerating the training task by collaboratively leveraging

the data information and computation resources among all devices. Therefore, the end-to-end
latency of the training period is also an essential term to be optimized. In the following, we will
mathematically calculate the latency of each step in one training period.
• Local Gradient Calculation Latency: In the CPU scenario, each device is equipped with
CPU for DNN training. We measure the computation capacity of each device by its CPU
frequency, denoted by fkC (in CPU cycle/s). Moreover, let C L denote the number of CPU
cycles performing the forward-backward propagation algorithm with one data. Due to the
fact that CPU operates in serial mode, the local gradient calculation latency is given by

Bk C L
tL,CPU
k = . (11)
fkC

• Local Gradient Upload Latency: The local gradient vector of each device should be quan-
2
tized and compressed before transmission. Denote the average quantitative bit number
for each gradient as d. Let r denote the compression ratio, which is defined as the ratio
between the compressed gradient data size and the overall raw gradient data size. Then the
total data size for each local gradient vector is s = r × d × p. Recall that we adopt the

2
Note that the computational complexity of the gradient quantization and compression algorithm is very low [23]. Thus, the
corresponding latency can be omitted as compared with the local gradient upload latency.
10

TDMA channel access for data transmission. Let TfU denote the length of each uplink radio
frame (usually 10 ms in LTE standards) and τkU denote the time-slot duration allocated to
device k. Therefore, the local gradient upload latency can be expressed as

sTfU
tUk = . (12)
τkU RkU

• Global Gradient Download Latency: When the edge server finishes gradient aggregation,
it broadcasts the global gradient vector to each device. To consist with the uploading
procedure, we use d bits to quantize each global gradient term and leverage the same
gradient compression technique. Similarly, let TfD denote the length of each downlink radio
frame and denote τkD as the time-slot duration allocated to device k for global gradient
downloading. Thus, the global gradient download latency can be expressed as

sTfD
tDk = . (13)
τkD RkD

• Local Model Update Latency: After receiving the global gradient, each device starts to
update its local model via the gradient-descent method, as presented in (2). Denote M C as
the number of CPU cycles that are required for local model update. Then, the local model
update latency is given by
MC
tM,CPU
k = . (14)
fkC

It should be noted that the edge server has powerful GPU modules so that the gradient
aggregation latency can be reasonably neglected. Moreover, the gradient aggregation at the edge
server cannot start until receiving the local gradient vectors of all devices. Therefore, in one
training period, the end-to-end latency of each device can be expressed as
n o
Tk = max tL,CPU
k + tU
k + tDk + tM,CPU
k , ∀k ∈ K. (15)
k∈K

Accordingly, the end-to-end latency of the FEEL system in one training period is given by
n o
T = max Tk = max tL,CPU
k + tUk + max tDk + tM,CPU
k . (16)
k∈K k∈K k∈K

C. Problem Formulation

In this paper, we aim at accelerating the training task while guaranteeing the learning accuracy.
To better reflect the training performance, we first define a novel evaluation criterion from the
11

learning perspective, as presented below.

Definition 1: The training performance of the FEEL system can be evaluated by the learning
efficiency, which is defined as
∆L
E= . (17)
T

Remark 1: The learning efficiency can be interpreted as the average global loss decay rate in
the duration of one training period. Therefore, improving the learning efficiency is equivalent to
accelerating the training process. In view of this, the learning efficiency is an appropriate metric
to evaluate the system training performance.

Based on the above analysis, the objective of training acceleration can be transformed into the
learning efficiency maximization. Therefore, the optimization problem can be mathematically
formulated as
√
ξ B
P1 : max E= , (18a)
{τkU ,τkD ,Bk ,B} max tL,CPU
k + tUk + max tDk + tM,CPU
k
k∈K k∈K
K
X
s.t. τkU ≤ TfU , (18b)
k=1
K
X
τkD ≤ TfD , (18c)
k=1
K
X
Bk = B, (18d)
k=1
1 ≤ Bk ≤ B max , ∀k ∈ K, (18e)

τkU , τkD ≥ 0, ∀k ∈ K, (18f)

where (18b) and (18c) represent the uplink and downlink communication resource limitations,
respectively, (18d) gives the overall number of data that can be processed in one training
period, and (18e) bounds the minimum and maximum batchsizes of each device, where B max is
determined by the memory size and the CPU configuration of each device.

It can be observed that the objective function in problem P1 is complicated and non-convex,
making it hard to be solved in general. In the next section, we will decompose it into two
subproblems and devise efficient algorithms to solve them individually.
12

IV. F EDERATED L EARNING IN CPU S CENARIO : O PTIMAL S OLUTION

In this section, we first analyze the mathematical characteristics of problem P1 and then
equivalently decompose it into two subproblems. The closed-form solutions for both subproblems
are derived individually and some insightful results are also discussed.

A. Problem Decomposition

The main challenge in solving problem P1 is that the denominator of the objective function
is non-smooth. For ease of decomposition, (18a) can be rewritten as
D
max tL,CPU
k + tU
k + max tk + tM,CPU
k
k∈K k∈K
min √ . (19)
{τkU ,τkD ,Bk ,B} ξ B

It can be seen that the local gradient upload latency tUk is determined only by the uplink
communication resource τkU , while the global gradient download latency tDk is determined only
by the downlink communication resource τkD and is independent of other variables. Meanwhile,
τkU and τkD are generally independent of each other according to (18b) and (18c). Toward this
end, one training period can be divided into two subperiods. The first subperiod aims to perform
local gradient calculation and uploading, which can be formulated as
L,CPU
tk + tUk
P2 : min max √ ,
{τkU ,Bk ,B} k∈K ξ B (20)
s.t. (18b), (18d), (18e), and (18f).

The second subperiod is to download global gradient and update local DNN model, and thus
can be formulated as
tDk + tM,CPU
k
P3 : min max √ ,
{τkD ,B} k∈K ξ B (21)
s.t. (18c) and (18f).

It needs to be emphasized that the value of B in subproblem P3 should match that in subproblem
P2 . Therefore, the global batchsize can be regarded as a global variable which will be optimized
at last. In view of this, we will analyze the two subproblems in the sequel.

B. Solution to Subproblem P2

We can find that subproblem P2 is generally a min-max optimization problem and is hard
to be solved directly. To make it better tractable, we first define E U as the maximum reciprocal
13

n o
U tL,CPU U
k √ +tk
of each uplink learning efficiency among user devices, i.e., E = max ξ B
. Then, by
k∈K
parametric algorithm [31], subproblem P2 can be transformed into

P4 : min E U, (22a)
{τkU ,Bk ,B,E U }
√
s.t. tL,CPU
k + tUk ≤ ξ BE U , ∀k ∈ K, (22b)

(18b), (18d), (18e), and (18f).

It can be easily verified that constraint (22b) is non-convex, resulting in a non-convex optimization
problem. Recall that the values of B in subproblems P2 and P3 are identical. Therefore, we first
keep B fixed and then determine the joint optimal batchsize selection and uplink communication
resource allocation policy. When B is fixed, the problem P4 becomes a convex one, as presented
in the following lemma.

Lemma 1: Given the value of B, problem P4 can be converted into a classical convex
optimization problem.

Proof: The proof is straightforward since when the value of B is fixed, the objective function
(22a) is convex and all constraints get affine. The detailed derivation is omitted for brevity.

Lemma 1 is essential for solving problem P4 by applying the fractional optimization method.
Moreover, classical convex optimization algorithms can be also utilized. To better characterize
the structure of the solution and gain more insightful results, we first define some significant
auxiliary indicators for each device, as

• Local training speed is defined as the speed of performing local forward-backward prop-
fkC
agation algorithm at each device, i.e., Vk = L .
C
• Training priority ratio is defined as the ratio between the local computation capacity and
fC
the overall devices’ computation capacity, i.e., ρk = PK k C .
k=1 fk

Based on the above definitions, the optimal solution to problem P4 with fixed B can be described
as follows.

Theorem 1: The joint optimal batchsize selection and uplink communication resource alloca-
14

tion policy is given by

  ! 12  Bmax

 ∆LsTfU µ∗


 Bk = ∆LE −  Vk 
∗ U∗
 ,
ρk RkU
1 ∀k ∈ K, (23)

 +

 s Bk∗
 U∗
 τk = U∗
∆LE − TfU ,
RkU Vk
P
where µ∗ and E U∗ are the optimal values satisfying the active time-sharing constraint K U∗
k=1 τk =
P
TfU and the global batchsize limitation K ∗ B
k=1 Bk = B, respectively. Here, the operations [X]A =

min {B, max {A, X}} and [Y ]+ = max{Y, 0}.

Proof: Please refer to Appendix A.
Remark 2: (Threshold-based Batchsize Selection) Theorem 1 reveals that the optimal batchsize
has a threshold-based structure and is mainly determined by three parameters, i.e., the local
training speed, the training priority ratio, and the uplink data rate. More precisely, the batchsize
scales linearly with the local training speed and increases with both the training priority ratio and
the uplink data rate in the power of −1/2. This result is intuitive and plays an important role in
hyperparameter tuning. On the one hand, it theoretically guides devices to increase the batchsize
when the local training speed increases or the channel condition becomes better. On the other
hand, those devices with higher training priority are suggested to choose larger batchsize since
they are superior to promote the training process by speeding up the global loss convergence.
Remark 3: (Adaptive Resource Allocation) Theorem 1 indicates that the optimal uplink re-
source allocation depends not only on the uplink data rate and the local training speed, but also
on the training batchsize. Specifically, this solution guarantees that the edge server can receive
all local gradient vectors simultaneously, resulting in a synchronous manner without any waiting
delay. This result guides devices to balance the learning and communication performances, which
is regarded as a learning-communication tradeoff. Concisely, the amount of time-slot resource
decreases with the uplink data rate since those devices with better channel quality request less
communication resource. In addition, when the local training speed decreases, the device should
occupy more time-slot resources to reduce its end-to-end latency.
Now we start to determine the optimal values of the above two parameters, i.e., E U∗ and µ∗ .
Classical method is to perform the two-dimensional search algorithm. To facilitate the search
process and reduce the computational complexity, we will calculate some useful bounds for these
two parameters in the following. On the one hand, by investigating a special case where each
15

device trains with the same batchsize as well as being allocated with identical time-slot resource,
we can obtain the upper bound of E U∗ . On the other hand, we relax the constraint (18e) and
further apply the Karush-Kuhn-Tucker (KKT) conditions to achieve the lower bound of E U∗ .
With mathematical analysis, the range of E U∗ could be expressed in the following corollary,
which is proved in Appendix B.

Corollary 1: The value of E U∗ shall satisfy

 !2 
L XK r
1  BC ρk  ≤ E ≤ max
U∗ 1 B Ks
PK +s U + U . (24)
∆L f
k=1 k
C
k=1
R k
k∈K ∆L KV k Rk

Based on Corollary 1, we further investigate another two extreme cases to determine the range
of µ∗ . The first one corresponds to the scenario where the optimal batchsize for all devices are
Bk∗ = 1, ∀k. The second one corresponds to the scenario of Bk∗ = B max , ∀k. Then the range of
µ∗ can be expressed as the function of E U∗ , which is presented in the following corollary and
is also proved in Appendix B.

Corollary 2: When there exists at least one device whose optimal batchsize is in the interval
(1, B max ), the value of µ∗ should satisfy
( 2 ) ( 2 )
∆LVk E U∗ − B max ρk RkU ∆LV k E U∗
− 1 ρk Rk
U
min ≤ µ∗ ≤ max . (25)
k∈K ∆LsTfU Vk2 k∈K ∆LsTfU Vk2

Remark 4: From Corollary 2, we can observe that the values of µ∗ and E U∗ are tightly coupled.
Note that even though this result is derived under the assumption that there is at least one device’s
batchsize is between 1 and B max , it is still applicable since the extreme cases of Bk = 1, ∀k and
Bk = B max , ∀k rarely occur in practice. On the other hand, these two extreme cases correspond
to the cases of B = K and B = KB max , whose solutions can be directly obtained via the KKT
conditions.

Based on the above analysis, we now develop an effective two-dimensional search algorithm
to solve subproblem P2 , as described in Table I. The main idea is to update the values of training
batchsize and uplink communication resource in each iteration until the time-sharing constraint
and the global batchsize limitation are both satisfied. It is easy to prove that this algorithm has

1 2
the computational complexity of O K log ǫ and thus can be easily implemented in practical
systems, where ǫ is the maximum tolerance.
16

TABLE I
T HE S OLUTION TO S UBPROBLEM P2

Algorithm 1 Two-dimensional search algorithm for subproblem P2 .

1: Initialization
• Set the maximum tolerance ǫ.
U U
• Calculate Eℓ and Eh according to (24).
P U∗
K P
K
U∗
• Calculate τk,ℓ and τk,h based on Theorem 1, involving the one-dimensional search
k=1 k=1
for µ∗ℓ and µ∗h , respectively.
P
K
U∗
P
K
2: while τk,ℓ − TfU > ǫ or U∗
τk,H − TfU > ǫ, do
k=1 k=1
E U + EhU
3: Define Em U
= ℓ .
PK U∗ 2
4: Calculate k=1 τk,m with one-dimensional search for µ∗m .
P
5: if K U∗
k=1 τk,m = Tf , then
U∗ U
6: E = Em , break.
7: else
P
K
U∗
8: if τk,m > TfU , then
k=1
9: EℓU U
= Em .
10: else
11: EhU = EmU
.
12: end if
U∗ U∗
13: Update τk,ℓ and τk,h .
14: end if
15: end while

C. Solution to Subproblem P3 and Global Discussion

In this part, we will first solve subproblem P3 and then discuss the global solution to the
n D M,CPU o
t +t
original problem P1 . Similar to the subproblem P2 , we denote E D = max k ξ√kB as
k∈K
the maximum reciprocal of each downlink learning efficiency among all devices. Then, the
subproblem P3 can be reformulated as

P5 : min E D, (26a)
{τkD ,B,E D }
√
s.t. tDk + tM,CPU
k ≤ ξ BE D , ∀k ∈ K, (26b)

(18c) and (18f).

Given the value of B, the mathematical characteristic of problem P5 is similar to that of problem
P4 . Therefore, it can be solved using the KKT conditions. The closed-form solution to P5 is
presented in the following theorem. Note that the proof is similar to that of Theorem 1, which
is omitted here due to page limits.
Theorem 2: The optimal downlink communication resource allocation policy is given by
+
D∗ s D∗ MC
τk = ∆LE − C TfD , (27)
RkD fk
PK
where E D∗ is the optimal value associated with the time-sharing constraint D∗
k=1 τk = TfD .
Remark 5: (Consistent Resource Allocation) Theorem 2 indicates that the optimal downlink
resource decreases with the downlink data rate. The reason is similar to that in Theorem 1. Note
that by this means, the local model updating of each device can be accomplished simultaneously.
Combined it with the results in Theorem 1, we can conclude that the overall end-to-end latency
of each device should be identical, leading to a synchronous training system.
As yet, we have presented the closed-form expressions of joint optimal batchsize selection
and uplink/downlink time-slot resource allocation as the function of the global batchsize B. As
a result, problem P1 can be degraded into an univariate optimization problem and thus can be
solved by classical gradient descent algorithm effectively.

V. E XTENSION TO GPU S CENARIO

In this section, we consider the scenario where each device is equipped with GPU for DNN
training. We will first propose a novel GPU training function and then extend the training
acceleration problem in this case. Particularly, the optimal solution is proved to have a similar
structure as that in the CPU scenario.

A. GPU Training Function

Unlike the serial mode of general CPUs, GPUs perform computation in the parallel mode.
In this situation, the local gradient calculation latency is no longer proportional to the training
batchsize. Specifically, when the training batchsize is small, GPU can directly process all data
simultaneously, leading to a constant training latency. In contrary, it grows once the training
batchsize exceeds the maximum volume of data that the GPU can process at a time. With
this consideration, we propose the following function to capture the relation between the local
18

Data Bound Compute Bound

Theoretical latency

l
!

0
"!#$ "%&'
Training batchsize

(a) Theoretical result. (b) Experimental result.

Fig. 2. Local gradient calculation latency w.r.t. training batchsize of general GPUs.

gradient calculation latency and the training batchsize, as presented in the following assumption
and shown in Fig. 2(a).
Assumption 1: In the GPU scenario, the relation between the local gradient calculation latency
and the training batchsize of each device is given by

 tℓk , if 1 ≤ Bk ≤ Bkth
L,GPU
tk = ∀k ∈ K, (28)
 h th ℓ th max
tk (Bk ) , ck Bk − Bk + tk , if Bk < Bk ≤ B

where tℓk , ck , and Bkth are three coefficients determined by the specific DNN structure (e.g., the
number of layers and the number of neurons) and the concrete GPU configuration (e.g., the
video memory size and the number of cores).
Remark 6: As discussed earlier, when the training batchsize is below a threshold Bkth , the
local gradient calculation latency keeps fixed because of GPU’s parallel execution capability.
In this case, the data samples are inadequate such that the computation resource is not fully
exploited. Therefore, we name it as the data bound region. On the other hand, the local gradient
calculation latency grows linearly when the training batchsize exceeds the threshold. It is because
in this case, the computation resource (e.g., the video memory) cannot support processing all
data samples simultaneously. We shall note that the local gradient calculation latency does not
increase in a ladder form. The reason is rather intuitive that the operations of data reading and
transferring between memory modules and processing modules need additional time, leading to
19

a comprehensive linear trend. Consequently, we name this region as the compute bound region.
To validate the results in Assumption 1, we implement three classic DNN models, i.e.,
DenseNet, GoogleNet, and PNASNet on a Linux server equipped with three NVIDIA GeForce
GTX 1080 Ti GPUs. The experimental results are depicted in Fig. 2(b). It can be observed that
the local gradient calculation latency of each model first keeps invariant and then increases almost
linearly with the training batchsize. This result shows that the curves obtained via experiments
fit the theoretical model in (28) very well, which demonstrates the applicability of the proposed
GPU training function.

B. Problem Formulation and Analysis

Similar to the CPU scenario, the learning efficiency in the GPU scenario can be also defined
as the ratio between the global loss decay and the end-to-end latency of each training period.
Note that the data size of both local gradient gkw and global gradient g w in the GPU scenario
are identical to those in the CPU scenario. Thus, the local gradient upload latency and global
gradient download latency can be still expressed as (12) and (13), respectively. Besides, let fkG
(in floating-point operations per second) denote the computation capability of the GPU module
of device k and denote M G as the number of floating-point operations that are required for
model updating. Then, the local model update latency in this case can be expressed as

MG
tM,GPU
k = . (29)
fkG

Based on the above analysis, the end-to-end latency of each training period in the GPU scenario
is given by

T = max tL,GPU
k + tUk + max tDk + tM,GPU
k . (30)
k∈K k∈K

Consequently, the training acceleration problem can be formulated as

√
ξ B
P6 : max E= D , (31a)
{τkU ,τkD ,Bk ,B} maxk∈K tL,GPU
k + tU
k + maxk∈K tk + tM,GPU
k
s.t. (18b)–(18f).

The above problem is not easy to solve since the local gradient calculation latency tL,GPU
k

is non-differential. Traditional method for solving problem P6 is the sub-gradient algorithm

which performs iterative optimization of the three-variable set, {τkU , τkD , Bk }, via the sub-gradient
functions and will result in high computational complexity. To address this issue, we analyze the
20

mathematical characteristic of the learning efficiency in (31a) and derive a necessary condition
of the solution to problem P6 , as summarized in the following lemma.
Lemma 2: In the GPU scenario, the optimal batchsize Bk∗ for each device shall locate in the
compute bound region, i.e., Bkth ≤ Bk∗ ≤ B max .
Proof: Please refer to Appendix C.
The result in Lemma 2 coincides with the empirical result in practical systems that the
computation resource of all devices should be fully exploited to achieve the largest learning
efficiency. Accordingly, the data bound region can be neglected and P6 can be reformulated as
√
b ξ B
P7 : max E= , (32a)
{τkU ,τkD ,Bk ,B} maxk∈K thk + tUk + maxk∈K tDk + tM,GPU
k
s.t. (18b), (18c), (18d), and (18f), (32b)

Bkth ≤ Bk ≤ B max , ∀k ∈ K. (32c)

With comprehensive comparison, we can observe that the structures of problems P1 and P7 are
identical except the expressions of local gradient calculation latency. Fortunately, the two kinds
of latency are affine with the batchsize, making both problems essentially the same structure.
Therefore, the GPU scenario can be degraded into the CPU scenario and the algorithms to solve
problem P1 are still applicable with a minor modification on the delay expressions. The detailed
derivations and results are omitted due to page limits.

VI. E XPERIMENTS

A. Experiment Settings

In this section, we conduct experiments to validate our theoretical analysis and evaluate the
performance of the proposed algorithms. The system setup is summarized as follows unless
otherwise specified. We consider a single-cell network with a radius of 200 m. The BS is
located in the center of the network and each device is uniformly distributed in the cell. The
channel gains of both the uplink and downlink channels are generated following the pass-loss
model: P L [dB] = 128.1 + 37.6 log d [km], where the small-scale fading is set as Rayleigh
distributed with uniform variance. The transmit powers of the uplink and downlink channels are
both 28 dBm. The system bandwidth W = 10 MHz and the channel noise variance N0 = −174
dBm. The lengths of each uplink and downlink radio frame are both set as TfU = TfD = 10 ms
21

according to the LTE standards. For each device, the average quantitative bit numbers for each
gradient d = 64 bits and the maximum batchsize is bounded by B max = 128 data samples.
To test the system performance, we choose three commonly-used DNN models: DenseNet121,
ResNet18, and MobileNetV2 for image classification. The well-known dataset CIFAR-10 is used
for model training, which consists of 50000 training images and 10000 validation images of 32
× 32 pixels in 10 classes. Standard data augmentation and preprocessing methods are employed
to improve the test accuracy, and effective deep gradient compression methods are also used
to reduce the communication overhead [23]. Moreover, to simulate the distributed mobile data,
we study two typical ways of partitioning CIFAR-10 data over devices [27]: 1) IID case, where
all data samples are first shuffled and then partitioned into K equal parts, and each device is
assigned with one particular part; 2) non-IID case, where all data samples are first sorted by digit
label and then divided into 2K shards of size 25000/K, and each device is thereafter assigned
with two shards. Note that the latter is a pathological non-IID partition way since most devices
only obtain two kinds of digits.

B. CPU Scenario: Generalization with Different DNN Models

An important test of the proposed scheme is whether it is applicable for training different
DNN models, i.e., the generalization ability. Toward this end, we implement the above three
DNN models in an FEEL system with K = 12 user devices. Specifically, the CPU frequencies
of all devices are configured as: 4 devices with 0.7 GHz, 4 devices with 1.4 GHz, and 4 devices
with 2.1 GHz.
We test both the loss convergence speed and learning accuracy improvement of each DNN
model with different learning rates. To be more realistic, we implement the non-IID case and
the results are presented in Fig. 3. It can be observed that our proposed scheme is capable
of attaining a satisfactory learning accuracy with a fast loss convergence speed for all three
models. Moreover, the ultimate learning accuracy can be well guaranteed with different learning
rates. This result demonstrates the strong generalization ability of the proposed algorithm, which
promotes its wide implementation in practical systems.

C. CPU Scenario: Performance Comparison among Different Schemes

In this part, we test the performance improvement of our proposed scheme as compared with
three benchmark schemes in the scenarios of K = 6 and K = 12 devices, respectively. Without
22

(a) Global training loss (b) Test accuracy

Fig. 3. Global training loss and test accuracy with different learning rates.

loss of generality, we take a pre-trained DenseNet121 with 86% initial learning accuracy for
the following test. Both the IID and non-IID cases are implemented. The detailed procedures of
each scheme can be summarized as follows.

• Individual Learning: Each device trains its DNN model until the local loss function con-
verges. Then the edge server aggregates all local models, averages the parameters, and
transmits the results to each device.
• Model-Based FL: Each device trains its DNN model via local dataset through one epoch.
Thereafter, the model parameters of each device are transmitted to the edge server for
aggregation. The results are then transmitted back to each device for the next training
epoch before the global loss function converges [18].
• Gradient-Based FL: Each device trains its DNN model by running one-step SGD algorithm.
Then the gradients of each device are transmitted to the edge server for aggregation. After
that, the global gradients are transmitted back to each device. This process is periodically
performed before the global loss function converges [32].

It is clearly that the main difference between our proposed scheme and the gradient-based FL
scheme is that we take into account the joint batchsize selection and communication resource
allocation.
We use two metrics to evaluate the training performance of each scheme, i.e., the test accuracy
and the training speedup, which is defined as the ratio between the training speed of each scheme
and that of the individual learning scheme.
23

TABLE II
T RAINING P ERFORMANCE OF D IFFERENT S CHEMES
(a) K = 6
IID case Non-IID case
Scheme
Test accuracy Training speedup Test accuracy Training speedup
Individual learning 90.65% 1× 89.16% 1×
Model-based FL 91.28% 0.29× 90.22% 0.37×
Gradient-based FL 91.60% 0.53× 91.51% 0.69×
Our proposed scheme 91.48% 1.09× 91.43% 1.03×

(b) K = 12
IID case Non-IID case
Scheme
Test accuracy Training speedup Test accuracy Training speedup
Individual learning 90.68% 1× 89.91% 1×
Model-based FL 92.01% 0.32× 91.16% 0.31×
Gradient-based FL 92.27% 0.68× 91.81% 0.67×
Our proposed scheme 92.34% 1.16× 92.12% 1.26×

Table II(a) shows the training performance of each scheme in the scenario of K = 6 devices.
As compared with the individual learning scheme, our proposed scheme can speed up the training
task by about 1.09 times with about 0.83% learning accuracy improvement in the IID case,
and can speed up the training task by about 1.03 times with about 2.27% learning accuracy
improvement in the non-IID case. These results demonstrate that the proposed scheme can
accelerate the training process and improve the learning accuracy simultaneously. Moreover,
compared with the model-based FL scheme, our scheme can achieve about 3.76 times training
speed in the IID case, and can achieve about 2.78 times training speed in the non-IID case,
both with a slight learning accuracy improvement. The reason can be explained as follows.
The gradients of each device can be deeply compressed because of the sparsity, therefore the
communication overhead for gradient transmission is much smaller than that for parameter
transmission. Besides, the gradients contain more information than parameters and match the
essential operations of the mini-batch SGD algorithm closely. Thus, the learning accuracy of the
proposed scheme is better than that of the model-based scheme. On the other hand, compared
with the gradient-based scheme, our scheme is capable of achieving about 2 times and 1.5
times training speed in the IID and non-IID cases, respectively, both with almost the same
learning accuracy. It is rather intuitive that our scheme optimally selects the training batchsize
to effectively balance the communication overhead and the training workload. In contrary, the
gradient-based scheme trains each local model using the whole dataset without considering the
communication and learning tradeoff.
24

(a) Global training loss vs training time. (b) Test accuracy vs training time.

Fig. 4. Performance comparison among different training schemes in the IID case.

We further test the training performance of each scheme in the scenario of K = 12 devices. The
results are presented in Table II(b). It can be observed that our proposed scheme can still achieve
the fastest training speed as well as attain a satisfactory learning accuracy improvement among
all schemes, which demonstrates the superiority and scalability of our proposal. Moreover, as
the number of devices increases, both the learning accuracy improvement and the training speed
improvement of the proposed scheme become more evident as compared with the individual
learning scheme. It is because with more devices, more computation resources can be exploited
to fully train the DNN model in our scheme, resulting in a higher learning accuracy. In particular,
the accuracy gap between IID case and non-IID case in the individual learning scheme is larger
than those in other schemes. As we can imagine, the individual learning scheme cannot properly
grasp the characteristic of the non-IID data because it performs local training without any
collaboration. Conversely, other three schemes dynamically share the local learning updates
so that their accuracy gaps between IID case and non-IID case are much smaller. This result
also suggests the superiority of the proposed scheme to deal with the non-IID data in distributed
FL systems.

D. GPU Scenario: Performance Comparison among Different Schemes

An important test is whether the proposed joint batchsize selection and communication re-
source allocation policy can accelerate the training task while guaranteeing the learning accuracy
in the GPU scenario. Therefore, we turn to compare the training performance of the proposed
25

(a) Global training loss vs training time. (b) Test accuracy vs training time.

Fig. 5. Performance comparison among different training schemes in the non-IID case.

scheme with three baseline schemes in the scenario of K = 6 devices. Similarly, the DenseNet121
is tested and both IID and non-IID cases are experimented. The three baseline schemes are
summarized as follows.

• Online learning scheme, where the training batchsize of each device is Bk = 1, ∀k.
• Full batchsize scheme, where the training batchsize of each device is Bk = B max = 128, ∀k.
• Random batchsize scheme, where each device randomly selects a training batchsize between
1 and 128 in each training period.

Fig. 4 and Fig. 5 show the loss convergence speed and the learning accuracy in the IID
case and non-IID case, respectively. We can observe from all plots that the proposed scheme
can achieve the fastest training speed while attaining the highest learning accuracy at the same
time. The reason is that our proposed scheme can optimally allocate the time-slot resource and
effectively select approximate training batchsize to balance the communication overhead and the
training workload. Further, the learning accuracy of the proposed scheme in the IID and non-IID
cases are almost the same with enough training time. It is because in the proposed scheme, the
edge server can generally aggregate the data information with different kinds of distributions.
Therefore, the learning accuracy in the non-IID case will not deteriorate as compared with that in
the IID case. This also suggests the applicability of the proposed scheme in the hyperparameter
adjustment for practical systems with non-IID data.
26

VII. C ONCLUSION

This paper aims at accelerating the DNN training task in the framework of FEEL, which not
only protects users’ privacy but also improves the system learning efficiency. The key idea is
to jointly optimize the local training batchsize and wireless resource allocation to achieve a fast
training performance while maintaining the learning accuracy. In the common CPU scenario,
we formulate a training acceleration optimization problem after analyzing the global loss decay
and the end-to-end latency of each training period. The closed-form solution for joint batchsize
selection and communication resource allocation is then developed by problem decomposition
and classical optimization tools. Some insightful results are also discussed to provide meaningful
guidelines for hypeparameter adjustment and network planning. To gain more insights, we further
extend our framework to the GPU scenario and develop a new GPU training function. The optimal
solution in this case can be derived by fine-tuning the results in the CPU scenario, indicating that
our proposed algorithm is applicable in more general systems. Our studies in this work provide
an important step towards the implementation of AI in wireless communication systems.

A PPENDIX A
P ROOF OF T HEOREM 1

According to Lemma 1, problem P4 is a classical convex optimization problem for a given

value of B. Therefore, it can be solved by the Lagrange multiplier method. The partial Lagrange
function can be defined as
K
! K
! K
!
X Bk C L sTfU √ X X
L = EU + λk + U U −ξ BE U +µ τkU −TfU +γ Bk −B , (33)
k=1
fkC τk Rk k=1 k=1

where λk , ∀k, µ, and γ are the Lagrange multipliers associated with constraints (22b), (18b),

and (18d), respectively. Denote Bk∗ , τkU∗ as the optimal solution to P4 . Note that the uplink
communication resource τkU is non-negative while the batchsize of each device is bounded in
the interval [1, B max ]. Therefore, applying the KKT conditions gives the following necessary and
sufficient conditions, as
27

K
∂L √ X
= 1 − ξ B λ∗k = 0, (34)
∂E U∗
k=1

∂L sT U  ≥ 0, τ U∗ = 0
∗ f ∗ k
U∗ = −λk U 2 +µ  , ∀k ∈ K, (35)
∂τk U∗
Rk (τk ) = 0, τkU∗ > 0


 ∗
∂L C L  ≥ 0, Bk = 1

∗
= λ∗k C + γ ∗ = 0, Bk∗ ∈ (1, B max ) , ∀k ∈ K, (36)
∂Bk fk 


 ≤ 0, B ∗ = B max
k
!
B ∗ L
C sTf
U √
λ∗k k
+ U∗ U − ξ BE U∗ = 0, λ∗k ≥ 0, ∀k ∈ K, (37)
fkC τk Rk
Bk∗ C L sTfU √
+ ≤ ξ BE U∗ , ∀k ∈ K, (38)
fkC τkU∗ RkU
K
!
X
µ∗ τkU∗ − TfU = 0, µ∗ ≥ 0, (39)
k=1
K
X
τkU∗ ≤ TfU , τkU∗ ≥ 0, ∀k ∈ K, (40)
k=1
K
X
Bk∗ = B, 1 ≤ Bk∗ ≤ B max , ∀k ∈ K. (41)
k=1

With simple mathematical calculation, we can derive the optimal batchsize selection policy as
 ! 12  Bmax
U ∗
∆LsTf µ
Bk∗ = ∆LE U∗ −  Vk  . (42)
ρk RkU
1

Moreover, the minimum E U is achieved when “≤” in (38) is set to “=”. Combining (38) and
(42), we can obtain the optimal time-slot allocation in Theorem 1.

A PPENDIX B

A. Proof of Corollary 1

To prove Corollary 1, we first analyze the following two cases.

1) Case A (Equivalent resource allocation)
In this case, the time-slot resource allocated to each device is identical and each device selects
TfU B
the same training batchsize for local gradient calculation, i.e., τkU = K
and Bk = K
, ∀k. As a
28

result, the corresponding objective value in (22a) can be derived as

U 1 B Ks
Eh = max + U . (43)
k∈K ∆L KVk Rk

Since the goal of problem P4 is to minimize E U , the optimal E U∗ is no greater than EhU , which
can be regarded as an upper bound of E U∗ .
2) Case B (Infinite memory resource)
In this case, the memory of each device is sufficient so that the batchsize limitation (18e) can
be relaxed. Further, the convexity of problem P4 can be preserved, making the KKT conditions
still effective. With the similar derivations in Appendix A, we can obtain the objective value as
 !2 
L K
X ρkr
1  BC .
EℓU = PK C + s U
(44)
∆L f
k=1 k k=1
R k

Due to the fact that the objective value will not increase after constraint relaxation, E U∗ is no
less than EℓU . Thus, EℓU can be viewed as a lower bound of E U∗ . Combining (43) and (44), we
can express the range of E U∗ in Corollary 1, which ends the proof.

B. Proof of Corollary 2

According to Theorem 1, we can observe that the optimal batchsize has a threshold-based
structure. Similar to the proof of Corollary 1, we also consider two cases.
1) Case A (Online learning)
In this case, we assume that the
optimal batchsize of each device is Bk∗ = 1, ∀k. According
∆LsT U µ∗ 12
to (23), this case happens when ∆LE U∗ − f
ρ RU
Vk ≤ 1, ∀k, resulting in
k k

( 2 )
∆LVk E U∗ − 1 ρk RkU
µ∗ ≥ µh = max . (45)
k∈K ∆LsTfU Vk2

2) Case B (Full batch learning)

Similar to case A, this case demands that the
optimal batchsize of each device satisfies Bk∗ =
∆LsT U µ∗ 21
B max , ∀k. Therefore, this case occurs when ∆LE U∗ − f
ρ RU
Vk ≥ B max , ∀k, leading
k k

to ( 2 )
∆LVk E U∗ − B max ρk RkU
µ∗ ≤ µℓ = min . (46)
k∈K ∆LsTfU Vk2
29

Based on the above analysis, when there exists at least one device whose batchsize is between
1 and B max , the value of µ∗ will satisfy µℓ ≤ µ∗ ≤ µh , which completes the proof.

A PPENDIX C
P ROOF OF L EMMA 2

The proof of Lemma 2 is straightforward by the reduction to absurdity. It can be observed

that the global loss decay ∆L is an increasing function with each training batchsize Bk , ∀k.
However, the local gradient calculation latency tL,GPU
k keeps unchanged when 1 ≤ Bk ≤ Bkth .
Therefore, in the data bound region, the global loss decay (numerator of (31a)) increases with
Bk while the end-to-end latency (denominator of (31a)) remains unchanged, resulting in an
increasing learning efficiency. Since the objective of problem P6 is to maximize the learning
efficiency, the optimal batchsize will not locate in the data bound region. This ends the proof.

R EFERENCES

[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436-444, May 2015.
[2] M. G. Kibria, K. Nguyen, G. P. Villardi, O. Zhao, K. Ishizu, and F. Kojima, “Big data analytics, machine learning, and
artificial intelligence in next-generation wireless networks,” IEEE Access, vol. 6, pp. 32328-32338, May 2018.
[3] Q. Mao, F. Hu, and Q. Hao, “Deep learning for intelligent wireless networks: A comprehensive survey,” IEEE Commun.
Surv. Tut., vol. 20, no. 4, pp. 2595-2621, Fourthquarter 2018.
[4] C. Zhang, P. Patras, and H. Haddadi, “Deep learning in mobile and wireless networking: A survey,” IEEE Commun. Surv.
Tut., early access, doi: 10.1109/COMST.2019.2904897.
[5] H. He, C. Wen, S. Jin, and G. Y. Li, “Deep learning-based channel estimation for beamspace mmWave massive MIMO
systems,” IEEE Wireless Commun. Lett., vol. 7, no. 5, pp. 852-855, Oct. 2018.
[6] H. Ye, G. Y. Li, and B.-H. F. Juang, “Power of deep learning for channel estimation and signal detection in OFDM
systems,” IEEE Wireless Commun. Lett., vol. 7, no. 1, pp. 114-117, Feb. 2018.
[7] H. Ye and G. Y. Li, “Deep reinforcement learning for resource allocation in V2V communications,” in Proc. IEEE Int.
Conf. Commun. (ICC), Kansas City, MO, May 2018, pp. 1-6.
[8] M. Lee, Y. Xiong, G. Yu, and G. Y. Li, “Deep neural networks for linear sum assignment problems,” IEEE Wireless
Commun. Lett., vol. 7, no. 6, pp. 962-965, Dec. 2018.
[9] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” CoRR, vol. abs/1812.02858,
2018. [Online]. Available: http://arxiv.org/abs/1812.02858
[10] W. House, “Consumer data privacy in a networked world: A framework for protecting privacy and promoting innovation
in the global digital economy,” J. Privacy and Confidentiality, 2013.
[11] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication
perspective,” IEEE Commun. Surv. Tut., vol. 19, no. 4, pp. 2322-2358, Aug. 2017.
[12] J. Ren, G. Yu, Y. Cai, and Y. He, “Latency optimization for resource allocation in mobile-edge computation offloading,”
IEEE Trans. Wireless Commun., vol. 17, no. 8, pp. 5506-5519, Aug. 2018.
30

[13] C. You, K. Huang, H. Chae, and B.-H. Kim, “Energy-efficient resource allocation for mobile-edge computation offloading,”
IEEE Trans. Wireless Commun., vol. 16, no. 3, pp. 1397-1411, Mar. 2017.
[14] Y. Mao, J. Zhang, S. H. Song, and K. B. Letaief, “Stochastic joint radio and computational resource management for
multi-user mobile-edge computing systems,” IEEE Trans. Wireless Commun., vol. 16, no. 9, pp. 5994-6009, Sep. 2017.
[15] H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas, “Federated learning of deep networks using model averaging,”
CoRR, vol. abs/1602.05629, 2016. [Online]. Available: http://arxiv.org/abs/1602.05629
[16] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik, “Federated optimization: Distributed machine learning for
on-device intelligence,” CoRR, vol. abs/1610.02527, 2016. [Online]. Available: http://arxiv.org/abs/1610.02527
[17] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Towards an intelligent edge: Wireless communication meets
machine learning,” arXiv preprint arXiv:1809.00343, 2018.
[18] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep
networks from decentralized data,” in Proc. 20th Int. Conf. Artif. Intell. Statist. (AISTATS), Fort Lauderdale, FL, USA,
Apr. 2017, pp. 1-10.
[19] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for
improving communication efficiency,” in Proc. of NIPS Workshop on Private Multi-Party Machine Learning, 2016. [Online].
Available: https://arxiv.org/abs/1610.05492
[20] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication efficient distributed machine learning with the parameter
server,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Montreal, QC, Canada, 2014, pp. 19-27.
[21] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization
and encoding,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Long Beach, CA, USA, Dec. 2017, pp. 1707-1718.
[22] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth
for distributed training,” arXiv preprint arXiv:1712.01887, 2017.
[23] F. Sattler, S. Wiedemann, K.-R. Müller, and W. Samek, “Sparse binary compression: Towards distributed deep learning
with minimal communication,” arXiv preprint arXiv:1805.08768, 2018.
[24] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” arXiv preprint
arXiv:1806.00582, 2018.
[25] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-task learning,” in Proc. Adv. Neural Inf. Process.
Syst. (NIPS), Long Beach, CA, USA, Dec. 2017, pp. 4427-4437.
[26] H. Kim, J. Park, M. Bennis, and S.-L. Kim, “On-device federated learning via blockchain and its latency analysis.” arXiv
preprint arXiv:1808.03949, 2018.
[27] G. Zhu, Y. Wang, and K. Huang, “Low-latency broadband analog aggregation for federated edge learning,” arXiv preprint
arXiv:1812.11494, 2019.
[28] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “When edge meets learning: Adaptive
control for resource-constrained distributed machine learning,” in Proc. IEEE Int. Conf. Comput. Comnun. (INFOCOM),
Honolulu, HI, 2018, pp. 63-71.
[29] M. Li, T. Zhang, Y. Chen, and A. J. Smola, “Efficient mini-batch training for stochastic optimization,” in Proc. ACM
SIGKDD Int. Conf. Knowl. Discovery Data Mining, New York, NY, USA, Aug. 2014, pp. 661-670.
[30] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. “Accurate, large
minibatch SGD: Training imagenet in 1 hour.” arXiv preprint arXiv:1706.02677, 2017.
[31] S. Geisser and W. M. Johnson, Modes of Parametric Statistical Inference. Hoboken, NJ, USA: Wiley, 2006.
[32] J. Dean et al., “Large scale distributed deep networks,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Lake Tahoe, NV,
USA, Dec. 2012, pp. 1223-1231.

Logcat
No ratings yet
Logcat
9,354 pages
Jsan 14 00037 With Cover
No ratings yet
Jsan 14 00037 With Cover
43 pages
One Plus 13 - User Manual
No ratings yet
One Plus 13 - User Manual
7 pages
BT 3
No ratings yet
BT 3
28 pages
Broadband Analog Aggregation For Low-Latency Federated Edge Learning (Extended Version)
No ratings yet
Broadband Analog Aggregation For Low-Latency Federated Edge Learning (Extended Version)
30 pages
B12. Fedeteral Learning in Edge
No ratings yet
B12. Fedeteral Learning in Edge
64 pages
Papr 4
No ratings yet
Papr 4
41 pages
Adaptive Batch Size For Federated Learning in Resource-Constrained Edge Computing
No ratings yet
Adaptive Batch Size For Federated Learning in Resource-Constrained Edge Computing
17 pages
Vehcom D 24 00093
No ratings yet
Vehcom D 24 00093
15 pages
Jlpea 12 00061 v2
No ratings yet
Jlpea 12 00061 v2
24 pages
A Graph Neural Network Learning Approach To Optimize RIS-Assisted Federated Learning
No ratings yet
A Graph Neural Network Learning Approach To Optimize RIS-Assisted Federated Learning
16 pages
Broadband Analog Aggregation For Low-Latency Federated Edge Learning
No ratings yet
Broadband Analog Aggregation For Low-Latency Federated Edge Learning
16 pages
Scheduling For Cellular Federated Edge Learning With Importance and Channel Awareness
No ratings yet
Scheduling For Cellular Federated Edge Learning With Importance and Channel Awareness
14 pages
Research Paper
No ratings yet
Research Paper
15 pages
Adaptive Federated Learning in Resource Constrained Edge Computing Systems
No ratings yet
Adaptive Federated Learning in Resource Constrained Edge Computing Systems
20 pages
Time Minimization in Hierarchical Federated Learning
No ratings yet
Time Minimization in Hierarchical Federated Learning
11 pages
Federated Learning in Iot: A Survey From A Resource-Constrained Perspective
No ratings yet
Federated Learning in Iot: A Survey From A Resource-Constrained Perspective
6 pages
HFEL Joint Edge Association and Resource Allocation For Cost-Efficient Hierarchical Federated Edge Learning
No ratings yet
HFEL Joint Edge Association and Resource Allocation For Cost-Efficient Hierarchical Federated Edge Learning
14 pages
1.5.-Federated Learning Over Wireless Networks Convergence Analysis and Resource Allocation
No ratings yet
1.5.-Federated Learning Over Wireless Networks Convergence Analysis and Resource Allocation
12 pages
Paper 2
No ratings yet
Paper 2
21 pages
Applsci 12 09124 v2
No ratings yet
Applsci 12 09124 v2
36 pages
Electronics 13 02135
No ratings yet
Electronics 13 02135
16 pages
Heterogeneous Multicore Processor Technologies For Embedded Systems Compress
No ratings yet
Heterogeneous Multicore Processor Technologies For Embedded Systems Compress
233 pages
Unit 4 FIoT Notes
No ratings yet
Unit 4 FIoT Notes
23 pages
1.-Federated Learning Over Wireless Networks Optimization Model Design and Analysis-1
No ratings yet
1.-Federated Learning Over Wireless Networks Optimization Model Design and Analysis-1
9 pages
Auction Based Clustered Federated Learning in Mobile Edge Computing System
No ratings yet
Auction Based Clustered Federated Learning in Mobile Edge Computing System
13 pages
Empowering Edge Intelligence by Air-Ground Integrated Federated Learning
No ratings yet
Empowering Edge Intelligence by Air-Ground Integrated Federated Learning
8 pages
Bonawitz, Eichner Et Al 2019 - Towards Federated Learning at Scale
No ratings yet
Bonawitz, Eichner Et Al 2019 - Towards Federated Learning at Scale
15 pages
Digital Twin-Assisted Federated Learning Service Provisioning Over Mobile Edge Networks
No ratings yet
Digital Twin-Assisted Federated Learning Service Provisioning Over Mobile Edge Networks
13 pages
Speed Up Federated Learning in Heterogeneous Environments A Dynamic Tiering Approach
No ratings yet
Speed Up Federated Learning in Heterogeneous Environments A Dynamic Tiering Approach
9 pages
Heterogeneity Aware Device Selection For Eff - 2024 - International Journal of I
No ratings yet
Heterogeneity Aware Device Selection For Eff - 2024 - International Journal of I
9 pages
Time-Sensitive Federated Learning With Heterogeneous Training Intensity A Deep Reinforcement Learning Approach
No ratings yet
Time-Sensitive Federated Learning With Heterogeneous Training Intensity A Deep Reinforcement Learning Approach
14 pages
Visual Basic
No ratings yet
Visual Basic
19 pages
Federated Learning For Edge Networks: Resource Optimization and Incentive Mechanism
No ratings yet
Federated Learning For Edge Networks: Resource Optimization and Incentive Mechanism
7 pages
Fedlesscan: Mitigating Stragglers in Serverless Federated Learning
No ratings yet
Fedlesscan: Mitigating Stragglers in Serverless Federated Learning
13 pages
Accelerating Federated Learning With Cluster Construction and Hierarchical Aggregation
No ratings yet
Accelerating Federated Learning With Cluster Construction and Hierarchical Aggregation
18 pages
P92x Sales Publication
No ratings yet
P92x Sales Publication
16 pages
Energy Efficient Federated Learning Over Wireless
No ratings yet
Energy Efficient Federated Learning Over Wireless
14 pages
Relay-Assisted Federated Edge Learning
No ratings yet
Relay-Assisted Federated Edge Learning
17 pages
SCADA - Drawing
No ratings yet
SCADA - Drawing
2 pages
Federated Dropout
No ratings yet
Federated Dropout
12 pages
Dynamic Scheduling For Over-The-Air Federated Edge Learning With Energy Constraints
No ratings yet
Dynamic Scheduling For Over-The-Air Federated Edge Learning With Energy Constraints
16 pages
Towards Energy-Aware Federated Learning Via MARL: A Dual-Selection Approach For Model and Client
No ratings yet
Towards Energy-Aware Federated Learning Via MARL: A Dual-Selection Approach For Model and Client
9 pages
Unit 1 - Software and Memory Processors
No ratings yet
Unit 1 - Software and Memory Processors
88 pages
Accelerating Federated Learning Via Momentum Gradient Descent
No ratings yet
Accelerating Federated Learning Via Momentum Gradient Descent
13 pages
Eris An Online Auction For Scheduling Unbiased Distributed Learning Over Edge Networks
No ratings yet
Eris An Online Auction For Scheduling Unbiased Distributed Learning Over Edge Networks
14 pages
Hierarchical Federated Learning Across
No ratings yet
Hierarchical Federated Learning Across
10 pages
Federated Edge Learning: Design Issues and Challenges: Afaf Ta Ik and Soumaya Cherkaoui
No ratings yet
Federated Edge Learning: Design Issues and Challenges: Afaf Ta Ik and Soumaya Cherkaoui
8 pages
Blockchain Databases: Practice Exercises
0% (1)
Blockchain Databases: Practice Exercises
4 pages
DSP ch04 S6,7P
No ratings yet
DSP ch04 S6,7P
70 pages
ChainsFL Blockchain-Driven Federated Learning From Design To Realization
No ratings yet
ChainsFL Blockchain-Driven Federated Learning From Design To Realization
6 pages
Federated Learning in Mobile Edge Networks: A Comprehensive Survey
No ratings yet
Federated Learning in Mobile Edge Networks: A Comprehensive Survey
33 pages
COMPDLA08
No ratings yet
COMPDLA08
3 pages
Fed Adp
No ratings yet
Fed Adp
11 pages
Towards Federated Learning at Scale System Design
No ratings yet
Towards Federated Learning at Scale System Design
15 pages
Clustering Enhanced Reinforcement Learning For Adaptive Offloading in Resource Constrained Devices
No ratings yet
Clustering Enhanced Reinforcement Learning For Adaptive Offloading in Resource Constrained Devices
8 pages
20894-Article Text-24907-1-2-20220628
No ratings yet
20894-Article Text-24907-1-2-20220628
9 pages
Aop Iccps 2024
No ratings yet
Aop Iccps 2024
10 pages
SOFTWARE
No ratings yet
SOFTWARE
70 pages
Federated Learning
No ratings yet
Federated Learning
15 pages
Federated Learning Resource
No ratings yet
Federated Learning Resource
8 pages
A Dispersed Federated Learning Framework For 6G-Enabled Autonomous Driving Cars
No ratings yet
A Dispersed Federated Learning Framework For 6G-Enabled Autonomous Driving Cars
12 pages
Dokumen - Tips Krohne Ifc 021 K F Signal Converter Catalogue As Per New L Easy User Friendly
No ratings yet
Dokumen - Tips Krohne Ifc 021 K F Signal Converter Catalogue As Per New L Easy User Friendly
8 pages
0502-E01-001E Ver.1.3 Software For NEXTA Product Specification Sheet
No ratings yet
0502-E01-001E Ver.1.3 Software For NEXTA Product Specification Sheet
22 pages
4.1.file System and Disk Partitions
No ratings yet
4.1.file System and Disk Partitions
44 pages
Federated Learning Challanges
No ratings yet
Federated Learning Challanges
21 pages
IGNOU MCS 227 Cloud Computing and IoT Previous Years Solved Papers
From Everand
IGNOU MCS 227 Cloud Computing and IoT Previous Years Solved Papers
Manish Soni
No ratings yet
2022 IEEEIoTM FLIoTVision
No ratings yet
2022 IEEEIoTM FLIoTVision
6 pages
Medassist: Basic Health Symptom Guide: Bachelor of Technology
No ratings yet
Medassist: Basic Health Symptom Guide: Bachelor of Technology
38 pages
Federated Learning Via Over-the-Air Computation
No ratings yet
Federated Learning Via Over-the-Air Computation
14 pages
3rd 1
No ratings yet
3rd 1
7 pages
Federated Quantum Neural Network With Quantum Teleportation For Resource Optimization in Future Wireless Communication
No ratings yet
Federated Quantum Neural Network With Quantum Teleportation For Resource Optimization in Future Wireless Communication
17 pages
Chapter 5.1-5.6 Memory
No ratings yet
Chapter 5.1-5.6 Memory
26 pages
Experiment Title: Control Tuning: Figure 1: General Schematic of Feedback Control
No ratings yet
Experiment Title: Control Tuning: Figure 1: General Schematic of Feedback Control
9 pages
UR - The Universal Relay: GE Power Management
No ratings yet
UR - The Universal Relay: GE Power Management
60 pages
2017 Konecny Et Al Federated Learning Google Paper
No ratings yet
2017 Konecny Et Al Federated Learning Google Paper
10 pages
Convergence Time Minimization of Federated Learning Over Wireless Networks
No ratings yet
Convergence Time Minimization of Federated Learning Over Wireless Networks
6 pages
Ict Workbook (24 25) Grade8
No ratings yet
Ict Workbook (24 25) Grade8
25 pages
Akul PHP Practical File (Edited Version) - Page - Number
No ratings yet
Akul PHP Practical File (Edited Version) - Page - Number
52 pages
A Novel Edge-Based Multi-Layer Hierarchical Architecture For Federated Learning
No ratings yet
A Novel Edge-Based Multi-Layer Hierarchical Architecture For Federated Learning
5 pages
Eit 12345
No ratings yet
Eit 12345
28 pages
DSP ch05 S1,2P
No ratings yet
DSP ch05 S1,2P
37 pages
Iot Ia 1
No ratings yet
Iot Ia 1
37 pages
Federated Learning: Strategies For Improving Communication Efficiency
No ratings yet
Federated Learning: Strategies For Improving Communication Efficiency
5 pages
PTC Princeton Tech PT2314E - C90034
No ratings yet
PTC Princeton Tech PT2314E - C90034
5 pages
DSP ch03 S9P
No ratings yet
DSP ch03 S9P
17 pages
Introduction of USACO (: USA Computing Olympiad
No ratings yet
Introduction of USACO (: USA Computing Olympiad
3 pages
P5070NA DataSheet en
No ratings yet
P5070NA DataSheet en
2 pages
Object-Oriented Chapter2 Full
No ratings yet
Object-Oriented Chapter2 Full
46 pages
FCM Push Notification To Multiple Devices
No ratings yet
FCM Push Notification To Multiple Devices
16 pages
Data Integration Concepts, Processes, and Techniques
No ratings yet
Data Integration Concepts, Processes, and Techniques
10 pages
Hardware Key
No ratings yet
Hardware Key
3 pages
Remote Procedure Call
No ratings yet
Remote Procedure Call
6 pages
Resume Anil Kumar Sikhakolli
No ratings yet
Resume Anil Kumar Sikhakolli
4 pages
Programming 1 Reviewer
No ratings yet
Programming 1 Reviewer
2 pages

Accelerating DNN Training in Wireless Federated Edge Learning

Uploaded by

Accelerating DNN Training in Wireless Federated Edge Learning

Uploaded by

1

Accelerating DNN Training in Wireless

the great potential to facilitate AI implementation in wireless networks [17].

II. S YSTEM M ODEL

A. Federated Edge Learning System Model

%& ... *&

%' ... *'

%(& ... *(+

%& ... *&

%(& )1 ... *(+ -1

%(& ... *(+

Device 2 BS Edge Server

%' ... *'

%(& )1 ... *(+ -1 ... *(+

D, -./03. 455 6/78.

Fig. 1. Federated edge learning system.

where η[n] is the step-size in the n-th training period.

the instantaneous ones, which can be respectively evaluated as

III. F EDERATED L EARNING IN CPU S CENARIO : P ROBLEM F ORMULATION

A. Training Loss Decay Analysis

B. End-to-End Latency Analysis

As mentioned earlier, we aim at accelerating the training task by collaboratively leveraging

learning perspective, as presented below.

τkU , τkD ≥ 0, ∀k ∈ K, (18f)

IV. F EDERATED L EARNING IN CPU S CENARIO : O PTIMAL S OLUTION

(18b), (18d), (18e), and (18f).

tion policy is given by

min {B, max {A, X}} and [Y ]+ = max{Y, 0}.

Corollary 1: The value of E U∗ shall satisfy

Algorithm 1 Two-dimensional search algorithm for subproblem P2 .

C. Solution to Subproblem P3 and Global Discussion

(18c) and (18f).

V. E XTENSION TO GPU S CENARIO

A. GPU Training Function

Data Bound Compute Bound

(a) Theoretical result. (b) Experimental result.

B. Problem Formulation and Analysis

Consequently, the training acceleration problem can be formulated as

is non-differential. Traditional method for solving problem P6 is the sub-gradient algorithm

Bkth ≤ Bk ≤ B max , ∀k ∈ K. (32c)

B. CPU Scenario: Generalization with Different DNN Models

C. CPU Scenario: Performance Comparison among Different Schemes

(a) Global training loss (b) Test accuracy

D. GPU Scenario: Performance Comparison among Different Schemes

According to Lemma 1, problem P4 is a classical convex optimization problem for a given

To prove Corollary 1, we first analyze the following two cases.

result, the corresponding objective value in (22a) can be derived as

2) Case B (Full batch learning)

The proof of Lemma 2 is straightforward by the reduction to absurdity. It can be observed

You might also like

%(& )1 ... (+ -1 ... (+