Accelerating DNN Training in Wireless Federated Edge Learning
Accelerating DNN Training in Wireless Federated Edge Learning
Abstract
Training task in classical machine learning models, such as deep neural networks (DNN), is
generally implemented at the remote computationally-adequate cloud center for centralized learning,
which is typically time-consuming and resource-hungry. It also incurs serious privacy issue and long
communication latency since massive data are transmitted to the centralized node. To overcome these
shortcomings, we consider a newly-emerged framework, namely federated edge learning (FEEL), to
aggregate the local learning updates at the edge server instead of users’ raw data. Aiming at accelerating
the training process while guaranteeing the learning accuracy, we first define a novel performance
evaluation criterion, called learning efficiency and formulate a training acceleration optimization problem
in the CPU scenario, where each user device is equipped with CPU. The closed-form expressions for joint
batchsize selection and communication resource allocation are developed and some insightful results are
also highlighted. Further, we extend our learning framework into the GPU scenario and propose a novel
training function to characterize the learning property of general GPU modules. The optimal solution
in this case is manifested to have the similar structure as that of the CPU scenario, recommending that
our proposed algorithm is applicable in more general systems. Finally, extensive experiments validate
our theoretical analysis and demonstrate that our proposal can reduce the training time and improve the
learning accuracy simultaneously.
Index Terms
Federated edge learning, learning efficiency, training acceleration, learning accuracy, batchsize
selection, resource allocation.
I. I NTRODUCTION
With AlphaGo defeating the world’s top Go player and the troika Y. Bengio, G. Hinton,
and Y. LeCun winning the 2018 ACM A.M. Turing Award, artifical intelligence (AI) becomes
J. Ren, G. Yu, and G. Ding are with the College of Information Science and Electronic Engineering, Zhejiang University,
Hangzhou 310027, China (e-mail: {renjinke, yuguanding, guangyaoding}@zju.edu.cn). Corresponding author: G. Yu.
2
the most cutting-edge technique in both academia and industry communities and is envisioned
as a revolutionary innovation enabling a smart earth in the future [1]. The implementation of
AI in wireless networks is one of the most fundamental research directions, leading the trends
of communication and computation convergence [2]–[4]. The key idea of implementing AI in
wireless networks is leveraging the rich data collected by massive distributed user devices to
learn approximate AI models for network planning and optimization. Various conceptual and
engineering AI breakthroughs have been applied for wireless network design, such as channel
estimation [5], signal detection [6], and resource allocation [7], [8].
In despite of the substantial progress in AI techniques, current learning algorithms demand
enormous computation and memory resources for data processing. However, the training data
in wireless networks is unevenly distributed over a large amount of resource-constrained user
devices, whereas each device only owns a small fraction of data [9]. These two hostile conditions
make it hard to implement AI algorithms in user devices. Conventional solution generally offloads
the local training data to a remote cloud center for centralized learning. Nevertheless, this method
suffers from two key disadvantages. On the one hand, the latency for data transmission is
typically very large because of the limited communication resource. On the other hand, the
privacy information involved in the training data may be leaked since the cloud center can
be inevitably attacked by some malicious third parties. Hence, classical cloud-based learning
framework is no longer suitable for the scenarios where data privacy is of paramount importance,
such as intelligent healthcare system and smart bank system [10] .
To address the first issue, an innovative architecture called mobile edge computing (MEC) has
been developed by implementing cloud computation capability at the network edge and migrating
learning tasks from the cloud center to the edge server [11]. By this means, the communication
latency can be significantly reduced [12], the mobile energy consumption can be extensively
saved [13], and the core network congestion can be notably relieved [14]. To overcome the
second deficiency, a novel distributed learning framework, namely federated learning (FL) has
been recently proposed in [15]. The key idea of this framework is to globally aggregate local
learning updates (gradients or parameters) trained on user devices in a centralized node while
keeping the privacy-sensitive raw data remained at local devices. Toward this end, the benefit of
the shared models trained from the affluent data can be reaped and the computation resources
of both user devices and cloud servers can be collectively exploited [16]. Motivated by this,
effective collaboration between MEC and FL, referred to as federated edge learning (FEEL) has
3
learning efficiency is also developed as the ratio between the global loss decay and the
end-to-end latency, which can well evaluate the system learning performance.
• We theoretically analyze the detailed expression of the learning efficiency in the CPU
scenario and formulate a training acceleration problem under both communication and
learning resource budgets. The closed-form expressions for joint batchsize selection and
communication resource allocation are also derived. Specifically, the optimal batchsize is
proved to scale linearly with the local training speed and increase with both the training
priority ratio and the uplink data rate in the power of −1/2.
• We extend the training acceleration problem to the GPU scenario and develop a new training
function to characterize the relation between the training latency and the batchsize of general
GPU modules. The corresponding solution in this scenario is manifested to have the similar
structure as that in the CPU scenario, revealing that our proposed algorithm can be applied
in more general systems.
• The proposed algorithms in both CPU and GPU scenarios are implemented in software.
Several classical DNN models are used to test the system performance based on a real
image dataset. The experimental results demonstrate that our proposed scheme can attain a
better learning performance than some benchmark schemes.
The rest of this paper is organized as follows. In Section II, we introduce the FEEL system
and establish the DNN model and the channel model. In Section III, we quantitatively analyze
the training process and formulate the training acceleration optimization problem in the CPU
scenario. The closed-form solution for the CPU scenario is developed in Section IV. In Section
V, we extend the training acceleration problem to the GPU scenario and discuss the solution
in this case. Section VI presents the experimental results and the whole paper is concluded in
Section VII.
As depicted in Fig. 1, we consider an FEEL system comprising one edge server and K
distributed single-antenna user devices, denoted by the set K = {1, 2, . . . , K}. A shared DNN
model (e.g., convolutional neural networks, CNN) needs to be collaboratively trained by these
devices. By interacting with the own user, each device collects a number of labelled data samples
5
-937:8;< L3.MN.3<:/;
!
$ n = GH$ I!J D$ K
...
...
...
...
...
...
...
...
...
%(& )1 ... *(+ -1
...
D1
Device 1
...
%' *'
-937:8;< >??98?3<:/;
...
...
...
...
...
...
...
...
!
1 !
n = # |D | $ [n]
...
D2 | "ODO | $ O
...
%& ... *&
...
...
...
...
...
...
...
...
%& *&
%(&)1 *( )1
...
...
%' ... *' +
...
...
...
...
...
...
%(&
...
...
...
and constitutes its local dataset Dk = {(x1 , y1 ), . . . , (xNk , yNk )}, where xi is the training sample
and yi represents the corresponding ground-true label.
To accomplish the training task, two schemes have been widely employed: 1) centralized
learning, i.e., each device directly uploads the raw data to the base station (BS) for global
training, and the updated model is then multicast back to each device; 2) individual learning, i.e.,
each device trains an independent model via local dataset without any collaboration. The former
faces with severe privacy issue since the edge server is inevitably attacked by some malicious
third parties. In contrast, the latter is capable of avoiding privacy disclosure but suffers from
isolated data island. Thus, the learning accuracy and reliability cannot be guaranteed. To deal
with the above two issues, we adopt the FEEL scheme to accelerate the training task and improve
the learning accuracy simultaneously. In this scheme, the edge server only aggregates the local
gradients without centrally collecting the raw data.1 For convenience, we define the following
1
Due to the sparsity of the gradient, the communication overhead can be significantly reduced by some gradient compression
methods [23]. Therefore, we consider to aggregate the gradients instead of the parameters in this paper.
6
five steps as a training period, which is periodically performed before achieving a satisfactory
learning accuracy. The detailed procedures can be summarized as follows.
• Step 1 (Local Gradient Calculation): In each training period, say the n-th period, each
device first selects the training data from the local dataset, performs the forward-backward
propagation algorithm, and then derives the local gradient vector gkw [n], where w is the
parameter set of the DNN model.
• Step 2 (Local Gradient Uploading): After quantizing and compressing the gradient vector,
each device transmits its local gradient to the edge server via multiple channel access, such
as the time division multiple access (TDMA) or the orthogonal frequency division multiple
access (OFDMA).
• Step 3 (Global Gradient Aggregation): The edge server receives the gradient vectors from
all user devices and then aggregates (averages) them as the global gradient, as
X K
w 1
g [n] = |Dk |gkw [n]. (1)
| ∪k Dk | k=1
• Step 4 (Global Gradient Downloading): After finishing the gradient aggregation, the edge
server delivers the global gradient g w [n] to the BS, which broadcasts it to all user devices.
• Step 5 (Local Model Updating): Each device runs the stochastic gradient descent (SGD)
algorithm based on the received global gradient. Mathematically, the local DNN models are
updated by
w k [n + 1] = w k [n] + η[n]g w [n], ∀k ∈ K, (2)
B. DNN Model
In this work, we take a generalized fully-connected DNN model for analysis. To comprehen-
sively characterize the network structure, we denote L as the number of layers, where the i-th
layer is equipped with ni neurons. Then in the (i + 1)-th layer, the number of weights between
each pair of connected neurons and the number of biases added to each neuron are ni ni+1 and
ni+1 , respectively. Therefore, the total number of parameters (weights and biases) is given by
L−1
X
p= (ni + 1) ni+1 . (3)
i=1
7
We assume that the DNN model is deployed in each user device and the edge server. Moreover,
the local loss function of each device that measures the training error is defined as
1 X
Lk (w, Dk ) = ℓ(w, xi , yi ), (4)
|Dk |
(xi ,yi )∈Dk
where ℓ(w, xi , yi ) is the sample-wise loss function that quantifies the prediction error between
the learning output (via input xi and parameter w) and the ground-true label yi . Accordingly,
the global loss function at the edge server can be expressed as
1 X
L(w) = |Dk |Lk (w, Dk ). (5)
| ∪k Dk | k∈K
The target of the training task is to optimize the parameters towards minimizing the global
loss function L(w) via the SGD algorithm. Further, the gradient vector of each device can be
expressed as
gkw = ∇Lk (w, Dk ) , (6)
where ∇ implies the gradient operator. Note that each parameter has a counterpart gradient,
therefore the total number of gradients of each device exactly equals p in (3).
C. Communication Model
Without loss of generality, we adopt the typical TDMA method for data transmission in this
paper. Let PkU denote the transmit power of device k for gradient uploading and hUk denote
the uplink channel power gain. Accordingly, denote PkD as the transmit power of the BS for
transmitting the global gradient to device k and hDk as the downlink channel power gain.
It should be emphasized that the durations of both uplink and downlink time-slots are relatively
short (e.g., the frame duration of LTE protocol is 10 ms), during which the channel power gains
keep fixed. However, the time duration for one training period is usually in the time-scale
of second because of the high computational complexity in running SGD algorithm as well
as the limited computation resource of user devices. In view of this, each training period will
experience multiple time-slots. Since this work focuses on accelerating the training task from the
long-term learning perspective, the channel dynamics would not affect the learning performance
too much. Toward this end, we employ the average uplink and downlink data rates instead of
8
where Eh represents the expectation over channel fading, W denotes the system bandwidth, and
N0 implies the variance of additive white Gaussian noise (AWGN).
Thus far, we have elaborated the detailed procedures of the FEEL scheme and established
both the learning and communication models. In the next, we will analyze the training task and
formulate the training acceleration optimization problem. In our work, the edge server is always
equipped with powerful GPU whereas the training module of user devices can be either CPU
or GPU. Therefore, we will respectively investigate the CPU and GPU scenarios in the sequel.
In this section, we will first investigate the CPU scenario where each device is only equipped
with CPU for DNN training. The training acceleration problem is formulated to maximize the
system learning efficiency. Some insightful results about network planning are also discussed.
To accelerate the training process and achieve a satisfactory learning accuracy, we take the
mini-batch SGD algorithm in this paper. As is known to us, the major difficulty performing this
algorithm is the selection of the hyperparameter, i.e., the training batchsize, which greatly affects
the learning accuracy and is valuable to be optimized.
To quantitatively assess the training performance, we first define an auxiliary function, namely
global loss decay, as
∆L[n] = L(w[n]) − L(w[n − 1]), (9)
which represents the difference of the global loss function across the n-th training period. For
brevity, we rewrite ∆L[n] as ∆L since our following analysis is based on one training period.
Now we analyze the training performance of the introduced FEEL system. Recall that in one
training period, each device first selects a subset of data, namely one batch for local gradient
calculation. Denote Bk as the number of data involved in the batch of device k. Then the edge
9
server aggregates the gradient information from all devices and calculates the global gradient.
P
Therefore, the FEEL system is capable of processing B = K k=1 Bk data in one training period,
which is referred to as global batchsize. Note that the target of the training task is toward
minimizing the global loss function. Then according to [29], the relation between the global loss
decay function and the global batchsize can be approximately evaluated as
v
u K
√ uX
∆L = ξ B = ξ t Bk , (10)
k=1
where ξ is a coefficient determined by the specific structure of the DNN model. We shall remark
that the global loss decay does not increase linearly with the global batchsize. It is because that
the learning rate should adapt to the batchsize to ensure the convergence of the mini-batch SGD
algorithm and guarantee the learning accuracy as well [30].
Bk C L
tL,CPU
k = . (11)
fkC
• Local Gradient Upload Latency: The local gradient vector of each device should be quan-
2
tized and compressed before transmission. Denote the average quantitative bit number
for each gradient as d. Let r denote the compression ratio, which is defined as the ratio
between the compressed gradient data size and the overall raw gradient data size. Then the
total data size for each local gradient vector is s = r × d × p. Recall that we adopt the
2
Note that the computational complexity of the gradient quantization and compression algorithm is very low [23]. Thus, the
corresponding latency can be omitted as compared with the local gradient upload latency.
10
TDMA channel access for data transmission. Let TfU denote the length of each uplink radio
frame (usually 10 ms in LTE standards) and τkU denote the time-slot duration allocated to
device k. Therefore, the local gradient upload latency can be expressed as
sTfU
tUk = . (12)
τkU RkU
• Global Gradient Download Latency: When the edge server finishes gradient aggregation,
it broadcasts the global gradient vector to each device. To consist with the uploading
procedure, we use d bits to quantize each global gradient term and leverage the same
gradient compression technique. Similarly, let TfD denote the length of each downlink radio
frame and denote τkD as the time-slot duration allocated to device k for global gradient
downloading. Thus, the global gradient download latency can be expressed as
sTfD
tDk = . (13)
τkD RkD
• Local Model Update Latency: After receiving the global gradient, each device starts to
update its local model via the gradient-descent method, as presented in (2). Denote M C as
the number of CPU cycles that are required for local model update. Then, the local model
update latency is given by
MC
tM,CPU
k = . (14)
fkC
It should be noted that the edge server has powerful GPU modules so that the gradient
aggregation latency can be reasonably neglected. Moreover, the gradient aggregation at the edge
server cannot start until receiving the local gradient vectors of all devices. Therefore, in one
training period, the end-to-end latency of each device can be expressed as
n o
Tk = max tL,CPU
k + tU
k + tDk + tM,CPU
k , ∀k ∈ K. (15)
k∈K
Accordingly, the end-to-end latency of the FEEL system in one training period is given by
n o
T = max Tk = max tL,CPU
k + tUk + max tDk + tM,CPU
k . (16)
k∈K k∈K k∈K
C. Problem Formulation
In this paper, we aim at accelerating the training task while guaranteeing the learning accuracy.
To better reflect the training performance, we first define a novel evaluation criterion from the
11
Definition 1: The training performance of the FEEL system can be evaluated by the learning
efficiency, which is defined as
∆L
E= . (17)
T
Remark 1: The learning efficiency can be interpreted as the average global loss decay rate in
the duration of one training period. Therefore, improving the learning efficiency is equivalent to
accelerating the training process. In view of this, the learning efficiency is an appropriate metric
to evaluate the system training performance.
Based on the above analysis, the objective of training acceleration can be transformed into the
learning efficiency maximization. Therefore, the optimization problem can be mathematically
formulated as
√
ξ B
P1 : max E= , (18a)
{τkU ,τkD ,Bk ,B} max tL,CPU
k + tUk + max tDk + tM,CPU
k
k∈K k∈K
K
X
s.t. τkU ≤ TfU , (18b)
k=1
K
X
τkD ≤ TfD , (18c)
k=1
K
X
Bk = B, (18d)
k=1
1 ≤ Bk ≤ B max , ∀k ∈ K, (18e)
where (18b) and (18c) represent the uplink and downlink communication resource limitations,
respectively, (18d) gives the overall number of data that can be processed in one training
period, and (18e) bounds the minimum and maximum batchsizes of each device, where B max is
determined by the memory size and the CPU configuration of each device.
It can be observed that the objective function in problem P1 is complicated and non-convex,
making it hard to be solved in general. In the next section, we will decompose it into two
subproblems and devise efficient algorithms to solve them individually.
12
In this section, we first analyze the mathematical characteristics of problem P1 and then
equivalently decompose it into two subproblems. The closed-form solutions for both subproblems
are derived individually and some insightful results are also discussed.
A. Problem Decomposition
The main challenge in solving problem P1 is that the denominator of the objective function
is non-smooth. For ease of decomposition, (18a) can be rewritten as
D
max tL,CPU
k + tU
k + max tk + tM,CPU
k
k∈K k∈K
min √ . (19)
{τkU ,τkD ,Bk ,B} ξ B
It can be seen that the local gradient upload latency tUk is determined only by the uplink
communication resource τkU , while the global gradient download latency tDk is determined only
by the downlink communication resource τkD and is independent of other variables. Meanwhile,
τkU and τkD are generally independent of each other according to (18b) and (18c). Toward this
end, one training period can be divided into two subperiods. The first subperiod aims to perform
local gradient calculation and uploading, which can be formulated as
L,CPU
tk + tUk
P2 : min max √ ,
{τkU ,Bk ,B} k∈K ξ B (20)
s.t. (18b), (18d), (18e), and (18f).
The second subperiod is to download global gradient and update local DNN model, and thus
can be formulated as
tDk + tM,CPU
k
P3 : min max √ ,
{τkD ,B} k∈K ξ B (21)
s.t. (18c) and (18f).
It needs to be emphasized that the value of B in subproblem P3 should match that in subproblem
P2 . Therefore, the global batchsize can be regarded as a global variable which will be optimized
at last. In view of this, we will analyze the two subproblems in the sequel.
B. Solution to Subproblem P2
We can find that subproblem P2 is generally a min-max optimization problem and is hard
to be solved directly. To make it better tractable, we first define E U as the maximum reciprocal
13
n o
U tL,CPU U
k √ +tk
of each uplink learning efficiency among user devices, i.e., E = max ξ B
. Then, by
k∈K
parametric algorithm [31], subproblem P2 can be transformed into
P4 : min E U, (22a)
{τkU ,Bk ,B,E U }
√
s.t. tL,CPU
k + tUk ≤ ξ BE U , ∀k ∈ K, (22b)
It can be easily verified that constraint (22b) is non-convex, resulting in a non-convex optimization
problem. Recall that the values of B in subproblems P2 and P3 are identical. Therefore, we first
keep B fixed and then determine the joint optimal batchsize selection and uplink communication
resource allocation policy. When B is fixed, the problem P4 becomes a convex one, as presented
in the following lemma.
Lemma 1: Given the value of B, problem P4 can be converted into a classical convex
optimization problem.
Proof: The proof is straightforward since when the value of B is fixed, the objective function
(22a) is convex and all constraints get affine. The detailed derivation is omitted for brevity.
Lemma 1 is essential for solving problem P4 by applying the fractional optimization method.
Moreover, classical convex optimization algorithms can be also utilized. To better characterize
the structure of the solution and gain more insightful results, we first define some significant
auxiliary indicators for each device, as
• Local training speed is defined as the speed of performing local forward-backward prop-
fkC
agation algorithm at each device, i.e., Vk = L .
C
• Training priority ratio is defined as the ratio between the local computation capacity and
fC
the overall devices’ computation capacity, i.e., ρk = PK k C .
k=1 fk
Based on the above definitions, the optimal solution to problem P4 with fixed B can be described
as follows.
Theorem 1: The joint optimal batchsize selection and uplink communication resource alloca-
14
device trains with the same batchsize as well as being allocated with identical time-slot resource,
we can obtain the upper bound of E U∗ . On the other hand, we relax the constraint (18e) and
further apply the Karush-Kuhn-Tucker (KKT) conditions to achieve the lower bound of E U∗ .
With mathematical analysis, the range of E U∗ could be expressed in the following corollary,
which is proved in Appendix B.
Based on Corollary 1, we further investigate another two extreme cases to determine the range
of µ∗ . The first one corresponds to the scenario where the optimal batchsize for all devices are
Bk∗ = 1, ∀k. The second one corresponds to the scenario of Bk∗ = B max , ∀k. Then the range of
µ∗ can be expressed as the function of E U∗ , which is presented in the following corollary and
is also proved in Appendix B.
Corollary 2: When there exists at least one device whose optimal batchsize is in the interval
(1, B max ), the value of µ∗ should satisfy
( 2 ) ( 2 )
∆LVk E U∗ − B max ρk RkU ∆LV k E U∗
− 1 ρk Rk
U
min ≤ µ∗ ≤ max . (25)
k∈K ∆LsTfU Vk2 k∈K ∆LsTfU Vk2
Remark 4: From Corollary 2, we can observe that the values of µ∗ and E U∗ are tightly coupled.
Note that even though this result is derived under the assumption that there is at least one device’s
batchsize is between 1 and B max , it is still applicable since the extreme cases of Bk = 1, ∀k and
Bk = B max , ∀k rarely occur in practice. On the other hand, these two extreme cases correspond
to the cases of B = K and B = KB max , whose solutions can be directly obtained via the KKT
conditions.
Based on the above analysis, we now develop an effective two-dimensional search algorithm
to solve subproblem P2 , as described in Table I. The main idea is to update the values of training
batchsize and uplink communication resource in each iteration until the time-sharing constraint
and the global batchsize limitation are both satisfied. It is easy to prove that this algorithm has
1 2
the computational complexity of O K log ǫ and thus can be easily implemented in practical
systems, where ǫ is the maximum tolerance.
16
TABLE I
T HE S OLUTION TO S UBPROBLEM P2
In this part, we will first solve subproblem P3 and then discuss the global solution to the
n D M,CPU o
t +t
original problem P1 . Similar to the subproblem P2 , we denote E D = max k ξ√kB as
k∈K
the maximum reciprocal of each downlink learning efficiency among all devices. Then, the
subproblem P3 can be reformulated as
P5 : min E D, (26a)
{τkD ,B,E D }
√
s.t. tDk + tM,CPU
k ≤ ξ BE D , ∀k ∈ K, (26b)
Given the value of B, the mathematical characteristic of problem P5 is similar to that of problem
P4 . Therefore, it can be solved using the KKT conditions. The closed-form solution to P5 is
presented in the following theorem. Note that the proof is similar to that of Theorem 1, which
is omitted here due to page limits.
Theorem 2: The optimal downlink communication resource allocation policy is given by
+
D∗ s D∗ MC
τk = ∆LE − C TfD , (27)
RkD fk
PK
where E D∗ is the optimal value associated with the time-sharing constraint D∗
k=1 τk = TfD .
Remark 5: (Consistent Resource Allocation) Theorem 2 indicates that the optimal downlink
resource decreases with the downlink data rate. The reason is similar to that in Theorem 1. Note
that by this means, the local model updating of each device can be accomplished simultaneously.
Combined it with the results in Theorem 1, we can conclude that the overall end-to-end latency
of each device should be identical, leading to a synchronous training system.
As yet, we have presented the closed-form expressions of joint optimal batchsize selection
and uplink/downlink time-slot resource allocation as the function of the global batchsize B. As
a result, problem P1 can be degraded into an univariate optimization problem and thus can be
solved by classical gradient descent algorithm effectively.
In this section, we consider the scenario where each device is equipped with GPU for DNN
training. We will first propose a novel GPU training function and then extend the training
acceleration problem in this case. Particularly, the optimal solution is proved to have a similar
structure as that in the CPU scenario.
Unlike the serial mode of general CPUs, GPUs perform computation in the parallel mode.
In this situation, the local gradient calculation latency is no longer proportional to the training
batchsize. Specifically, when the training batchsize is small, GPU can directly process all data
simultaneously, leading to a constant training latency. In contrary, it grows once the training
batchsize exceeds the maximum volume of data that the GPU can process at a time. With
this consideration, we propose the following function to capture the relation between the local
18
l
!
0
"!#$ "%&'
Training batchsize
Fig. 2. Local gradient calculation latency w.r.t. training batchsize of general GPUs.
gradient calculation latency and the training batchsize, as presented in the following assumption
and shown in Fig. 2(a).
Assumption 1: In the GPU scenario, the relation between the local gradient calculation latency
and the training batchsize of each device is given by
tℓk , if 1 ≤ Bk ≤ Bkth
L,GPU
tk = ∀k ∈ K, (28)
h th ℓ th max
tk (Bk ) , ck Bk − Bk + tk , if Bk < Bk ≤ B
where tℓk , ck , and Bkth are three coefficients determined by the specific DNN structure (e.g., the
number of layers and the number of neurons) and the concrete GPU configuration (e.g., the
video memory size and the number of cores).
Remark 6: As discussed earlier, when the training batchsize is below a threshold Bkth , the
local gradient calculation latency keeps fixed because of GPU’s parallel execution capability.
In this case, the data samples are inadequate such that the computation resource is not fully
exploited. Therefore, we name it as the data bound region. On the other hand, the local gradient
calculation latency grows linearly when the training batchsize exceeds the threshold. It is because
in this case, the computation resource (e.g., the video memory) cannot support processing all
data samples simultaneously. We shall note that the local gradient calculation latency does not
increase in a ladder form. The reason is rather intuitive that the operations of data reading and
transferring between memory modules and processing modules need additional time, leading to
19
a comprehensive linear trend. Consequently, we name this region as the compute bound region.
To validate the results in Assumption 1, we implement three classic DNN models, i.e.,
DenseNet, GoogleNet, and PNASNet on a Linux server equipped with three NVIDIA GeForce
GTX 1080 Ti GPUs. The experimental results are depicted in Fig. 2(b). It can be observed that
the local gradient calculation latency of each model first keeps invariant and then increases almost
linearly with the training batchsize. This result shows that the curves obtained via experiments
fit the theoretical model in (28) very well, which demonstrates the applicability of the proposed
GPU training function.
Similar to the CPU scenario, the learning efficiency in the GPU scenario can be also defined
as the ratio between the global loss decay and the end-to-end latency of each training period.
Note that the data size of both local gradient gkw and global gradient g w in the GPU scenario
are identical to those in the CPU scenario. Thus, the local gradient upload latency and global
gradient download latency can be still expressed as (12) and (13), respectively. Besides, let fkG
(in floating-point operations per second) denote the computation capability of the GPU module
of device k and denote M G as the number of floating-point operations that are required for
model updating. Then, the local model update latency in this case can be expressed as
MG
tM,GPU
k = . (29)
fkG
Based on the above analysis, the end-to-end latency of each training period in the GPU scenario
is given by
T = max tL,GPU
k + tUk + max tDk + tM,GPU
k . (30)
k∈K k∈K
The above problem is not easy to solve since the local gradient calculation latency tL,GPU
k
mathematical characteristic of the learning efficiency in (31a) and derive a necessary condition
of the solution to problem P6 , as summarized in the following lemma.
Lemma 2: In the GPU scenario, the optimal batchsize Bk∗ for each device shall locate in the
compute bound region, i.e., Bkth ≤ Bk∗ ≤ B max .
Proof: Please refer to Appendix C.
The result in Lemma 2 coincides with the empirical result in practical systems that the
computation resource of all devices should be fully exploited to achieve the largest learning
efficiency. Accordingly, the data bound region can be neglected and P6 can be reformulated as
√
b ξ B
P7 : max E= , (32a)
{τkU ,τkD ,Bk ,B} maxk∈K thk + tUk + maxk∈K tDk + tM,GPU
k
s.t. (18b), (18c), (18d), and (18f), (32b)
With comprehensive comparison, we can observe that the structures of problems P1 and P7 are
identical except the expressions of local gradient calculation latency. Fortunately, the two kinds
of latency are affine with the batchsize, making both problems essentially the same structure.
Therefore, the GPU scenario can be degraded into the CPU scenario and the algorithms to solve
problem P1 are still applicable with a minor modification on the delay expressions. The detailed
derivations and results are omitted due to page limits.
VI. E XPERIMENTS
A. Experiment Settings
In this section, we conduct experiments to validate our theoretical analysis and evaluate the
performance of the proposed algorithms. The system setup is summarized as follows unless
otherwise specified. We consider a single-cell network with a radius of 200 m. The BS is
located in the center of the network and each device is uniformly distributed in the cell. The
channel gains of both the uplink and downlink channels are generated following the pass-loss
model: P L [dB] = 128.1 + 37.6 log d [km], where the small-scale fading is set as Rayleigh
distributed with uniform variance. The transmit powers of the uplink and downlink channels are
both 28 dBm. The system bandwidth W = 10 MHz and the channel noise variance N0 = −174
dBm. The lengths of each uplink and downlink radio frame are both set as TfU = TfD = 10 ms
21
according to the LTE standards. For each device, the average quantitative bit numbers for each
gradient d = 64 bits and the maximum batchsize is bounded by B max = 128 data samples.
To test the system performance, we choose three commonly-used DNN models: DenseNet121,
ResNet18, and MobileNetV2 for image classification. The well-known dataset CIFAR-10 is used
for model training, which consists of 50000 training images and 10000 validation images of 32
× 32 pixels in 10 classes. Standard data augmentation and preprocessing methods are employed
to improve the test accuracy, and effective deep gradient compression methods are also used
to reduce the communication overhead [23]. Moreover, to simulate the distributed mobile data,
we study two typical ways of partitioning CIFAR-10 data over devices [27]: 1) IID case, where
all data samples are first shuffled and then partitioned into K equal parts, and each device is
assigned with one particular part; 2) non-IID case, where all data samples are first sorted by digit
label and then divided into 2K shards of size 25000/K, and each device is thereafter assigned
with two shards. Note that the latter is a pathological non-IID partition way since most devices
only obtain two kinds of digits.
An important test of the proposed scheme is whether it is applicable for training different
DNN models, i.e., the generalization ability. Toward this end, we implement the above three
DNN models in an FEEL system with K = 12 user devices. Specifically, the CPU frequencies
of all devices are configured as: 4 devices with 0.7 GHz, 4 devices with 1.4 GHz, and 4 devices
with 2.1 GHz.
We test both the loss convergence speed and learning accuracy improvement of each DNN
model with different learning rates. To be more realistic, we implement the non-IID case and
the results are presented in Fig. 3. It can be observed that our proposed scheme is capable
of attaining a satisfactory learning accuracy with a fast loss convergence speed for all three
models. Moreover, the ultimate learning accuracy can be well guaranteed with different learning
rates. This result demonstrates the strong generalization ability of the proposed algorithm, which
promotes its wide implementation in practical systems.
In this part, we test the performance improvement of our proposed scheme as compared with
three benchmark schemes in the scenarios of K = 6 and K = 12 devices, respectively. Without
22
Fig. 3. Global training loss and test accuracy with different learning rates.
loss of generality, we take a pre-trained DenseNet121 with 86% initial learning accuracy for
the following test. Both the IID and non-IID cases are implemented. The detailed procedures of
each scheme can be summarized as follows.
• Individual Learning: Each device trains its DNN model until the local loss function con-
verges. Then the edge server aggregates all local models, averages the parameters, and
transmits the results to each device.
• Model-Based FL: Each device trains its DNN model via local dataset through one epoch.
Thereafter, the model parameters of each device are transmitted to the edge server for
aggregation. The results are then transmitted back to each device for the next training
epoch before the global loss function converges [18].
• Gradient-Based FL: Each device trains its DNN model by running one-step SGD algorithm.
Then the gradients of each device are transmitted to the edge server for aggregation. After
that, the global gradients are transmitted back to each device. This process is periodically
performed before the global loss function converges [32].
It is clearly that the main difference between our proposed scheme and the gradient-based FL
scheme is that we take into account the joint batchsize selection and communication resource
allocation.
We use two metrics to evaluate the training performance of each scheme, i.e., the test accuracy
and the training speedup, which is defined as the ratio between the training speed of each scheme
and that of the individual learning scheme.
23
TABLE II
T RAINING P ERFORMANCE OF D IFFERENT S CHEMES
(a) K = 6
IID case Non-IID case
Scheme
Test accuracy Training speedup Test accuracy Training speedup
Individual learning 90.65% 1× 89.16% 1×
Model-based FL 91.28% 0.29× 90.22% 0.37×
Gradient-based FL 91.60% 0.53× 91.51% 0.69×
Our proposed scheme 91.48% 1.09× 91.43% 1.03×
(b) K = 12
IID case Non-IID case
Scheme
Test accuracy Training speedup Test accuracy Training speedup
Individual learning 90.68% 1× 89.91% 1×
Model-based FL 92.01% 0.32× 91.16% 0.31×
Gradient-based FL 92.27% 0.68× 91.81% 0.67×
Our proposed scheme 92.34% 1.16× 92.12% 1.26×
Table II(a) shows the training performance of each scheme in the scenario of K = 6 devices.
As compared with the individual learning scheme, our proposed scheme can speed up the training
task by about 1.09 times with about 0.83% learning accuracy improvement in the IID case,
and can speed up the training task by about 1.03 times with about 2.27% learning accuracy
improvement in the non-IID case. These results demonstrate that the proposed scheme can
accelerate the training process and improve the learning accuracy simultaneously. Moreover,
compared with the model-based FL scheme, our scheme can achieve about 3.76 times training
speed in the IID case, and can achieve about 2.78 times training speed in the non-IID case,
both with a slight learning accuracy improvement. The reason can be explained as follows.
The gradients of each device can be deeply compressed because of the sparsity, therefore the
communication overhead for gradient transmission is much smaller than that for parameter
transmission. Besides, the gradients contain more information than parameters and match the
essential operations of the mini-batch SGD algorithm closely. Thus, the learning accuracy of the
proposed scheme is better than that of the model-based scheme. On the other hand, compared
with the gradient-based scheme, our scheme is capable of achieving about 2 times and 1.5
times training speed in the IID and non-IID cases, respectively, both with almost the same
learning accuracy. It is rather intuitive that our scheme optimally selects the training batchsize
to effectively balance the communication overhead and the training workload. In contrary, the
gradient-based scheme trains each local model using the whole dataset without considering the
communication and learning tradeoff.
24
(a) Global training loss vs training time. (b) Test accuracy vs training time.
Fig. 4. Performance comparison among different training schemes in the IID case.
We further test the training performance of each scheme in the scenario of K = 12 devices. The
results are presented in Table II(b). It can be observed that our proposed scheme can still achieve
the fastest training speed as well as attain a satisfactory learning accuracy improvement among
all schemes, which demonstrates the superiority and scalability of our proposal. Moreover, as
the number of devices increases, both the learning accuracy improvement and the training speed
improvement of the proposed scheme become more evident as compared with the individual
learning scheme. It is because with more devices, more computation resources can be exploited
to fully train the DNN model in our scheme, resulting in a higher learning accuracy. In particular,
the accuracy gap between IID case and non-IID case in the individual learning scheme is larger
than those in other schemes. As we can imagine, the individual learning scheme cannot properly
grasp the characteristic of the non-IID data because it performs local training without any
collaboration. Conversely, other three schemes dynamically share the local learning updates
so that their accuracy gaps between IID case and non-IID case are much smaller. This result
also suggests the superiority of the proposed scheme to deal with the non-IID data in distributed
FL systems.
An important test is whether the proposed joint batchsize selection and communication re-
source allocation policy can accelerate the training task while guaranteeing the learning accuracy
in the GPU scenario. Therefore, we turn to compare the training performance of the proposed
25
(a) Global training loss vs training time. (b) Test accuracy vs training time.
Fig. 5. Performance comparison among different training schemes in the non-IID case.
scheme with three baseline schemes in the scenario of K = 6 devices. Similarly, the DenseNet121
is tested and both IID and non-IID cases are experimented. The three baseline schemes are
summarized as follows.
• Online learning scheme, where the training batchsize of each device is Bk = 1, ∀k.
• Full batchsize scheme, where the training batchsize of each device is Bk = B max = 128, ∀k.
• Random batchsize scheme, where each device randomly selects a training batchsize between
1 and 128 in each training period.
Fig. 4 and Fig. 5 show the loss convergence speed and the learning accuracy in the IID
case and non-IID case, respectively. We can observe from all plots that the proposed scheme
can achieve the fastest training speed while attaining the highest learning accuracy at the same
time. The reason is that our proposed scheme can optimally allocate the time-slot resource and
effectively select approximate training batchsize to balance the communication overhead and the
training workload. Further, the learning accuracy of the proposed scheme in the IID and non-IID
cases are almost the same with enough training time. It is because in the proposed scheme, the
edge server can generally aggregate the data information with different kinds of distributions.
Therefore, the learning accuracy in the non-IID case will not deteriorate as compared with that in
the IID case. This also suggests the applicability of the proposed scheme in the hyperparameter
adjustment for practical systems with non-IID data.
26
VII. C ONCLUSION
This paper aims at accelerating the DNN training task in the framework of FEEL, which not
only protects users’ privacy but also improves the system learning efficiency. The key idea is
to jointly optimize the local training batchsize and wireless resource allocation to achieve a fast
training performance while maintaining the learning accuracy. In the common CPU scenario,
we formulate a training acceleration optimization problem after analyzing the global loss decay
and the end-to-end latency of each training period. The closed-form solution for joint batchsize
selection and communication resource allocation is then developed by problem decomposition
and classical optimization tools. Some insightful results are also discussed to provide meaningful
guidelines for hypeparameter adjustment and network planning. To gain more insights, we further
extend our framework to the GPU scenario and develop a new GPU training function. The optimal
solution in this case can be derived by fine-tuning the results in the CPU scenario, indicating that
our proposed algorithm is applicable in more general systems. Our studies in this work provide
an important step towards the implementation of AI in wireless communication systems.
A PPENDIX A
P ROOF OF T HEOREM 1
where λk , ∀k, µ, and γ are the Lagrange multipliers associated with constraints (22b), (18b),
and (18d), respectively. Denote Bk∗ , τkU∗ as the optimal solution to P4 . Note that the uplink
communication resource τkU is non-negative while the batchsize of each device is bounded in
the interval [1, B max ]. Therefore, applying the KKT conditions gives the following necessary and
sufficient conditions, as
27
K
∂L √ X
= 1 − ξ B λ∗k = 0, (34)
∂E U∗
k=1
∂L sT U ≥ 0, τ U∗ = 0
∗ f ∗ k
U∗ = −λk U 2 +µ , ∀k ∈ K, (35)
∂τk U∗
Rk (τk ) = 0, τkU∗ > 0
∗
∂L C L ≥ 0, Bk = 1
∗
= λ∗k C + γ ∗ = 0, Bk∗ ∈ (1, B max ) , ∀k ∈ K, (36)
∂Bk fk
≤ 0, B ∗ = B max
k
!
B ∗ L
C sTf
U √
λ∗k k
+ U∗ U − ξ BE U∗ = 0, λ∗k ≥ 0, ∀k ∈ K, (37)
fkC τk Rk
Bk∗ C L sTfU √
+ ≤ ξ BE U∗ , ∀k ∈ K, (38)
fkC τkU∗ RkU
K
!
X
µ∗ τkU∗ − TfU = 0, µ∗ ≥ 0, (39)
k=1
K
X
τkU∗ ≤ TfU , τkU∗ ≥ 0, ∀k ∈ K, (40)
k=1
K
X
Bk∗ = B, 1 ≤ Bk∗ ≤ B max , ∀k ∈ K. (41)
k=1
With simple mathematical calculation, we can derive the optimal batchsize selection policy as
! 12 Bmax
U ∗
∆LsTf µ
Bk∗ = ∆LE U∗ − Vk . (42)
ρk RkU
1
Moreover, the minimum E U is achieved when “≤” in (38) is set to “=”. Combining (38) and
(42), we can obtain the optimal time-slot allocation in Theorem 1.
A PPENDIX B
A. Proof of Corollary 1
Since the goal of problem P4 is to minimize E U , the optimal E U∗ is no greater than EhU , which
can be regarded as an upper bound of E U∗ .
2) Case B (Infinite memory resource)
In this case, the memory of each device is sufficient so that the batchsize limitation (18e) can
be relaxed. Further, the convexity of problem P4 can be preserved, making the KKT conditions
still effective. With the similar derivations in Appendix A, we can obtain the objective value as
!2
L K
X ρkr
1 BC .
EℓU = PK C + s U
(44)
∆L f
k=1 k k=1
R k
Due to the fact that the objective value will not increase after constraint relaxation, E U∗ is no
less than EℓU . Thus, EℓU can be viewed as a lower bound of E U∗ . Combining (43) and (44), we
can express the range of E U∗ in Corollary 1, which ends the proof.
B. Proof of Corollary 2
According to Theorem 1, we can observe that the optimal batchsize has a threshold-based
structure. Similar to the proof of Corollary 1, we also consider two cases.
1) Case A (Online learning)
In this case, we assume that the
optimal batchsize of each device is Bk∗ = 1, ∀k. According
∆LsT U µ∗ 12
to (23), this case happens when ∆LE U∗ − f
ρ RU
Vk ≤ 1, ∀k, resulting in
k k
( 2 )
∆LVk E U∗ − 1 ρk RkU
µ∗ ≥ µh = max . (45)
k∈K ∆LsTfU Vk2
to ( 2 )
∆LVk E U∗ − B max ρk RkU
µ∗ ≤ µℓ = min . (46)
k∈K ∆LsTfU Vk2
29
Based on the above analysis, when there exists at least one device whose batchsize is between
1 and B max , the value of µ∗ will satisfy µℓ ≤ µ∗ ≤ µh , which completes the proof.
A PPENDIX C
P ROOF OF L EMMA 2
R EFERENCES
[1] Y. LeCun, Y. Bengio, and G. Hinton, “Deep learning,” Nature, vol. 521, pp. 436-444, May 2015.
[2] M. G. Kibria, K. Nguyen, G. P. Villardi, O. Zhao, K. Ishizu, and F. Kojima, “Big data analytics, machine learning, and
artificial intelligence in next-generation wireless networks,” IEEE Access, vol. 6, pp. 32328-32338, May 2018.
[3] Q. Mao, F. Hu, and Q. Hao, “Deep learning for intelligent wireless networks: A comprehensive survey,” IEEE Commun.
Surv. Tut., vol. 20, no. 4, pp. 2595-2621, Fourthquarter 2018.
[4] C. Zhang, P. Patras, and H. Haddadi, “Deep learning in mobile and wireless networking: A survey,” IEEE Commun. Surv.
Tut., early access, doi: 10.1109/COMST.2019.2904897.
[5] H. He, C. Wen, S. Jin, and G. Y. Li, “Deep learning-based channel estimation for beamspace mmWave massive MIMO
systems,” IEEE Wireless Commun. Lett., vol. 7, no. 5, pp. 852-855, Oct. 2018.
[6] H. Ye, G. Y. Li, and B.-H. F. Juang, “Power of deep learning for channel estimation and signal detection in OFDM
systems,” IEEE Wireless Commun. Lett., vol. 7, no. 1, pp. 114-117, Feb. 2018.
[7] H. Ye and G. Y. Li, “Deep reinforcement learning for resource allocation in V2V communications,” in Proc. IEEE Int.
Conf. Commun. (ICC), Kansas City, MO, May 2018, pp. 1-6.
[8] M. Lee, Y. Xiong, G. Yu, and G. Y. Li, “Deep neural networks for linear sum assignment problems,” IEEE Wireless
Commun. Lett., vol. 7, no. 6, pp. 962-965, Dec. 2018.
[9] J. Park, S. Samarakoon, M. Bennis, and M. Debbah, “Wireless network intelligence at the edge,” CoRR, vol. abs/1812.02858,
2018. [Online]. Available: http://arxiv.org/abs/1812.02858
[10] W. House, “Consumer data privacy in a networked world: A framework for protecting privacy and promoting innovation
in the global digital economy,” J. Privacy and Confidentiality, 2013.
[11] Y. Mao, C. You, J. Zhang, K. Huang, and K. B. Letaief, “A survey on mobile edge computing: The communication
perspective,” IEEE Commun. Surv. Tut., vol. 19, no. 4, pp. 2322-2358, Aug. 2017.
[12] J. Ren, G. Yu, Y. Cai, and Y. He, “Latency optimization for resource allocation in mobile-edge computation offloading,”
IEEE Trans. Wireless Commun., vol. 17, no. 8, pp. 5506-5519, Aug. 2018.
30
[13] C. You, K. Huang, H. Chae, and B.-H. Kim, “Energy-efficient resource allocation for mobile-edge computation offloading,”
IEEE Trans. Wireless Commun., vol. 16, no. 3, pp. 1397-1411, Mar. 2017.
[14] Y. Mao, J. Zhang, S. H. Song, and K. B. Letaief, “Stochastic joint radio and computational resource management for
multi-user mobile-edge computing systems,” IEEE Trans. Wireless Commun., vol. 16, no. 9, pp. 5994-6009, Sep. 2017.
[15] H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas, “Federated learning of deep networks using model averaging,”
CoRR, vol. abs/1602.05629, 2016. [Online]. Available: http://arxiv.org/abs/1602.05629
[16] J. Konečnỳ, H. B. McMahan, D. Ramage, and P. Richtárik, “Federated optimization: Distributed machine learning for
on-device intelligence,” CoRR, vol. abs/1610.02527, 2016. [Online]. Available: http://arxiv.org/abs/1610.02527
[17] G. Zhu, D. Liu, Y. Du, C. You, J. Zhang, and K. Huang, “Towards an intelligent edge: Wireless communication meets
machine learning,” arXiv preprint arXiv:1809.00343, 2018.
[18] H. B. McMahan, E. Moore, D. Ramage, S. Hampson, and B. A. y Arcas, “Communication-efficient learning of deep
networks from decentralized data,” in Proc. 20th Int. Conf. Artif. Intell. Statist. (AISTATS), Fort Lauderdale, FL, USA,
Apr. 2017, pp. 1-10.
[19] J. Konečnỳ, H. B. McMahan, F. X. Yu, P. Richtárik, A. T. Suresh, and D. Bacon, “Federated learning: Strategies for
improving communication efficiency,” in Proc. of NIPS Workshop on Private Multi-Party Machine Learning, 2016. [Online].
Available: https://arxiv.org/abs/1610.05492
[20] M. Li, D. G. Andersen, A. J. Smola, and K. Yu, “Communication efficient distributed machine learning with the parameter
server,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Montreal, QC, Canada, 2014, pp. 19-27.
[21] D. Alistarh, D. Grubic, J. Li, R. Tomioka, and M. Vojnovic, “Qsgd: Communication-efficient sgd via gradient quantization
and encoding,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Long Beach, CA, USA, Dec. 2017, pp. 1707-1718.
[22] Y. Lin, S. Han, H. Mao, Y. Wang, and W. J. Dally, “Deep gradient compression: Reducing the communication bandwidth
for distributed training,” arXiv preprint arXiv:1712.01887, 2017.
[23] F. Sattler, S. Wiedemann, K.-R. Müller, and W. Samek, “Sparse binary compression: Towards distributed deep learning
with minimal communication,” arXiv preprint arXiv:1805.08768, 2018.
[24] Y. Zhao, M. Li, L. Lai, N. Suda, D. Civin, and V. Chandra, “Federated learning with non-iid data,” arXiv preprint
arXiv:1806.00582, 2018.
[25] V. Smith, C.-K. Chiang, M. Sanjabi, and A. S. Talwalkar, “Federated multi-task learning,” in Proc. Adv. Neural Inf. Process.
Syst. (NIPS), Long Beach, CA, USA, Dec. 2017, pp. 4427-4437.
[26] H. Kim, J. Park, M. Bennis, and S.-L. Kim, “On-device federated learning via blockchain and its latency analysis.” arXiv
preprint arXiv:1808.03949, 2018.
[27] G. Zhu, Y. Wang, and K. Huang, “Low-latency broadband analog aggregation for federated edge learning,” arXiv preprint
arXiv:1812.11494, 2019.
[28] S. Wang, T. Tuor, T. Salonidis, K. K. Leung, C. Makaya, T. He, and K. Chan, “When edge meets learning: Adaptive
control for resource-constrained distributed machine learning,” in Proc. IEEE Int. Conf. Comput. Comnun. (INFOCOM),
Honolulu, HI, 2018, pp. 63-71.
[29] M. Li, T. Zhang, Y. Chen, and A. J. Smola, “Efficient mini-batch training for stochastic optimization,” in Proc. ACM
SIGKDD Int. Conf. Knowl. Discovery Data Mining, New York, NY, USA, Aug. 2014, pp. 661-670.
[30] P. Goyal, P. Dollár, R. Girshick, P. Noordhuis, L. Wesolowski, A. Kyrola, A. Tulloch, Y. Jia, and K. He. “Accurate, large
minibatch SGD: Training imagenet in 1 hour.” arXiv preprint arXiv:1706.02677, 2017.
[31] S. Geisser and W. M. Johnson, Modes of Parametric Statistical Inference. Hoboken, NJ, USA: Wiley, 2006.
[32] J. Dean et al., “Large scale distributed deep networks,” in Proc. Adv. Neural Inf. Process. Syst. (NIPS), Lake Tahoe, NV,
USA, Dec. 2012, pp. 1223-1231.