0% found this document useful (0 votes)
1 views13 pages

Self-Aware Distributed Deep Learning Framework For Heterogeneous IoT

This document presents a self-aware distributed deep learning (DDL) framework designed for heterogeneous IoT edge devices, aiming to enhance training efficiency and system reliability. The framework utilizes a dynamic self-organizing approach and self-healing methods to adapt to changes in device availability, achieving significant improvements in training speed and accuracy. Experimental results demonstrate that the proposed system effectively utilizes existing edge resources, making it suitable for scalable IoT applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views13 pages

Self-Aware Distributed Deep Learning Framework For Heterogeneous IoT

This document presents a self-aware distributed deep learning (DDL) framework designed for heterogeneous IoT edge devices, aiming to enhance training efficiency and system reliability. The framework utilizes a dynamic self-organizing approach and self-healing methods to adapt to changes in device availability, achieving significant improvements in training speed and accuracy. Experimental results demonstrate that the proposed system effectively utilizes existing edge resources, making it suitable for scalable IoT applications.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Future Generation Computer Systems 125 (2021) 908–920

Contents lists available at ScienceDirect

Future Generation Computer Systems


journal homepage: www.elsevier.com/locate/fgcs

Self-aware distributed deep learning framework for heterogeneous IoT


edge devices

Yi Jin a , Jiawei Cai a , Jiawei Xu a , Yuxiang Huan a,b , , Yulong Yan a , Bin Huang a ,
∗ ∗
Yongliang Guo a , Lirong Zheng a , , Zhuo Zou a ,
a
School of Information Science and Technology, State Key Laboratory of ASIC and System, Fudan University, Shanghai, China
b
School of Electrical Engineering and Computer Science, KTH Royal Institute of Technology, Stockholm, Sweden

article info a b s t r a c t

Article history: Implementing artificial intelligence (AI) in the Internet of Things (IoT) involves a move from the
Received 26 December 2020 cloud to the heterogeneous and low-power edge, following an urgent demand for deploying complex
Received in revised form 4 June 2021 training tasks in a distributed and reliable manner. This work proposes a self-aware distributed deep
Accepted 7 July 2021
learning (DDL) framework for IoT applications, which is applicable to heterogeneous edge devices
Available online 14 July 2021
aiming to improve adaptivity and amortize the training cost. The self-aware design including the
Keywords: dynamic self-organizing approach and the self-healing method enhances the system reliability and
Internet of Things (IoT) resilience. Three typical edge devices are adopted with cross-platform Docker deployment: Personal
Edge computing Computers (PC ) for general computing devices, Raspberry Pi 4Bs (Rpi) for resource-constrained edge
Distributed deep learning devices, and Jetson Nanos (Jts) for AI-enabled edge devices. Benchmarked with ResNet-32 on CIFAR-10,
Deep neural networks the training efficiency of tested distributed clusters is increased by 8.44× compared to the standalone
Self-awareness Rpi. The cluster with 11 heterogeneous edge devices achieves a training efficiency of 200.4 images/s
and an accuracy of 92.45%. Results prove that the self-organizing approach functions well with dynamic
changes like devices being removed or added. The self-healing method is evaluated with various
stabilities, cluster scales, and breakdown cases, testifying that the reliability can be largely enhanced
for extensively distributed deployments. The proposed DDL framework shows excellent performance
for training implementation with heterogeneous edge devices in IoT applications with high-degree
scalability and reliability.
© 2021 Elsevier B.V. All rights reserved.

1. Introduction requirements [4]. There are two viable means to solve the issue:
reducing the complexity of DNN models and exploiting the poten-
The Internet of Things (IoT) and Cyber–Physical Systems (CPSs) tial of edge participants. The former has been extensively studied
extend the physical world to cyberspace with vast devices from with increasingly compact and lightweight DNN models. The
the cloud to the edge. The gap between the massive volume latter can be achieved by optimizing the system with distributed
of data created by users/things every day and the data center solutions to schedule edge resources effectively for large-scale IoT
traffic in the cloud is continuously growing, due to excessive applications.
data volume that may overburden cloud infrastructure [1]. It is Distributed deep learning (DDL) is thereupon proposed to
necessary to solve the imbalance of energy efficiency, bandwidth, balance intensive computations and limited computing resources
privacy, and real-time performance in crowdsourced IoT systems in multi-agent networks for deep learning inference and training
and CPSs. Many works present the notion of edge computing to tasks [5–7], as demonstrated in Fig. 1. Many works have gained
settle the matter with its local processing and quick responses, state-of-the-art performance for DDL inference in IoT applica-
thereby relieving the pressures of cloud services [2,3]. The prac- tions with dedicated hardware [8]. The practices in DDL training
tices of deep neural networks (DNNs) in IoT applications bring include Microsoft’s Tiresias [9], Google’s TensorFlow [10], and
brand new prospects to the era of Big Data and intelligent com- Intel’s BigDL [11]. Training is much more computationally pro-
puting. However, the computing resources for IoT applications hibitive than inference. It is difficult to implement training in
are limited and dispersive, and thus fail to meet traditional DNN resource-constrained IoT scenarios due to the discordance be-
tween the complex model and the lightweight hardware. For
∗ Corresponding authors. example, the inference efficiency of NVIDIA V100 is 2.7x that
E-mail addresses: [email protected] (Y. Jin), [email protected] of its training efficiency [12]. However, the tendency to deploy
(Y. Huan), [email protected] (L. Zheng), [email protected] (Z. Zou). DDL training on the edge is prospective along with the growth in

https://doi.org/10.1016/j.future.2021.07.010
0167-739X/© 2021 Elsevier B.V. All rights reserved.
Y. Jin, J. Cai, J. Xu et al. Future Generation Computer Systems 125 (2021) 908–920

Fig. 1. DDL implementation in IoT applications.

Fig. 2. Parallel methods for distributed deployment. (a) Data parallelism; (b)
model parallelism; and (c) pipeline parallelism.
emerging AI devices for IoT applications, and is further propelled
by the underlying benefit of privacy improvement [13]. Federated
learning (FL) enables DDL training in large-scale edge networks, The rest of the paper is organized as follows. Related works
yet the training is implemented in a collaborative manner with and technical backgrounds are reviewed in Section 2. Section 3
both the edge, and the cloud or a relatively centralized mas- discusses the proposed self-aware DDL framework. Experimen-
ter node [14–16]. The problem of additional hardware cost or tal results are analyzed in Section 4, and this is followed by a
communication overhead may be induced by FL solutions in conclusion in Section 5.
the cloud–edge cooperative pattern. To minimize the cost of
purchasing extra high-performance AI devices, this work aims 2. Related works and technical backgrounds
to exploit the unoccupied computing resources available in an
already existing IoT environment with less effort, and in tasks that In this section, the related works on DDL implementation in
are not specially designed for AI. emerging IoT and edge applications are presented.
To realize the deployment of DDL training only with exist-
ing edge devices, the following challenges must be solved: the 2.1. Distributed deployment with parallelism
heterogeneity of edge devices and the dynamic requirements
of IoT systems. Edge devices adopt diverse instruction sets and The value of neural networks lies in their high parallelism. As
operating systems with various performances and power con- shown in Fig. 2, the parallel solutions include data parallelism,
sumption requirements. The computing capabilities of emerging model parallelism, and pipeline parallelism [8,32].
AI devices for IoT applications vary by orders of magnitude from Data parallelism (DP). Each worker runs the whole model
giga operations per second (GOPS) to tera operations per sec- while different workers process various inputs. The typical
ond (TOPS) [17]. The problem of cross-platform implementation method partitions the source data into multiple mini-batches,
should be resolved so that the differences in energy, cost, per- similar to the stochastic gradient descent (SGD) [33].
formance, resiliency, and security can be tolerated [18,19]. The Model parallelism (MP). Different models or their splits are
self-aware system is promising for settling the dynamic system deployed on various workers which have the same inputs for
changes because it can reinforce the system reliability for coordi- processing. The dimensional variants such as height, width, and
nating the multi-agent networks autonomously and incorporating channel of DNN models can be divided to conserve memory and
computing resources.
elements of intelligence. When combined with the distributed
Pipeline Parallelism (PP). This combines data parallelism and
solutions, self-awareness can make the system function correctly
model parallelism. The entire model is partitioned into several
within desired constraints despite highly dynamic changes [20,
partial models assigned to various devices, and the input dataset
21].
is also split into micro-batches. The training process is streamed
In this paper, we propose a DDL framework for training de-
along the worker sequence to build the pipeline.
ployment that can harness the intelligence of existing hetero-
Data parallelism can fulfill the requirements for IoT applica-
geneous edge devices and tolerate dynamic system changes by
tions with heterogeneous nodes and is supported by a majority
optimizing training accuracy, enhancing system efficiency, and of DL frameworks thanks to fewer hardware-level limitations and
strengthening reliability with self-awareness. The key contribu- better adaptivity. Model parallelism and pipeline parallelism are
tions are as follows: less frequently used in traditional hardware because they often
require a custom model-specific distribution layout and a higher
• A DDL framework is designed for training on the edge
network bandwidth.
to maximize overall performance while balancing training
efficiency and accuracy. The optimized training accuracy
2.2. Distributed system architecture
with ResNet-32 on CIFAR-10 reaches 92.69% for 1PC_7Rpi.
1PC_3Jts7Rpi has a training efficiency of 200.4 images/s The distributed system architecture focuses on parameter syn-
(sX_kYtZ represents the cluster with s X as the server, k Y chronization and network scalability. Fig. 3 shows two widely
and t Z as the workers); adopted architectures, i.e. the centralized architecture and the
• A self-aware design including the dynamic self-organizing decentralized architecture [8].
approach and the self-healing method is utilized to enhance Centralized architecture. As the most prominent architecture
the system reliability and resilience. The experimental re- of data-parallel DL systems, the centralized architecture has two
sults verify that the system malfunction risk can be largely major parts: parameter servers (PSs) and workers [34]. The PS
reduced and the reliability can be maintained with dynamic is responsible for storing model information, receiving local gra-
changes; dients from workers, and updating new model parameters. The
• A broad spectrum of system compositions with heteroge- worker takes charge of storing partial training data, pulling the
neous edge devices is applicable in our DDL framework latest model from the PS, computing local gradients, and sending
with Docker implementation to improve cross-platform them to the PS. The PS determines the bottleneck of the net-
adaptivity for practical IoT applications. work bandwidth, which may result in network congestion and
909
Y. Jin, J. Cai, J. Xu et al. Future Generation Computer Systems 125 (2021) 908–920

Table 1
Comparison with related works of DDL solutions.
Work Distributed pattern Composition Platform Model Task Location Reliability
MoDNN [22] Centralized MP Edge devices (LG Nexus 5) MXNet VGG-16 Inference Edge –
NoNN [23] Decentralized PP Raspberry Pi-3 PyTorch Resnet (WRN40-4) Inference Edge –
Neurosurgeon [24] PP Jetson TK1 + NVIDIA Tesla – AlexNet, VGG16, DeepFace, . . . Inference Edge –
JointDNN [25] PP Jetson TX2 + NVIDIA Tesla – AlexNet, VGG16, DeepSpeech, . . . Training/Inference Cloud + Edge –
K40C
DeepThings [26] Centralized MP Raspberry Pi-3 – YOLOv2, Darknet Inference Edge –
BlueConnect [27] Decentralized DP Heterogeneous GPUs Caffe2 Resnet-50 Training Cloud –
ELDLF [28] Centralized DP Heterogeneous GPUs TensorFlow Resnet-50 Training Cloud –
AdaptiveFL [29] Centralized DP Raspberry Pi-3 + laptop TensorFlow SVM, K-means, . . . Training Cloud + Edge –
CE-FedAvg [30] Centralized DP Raspberry Pi-2B/3B + TensorFlow MNIST-CNN Training Cloud + Edge –
desktop
FedProx [31] Centralized DP Intel R Xeon CPUs + NVidia TensorFlow LSTM, logistic regression Training Cloud Convergence
1080Ti GPUs guarantee
This work Centralized DP Heterogeneous edge TensorFlow ResNet-32 Training Edge Self-aware
devices design

aspects: the operating system, instruction set architecture, and


quality of experience (QoE) [40]. The commonly used operating
systems are Windows, Linux, Android, and WinCE. The instruction
set architecture comprises complex instruction set computing
(CISC), reduced instruction set computing (RISC), explicitly paral-
lel instruction computing (EPIC), and very long instruction word
(VLIW). The QoE of edge devices differs a lot in end-to-end delay,
computing performance, power consumption, and service life.
Many edge devices also possess the sensing ability for informa-
tion collection ranging over visual, auditory, tactile, and olfactory
Fig. 3. Distributed system architecture. (a) Centralized architecture and (b) perception [41], bringing a wide variety of source data that can
decentralized architecture.
be used for IoT applications.
Driven by AI development, studies focusing on the emerging
hardware for DNN applications have advanced in recent decades,
bandwidth waste caused by broadcast operations. This architec- covering CPU, GPU, SoCs, FPGAs, and dedicated AI processors [42].
ture is adopted by many approaches such as TensorFlow [10],
These devices can support sophisticated DL algorithms with di-
GeePS [35], and Poseidon [36].
verse performances, whose computing capabilities range from
Decentralized architecture. The decentralized architecture
tens of GOPS to hundreds of TOPS. The scope of power con-
links all nodes without a central server, where the source data
sumption is from the milliwatt level for specialized low-power
is split into small slices and each node only communicates with
its neighbors. The bandwidth utilization can be greatly optimized hardware to the hectowatt level for powerful GPUs. To reduce the
when the number of nodes increases, since it guarantees that the power consumption and improve the processing performance,
available upload and download bandwidths of each node are fully specialized hardware, especially the one with domain-specific
utilized. This architecture has been studied by the Ring-AllReduce architecture, is considered for DNN processor/accelerator design.
design from Baidu [37], the PowerAI Vision from IBM [38], and the By balancing the generality, performance, and power, these low-
Horovod architecture from Uber [39]. power intelligent devices can be adopted on the edge to explore
Many approaches have been introduced for DDL with the the potential of always-on IoT applications. Solutions have been
abovementioned parallel methods and system architectures, as proposed to implement the training on resource-constrained edge
listed in Table 1. MoDNN [22] designs a local distributed com- devices [29] or designing low-power domain-specific AI accel-
puting system to partition trained DNN models onto several erators for training in IoT applications [43]. The perspective of
mobile devices to accelerate inference. Meanwhile, NoNN [23] utilizing more emerging edge devices for training inspires us
presents a distributed IoT learning paradigm that compresses to propose a distributed deep learning framework to realize
a pre-trained deep network into several modules without loss the edge-based training with less dependency on the cloud; we
of accuracy. A lightweight scheduler is proposed in Neurosur- accomplish this by utilizing existing edge resources instead of
geon [24] with horizontal distribution automatically partitioning spending extra cost buying additional AI devices or servers.
the DNN model between mobile devices and data centers.
JointDNN [25] adopts an efficient engine for collaborative compu- 2.4. Self-awareness in distributed IoT
tation between mobile devices and the cloud, and DeepThings [26]
adaptively implements the vertically distributed CNN-based in-
Continuous environment changes and uncertainties in IoT lead
ference on resource-constrained edge clusters. BlueConnect [27]
to complex effects that must be controlled to provide perfor-
and ELDLF [28] discuss the DDL framework for heterogeneous
mance and application guarantees. Self-awareness as a paradigm
multi-GPU clusters to effectively improve resource utilization.
is proposed to cope with such complexity in CPSs, covering
AdaptiveFL and CE-FedAvg are FL approaches that utilize edge
devices for DDL training with cloud and edge collaborations. adaptive industrial monitoring, system resource arrangement,
FedProx tackles the system heterogeneity in FL networks with and on-chip storage management. Self-aware computing systems
convergence guarantees. can sense themselves and the environment, including their state,
possible actions, and run-time behavior, as well as acting for goals
2.3. Heterogeneity of emerging edge devices or responding to changes based on their reasoning [44,45].
When combined with distributed IoT applications, self-
Multitudinous edge devices are continually employed to fulfill awareness is promising for the task of coordinating multiple
various IoT applications. Their diversities are embodied in three agents and dealing with unknown accidents, as discussed by
910
Y. Jin, J. Cai, J. Xu et al. Future Generation Computer Systems 125 (2021) 908–920

Fig. 4. System architecture for DDL implementation with heterogeneous edge devices.

many researchers focusing on the deployment of self-awareness to reuse the existing edge resource for deploying the training
in distributed systems. The self-organizing behavior and het- tasks and reducing the cost; and (3) the whole training process
erogeneous configuration of smart camera networks that use is deployed completely on the edge in a non-hierarchical manner
strategies based on social and economic knowledge to target without any help from the cloud.
communication activity are studied in [20]. SAMBA proposes a
self-aware health monitoring and bio-inspired coordination for 3.1. Baseline DDL system architecture
distributed automation systems to increase the adaptivity to
rapidly changing environments and conditions of CPSs [21]. SACA This work adopts the centralized architecture because it can
presents a self-aware communication architecture for sustain- be better adapted to practical applications in heterogeneous edge
able distributed networking over IoT devices with prolonged scenarios. It can hardly be fulfilled under the same circumstances
connectivity [46]. Furthermore, a self-aware and self-adaptive as the decentralized architecture since it requires that the per-
cloud autoscaling system is designed for cloud-based services and formances among network devices are about the same and each
applications through various configurations of cloud software and gradient computation is completed before the gradient fusion
provisions of hardware resources [47]. These practices reveal and communication. Considering that the training speed is limited
explore the potential of self-awareness in the sophisticated IoT by the slowest device in the synchronous mode and that the
system to tackle the challenges of dynamic changes and achieve computing capabilities of edge devices vary significantly, the
superior performance. asynchronous mode is used to effectively solve the problem of
disparity in processing speed while keeping the accuracy almost
3. Proposed DDL framework unchanged, so that PSs do not need to gather all sub-gradients
at the same time. The proposed DDL framework for DNN training
It is essential to answer the following pivotal question: How with heterogeneous edge devices in the IoT is demonstrated in
can we implement a distributed training system with hetero- Fig. 4. The whole system is partitioned into multiple clusters,
geneous edge devices for a given DNN model where (1) the where mutual networking and computing are performed inside
system topology can well fit within the performance budgets with autonomously. Each cluster processes independent DNN training
a small overhead and protect individual privacy, (2) the cross- tasks, containing several PSs and affiliated workers. The training
platform/cross-device implementation can be easily realized for process contains forward and backward computations that are
heterogeneous edge devices, and (3) the proposed approach executed on the worker side.
can achieve considerable self-awareness. Table 1 shows that Note that in the traditional centralized architecture, the role
the training is mostly deployed with high-performance GPUs of each device is certain and usually remains unchanged through-
rather than resource-constrained edge devices in previous works. out. However, in an edge computing scenario where the system
Besides, the edge nodes are usually utilized for easy AI applica- composition often changes anytime and anywhere, a fixed con-
tions such as inference, or work collaboratively with the cloud figuration for device roles becomes inadaptable. The motivation
server for training. The distributed approaches with edge devices of this work is to exploit the existing edge computing resources
are usually considering homogeneous rather than heterogeneous to the utmost extent by bringing down the traditional high-
cases. Therefore, the main contribution of this work over exsit- performance requirement for servers, while considering that
ing state-of-the-art works is that we delve into these issues most of the edge devices are quite resource-limited and unsuit-
by proposing a DDL framework with heterogeneous resource- able as traditional computation-intensive servers. A PS in the
constrained edge devices for training implementation to balance proposed DDL framework can be any device in the cluster as long
accuracy optimization, training acceleration, and system reliabil- as it is able to handle the workloads of gradients gathering and
ity. The self-aware design in the proposed framework includes parameters updating in the system. Besides, the role of either a PS
two parts to make the system function more correctly and appro- or a worker for each device is reconfigurable, depending on the
priately, and within desired constraints despite dynamic device real-time system status and requirements. The number of PSs M,
changes. Compared with most FL and other distributed solutions, workers N, and their ratio M /N can be altered depending on the
(1) our framework provides the reconfigurable role assignment application scale. The processing efficiency of the whole system
for any device so that the light-weight edge device can be as- can be restricted by the total bandwidth of servers when the
signed as either a PS or a worker, and is interchangeable in a ratio M /N is comparably small; meanwhile, a large ratio may lead
real-time manner; (2) there is no need to use any dedicated server to a waste of computing resources. When the training speed is
911
Y. Jin, J. Cai, J. Xu et al. Future Generation Computer Systems 125 (2021) 908–920

Fig. 6. Dynamic task assignment for distributed training processes among


workers.

Algorithm 1 System workflow


Input: input workloads at time t
Output: preserve and offload training results at time t
Fig. 5. Distributed workflow between PSs and affiliated workers. Initialization: empty the memory structure
initialize the DNN models
Parameter Server:
encumbered by the bandwidth bottleneck of servers in a cluster, 1: while t < iteration do
it is able to transfer the assignment of some device by changing 2: for i = 1 to M parameter server do
a worker to a PS, or adding an unoccupied device as a PS for 3: for j = 1 to N workers do
cluster extension. In this way, the pressure on servers can be 4: parameter server Si distributes latest model
relieved through reconfigurable role assignment, thus enabling parameters Pj (t) to worker Wj ;
the distributed system to dynamically adjust itself for changes. 5: parameter server Si aggregates sub-gradients
However, it is preferable to use as few PSs as possible if their Gij (t) from worker Wj ;
bandwidth is enough for the communications of the adopted 6: parameter server Si computes and updates new
distributed training task, so as to maximize the computing ca- parameters Pj (t + 1);
pability of the worker group and avoid attenuation of computing 7: end for
capability. The optimal ratio can be obtained by balancing factors 8: end for
such as bandwidth limitation, networking size of the learning 9: update time t = t + 1;
model, and the batch size of the training dataset. 10: end while
The distributed workflow between PSs and affiliated work- Worker:
ers is demonstrated in Fig. 5, where PSs and workers in each
1: while t < iteration do
cluster have different functions. Here, servers are responsible
2: for j = 1 to N workers do
for initializing the model, counting the number of training
3: worker Wj load training sub-dataset Mj ;
steps completed, monitoring the session, and saving and restor-
4: worker Wj loads model parameters Pj (t);
ing model checkpoints to recover from failures. Almost all
5: worker Wj execute computations for results and
computation-intensive tasks are carried out by workers, whereas
sub-gradients Gij (t);
the performance requirements of the PS are relatively lower. The
6: worker Wj offload sub-gradients Gij (t) to its
information exchange between PSs and workers includes two
parameter server Si ;
procedures: (1) the partial gradients are transmitted from work-
7: end for
ers to PSs and (2) the updated model weights are transmitted
8: update time t = t + 1;
from PSs to workers. For each iteration, the uploaded information
9: end while
from worker i to PSs is ∆wi to protect the data privacy on the
edge. The partial gradient ∆wi computed by worker i refers to
∂ Li
∂wi
. wi is the last round weights. Li is the achieved loss function
computed over the mini-batch i. PSs are responsible for comput- Fig. 6 illustrates the dynamic task assignment among workers as
ing the w ′ for the updated model with relieved computational first come, first served. The epochs taken by each device vary
tasks taken by the worker group, according to the updating rule according to its computing capability and the training condition
wi′ = wi − η∆wi [48]. η is the learning rate, and wi′ is the updated such as the bandwidth limitation and the device functioning
weights. Therefore, model updating can be accomplished by PSs status. Each worker receives one task for each update round and
without extra effort as it is not necessary for PSs to communicate obtains its next task assignment from PSs only after the current
or synchronize with each other. task is fulfilled, and corresponding results are received by PSs.
This work adopts a dynamic task assignment to ensure that
The system variants involved and their definitions are listed
the idle state can be shortened as much as possible, thereby
in Table 2. At the very start, server S distributes the model pa-
maximizing the system training speed by exploiting the usage of
rameters P to affiliated workers. It aggregates all sub-gradients G
each participant. Before initiating the training deployment of a
cluster, a pretesting procedure is executed that goes along each and then updates the new model parameters P after calculations.
device to acquire its information, including the computing capa- All workers W in the system load the distributive training sub-
bility, the bandwidth for communication, and the device status. dataset M and obtain the latest model parameters P preserved
The computing capability and the bandwidth information of each and updated by the PS S. The computations on the edge are
device are acquired by running a lightweight testing program. processed locally by workers W for training results and partial
If any device is found to be offline, the device will be removed gradients G, with the latter being offloaded to the superior PS. The
from the training task by deleting its information from the token. system workflow of the training process is shown in Algorithm 1.
912
Y. Jin, J. Cai, J. Xu et al. Future Generation Computer Systems 125 (2021) 908–920

Table 2
System variants.
Notation Definition
Training set M Training sub-dataset distributed across the workers
for data-parallel computing
Parameter server S Server nodes that store, update, and distribute model
parameters for the cluster
Worker W Worker nodes that execute training computations for
various applications
Gradients G Partitions that contain the gradients of each worker
for local preservation and global update
Parameters P Partitions that contain the parameters of the training
model preserved by worker nodes and updated by
Fig. 7. The implementation procedure for heterogeneous edge devices on top
server nodes
of Docker.

In addition, the proposed DDL framework can improve the To verify the optimized effectiveness of our distributed design,
privacy problem since all input resources are stored and pro- the homogeneous clusters are evaluated where each worker owns
cessed locally on the edge side. Unlike traditional cloud-based the same training efficiency E and communication time Tcomm . The
DNN implementations where the source data collected at the total training time is
edge are sent to the cloud server for further analysis, our design Qi Qtraining
ensures that the only information exchanges among nodes are Thomo = + Tcomm = + Tcomm . (2)
Ei NE
intermediate partial gradients and updated global gradients, as
demonstrated in the aforementioned distributed workflow. These The training time for the standalone device is
gradients of DNN models cannot be utilized to reveal the original Qtraining
data. In this way, any additional data protection methods such Tstan = . (3)
E
as data encryption are superfluous, and resources can be saved
Therefore, by increasing the number of workers, our dis-
for dedicated hardware design and communicational overhead
tributed design can speed up the system performance with the
for data security. In practical IoT applications, each edge device
speed-up ratio Rspeedup for the homogeneously distributed cluster
can possess the absolute control of the data obtained by itself
compared to the standalone device
without data sharing or transfer. Hence, the privacy problem can
be settled natively with our DDL framework by isolating input Tstan NQtraining
Rspeedup = = . (4)
data from the information interchanges. Thomo Qtraining + NETcomm
When the communicational overhead is far less than the
3.2. System scalability management computing time, the speed-up ratio is close to N. However, the
system optimization may be implicated for non-negligible com-
Due to the heterogeneity of edge networks, system implemen- munication costs induced by inefficient transmission technology
tation on various devices becomes the choke point for design of networks, or the encountered bandwidth bottleneck of PSs. The
scalability. To solve this problem, the technology of Docker problem of communicational overhead and limited bandwidth
is adopted in this work to improve cross-platform adaptivity. can be severe when dealing with an exaggerated cluster scale.
Docker is a lightweight and easily accessible virtualization tech- For homogeneous clusters, if the total training workload is too
nology whose functions lie in packaging required applications and small compared to the communication workload or the training
environments into one file [49]. It supports multiple platforms, efficiency of workers, the distributed training will be counterpro-
instruction set architectures, and operating systems with good ductive. This happens when workers that own high computing
isolation between inside processes and system resources [50]. The capabilities complete the training tasks quickly but spend plenty
implementation procedure for heterogeneous edge devices on top of time on communications. The related variables should be
of Docker is demonstrated in Fig. 7. The required development en- carefully determined to maximize the speed-up ratio in practical
vironments are packaged as a Docker image file, including Python, distributed applications, by using appropriate transmission tech-
TensorFlow, and the programs for the distributed deployment. If nology, selecting PSs with large bandwidth, and balancing the
any device is to be added into the distributed system, it needs to worker scale.
install Docker based on its system and instruction set type, and Moreover, the performance differences among workers should
then load the saved image file for quick setup. After the necessary not be excessive since all workers share the same global gradi-
environments contained in the image file are all established, ents to prevent the problem of stale model. Due to the gradient
the final configuration is completed by filling the related device dispersion, the training process is hard to converge, and may
information into the program, such as the device IP address. By lead to the local optimal solution. In order to maximize the
utilizing Docker in system development, the preparation time processing performance of the distributed system, the following
of our design can be largely saved and the differences between requirements should be satisfied:
devices can be settled to ensure the system’s compatibility and
⎨∑ Cps ≥ α M ∑
⎧∑ ∑
· fw
maintainability.
Assume that the total training workload for one cluster is Bps ≥ M · fw (5)
min(Cw ) ≥ β · max(Cw ).

Qtraining images and the training efficiency of each worker is Ei im-
ages/s (i ∈ [1,n]). The communication time between each worker
Here, Cps and Cw are the computing capabilities of the server
and server is Tcomm,i . The total system training time Tdistributed
group and worker group, respectively. fw is the update frequency
should be
of each worker, M is the model size, and Bps is the bandwidth
limitation of each PS. α represents the conversion factor between
{∑
Qi = Qtraining
Q (1)
Tdistributed = max{ E i + Tcomm,i }. the model size and the number of calculations required to update
i

913
Y. Jin, J. Cai, J. Xu et al. Future Generation Computer Systems 125 (2021) 908–920

Algorithm 2 Token delivery procedures


Task Scheduler:
1: function InitialToken
2: collect dev ice_info_all for all devices in cluster;
3: rank dev ice_info_all and assign groups;
4: constitute initial token token(0);
5: end function
6: function DeliverToken(t)
7: detect next available device TokenHolder(t + 1);
8: update token(t + 1) if any found offline or added;
9: deliver token(t + 1) to TokenHolder(t + 1);
10: end function
11: function UpdateToken(t)
Fig. 8. Dynamic self-organizing approach with acentric token delivery.
12: constitute token(t + 1);
13: end function
14: function DeleteDevice(offline_dev ice_info)
the model, while β specifies the suggested performance ratio 15: stop all device processes;
between the worst worker and the best worker in the cluster. The 16: delete offline_dev ice_info from dev ice_info_all;
workers that do not conform with the requirements presented 17: restart remaining devices when offline_dev ice_info in
in Eq. (5) should be removed from the DDL queue to avoid being Ser v erGroup;
an encumbrance. An appropriate cluster constitution should be 18: end function
selected to maximize the system reliability on the basis of the 19: function AddDevice(new _dev ice_info)
used communication method, the system bandwidth limitation, 20: stop all device processes;
as well as the reliability of participant devices. 21: add new _dev ice_info to dev ice_info_all;
22: restart devices from the latest checkpoint;
3.3. Self-aware design 23: end function

The self-aware design combined in our DDL framework in-


cludes two parts: a self-organizing approach with the token
delivery for system scalability management, and a self-healing
method for system reliability and resilience management, with
a token delivered along all devices in the cluster.
The self-organizing approach is designed to realize the self-
aware system scalability and the dynamic arrangement, as
demonstrated in Fig. 8. The distributed system can be self-
organized with the token delivery, thus dynamically monitoring
the status of each node. The token delivery is carried out with
the virtually closed chain structure, where the device order is
determined by a token that preserves the device status and cor-
responding IP address in the system. The token is expressed as
follows:
token = {w orkergroup : [infow w w
1 , info2 , . . . , infok ],
(6)
ser v ergroup :[infos1 , infos2 ,..., infost ]},
where all devices are divided into two parts: the worker group
Fig. 9. Flow diagram of the self-healing method with acentric token delivery.
(k devices) and the server group (t devices). The token delivery
mechanism is explained in Algorithm 2. The initial token is ob-
tained by running the aforementioned pretesting program, where
the configuration parameters needed include the total number token deletes the corresponding information from the token or
of devices, the IP address of each device, and the pre-defined adds it to the token. The role of a PS/worker will be reassigned if
PS/worker ratio. The computational and communicational perfor- the requirement of the server group cannot be fulfilled after the
mance, device status, and IP address of each device in the system token is updated. By adjusting the token information dynamically,
are preserved in the token with a ranked order. Based on the the distributed training is guaranteed to be congruent with the
total device number and the pre-defined PS/worker ratio given latest participants in appropriate management.
by users, the program gets the appropriate number of PSs and In order to enhance the reliability and resilience of the system,
workers for the distributed training task. Each time the program this work proposes the self-healing method to reduce the risk
is launched by a device in accordance with the given IP address of malfunction. As illustrated in our DDL framework, workers
book, the necessary device information is obtained, and added to are responsible for training while PSs are mainly responsible
the initial token. When all devices have been tested, the last de- for model updating. Considering the nature of DDL, any worker
vice will rank all devices’ information collected in the initial token failure will hardly influence the whole training process whereas
based on the computing capability and the bandwidth for com- the breakdown of any PS can make the cluster out of order and
munication, and automatically divide all devices into two groups lead to system malfunction. Only when any PS fails should the
(the server group and the worker group) based on the sorted in- whole distributed training process be restarted. The self-healing
formation. System changes, including any device being removed design with token delivery is highly reliable in the DDL system
or added, can be regulated by modulating the token information. since all devices are united as the foundation of fault detection,
If any device is found offline or added, the device that holds the as illustrated in Fig. 9. The device that stores the token, for
914
Y. Jin, J. Cai, J. Xu et al. Future Generation Computer Systems 125 (2021) 908–920

Table 3
Experimental setups.
Device Processing core Memory Operating system Instruction set
Personal Computer (PC ) (Lenovo Legion Y7000) Intel I5-9300 (CPU) 16 GB LPDDR4 Windows x86 CISC
Raspberry Pi 4B (Rpi) 1.5 GHz Quad-Core Broadcom BCM2711 (CPU) 4 GB LPDDR4 Linux vArmv7l RISC
Nvidia Jetson Nano (Jts) NVIDIA maxwell with 128 NVIDIA CUDA core (GPU) 4 GB LPDDR4 AArch64 RISC

example device_1, starts to send the token to the next device and 0.46 MB parameters [51]. The CIFAR-10 dataset is adopted
according to the order in the token, which is device_2. Before as the training dataset, consisting of 50k training images and
sending the token, device_1 tries to communicate with device_2 10k testing images in 10 classes. The benchmark is utilized as a
to determine whether it is still online. If so, device_1 sends the case study for evaluating and validating the effectiveness of the
token; if not, device_1 tries to communicate with the next device proposed distributed deep learning framework on heterogeneous
(device_3) and continues in this manner until an online device edge devices. The evaluation is carried out in the aspects of
is found. When any device is found offline during the training training accuracy, training efficiency, and reliability to demon-
process, the detection process determines which group it belongs strate the strengths of our design. The proposed DLL framework
to according to its IP address. If it is in the worker group, the is extensively evaluated on various compositions of distributed
device is deleted from the token and the system keeps running. clusters with constraints. The proposed framework is applicable
If it is in the server group, the global notification is triggered for any lightweight DNN training model as long as it can be run
and broadcasted to restart all devices in the cluster from the on a single edge device.
most recent checkpoint. Before token delivery, device_1 deletes
all offline device information from the token. device_2 waits to 4.1. Implementation setup
receive the token, and then performs the same procedures to
check if the next device in the token is online. Therefore, the The distributed training is carried out on Python 3 Compiler
system should always keep proper functioning as long as the with TensorFlow version 1.14. Distributed TensorFlow is adapted
token holder works, no matter which group it belongs to. and modified in this work as the most popular DDL platform
Assume the error rate of any single device is ei ∈ (0,1) in that supports the PS architecture. It is also applicable to trans-
a cluster that contains N devices (i ∈ [1,n]). The failure rate of fer the proposed DDL framework to other mainstream machine
the cluster with and without the proposed self-healing method is learning platforms such as PyTorch. The system training is carried
measured as Pwo/t and Pw/t , respectively. Considering that a sys- out based on the data-parallel PS architecture in asynchronous
tem with token delivery should only malfunction when the token mode. WIFI is utilized in the real-case experiments for cluster
holder crashes, Eq. (7) shows the mathematical expectations of communications. The framework is evaluated with different hard-
E(Pwo/t ) and E(Pw/t ). ware including Personal Computers (PC ), Raspberry Pi 4Bs (Rpi),
⎧ ∏ ∏ and NVIDIA Jetson Nanos (Jts); they respectively represent the
⎪E(Pwo/t ) = 1 − (1−ei ) = 1 − gi general computing device, the resource-constrained edge device,
and the AI-enabled edge device. The configuration details for



⎨ i i
1 ∑ 1 ∑ (7) setup are listed in Table 3. With cross compiling of Docker, the
E(Pw/t ) = ei = (1 − gi ) implementation for Rpi and Jts is fulfilled on Ubuntu Arm64, and

⎪ N N
i i
that for PC is fulfilled on Windows. The cluster is represented


gi = 1 − ei , ei ∈ (0, 1).

as sX_kYtZ with s X as the server, k Y and t Z as workers. The
Eq. (8) demonstrates that E(Pw/t ) is always smaller ∑ than experiments conducted with these three types of typical edge
E(Pwo/t ) because E(Pw/t ) ∈ (0,1), and the arithmetic mean N1 devices verify the effectiveness of our framework and point out
i gi
∏ 1 the practicality for other edge devices. For devices that do not
is always equal to or greater than the geometrical mean ( i gi ) N support TensorFlow, their performance is very likely lower than
of the same non-negative real numbers. Rpi. Considering that the contribution that can be offered by a Rpi
1 ∑ 1 ∑ for training is already limited, it is not suggested to adopt these
1 − E(Pw/t ) = gi > ( gi )N weaker devices for training since they will encumber the whole
N N
i i process.
1
(8)
∏ ∏
≥ (( gi ) N )N = gi = 1 − E(Pwo/t ) Due to limited experimental conditions, the maximum num-
i i ber of nodes verified for one cluster is 11, which is relatively
⇒ E(Pw/t ) < E(Pwo/t ). small compared to large-scale applications. It is also possible for
our design to be applied to large-scale IoT applications in further
It is proved that with the proposed self-healing method, the studies. The transmission bottleneck is not manifested in our
system failure possibility can be reduced for any cluster size experiments so that each cluster adopts multiple workers and
and error rate case. The failure of nonsignificant parts can be only 1 PS for evaluations.
neglected with few unfavorable effects imposed on subsequent
training. Furthermore, the customizable token is flexible for dy- 4.2. Training accuracy
namic system expansion and complex networking topology. The
number of devices employed during each round of communica- The original test accuracy of ResNet-32 on CIFAR-10 is
tion and the delivery path can be adjusted depending on various 92.49% [51]. The training accuracy on CIFAR-10 is tested with
applications. different clusters to verify the performance of our design.
Fig. 10 demonstrates the training and testing accuracy of stan-
4. Experiments and evaluations dalone devices. The testing accuracy of PC, Jts, and Rpi standalones
can reach 92.30%, 92.27%, and 92.72% after 172 651, 62 660, and
The performance of the proposed design is benchmarked with 845 347 s (150, 100, and 150 epochs), respectively. For differ-
the widely used and publicly available DNN model ResNet-32, ent distributed clusters, the training accuracy fluctuates between
considering its merits of being lightweight and easily imple- 92.45% and 92.69% without any accuracy loss compared to the
mented on resource-constrained edge devices. It has 32 layers original work on ResNet-32, as detailed in Table 4.
915
Y. Jin, J. Cai, J. Xu et al. Future Generation Computer Systems 125 (2021) 908–920

Fig. 10. Training accuracy of different standalone devices. (a) Jts, (b) Rpi, and (c) PC.

Table 4
Training accuracy for different cases with ResNet-32 on CIFAR-10.
Cluster composition Test accuracy (%) Training time (s)
Reference [51] – 92.49 –
PC 92.30 172 651
Standalone Jts 92.27 62 660
Rpi 92.72 845 347
1PC_3Jts 92.55 39 391
Distributed 1Rpi_7Rpi 92.69 129 055
1PC_3Jts7Rpi 92.45 37 420

Table 5
Training performance for different cases with ResNet-32 on CIFAR-10.
Cluster composition Epoch time (s) Efficiency (images/s)
Jts 417.7 119.7
Standalone Rpi 5636.0 8.9
PC 1151.0 43.4 Fig. 11. The speed-up performance for different distributed clusters compared
1Rpi_7Rpi 860.4 58.1 with the standalone Rpi clusters.
1PC_2Jts 355.0 140.9
Distributed
1PC_2Jts7Rpi 280.9 178.0
1PC_3Jts7Rpi 249.5 200.4

The experimental results show that the test accuracy main-


tains the same level between the distributed group and the
standalone group. Theoretically, our design can achieve the same
training results with the standalone group with the synchronous
mode. Although the disparity induced by the asynchroniza-
tion may result in some uncertainties, the asynchronous mode
adopted in the framework can increase training speed when the
hardware differences are non-negligible.
Additionally, the training process of distributed clusters is
more stable than that of standalone devices according to the
accuracy curve, since the volatile gradient updates caused by
distributed training can to a great extent prevent overfitting and
increase system resilience.
Fig. 12. Training efficiency of various clusters with normalized computing
capability (taking the standalone Rpi as 1).
4.3. Training efficiency

The effectiveness of the proposed DDL framework with various


calculating the measured epoch time of the standalone Rpi and
cases is evaluated as shown in Table 5. For the standalone group,
the distributed cluster 1Rpi_1Rpi. When the homogeneous cluster
Jts has the best performance (119.7 images/s). Jts indicates the
scales, the ideal performance of the speed-up ratio is shown
customized or specialized embedded hardware or accelerators
toward AI applications. Jts is designed with GPU optimizations by the blue curve. The data markers under the curve are the
while PC and Rpi are based on CPU computations. The training experimental results of different clusters. Disparities between the
efficiency of Rpi is the worst (8.9 images/s), thus representing the results and the ideal expectation are observed because the com-
general capability of common resource-constrained IoT devices. municational time increases due to the blocked communications
The training performance of homogeneous clusters is eval- when the cluster expands in actual experiments. The trends for
uated to verify the speed-up efficiency of the proposed DDL different clusters are the same as the ideal curve, which proves
framework in Fig. 11. The additional communications that are the accelerating performance of the proposed DDL framework
induced by the distributed interactions and hinder the accel- along with the expansion of the cluster. The epoch time of cluster
eration of training speed are considered in the evaluation. The 1Rpi_1Rpi is slightly larger than that of the standalone Rpi due
communicational time for one epoch is obtained as 80 s by to additional communication costs for workload distributions.
916
Y. Jin, J. Cai, J. Xu et al. Future Generation Computer Systems 125 (2021) 908–920

Table 6
Decision-making for the PS selection.
Server Worker Epoch time (s) Efficiency (images/s)
1 Rpi 1 PC + 1 Jts 608.1 82.22
1 Jts 1 PC + 1 Rpi 453.5 110.25
1 PC 1 Jts + 1 Rpi 338.6 147.67

The mean of the measured training efficiencies of 1Rpi_3Rpi,


1Rpi_5Rpi, 1Rpi_7Rpi, and 1Rpi_9Rpi are 24.6, 41.3, 59.1, and 75.1
images/s. The improvements compared with the standalone Rpi
are 2.76×, 4.70×, 6.59×, and 8.44×, respectively. The results of
homogeneous clusters prove that the proposed DDL framework
can reduce device hours by many orders of magnitude with
increasing cluster size.
The comparison between training efficiency and normalized
computing capability is illustrated in Fig. 12. The computing
capability of each cluster is normalized, taking the standalone Rpi
as 1. The circle size represents the cluster size. Due to our limited
experimental conditions, the largest cluster 1PC_3Jts7Rpi with the
Fig. 13. Self-organizing performance of 1PC_2Jts7Rpi cluster for cases of two Jts
highest computing capability shows the best training efficiency of dropped and one Jts added.
200.4 images/s and a total training time of 37 420 s. The results
prove that with our DDL framework, the training efficiency of the
cluster increases along with the computing capability, with little
influence from cluster size. With more edge devices, the training
performance can be greatly enhanced where the contribution of
each device is related to its computing capability.
Table 6 shows the decision-making involved in the PS se-
lection with given resource constraints, taking the cluster with
1 Rpi, 1 PC, and 1 Jts as an example. The best training effi-
ciency is 1PC_1Rpi1Jts (147.67 images/s) rather than 1Rpi_1PC1Jts
due to the limited performance of Rpi as a PS obstructs the
communications. It is found that the device with relatively low
computing capability can be used as a PS to maximize the overall
performance only if the components fulfill the communicational
requirements declared in Eq. (5). Otherwise, the low-performance
device as a PS will hinder the overall training performance.
The finding agrees with our assignment principle for the cluster
organization.
A comparison in performance and price between distributed
edge clusters and high-performance computers is presented in Fig. 14. Reliability results for various cases of homogeneous clusters: total error
Table 7. These findings demonstrate that even though the edge rate versus cluster size & error rate of device.
devices are resource-constrained, they can be combined together
as a cluster to provide equivalent or even higher performance
compared with normal computers equipped with GPU; this solu- remains unchanged. This work can not only deal with the tem-
tion also costs less than upgrading hardware. The workstation in porary or permanent absence of any node, but also support to
our lab equipped with an Nvidia GTX 2080Ti cost about $9000 in add/delete nodes as well as coordinate the existing resources for
total (counting all the necessary modules, such as Intel processors dynamic changes during the training process.
and memories). The cost performance ratio in the table clarifies
the reasonability and superiority of using resource-constrained 4.4.1. Self-organizing method
edge devices for training in IoT applications, rather than exe- The feasibility of the proposed self-organizing method is veri-
cuting the training process just on a GPU with high speed and fied with the 1PC_2Jts7Rpi cluster by removing or adding random
efficiency. devices during the training process. Fig. 13 demonstrates the
Therefore, the proposed DDL framework with heterogeneous system resilience for dynamic device changes. The training accu-
edge devices can function soundly without accuracy loss, and racy here changing over time shows the actual training process,
largely accelerate the training process with the expansion of with each heterochromatic marker representing the intermediate
cluster computing capability. results updated from each worker after its epoch is completed.
The blue line shows the training process of the base case
4.4. Self-awareness evaluations as 1PC_2Jts7Rpi normally functioning for 26 328 s in total (150
epochs). The two Jts workers (marked in red and yellow, re-
The evaluations for the self-awareness in our design mainly spectively) are removed from the cluster at the 10th epoch in
focus on two aspects as the self-organizing approach for dynamic Fig. 13(a). The one Jts worker (marked in green) is added at the
system management and the self-healing method for system 10th epoch in Fig. 13(b). The updated clusters as 1PC_7Rpi and
reliability and resilience enhancement. Recent FL solutions also 1PC_3Jts7Rpi operate successfully after the adjustment. Their total
consider the temporary absence of a training node, but the ba- training times are 61 148 and 15 156 s, respectively, since the
sic system arrangement of other devices working regularly still changes in cluster influence the total computing capability. The
917
Y. Jin, J. Cai, J. Xu et al. Future Generation Computer Systems 125 (2021) 908–920

Table 7
Cost performance comparison between high-performance computers and edge clusters.
a
Device Configuration Performance (Training efficiency) Price Cost performance
Workstation Nvidia 2080Ti, 24-core i9-9920X, 128G SSD 5000 images/s $9000 0.56
Laptop Nvidia 2060, 4-core Ryzen 7 3750H, 16G SSD 500 images/s $1175 0.42
Edge cluster 8 Jts (Nvidia 128-core Maxwell, 4-core Cortex-A57, 4GB LPDDR) 990 images/s $99*8 = $792 1.25
Edge cluster 8 Jts + 8 Rpi (Nvidia 128-core Maxwell, 4-core Cortex-A57, 4 1170 images/s $(55+99)*8 = $1232 0.95
GB LPDDR + 4-core Cortex-A72, 4 GB LPDDR)
a
Cost performance is calculated by performance/price (images/s/$).

training performance is consistent with the results from Fig. 12 CRediT authorship contribution statement
that the training efficiency of the cluster increases along with the
computing capability. These results prove that when the cluster Yi Jin: Conceptualization, Methodology, Writing – original
encounters any device change dynamically, it can keep running draft. Jiawei Cai: Software, Investigation. Jiawei Xu: Conceptual-
without any pauses. The self-organizing method is practical for ization. Yuxiang Huan: Methodology. Yulong Yan: Visualization.
improving system reliability dynamically. Bin Huang: Software, Validation. Yongliang Guo: Data cura-
tion, Validation. Lirong Zheng: Supervision, Writing – review &
editing. Zhuo Zou: Supervision, Writing – review & editing.
4.4.2. Self-healing method
To explicitly evaluate the reliability of our solution, the ho-
Declaration of competing interest
mogeneous clusters are used for analysis, where all devices have
the error rate e. e can vary from 10−4 to 10−1 in a Gaussian
The authors declare that they have no known competing finan-
distribution, and cluster size N ranges from 2 to 100 depending cial interests or personal relationships that could have appeared
on the hardware reliability and system scale. to influence the work reported in this paper.
The reliability performance is evaluated for different cluster
sizes and error rates, as illustrated in Fig. 14. The system failure Acknowledgments
rate of the cluster with the proposed self-healing method grows
more slowly as the cluster size increases compared to the one This work was supported in part by the National Natural
without any fault tolerance, especially for those with large cluster Science Foundation of China under Grant 61876039, 62004045,
scales and high error rates. Our self-healing design can greatly in part by 62076066, and Shanghai Municipal Science and Tech-
increase the reliability of the DDL system, which can be further nology Major Project No. 2021SHZDZX0103, and in part by the
optimized by improving the stability of each device in the cluster Shanghai Platform for Neuromorphic and AI Chip under Grant
and appropriately scaling up the cluster size. 17DZ2260900.

5. Conclusion References

[1] C.V.N. Index, Forecast and Trends, 2017–2022 White Paper, Cisco, San Jose,
This work proposes a DDL framework for heterogeneous IoT CA, USA.
edge devices with a self-aware design. The system adaptivity is [2] X. Wang, Y. Han, V.C.M. Leung, D. Niyato, X. Yan, X. Chen, Convergence
greatly enhanced with the cross-platform Docker deployment. of edge computing and deep learning: A comprehensive survey, IEEE
Commun. Surv. Tutor. 22 (2) (2020) 869–904.
Thus, heterogeneous edge devices can be utilized for distributed [3] W.Z. Khan, E. Ahmed, S. Hakak, I. Yaqoob, A. Ahmed, Edge computing: A
learning, which reduces additional training cost. The self-aware survey, Future Gener. Comput. Syst. 97 (2019) 219–235.
design including the dynamic self-organizing approach and the [4] H. Cai, C. Gan, T. Wang, Z. Zhang, S. Han, Once-for-all: Train one network
self-healing method is presented to ameliorate the system’s re- and specialize it for efficient deployment, in: International Conference on
Learning Representations, 2019.
liability and resilience. The experimental results demonstrate [5] S. Wang, T. Tuor, T. Salonidis, K.K. Leung, C. Makaya, T. He, K. Chan, When
that the proposed DDL scheme can greatly improve the training edge meets learning: Adaptive control for resource-constrained distributed
efficiency without accuracy loss on resource-constrained edge machine learning, in: IEEE INFOCOM 2018-IEEE Conference on Computer
devices. The experiments are benchmarked with ResNet-32 on Communications, IEEE, 2018, pp. 63–71.
[6] B.C. Ooi, K.-L. Tan, S. Wang, W. Wang, Q. Cai, G. Chen, J. Gao, Z. Luo,
the CIFAR-10 dataset, and tested on typical edge devices (PC, A.K. Tung, Y. Wang, et al., SINGA: A distributed deep learning platform,
Rpi, and Jts). The training efficiency of the 1Rpi_9Rpi cluster is in: Proceedings of the 23rd ACM International Conference on Multimedia,
improved by 8.44× compared to that of the standalone Rpi. The 2015, pp. 685–688.
[7] Y. Jin, J. Xu, Y. Huan, Y. Yan, L. Zheng, Z. Zou, Energy-aware workload
cluster 1PC_3Jts7Rpi has a training efficiency of 200.4 images/s
allocation for distributed deep neural networks in edge-cloud continuum,
and an accuracy of 92.45%. The self-organizing approach veri- in: 2019 32nd IEEE International System-on-Chip Conference (SOCC), IEEE,
fies the dynamic system’s scalability with devices removed or 2019, pp. 213–217.
added during the training process. The reliability of the self- [8] R. Mayer, H.-A. Jacobsen, Scalable deep learning on distributed infrastruc-
tures: Challenges, techniques, and tools, ACM Comput. Surv. 53 (1) (2020)
healing method is evaluated with different stabilities, scales, and
1–37.
breakdown cases, showing that the system malfunction rate can [9] J. Gu, M. Chowdhury, K.G. Shin, Y. Zhu, M. Jeon, J. Qian, H. Liu, C. Guo,
be largely reduced for extensive distributed deployment. For Tiresias: A {GPU} cluster manager for distributed deep learning, in: 16th
more sophisticated deep learning scenarios in case of the band- {USENIX} Symposium on Networked Systems Design and Implementation
({NSDI} 19), 2019, pp. 485–500.
width bottleneck, a distributed system where computational and
[10] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G.S. Corrado,
communicational resources can be arranged more automatically, A. Davis, J. Dean, M. Devin, et al., TensorFlow: Large-scale machine learning
dynamically, and soundly is applicable in future work, and can on heterogeneous distributed systems, 2016, arxiv preprint arXiv:1603.
emerge from our DDL framework. Approaches will be further 04467.
[11] J.J. Dai, Y. Wang, X. Qiu, D. Ding, Y. Zhang, Y. Wang, X. Jia, C.L. Zhang,
studied by optimizing the distributed system with limited band-
Y. Wan, Z. Li, et al., Bigdl: A distributed deep learning framework for big
width, and balancing the training efficiency and communicational data, in: Proceedings of the ACM Symposium on Cloud Computing, 2019,
costs through appropriate cluster assignment. pp. 50–60.

918
Y. Jin, J. Cai, J. Xu et al. Future Generation Computer Systems 125 (2021) 908–920

[12] NVIDIA, Data center deep learning product performance, 2020, https: [37] Baidu, All-reduce algorithm, 2020, https://github.com/baidu-research/
//developer.nvidia.com/deep-learning-performance-training-inference. ac- baidu-allreduce. Accessed July 15, 2020.
cessed July 15, 2020. [38] D. Quintero, B. He, B.C. Faria, A. Jara, C. Parsons, S. Tsukamoto, R. Wale, et
[13] S. Hameed, F.I. Khan, B. Hameed, Understanding security requirements and al., IBM PowerAI: Deep Learning Unleashed on IBM Power Systems Servers,
challenges in Internet of Things (IoT): A review, J. Comput. Netw. Commun. IBM Redbooks, 2019.
2019 (2019). [39] A. Sergeev, M. Del Balso, Horovod: Fast and easy distributed deep learning
[14] S. Savazzi, M. Nicoli, V. Rampa, Federated learning with cooperating in TensorFlow, 2018, arxiv preprint arXiv:1802.05799.
devices: A consensus approach for massive IoT networks, IEEE Internet
[40] T. Chen, Q. Ling, Y. Shen, G.B. Giannakis, Heterogeneous online learning for
Things J. 7 (5) (2020) 4641–4654.
‘‘thing-adaptive’’ fog computing in IoT, IEEE Internet Things J. 5 (6) (2018)
[15] X. Wang, Y. Han, C. Wang, Q. Zhao, X. Chen, M. Chen, In-edge ai:
4328–4341.
Intelligentizing mobile edge computing, caching and communication by
federated learning, IEEE Netw. 33 (5) (2019) 156–165. [41] A. Alnoman, S.K. Sharma, W. Ejaz, A. Anpalagan, Emerging edge computing
[16] W.Y.B. Lim, N.C. Luong, D.T. Hoang, Y. Jiao, Y.-C. Liang, Q. Yang, D. Niyato, technologies for distributed IoT systems, IEEE Netw. 33 (6) (2019) 140–147.
C. Miao, Federated learning in mobile edge networks: A comprehensive [42] Y. LeCun, Deep learning hardware: Past, present, and future, in: 2019 IEEE
survey, IEEE Commun. Surv. Tutor. 22 (3) (2020) 2031–2063. International Solid- State Circuits Conference - (ISSCC), 2019, pp. 12–19.
[17] A. Reuther, P. Michaleas, M. Jones, V. Gadepally, S. Samsi, J. Kepner, Survey [43] A. Agrawal, S.K. Lee, J. Silberman, M. Ziegler, M. Kang, S. Venkataramani,
and benchmarking of machine learning accelerators, in: 2019 IEEE High N. Cao, B. Fleischer, M. Guillorn, M. Cohen, et al., A 7nm 4-core AI chip
Performance Extreme Computing Conference (HPEC), 2019, pp. 1–9. with 25.6 TFLOPS hybrid FP8 training, 102.4 TOPS INT4 inference and
[18] F. Liang, W. Yu, X. Liu, D. Griffith, N. Golmie, Toward edge-based deep workload-aware throttling, in: 2021 IEEE International Solid-State Circuits
learning in industrial Internet of Things, IEEE Internet Things J. 7 (5) (2020) Conference (ISSCC), Vol. 64, IEEE, 2021, pp. 144–146.
4329–4341. [44] H. Hoffmann, A. Jantsch, N.D. Dutt, Embodied self-aware computing
[19] H. Ning, Y. Li, F. Shi, L.T. Yang, Heterogeneous edge computing open systems, Proc. IEEE 108 (7) (2020) 1027–1046.
platforms and tools for internet of things, Future Gener. Comput. Syst.
[45] M. D’Angelo, M. Caporuscio, V. Grassi, R. Mirandola, Decentralized learning
106 (2020) 67–76.
for self-adaptive QoS-aware service assembly, Future Gener. Comput. Syst.
[20] P.R. Lewis, L. Esterle, A. Chandra, B. Rinner, J. Torresen, X. Yao, Static, dy-
(2020).
namic, and adaptive heterogeneity in distributed smart camera networks,
ACM Trans. Auton. Adapt. Syst. (TAAS) 10 (2) (2015) 1–30. [46] V. Sharma, J.D. Lim, J.N. Kim, I. You, SACA: Self-aware communication
architecture for IoT using mobile fog servers, Mob. Inf. Syst. 2017 (2017).
[21] L.C. Siafara, H. Kholerdi, A. Bratukhin, N. Taherinejad, A. Jantsch, SAMBA–
an architecture for adaptive cognitive control of distributed cyber-physical [47] T. Chen, R. Bahsoon, X. Yao, A survey and taxonomy of self-aware and
production systems based on its self-awareness, E I Elektrotech. Inf.tech. self-adaptive cloud autoscaling systems, ACM Comput. Surv. 51 (3) (2018)
135 (3) (2018) 270–277. 1–40.
[22] J. Mao, X. Chen, K.W. Nixon, C. Krieger, Y. Chen, MoDNN: Local distributed [48] R. Shokri, V. Shmatikov, Privacy-preserving deep learning, in: Proceedings
mobile computing system for deep neural network, in: Design, Automation of the 22nd ACM SIGSAC Conference on Computer and Communications
Test in Europe Conference Exhibition (DATE), 2017, 2017, pp. 1396–1401. Security, 2015, pp. 1310–1321.
[23] K. Bhardwaj, C.-Y. Lin, A. Sartor, R. Marculescu, Memory-and [49] D. Merkel, Docker: Lightweight linux containers for consistent develop-
communication-aware model compression for distributed deep learning ment and deployment, Linux J. 2014 (239) (2014) 2.
inference on IoT, ACM Trans. Embedded Comput. Syst. (TECS) 18 (5s) [50] P. Bellavista, A. Zanni, Feasibility of fog computing deployment based on
(2019) 1–22. docker containerization over raspberry pi, in: Proceedings of the 18th
[24] Y. Kang, J. Hauswald, C. Gao, A. Rovinski, T. Mudge, J. Mars, L. Tang, International Conference on Distributed Computing and Networking, 2017,
Neurosurgeon: Collaborative intelligence between the cloud and mobile pp. 1–10.
edge, ACM SIGARCH Comput. Archit. News 45 (1) (2017) 615–629.
[51] K. He, X. Zhang, S. Ren, J. Sun, Deep residual learning for image recognition,
[25] A.E. Eshratifar, M.S. Abrishami, M. Pedram, JointDNN: An efficient training
in: Proceedings of the IEEE Conference on Computer Vision and Pattern
and inference engine for intelligent mobile cloud computing services, IEEE
Recognition, 2016, pp. 770–778.
Trans. Mob. Comput. (2019) 1.
[26] Z. Zhao, K.M. Barijough, A. Gerstlauer, Deepthings: Distributed adaptive
deep learning inference on resource-constrained IoT edge clusters, IEEE
Trans. Comput.-Aided Des. Integr. Circuits Syst. 37 (11) (2018) 2348–2359.
Yi Jin received the B.S. degree in electronic information
[27] M. Cho, U. Finkler, M. Serrano, D. Kung, H. Hunter, BlueConnect: Decom-
science and technology from Fudan University, Shang-
posing all-reduce for deep learning on heterogeneous network hierarchy,
hai, China, in 2016, where she is currently working
IBM J. Res. Dev. 63 (6) (2019) 1:1–1:11.
toward the Ph.D. degree in microelectronics and solid-
[28] Y. Kim, H. Choi, J. Lee, J. Kim, H. Jei, H. Roh, Efficient large-scale deep
state electronics. She is conducting research on efficient
learning framework for heterogeneous multi-GPU cluster, in: 2019 IEEE 4th
fault-tolerant design for the Internet of Things, low-
International Workshops on Foundations and Applications of Self* Systems
power distributed architecture for embedded learning,
(FAS*W), 2019, pp. 176–181.
distributed deep learning system design with edge
[29] S. Wang, T. Tuor, T. Salonidis, K.K. Leung, C. Makaya, T. He, K. Chan, Adap-
computing, and Artificial Intelligence-of-Things.
tive federated learning in resource constrained edge computing systems,
IEEE J. Sel. Areas Commun. 37 (6) (2019) 1205–1221.
[30] J. Mills, J. Hu, G. Min, Communication-efficient federated learning for
wireless edge intelligence in IoT, IEEE Internet Things J. 7 (7) (2019) Jiawei Cai received the B.S. degree in electronic engi-
5986–5994. neering from Northeastern University, Shenyang, China,
[31] T. Li, A.K. Sahu, M. Zaheer, M. Sanjabi, A. Talwalkar, V. Smith, Federated in 2017, and received the M.S. degree in electronic
optimization in heterogeneous networks, 2018, arxiv preprint arXiv:1812. engineering from Fudan University, Shanghai, China,
06127. in 2020. His current research interests include rec-
[32] T. Ben-Nun, T. Hoefler, Demystifying parallel and distributed deep learning: ommended system, distributed deep learning system
An in-depth concurrency analysis, ACM Comput. Surv. 52 (4) (2019) 1–43. design, and machine learning applications.
[33] M.M. Amiri, D. Gündüz, Machine learning at the wireless edge: Distributed
stochastic gradient descent over-the-air, IEEE Trans. Signal Process. 68
(2020) 2155–2169.
[34] L. Liu, Q. Jin, D. Wang, H. Yu, G. Sun, S. Luo, PSNet: Reconfigurable Jiawei Xu received the B.S. degree in electronic en-
network topology design for accelerating parameter server architecture gineering from Fudan University, Shanghai, China, in
based distributed machine learning, Future Gener. Comput. Syst. 106 2016, where she is currently pursuing the Ph.D. degree.
(2020) 320–332. She has been working in the field of low-power deep
[35] H. Cui, H. Zhang, G.R. Ganger, P.B. Gibbons, E.P. Xing, Geeps: Scalable deep learning design since 2016. Currently, her research
learning on distributed GPUs with a GPU-specialized parameter server, in: interests include embedded learning, low-power ar-
Proceedings of the Eleventh European Conference on Computer Systems, chitecture for microprocessor, and efficient hardware
2016, pp. 1–16. design for artificial neural network.
[36] H. Zhang, Z. Zheng, S. Xu, W. Dai, Q. Ho, X. Liang, Z. Hu, J. Wei, P. Xie, E.P.
Xing, Poseidon: An efficient communication architecture for distributed
deep learning on {GPU} clusters, in: 2017 {USENIX} Annual Technical
Conference ({USENIX}{ATC} 17), 2017, pp. 181–193.

919
Y. Jin, J. Cai, J. Xu et al. Future Generation Computer Systems 125 (2021) 908–920

Yuxiang Huan received his Ph.D. degree in micro- Lirong Zheng received the Ph.D. degree in electronic
electronics from Fudan University, Shanghai, China, in system design from the KTH Royal Institute of Tech-
2018. He is currently working at Fudan University nology (KTH), Stockholm, Sweden, in 2001. Afterward
as a research fellow. His research interests include he worked at KTH as a research fellow, associate pro-
low-power microprocessors, energy-efficient architec- fessor and full professor. He is the founding director of
ture for deep learning accelerators, and neuromorphic iPack VINN Excellence Center of Sweden and the chair
hardware for brain-like computing. professor in media electronics at KTH since 2006. He is
also a guest professor (since 2008) and a distinguished
professor (since 2010) at Fudan University, Shanghai
China. Currently, he holds the directorship of Shanghai
Institute of Intelligent Electronics and Systems, Fudan
University. His research experience and interest include electronic circuits,
Yulong Yan received the B.E. degree in communication
wireless sensors and systems for ambient intelligence and Internet-of-Things. He
engineering from Shandong University, Jinan, China, in
has authored more than 500 publications and served as steering board member
2017. He is currently pursuing the Ph.D. degree in the
of International Conference on Internet-of-Things.
School of Information Science and Technology, Fudan
University, Shanghai, China. He has been working in the
field of intelligent electronics and systems since 2016,
especially in computer vision, pattern recognition, and
Zhuo Zou received his Ph.D. degree in Electronic and
machine learning algorithms for the Internet of Things
Computer Systems from KTH Royal Institute of Tech-
systems and applications.
nology, Sweden, in 2012. Currently, he is with Fudan
University Shanghai as a professor, where he is con-
ducting research on integrated circuits and systems for
IoT and ubiquitous intelligence. Prior to joining Fudan,
he was the assistant director and a project leader at
Bin Huang received the B.E. degree in microelectron- VINN iPack Excellence Center, KTH, Sweden, where he
ics Science and engineering from Shanghai University, coordinated the research project on ultra-low-power
Shanghai, China, in 2019. He is currently pursuing embedded electronics for wireless sensing. Dr. Zou has
his M.E. degree in Fudan University, Shanghai, China. also been an adjunct professor and docent at University
His research interests include distributed deep learning of Turku, Finland.
architecture, embedded microprocessor, and artificial
Intelligence-of-Things.

Yongliang Guo received the B.S. degree in College of


Electronics and Information Engineering from Tongji
University, Shanghai, China, in 2019. He is currently in
his M.S. study in the School of information science and
engineering in Fudan University. His research interest
is the fault tolerance design of space devices and
the system implementation of edge-based Internet of
Things.

920

You might also like