0% found this document useful (0 votes)
5 views

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

This document presents a hybrid parallelization approach for distributed and scalable training of Deep Neural Networks (DNNs), combining model and data parallelism to enhance efficiency. The proposed method includes a Genetic Algorithm Based Heuristic Resources Allocation (GABRA) mechanism for optimal GPU resource distribution, achieving a 20% improvement in training time compared to existing methods. The approach is validated through a case study on a 3D Residual Attention Deep Neural Network for Alzheimer's disease diagnosis, demonstrating almost linear speedup with minimal accuracy loss.

Uploaded by

dannyafolabs
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

This document presents a hybrid parallelization approach for distributed and scalable training of Deep Neural Networks (DNNs), combining model and data parallelism to enhance efficiency. The proposed method includes a Genetic Algorithm Based Heuristic Resources Allocation (GABRA) mechanism for optimal GPU resource distribution, achieving a 20% improvement in training time compared to existing methods. The approach is validated through a case study on a 3D Residual Attention Deep Neural Network for Alzheimer's disease diagnosis, demonstrating almost linear speedup with minimal accuracy loss.

Uploaded by

dannyafolabs
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Received 2 July 2022, accepted 20 July 2022, date of publication 25 July 2022, date of current version 28 July 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3193690

A Hybrid Parallelization Approach for Distributed


and Scalable Deep Learning
SAMSON B. AKINTOYE1 , LIANGXIU HAN 1 , XIN ZHANG 1, HAOMING CHEN2 ,
AND DAOQIANG ZHANG 3 , (Senior Member, IEEE)
1 Department of Computing and Mathematics, Manchester Metropolitan University, Manchester M15 6BH, U.K.
2 Department of Computer Science, The University of Sheffield, Sheffield S10 2TN, U.K.
3 College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

Corresponding author: Liangxiu Han ([email protected])


This work was supported in part by the Project by Royal Society–Academy of Medical Sciences Newton Advanced Fellowship under
Grant NAF \R1\180371.

ABSTRACT Recently, Deep Neural Networks (DNNs) have recorded significant success in handling
medical and other complex classification tasks. However, as the sizes of DNN models and the available
datasets increase, the training process becomes more complex and computationally intensive, usually taking
longer to complete. In this work, we have proposed a generic full end-to-end hybrid parallelization approach
combining model and data parallelism for efficiently distributed and scalable training of DNN models.
We have also proposed a Genetic Algorithm Based Heuristic Resources Allocation (GABRA) mechanism for
optimal distribution of partitions on the available GPUs for computing performance optimization. We have
applied our proposed approach to a real use case based on 3D Residual Attention Deep Neural Network
(3D-ResAttNet) for efficient Alzheimer Disease (AD) diagnosis on multiple GPUs and compared with the
existing state-of-the-art parallel methods. The experimental evaluation shows that our proposed approach
is 20% averagely better than existing parallel methods in terms of training time and achieves almost linear
speedup with little or no differences in accuracy performance when compared with the existing non-parallel
DNN models.

INDEX TERMS Deep learning, genetic algorithm, data parallelization, model parallelization.

I. INTRODUCTION Data parallelism is a parallelization method that trains repli-


In recent time, Deep Neural Networks (DNNs) have gained cas of a model on individual devices using different subsets
popularity as an important tool for solving complex tasks of data, known as mini-batches [9], [10]. In data parallel dis-
ranging from image classification [1], speech recognition [2], tributed training, each computing node or a worker contains
medical diagnosis [3], [4], to the recommendation systems [5] a neural network model replica and a churn of dataset, and
and complex games [6], [7]. However, training a DNN model compute gradients which are shared with other workers and
requires a large volume of data, which is both data and used by the parameter server to update the model parame-
computational intensive, leading to increased training time. ters [11]. However, as parameters increases, the overhead for
To overcome this challenge, various parallel and dis- parameter synchronisation inceeases, leading to performance
tributed computing methods [8] have been proposed to scale degradation. In addition, when a DNN model size is too big,
up the DNN models to provide timely and efficient learn- it couldn’t be executed on a single device. Hence it is not
ing solutions. Broadly, it can be divided into data paral- possible to perform data parallelization. Model parallelism is
lelism, model parallelism, pipeline parallelism and hybrid a parallelization method where a large model is split, running
parallelism (a combination of data and model parallelism). concurrent operations across multiple devices with the same
mini-batch [8]. It can help to speed up the DNN training
The associate editor coordinating the review of this manuscript and either through its implementation or algorithm. In model
approving it for publication was Kostas Kolomvatsos . parallelism, each node or a worker has distinct parameters

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
77950 VOLUME 10, 2022
S. B. Akintoye et al.: Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

and computation of layer of a model, and also updates the 1) DATA PARALLELISM
weight of the allocated model layers. Pipelining parallelism In data parallelism, a dataset is broken down into
splits the DNN models training tasks into a sequence of mini-batches and distributed across the multiple GPUs and
processing stages [56]. Each stage takes the result from the each GPU contains a complete replica of the local model and
previous stage as input, with results being passed downstream computes the gradient. The gradients aggregation and updates
immediately. among the GPUs are usually done either synchronously
Recently, the combination of model and data paralleliza- or asynchronously [11]. In synchronous training, all GPUs
tion methods known as Hybrid parallelization has been wait for each other to complete the gradient computation
explored to leverage the benefits of both methods to minimize of their local models, then aggregate computed gradients
communication overhead in the multi-device parallel training before being used to update the global model. On the other
of DNN models [15], [18], [19]. hand, in asynchronous training, the gradient from one GPU
Despite the performance of the existing parallelization is used to update the global model without waiting for other
methods, they are still subject to further improvement GPUs to finish. The asynchronous training method has higher
by optimally allocating the model computations and data throughput in that it eliminates the waiting time incurred in
partitions to the available devices for better model training the synchronous training method. In both asynchronous and
performance. In this paper, we have proposed a generic hybrid synchronous training, aggregated gradients can be shared
parallelization approach for parallel training of DNN in mul- between GPUs through the two basic data-parallel training
tiple Graphics Processing Units (GPUs) computing environ- architectures: parameter server architecture and AllReduce
ments, which combines both model and data parallelization architecture. Parameter server architecture [14] is a central-
methods. Our major contributions are as follows: ized architecture where all GPUs communicate to a dedicated
• Development of a generic full end-to-end hybrid paral- GPU for gradients aggregation and updates. Alternately,
lelization approach for the multi-GPU distributed train- AllReduce architecture [20] is a decentralized architecture
ing of a DNN model. where the GPUs share parameter updates in a ring network
• Model parallelization by splitting a DNN model topology manner through the Allreduce operation.
into independent partitions, formulating the network
partitions-to-GPUs allocation problem as a 0-1 multiple
2) MODEL PARALLELISM
knapsack model, and proposing a Genetic Algorithm
In model parallelization, model layers are divided into parti-
based heuristic resources allocation (GABRA) approach
tions and distributed across GPUs for parallel training [21],
as an efficient solution to optimize the resources
[22]. In model parallel training, each GPU has distinct param-
allocation.
eters and computation of the layer of a model, and also
• Exploitation of data parallelization based on the
updates weight of allocated model layers. Huo et al. [51]
All-reduced method and asynchronous stochastic gradi-
proposed a Decoupled Parallel Back-propagation (DDG),
ent descent across multiple GPUs for further accelera-
which splits the network into partitions and solves the prob-
tion of the overall training speed.
lem of backward locking by storing delayed error gradi-
• Evaluation of the proposed approach through a real
ent and intermediate activations at each partition. Similarly,
use case study – by parallel and distributed training
Zhuang et al. [52] adopted the delayed gradients method to
of a 3D Residual Attention Deep Neural Network
propose a fully decoupled training scheme (FDG). The work
(3D-ResAttNet) for efficient Alzheimer’s disease
breaks a neural network into several modules and trains
diagnosis.
The remainder of this paper is organized as follows: Section II them concurrently and asynchronously on multiple devices.
reviews the related work of the study. Section III discusses However, the major challenges are how to break the model
the details of the proposed approach. In Section IV, the layers into partitions as well as the allocation of partitions
experimental evaluation is described. Section V concludes the to GPUs for efficient training performance [16]. Moreover,
work. using model parallelization alone does not scale well to a
large number of devices [17] as it involves heavy commu-
II. RELATED WORK
nication between workers.
This section provides an overview in relation to distributed
training of deep neural networks and genetic algorithms for 3) PIPELINING PARALLELISM
resource optimisation. Pipelining parallelism breaks the task (data and model) into
a sequence of processing stages. Each stage takes the result
A. PARALLEL AND DISTRIBUTED TRAINING OF DEEP from the previous stage as input, with results being passed
NEURAL NETWORKS (DNNs) downstream immediately [53]. Various works have adopted
As mentioned earlier, existing efforts on parallel and dis- this technique. Lee et al. [54] used the pipeline parallelism
tributed training of DNNs can be broadly divided into three approach to overlap computation and communication for
categories, which include data parallelism, model paral- CNN training. They implement a thread in each computer
lelism, pipeline parallelism and hybrid parallelism. server to spawn communication processes after the gradient

VOLUME 10, 2022 77951


S. B. Akintoye et al.: Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

is generated. Chen et al. [55] proposed a pipelined model allocation method based on GA to minimize the number
parallel execution method for high GPU utilisation and used of hosts required to execute a set of cloudlet associated
a novel weight prediction technique to achieve a robust train- with the corresponding set of the virtual machine, thereby
ing accuracy. However, one of the significant drawbacks of reducing excessive power consumption in the data centre.
pipelining parallelism is that it is limited by the slowest stages Furthermore, Jiang et al. [34] proposed a multi-objective
and has limited scalability. model based on the non-dominated sorting genetic algorithm
to minimize the expected total makespan and the expected
4) HYBRID PARALLELISM total cost of the disassembly service under the uncertain
Several research works have explored both data and nature of the disassembly process. Mosa and Sakellariou [35]
model parallelization methods for efficient DNN mod- proposed a dynamic VM placement solution used a GA to
els training. Yadan et al. [23] achieved 2.2× speed-up optimize the utilization of both CPU and memory with the
when trained a large deep convolutional neural net- aim to ensure better overall utilization in the cloud data
work model with hybridized data and model parallelism. centre. Devarasetty and Reddy [36] proposed an optimization
Krizhevsky et al. [9] used model and data parallelization method for resource allocation in the cloud with the aim to
techniques to train a large deep convolutional neural net- minimize the deployment cost and improve the QoS perfor-
work and classify 1.2 million high-resolution images in mance. They used the GA to find optimal solutions to the
the ImageNet LSVRC-2010 contest into the 1000 differ- allocation problem. In addition to resource allocation in the
ent classes. Shazeer et al. [25] proposed Mesh-TensorFlow cloud environment, Mata and Guardieiro [37] investigated
where data parallelism is combined with model parallelism the resource allocation in the Long-Term Evolution (LTE)
to improve training performance of transformer model with uplink and proposed a scheduling algorithm based on GA
a huge number of parameters. In Mesh-TensorFlow, users to find a solution for allocating LTE resource to the user
split layers across the multi-dimensional mesh of processors requests. Moreover, Li and Zhu [38] adopted a genetic algo-
and explored data parallelism technique in conjunction with rithm to develop a joint optimization method for offloading
All-reduced update method. Moreover, Onoufriou et al. [26] tasks to the mobile edge servers (MESs) in a mobile-edge
proposed Nemesyst, a novel end-to-end hybrid parallelism computing environment under limited wireless transmission
deep learning-based Framework, where model partitions are resources and MESs’ processing resources.
trained with independent data sets simultaneously. Simi- However, none of the work mentioned above considered
larly, Oyama et al. [27] proposed end-to-end hybrid-parallel the resource allocation problem in deep learning and applied
training algorithms for large-scale 3D convolutional neu- GA to solve the problem for efficient training performance of
ral networks. The algorithms combine both data and model the DNN model.
parallelisms to increase throughput and minimize I/O scaling
bottlenecks. III. THE PROPOSED APPROACH
The above-aforementioned approaches adopted data, In parallel and distributed computing, there are several con-
model and pipeline parallelization separately or the com- siderations on efficient training of DNN models including:
bination of the methods to improve the performance of 1) how to decompose a model or a dataset into parts/small
DNN models training. However, none of the existing methods chunks; 2) how to map and allocate these parts onto dis-
considered the resource utilization and allocation problem in tributed resources for efficient computation as well as reduc-
deep learning and provided solutions for efficient distributed ing communication overhead between computing nodes.
training performance. This work has proposed a generic full end-to-end hybrid
parallelization approach for efficient training of a DNN
B. GENETIC ALGORITHMS FOR RESOURCE model, which combines both data and model parallelization.
MANAGEMENT OPTIMIZATION For data parallelization, we have exploited data paralleliza-
Resource management optimization is an important research tion based on the All-reduced method and asynchronous
topic in distributed computing systems [28]. Several works stochastic gradient descent across multiple GPUs for accel-
have been proposed, with different techniques for addressing eration of the overall network training speed. For model
resource management problems, such as scheduling [29] parallelization, model layers are partitioned individually with
and allocation [30]. Genetic Algorithms (GA)s are com- the aim to reduce communication overhead during training
monly used to optimize either homogenous or heterogeneous process. We have also designed a genetic algorithm-based
resources in distributed system environments [31], [57]. For heuristic resource allocation mechanism to map and allocate
instances, Gai et al. [32] proposed the Cost-Aware Hetero- partitions to appropriate resources for efficient DNN training.
geneous Cloud Memory Model (CAHCM) to provide high Figure 1 shows the high-level architecture, including
performance cloud-based heterogeneous memory service 1) model parallelization consisting of network partitions and
offerings. It proposed the Dynamic Data Allocation resource allocation components; and 2) data parallelization.
Advanced (2DA) algorithm based on genetic programming to The details of the proposed method are presented in the
determine the data allocations on the cloud-based memories following sections. The important notations in this paper are
for the model. Mezache et al. [33] proposed a resource detailed in Table 1.

77952 VOLUME 10, 2022


S. B. Akintoye et al.: Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

FIGURE 1. The high level architecture of the proposed hybrid parallelization approach.

TABLE 1. Notations. si + 1 and si+1 − 1 are the second layer and last layer of
each partition. In addition to this, all partitions are computed
simultaneously, the gradient of the partition input is passed to
the next partition (i − 1), while the partition output is sent to
partition (i + 1) as its new input. In forward pass, the input
atsi−1 from partition (i − 1) is sent to partition i and gives
activation ats(i+1) −1 at iteration t. Also, In backward pass, the
gts(i+1) −1 denotes the gradient at partition (i + 1) at iteration t.
For each layer (si ≤ q ≤ si+1 − 1) such that q ≤ Q, the
δats
(i+1) −1
gradient is given as: ĝtwq = δwtq gts(i+1) −1 which can be
updated by wt+1
q = wtq − γt ĝtwq where γt is learning rate.

δat−i+1
s(i+1) −1
ĝt−i+1
wq = gt−i+1
s(i+1) −1 (1)
δwt−i+1
q
which can be updated by:
A. MODEL PARALLELIZATION
Model parallelization includes neural network model par- wt−i+2
q = wt−i+1
q − γt−i+1 ĝt−i+1
wq (2)
titioning and a genetic algorithm-based heuristic resource where γt−i+1 is learning rate.
allocation mechanism.
2) GENETIC ALGORITHM BASED RESOURCE ALLOCATION
1) NETWORK PARTITIONING (GABRA)
The principle of the network partitioning is based on the To enable efficient DNN model training on multiple GPUs,
computation loads of each layer with the aim to reduce we have also proposed a Genetic Algorithm-based heuristic
communication overhead during training process. The highly resource allocation mechanism. A genetic algorithm (GA)
functional layers are partitioned individually as a single par- is one of the evolutionary algorithms commonly used to
tition for even distribution of the DNN model layers. For provide efficient solutions to optimisation problems such as
instance, a convolution layer of CNN architecture has a large resource allocation problems, based on biologically inspired
volume of weights and can be partitioned as a single partition operators such as mutation, crossover and selection. The pro-
for efficient parallel training performance. posed genetic based algorithm aims to finds the best network
Specifically, let’s assume a model network contains a set of partitions to be allocated to the available GPUs to maximise
layers {s1 , s2 , . . . , sQ }. The model network P is split into par- resource utilisation and minimise the computation time of
titions {p1 , p2 , . . . , pn } where pi = {si , si + 1, . . . , si+1 − 1}, network partitions for efficient and better overall training
which denotes a set of layers in i partition such that 1 ≤ i ≤ n. performance over the existing methods.

VOLUME 10, 2022 77953


S. B. Akintoye et al.: Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

Thus, we formulate the problem of allocating GPUs to Algorithm 1: Genetic Algorithm Based Resources
network partitions as a 0-1 multiple knapsack problem model. Allocation (GABRA)
As previously illustrated, we consider computation load of a input : {p1 , p2 , . . . ., pn }: computation loads of
set of partitions pi , where i = {1, 2, . . . , n}. We also consider partitions
the capacity of a set of available GPU G, each denoted by dj {d1 , d2 , . . . ., dm }: capacity values of available
where j = {1, 2, . . . , m} and dj ∈ G. Furthermore, we assume GPUs
that the GPUs can either be heterogeneous or homogeneous; output: optimized solution f (Z ∗ )
with the different or same capacities items of memory. Each pi
1 evaluate cij ← , for i = {1, 2, . . . , n} and
GPU runs at least one partition, and each partition needs to dj
be allocated to only one GPU. j = {1, 2, . . . , m};
Let C = (cij ) ∈ Rn×m be a n × m matrix in which cij is a 2 set t ← 0;
profit of allocating gpu j to partition i: 3 initialise P(t) ← {β1 , β2 , . . . , βn } ;
pi 4 evaluate P(t) : {f (β1 ), f (β2 ), . . . , f (βn )} ;
cij = (3) ∗ ∗
5 find Z ∈ P(t) such that f (Z ) ≥ f (Z ), ∀Z ∈ P(t) ;
dj
6 while (t < tmax ) do
Also, Let X = (xij ) ∈ Rn×m where 7 select {Y1 , Y2 } = φ(P(t)); // φ is a selection function ;
crossover W ← 9c (Y1 , Y2 ); //9c is a crossover
(
1, if gpu j is allocated to partition i 8
xij = (4) function;
0, otherwise.
9 mutate W ← 9m (W ); //9m is a mutation function;
Thus, we formulate the multiple knapsack model in terms 10 if W = any Z ∈ P(t) then
of a function z as: 11 go to 7
n X
m
X 12 end if
max z(X ) = xij cij (5) 13 evaluate f (W ) ;
p,x,c
i=1 j=1 14 find Z 0 ∈ P(t) such that f (Z 0 ) ≤ f (Z ), ∀Z ∈ P(t)
n
X and replace Z 0 ← W ;
subject to: pi xij ≤ dj , ∀j ∈ M = {1, 2, . . . m} (6) 15 if f (W ) > f (Z ∗ ) then
i=1
m
16 Z ∗ ← W ; //update best fit Z ∗
X
xij = 1, ∀i ∈ N = {1, 2, . . . n} (7) 17 end if
18 t ←t +1;
j=1
19 end while
xij = 0 or 1, for i = 1, 2, . . . n, ∗ ∗
20 return Z , f (Z )
and for j = 1, 2, . . . m (8)
Our goal is to find ( 8) that guarantees no GPU is overutilized
and yields the maximum profit simultaneously. Thus, the
The fitness evaluation validates the optimal solution condi-
objective function in equation 5 maximizes the sum of the
tion with respect to the optimization objectives. Thus, the
profits of the selected partitions. The constraint Eq. 6 implies
fitness value of each chromosome is calculated as:
that each partition is allocated to at most one GPU, while
n
constraints Eq. 7 ensures that the capacity of each available X
GPU is not exceeded. f (β) = cij βi , and for j = 1, 2, . . . , m (9)
i=1
Next, we present a Genetic Algorithm-Based Resources
Allocation (GABRA) as an efficient solution to the model. In the case where the optimal solution condition does
Genetic Algorithm has been proven as a stochastic method not satisfy the optimization objectives, a new population is
to produce high-quality solutions for solving combinatorial computed from an initial population of the solutions using
optimization problems, particularly NP-hard problems [39]. their fitness values and genetic functions: selection, crossover
The Algorithm 1 shows the pseudo-code of the GABRA and mutation functions in the looping part (lines 7 - 18).
for solving GUPs-to-partitions allocation problem. It consists We use the selection function (φ), which is based on the
of four major parts: input, initialization, looping and output. roulette wheel method [50] to select the best chromosomes.
In the initialization part (line 3), unlike the classical GA, the The selection is based on the chromosomes’ fitness values,
set of chromosomes which also known as initial population representing the total profit of allocating partitions to the
P(t) for allocating GPUs to partitions, is generated as indi- available GPUs. The chromosomes with higher fitness values
cated in the Algorithm 2, by randomizing the allocation of are selected for the generation of the next population. The
resources without exceeding their capacities with respect to midpoint crossover function 9c as described in Algorithm 3,
the computation load of each network partition. works on two-parent chromosomes {Y1 , Y2 } with crossover
The looping part contains fitness evaluation, selection, probability 0.8 and produces a new individual.
crossover and mutation functions. The objective is to opti- Next, the inversion mutation functions 9m is adopted
mize the total profit of allocating GPUs to partitions. where a subset of genes in a chromosome is selected and

77954 VOLUME 10, 2022


S. B. Akintoye et al.: Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

Algorithm 2: Initial Population Algorithm The derivative of this objective which also referred to as
input : {p1 , p2 , . . . ., pn }: computation loads of the gradient is given as:
partitions b×m
∂f (w; V ) 1 X ∂`(w, vi )
{d1 , d2 , . . . ., dm }: capacity values of available = (11)
GPUs ∂w b×m ∂w
i=1
output: initial population In data parallellization, the gradient updates is calculated as
1 for (all partition loads) do a sum of summations each of which is the sum of derivatives
2 randomize the allocation of partitions to the number over b data points, and is given as:
of the available GPUs  
3 end for b b×2
∂f (w; V ) 1  1 X ∂`(w, vi ) 1 X ∂`(w, vi ) 
4 return initial population = + +
∂w m b ∂w b ∂w
i=1 i=b+1
 
b×m
1 X ∂`(w, vi )
Algorithm 3: Crossover Function (9c ) × · · · +  (12)
input : Y1 , Y2 : two parent chromosomes b ∂w
i=b×(m−1)+1
output: Yλ1 , Yλ2 : two offspring chromosomes In addition, the speed of data-parallel training with m GPUs
1 8 ← length(Y1 );
Y
can be expressed as:
2 cp ← 1 ; //mid cross point;
2 T1 TS1 E1
3 Yλ1 ← Y1 (1 : cp) ∪ Y2 (cp : 8); STm = × × (13)
4 Yλ2 ← Y1 (cp : 8) ∪ Y2 (1 : cp);
Tm TSm Em
5 return Yλ1 , Yλ2 where T1 is the average training time per step for using
one GPU, while Tm is the time per step for using m GPUs.
E1 is the number of epochs required to converge for one GPU,
while Em is the number of epochs required for m GPUs.
inverted to form mutated offspring. In the line 14, the old
chromosomes in the current population are replaced with the IV. EXPERIMENTAL EVALUATION THROUGH A REAL USE
new chromosomes to form a new population. Finally, the CASE STUDY
algorithm terminates when the maximum number of gen- We have applied our approach to a real case study in Neu-
erations is reached, or the optimal total profit of allocating rocomputing to evaluate the effectiveness of the proposed
GPUs-to-partitions is obtained. method in this work. Previously, we have developed a 3D
explainable residual self-attention convolutional neural net-
B. DATA PARALLELLISATION work (3D-ResAttNet) to automatically classify discrimina-
To accelerate the training process, each GPU uses a dif- tive atrophy localization on sMRI image for Alzheimer’s
ferent mini-batch that is GPU1 uses the first mini-batch, Disease (AD) diagnosis [42]. It is a non-parallel model and
GPU2 uses the second mini-batch and so on. To reduce runs only on a single GPU. To evaluate the proposed parallel
computation time, the full DNN is trained by training a par- approach, we have parallellized our previous 3D-ResAttNet
tition with the mini-batch in all GPUs concurrently. Further- model, run it on a homogenous multiple-GPUs setting and
more, we adopt Asynchronous Stochastic Gradient Descent compared the performance with and without parallelization.
(ASGD) [8] as well as ring All-reduce mechanisms [40] for Moreover, we have compared our approaches with the
parameter updates to complete an iteration. The process con- state-of-the art methods including Distributed Data Paral-
tinues until all the iterations are completed. ASGD achieves a lel (DDP) and Data Parallel (DP) from PyTorch frame-
faster training speed as there is no need to wait for the slowest work [43], FDG [51] and DDG [52].
GPU in every iteration for the global model updates. The ring
All-reduce is an optimal communication algorithm to mini- A. EVALUATION METRICS
mize the communication overhead among the GPUs, where We have adopted standard metrics for performance evaluation
all GPUs are logically arranged in a ring All-reduce topology. including Speedup (S), Accuracy (ACC) and Training Time
Each GPU sends and receives the required information to (TT). The Speedup (S) is to measure the scalability and
update its model parameters from the neighbour GPUs. computing performance. It is defined as the ratio of the serial
In all, the objective is to minimize as follows: runtime of the best sequential algorithm for solving a problem
to the time taken by the parallel algorithm to solve the same
b×m
1 X problem on multiple processors (e.g., GPUs in this case).
f (w; V ) = `(w, vi ) (10)
b×m It can be calculated as:
i=1
S = Ts /Tp (14)
where f is a neural network, b is the batch size, m is the
number of GPUs, ` is a loss function for each data point where Ts represents computing time on a single machine
v ∈ V , and w is the trainable parameter of the neural network. or GPU. Tp refers to the computing time on multiple machines

VOLUME 10, 2022 77955


S. B. Akintoye et al.: Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

or GPUs. The Accuracy (ACC) measures the classification


accuracy and is defined as:
ACC = (TP + TN )/(TP + TN + FP + FN ) (15)
where TP = True positive, FP = False positive, TN = True
negative and FN = False negative. Training Time (TT) is
the time taken for training 3D-ResAttNet using the proposed
approach and other existing distributed training methods.

B. SYSTEM CONFIGURATION
We have conducted our experiments on an Amazon Web
Service (AWS) EC2 P3 instances. Specifically, we used a
p3.16xlarge instance consisting of homogeneous 8 NVIDIA
Tesla V100 GPUs developed purposely for the deep learning
and Artificial intelligent crowd to provide ultra-fast GPU
to GPU communication through NVLink technology. Other
hardware configuration of the p3.16xlarge instance includes
128GB GPU memory, 64 vCPUs, 488GM memory, and
25Gbps network bandwidth. Additionally, software config-
uration /installation include: Ubuntu 18.04, Python 3.7.3,
Pytorch 1.2.0, Torchvision 0.4.0, Numpy 1.15.4, Tensor-
boardx 1.4, Matplotlib 3.0.1, Tqdm 4.39.0, nibabel, fastai,
and NVIDIA Collective Communications Library (NCCL)
CUDA toolkit 10.2 - a library of multi-GPU collective com- FIGURE 2. The hybrid parallelisation of 3D-ResAttNet.
munication primitives [41].
TABLE 2. Demographic data for the subjects from ADNI database.
C. A USE CASE - PARALLELIZATION OF 3D-ResAttNet FOR
ALZHEIMER’s DISEASE (AD) DIAGNOSIS
As described earlier, we have applied our hybrid paralleliza-
tion approach to our previous non-parallel 3D-ResAttNet for
automatic detection of the progression of AD and its Mild
Cognitive Impairments (MCIs) such as Normal cohort (NC),
Progressive MCL (pMCI) and Stable MCI (sMCI) from sMRI
parallelization, we adopt a stochastic gradient descent as
scans [42]. It includes two types of classification: NC vs. AD,
well as ring All-reduce mechanisms for parameters updates
and pMCI vs. sMCI.
and equally distribute data parts to each GPU. For model
parallelization, the network model is partitioned based on
1) THE HIGH-LEVEL PARALLELIZATION OF THE SYSTEM
the computational complexity which usually synonymous
Fig. 2 shows the high level parallelization of our previous
to the number of basic operations, such as multiplications
3D-ResAttNet model architecture based on self-attention
and summations, that each layer performs. Each Conv Block
residual mechanism and explainable gradient-based local-
in the network consists 3 × 3 × 3 3D convolution layer,
isation class activation mapping (Grad-CAM) to improve
3D batch normalization and rectified-linear-unit nonlinearity
AD diagnosis performance. The 3D-ResAttNet model con-
layer (ReLU). Moreover, a convolutional layer has higher
sists of 3D Convolutional blocks (Conv blocks), Residual
operations with complexity O(Co .C1 .T .H .W .KT .KH .KW )
self-attention block, and Explainable blocks. Conv blocks use
where Co and C1 denote the number of output and input
a 3D filter for computation of the low-level feature repre-
channels respectively, T , H and W are image dimension,
sentations. The residual self-attention block combines two
and KT , KH and KW are filter dimension. Consequently,
important network layers: Residual network layer and Self-
we partitioned each Conv block individually as a single par-
attention layer. The residual network layer comprises two
tition while other layers with less computation operations are
Conv blocks consisting of 3 × 3 × 3 3D convolution layers,
partitioned as shown in Fig. 2.
3D batch normalization and rectified-linear-unit nonlinearity
layer (ReLU). The explainable block uses 3D Grad-CAM to 2) DATASET DESCRIPTION
improve the model decision. The dataset is obtained from the Alzheimer’s Disease
As shown in Fig. 2, the hybrid parallelization approach Neuroimaging Initiative (ADNI) database (http://adni.loni.
for 3D-ResAttNet is divided into three phases: the split- usc.edu), which is the same dataset previously used for
ting of 3D-ResAttNet into partitions, allocation of GPUs to validation of our 3D-ResAttNet. The dataset contains
partitions, and data partitioning and distribution. For data 1193 MRI scans of four classes: 389 Alzheimer’s

77956 VOLUME 10, 2022


S. B. Akintoye et al.: Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

TABLE 3. Parallel training performance of 3D-ResAttNet using our proposed approach.

FIGURE 3. Speedup of our proposed approach.

Disease (AD), 400 Normal Cohort (NC), 232 static mild cog- 2) We have compared our proposed approach with four
nitive impairment (sMCI) and 172 progressive mild cognitive existing data and model parallelism methods, includ-
impairment (pMCI) patients. The demographic data for this ing data parallelism - two PyTorch generic distributed
dataset is shown in Table 2. training methods: DistributedDataParallel (DDP) and
DataParallel (DP), and model parallelism - delayed
D. EXPERIMENTS gradient parallel methods: FDG and DDG for further
We have conducted experiments under different strategies: evaluation. DDP is a multi-process data parallel train-
1) We have evaluated the model performance in the paral- ing across GPUs either on a single machine or multiple
lel setting across the number of heterogeneous GPUs. machines, while DP is for single-process multi-parallel

VOLUME 10, 2022 77957


S. B. Akintoye et al.: Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

FIGURE 4. Training time of proposed approach, FDG, DDG, DDP and DP.

training using multiple GPUs on a single machine [43]. The Tables 3 shows the experiment results of our parallel
Both FDG and DDG were implemented by partitioning 3D-ResAttNet (with 18 and 34 layers respectively) in terms
data and trained parallel models with sub-data across of both training time (TT) and accuracy. Based on it, we cal-
multiple GPUs. culated the speedup, as shown in Figs. 3a, 3b, 3c, and 3d.
In all cases, it is observed that our proposed approach
In all experiments, we carried out several distributed train-
achieves almost linear speedup, which demonstrates the scal-
ing of 3D-ResAttNet18 and 3D-ResAttNet34 for two classi-
ability of our approach in that the number of GPUs is
fication tasks: sMCI vs. pMCI and AD vs. NC with different
directly proportional to the training speedup performance.
number of GPUs (ranging from 1 to 8). Furthermore, we opti-
For instance, in AD vs. NC classification task with 3D-
mized model parameters with SGD, a stochastic optimization
ResAttNet34, the training speedup for 1, 2 3, 4, 5, 6, 7,
algorithm. We adopted other training parameters, including a
and 8 GPUs are 1, 2.38, 2.95, 3.44, 3.65, 4.13, 5.17, and
batch size of six samples, cross-entropy as the loss function,
5.64 respectively. A similar trend is also observed in the
and 50 epochs for better convergence. In addition, we set
sMCI vs. pMCI classification task with 3D-ResAttNet34, the
initial learning rate (LR) as 1×10−4 , then reduced by 1×10−2
training speedup for 1, 2,3, 4, 5, 6, 7, and 8 GPUs are 1, 2.27,
with increased iterations.
2.62, 3.09, 3.40, 3.78, 4.86, and 5.67 respectively.

E. EXPERIMENTAL RESULTS AND DISCUSSION 2) COMPARISON WITH THE EXISTING PARALLEL WORKS
1) PERFORMANCE OF 3D-ResAttNet IN THE PARALLEL We compare our proposed approach with existing parallel
SETTING and non-parallel methods regarding training time and accu-
We conducted the experiments on 3D-ResAttNet model for racy, respectively, to affirm the robustness of the proposed
two classification tasks: sMCI vs. pMCI and AD vs. NC. method.

77958 VOLUME 10, 2022


S. B. Akintoye et al.: Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

TABLE 4. Accuracy comparison with the existing works. AD diagnosis. The experiment results show that the proposed
approach achieves linear speedup, which demonstrates its
scalability and efficient computing capability with little or no
differences in accuracy performance (when compared with
the existing non-parallel DNN models).
Future work will be focused on further improvement of
parallelization approach for efficient training performance.

REFERENCES
[1] Y. Li, Y. Zhang, and Z. Zhu, ‘‘Error-tolerant deep learning for remote
sensing image scene classification,’’ IEEE Trans. Cybern., vol. 51, no. 4,
pp. 1756–1768, Apr. 2021, doi: 10.1109/TCYB.2020.2989241.
[2] B. J. Abbaschian, D. Sierra-Sosa, and A. Elmaghraby, ‘‘Deep learning
techniques for speech emotion recognition, from databases to models,’’
Sensors, vol. 21, no. 4, p. 1249, Feb. 2021.
(i) Training speed: Our Proposed Approach And The [3] M. Liu, J. Zhang, C. Lian, and D. Shen, ‘‘Weakly supervised deep learning
for brain disease prognosis using MRI and incomplete clinical scores,’’
Existing Parallel Approaches IEEE Trans. Cybern., vol. 50, no. 7, pp. 3381–3392, Jul. 2020, doi:
We have compared our approach with DDP [43], 10.1109/TCYB.2019.2904186.
DP [43], DDG [52] and FDG [51]. DDP and DP are two [4] M. F. J. Acosta, L. Y. C. Tovar, M. B. Garcia-Zapirain, and
W. S. Percybrooks, ‘‘Melanoma diagnosis using deep learning techniques
PyTorch generic distributed training methods. FDG and on dermatoscopic images,’’ BMC Med. Imag., vol. 21, no. 1, p. 6,
DDG are model parallelism approaches based on the Dec. 2021.
delayed gradient method. The experiment results are [5] M. Schedl, ‘‘Deep learning in music recommendation systems,’’ Frontiers
shown in Figs. 4a, 4b, 4c, and 4d. It can be seen Appl. Math. Statist., vol. 5, p. 44, Aug. 2019.
[6] D. J. N. J. Soemers, V. Mella, C. Browne, and O. Teytaud, ‘‘Deep
that our proposed approach outperforms the existing learning for general game playing with ludii and polygames,’’ 2021,
methods in terms of training time. For instance, for arXiv:2101.09562.
3D-ResAttNet18 on AD vs. NC and sMCI vs. pMCI [7] H. Tembine, ‘‘Deep learning meets game theory: Bregman-based algo-
rithms for interactive deep generative adversarial networks,’’ IEEE
classification tasks, the training time incurred by our Trans. Cybern., vol. 50, no. 3, pp. 1132–1145, Mar. 2020, doi:
proposed approach is 20% averagely lower than DDP, 10.1109/TCYB.2018.2886238.
DP, DDG and FDG. Similarly, there are related trends [8] M. Diskin, A. Bukhtiyarov, M. Ryabinin, L. Saulnier, Q. Lhoest,
A. Sinitsin, D. Popov, D. Pyrkin, M. Kashirin, A. Borzunov,
when comparing the proposed approach with the DDP, A. V. D. Moral, D. Mazur, I. Kobelev, Y. Jernite, T. Wolf, and
DP, DDG and FDG concerning the distributed training G. Pekhimenko, ‘‘Distributed deep learning in open collaborations,’’
of 3D-ResAttNet34 for two classification tasks: AD vs. in Proc. Adv. Neural Inf. Process. Syst., 2021, pp. 7879–7897.
[9] A. Krizhevsky, I. Sutskever, and G. E. Hinton, ‘‘ImageNet classification
NC and sMCI vs. pMCI. with deep convolutional neural networks,’’ Commun. ACM, vol. 60, no. 6,
(ii) Accuracy: Comparison With The Existing Non-Parallel pp. 84–90, May 2017.
Works [10] J. George and P. Gurram, ‘‘Distributed deep learning with event-triggered
Table 4 shows the accuracy comparison results from communication,’’ 2019, arXiv:1909.05020.
[11] S. Kim, G.-I. Yu, H. Park, S. Cho, E. Jeong, H. Ha, S. Lee, J. S. Jeong, and
seven state-of-the-art deep neural networks and our B.-G. Chun, ‘‘Parallax: Sparsity-aware data parallel training of deep neural
methods. The best testing accuracies obtained in our networks,’’ in Proc. 14th EuroSys Conf., Dresden, Germany, Mar. 2019,
approach are 97% and 84% for AD vs. NC and sMCI p. 15.
[12] T. Chen, M. Li, Y. Li, M. Lin, N. Wang, M. Wang, T. Xiao, B. Xu, C. Zhang,
vs.pMCI classification respectively. The results show and Z. Zhang, ‘‘MXNet: A flexible and efficient machine learning library
that our proposed approach performs efficiently when for heterogeneous distributed systems,’’ 2015, arXiv:1512.01274.
compared with the existing works in terms of accuracy. [13] PyTorch. (2020). PyTorch Deep Learning Framework That Puts Python
First. Accessed: Dec. 16, 2020. [Online]. Available: http://pytorch.org/
In addition, our work implements parallel distributed [14] Z. Song, Y. Gu, Z. Wang, and G. Yu, ‘‘DRPS: Efficient disk-resident
training of networks in a multi-GPU environment, parameter servers for distributed machine learning,’’ Frontiers Comput.
whereas existing works are non-parallel methods. Sci., vol. 16, no. 4, Aug. 2022, Art. no. 164321.
[15] J. Ono, M. Utiyama, and E. Sumita, ‘‘Hybrid data-model parallel training
for sequence-to-sequence recurrent neural network machine translation,’’
V. CONCLUSION AND FUTURE WORK 2019, arXiv:1909.00562.
In this work, we have proposed a hybrid parallelization [16] S. Gandhi and A. P. Iyer, ‘‘P3: Distributed deep graph learning at scale,’’
in Proc. 15th USENIX Symp. Oper. Syst. Design Implement. (OSDI),
approach that combines both model and data parallelization Jul. 2021, pp. 551–568.
for parallel training of a DNN model. The Genetic Algorithm [17] A. Mirhoseini, H. Pham, Q. V. Le, B. Steiner, R. Larsen, Y. Zhou,
based heuristic resources allocation mechanism (GABRA) N. Kumar, M. Norouzi, S. Bengio, and J. Dean, ‘‘Device place-
ment optimization with reinforcement learning,’’ in Proc. ICML, 2017,
has also been developed for optimal distribution of network pp. 2430–2439.
partitions on the available GPUs with the same or differ- [18] M. Wang, C.-C. Huang, and J. Li, ‘‘Unifying data, model and hybrid
ent capacities for performance optimization. Our proposed parallelism in deep learning via tensor tiling,’’ 2018, arXiv:1805.04170.
approach has been compared with the existing state-of-the- [19] L. Song, J. Mao, Y. Zhuo, X. Qian, H. Li, and Y. Chen, ‘‘HyPar:
Towards hybrid parallelism for deep learning accelerator array,’’ in Proc.
art parallel methods and evaluated with a real use case based IEEE Int. Symp. High Perform. Comput. Archit. (HPCA), Feb. 2019,
on our previous 3D-ResAttNet model developed for efficient pp. 56–68.

VOLUME 10, 2022 77959


S. B. Akintoye et al.: Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

[20] A. Sergeev and M. D. Balso, ‘‘Horovod: Fast and easy distributed deep [43] S. Li, Y. Zhao, R. Varma, O. Salpekar, P. Noordhuis, T. Li, A. Paszke,
learning in TensorFlow,’’ 2018, arXiv:1802.05799. J. Smith, B. Vaughan, P. Damania, and S. Chintala, ‘‘PyTorch distributed:
[21] Z. Jia, M. Zaharia, and A. Aiken, ‘‘Beyond data and model parallelism for Experiences on accelerating data parallel training,’’ Proc. VLDB Endow-
deep neural networks,’’ 2018, arXiv:1807.05358. ment, vol. 13, no. 12, pp. 3005–3018, Aug. 2020.
[22] A. Mirhoseini, A. Goldie, H. Pham, B. Steiner, Q. V. Le, and J. Dean, [44] E. Hosseini-Asl, R. Keynton, and A. El-Baz, ‘‘Alzheimer’s disease diag-
‘‘A hierarchical model for device placement,’’ in Proc. Int. Conf. Learn. nostics by adaptation of 3D convolutional network,’’ in Proc. IEEE Int.
Represent., 2018, pp. 1–11. Conf. Image Process. (ICIP), Sep. 2016, pp. 126–130.
[23] O. Yadan, K. Adams, Y. Taigman, and M. Ranzato, ‘‘Multi-GPU [45] H. Suk and D. Shen, ‘‘Deep learning-based feature representation for
training of ConvNets,’’ 2014, arXiv:1312.5853. [Online]. Available: AD/MCI classification,’’ in Proc. Int. Conf. Med. Image Comput. Comput.-
https://arxiv.org/abs/1312.5853 Assist. Intervent., 2013, pp. 583–590.
[24] Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. X. Chen, D. Chen, H. Lee, [46] S. Sarraf and G. Tofighi, ‘‘Classification of Alzheimer’s disease using
J. Ngiam, Q. V. Le, Y. Wu, and Z. Chen, ‘‘GPipe: Efficient training of giant fMRI data and deep learning convolutional neural networks,’’ 2016,
neural networks using pipeline parallelism,’’ 2018, arXiv:1811.06965. arXiv:1603.08631.
[25] N. Shazeer, Y. Cheng, N. Parmar, D. Tran, A. Vaswani, P. Koanan- [47] C. D. Billones, O. J. L. D. Demetria, D. E. D. Hostallero, and P. C. Naval,
takool, P. Hawkins, H. Lee, M. Hong, C. Young, R. Sepassi, and ‘‘DemNet: A convolutional neural network for the detection of Alzheimer’s
B. A. Hechtman, ‘‘Mesh-TensorFlow: Deep learning for supercomputers,’’ disease and mild cognitive impairment,’’ in Proc. IEEE Region 10 Conf.
in Proc. NeurIPS, 2018, pp. 1–10. (TENCON), Nov. 2016, pp. 3724–3727.
[26] G. Onoufriou, R. Bickerton, S. Pearson, and G. Leontidis, ‘‘Nemesyst: [48] H. Li, M. Habes, and Y. Fan, ‘‘Deep ordinal ranking for multi-category
A hybrid parallelism deep learning-based framework applied for Internet diagnosis of Alzheimer’s disease using hippocampal MRI data,’’ 2017,
of Things enabled food retailing refrigeration systems,’’ Comput. Ind., arXiv:1709.01599.
vol. 113, Dec. 2019, Art. no. 103133. [49] J. Shi, X. Zheng, Y. Li, Q. Zhang, and S. Ying, ‘‘Multimodal neuroimaging
[27] Y. Oyama, N. Maruyama, N. Dryden, E. Mccarthy, P. Harrington, feature learning with multimodal stacked deep polynomial networks for
J. Balewski, S. Matsuoka, P. Nugent, and B. Van Essen, ‘‘The case for diagnosis of Alzheimer’s disease,’’ IEEE J. Biomed. Health Informat.,
strong scaling in deep learning: Training large 3D CNNs with hybrid paral- vol. 22, no. 1, pp. 173–183, Jan. 2018.
lelism,’’ IEEE Trans. Parallel Distrib. Syst., vol. 32, no. 7, pp. 1641–1652, [50] F. Yu, X. Fu, H. Li, and G. Dong, ‘‘Improved fitness proportionate
Jul. 2021. selection-based genetic algorithm,’’ in Proc. 3rd Int. Conf. Mechatronics
[28] Z. Wesolowski, ‘‘Network resource allocation in distributed systems: A Inf. Technol., 2016, pp. 136–140.
global optimization framework,’’ in Proc. IEEE 2nd Int. Conf. Cybern. [51] Z. Huo, B. Gu, Q. Yang, and H. Huang, ‘‘Decoupled parallel
(CYBCONF), Jun. 2015, pp. 267–270. backpropagation with convergence guarantee,’’ in Proc. ICML, 2018,
[29] A. V. Martinez, ‘‘Scheduling in heterogeneous distributed pp. 2098–2106.
computing systems based on internal structure of parallel tasks [52] H. Zhuang, Y. Wang, Q. Liu, and Z. Lin, ‘‘Fully decoupled neural
graphs with meta-heuristics,’’ Appl. Sci., vol. 10, no. 18, p. 6611, network learning using delayed gradients,’’ IEEE Trans. Neural Netw.
Sep. 2020. Learn. Syst., early access, Apr. 9, 2021, doi: 10.1109/TNNLS.2021.
[30] L. Haji, S. Zeebaree, O. Ahmed, A. Sallow, K. Jacksi, and R. Zebari, 3069883.
‘‘Dynamic resource allocation for distributed systems and cloud comput- [53] D. Narayanan, A. Harlap, A. Phanishayee, V. Seshadri, N. R. Devanur,
ing,’’ Test Eng. Manage., vol. 83, pp. 22417–22426, May/Jun. 2020. G. R. Ganger, P. B. Gibbons, and M. Zaharia, ‘‘PipeDream: Generalized
[31] M. Zhang, L. Liu, and S. Liu, ‘‘Genetic algorithm based QoS-aware service pipeline parallelism for DNN training,’’ in Proc. 27th ACM Symp. Oper.
composition in multi-cloud,’’ in Proc. IEEE Conf. Collaboration Internet Syst. Princ. New York, NY, USA: Association for Computing Machinery,
Comput. (CIC), Oct. 2015, pp. 113–118. Oct. 2019, pp. 1–15.
[32] K. Gai, L. Qiu, H. Zhao, and M. Qiu, ‘‘Cost-aware multimedia data [54] S. Lee, D. Jha, A. Agrawal, A. Choudhary, and W.-K. Liao, ‘‘Parallel
allocation for heterogeneous memory using genetic algorithm in cloud deep convolutional neural network training by exploiting the overlapping
computing,’’ IEEE Trans. Cloud Comput., vol. 8, no. 4, pp. 1212–1222, of computation and communication,’’ in Proc. IEEE 24th Int. Conf. High
Oct. 2020. Perform. Comput. (HiPC), Jaipur, India, Dec. 2017, pp. 183–192.
[33] C. Mezache, O. Kazar, and S. Bourekkache, ‘‘A genetic algorithm [55] C.-C. Chen, C.-L. Yang, and H.-Y. Cheng, ‘‘Efficient and robust parallel
for resource allocation with energy constraint in cloud computing,’’ in DNN training through model parallelism on multi-GPU platform,’’ 2018,
Proc. Int. Conf. Image Process., Prod. Comput. Sci. (ICIPCS), 2016, arXiv:1809.02839.
pp. 62–69. [56] C. Kim, H. Lee, M. Jeong, W. Baek, B. Yoon, I. Kim, S. Lim, and S. Kim,
[34] H. Jiang, J. Yi, S. Chen, and X. Zhu, ‘‘A multi-objective algorithm for task ‘‘torchgpipe: On-the-fly pipeline parallelism for training giant models,’’
scheduling and resource allocation in cloud-based disassembly,’’ J. Manuf. 2020, arXiv:2004.09910.
Syst., vol. 41, pp. 239–255, Oct. 2016. [57] S. Li, Z. Huang, L. Han, and C. Jiang, ‘‘A genetic algorithm enhanced
[35] A. Mosa and R. Sakellariou, ‘‘Dynamic virtual machine placement automatic data flow management solution for facilitating data intensive
considering CPU and memory resource requirements,’’ in Proc. applications in the cloud,’’ Concurrency Comput., Pract. Exp., vol. 30,
IEEE 12th Int. Conf. Cloud Comput. (CLOUD), Jul. 2019, no. 23, Dec. 2018, Art. no. e4844.
pp. 196–198.
[36] P. Devarasetty and S. Reddy, ‘‘Genetic algorithm for quality of service
based resource allocation in cloud computing,’’ Evol. Intell., vol. 14, no. 2,
pp. 381–387, Jun. 2021.
[37] S. H. da Mata and P. R. Guardieiro, ‘‘A genetic algorithm based approach
for resource allocation in LTE uplink,’’ in Proc. Int. Telecommun. Symp.
(ITS), Aug. 2014, pp. 1–5.
[38] Z. Li and Q. Zhu, ‘‘Genetic algorithm-based optimization of offloading
and resource allocation in mobile-edge computing,’’ Information, vol. 11,
no. 2, p. 83, Feb. 2020.
[39] T. Perry, M. Bader-El-Den, and S. Cooper, ‘‘Imbalanced classification
using genetically optimized cost sensitive classifiers,’’ in Proc. IEEE SAMSON B. AKINTOYE received the Ph.D.
Congr. Evol. Comput. (CEC), 2015, pp. 680–687. degree in computer science from the University
[40] A. Sergeev and M. D. Balso, ‘‘Horovod: Fast and easy distributed deep of the Western Cape, South Africa, in 2019.
learning in TensorFlow,’’ 2018, arXiv:1802.05799. He is currently working as a Research Associate
[41] (2021). NVIDIA, NCCL. Accessed: Jan. 23, 2021. [Online]. Available: with the Department of Computing and Mathe-
https://docs.nvidia.com/deeplearning/nccl/install-guide/index.html matics, Manchester Metropolitan University, U.K.
[42] X. Zhang, L. Han, W. Zhu, L. Sun, and D. Zhang, ‘‘An explainable His current research interests include parallel and
3D residual self-attention deep neural network FOR joint atrophy local- distributed computing, deep learning, and cloud
ization and Alzheimer’s disease diagnosis using structural MRI,’’ 2020, computing.
arXiv:2008.04024.

77960 VOLUME 10, 2022


S. B. Akintoye et al.: Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

LIANGXIU HAN received the Ph.D. degree HAOMING CHEN is currently pursuing the mas-
in computer science from Fudan University, ter’s degree in computer science and artificial
Shanghai, China, in 2002. She is currently a intelligence with The University of Sheffield. His
Professor of computer science with the Depart- current research interests include machine learning
ment of Computing and Mathematics, Manchester and artificial intelligence.
Metropolitan University. Her research interests
include the development of novel big data
analytics and development of novel intelligent
architectures that facilitates big data analytics
(e.g., parallel and distributed computing and
cloud/service-oriented computing/data intensive computing) as well as appli-
cations in different domains using various large datasets (biomedical images,
environmental sensor, network traffic data, and web documents). She is also
a principal investigator or a co-PI on a number of research projects in the
research areas mentioned above.

DAOQIANG ZHANG (Senior Member, IEEE)


received the B.Sc. and Ph.D. degrees in computer
science from the Nanjing University of Aeronau-
XIN ZHANG received the B.S. degree from The tics and Astronautics, Nanjing, China, in 1999 and
PLA Academy of Communication and Command- 2004, respectively. He is currently a Professor
ing, China, in 2009, and the Ph.D. degree in with the Department of Computer Science and
cartography and geographic information system Engineering, Nanjing University of Aeronautics
from Beijing Normal University (BNU), China, and Astronautics. His current research interests
in 2014. He is currently an Associate Researcher include machine learning, pattern recognition, and
with Manchester Metropolitan University (MMU). biomedical image analysis. In these areas, he has
His current research interests include remote sens- authored or coauthored more than 100 technical papers in the refereed
ing image processing and deep learning. international journals and conference proceedings.

VOLUME 10, 2022 77961

You might also like