A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

This document presents a hybrid parallelization approach for distributed and scalable training of Deep Neural Networks (DNNs), combining model and data parallelism to enhance efficiency. The proposed method includes a Genetic Algorithm Based Heuristic Resources Allocation (GABRA) mechanism for optimal GPU resource distribution, achieving a 20% improvement in training time compared to existing methods. The approach is validated through a case study on a 3D Residual Attention Deep Neural Network for Alzheimer's disease diagnosis, demonstrating almost linear speedup with minimal accuracy loss.

Uploaded by

dannyafolabs

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views

A Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

Uploaded by

dannyafolabs

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

Received 2 July 2022, accepted 20 July 2022, date of publication 25 July 2022, date of current version 28 July 2022.

Digital Object Identifier 10.1109/ACCESS.2022.3193690

A Hybrid Parallelization Approach for Distributed

and Scalable Deep Learning
SAMSON B. AKINTOYE1 , LIANGXIU HAN 1 , XIN ZHANG 1, HAOMING CHEN2 ,
AND DAOQIANG ZHANG 3 , (Senior Member, IEEE)
1 Department of Computing and Mathematics, Manchester Metropolitan University, Manchester M15 6BH, U.K.
2 Department of Computer Science, The University of Sheffield, Sheffield S10 2TN, U.K.
3 College of Computer Science and Technology, Nanjing University of Aeronautics and Astronautics, Nanjing 210016, China

Corresponding author: Liangxiu Han ([email protected])

This work was supported in part by the Project by Royal Society–Academy of Medical Sciences Newton Advanced Fellowship under
Grant NAF \R1\180371.

ABSTRACT Recently, Deep Neural Networks (DNNs) have recorded significant success in handling
medical and other complex classification tasks. However, as the sizes of DNN models and the available
datasets increase, the training process becomes more complex and computationally intensive, usually taking
longer to complete. In this work, we have proposed a generic full end-to-end hybrid parallelization approach
combining model and data parallelism for efficiently distributed and scalable training of DNN models.
We have also proposed a Genetic Algorithm Based Heuristic Resources Allocation (GABRA) mechanism for
optimal distribution of partitions on the available GPUs for computing performance optimization. We have
applied our proposed approach to a real use case based on 3D Residual Attention Deep Neural Network
(3D-ResAttNet) for efficient Alzheimer Disease (AD) diagnosis on multiple GPUs and compared with the
existing state-of-the-art parallel methods. The experimental evaluation shows that our proposed approach
is 20% averagely better than existing parallel methods in terms of training time and achieves almost linear
speedup with little or no differences in accuracy performance when compared with the existing non-parallel
DNN models.

INDEX TERMS Deep learning, genetic algorithm, data parallelization, model parallelization.

I. INTRODUCTION Data parallelism is a parallelization method that trains repli-

In recent time, Deep Neural Networks (DNNs) have gained cas of a model on individual devices using different subsets
popularity as an important tool for solving complex tasks of data, known as mini-batches [9], [10]. In data parallel dis-
ranging from image classification [1], speech recognition [2], tributed training, each computing node or a worker contains
medical diagnosis [3], [4], to the recommendation systems [5] a neural network model replica and a churn of dataset, and
and complex games [6], [7]. However, training a DNN model compute gradients which are shared with other workers and
requires a large volume of data, which is both data and used by the parameter server to update the model parame-
computational intensive, leading to increased training time. ters [11]. However, as parameters increases, the overhead for
To overcome this challenge, various parallel and dis- parameter synchronisation inceeases, leading to performance
tributed computing methods [8] have been proposed to scale degradation. In addition, when a DNN model size is too big,
up the DNN models to provide timely and efficient learn- it couldn’t be executed on a single device. Hence it is not
ing solutions. Broadly, it can be divided into data paral- possible to perform data parallelization. Model parallelism is
lelism, model parallelism, pipeline parallelism and hybrid a parallelization method where a large model is split, running
parallelism (a combination of data and model parallelism). concurrent operations across multiple devices with the same
mini-batch [8]. It can help to speed up the DNN training
The associate editor coordinating the review of this manuscript and either through its implementation or algorithm. In model
approving it for publication was Kostas Kolomvatsos . parallelism, each node or a worker has distinct parameters

This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see https://creativecommons.org/licenses/by/4.0/
77950 VOLUME 10, 2022
S. B. Akintoye et al.: Hybrid Parallelization Approach for Distributed and Scalable Deep Learning

and computation of layer of a model, and also updates the 1) DATA PARALLELISM
weight of the allocated model layers. Pipelining parallelism In data parallelism, a dataset is broken down into
splits the DNN models training tasks into a sequence of mini-batches and distributed across the multiple GPUs and
processing stages [56]. Each stage takes the result from the each GPU contains a complete replica of the local model and
previous stage as input, with results being passed downstream computes the gradient. The gradients aggregation and updates
immediately. among the GPUs are usually done either synchronously
Recently, the combination of model and data paralleliza- or asynchronously [11]. In synchronous training, all GPUs
tion methods known as Hybrid parallelization has been wait for each other to complete the gradient computation
explored to leverage the benefits of both methods to minimize of their local models, then aggregate computed gradients
communication overhead in the multi-device parallel training before being used to update the global model. On the other
of DNN models [15], [18], [19]. hand, in asynchronous training, the gradient from one GPU
Despite the performance of the existing parallelization is used to update the global model without waiting for other
methods, they are still subject to further improvement GPUs to finish. The asynchronous training method has higher
by optimally allocating the model computations and data throughput in that it eliminates the waiting time incurred in
partitions to the available devices for better model training the synchronous training method. In both asynchronous and
performance. In this paper, we have proposed a generic hybrid synchronous training, aggregated gradients can be shared
parallelization approach for parallel training of DNN in mul- between GPUs through the two basic data-parallel training
tiple Graphics Processing Units (GPUs) computing environ- architectures: parameter server architecture and AllReduce
ments, which combines both model and data parallelization architecture. Parameter server architecture [14] is a central-
methods. Our major contributions are as follows: ized architecture where all GPUs communicate to a dedicated
• Development of a generic full end-to-end hybrid paral- GPU for gradients aggregation and updates. Alternately,
lelization approach for the multi-GPU distributed train- AllReduce architecture [20] is a decentralized architecture
ing of a DNN model. where the GPUs share parameter updates in a ring network
• Model parallelization by splitting a DNN model topology manner through the Allreduce operation.
into independent partitions, formulating the network
partitions-to-GPUs allocation problem as a 0-1 multiple
2) MODEL PARALLELISM
knapsack model, and proposing a Genetic Algorithm
In model parallelization, model layers are divided into parti-
based heuristic resources allocation (GABRA) approach
tions and distributed across GPUs for parallel training [21],
as an efficient solution to optimize the resources
[22]. In model parallel training, each GPU has distinct param-
allocation.
eters and computation of the layer of a model, and also
• Exploitation of data parallelization based on the
updates weight of allocated model layers. Huo et al. [51]
All-reduced method and asynchronous stochastic gradi-
proposed a Decoupled Parallel Back-propagation (DDG),
ent descent across multiple GPUs for further accelera-
which splits the network into partitions and solves the prob-
tion of the overall training speed.
lem of backward locking by storing delayed error gradi-
• Evaluation of the proposed approach through a real
ent and intermediate activations at each partition. Similarly,
use case study – by parallel and distributed training
Zhuang et al. [52] adopted the delayed gradients method to
of a 3D Residual Attention Deep Neural Network
propose a fully decoupled training scheme (FDG). The work
(3D-ResAttNet) for efficient Alzheimer’s disease
breaks a neural network into several modules and trains
diagnosis.
The remainder of this paper is organized as follows: Section II them concurrently and asynchronously on multiple devices.
reviews the related work of the study. Section III discusses However, the major challenges are how to break the model
the details of the proposed approach. In Section IV, the layers into partitions as well as the allocation of partitions
experimental evaluation is described. Section V concludes the to GPUs for efficient training performance [16]. Moreover,
work. using model parallelization alone does not scale well to a
large number of devices [17] as it involves heavy commu-
II. RELATED WORK
nication between workers.
This section provides an overview in relation to distributed
training of deep neural networks and genetic algorithms for 3) PIPELINING PARALLELISM
resource optimisation. Pipelining parallelism breaks the task (data and model) into
a sequence of processing stages. Each stage takes the result
A. PARALLEL AND DISTRIBUTED TRAINING OF DEEP from the previous stage as input, with results being passed
NEURAL NETWORKS (DNNs) downstream immediately [53]. Various works have adopted
As mentioned earlier, existing efforts on parallel and dis- this technique. Lee et al. [54] used the pipeline parallelism
tributed training of DNNs can be broadly divided into three approach to overlap computation and communication for
categories, which include data parallelism, model paral- CNN training. They implement a thread in each computer
lelism, pipeline parallelism and hybrid parallelism. server to spawn communication processes after the gradient