0% found this document useful (0 votes)
6 views

Ch12 - Distributed Deep Learning

Uploaded by

nhatminhle248
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Ch12 - Distributed Deep Learning

Uploaded by

nhatminhle248
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 60

Distributed Deep Learning Systems

Thoai Nam
High Performance Computing Lab (HPC Lab)
Faculty of Computer Science and Technology
HCMC University of Technology
HPC Lab-CSE-HCMUT 1
Parameter Server

HPC Lab - CSE-HCMUT 2


Parameter Server (PS)
§ Model parameters are stored on PS
Parameter Server Parameter server Parameter server
machines and accessed via key-value
p1 p2 p3 p4 p5 p6 p7 p8 p9
interface (distributed shared
memory)
§ Extensions
o Multiple keys (for a matrix);
multiple “channels” (for multiple
sparse vectors, multiple clients for
same servers, …)
Worker Worker Worker Worker Worker
o Push/pull interface to send/receive
most recent copy of (subset of)
parameters, blocking is optional
o Can block until push/pulls with
clock < (t –τ) complete
[Smola et al 2010, Ho et al 2013, Li et al 2014]

HPC Lab - CSE-HCMUT 3


Machine Learning (ML)
Wide array of problems and algorithms
§ Classification
o Given labeled data points, predict label of new data point
§ Regression
o Learn a function from some (x, y) pairs
§ Clustering
o Group data points into “similar” clusters
§ Segmentation
o Partition image into meaningful segments
§ Outlier detection

HPC Lab - CSE-HCMUT 4


Abstracting ML algorithms
§ Can we find commonalities among ML algorithms?
§ This would allow finding
o Common abstractions
o Systems solutions to efficiently implement these abstractions
§ Some common aspects
o We have a prediction model A
o A should optimize some complex objective function L
o ML algorithm does this by iteratively refining A

HPC Lab - CSE-HCMUT 5


High level view
§ Notation
o D: data
o A: model parameters
o L: function to optimize (e.g., minimize loss)
§ Goal: Update A based on D to optimize L
§ Typical approach: iterative convergence

HPC Lab - CSE-HCMUT 6


Distributed Deep Learning Systems (DDLS)
DDLSs train deep neural network models by utilizing the distributed
resources of a cluster

§ The massive parallel processing power of § Effective remedy to this problem is to utilize multiple
graphics processing units (GPUs) has been GPUs to speed up training
largely responsible for the recent successes in § Scale-up approaches rely on tight hardware
training deep learning models integration to improve the data throughput
§ Increasingly larger and more complex deep o These solutions are effective, but costly
learning models are necessary o Furthermore, technological and economic
§ The disruptive trend towards big data has led to constraints impose tight limitations on scaling up
an explosion in the size and availability of § DDLS aim at scaling out to train large models using
training datasets for machine learning tasks the combined resources of clusters of independent
o Training such models on large datasets to machines
convergence can easily take weeks or even
months on a single GPU

HPC Lab - CSE-HCMUT 7


Distributed SGD algorithm: all-reduce
§ SGD (Stochastic Gradient Descend)

§ M machines/mini-batches: B = M.B’

HPC Lab - CSE-HCMUT 8


Parameter server (PS)
Parameter Server Parameter server Parameter server
p1 p2 p3 p4 p5 p6 p7 p8 p9

workers send parameter servers


gradients to send new parameters
parameter servers to workers

Worker Worker Worker Worker Worker

Training data

HPC Lab - CSE-HCMUT 9


HPC Lab - CSE-HCMUT 10
Distributed Deep Learning Systems (DDLS)

Matthias Langer, Zhen He, Wenny Rahayu, Yanbo Xue, Distributed Training of Deep Learning Models:
A Taxonomic Perspective, IEEE Transactions on Parallel and Distributed Systems, 2020, Volume: 31,
Issue: 12, Pages: 2802-2818, 10.1109/TPDS.2020.3003307

HPC Lab - CSE-HCMUT 11


How to parallelize?
§ How to execute the algorithm over a set of workers?
§ Data-parallel approach
o Partition data D
o All workers share the model parameters A
§ Model-parallel approach
o Partition model parameters A
o All workers process the same data D

HPC Lab - CSE-HCMUT 12


Model parallelism
§ The model is split into partitions, which
are then processed in separate
machines
§ Model partitioning can be conducted
either by applying splits between
neural network layers (=vertical
partitioning) or by splitting the layers
(=horizontal partitioning)
§ Vertical partitioning can be applied to any deep learning model because the
layers themselves are unaffected
§ Horizontal partitioning: the layers themselves are partitioned

HPC Lab - CSE-HCMUT 13


Data parallelism
① Each worker downloads the current
model
② Each worker performs backpropagation
using its assignment of data in parallel
③ The respective results are aggregated
and integrated to parameter server in
order to form a new model.
Distinct data flow cycles in deep learning models during training
( à= gradient computation cycle; à = model update / optimization
cycle)
§ Increasing the overall sample throughput rate by replicating the model onto multiple machines, where
backpropagation can be performed in parallel, to gather more information about the loss function faster
§ Most transformations applied to a specific training sample in deep neural networks do not involve data from
other samples
§ Sum of per-parameter gradients computed using subsets (x0 , · · · , xn) of a mini-batch (x) matches the
per-parameter gradients for the entire input batch:
∂L(x; w)/∂w = ∂L(x0;w )/∂w + ... + ∂L(xn ; w)/∂w
HPC Lab - CSE-HCMUT 14
Centralized & Decentralized optimization
§ Centralized optimization
o The optimization cycle is executed in a central machine,
while the gradient computation code is replicated onto the remaining
cluster nodes
§ Decentralized optimization
o Both cycles are replicated in each cluster node and
some form of synchronization is realized that allows the distinct
optimizers to act cooperatively

HPC Lab - CSE-HCMUT 15


Centralized optimization
§ A single optimizer instance
(often called parameter
server) is responsible for
updating a specific model
parameter
§ Parameter servers depend on
the gradients computed by
workers that perform
backpropagation (workers) Ø The blue process à computes per-parameter gradients based on the
current model parameters by applying backpropagation on mini-batches
§ Depending on whether drawn from the training data
computations across workers Ø The optimization cycle à consumes these gradients to determine model
are scheduled synchronously parameter updates.

or asynchronously, this can


have different effects on the
optimization. HPC Lab - CSE-HCMUT 16
Decentralized optimization (1)
§ Each worker independently
probes the loss function to
find gradient descent
trajectories to minima that
have good generalization
properties
§ To arrive at a better joint
model, some form of
Ø In this figure, we assume the existence of a dedicated master node,
arbitration is necessary to which processes the individual parameter adjustments suggested by the
bring the different views into workers and comes up with a new global model state that is then shared
alignment. with them
Ø The blue process à computes per-parameter gradients based on the
current model parameters by applying backpropagation on mini-batches
drawn from the training data
Ø The optimization cycle à consumes these gradients to determine model
parameter updates.
HPC Lab - CSE-HCMUT 17
Decentralized optimization (2)
§ Multiple independent entities concurrently try to
solve a similar but not exactly the same problem
o The loss function in deep learning is usually non-
trivial
o Find different descent trails more appealing and
converge towards different local minima
§ Over time, the workers diverge and eventually
arrive at incompatible models
o models that cannot be merged without destroying
the accumulated information
o DDLS that rely on decentralized optimization have
to take measures to limit divergence
Ø The master and all workers start from the same Ø Exploitation phase (à)
model state w0 on the yellow plateau (=high loss) o Each worker shares its model updates with the master node, which in
turn merges the updates to distill latent parameter adjustments that
Ø Exploration phase (à) have worked better on average across the investigated portion of the
o The workers iteratively evaluate the loss function training dataset
using different mini-batches and independently
update their local models Ø A revised new global state w1 is then shared with the workers,
which use it as the starting point of the next exploration phase
HPC Lab - CSE-HCMUT 18
Synchronous & Asynchronous scheduling
§ Synchronous systems
o In bulk synchronous (or simply synchronous) systems, computations across all
workers occur simultaneously
o Global synchronization barriers ensure that individual worker nodes do not progress
until the remaining workers have reached the same state
§ Asynchronous systems
o Asynchronous systems take a more relaxed approach to organizing collaborative
training and avoid delaying the execution of a worker to accommodate other workers
(i.e. the workers are allowed to operate at their own pace)
§ Bounded asynchronous systems
o A hybrid approach between two above archetypes
o They operate akin to centralized asynchronous systems, but enforce rules to
accommodate workers progressing at different paces
o The workers operate asynchronously with respect to each other, but only within
certain bounds

HPC Lab - CSE-HCMUT 19


Centralized Synchronous systems (1)
§ Centralized systems
o Model training is split between the workers (=gradient computation) and
the parameter servers (=model update)
§ Synchronization:
o Training cannot progress without a full parameter exchange between the
parameter server and its workers
o The parameter server is dependent on the gradient input to update the
model
o The workers are dependent on the updated model in order to further
investigate the loss function
o The cluster as a whole cyclically transitions between phases, during which
all workers perform the same operation

HPC Lab - CSE-HCMUT 20


Centralized Synchronous
systems (2)
§ Each training cycle begins with the
workers downloading new model
parameters (w) from the parameter
server
§ Workers locally sample a training
mini- batch (x ∼ Di) and compute
per-parameter gradients (gi)
§ Workers share their gradients with
the parameter server
§ The parameter server aggregates the
gradients from all workers and injects
the aggregate into an optimization
algorithm to update the model.

HPC Lab - CSE-HCMUT 21


Centralized Synchronous systems (3)
§ Assuming appropriate and § The next training step can only be conducted
sufficient random sampling, larger once all workers have completed their
mini-batches may represent the assigned task and submitted gradients
training distribution better [31] o A majority of cluster machines always has to wait
for stragglers [34]
§ Optimizing a model using mini-
batches with a large coverage of § Relaxed solutions
o The training set is (1) large enough, (2) reasonably
the training distribution tend to get well-balanced, and (3) sufficiently randomly
trapped in sharp minima basins of distributed among the workers
the loss function o Minor portions of the training data are absent,
such that this requirement can be relaxed
Ø Ending training epochs once 75% of all training
samples have been processed [23]
Ø Over-provision by allocating more workers and
ending each gradient aggregation phase once a
quorum has been reached [17]
HPC Lab - CSE-HCMUT 22
Decentralized
Synchronous
systems (1)
§ Rely on decentralized optimization
independently conduct model
training in each worker
§ Works do not exchange parameters
to further model training, but
rather to share the independent
findings from each worker with the
rest of the cluster to determine
descent trajectories with good
generalization properties
§ Workers operate in phases
separated by global
synchronization barriers The decentralized
synchronous system
SparkNet [10]
HPC Lab - CSE-HCMUT 23
Decentralized Synchronous systems (2)
§ The initial model parameters are § Due to the different properties of the mini-
distributed among the workers to initialize the batches, each worker eventually arrives at a
local models slightly better (w.r.t. L), but different model
§ The master node acts as a synchronization
Exploration phase conduit
Each worker
§ Randomly sample mini-batches from their Exploitation phase
locally available partition of the training § The worker models are merged to form a new
dataset joint model
§ Determine per-parameter gradients and adjust
their model to minimize the loss function (L)
§ This process is repeated τ times, during which
each worker independently trains its local
model in isolation

HPC Lab - CSE-HCMUT 24


Decentralized Synchronous systems (3)
§ Keeping the local optimizers running for too § τ determines how much time should be spent on
long will result in reduced convergence improving the local models versus synchronizing
performance or even setbacks if the worker the states across machines
models diverge too far § To make the best use of the cluster GPUs
o Limit the amount of independent exploration => large τ
steps (τ) [10, 25] o Often lead to sub-optimal convergence rates [11]
=> small τ
§ The best rate of convergence for a given
model can typically be achieved if τ is rather
small (τ ≤ 10) [8, 10].
Ø Any choice of τ represents the dilemma of finding a balance between
harnessing the benefits from having more computational resources and the
need to limit divergence among workers
Ø Practically motivated suggestions such as to aim for a 1:5 computation-to-
computation ratio (≈83.3% GPU utilization; [10]) may serve as a starting point
for hyper-parameter search and to determine whether efficient decentralized
optimization is possible at all using a certain configuration.
HPC Lab - CSE-HCMUT 25
Centralized Asynchronous systems (1)
§ Asynchronous systems
o Each worker acts alone
§ Centralized systems
o Each worker shares its gradients with the parameter server once a mini-batch has been
processed
§ Centralized Asynchronous DDLS [14], [17], [19], [21], [35]
§ Instead of waiting for other workers to reach the same state, the parameter
server eagerly injects received gradients into the optimization algorithm to
train the model
o Each update of the global model is only based on the gradient input from a single worker
o Similar to the eager aggregation mechanisms
§ Instead of discarding the results from all remaining workers and losing the
invested computational resources, each worker is allowed to simply continue
using its locally cached stale version of the model parameters

HPC Lab - CSE-HCMUT 26


Centralized
Asynchronous
systems (2)

§ Each worker approaches the


parameter server at its own pace to
offer gradient, after which the global
model is updated immediately, and
request updated model parameters
§ Each worker maintains a separate
parameter exchange cycle with the
parameter server
§ There is no interdependence between
workers, situations where straggler
nodes delay the execution of other
workers cannot happen.
HPC Lab - CSE-HCMUT 27
Centralized
Asynchronous systems (3)
§ For this system to work, choosing the results from one
worker over another must not introduce a bias that
significantly changes the shape of the loss function
§ Thus, on average, the mini-batches sampled by each
worker have to mimic the properties of the training
distribution reasonably well
§ At any point in time, only a single worker is in
possession of the most recent version of the model ² In the figure, both workers start from the same model
o Other workers only possess stale variants that represent state (w0), but draw different mini-batches from the
the state of the parameter server during their last same distribution
interaction with it ² Worker 1: send gradients(w0), PS(w0->w1), receive w1
o Any gradients that they produce are relevant to the ² Worker 2: send gradients(w0), PS(w1->w2), receive w2
shape of the loss function around that stale model ² Worker 1: send gradients(w1), PS(w2->w3), receive w3
representation ² Worker 2: send gradients(w2), PS(w3->w4), receive w4
Ø Staleness has serious implications on model training [37]
o In a cluster with multiple asynchronous workers, the next worker to
exchange parameters is usually stale by some amount of update steps
HPC Lab - CSE-HCMUT 28
Centralized Asynchronous systems (4)
§ The fair scheduling is undesirable in practice
o The slowest machine would hold back faster machines,
o which is exactly the situation that asynchronous systems try to avoid
§ Gradients from severely eclipsed workers can confuse the parameter server’s
optimizer
o Can setback training or even destroy the model
§ To avoid compounding delays
o The parameter server typically places workers that indicated their readiness to upload gradients in
a priority queue based on their staleness [17], [19], [42]
§ To protect against adverse influences from severe stragglers, some systems allow
defining conditions that must be fulfilled before queued requests can be processed by
the parameter server
§ These conditions typically take the form of either a value or delay bound.

HPC Lab - CSE-HCMUT 29


Bounded Asynchronous systems
§ Value bounds § Delay bounds
o The parameter server maintains a copy of all (e.g. the Stale Synchronous Parallel [35])
versions of the model currently in use across the o Each worker (i) maintains a separate clock (ti)
cluster o Whenever a worker submits gradients to the
² wt−δ: currently known by the slowest worker parameter server, ti is increased
² wt: he most recent model o If the clock of a worker differs from that of the
o wt −wt−δ is the amount of change in transit that is slowest worker by more than s steps, it is delayed
currently not known by the slowest worker until the slow worker has caught up
o If a worker triggers an update that leads to a o If a worker downloads the current global model it
violation of some value bound is ensured that this model includes all local
(i.e. ∥wt −wt−δ ∥∞ ≥ ∆max ), it is delayed until the updates and may also contain updates from other
value bound condition holds again workers within a range of [ti−s, ti+s−1] update
o Choosing a reliable metric and limit for a value steps.
bound can be difficult [38]
² The magnitude of future model updates is
largely unknown
² Adjustment during training.
HPC Lab - CSE-HCMUT 30
Decentralized Asynchronous systems (1)
§ The workers act independently § Combining master and worker models in such a
and continue to explore the loss setting is to apply linear interpolation as
function based on a model that is
detached from the master’s
current state
§ The workers cannot replace their ([11], [25], [29], [30], [42], [43])
model parameters upon o The worker model is displaced towards the master model’s
completing a parameter state at a rate of α times their relative distance
o The master model is displaced in the opposite direction at a
exchange with the master node
rate of β
§ Instead, workers have to merge o This operation is equivalent to temporarily extending the loss
the respective asynchronously function with the squared l2-norm of the difference between
gathered information both models.

HPC Lab - CSE-HCMUT 31


Decentralized
Asynchronous
systems (2)
§ The decentralized asynchronous system
Elastic Averaging SGD (EASGD) [30]
§ Once τ iterations have been completed by a
worker
o The master node’s current model is downloaded
and the penalization term δi is computed and
applied to the local model
o Then δi is transferred to the master node, which
applies the inverse operation to (α=β)
§ A symmetric force (elastic symmetric)
between each worker and the master node
that equally attracts both models The decentralized
asynchronous system
HPC Lab - CSE-HCMUT EASGD [30] 32
Decentralized
Asynchronous
systems (3)
§ The individual models (workers &
Master) are evolving side-by-side in
parallel
§ Note that there is no direct interaction
between workers
§ Stability is maintained by the
penalization coefficients (α and β) in
combination with the length of isolated
learning phases (τ) § The communication demand with the master node scales roughly
§ The optimizer hyper-parameters, α, β linear with the number of workers
and τ, are inter-dependent and must be o To avoid congestion induced delays due to network I/O bandwidth
limitations at the master, τ must be scaled accordingly [11]
weighted carefully for each training task
o long phases of isolated training can severely hamper convergence
and cluster setup to constrain how far
due to the increasing incompatibilities in the models.
individual workers can diverge from the
master and one another HPC Lab - CSE-HCMUT 33
Communication Patterns (CP)

HPC Lab - CSE-HCMUT 34


CP in centralized systems
§ The parameter server a bottleneck
o 1 parameter server & n workers: sending & receiving n .∥w∥ parameters
§ In bulk-synchronous systems
o Parameter up- and downloads occur sequentially, implying a communication delay of at least
2nTw + (n ­ 1)Rw +Uw in theory, where Tw, Rw and Uw respectively denote the time
required to transmit, reduce and update ∥w∥ parameters (i. e. the model)
§ Efficient collective communication
o Binomial tree [15]
² Lower bound:
o Scatter-reduce/broadcast algorithm [44]
² Lower bound:
§ Asynchronous communication
o The minimum communication delay per worker:
o A full parameter exchange with all workers:
² Parameter exchange requests from individual workers can be overlapped if n > 1
HPC Lab - CSE-HCMUT 35
Parameter server (1)
§ If the parameter server is a bottleneck, it is highly desirable to distribute this role [21]
§ Most gradient-descent-based optimization algorithms can be executed independently
for each model parameter, which permits almost arbitrary slicing
o In practice this freedom is limited by the overheads incurred from peering with each additional
endpoint to complete a parameter exchange [23]
o Asynchronous systems where congestion-free k : n communication (i.e. between k parameter
servers and n workers) is more difficult to realize because the workers operate largely at their own
pace [25]
§ Additional limitations apply if training depends on hyper-parameter schedules that
must be coordinated across parameter servers, or if reproducibility is desired [21]

HPC Lab - CSE-HCMUT 36


Parameter server (2)
§ A popular variant of the multi-parameter-
server-approach is to migrate the parameter
server role into the worker nodes [17], [19],
[24], [27]
§ Such that all nodes are workers, but also act
as parameter servers (i. e. k = n)
§ Each worker is responsible for maintaining
and updating 1/n the global model
parameters
§ The external communication demand of each § It can be beneficial in homogeneous
node is reduced to cluster setups
o The locally maintained model partition does not § Any node-failure requires a complete
have to be exchanged via the network reorganization of the cluster [12].
HPC Lab - CSE-HCMUT 37
Parameter server (3)
§ The entire parameter server function is implemented
in each worker
§ The workers synchronously compute gradients, which
are shared between machines using
o a collective all-reduce operation
² lower bound delay =
o a ring algorithm
² lower bound delay =

§ Each machine uses the thereby locally accumulated identical gradients to step an
equally parameterized optimizer copy, which in turn applies exactly the same update
§ Not only robust to node failures, but also makes adding and removing nodes trivial.

HPC Lab - CSE-HCMUT 38


Lower bound communication delays when training various models in different cluster setups
in Ethernet and InfiniBand environments, assuming Rw ≈ 10 GiB/s and Uw ≈ 2 GiB/s in an
ideal scenario with no latency or competing I/O requests that need arbitration

HPC Lab - CSE-HCMUT 39


CP in decentralized systems
§ Assuming isolated training phases of τ cycles, the communication demand of each
decentralized worker per local compute step is only
§ Decentralized systems typically maintain a higher computation hardware
utilization, even with limited network bandwidth, which can make training large
models possible in spite of bandwidth-constraints
§ Scaling out to larger cluster sizes may still result in the master node becoming a
bottleneck
o it is possible to split the master’s role in decentralized systems to reduce communication costs
like in centralized systems
§ Each machine is a self-contained independent trainer
=> decentralized DDLS have many options for organizing parameter exchanges

HPC Lab - CSE-HCMUT 40


CP in decentralized systems:
D-PSGD [26]
§ A ring-like structure
§ After each training cycle, each worker sends its model
parameters to its neighbors and also integrates the
models it receives from them
§ The more hops two workers
are away from each other,
the further they may
diverge
§ Because the ring is closed,
all workers project the
same distance-attenuated
force on each other which
is crucial for stability D-PSGD [26]
HPC Lab - CSE-HCMUT 41
CP in decentralized systems:
TreeEASGD[25]
§ A hierarchical tree-based
communication pattern
o Avoids bandwidth-related limitations
§ Each worker implements two
parameter exchange intervals
o At every cycles, they share their current
model parameters with the respective
upstream node
o At every cycles, they share their current
model parameters with all adjacent downstream
nodes
§ The degree of exploration is controlled by
separately adjusting the up- and downstream
parameter exchange frequencies (i.e. and
) for each worker based on its depth in the TreeEASGD [25]
hierarchy HPC Lab - CSE-HCMUT 42
CP in decentralized
systems: GoSGD[45]
§ Workers and can peer with each other to
exchange parameters by implementing a
sum-weighted gossip protocol
§ Each worker (i) defines a variable that is
initialized equally such that
§ After each local update, a random Bernoulli
distribution is sampled to decide whether a
GoSGD [45]
parameter exchange should be done
o The probability (p) determines the § Workers that shared their state recently are weighted down
average communication interval in relevance because the information they collected about
§ is halved and sent along with the current the loss function has become more common knowledge
model parameters to the destination (=gossip) among other workers
worker, which in turn replaces its local model § The variance of staleness within the cluster is minimized
with the weighted average based on its own § By setting the learning rate (η) to zero, all workers
and the received α value asymptotically approximate a consensus model.

HPC Lab - CSE-HCMUT 43


DDLS tools (1)
§ DistBelief (Google; [21]) § TensorFlow (Google; [17]) and MXNet (Apache Foundation;
o A centralized asynchronous DDLS with [19])
support for multiple parameter servers o Modern descendants of DistBelief that improve upon
§ Project Adam (Microsoft; [23]) previous approaches by introducing new concepts, such as
o Took a similar approach by moving gradient defining backup workers
computation steps into the parameter servers o Optimizing model partitioning using self-tuning heuristic
for some neural network layers models, and improving scalability by allowing hierarchical
o Organize the parameter servers in a Paxos parameter servers to be configured
cluster to establish high availability § CaffeOnSpark (Yahoo; [24]) and BigDL (Intel; [27])
§ Petuum [16] o The opposite approach and focus on easy integration with
o Imposing delay bounds to control the existing data analytics systems and commodity hardware
staleness of asynchronous workers can environments by implementing centralized synchronous
improve the rate of convergence data-parallel model training on top of Apache Spark
§ Parameter Server [14] o Accommodate the frequent communication needs of such
o Formalizing the processing of deep learning systems, they use sophisticated communication patterns to
workloads to establish hybrid parallelism and implement a distributed parameter server
integrate with a general machine learning
architecture
HPC Lab - CSE-HCMUT 44
DDLS tools (2)
§ SparkNet [10], § The data-parallel optimizer of PyTorch (Facebook; [47])
o A decentralized synchronous DDLS that o Implement a custom interface to realize synchronous model
replicates Caffe-solvers using Apache Spark’s training using collective communication primitives
map-reduce API to realize training in o Either one or all workers act as parameter server (all-reduce
commodity cluster environments approach)
o As part of the popular Java DDLS § EASGD [30]
deeplearning4j o Retain the idea of limited isolated training phases but
o The restriction to synchronous execution is imposes fully asynchronous scheduling
often considered as a major downside by this o Having a single master node can become a bottleneck as the
approach cluster grows larger
§ MPCA-SGD [11] § D-PSGD [26], TreeEASGD [25] and GoSGD [45]
o Improve upon SparkNet o Approaches to further scale out decentralized optimization
o Extend the basic Spark-based approach by by distributing the master function
overlapping computation and communication § COTS HPC [9] and FireCaffe [15]
to realize quasi-asynchronous training and o Optimized for HPC and GPU supercomputer environments,
extrapolates the recently observed descent where they have been shown to achieve unparalleled
trajectory to cope with staleness performance for certain applications.

HPC Lab - CSE-HCMUT 45


HPC Lab - CSE-HCMUT 46
Parallelism
§ DP is more frequently supported than MP
o Decentral optimization is based on the concept of sparse communication between independent
trainers; Realizing cross-machine MP in such systems is counter-intuitive
o New modeling and training techniques [2][4] allow utilizing the available parameter space more
efficiently, while technological improvements in hardware allow processing increasingly larger
models
o Not every model can be partitioned evenly across a given number of machines, which leads to the
under-utilization of workers [20]
§ If a model fits well into the GPU memory, the resource requirements of the
backpropagation algorithm can often be regulated reasonably well by adjusting the
mini-batch size in DP systems,
o Some DDLS discourage using cross-machine MP in favor of DP, which is less susceptible to
processing time variations

HPC Lab - CSE-HCMUT 47


Optimization
§ A trend towards decentralized systems in research
§ Centralized DDLS
o Minor improvements, such as tailored optimization techniques [2], [36], [39]
o The development of domain-specific compression methods [13]
§ Centralized DDLS dominate industry usage and application research, although centralized
and decentralized DDLS offer similar convergence guarantees [26]
o Centralized approaches are generally better understood and easier to use
o Most popular and industry- backed deep learning frameworks (PyTorch, TensorFlow, MXNet, etc.)
contain centralized DDLS implementations that are mature, highly optimized and work tremendously
well as long as parameter exchanges do not dominate the overall execution [11], [48]

HPC Lab - CSE-HCMUT 48


Scheduling
§ Centralized asynchronous methods
o Cope better with performance deviations and have the potential to yield a higher hardware
utilization
o But introduce new challenges such as concurrent updates and staleness => some DDLS support
synchronous and asynchronous modes of operation
§ Centralized bounded asynchronous DDLS can always simulate synchronous and
asynchronous scheduling
o If a delay bound is used, s = 0 is identical to synchronous, while s = ∞ results in fully
asynchronous behavior
§ Decentralized DDLS
o Some decentralized DDLS define a simple threshold (τ) to limit the amount of exploration per
training phase
o Others take a more dynamic approach to cope better with bandwidth limitations, which is indicated
using the term soft-bounded

HPC Lab - CSE-HCMUT 49


Parameter Exchange mechanism
§ Binomial tree methods
o Scale worse than scattering operations, but are preferable in high latency environments because
less individual connections between nodes are required [44]
§ Collective operation
o Common, but not necessarily the only parameter exchange method available
o Some synchronous DDLS implement several collective operations and switch between them to
maximize efficiency

HPC Lab - CSE-HCMUT 50


Topology
§ Centralized DDLS
o The current state-of-the-art in centralized DDLS for small clusters is the synchronous all-reduce-
based approach
o Large and heterogeneous setups can be utilized efficiently using hierarchically structured
asynchronous communication patterns [19]
§ Decentralized DDLS
o Heavily structured communication protocols [25], [26], boosting techniques [11], as well as
relatively unstructured methods [45] have been reported to offer better convergence rates than
naïve implementations

HPC Lab - CSE-HCMUT 51


Right technique
§ The non-linear non-convex nature of deep learning models in combination with the
abundance of distributed methods opens up a large solution space [2]
§ Although frequently done, comparing DDLS based on processing metrics such as GPU
utilization or training sample throughput is not useful in practice
o Such performance indicators can easily be maximized by increasing the batch-size, allowing more
staleness or extending exploration phases, which does not necessarily equate to faster training or
yield a better model
§ Benchmark is not enough
o Well-established deep learning benchmarks like DAWN- Bench [48] propose comparing the end-to-
end training performance by measuring quality metrics (e.g. time to ac- curacy x%)
o However, optimal configurations w.r.t. quality metrics are usually highly task dependent and may
vary as the training progresses [42]

HPC Lab - CSE-HCMUT 52


Benchmark tools
§ The collection and quantitative study of the performance of DDLS using standardized AI
benchmarks is becoming increasingly important and can provide guidance regarding
what configurations work well in practice
§ DAWNBench [48]
o Strong emphasis on distributed implementations, but focuses only on a few workloads
§ MLPerf [49]
o Expand the scope and defines stricter test protocols to establish better comparability
§ Deep500 [50]: a new benchmark tool
o Focus on gathering more information by defining metrics and measurement points along the
training pipeline
§ AIBench [51]
o Aim at covering many machine learning applications like recommendation systems, speech
recognition, image generation, image compression, text-to-text translation, etc.

HPC Lab - CSE-HCMUT 53


Criteria of DDLS

HPC Lab - CSE-HCMUT 54


Future research directions
§ Using decentralized optimization techniques in conjunction with P2P model sharing [45]
o An interesting area of research for certain IoTs or automotive applications
§ A comprehensive analysis of different distributed approaches in real-life scenarios would
be helpful to many practitioners
o An actual cluster setups the situation is usually more complex due to competing workloads
o Most works in distributed deep learning restrict themselves to ideal test scenarios
§ Efficiently realizing distributed training in heterogeneous setups is a largely un-tackled
engineering problem
o An investment commodity, clusters are often not replaced, but rather extended
§ A structured quantitative analysis of the results from DDLS benchmarks could be
interesting for many practitioners

HPC Lab - CSE-HCMUT 55


Reference (1)
1. Y. LeCun, Y. Bengio, and G. E. Hinton, “Deep Learning,” Nature, vol. 521, no. 7553, pp. 436–444, 2015.
2. J. Schmidhuber, “Deep Learning in Neural Networks: An Overview,” Neural Networks, vol. 61, pp. 85–117, 2015.
3. A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet Classifi- cation with Deep Convolutional Neural Networks,” Adv. in Neural
Information Processing Systems, vol. 25, pp. 1097–1105, 2012.
4. K. He, X. Zhang, S. Ren, and J. Sun, “Deep Residual Learning for Image Recognition,” Proc. IEEE Conf. on Computer Vision and Pattern
Recognition, pp. 770–778, 2016.
5. S. Shi, Q. Wang, P. Xu et al., “Benchmarking State-of-the-Art Deep Learning Software Tools,” arXiv CoRR, vol. abs/1608.07249, 2016.
6. N. Shazeer, A. Mirhoseini, K. Maziarz et al., “Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer,” Proc.
5th Intl. Conf. on Learning Representations, 2017.
7. K. Simonyan and A. Zisserman, “Very Deep Convolutional Net- works for Large-Scale Image Recoginition,” Proc. 3rd Intl. Conf. on
Learning Representations, 2015.
8. M. Langer, “Distributed Deep Learning in Bandwidth- Constrained Environments,” Ph.D. dissertation, La Trobe University, 2018.
9. A. Coates, B. Huval, T. Wang, D. J. Wu, B. C. Catanzaro, and A. Y. Ng, “Deep Learning with COTS HPC Systems,” Proc. 30th Intl. Conf. on
Machine Learning, pp. 1337–1345, 2013.
10. P. Moritz, R. Nishihara et al., “SparkNet: Training Deep Networks in Spark,” Proc. 4th Intl. Conf. on Learning Representations, 2016.
11. M. Langer, A. Hall et al., “MPCA SGD - A Method for Distributed Training of Deep Learning Models on Spark,” IEEE Trans. on Parallel
and Distributed Systems, vol. 29, no. 11, pp. 2540–2556, 2018.
12. K. Zhang, S. Alqahtani, and M. Demirbas, “A Comparison of Distributed Machine Learning Platforms,” Proc. 26th Intl. Conf. on
Computer Communications and Networks, 2017.
13. T.Ben-NunandT.Hoefler,“DemystifyingParallelandDistributed Deep Learning: An In-Depth Concurrency Analysis,” ACM Com- put. Surv.,
vol. 52, no. 4, 2019.
HPC Lab - CSE-HCMUT 56
Reference (2)
14. M.Li,D.G.Andersen,A.J.Smolaetal.,“CommunicationEfficient Distributed Machine Learning with the Parameter Server,” Adv. in Neural
Information Processing Systems, vol. 27, pp. 19–27, 2014.
15. F.N.Iandola,K.Ashrafetal.,“FireCaffe:Near-LinearAcceleration of Deep Neural Network Training on Compute Clusters,” Proc. IEEE Conf.
on Computer Vision and Pattern Recognition, 2015.
16. E. P. Xing, Q. Ho, W. Dai, J. K. Kim, J. Wei, S. Lee et al., “Petuum - A New Platform for Distributed Machine Learning on Big Data,” IEEE
Trans. on Big Data, vol. 1, no. 2, pp. 49–67, 2015.
17. M. Abadi, P. Barham, J. Chen et al., “TensorFlow: A System for Large-Scale Machine Learning,” Proc. 12th USENIX Symp. on Operating
Systems Design and Implementation, pp. 265–283, 2016.
18. Q. V. Le, “Building High-Level Features Using Large Scale Unsu- pervised Learning,” Proc. IEEE Intl. Conf. on Acoustics, Speech and
Signal Processing, pp. 8595–8598, 2013.
19. T. Chen, M. Li, Y. Li et al., “MXNet: A Flexible and Efficient Ma- chine Learning Library for Heterogeneous Distributed Systems,” Proc.
29th Conf. on Neural Information Processing Systems, 2015.
20. Y. Huang, Y. Cheng, A. Bapna, O. Firat, M. X. Chen et al., “GPipe: Efficient Training of Giant Neural Networks using Pipeline Paral-
lelism,” arXiv CoRR, vol. abs/1811.06965, 2018.
21. J. Dean, G. S. Corrado, R. Monga, K. Chen, M. Devin et al., “Large Scale Distributed Deep Networks,” Adv. in Neural Information
Processing Systems, pp. 1223–1231, 2012.
22. S. Ioffe and C. Szegedy, “Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift,” Proc. 32nd
Intl. Conf. on Machine Learning, vol. 37, pp. 448–456, 2015.
23. T. Chilimbi, Y. Suzue et al., “Project Adam: Building an Efficient and Scalable Deep Learning Training System,” Proc. 11th USENIX Symp.
on OS Design and Implementation, pp. 571–582, 2014.

HPC Lab - CSE-HCMUT 57


Reference (3)
24. A. Feng, J. Shi, and M. Jain, “CaffeOnSpark Open Sourced for Distributed Deep Learning on Big Data Clusters,” 2016. [Online].
Available: http://yahoohadoop.tumblr.com/post/139916563586/ caffeonspark- open- sourced- for- distributed- deep.
25. S. Zhang, “Distributed Stochastic Optimization for Deep Learn- ing,” Ph.D. dissertation, New York University, 2016.
26. X.Lian,C.Zhang,H.Zhang,C.-J.Hsiehetal.,“CanDecentralized Algorithms Outperform Centralized Algorithms? A Case Study for
Decentralized Parallel Stochastic Gradient Descent,” Adv. in Neural Information Processing Systems, vol. 30, pp. 5330–5340, 2017.
27. J. Dai, Y. Wang, X. Qiu, D. Ding, Y. Zhang et al., “BigDL: A Distributed Deep Learning Framework for Big Data,” Proc. ACM Symposium on
Cloud Computing, pp. 50–60, 2019.
28. I. J. Goodfellow, O. Vinyals, and A. M. Saxe, “Qualitatively Char- acterizing Neural Network Optimization Problems,” Proc. 3rd Intl. Conf.
on Learning Representations, 2015.
29. H. R. Feyzmahdavian, A. Aytekin et al., “An Asynchronous Mini- Batch Algorithm for Regularized Stochastic Optimization,” Proc. 54th
IEEE Conf. on Decision and Control, pp. 1384–1389, 2015.
30. S. Zhang, A. Choromanska, and Y. LeCun, “Deep learning with Elastic Averaging SGD,” Adv. in Neural Information Processing Systems,
vol. 28, pp. 685–693, 2015.
31. N. S. Keskar, D. Mudigere, J. Nocedal et al., “On Large-Batch Train- ing for Deep Learning: Generalization Gap and Sharp Minima,” Proc.
5th Intl. Conf. on Learning Representations, 2017.
32. I. Sutskever, J. Martens, G. Dahl et al., “On the Importance of Initialization and Momentum in Deep Learning,” Proc. 30th Intl. Conf. on
Machine Learning, vol. 28, no. 3, pp. 1139–1147, 2013.
33. D. Kingma and J. Ba, “Adam: A Method for Stochastic Optimiza- tion,” Proc. 3rd Intl. Conf. on Learning Representations, 2015.
34. J. Chen, X. Pan et al., “Revisiting Distributed Synchronous SGD,” Proc. 5th Intl. Conf. on Learning Representations, 2017.
35. Q.Ho,J.Cipar,H.Cui,J.K.Kimetal.,“MoreEffectiveDistributed ML via a Stale Synchronous Parallel Parameter Server,” Adv. in Neural
Information Processing Systems, vol. 26, pp. 1223–1231, 2013.
HPC Lab - CSE-HCMUT 58
Reference (4)
36. R. Tandon, Q. Lei, A. Dimakis, and N. Karampatziakis, “Gradient Coding: Avoiding Stragglers in Distributed Learning,” Proc. 34th Intl.
Conf. on Machine Learning, vol. 70, pp. 3368–3376, 2017.
37. A. Agarwal and J. C. Duchi, “Distributed Delayed Stochasoc Op- omizaoon,” Adv. in Neural InformaSon Processing Systems, vol. 24, pp.
873–881, 2011.
38. W. Dai, A. Kumar, J. Wei, Q. Ho et al., “High-Performance Distributed ML at Scale Through Parameter Server Consistency Models,”
Proc. 29th Conf. on ArSficial Intelligence, pp. 79–87, 2015.
39. I. Mitliagkas, C. Zhang et al., “Asynchrony Begets Momentum, With an Applicaoon to Deep Learning,” Proc. 54th Allerton Conf. on
CommunicaSon, Control, and CompuSng, pp. 997–1004, 2017.
40. S.Zheng,Q.Meng,T.Wang,W.Chen,N.Yuetal.,“Asynchronous Stochasoc Gradient Descent with Delay Compensaoon,” Proc. 34th Intl. Conf.
on Machine Learning, pp. 4120–4129, 2017.
41. F. Niu, B. Recht, C. Re ,́ and S. J. Wright, “Hogwild!: A Lock-Free Approach to Parallelizing Stochasoc Gradient Descent,” Adv. in Neural
InformaSon Processing Systems, vol. 24, pp. 693–701, 2011.
42. H. Kim, J. Park, J. Jang, and S. Yoon, “DeepSpark: Spark-Based Deep Learning Supporong Asynchronous Updates and Caffe
Compaobility,” arXiv CoRR, vol. abs/1602.08191, 2016.
43. X. Lian, Y. Huang, Y. Li, and J. Liu, “Asynchronous Parallel Stochasoc Gradient for Nonconvex Opomizaoon,” Adv. in Neural InformaSon
Processing Systems, vol. 28, pp. 2737–2745, 2015.
44. R. Thakur, R. Rabenseifner, and W. Gropp, “Opomizaoon of Collecove Communicaoon Operaoons in MPICH,” Intl. Jrnl. of High
Performance CompuSng ApplicaSons, vol. 19, pp. 49–66, 2005.
45. M. Blot, D. Picard, M. Cord, and N. Thome, “Gossip Training for Deep Learning,” arXiv CoRR, vol. abs/1611.09726, 2016.
46. S. Boyd, A. Ghosh et al., “Randomized Gossip Algorithms,” IEEE Trans. on InformaSon Theory, vol. 52, no. 6, pp. 2508–2530, 2006.
47. N. Ketkar, “Deep Learning with Python: A Hands-on Introduc- oon,” ISBN: 978-1-4842-2766-4, pp. 195–208, 2017.
HPC Lab - CSE-HCMUT 59
Reference (5)
48. C. Coleman, D. Narayanan, D. Kang, T. Zhao et al., “DAWNBench: An End-to-End Deep Learning Benchmark and Competition,” NIPS
ML Systems Workshop, 2017.
49. P. Mattson, C. Cheng, C. Coleman et al., “MLPerf Training Bench- mark,” arXiv CoRR, vol. abs/1910.01500, 2019.
50. T. Ben-Nun, M. Besta, S. Huber, A. N. Ziogas, D. Peter, and T. Hoefler, “A Modular Benchmarking Infrastructure for High- Performance
and Reproducible Deep Learning,” Proc. 33rd Intl. Parallel and Distributed Processing Symposium, pp. 66–77, 2019.
51. W. Gao, F. Tang, L. Wang, J. Zhan, C. Lan, C. Luo, Y. Huang, C. Zheng et al., “AIBench: An Industry Standard Internet Service AI
Benchmark Suite,” arXiv CoRR, vol. abs/1908.08998, 2019.

HPC Lab - CSE-HCMUT 60

You might also like