Ch12 - Distributed Deep Learning
Ch12 - Distributed Deep Learning
Thoai Nam
High Performance Computing Lab (HPC Lab)
Faculty of Computer Science and Technology
HCMC University of Technology
HPC Lab-CSE-HCMUT 1
Parameter Server
§ The massive parallel processing power of § Effective remedy to this problem is to utilize multiple
graphics processing units (GPUs) has been GPUs to speed up training
largely responsible for the recent successes in § Scale-up approaches rely on tight hardware
training deep learning models integration to improve the data throughput
§ Increasingly larger and more complex deep o These solutions are effective, but costly
learning models are necessary o Furthermore, technological and economic
§ The disruptive trend towards big data has led to constraints impose tight limitations on scaling up
an explosion in the size and availability of § DDLS aim at scaling out to train large models using
training datasets for machine learning tasks the combined resources of clusters of independent
o Training such models on large datasets to machines
convergence can easily take weeks or even
months on a single GPU
§ M machines/mini-batches: B = M.B’
Training data
Matthias Langer, Zhen He, Wenny Rahayu, Yanbo Xue, Distributed Training of Deep Learning Models:
A Taxonomic Perspective, IEEE Transactions on Parallel and Distributed Systems, 2020, Volume: 31,
Issue: 12, Pages: 2802-2818, 10.1109/TPDS.2020.3003307
§ Each machine uses the thereby locally accumulated identical gradients to step an
equally parameterized optimizer copy, which in turn applies exactly the same update
§ Not only robust to node failures, but also makes adding and removing nodes trivial.