Parallelism Strategies in Machine Learning, Get The Free Cheat Sheet - 2
Parallelism Strategies in Machine Learning, Get The Free Cheat Sheet - 2
Parallelism Strategies
for Distributed Training
Read More
Introduction
reasons include:
Faster Experimentation
simultaneously.
Large Models
for training
communication and
replicas’.
Tensorflow MirroredStrategy:
MirroredStrategy is a TensorFlow
API that supports data parallelism
on a single machine with multiple
GPUs. It replicates the model on
each GPU, performs parallel
computations, and keeps the
model replicas synchronized
Tensorflow
MultiWorkerMirroredStrategy: This
TensorFlow strategy extends
MirroredStrategy to distribute
training across multiple machines.
It allows for synchronous training
across multiple workers, where
each worker has access to one or
more GPUs.
Tensorflow TPUStrategy:
TPUStrategy is designed
specifically for training models on
Google's Tensor Processing Units
(TPUs). It replicates the model on
multiple TPUs and enables efficient
parallel computations for
accelerated training.
Pipeline Parallelism
Scaling up the capacity of deep neural
networks has proven to be an effective
method for improving the quality of
models in different machine learning
tasks. However, in many cases, when
we want to go beyond the memory
limitations of a single accelerator,
it becomes necessary to develop
specialized algorithms or
infrastructure.Here comes pipeline
parallelism. Pipeline parallelism is a
method where each layer (or multiple
layers) are placed on each GPU
(vertically or layer-level). If it is applied
naively, the training process will suffer
from severely low GPU utilization as it
is shown in Figure 1(b). The figure
shows a model consisting of 4 layers
spread across 4 different GPUs
(represented vertically). The horizontal
axis represents the training process
over time, and it demonstrates that
only one GPU is used at a time. For
more information about pipeline
parallelism, refer to this paper.
Figure 1: (a) An example neural
network with sequential layers is
partitioned across four accelerators.
Fk is the composite forward
computation function of the k-th cell.
Bk is the back-propagation function,
which depends on both Bk+1 from the
upper layer and Fk. (b) The naive
model parallelism strategy leads to
severe under-utilization due to the
sequential dependency of the
network. (c) Pipeline parallelism
divides the input mini-batch into
smaller micro-batches, enabling
different accelerators to work on
different micro-batches
simultaneously.
Gradients are applied synchronously
at the end.
Implementations
effective’
implementation is availabl
Fairscale.
Parallelism
Implementations:
least 8 GPUs.
Combination of
Parallelism Techniques
Sometimes, a data science task may
require a combination of different
training paradigms to achieve optimal
performance. For instance, you might
want to leverage two or more methods
that we covered earlier to take
advantage of their respective
strengths. There are many possible
combinations of these techniques.
However, we will cover only 2 in this
section, which are state-of-the-art. If
you want to train a gigantic model with
billions of parameters, you should
consider one of these techniques:
The Zero Redundancy
Optimizer (ZeRO)
A
lpa is a framework that automates the
complex process of parallelizing deep
learning models for distributed
training. It focuses on two types of
parallelism: inter-operator parallelism
(e.g. device placement, pipeline
parallelism and their variants) and
intra-operator parallelism (e.g. data
parallelism, Megatron-LM’s tensor
model parallelism). Inter-operator
parallelism assigns different operators
in the model to different devices,
reducing communication bandwidth
requirements but suffering from device
underutilization.
Intra-operator parallelism partitions
individual operators and executes
them on multiple devices, requiring
heavier communication but avoiding
data dependency issues. Alpa uses a
compiler-based approach to
automatically analyze the
computational graph and device
cluster, finding optimal parallelization
strategies for both inter- and intra-
operator parallelism. It generates a
static plan for execution, allowing the
distributed model to be efficiently
trained on a user-provided device
cluster.