DC Unit IV
DC Unit IV
Distributed parallel training has two high-level concepts: parallelism and distribution.
Parallelism is a framework strategy to tackle the size of large models or improve training
efficiency, and distribution is an infrastructure architecture to scale out.In addition to the two
basic types of parallelism, there are many more variants, such as expert parallelism. Furthermore,
they can be mixed with two or all, such as data and model mixed parallelism. For large-scale
models, a combination of data and model parallelism is frequently used.
Data parallelism shards data across all cores with the same model. A data parallelism framework
like PyTorch Distributed Data Parallel, SageMaker Distributed, and Horovod mainly
accomplishes the following three tasks:
1. First, it creates and dispatches copies of the model, one copy per each accelerator.
2. It shards the data and then distributes it to the corresponding devices.
3. It finally aggregates all results together in the backpropagation step.
So we can see that the first task should happen once per training, but the last two tasks should
occur in each iteration.
Data parallelism for multi-machine operation is implemented at the module level via PyTorch
Distributed Data Parallel (DDP). It is compatible with the PyTorch parallel model. DDP
applications ought to start many processes, one DDP instance for each process. Collective
communications are used by DDP in the torch. a distributed package for buffers and gradient
synchronization. Moreover, DDP registers an autograd hook for every model parameter.
parameters(), and it will start to fire in the backward pass upon computing the corresponding
gradient. After that, DDP uses the signal to start process-to-process gradient synchronization.
So there are three main steps to set up and run DDP in PyTorch:
Distributed model parallel training has two main parts. It is essential to design model parallelism
in multiple GPUs to realize this. PyTorch wraps this up and alleviates the implementation. There
are only three small changes in PyTorch.
1. Use “to(device)” to identify a specific device (or GPU) for particular layers (or
sub-networks) of a model.
2. Add a “forward” method accordingly to move intermediate outputs across devices.
3. Specify the label outputs on the same device when calling the loss function. And the
“backward()” and “torch.optim” will automatically handle gradients as if running on
a single GPU.
Minimizing this function leads to a model that better fits the data. Gradient descent helps achieve
this by iteratively updating the model's parameters to move towards the minimum of the loss
function.
Imagine a ball rolling down a hill. The steeper the slope, the faster the ball rolls. Similarly,
gradient descent Faster optimization is achieved by updating the model parameters in the
direction of the loss function's steepest descent.
The mean squared error (MSE) is a typical loss function in linear regression. For each training
instance, it calculates the average squared difference between the expected and actual values.
MSE Formula :
1. Calculate the gradient of the loss function with respect to the model parameters.
2. Update the parameters by subtracting the learning rate multiplied by the gradient.
3. Repeat steps 1 and 2 until convergence (i.e., the loss function stops decreasing
significantly).
Learning Rate:
During the updating process, the step size is determined by the learning rate. A larger learning
rate leads to faster convergence but can also lead to instability and overshooting the minimum. A
smaller learning rate leads to slower convergence but ensures stability.
Fig. An illustration of federated learning for the mobile word prediction job.
A multitude of data is being produced daily by a number of contemporary dispersed networks,
including wearable technology, mobile phones, and self-driving cars. Storing data locally and
shifting network computation to the edge devices is becoming more and more appealing due to
the devices' rising computational capability and worries about sending sensitive data. In these
kinds of situations, federated learning has become a popular training model.Federated learning
creates new challenges at the nexus of machine learning and systems and necessitates
fundamental advancements in fields like distributed optimization, large-scale machine learning,
and privacy.
Federated learning can be used to learn from mobile phone users' behaviors, guide autonomous
vehicles to accommodate pedestrian behavior, or forecast health events like the likelihood of a
heart attack from wearable technology.
Fig. Federated learning is also being used for individual healthcare through learning over diverse
electronic health records that are dispersed among several hospitals.
One of the fundamental ideas of functional programming is reduce. Data reduction is the process
of using a function to condense a set of numbers into a smaller set. Consider the following
scenario: we have a list of numbers: 1, 2, 3, 4, 5, etc. Sum ([1, 2, 3, 4, 5]) = 15 would result from
using the sum function to reduce this list of numbers. In a similar vein, multiply ([1, 2, 3, 4, 5]) =
120 would result from the multiplication reduction
Applying reduction functions on a set of distributed numbers can be exceedingly laborious. In
addition, programming non-commutative reductions—that is, reductions that need to happen in a
specific order—is challenging. Fortunately, MPI includes a useful function named MPI_Reduce
that can handle nearly all typical reductions required by programmers working on parallel
applications.
MPI_Reduce
Like MPI_Gather, MPI_Reduce receives an array of input items from each process and provides
the root process with an array of output elements. The decreased result is contained in the output
elements. The prototype for MPI_Reduce looks like this:
It's now time to discuss MPI_Allreduce, MPI_Reduce's brother.
MPI_Allreduce
Instead of accessing the root process, many concurrent applications will need to retrieve the
reduced results across all processes. Similar to how MPI_Allgather and MPI_Gather complement
each other, Additionally, MPI_Allreduce will lower the values and distribute the results to every
process.
This is a function prototype:
Initialization: Each worker (e.g., GPU) starts with a copy of the initial global model parameters.
Local Updates: Each worker independently performs SGD updates on its local copy of the
parameters using its own mini-batch of data.
Elastic Averaging: Instead of directly averaging the local parameters, EASGD applies an "elastic
averaging" step. Each worker computes an "elastic force" that pulls its local parameters closer to
the global parameters stored on the central server. This force is proportional to the difference
between the local and global parameters.
Parameter Update: Each worker updates its local parameters by incorporating both the local SGD
update and the elastic force.
Global Model Update: The central server periodically updates the global parameters by
averaging the local parameters from all workers.
Benefits of EASGD:
Faster convergence: EASGD can achieve faster convergence compared to traditional SGD,
especially for large datasets and distributed training environments.
Improved communication efficiency: By avoiding frequent communication of entire model
parameters, EASGD reduces communication overhead in distributed settings.
Better exploration: EASGD allows local workers to explore different parameter values
independently, potentially leading to better solutions.
Resilience to staleness: EASGD is less sensitive to stale gradients compared to traditional SGD,
making it more robust in asynchronous training environments.
Limitations of EASGD:
Tuning complexity: Choosing the appropriate values for the hyperparameters (e.g., learning rate,
elastic averaging coefficient) can be challenging and may require experimentation.
Implementation complexity: Implementing EASGD can be more complex than traditional SGD
due to the need for additional communication and coordination between workers and the central
server.
Applications of EASGD:
Large-scale image classification: EASGD is well-suited for training deep convolutional neural
networks on large image datasets.
Natural language processing: EASGD can be used to train recurrent neural networks for tasks
such as machine translation and text summarization.
Speech recognition: EASGD can be applied to train deep learning models for speech recognition
tasks.
The master/slave architecture of Apache Spark is hierarchical. The cluster manager, which
monitors the worker (slave) nodes and provides data results to the application client, is overseen
by the master node known as the Spark Driver.
The cluster manager—Spark's Standalone Cluster Manager or alternative cluster managers, such
as Hadoop YARN, Kubernetes, or Mesos—works with the Spark Driver to build the Spark
Context based on the application code. This allows the cluster management to allocate and track
execution across the nodes. Additionally, it generates robust distributed datasets, or RDDs, which
are essential to Spark's incredible processing speed.
Fault-tolerant collections of components that can be dispersed throughout several cluster nodes
and worked on concurrently are known as resilient distributed datasets, or RDDs. RDDs are the
core building block of Apache Spark.
Spark uses the Spark Context parallelize method to parallelize an existing collection or
references a data source to load data into an RDD for processing. The secret to Spark's
performance is the transformations and operations it applies to RDDs while they are in memory,
following data loading.
Spark likewise maintains the data in memory until the system runs out of memory or the user
choose to write the data to disk for persistence.
Logical divisions, which may be computed on several cluster nodes, are created for each dataset
in an RDD. Additionally, users are able to carry out actions and transformations within RDD.
Operations used to produce a new RDD are called transformations. To tell Apache Spark to
perform a computation and provide the result to the driver, use actions.
Numerous operations and transformations on RDDs are supported by Spark. Spark handles this
distribution, so users don't need to worry about figuring out which distribution to use.
Spark Streaming
Scalable and fault-tolerant processing of real-time data streams is possible with Spark Streaming,
an addition to the core Spark API. The machine learning and graph-processing algorithms of
Spark Streaming enable real-time data transfer for streaming analytics to file systems, databases,
and live dashboards. Spark Streaming speeds up the processing of streaming data by enabling
incremental batch processing, which is powered by the Spark SQL engine.
4.2.2 GraphLab
GraphLab is a new C++ parallel machine learning framework. The project is open-source and
was created with the volume, diversity, and complexity of real-world data in mind. To provide a
high performance experience, it combines a number of sophisticated techniques, including
Gradient Descent, Stochastic Gradient Descent (SGD), and Locking. It facilitates the large-scale
creation and installation of apps by data scientists and developers.
But what makes it so fantastic? It's the accessibility of practical libraries for working with,
modifying, and displaying models. It also has scalable machine learning toolkits with almost all
the parts required to improve machine learning models. The toolkit contains implementations for
several approaches such as nearest neighbors, topic modeling, deep learning, factor machines,
and clustering.
Here is the complete architecture of GraphLab
Fig :- Architecture of GraphLab
How to Install GraphLab?
After obtaining GraphLab's license, you can utilize it as well. But, you can also begin using the
academic edition with a one-year membership or a free trial. Therefore, in order to operate
GraphLab, your computer needs to meet the system requirements before installing.
Handles Large Data: Large data sets may be handled using GraphLab's data structure, enabling
scalable machine learning. Let’s look at the data structure of Graph Lab:
SFrame: This disk-based tabular data format is effective and not constrained by RAM. Even on
a laptop, it helps to scale analysis and data processing to handle big data sets (Tera byte). Its
syntax is comparable to that of R data frames or pandas. A set of elements saved on disk, known
as a SArray, is represented by each column. As a result, SFrames is disk based. In the sections
that follow, I've covered several approaches to working with "SFrames."
SGraph: Our knowledge of networks is aided by the graph's examination of the relationships
between pairs of objects. Every object in the graph is represented by a vertex. The relationship
between objects is indicated by the edges. GraphLab uses SGraph as an object to process data in
a graph-oriented manner. It is a scalable graph data format that keeps vertices and edges in
SFrames.
Integration with various data sources: Numerous data sources, including S3, ODBC, JSON,
CSV, HDFS, and many more, are supported by GraphLab.
Data exploration and visualization with GraphLab Canvas. With the browser-based
interactive GraphLab Canvas, you may examine bi-variate graphs, summary statistics, and
tabular data. You save time coding for data exploration by using this functionality. This will
enable you to concentrate more on comprehending the distribution and relationship between the
variables. I've talked about this in the parts that follow.
Feature Engineering: A built-in feature of GraphLab allows you to add new, practical elements
to improve model performance. It has a number of choices, including tf-idf, imputation, binning,
transformation, and one hot encoding.
Modeling: With its many toolkits, GraphLab can quickly and easily solve machine learning
challenges. With less lines of code, it enables you to carry out several modeling exercises (such
as regression, classification, and clustering). You can work on issues related to image analysis,
sentiment analysis, churn prediction, recommendation systems, and many more.
Production automation: Reusable code tasks can be assembled into jobs using data pipelines.
Run them automatically on shared execution environments after that (e.g. Amazon Web Services,
Hadoop).
GraphLab Create SDK: Advance users can extend the capabilities of GraphLab Create using
GraphLab Creat SDK. New machine learning models and algorithms can be defined and
integrated with the rest of the package.
License: Its utility is limited. You can choose between a one-year academic edition license and a
30-day free trial.
.
Steps for Installation:
● Register for free trail. After registration, you will receive a product key
● Select your operating system (Auto selection is on) and follow the given instructions
● Below is the command line installation instruction (For “Anaconda Python
Environment”).
4.2.3 TensorFlow
An end-to-end open source machine learning platform is called TensorFlow. This course focuses
on using a specific TensorFlow API to create and train machine learning models, even though
TensorFlow is a robust system for managing every aspect of a machine learning system. For
comprehensive details regarding the TensorFlow system as a whole, refer to the TensorFlow
manual.
The higher-level TensorFlow APIs are arranged hierarchically, with the low-level APIs acting as
their base. Researchers studying machine learning use the low-level APIs to create and explore
new algorithms. Using a high-level API called tf.keras, you will define, train, and make
predictions using machine learning models throughout this course. tf.keras is the name of the
TensorFlow version of the open-source Keras API.
A parallel machine learning (ML) system refers to a system or framework designed to perform
machine learning tasks in parallel. Parallel computing involves the simultaneous execution of
multiple calculations or processes to solve a problem more quickly. In the context of machine
learning, parallelism can be employed to accelerate the training and inference processes of ML
models, especially when dealing with large datasets and complex models.
There are various ways to achieve parallelism in machine learning, and several frameworks and
systems have been developed to support parallel and distributed ML computations.
4.2.4 PetuumDesign
The three primary parts of Petuum are the workers, parameter server, and scheduler
Scheduler: The scheduler system allows users to select which model parameters are updated by
worker computers, enabling model-parallelism. This is accomplished by using the user-defined
scheduling function schedule(), which yields a set of settings for each worker. A more complex
scheduler might choose parameters depending on a number of criteria, such as pairwise
independence or distance from convergence, whereas a basic schedule might choose a parameter
at random for each worker. The scheduler employs the scheduling control channel (Above Fig.)
to tell workers who these parameters are, even though the actual parameter values are provided
through a parameter server system.
Workers: Upon obtaining the settings for updating from schedule(), every worker p
simultaneously runs push() on data D. Workers can read data from any kind of data storage
system, including disk, distributed file systems, and databases like HDFS, because Petuum
purposely does not define a data abstraction. Furthermore, employees can work with the data in
whatever sequence that the programmer specifies. For instance, in batch algorithms, employees
might process every data point in a single iteration, while in data-parallel stochastic algorithms,
employees may sample a single data point at a time.
Parameter Server: The parameter servers (PS), which are comparable to table-based or
key-value stores in that they offer an easy distributed shared memory API, enable global access
to the model parameters A, which are spread over numerous workstations. The PS uses Stale
Synchronous Parallel (SSP) consistency to benefit from ML-algorithmic principles. This lowers
network synchronization costs while upholding the bounded-staleness convergence guarantees
suggested by SSP.
Making it simple to construct data- and model-parallel machine learning algorithms is one of
Petuum's main objectives. Petuum offers APIs to important systems that facilitate this work.:
1. Programmers no longer need to explicitly synchronize networks when using a parameter
server system, which utilizes a bounded-asynchronous consistency model to maintain
data-parallel convergence guarantees. It also provides an intuitive distributed shared-memory
interface that emulates single-machine programming, allowing programmers to access global
model state A from any machine.
2. Users can provide their own consistency criteria for ML applications with a scheduler, giving
them fine-grained control over the model-parallel update sequencing in parallel.
4.2.5 Systems and Architectures for Distributed Machine Learning
Distributed machine learning involves training machine learning models across multiple
machines or nodes in a network. This approach is crucial for handling large datasets, complex
models, and reducing training times. Several systems and architectures have been developed to
support distributed machine learning. Here are some key concepts and frameworks:
Anomaly detection and fault tolerance are critical components of reliable and resilient distributed
systems. Integrating artificial intelligence (AI) into these aspects enhances the system's ability to
identify abnormal behavior, detect faults, and respond proactively to maintain operational
continuity.
Anomaly Detection:
1. Machine Learning Models:
● Utilize machine learning algorithms, such as clustering, classification, or time-series
analysis, to build models based on historical data.
● Train models to recognize normal patterns and behaviors within the distributed system.
● Detect anomalies by comparing real-time or near-real-time data against the learned
patterns.
2. Unsupervised Learning:
● Unsupervised learning techniques can be particularly useful for anomaly detection, as
they don't require labeled data for training.
● Algorithms like isolation forests, one-class SVM, or autoencoders can identify deviations
from the norm
3. Continuous Monitoring:
● Implement continuous monitoring of system metrics, including performance, resource
utilization, and network traffic.
● Trigger alerts or responses when deviations from normal behavior are detected, indicating
potential anomalies.
Fault Tolerance:
Resource Management: Dynamically monitors and manages the available resources across
various computing units, including edge devices, cloud servers, and fog nodes.
Task Characterization: Analyzes each task's characteristics, such as its computational
requirements, resource demands, and latency constraints.
Offloading Decision Engine: Leverages machine learning and other AI techniques to make
intelligent decisions about which tasks are offloaded, where they should be offloaded, and when
they should be offloaded.
Communication Management: Optimizes communication protocols and data transfer between
different computing units to minimize latency and network congestion.
Performance Monitoring: Continuously monitors the performance of the offloading process and
makes adjustments to the decision engine as needed.