0% found this document useful (0 votes)

34 views114 pages

Parallel and Distributed Computing: Composed By: Danish Khan

This document provides an overview of parallel computing, distributed computing, and key differences between the two. It also discusses heterogeneous distributed systems, including the need for heterogeneous distributed shared memory (DSM) and how data is handled. Load balancing in distributed systems is covered, including purposes, approaches, advantages, and issues related to load balancing. Memory consistency models for distributed systems are also summarized at a high level. The document contains information on parallel computing, distributed computing, heterogeneous computing, load balancing, and memory models across distributed systems.

Uploaded by

Danish Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

34 views114 pages

Parallel and Distributed Computing: Composed By: Danish Khan

Uploaded by

Danish Khan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 114

PARALLEL AND

DISTRIBUTED
COMPUTING
COMPOSED BY : DANISH KHAN
1

Table of Contents
Parallel Computing and Distributed Computing .................................................................................... 5
What is Parallel Computing? ................................................................................................................ 6
Advantages and Disadvantages of Parallel Computing ........................................................ 6
What is Distributing Computing? .................................................................................................... 7
Advantages and Disadvantages of Distributed Computing ................................................. 7
Key differences between the Parallel Computing and Distributed Computing ................... 8
Various Failures in Distributed System ................................................................................................. 16
GPU architecture and programming: .................................................................................................... 19
Difference between CPU and GPU: ...................................................................................................... 19
Introduction to CUDA Programming ...................................................................................................... 20
Why do we need CUDA? ............................................................................................................. 20
How CUDA works? ........................................................................................................................ 21
Architecture of CUDA ................................................................................................................... 21
How work is distributed? ............................................................................................................ 22
CUDA Applications ....................................................................................................................... 22
Benefits of CUDA ........................................................................................................................... 23
Limitations of CUDA ..................................................................................................................... 23
What is GPU Programming? ............................................................................................................ 23
Heterogeneous computing ...................................................................................................................... 24
Heterogeneous and other DSM systems | Distributed systems........................................................ 24
Need for Heterogeneous DSM (HDSM): .................................................................................. 24
Heterogeneous DSM:.................................................................................................................... 25
Data compatibility & conversion: ............................................................................................. 25
Block size selection : ................................................................................................................... 26
Advantages of DSM: ..................................................................................................................... 27
Difference between a Homogeneous DSM & Heterogeneous DSM: .............................. 27
Interconnection Network/topologies: ........................................................................................... 28
Evaluating Design Trade-offs in Network Topology ................................................................. 30
Routing ................................................................................................................................................. 30
Routing Mechanisms .................................................................................................................... 30
Deterministic Routing ................................................................................................................... 30
Deadlock Freedom ......................................................................................................................... 30
2

Switch Design ..................................................................................................................................... 31

Ports ...................................................................................................................................................... 31
Internal Data path ........................................................................................................................... 31
Channel Buffers .............................................................................................................................. 31
Flow Control ........................................................................................................................................ 31
Load Balancing Approach in Distributed System ................................................................................ 32
Purpose of Load Balancing in Distributed Systems: ......................................................... 32
Load Balancing Approaches: .................................................................................................... 33
Classes of Load Adjusting Calculations: ............................................................................... 33
Advantages of Load Balancing: ................................................................................................ 34
Migration: ......................................................................................................................................... 34
Migration Models: ............................................................................................................................ 34
Difference Between Load Balancing and Load Sharing in Distributed System .............................. 35
Load Balancing: ............................................................................................................................. 35
Load Sharing: ................................................................................................................................. 35
Scheduling and Load Balancing in Distributed System ...................................................................... 35
Scheduling in Distributed Systems: ........................................................................................ 35
Characteristics of a Good Scheduling Algorithm: .............................................................. 36
Load Balancing in Distributed Systems: ................................................................................ 37
Types of Load Balancing Algorithms: ..................................................................................... 38
Types of Distributed Load Balancing Algorithms: .............................................................. 39
Issues in Designing Load-balancing Algorithms:................................................................ 39
Issues Related to Load Balancing in Distributed System .................................................................. 40
Load Balancers: ............................................................................................................................. 40
Issues Related to Load Balancing: .......................................................................................... 40
Load balancing in Cloud Computing ..................................................................................................... 42
Memory Consistency model.................................................................................................................... 43
Strong consistency models .......................................................................................................... 44
 Strict consistency ..................................................................................................................... 44
 Sequential consistency ............................................................................................................ 44
 Causal consistency .................................................................................................................. 45
 Processor consistency............................................................................................................. 46
 Pipelined RAM consistency, or FIFO consistency .............................................................. 46
3

 Cache consistency ................................................................................................................... 47

Session guarantees consistency models .................................................................................. 47
Monotonic read consistency ....................................................................................................... 47
Monotonic write consistency ...................................................................................................... 47
Read-your-writes consistency .................................................................................................... 47
Writes-follows-reads consistency ............................................................................................. 47
Weak memory consistency models .......................................................................................... 48
 Weak ordering........................................................................................................................... 48
 Release consistency ................................................................................................................ 48
 Entry consistency ..................................................................................................................... 49
 Local consistency ..................................................................................................................... 49
 General consistency ................................................................................................................ 49
 Eventual consistency ............................................................................................................... 49
Relaxed memory consistency models ..................................................................................... 49
Transactional memory models .................................................................................................. 51
Memory Hierarchy Design and its Characteristics .............................................................................. 51
message passing interface (MPI) ............................................................................................................. 54
What is the message passing interface (MPI)? ........................................................................... 54
Benefits of the message passing interface .................................................................................. 55
MPI terminology: Key concepts and commands ................................................................... 56
Differentiate between shared memory and message passing model .............................................. 57
Differences ....................................................................................................................................... 59
Parallel Algorithm and architecture........................................................................................................ 60
Concurrent Processing .................................................................................................................... 61
What is Parallelism?.......................................................................................................................... 61
What is an Algorithm? ...................................................................................................................... 61
Flynn’s taxonomy:’ ................................................................................................................................... 62
MIMD/SIMD (models of computing) ...................................................................................................... 62
Parallel Algorithm - Analysis ................................................................................................................... 66
Time Complexity................................................................................................................................. 67
Asymptotic Analysis ......................................................................................................................... 67
Big O notation ................................................................................................................................. 68
Omega notation .............................................................................................................................. 68
4

Theta Notation................................................................................................................................. 68
Speedup of an Algorithm ................................................................................................................. 69
Number of Processors Used ........................................................................................................... 69
Total Cost ............................................................................................................................................. 69
Parallel Algorithm - Models ..................................................................................................................... 69
Data Parallel......................................................................................................................................... 70
Task Graph Model .............................................................................................................................. 71
Work Pool Model ................................................................................................................................ 72
Master-Slave Model ........................................................................................................................... 73
Precautions in using the master-slave model ........................................................................ 74
Pipeline Model..................................................................................................................................... 74
Hybrid Models ..................................................................................................................................... 75
Parallel Random Access Machines ....................................................................................................... 75
Shared Memory Model ...................................................................................................................... 77
Merits of Shared Memory Programming .................................................................................. 78
Demerits of Shared Memory Programming............................................................................. 78
Message Passing Model................................................................................................................... 78
Multithreaded programming: ...................................................................................................... 79
Multithreading on a Single Processor ...................................................................................... 80
Multithreaded Programming on Multiple Processors .......................................................... 80
Why Is Multithreading Important? ................................................................................................. 80
Processors Are at Maximum Clock Speed .............................................................................. 80
Parallelism Is Important for AI .................................................................................................... 80
What Are Common Multithreaded Programming Issues? ...................................................... 81
Race Conditions (Including Data Race) ................................................................................... 81
Deadlock ........................................................................................................................................... 82
parallel I/O: ................................................................................................................................................. 83
.................................................................................................................................................................... 84
Performance Optimization of Distributed System ............................................................................... 84
Performance Optimization of Distributed Systems: ........................................................... 84
Performance analysis of parallel processing systems........................................................................ 87
Classification of parallel programming models ................................................................. 90
 Process interaction................................................................................................................... 90
5

 Problem decomposition ........................................................................................................... 91

Terminology ...................................................................................................................................... 92
What is Scalability and performance: .................................................................................................... 97
What’s the difference? .......................................................................................................................... 97
What is Scalable System in Distributed System? ............................................................................... 97
Need for Scalability Framework: ............................................................................................... 98
How to Measure Scalability: ....................................................................................................... 98
Measures of Scalability: .............................................................................................................. 98
Types of Scalability: ..................................................................................................................... 99
Why prefer Horizontal Scalability? ......................................................................................... 100
storage system: ......................................................................................................................................... 100
What is parallel storage? ..................................................................................................................... 100
What is distributed storage?................................................................................................................ 101
Features of distributed file systems.................................................................................................. 101
Difference between parallel and distributed file system ..................................................................... 102
Introduction of Process Synchronization ............................................................................................ 103
Race Condition: ........................................................................................................................... 103
Critical Section Problem: .......................................................................................................... 104
Peterson’s Solution: ................................................................................................................... 105
Semaphores: ................................................................................................................................. 106
Synchronization in Distributed Systems ............................................................................................. 107
Performance Comparison of Open MP, MPI ............................................................................................ 109

Parallel Computing and Distributed Computing

Computing: Computing is the process of using computer
technology to complete a given goal-oriented task. Computing may
encompass the design and development of software and hardware
systems for a broad range of purposes - often structuring,
processing and managing any kind of information - to aid in the
pursuit of scientific studies, making intelligent systems, and creating
and using different media for entertainment and communication.
6

There are mainly two computation types, including parallel computing and distributed
computing. A computer system may perform tasks according to human instructions. A
single processor executes only one task in the computer system, which is not an
effective way. Parallel computing solves this problem by allowing numerous processors
to accomplish tasks simultaneously. Modern computers support parallel processing to
improve system performance. In contrast, distributed computing enables several
computers to communicate with one another and achieve a goal. All of these computers
communicate and collaborate over the network. Distributed computing is commonly
used by organizations such as Facebook and Google that allow people to share
resources.

What is Parallel Computing?

It is also known as parallel processing. It utilizes several processors. Each of the
processors completes the tasks that have been allocated to them. In other words,
parallel computing involves performing numerous tasks simultaneously. A shared
memory or distributed memory system can be used to assist in parallel computing. All
CPUs in shared memory systems share the memory. Memory is shared between the
processors in distributed memory systems.

Parallel computing provides numerous advantages. Parallel computing helps to increase

the CPU utilization and improve the performance because several processors work
simultaneously. Moreover, the failure of one CPU has no impact on the other CPUs'
functionality. Furthermore, if one processor needs instructions from another, the CPU
might cause latency.

Advantages and Disadvantages of Parallel Computing

There are various advantages and disadvantages of parallel computing. Some of the
advantages and disadvantages are as follows:

Advantages

1. It saves time and money because many resources working together cut down on time
and costs.
2. It may be difficult to resolve larger problems on Serial Computing.
3. You can do many things at once using many computing resources.
7

4. Parallel computing is much better than serial computing for modeling, simulating, and
comprehending complicated real-world events.

Disadvantages

1. The multi-core architectures consume a lot of power.

2. Parallel solutions are more difficult to implement, debug, and prove right due to the
complexity of communication and coordination, and they frequently perform worse than
their serial equivalents.

What is Distributing Computing?

It comprises several software components that reside on different systems but operate
as a single system. A distributed system's computers can be physically close together
and linked by a local network or geographically distant and linked by a wide area
network (WAN). A distributed system can be made up of any number of different
configurations, such as mainframes, PCs, workstations, and minicomputers. The main
aim of distributed computing is to make a network work as a single computer.

There are various benefits of using distributed computing. It enables scalability and
makes it simpler to share resources. It also aids in the efficiency of computation
processes.

Advantages and Disadvantages of Distributed Computing

There are various advantages and disadvantages of distributed computing. Some of the
advantages and disadvantages are as follows:

Advantages

1. It is flexible, making it simple to install, use, and debug new services.

2. In distributed computing, you may add multiple machines as required.
3. If the system crashes on one server, that doesn't affect other servers.
4. A distributed computer system may combine the computational capacity of several
computers, making it faster than traditional systems.
8

Disadvantages

1. Data security and sharing are the main issues in distributed systems due to the features
of open systems
2. Because of the distribution across multiple servers, troubleshooting and diagnostics are
more challenging.
3. The main disadvantage of distributed computer systems is the lack of software support.

Key differences between the Parallel Computing

and Distributed Computing

Here, you will learn the various key differences between parallel computing and
distributed computation. Some of the key differences between parallel computing and
distributed computing are as follows:

1. Parallel computing is a sort of computation in which various tasks or processes are run at
the same time. In contrast, distributed computing is that type of computing in which the
components are located on various networked systems that interact and coordinate their
actions by passing messages to one another.
2. In parallel computing, processors communicate with another processor via a bus. On the
other hand, computer systems in distributed computing connect with one another via a
network.
3. Parallel computing takes place on a single computer. In contrast, distributed computing
takes place on several computers.
9

4. Parallel computing aids in improving system performance. On the other hand,

distributed computing allows for scalability, resource sharing, and the efficient
completion of computation tasks.
5. The computer in parallel computing can have shared or distributed memory. In contrast,
every system in distributed computing has its memory.
6. Multiple processors execute multiple tasks simultaneously in parallel computing. In
contrast, many computer systems execute tasks simultaneously in distributed computing.

Types of Parallelism:

1. Bit-level parallelism: It is the form of parallel computing which is based on the

increasing processor’s size. It reduces the number of instructions that the system must
execute in order to perform a task on large-sized data. Example: Consider a scenario
where an 8-bit processor must compute the sum of two 16-bit integers. It must first sum
up the 8 lower-order bits, then add the 8 higher-order bits, thus requiring two
instructions to perform the operation. A 16-bit processor can perform the operation
with just one instruction.

2. Instruction-level parallelism: A processor can only address less than one instruction
for each clock cycle phase. These instructions can be re-ordered and grouped which are
later on executed concurrently without affecting the result of the program. This is called
instruction-level parallelism.

3. Task Parallelism: Task parallelism employs the decomposition of a task into subtasks
and then allocating each of the subtasks for execution. The processors perform
execution of sub tasks concurrently.

Grid Computing: Grid Computing refers to distributed computing, in which a group of

computers from multiple locations are connected with each other to achieve a common
objective. These computer resources are heterogeneous and geographically dispersed.
Grid Computing breaks complex task into smaller pieces, which are distributed to CPUs
that reside within the grid.
10

Utility Computing: Utility computing is based on Pay-per-Use model. It offers

computational resources on demand as a metered service. Cloud computing, grid
computing, and managed IT services are based on the concept of utility computing .

Synchronous Transmission: In Synchronous Transmission, data is

sent in form of blocks or frames. This transmission is the full-duplex type.
Between sender and receiver, synchronization is compulsory. In Synchronous
transmission, there is no gap present between data. It is more efficient and
more reliable than asynchronous transmission to transfer a large amount of
data.
Example:
 Chat Rooms
 Telephonic Conversations
 Video Conferencing

Asynchronous Transmission: In Asynchronous Transmission, data is sent in

form of byte or character. This transmission is the half-duplex type
transmission. In this transmission start bits and stop bits are added with data. It
does not require synchronization.
Example:
 Email
 Forums
 Letters
11

Now, let’s see the difference between Synchronous and

Asynchronous Transmission:

Synchronous
Transmission Asynchronous Transmission
In Synchronous transmission, data is In Asynchronous transmission, data is sent
1. sent in form of blocks or frames. in form of bytes or characters.

2. Synchronous transmission is fast. Asynchronous transmission is slow.

3. Synchronous transmission is costly. Asynchronous transmission is economical.

In Asynchronous transmission, the time

In Synchronous transmission, the time interval of transmission is not constant, it is
4. interval of transmission is constant. random.

In this transmission, users have to wait

till the transmission is complete before Here, users do not have to wait for the
getting a response back from the completion of transmission in order to get a
5. server. response from the server.

In Synchronous transmission, there is In Asynchronous transmission, there is a

6. no gap present between data. gap present between data.
12

Synchronous
Transmission Asynchronous Transmission
While in Asynchronous transmission, the
Efficient use of transmission lines is transmission line remains empty during a
7. done in synchronous transmission. gap in character transmission.

The start and stop bits are used in

The start and stop bits are not used in transmitting data that imposes extra
8. transmitting data. overhead.

Asynchronous transmission does not need

Synchronous transmission needs synchronized clocks as parity bit is used in
precisely synchronized clocks for the this transmission for information of new
9. information of new bytes. bytes.

Concurrency:
Concurrency relates to an application that is processing more than one task at
the same time. Concurrency is an approach that is used for decreasing the
response time of the system by using the single processing unit. Concurrency is
creating the illusion of parallelism, however actually the chunks of a task aren’t
parallel processed, but inside the application, there are more than one task is
being processed at a time. It doesn’t fully end one task before it begins
ensuing.
Concurrency is achieved through the interleaving operation of processes on the
central processing unit(CPU) or in other words by the context switching. that’s
rationale it’s like parallel processing. It increases the amount of work finished at
a time.

In the above figure, we can see that there is multiple tasks making progress
at the same time. This figure shows the concurrency because concurrency is
the technique that deals with the lot of things at a time.
13

Parallelism:
Parallelism is related to an application where tasks are divided into smaller sub-
tasks that are processed seemingly simultaneously or parallel. It is used to
increase the throughput and computational speed of the system by using
multiple processors. It enables single sequential CPUs to do lot of things
“seemingly” simultaneously.
Parallelism leads to overlapping of central processing units and input-output
tasks in one process with the central processing unit and input-output tasks of
another process. Whereas in concurrency the speed is increased by
overlapping the input-output activities of one process with CPU process of
another process.

In the above figure, we can see that the tasks are divided into smaller sub-
tasks that are processing simultaneously or parallel. This figure shows the
parallelism, the technique that runs threads simultaneously.

Concurrency control:

Two important issues in concurrency control are known as deadlocks and race
conditions. Deadlock occurs when a resource held indefinitely by one process is
14

requested by two or more other processes simultaneously. As a result, none of the

processes that call for the resource can continue; they are deadlocked, waiting for the
resource to be freed. An operating system can handle this situation with various
prevention or detection and recovery techniques. A race condition, on the other hand,
occurs when two or more concurrent processes assign a different value to a variable, and
the result depends on which process assigns the variable first (or last).

Preventing deadlocks and race conditions is fundamentally important, since it ensures

the integrity of the underlying application. A general prevention strategy is
called process synchronization. Synchronization requires that one process wait for
another to complete some operation before proceeding. For example, one process (a
writer) may be writing data to a certain main memory area, while another process (a
reader) may want to read data from that area. The reader and writer must be
synchronized so that the writer does not overwrite existing data until the reader has
processed it. Similarly, the reader should not start to read until data has been written in
the area.

Fault-tolerance:
Fault tolerance is the process of working of a system in a proper way in spite of the
occurrence of the failures in the system. Even after performing the so many testing
processes there is possibility of failure in system. Practically a system can’t be made
entirely error free. hence, systems are designed in such a way that in case of error
availability and failure, system does the work properly and given correct result.
Any system has two major components – Hardware and Software. Fault may occur in
either of it. So there are separate techniques for fault-tolerance in both hardware and
software.
Hardware Fault-tolerance Techniques:
Making a hardware fault-tolerance is simple as compared to software. Fault-tolerance
techniques make the hardware work proper and give correct result even some fault occurs
in the hardware part of the system. There are basically two techniques used for hardware
fault-tolerance:
1. BIST –
BIST stands for Build in Self Test. System carries out the test of itself after a certain
period of time again and again, that is BIST technique for hardware fault-tolerance.
When system detects a fault, it switches out the faulty component and switches in the
redundant of it. System basically reconfigure itself in case of fault occurrence.
2. TMR –
TMR is Triple Modular Redundancy. Three redundant copies of critical components
15

are generated and all these three copies are run concurrently. Voting of result of all
redundant copies are done and majority result is selected. It can tolerate the
occurrence of a single fault at a time.

Software Fault-tolerance Techniques:

Software fault-tolerance techniques are used to make the software reliable in the
condition of fault occurrence and failure. There are three techniques used in software
fault-tolerance. First two techniques are common and are basically an adaptation of
hardware fault-tolerance techniques.
1. N-version Programming –
In N-version programming, N versions of software are developed by N individuals or
groups of developers. N-version programming is just like TMR in hardware fault-
tolerance technique. In N-version programming, all the redundant copies are run
concurrently and result obtained is different from each processing. The idea of n-
version programming is basically to get the all errors during development only.
16

2. Recovery Blocks –
Recovery blocks technique is also kike the n-version programming but in recovery
blocks technique, redundant copies are generated using different algorithms only. In
recovery block, all the redundant copies are not run concurrently and these copies are
run one by one. Recovery block technique can only be used where the task deadlines
are more than task computation time.

Various Failures in Distributed System

DSM implements distributed systems shared memory model in an
exceedingly distributed system, that hasn’t any physically shared memory. The
shared model provides a virtual address space shared between any numbers of
nodes. The DSM system hides the remote communication mechanism from the
appliance author, protecting the programming ease and quality typical of
shared-memory systems.

These are explained as following below.

1. Method failure :
In this type of failure, the distributed system is generally halted and unable to
perform the execution. Sometimes it leads to ending up the execution resulting
in an associate incorrect outcome. Method failure causes the system state to
deviate from specifications, and also method might fail to progress.
 Behavior –
It may be understood as if incorrect computation like Protection violation,
deadlocks, timeout, user input, etc is performed then the method stops its
execution.
 Recovery –
Method failure can be prevented by aborting the method or restarting it from
its prior state.
17

2. System failure :
In system failure, the processor associated with the distributed system fails to
perform the execution. This is caused by computer code errors and hardware
issues. Hardware issues may involve CPU/memory/bus failure. This is assumed
that whenever the system stops its execution due to some fault then the interior
state is lost.
 Behavior –
It is concerned with physical and logical units of the processor. The system
may freeze, reboot and also it does not perform any functioning leading it to
go in an idle state.
 Recovery –
This can be cured by rebooting the system as soon as possible and
configuring the failure point and wrong state.
3. Secondary storage device failure :
A storage device failure is claimed to have occurred once the keep information
can’t be accessed. This failure is sometimes caused by parity error, head crash,
or dirt particles settled on the medium.
 Behavior –
Stored information can’t be accessed.
 Errors inflicting failure –
Parity error, head crash, etc.
 Recovery/Design strategies –
Reconstruct content from the archive and the log of activities and style
reflected disk system. A system failure will additionally be classified as
follows.
 Associate cognitive state failure
 A partial cognitive state failure
 a disruption failure
 A halting failure
4. Communication medium failure :
A communication medium failure happens once a web site cannot communicate
with another operational site within the network. it’s typically caused by the
failure of the shift nodes and/or the links of the human activity system.

 Behavior –
A web site cannot communicate with another operational site.
 Errors/Faults –
Failure of shift nodes or communication links.
 Recovery/Design strategies –
Reroute, error-resistant communication protocols.
Failure Models:
18

1. Timing failure:
Timing failure occurs when a node in a system correctly sends a response, but
the response arrives earlier or later than anticipated. Timing failures, also
known as performance failures, occur when a node delivers a response
that is either earlier or later than anticipated.
2. Response failure:
When a server’s response is flawed, a response failure occurs. The response’s
value could be off or transmitted using the inappropriate control flow.
3. Omission failure:
A timing issue known as an “infinite late” or omission failure occurs when the
node’s answer never appears to have been sent.
4. Crash failure:
If a node encounters an omission failure once and then totally stops responding
and goes unresponsive, this is known as a crash failure.
5. Arbitrary failure :
A server may produce arbitrary response at arbitrary times.

3. Check-pointing and Rollback Recovery –

This technique is different from above two techniques of software fault-tolerance. In
this technique, system is tested each time when we perform some computation. This
techniques is basically useful when there is processor failure or data corruption.
19

GPU architecture and programming:

Difference between CPU and GPU:
Central Processing Unit (CPU):
CPU is known as brain for every ingrained system. CPU comprises the
arithmetic logic unit (ALU) accustomed quickly to store the information and
perform calculations and Control Unit (CU) for performing instruction
sequencing as well as branching. CPU interacts with more computer
components such as memory, input and output for performing instruction.

Graphics Processing Unit (GPU):

GPU is used to provide the images in computer games. GPU is faster than
20

CPU’s speed and it emphasis on high throughput. It’s generally incorporated

with electronic equipment for sharing RAM with electronic equipment that is
nice for the foremost computing task. It contains more ALU units than CPU.

Introduction to CUDA Programming

CUDA stands for Compute Unified Device Architecture. It is an extension of
C/C++ programming. CUDA is a programming language that uses the Graphical
Processing Unit (GPU). It is a parallel computing platform and an API
(Application Programming Interface) model, Compute Unified Device
Architecture was developed by Nvidia. This allows computations to be
performed in parallel while providing well-formed speed. Using CUDA, one can
harness the power of the Nvidia GPU to perform common computing tasks,
such as processing matrices and other linear algebra operations, rather than
simply performing graphical calculations.

Why do we need CUDA?

 GPUs are designed to perform high-speed parallel computations to display

graphics such as games.
 Use available CUDA resources. More than 100 million GPUs are already
deployed.
21

 It provides 30-100x speed-up over other microprocessors for some

applications.
 GPUs have very small Arithmetic Logic Units (ALUs) compared to the
somewhat larger CPUs. This allows for many parallel calculations, such as
calculating the color for each pixel on the screen, etc.

How CUDA works?

 GPUs run one kernel (a group of tasks) at a time.

 Each kernel consists of blocks, which are independent groups of ALUs.
 Each block contains threads, which are levels of computation.
 The threads in each block typically work together to calculate a value.
 Threads in the same block can share memory.
 In CUDA, sending information from the CPU to the GPU is often the most
typical part of the computation.
 For each thread, local memory is the fastest, followed by shared memory,
global, static, and texture memory the slowest.

Architecture of CUDA
22

 16 Streaming Multiprocessor (SM) diagrams are shown in the above

diagram.
 Each Streaming Multiprocessor has 8 Streaming Processors (SP) ie, we get
a total of 128 Streaming Processors (SPs).
 Now, each Streaming processor has a MAD unit (Multiplication and Addition
Unit) and an additional MU (multiplication unit).
 The GT200 has 240 Streaming Processors (SPs), and more than 1 TFLOP
processing power.
 Each Streaming Processor is gracefully threaded and can run thousands of
threads per application.
 The G80 card supports 768 threads per Streaming Multiprocessor (note: not
per SP).
 Eventually, after each Streaming Multiprocessor has 8 SPs, each SP
supports a maximal of 96 threads. Total threads that can run – 128 * 96 =
12,228 times.
 Therefore these processors are called massively parallel.
 The G80 chips have a memory bandwidth of 86.4GB/s.
 It also has an 8GB/s communication channel with the CPU (4GB/s for
uploading to the CPU RAM, and 4GB/s for downloading from the CPU RAM).

How work is distributed?

 Each thread “knows” the x and y coordinates of the block it is in, and the
coordinates where it is in the block.
 These positions can be used to calculate a unique thread ID for each thread.
 The computational work done will depend on the value of the thread ID.
For example, the thread ID corresponds to a group of matrix elements.

CUDA Applications

CUDA applications must run parallel operations on a lot of data, and be

processing-intensive.
1. Computational finance
2. Climate, weather, and ocean modeling
3. Data science and analytics
4. Deep learning and machine learning
5. Defense and intelligence
6. Manufacturing/AEC
7. Media and entertainment
8. Medical imaging
9. Oil and gas
23

10. Research
11. Safety and security
12. Tools and management

Benefits of CUDA
There are several advantages that give CUDA an edge over traditional general-
purpose graphics processor (GPU) computers with graphics APIs:
 Integrated memory (CUDA 6.0 or later) and Integrated virtual memory
(CUDA 4.0 or later).
 Shared memory provides a fast area of shared memory for CUDA threads. It
can be used as a caching mechanism and provides more bandwidth than
texture lookup.
 Scattered read codes can be read from any address in memory.
 Improved performance on downloads and reads, which works well from the
GPU and to the GPU.
 CUDA has full support for bitwise and integer operations.

Limitations of CUDA

 CUDA source code is given on the host machine or GPU, as defined by the
C++ syntax rules. Longstanding versions of CUDA use C syntax rules, which
means that up-to-date CUDA source code may or may not work as required.
 CUDA has unilateral interoperability (the ability of computer systems or
software to exchange and make use of information) with transferor
languages like OpenGL. OpenGL can access CUDA registered memory, but
CUDA cannot access OpenGL memory.
 Afterward versions of CUDA do not provide emulators or fallback support for
older versions.
 CUDA supports only NVIDIA hardware.
What is GPU Programming?

GPU Programming is a method of running highly parallel general-purpose computations

on GPU accelerators.

While the past GPUs were designed exclusively for computer graphics, today they are
being used extensively for general-purpose computing (GPGPU computing) as well. In
addition to graphical rendering, GPU-driven parallel computing is used for scientific
modelling, machine learning, and other parallelization-prone jobs today.
24

Heterogeneous computing
Heterogeneous computing refers to systems that use more than one kind of
processor or core. These systems gain performance or energy efficiency not just by
adding the same type of processors, but by adding dissimilar coprocessors, usually
incorporating specialized processing capabilities to handle particular tasks.

Heterogeneous and other DSM systems |

Distributed systems
A distributed shared memory is a system that allows end-user processes to
access shared data without the need for inter-process communication. The
shared-memory paradigm applied to loosely-coupled distributed-memory
systems is known as Distributed Shared Memory (DSM).
Distributed shared memory (DSM) is a type of memory architecture in computer
science that allows physically distinct memories to be addressed as one
logically shared address space. Sharing here does not refer to a single central
memory, but rather to the address space.
In other words, the goal of a DSM system is to make inter-process
communications transparent to end-users.
Need for Heterogeneous DSM (HDSM):
Many computing environments have heterogeneity, which is almost always
unavoidable because hardware and software are generally tailored to a certain
application area. Supercomputers and multiprocessors, for example, excel at
compute-intensive tasks but struggle with user interface and device I/O.
Personal computers and workstations, on the other hand, are frequently
equipped with excellent user interfaces.
Many applications necessitate complex user interfaces, specific I/O devices,
and large amounts of computational power. Artificial intelligence, CAM,
interactive graphics, and interactive simulation are all examples of such
applications. As a result, integrating diverse machines into a coherent
distributed system and sharing resources among them is very desirable.
HDSM is beneficial for distributed and parallel applications that need to access
resources from numerous hosts at the same time.
25

Heterogeneous DSM:
In a heterogeneous computing environment, applications can take advantage of
the best of several computing architectures. Heterogeneity is typically desired in
distributed systems. With such a heterogeneous DSM system, memory sharing
between machines with different architectures will be conceivable. The two
major issues in building heterogeneous DSM are :
(i) Data Compatibility and conversion
(ii) Block Size Selection
Data compatibility & conversion:
The data comparability and conversion is the initial design concern in a
heterogeneous DSM system. Different byte-ordering and floating-point
representations may be used by machines with different architectures. Data that
is sent from one machine to another must be converted to the destination
machine’s format. The data transmission unit (block) must be transformed
according to the data type of its contents. As a result, application programmers
must be involved because they are familiar with the memory layout. In
heterogeneous DSM systems, data conversion can be accomplished by
organizing the system as a collection of source language objects or by allowing
only one type of data block.
 DSM as a collection of source language objects:
The DSM is structured as a collection of source language objects, according
to the first technique of data conversion. The unit of data migration in this
situation is either a shared variable or an object. Conversion procedures can
be used directly by the compiler to translate between different machine
architectures. The DSM system checks whether the requesting node and the
node that has the object are compatible before accessing remote objects or
variables. If the nodes are incompatible, it invokes a conversion routine,
translates, and migrates the shared variable or object.
This approach is employed in Agora Shared Memory systems, and while it is
handy for data conversion, it has a low performance. Scalars, arrays, and
structures are the objects of programming languages. Each of them
necessitates access rights, and migration need communication overhead.
Due to the limited packet size of transport protocols, access to big arrays
may result in false sharing and thrashing, while migration would entail
fragmentation and reassembling.
 DSM as one type of data block:
Only one type of data block is allowed in the second data conversion
procedure. Mermaid DSM use this approach, which uses a page size equal
to the block size. Additional information is kept in the page table entry, such
as the type of data preserved in the page and the amount of data allocated
to the page. The method changes the page to an appropriate format when
26

there are page defects or incompatibilities.

There are a few drawbacks to this method. Because the block only contains
one sort of data, fragmentation might waste memory. Compilers on
heterogeneous computers must also be consistent in terms of data type size
and field order within compound structures in the code generated by the
compiler. Even though only a small piece of the page can be accessed, the
complete page is transformed and sent. Because users must describe the
conversion process for the user-specified data type, as well as the mapping
of the data type to the conversion routine, transparency is reduced. Finally, if
data is translated too frequently, the accuracy of floating point numbers may
suffer.
Block size selection :
The choice of block size is another difficulty in creating heterogeneous DSM
systems. In a DSM system, heterogeneous machines can have varying virtual
memory page sizes. As a result, any of the following algorithms can be used to
choose the right block size :
 Largest page size: The DSM block size is the largest virtual memory page
size among all machines in this technique, as the name implies. Multiple
virtual memory pages can fit within a single DSM block since the page size is
always the power of two. Multiple blocks, including the required page, are
received in the event of a page fault on a node with a reduced page size.
False sharing and thrashing are common problem that occur frequently
because of larger block size.
 Smallest page size: The DSM block size is selected as the smallest virtual
memory page size available across all computers. Multiple blocks will be
sent when a page fault occurs on a node with a greater page size. This
approach decreases data contention but introduces additional block table
management overheads due to greater communication.
 Intermediate page size: Given the aforementioned two procedures, the
optimum option is to choose a block size that falls between the machines’
largest and smallest virtual memory page sizes. This method is used to
balance the issues that come with large and small blocks.
Based on how data catching is managed, there are three approaches for
designing a DSM system :
1. DSM managed by the OS
2. DSM managed by the MMU hardware
3. DSM managed by the language runtime system.
1. DSM managed by the OS : This area of data cache management by the OS
includes distributed operating systems like Ivy and Mirage. Each node has
its own memory, and a page fault sends a trap to the operating system,
which then employs exchange messages to identify and fetch the required
block. The operating system is in charge of data placement and migration.
27

2. DSM managed by the MMU hardware : Memory caching is managed by

the MMU hardware and cache in this technique. To keep cache consistency,
snooping bus protocols are utilized. DEC Firefly, for example, uses memory
hardware. Directories can be used to keep track of where data blocks and
nodes are located. MMU locates and transfers the requested page in the
event of a page fault.
3. DSM managed by the language runtime system : The DSM is organized
as a set of programming language elements, such as shared variables, data
structures, and shared objects, in this system. The programming language
and the operating system handle the placement and migration of these
shared variables or objects at runtime. Features to indicate the data
utilization pattern can be added to the programming language. Such a
system can support several consistency protocols and can be applied to the
granularity of individual data. But this technique imposes extra burden on the
programmer. Examples o such systems are Munin and Midway , which uses
shared variables, while Orca and Linda uses shared Objects.
Advantages of DSM:
 When a block is moved, take use of locality-of-reference.
 Passing-by-reference and passing complex data structures are made easier
with a single address space.
 There is no memory access bottleneck because there is no single bus.
 Because DSM programs have a similar programming interface, they are
portable.
 Virtual memory space that is quite large.
Difference between a Homogeneous DSM & Heterogeneous DSM:
When distributed application components share memory directly through DSM,
they are more tightly connected than when data is shared through DSM. As a
result, extending a DSM to a heterogeneous system environment is difficult.

The performance of a homogeneous DSM is slight better than a heterogeneous

DSM. Despite a number of challenges in data conversion, Heterogeneous DSM
can be accomplished with functional and performance transparency that is
comparable to homogeneous DSM.
28

Interconnection Network/topologies:

Interconnection networks are composed of switching elements. Topology is the pattern

to connect the individual switches to other elements, like processors, memories and
other switches. A network allows exchange of data between processors in the parallel
system.

Properties of inter connect topology:

 Direct connection networks − Direct networks have point-to-point
connections between neighboring nodes. These networks are static, which
means that the point-to-point connections are fixed. Some examples of direct
networks are rings, meshes and cubes.
29

 Indirect connection networks − Indirect networks have no fixed

neighbors. The communication topology can be changed dynamically based on
the application demands. Indirect networks can be subdivided into three parts:
bus networks, multistage networks and crossbar switches.


o Bus networks − A bus network is composed of a number of bit lines onto
which a number of resources are attached. When busses use the same
physical lines for data and addresses, the data and the address lines are
time multiplexed. When there are multiple bus-masters attached to the
bus, an arbiter is required.
o Multistage networks − A multistage network consists of multiple stages
of switches. It is composed of ‘axb’ switches which are connected using a
particular interstage connection pattern (ISC). Small 2x2 switch elements
are a common choice for many multistage networks. The number of
stages determine the delay of the network. By choosing different
interstage connection patterns, various types of multistage network can be
created.
o Crossbar switches − A crossbar switch contains a matrix of simple
switch elements that can switch on and off to create or break a
connection. Turning on a switch element in the matrix, a connection
between a processor and a memory can be made. Crossbar switches are
30

non-blocking, that is all communication permutations can be performed

without blocking.

Evaluating Design Trade-offs in Network Topology

If the main concern is the routing distance, then the dimension has to be maximized and
a hypercube made. In store-and-forward routing, assuming that the degree of the switch
and the number of links were not a significant cost factor, and the numbers of links or
the switch degree are the main costs, the dimension has to be minimized and a mesh
built.

In worst case traffic pattern for each network, it is preferred to have high dimensional
networks where all the paths are short. In patterns where each node is communicating
with only one or two nearby neighbors, it is preferred to have low dimensional networks,
since only a few of the dimensions are actually used.

Routing
The routing algorithm of a network determines which of the possible paths from source
to destination is used as routes and how the route followed by each particular packet is
determined. Dimension order routing limits the set of legal paths so that there is exactly
one route from each source to each destination. The one obtained by first traveling the
correct distance in the high-order dimension, then the next dimension and so on.

Routing Mechanisms
Arithmetic, source-based port select, and table look-up are three mechanisms that high-
speed switches use to determine the output channel from information in the packet
header. All of these mechanisms are simpler than the kind of general routing
computations implemented in traditional LAN and WAN routers. In parallel computer
networks, the switch needs to make the routing decision for all its inputs in every cycle,
so the mechanism needs to be simple and fast.

Deterministic Routing
A routing algorithm is deterministic if the route taken by a message is determined
exclusively by its source and destination, and not by other traffic in the network. If a
routing algorithm only selects shortest paths toward the destination, it is minimal,
otherwise it is non-minimal.

Deadlock Freedom
Deadlock can occur in a various situation. When two nodes attempt to send data to
each other and each begins sending before either receives, a ‘head-on’ deadlock may
occur. Another case of deadlock occurs, when there are multiple messages competing
for resources within the network.
31

The basic technique for proving a network is deadlock free, is to clear the dependencies
that can occur between channels as a result of messages moving through the networks
and to show that there are no cycles in the overall channel dependency graph; hence
there is no traffic patterns that can lead to a deadlock. The common way of doing this is
to number the channel resources such that all routes follow a particular increasing or
decreasing sequences, so that no dependency cycles arise.

Switch Design
Design of a network depends on the design of the switch and how the switches are
wired together. The degree of the switch, its internal routing mechanisms, and its
internal buffering decides what topologies can be supported and what routing algorithms
can be implemented. Like any other hardware component of a computer system, a
network switch contains data path, control, and storage.

Ports
The total number of pins is actually the total number of input and output ports times the
channel width. As the perimeter of the chip grows slowly compared to the area,
switches tend to be pin limited.

Internal Data path

The data path is the connectivity between each of the set of input ports and every
output port. It is generally referred to as the internal cross-bar. A non-blocking cross-bar
is one where each input port can be connected to a distinct output in any permutation
simultaneously.

Channel Buffers
The organization of the buffer storage within the switch has an important impact on the
switch performance. Traditional routers and switches tend to have large SRAM or
DRAM buffers external to the switch fabric, while in VLSI switches the buffering is
internal to the switch and comes out of the same silicon budget as the data path and the
control section. As the chip size and density increases, more buffering is available and
the network designer has more options, but still the buffer real-estate comes at a prime
choice and its organization is important.

Flow Control
When multiple data flows in the network attempt to use the same shared network
resources at the same time, some action must be taken to control these flows. If we
don’t want to lose any data, some of the flows must be blocked while others proceed.

The problem of flow control arises in all networks and at many levels. But it is
qualitatively different in parallel computer networks than in local and wide area
networks. In parallel computers, the network traffic needs to be delivered about as
accurately as traffic across a bus and there are a very large number of parallel flows on
very small-time scale.
32

Load Balancing Approach in Distributed

System
A load balancer is a device that acts as a reverse proxy and distributes network
or application traffic across a number of servers. Load adjusting is the approach
to conveying load units (i.e., occupations/assignments) across the organization
which is associated with the distributed system. Load adjusting should be
possible by the load balancer. The load balancer is a framework that can deal
with the load and is utilized to disperse the assignments to the servers. The
load balancers allocate the primary undertaking to the main server and the
second assignment to the second server.

Purpose of Load Balancing in Distributed Systems:

 Security: A load balancer provide safety to your site with practically no
progressions to your application.
 Protect applications from emerging threats: The Web Application Firewall
(WAF) in the load balancer shields your site.
 Authenticate User Access: The load balancer can demand a username
and secret key prior to conceding admittance to your site to safeguard
against unapproved access.
33

 Protect against DDoS attacks: The load balancer can distinguish and drop
conveyed refusal of administration (DDoS) traffic before it gets to your site.
 Performance: Load balancers can decrease the load on your web servers
and advance traffic for a superior client experience.
 SSL Offload: Protecting traffic with SSL (Secure Sockets Layer) on the load
balancer eliminates the upward from web servers bringing about additional
assets being accessible for your web application.
 Traffic Compression: A load balancer can pack site traffic giving your
clients a vastly improved encounter with your site.

Load Balancing Approaches:

 Round Robin
 Least Connections
 Least Time
 Hash
 IP Hash
Classes of Load Adjusting Calculations:
Following are a portion of the various classes of the load adjusting calculations.
 Static: In this model assuming any hub/node is found with a heavy load, an
assignment can be taken arbitrarily and move the undertaking to some other
arbitrary system. .
 Dynamic: It involves the present status data for load adjusting. These are
better calculations than static calculations.
 Deterministic: These calculations utilize processor and cycle attributes to
apportion cycles to the hubs.
 Centralized: The framework states data is gathered by a single hub.
34

Advantages of Load Balancing:

 Load balancers minimize server response time and maximize throughput.
 Load balancer ensures high availability and reliability by sending requests
only to online servers
 Load balancers do continuous health checks to monitor the server’s
capability of handling the request.
Migration:
Another important policy to be used by a distributed operating system that
supports process migration is to decide about the total number of times a
process should be allowed to migrate.

Migration Models:
 Code section
 Resource section
 Execution section
35

 Code section: It contains the real code.

 Resource fragment: It contains a reference to outer resources required by
the interaction.
 Execution section: It stores the ongoing execution condition of interaction,
comprising private information, the stack, and the program counter.
 Powerless movement: In the powerless relocation just the code section will
be moved.
 Solid relocation: In this movement, both the code fragment and the
execution portion will be moved. The relocation additionally can be started
by the source.

Difference Between Load Balancing and

Load Sharing in Distributed System
A distributed system is a computing environment in which different components
are dispersed among several computers (or other computing devices)
connected to a network. These devices broke up the work and coordinated their
efforts to do the job faster than if it had been assigned to a single device.
Load Balancing:
Load balancing is the practice of spreading the workload across distributed
system nodes in order to optimize resource efficiency and task response time
while avoiding a situation in which some nodes are substantially loaded while
others are idle or performing little work.
Load Sharing:
Load balancing solutions are designed to establish a dispersed network in
which requests are evenly spread across several servers. Load sharing, on the
other hand, includes sending a portion of the traffic to one server and the rest to
another.

Scheduling and Load Balancing in

Distributed System
Scheduling in Distributed Systems:

The techniques that are used for scheduling the processes in distributed
systems are as follows:
1. Task Assignment Approach: In the Task Assignment Approach, the user-
submitted process is composed of multiple related tasks which are
36

scheduled to appropriate nodes in a system to improve the performance of a

system as a whole.
2. Load Balancing Approach: In the Load Balancing Approach, as the name
implies, the workload is balanced among the nodes of the system.
3. Load Sharing Approach: In the Load Sharing Approach, it is assured that
no node would be idle while processes are waiting for their processing.
Note: The Task Assignment Approach finds less applicability
practically as it assumes that characteristics of processes
like inter-process communication cost, etc. must be known in advance.

Characteristics of a Good Scheduling Algorithm:

The following are the required characteristics of a Good Scheduling Algorithm:

 The scheduling algorithms that require prior knowledge about the properties
and resource requirements of a process submitted by a user put a burden on
the user. Hence, a good scheduling algorithm does not require prior
specification regarding the user-submitted process.
 A good scheduling algorithm must exhibit the dynamic scheduling of
processes as the initial allocation of the process to a system might need to
be changed with time to balance the load of the system.
 The algorithm must be flexible enough to process migration decisions when
there is a change in the system load.
 The algorithm must possess stability so that processors can be utilized
optimally. It is possible only when thrashing overhead gets minimized and
there should no wastage of time in process migration.
 An algorithm with quick decision making is preferable such as heuristic
methods that take less time due to less computational work give near-
optimal results in comparison to an exhaustive search that provides an
optimal solution but takes more time.
 A good scheduling algorithm gives balanced system performance by
maintaining minimum global state information as global state information
(CPU load) is directly proportional to overhead. So, with the increase in
global state information overhead also increases.
 The algorithm should not be affected by the failure of one or more nodes of
the system. Furthermore, even if the link fails and nodes of a group get
separated into two or more groups then also it should not break down. So,
the algorithm must possess decentralized decision-making capability in
which consideration is given only to the available nodes for taking a decision
and thus, providing fault tolerance.
 A good scheduling algorithm has the property of being scalable. It is flexible
for scaling when the number of nodes increases in a system. If an algorithm
opts for a strategy in which it inquires about the workload of all nodes and
37

then selects the one with the least load then it is not considered a good
approach because it leads to poor scalability as it will not work well for a
system having many nodes. The reason is that the inquirer receives a lot
many replies almost simultaneously and the processing time spent for reply
messages is too long for a node selection with the increase in several nodes
(N). A straightforward way is to examine only m of N nodes.
 A good scheduling algorithm must be having fairness of service because in
an attempt to balance the workload on all nodes of the system there might
be a possibility that nodes with more load get more benefit as compared to
nodes with less load because they suffer from poor response time than
stand-alone systems. Hence, the solution lies in the concept of load sharing
in which a node can share some of its resources until the user is not
affected.

Load Balancing in Distributed Systems:

The Load Balancing approach refers to the division of load among the
processing elements of a distributed system. The excess load of one
processing element is distributed to other processing elements that have less
load according to the defined limits. In other words, the load is maintained at
each processing element in such a manner that neither it gets overloaded nor
idle during the execution of a program to maximize the system throughput
which is the ultimate goal of distributed systems. This approach makes all
processing elements equally busy thus speeding up the entire task leads to the
completion of the task by all processors approximately at the same time.
38

Types of Load Balancing Algorithms:

 Static Load Balancing Algorithm: In the Static Load Balancing Algorithm,
while distributing load the current state of the system is not taken into
account. These algorithms are simpler in comparison to dynamic load
balancing algorithms. Types of Static Load Balancing Algorithms are as
follows:
 Deterministic: In Deterministic Algorithms, the properties of nodes
and processes are taken into account for the allocation of
processes to nodes. Because of the deterministic characteristic of
the algorithm, it is difficult to optimize to give better results and also
costs more to implement.
 Probabilistic: n Probabilistic Algorithms, Statistical attributes of the
system are taken into account such as several nodes, topology,
etc. to make process placement rules. It does not give better
performance.
 Dynamic Load Balancing Algorithm: Dynamic Load Balancing Algorithm
takes into account the current load of each node or computing unit in the
system, allowing for faster processing by dynamically redistributing
workloads away from overloaded nodes and toward underloaded nodes.
Dynamic algorithms are significantly more difficult to design, but they can
give superior results, especially when execution durations for distinct jobs
vary greatly. Furthermore, because dedicated nodes for task distribution are
39

not required, a dynamic load balancing architecture is frequently more

modular. Types of Dynamic Load Balancing Algorithms are as follows:
 Centralized: In Centralized Load Balancing Algorithms, the task of
handling requests for process scheduling is carried out by a
centralized server node. The benefit of this approach is efficiency
as all the information is held at a single node but it suffers from the
reliability problem because of the lower fault tolerance. Moreover,
there is another problem with the increasing number of requests.
 Distributed: In Distributed Load Balancing Algorithms, the decision
task of assigning processes is distributed physically to the
individual nodes of the system. Unlike Centralized Load Balancing
Algorithms, there is no need to hold state information. Hence,
speed is fast.

Types of Distributed Load Balancing Algorithms:

 Cooperative: In Cooperative Load Balancing Algorithms, as the name
implies, scheduling decisions are taken with the cooperation of entities in the
system. The benefit lies in the stability of this approach. The drawback is the
complexity involved which leads to more overhead than Non-cooperative
algorithms.
 Non-cooperative: In Non-Cooperative Load Balancing Algorithms,
scheduling decisions are taken by the individual entities of the system as
they act as autonomous entities. The benefit is that minor overheads are
involved due to the basic nature of non-cooperation. The drawback is that
these algorithms might be less stable than Cooperative algorithms.

Issues in Designing Load-balancing Algorithms:

Many issues need to be taken into account while designing Load-balancing

Algorithms:
 Load Estimation Policies: Determination of a load of a node in a distributed
system.
 Process Transfer Policies: Decides for the execution of process: local or
remote.
 State Information Exchange: Determination of strategy for exchanging
system load information among the nodes in a distributed system.
 Location Policy: Determining the selection of destination nodes for the
migration of the process.
 Priority Assignment: Determines whether the priority is given to a local or a
remote process on a node for execution.
40

 Migration limit policy: Determines the limit value for the migration of
processes.
Issues Related to Load Balancing in
Distributed System
A distributed system is a set of computers joined by some sort of
communication network, each of which has its database system and users may
access data from any spot on the network, necessitating the availability of data
at each site. Example- If you want to withdraw money from an ATM, then you
can go to any ATM (even ATM of other banks) and swipe your card. The money
will be debited from your account and it will be reflected in your account. It
doesn’t matter you are taking money from ATM or transferring it to someone by
net banking. It means internally all of these things are connected to each other
and working as a single unit. Although in real life we see them as distributed.

Load Balancers:

It is a really important concept in distributed computing which means exactly the

same as its name suggests. Let’s take an example of an OTT platform, name it
ABC. Now, if there is a weekend today then many people would be sending
requests to the server to show the movies or web series of their choice. Behind
any large application, there are a lot of servers that deal with client requests
and deliver responses. Let say our platform ABC has three servers S1, S2, and
S3. Now as a lot of requests are coming, so we need to make sure that the
requests should be balanced among these three servers. If all the requests are
going to S1 only and S2, S3 are sitting idle. Then it will increase the load on S1
which may lead to a crash of the server and it will also not be good for clients
because it will give them delayed response.
Thus we can clearly see that a Load balancer improves the overall performance
of a distributed system.

Issues Related to Load Balancing:

1. Performance Degradation:
It may lead to performance degradation as load balancers assign equivalent or
predetermined weights to diverse resources and therefore it can result in poor
performance in terms of speed and cost. Therefore, it is the need to have
effective load balancers which balance load depending upon the type of
resources.
41

2. Job Selection:
It deals with the issue of job selection. Whenever we are assigning some jobs to
resources through load balancers. There should be an optimal algorithm to
decide the order and which jobs should be given to which servers for our
system to work efficiently.

3. Load Level Comparison:

Load distribution should be done based on the basis of load level comparison of
different servers. Thus a whole system needs to be set up for collecting and
maintaining the server’s status data.
4. Load Estimation:
There is no way to determine or predict the load or the total number of
processes on a node since the demand for process resources fluctuates
quickly.
5. Performance Indices:
The performance indices of the system should not degrade anything more
than a particular point. Load balancers should provide stability. So they need to
make sure that during extreme events like- when the number of requests from
the server increases drastically.
6. Availability and Scalability:
A distributed system should be easily available and scalable. Nowadays the
concept of distributed systems is used all over the globe. It provides customers
a lot of flexibility to view services on demand. Therefore an effective load
balancer must account for transformation as per expectations of processing
power and scalability.
7. Stability:
In a normal load balancer, there is a central node that is in charge of load
balancing choices. As one load is given all the power, then it leads to the
condition of a single point of failure. If the central node fails, it will badly impact
the application. Therefore its the need to have some distributed algorithms to
make sure that there we don’t rely on a central node for all our tasks.
8. Security:
It also deals with some security issues. It is vulnerable to attacks. This issue
can be minimized to a larger extent by using cloud load balancing. They are
less prone to attacks.
9. Amount of Information Exchanged among Nodes:
As we are aware that fixing network problems is quite difficult. And introducing a
load balancer to the picture adds to the difficulty. It might be hard to tell whether
42

the load balancer is merely discarding packets, altering packets, or increasing

delay.
10. Simplicity:
The algorithm of the load balancer should be simpler, it should not be
sophisticated or of high time complexity. The more the system gets complicated
the more latency increases thus increasing the response time of the server.
Therefore, distributed system’s overall productivity will be harmed by a
sophisticated algorithm.
11. Homogeneous nodes:
The requirement expectations from a system change from time to time.
Therefore we can’t go for homogeneous nodes i.e the nodes which are made to
do a certain type of task. As a result, developing efficient load-balancing
solutions for diverse environments/nodes is a difficult task.

Load balancing in Cloud Computing

Cloud load balancing is defined as the method of splitting workloads and
computing properties in a cloud computing. It enables enterprise to manage
workload demands or application demands by distributing resources among
numerous computers, networks or servers. Cloud load balancing includes
holding the circulation of workload traffic and demands that exist over the
Internet.
As the traffic on the internet growing rapidly, which is about 100% annually of
the present traffic. Hence, the workload on the server growing so fast which
leads to the overloading of servers mainly for popular web server. There are
two elementary solutions to overcome the problem of overloading on the
servers-
 First is a single-server solution in which the server is upgraded to a higher
performance server. However, the new server may also be overloaded soon,
demanding another upgrade. Moreover, the upgrading process is arduous
and expensive.
 Second is a multiple-server solution in which a scalable service system on a
cluster of servers is built. That’s why it is more cost effective as well as more
scalable to build a server cluster system for network services.
Load balancing is beneficial with almost any type of service, like HTTP, SMTP,
DNS, FTP, and POP/IMAP. It also rises reliability through redundancy. The
balancing service is provided by a dedicated hardware device or program.
Cloud-based servers farms can attain more precise scalability and availability
using server load balancing.
43

Load balancing solutions can be categorized into two types –

1. Software-based load balancers: Software-based load balancers run on
standard hardware (desktop, PCs) and standard operating systems.
2. Hardware-based load balancer: Hardware-based load balancers are
dedicated boxes which include Application Specific Integrated Circuits
(ASICs) adapted for a particular use. ASICs allows high speed promoting of
network traffic and are frequently used for transport-level load balancing
because hardware-based load balancing is faster in comparison to software
solution.
Major Examples of Load Balancers –
1. Direct Routing Requesting Dispatching Technique: This approach of
request dispatching is like to the one implemented in IBM’s Net Dispatcher.
A real server and load balancer share the virtual IP address. In this, load
balancer takes an interface constructed with the virtual IP address that
accepts request packets and it directly routes the packet to the selected
servers.
2. Dispatcher-Based Load Balancing Cluster: A dispatcher does smart load
balancing by utilizing server availability, workload, capability and other user-
defined criteria to regulate where to send a TCP/IP request. The dispatcher
module of a load balancer can split HTTP requests among various nodes in
a cluster. The dispatcher splits the load among many servers in a cluster so
the services of various nodes seem like a virtual service on an only IP
address; consumers interrelate as if it were a solo server, without having an
information about the back-end infrastructure.
3. Linux Virtual Load Balancer: It is an opensource enhanced load balancing
solution used to build extremely scalable and extremely available network
services such as HTTP, POP3, FTP, SMTP, media and caching and Voice
Over Internet Protocol (VoIP). It is simple and powerful product made for
load balancing and fail-over. The load balancer itself is the primary entry
point of server cluster systems and can execute Internet Protocol Virtual
Server (IPVS), which implements transport-layer load balancing in the Linux
kernel also known as Layer-4 switching.

Memory Consistency model

a consistency model specifies a contract between the programmer and a
system, wherein the system guarantees that if the programmer follows the
rules for operations on memory, memory will be consistent and the results
of reading, writing, or updating memory will be predictable. Consistency
models are used in distributed systems like distributed shared
memory systems or distributed data stores (such as file
systems, databases, optimistic replication systems or web caching).
44

Consistency is different from coherence, which occurs in systems that

are cached or cache-less, and is consistency of data with respect to all
processors. Coherence deals with maintaining a global order in which
writes to a single location or single variable are seen by all processors.
Consistency deals with the ordering of operations to multiple locations with
respect to all processors.

Types of consistency models:

Strong consistency models
 Strict consistency
Strict consistency is the strongest consistency model. Under this model, a
write to a variable by any processor needs to be seen instantaneously by
all processors.
The strict model diagram and non-strict model diagrams describe the time
constraint – instantaneous. It can be better understood as though a global
clock is present in which every write should be reflected in all processor
caches by the end of that clock period. The next operation must happen
only in the next clock period. This is the most rigid model. In this model, the
programmer's expected result will be received every time. It is
deterministic. Its practical relevance is restricted to a thought experiment
and formalism, because instantaneous message exchange is impossible. It
doesn't help in answering the question of conflict resolution in concurrent
writes to the same data item, because it assumes concurrent writes to be
impossible.
 Sequential consistency
The sequential consistency model was proposed by Lamport(1979). It is a
weaker memory model than strict consistency model. A write to a variable
does not have to be seen instantaneously, however, writes to variables by
different processors have to be seen in the same order by all processors.
Sequential consistency is met if "the result of any execution is the same as
if the (read and write) operations of all processes on the data store were
executed in some sequential order, and the operations of each individual
processor appear in this sequence in the order specified by its
program."Adve and Gharachorloo, 1996 define two requirements to
implement the sequential consistency; program order and write atomicity.
45

 Program order: Program order guarantees that each process issues a

memory request ordered by its program.
 Write atomicity: Write atomicity defines that memory requests are
serviced based on the order of a single FIFO queue.
In sequential consistency, there is no notion of time or most recent write
operations. There are some operations interleaving that is same for all
processes. A process can see the write operations of all processes but it
can just see its own read operations. Program order within each processor
and sequential ordering of operations between processors should be
maintained. In order to preserve sequential order of execution between
processors, all operations must appear to execute instantaneously or
atomically with respect to every other processor.
 Causal consistency
Causal consistency defined by Hutto and Ahamad, 1990, is a weakening of
the sequential consistency model by categorizing events into those
causally related and those that are not. It defines that only write operations
that are causally related, need to be seen in the same order by all
processes. For example, if an event b takes effect from an earlier event a,
the causal consistency guarantees that all processes see event b after
event a. Tanenbaum et al., 2007 provide a stricter definition, that a data
store is considered causally consistent under the following conditions: [4]

 Writes that are potentially causally related must be seen by all

processes in the same order.
 Concurrent writes may be seen in a different order on different
machines.
This model relaxes sequential consistency on concurrent writes by a
processor and on writes that are not causally related. Two writes can
become causally related if one writes to a variable is dependent on a
previous write to any variable if the processor doing the second write has
just read the first write. The two writes could have been done by the same
processor or by different processors.
As in sequential consistency, reads do not need to reflect changes
instantaneously, however, they need to reflect all changes to a variable
sequentially.
46

 Processor consistency
in order for consistency in data to be maintained and to attain
scalable processor systems where every processor has its own memory,
the processor consistency model was derived. All processors need to be
consistent in the order in which they see writes done by one processor and
in the way they see writes by different processors to the same location
(coherence is maintained). However, they do not need to be consistent
when the writes are by different processors to different locations.
Every write operation can be divided into several sub-writes to all
memories. A read from one such memory can happen before the write to
this memory completes. Therefore, the data read can be stale. Thus, a
processor under PC can execute a younger load when an older store
needs to be stalled. Read before write, read after read and write before
write ordering is still preserved in this model.
The processor consistency model is similar to PRAM consistency model
with a stronger condition that defines all writes to the same memory
location must be seen in the same sequential order by all other processes.
Processor consistency is weaker than sequential consistency but stronger
than PRAM consistency model.

 Pipelined RAM consistency, or FIFO consistency

Pipelined RAM consistency (PRAM consistency) was presented by Lipton
and Sandberg in 1988 as one of the first described consistency models.
Due to its informal definition, there are in fact at least two subtly different
implementations, one by Ahamad et al. and one by Mosberger.
In PRAM consistency, all processes view the operations of a single process
in the same order that they were issued by that process, while operations
issued by different processes can be viewed in different order from different
processes. PRAM consistency is weaker than processor consistency.
PRAM relaxes the need to maintain coherence to a location across all its
processors. Here, reads to any variable can be executed before writes in a
processor. Read before write, read after read and write before write
ordering is still preserved in this model
47

 Cache consistency
Cache consistency requires that all write operations to the same memory
location are performed in some sequential order. Cache consistency is
weaker than processor consistency and incomparable with PRAM
consistency.

Session guarantees consistency models

These 4 consistency models were proposed in a 1994 paper. They focus
on guarantees in the situation where only a single user or application is
making data modifications.
Monotonic read consistency
Tanenbaum et al., 2007 defines monotonic read consistency as follows:
"If a process reads the value of a data item x, any successive read
operation on x by that process will always return that same value or a more
recent value."
Monotonic read consistency guarantees that after a process reads a value
of data item x at time t, it will never see the older value of that data item.
Monotonic write consistency
Monotonic write consistency condition is defined by Tanenbaum et al.,
2007 as follows:
"A write operation by a process on a data item X is completed before any
successive write operation on X by the same process."
Read-your-writes consistency
A value written by a process on a data item X will be always available to a
successive read operation performed by the same process on data item X
Writes-follows-reads consistency
In writes-follow-reads consistency, updates are propagated after
performing the previous read operations. Tanenbaum et al., 2007 defines
the following condition for writes-follow-reads consistency:
"A write operation by a process on a data item x following a previous read
operation on x by the same process is guaranteed to take place on the
same or a more recent value of x that was read."
48

Weak memory consistency models

 Weak ordering

Weak ordering classifies memory operations into two categories: data

operations and synchronization operations. To enforce program order, a
programmer needs to find at least one synchronization operation in a
program. Synchronization operations signal the processor to make sure it
has completed and seen all previous operations done by all processors.
Program order and atomicity is maintained only on synchronization
operations and not on all reads and writes. This was derived from the
understanding that certain memory operations – such as those conducted
in a critical section - need not be seen by all processors until after all
operations in the critical section are completed. It assumes reordering
memory operations to data regions between synchronization operations
does not affect the outcome of the program. This exploits the fact that
programs written to be executed on a multi-processor system contain the
required synchronization to make sure that data races do not occur and SC
outcomes are produced always.

 Release consistency
The release consistency model relaxes the weak consistency model by
distinguishing the entrance synchronization operation from the exit
synchronization operation. Under weak ordering, when a synchronization
operation is to be seen, all operations in all processors need to be visible
before the synchronization operation is done and the processor proceeds.
However, under release consistency model, during the entry to a critical
section, termed as "acquire", all operations with respect to the local
memory variables need to be completed. During the exit, termed as
"release", all changes made by the local processor should be propagated
to all other processors. Coherence is still maintained.
The acquire operation is a load/read that is performed to access the critical
section. A release operation is a store/write performed to allow other
processors to use the shared variables.
Among synchronization variables, sequential consistency or processor
consistency can be maintained. Using SC, all competing synchronization
variables should be processed in order. However, with PC, a pair of
competing variables need to only follow this order. Younger acquires can
be allowed to happen before older releases.
49

 Entry consistency

This is a variant of the release consistency model. It also requires the use
of acquire and release instructions to explicitly state an entry or exit to a
critical section. However, under entry consistency, every shared variable is
assigned a synchronization variable specific to it. This way, only when the
acquire is to variable x, all operations related to x need to be completed
with respect to that processor. This allows concurrent operations of
different critical sections of different shared variables to occur. Concurrency
cannot be seen for critical operations on the same shared variable. Such a
consistency model will be useful when different matrix elements can be
processed at the same time.
 Local consistency
In local consistency, each process performs its own operations in the order
defined by its program. There is no constraint on the ordering in which the
write operations of other processes appear to be performed. Local
consistency is the weakest consistency model in shared memory systems.
 General consistency
In general consistency, all the copies of a memory location are eventually
identical after all processes' writes are completed.
 Eventual consistency
An eventual consistency is a weak consistency model in the system with
the lack of simultaneous updates. It defines that if no update takes a very
long time, all replicas eventually become consistent.
Most shared decentralized databases have an eventual consistency model,
either BASE: basically available; soft state; eventually consistent, or a
combination of ACID and BASE sometimes called SALT: sequential;
agreed; ledge red; tamper-resistant, and also symmetric; admin-free; ledge
red; and time-consensual.

Relaxed memory consistency models

Some different consistency models can be defined by relaxing one or more
requirements in sequential consistency called relaxed consistency models.
These consistency models do not provide memory consistency at the
hardware level. In fact, the programmers are responsible for implementing
50

the memory consistency by applying synchronization techniques. The

above models are classified based on four criteria and are detailed further.
There are four comparisons to define the relaxed consistency:
 Relaxation
One way to categorize the relaxed consistency is to define which
sequential consistency requirements are relaxed. We can have less
strict models by relaxing either program order or write atomicity
requirements defined by Adve and Gharachorloo, 1996. Program
order guarantees that each process issues a memory request
ordered by its program and write atomicity defines that memory
requests are serviced based on the order of a single FIFO queue. In
relaxing program order, any or all the ordering of operation pairs,
write-after-write, read-after-write, or read/write-after-read, can be
relaxed. In the relaxed write atomicity model, a process can view its
own writes before any other processors.
 Synchronizing vs. non-synchronizing
A synchronizing model can be defined by dividing the memory
accesses into two groups and assigning different consistency
restrictions to each group considering that one group can have a
weak consistency model while the other one needs a more restrictive
consistency model. In contrast, a non-synchronizing model assigns
the same consistency model to the memory access types.
 Issue vs. view-based
Issue method provides sequential consistency simulation by defining
the restrictions for processes to issue memory operations. Whereas,
view method describes the visibility restrictions on the events order
for processes.
 Relative model strength
Some consistency models are more restrictive than others. In other
words, strict consistency models enforce more constraints as
consistency requirements. The strength of a model can be defined by
the program order or atomicity relaxations and the strength of models
can also be compared. Some models are directly related if they apply
same relaxations or more. On the other hand, the models that relax
different requirements are not directly related.
51

Transactional memory models

Transactional memory model is the combination of cache coherency and
memory consistency models as a communication model for shared
memory systems supported by software or hardware; a transactional
memory model provides both memory consistency and cache coherency. A
transaction is a sequence of operations executed by a process that
transforms data from one consistent state to another. A transaction either
commits when there is no conflict or aborts. In commits, all changes are
visible to all other processes when a transaction is completed, while aborts
discard all changes. Compared to relaxed consistency models, a
transactional model is easier to use and can provide the higher
performance than a sequential consistency model.

Memory Hierarchy Design and its

Characteristics
In the Computer System Design, Memory Hierarchy is an enhancement to
organize the memory such that it can minimize the access time. The Memory
Hierarchy was developed based on a program behavior known as locality of
references. The figure below clearly demonstrates the different levels of
memory hierarchy
:
52

This Memory Hierarchy Design is divided into 2 main types:

1. External Memory or Secondary Memory – Comprising of Magnetic Disk,
Optical Disk, Magnetic Tape i.e. peripheral storage devices which are
accessible by the processor via I/O Module.
2. Internal Memory or Primary Memory – Comprising of Main Memory,
Cache Memory & CPU registers. This is directly accessible by the processor.
We can infer the following characteristics of Memory Hierarchy Design from
above figure:
1. Capacity: It is the global volume of information the memory can store. As we
move from top to bottom in the Hierarchy, the capacity increases.
2. Access Time: It is the time interval between the read/write request and the
availability of the data. As we move from top to bottom in the Hierarchy, the
access time increases.
3. Performance: Earlier when the computer system was designed without
Memory Hierarchy design, the speed gap increases between the CPU
registers and Main Memory due to large difference in access time. This
results in lower performance of the system and thus, enhancement was
required. This enhancement was made in the form of Memory Hierarchy
Design because of which the performance of the system increases. One of
the most significant ways to increase system performance is minimizing how
far down the memory hierarchy one has to go to manipulate data.
53

4. Cost per bit: As we move from bottom to top in the Hierarchy, the cost per
bit increases i.e. Internal Memory is costlier than External Memory.
According to the memory Hierarchy, the system supported memory
standards are defined below:

Level 1 2 3 4

Secondary
Name Register Cache Main Memory Memory

Size <1 KB less than 16 MB <16GB >100 GB

DRAM (capacitor
Implementation Multi-ports On-chip/SRAM memory) Magnetic

Access Time 0.25ns to 0.5ns 0.5 to 25ns 80ns to 250ns 50 lakh ns

20000 to 1 lakh
Bandwidth MBytes 5000 to 15000 1000 to 5000 20 to 150

Operating
Managed by Compiler Hardware Operating System System

Backing from Main from Secondary

Mechanism From cache Memory Memory from CD

Why memory Hierarchy is used in systems?

Memory hierarchy is arranging different kinds of storage present
on a computing device based on speed of access. At the very
top, the highest performing storage is CPU registers which are
the fastest to read and write to. Next is cache memory followed
by conventional DRAM memory, followed by disk storage with
different levels of performance including SSD, optical and
magnetic disk drives.
54

To bridge the processor memory performance gap, hardware

designers are increasingly relying on memory at the top of the
memory hierarchy to close / reduce the performance gap. This is
done through increasingly larger cache hierarchies (which can be
accessed by processors much faster), reducing the dependency
on main memory which is slower.

message passing interface (MPI)

What is the message passing interface (MPI)?
The message passing interface (MPI) is a standardized means of exchanging
messages between multiple computers running a parallel program across
distributed memory.

In parallel computing, multiple computers – or even multiple processor cores

within the same computer – are called nodes. Each node in the parallel
arrangement typically works on a portion of the overall computing problem. The
challenge then is to synchronize the actions of each parallel node, exchange data
between nodes, and provide command and control over the entire parallel cluster.
The message passing interface defines a standard suite of functions for these tasks.
The term message passing itself typically refers to the sending of a message to an
object, parallel process, subroutine, function or thread, which is then used to start
another process.

MPI isn't endorsed as an official standard by any standards organization, such as

the Institute of Electrical and Electronics Engineers (IEEE) or the International
Organization for Standardization (ISO), but it's generally considered to be the
industry standard, and it forms the basis for most communication interfaces
adopted by parallel computing programmers. Various implementations of MPI
have been created by developers as well.

MPI defines useful syntax for routines and libraries in programming languages
including Fortran, C, C++ and Java.
55

Benefits of the message passing interface

The message passing interface provides the following benefits:

 Standardization. MPI has replaced other message passing libraries, becoming a

generally accepted industry standard.
 Developed by a broad committee. Although MPI may not be an official standard,
it's still a general standard created by a committee of vendors, implementors
and users.
 Portability. MPI has been implemented for many distributed memory
architectures, meaning users don't need to modify source code when porting
applications over to different platforms that are supported by the MPI
standard.
 Speed. Implementation is typically optimized for the hardware the MPI runs on.
Vendor implementations may also be optimized for native hardware features.
 Functionality. MPI is designed for high performance on massively
parallel machines and clusters. The basic MPI-1 implementation has more than
100 defined routines.

Some organizations are also able to offload MPI to make their programming
models and libraries faster.

Merits of Message Passing Interface

 Runs only on shared memory architectures or distributed memory
architectures;
 Each processors has its own local variables;
 As compared to large shared memory computers, distributed memory
computers are less expensive.
Demerits of Message Passing Interface
 More programming changes are required for parallel algorithm;
 Sometimes difficult to debug; and
56

 Does not perform well in the communication network between the

nodes.

MPI terminology: Key concepts and commands

The following list includes some basic key MPI concepts and commands:

 Comm. These are communicator objects that connect groups of processes in

MPI. Communicator commands give a contained process an independent
identifier, arranging it as an ordered topology. For example, a command for a
base communicator includes MPI_COMM_WORLD.

 Color. This assigns a color to a process, and all processes with the same color
are located in the same communicator. A command related to color
includes MPE_Make_color_array, which changes the available colors.

 Key. The rank or order of a process in the communicator is based on a key. If

two processes are given the same key, the order is based on the process's rank
in the communicator.

 Newcomm. This is a command for creating a new

communicator. MPI_COMM_DUP is an example command to create a
duplicate of a comm with the same fixed attributes.

 Derived data types. MPI functions need a specification to what type of data is
sent between processes. MPI_INT, MPI_CHAR and MPI_DOUBLE aid in
predefining the constants.

 Point-to-point. This sends a message between two specific

processes. MPI_Send and MPI_Recv are two common blocking methods for
point-to-point messages. Blocking refers to having the sending and receiving
processes wait until a complete message has been correctly sent and received
to send and complete a message.
57

 Collective basics. These are collective functions that need communication among
all processes in a process group. MPI_Bcast is an example of such, which sends
data from one node to all processes in a process group.
 One-sided. This term is typically used referring to a form of communications
operations, including MPI_Put, MPI_Get and MPI_Accumulate. They refer
specifically to being a writing to memory, reading from memory and reducing
operation on the same memory across tasks.

Differentiate between shared memory and

message passing model

Shared memory system:

is the fundamental model of inter process communication. In a

shared memory system, in the address space region the
cooperating communicates with each other by establishing the
shared memory region.
Shared memory concept works on fastest inter process
communication.
If the process wants to initiate the communication and it has
some data to share, then establish the shared memory region in
its address space.
After that, another process wants to communicate and tries to
read the shared data, and must attach itself to the initiating
process’s shared address space.
Message Passing:
provides a mechanism to allow processes to communicate and
to synchronize their actions without sharing the same address
space.
For example − Chat program on the World Wide Web.
Message passing provides two operations which are as follows −
Send message
 Receive message
Messages sent by a process can be either fixed or variable size.
For fixed size messages the system level implementation is
straight forward. It makes the task of programming more difficult.
The variable sized messages require a more system level
implementation but the programming task becomes simpler.
If process P1 and P2 want to communicate they need to send a
message to and receive a message from each other that means
here a communication link exists between them.
59

Differences
The major differences between shared memory and message
passing model −
Shared Memory Message Passing

It is one of the region for data Mainly the message passing is used for
communication communication.

It is used for communication between It is used in distributed environments

single processor and multiprocessor where the communicating processes
systems where the processes that are are present on remote machines which
to be communicated present on the are connected with the help of a
same machine and they are sharing network.
common address space.

The shared memory code that has to Here no code is required because the
be read or write the data that should message passing facility provides a
be written explicitly by the mechanism for communication and
application programmer. synchronization of actions that are
performed by the communicating
processes.

It is going to provide a maximum Message passing is a time consuming

speed of computations because the process because it is implemented
communication is done with the help through kernel (system calls).
of shared memory so system calls are
used to establish the shared memory.

In shared memory make sure that the Message passing is useful for sharing
processes are not writing to the same small amounts of data so that conflicts
location simultaneously. need not occur.
60

Shared Memory Message Passing

It follows a faster communication In message passing the communication

strategy when compared to message is slower when compared to shared
passing technique. memory technique.

Given below is the structure of Given below is the structure of message

shared memory system − passing system −

Parallel Algorithm and architecture

An algorithm is a sequence of steps that take inputs from the user and
after some computation, produces an output. A parallel algorithm is an
algorithm that can execute several instructions simultaneously on different
processing devices and then combine all the individual outputs to produce
the final result.
61

Concurrent Processing
The easy availability of computers along with the growth of Internet has
changed the way we store and process data. We are living in a day and
age where data is available in abundance. Every day we deal with huge
volumes of data that require complex computing and that too, in quick time.
Sometimes, we need to fetch data from similar or interrelated events that
occur simultaneously. This is where we require concurrent
processing that can divide a complex task and process it multiple systems
to produce the output in quick time.
Concurrent processing is essential where the task involves processing a
huge bulk of complex data. Examples include − accessing large databases,
aircraft testing, astronomical calculations, atomic and nuclear physics,
biomedical analysis, economic planning, image processing, robotics,
weather forecasting, web-based services, etc.
What is Parallelism?
Parallelism is the process of processing several set of instructions
simultaneously. It reduces the total computational time. Parallelism can be
implemented by using parallel computers, i.e. a computer with many
processors. Parallel computers require parallel algorithm, programming
languages, compilers and operating system that support multitasking.
In this tutorial, we will discuss only about parallel algorithms. Before
moving further, let us first discuss about algorithms and their types.
What is an Algorithm?
An algorithm is a sequence of instructions followed to solve a problem.
While designing an algorithm, we should consider the architecture of
computer on which the algorithm will be executed. As per the architecture,
there are two types of computers −
 Sequential Computer
 Parallel Computer
Depending on the architecture of computers, we have two types of
algorithms −
 Sequential Algorithm − An algorithm in which some consecutive
steps of instructions are executed in a chronological order to solve a
problem.
62

 Parallel Algorithm − The problem is divided into sub-problems and

are executed in parallel to get individual outputs. Later on, these
individual outputs are combined together to get the final desired
output.
It is not easy to divide a large problem into sub-problems. Sub-problems
may have data dependency among them. Therefore, the processors have
to communicate with each other to solve the problem.
It has been found that the time needed by the processors in communicating
with each other is more than the actual processing time. So, while
designing a parallel algorithm, proper CPU utilization should be considered
to get an efficient algorithm.
To design an algorithm properly, we must have a clear idea of the
basic model of computation in a parallel computer.

Flynn’s taxonomy:’
MIMD/SIMD (models of computing)
Parallel computing is a computing where the jobs are broken into
discrete parts that can be executed concurrently. Each part is further
broken down to a series of instructions. Instructions from each part
execute simultaneously on different CPUs. Parallel systems deal with the
simultaneous use of multiple computer resources that can include a single
computer with multiple processors, a number of computers connected by
a network to form a parallel processing cluster or a combination of both.
Parallel systems are more difficult to program than computers with a
single processor because the architecture of parallel computers varies
accordingly and the processes of multiple CPUs must be coordinated and
synchronized.
The crux of parallel processing are CPUs. Based on the number
of instruction and data streams that can be processed simultaneously,
computing systems are classified into four major categories:
63

Flynn’s classification –
1. Single-instruction, single-data (SISD) systems –
An SISD computing system is a uniprocessor machine which is
capable of executing a single instruction, operating on a single data
stream. In SISD, machine instructions are processed in a sequential
manner and computers adopting this model are popularly called
sequential computers. Most conventional computers have SISD
architecture. All the instructions and data to be processed have to be
stored in primary memory.
64

The speed of the processing element in the SISD model is

limited(dependent) by the rate at which the computer can transfer
information internally. Dominant representative SISD systems are IBM
PC, workstations.
2. Single-instruction, multiple-data (SIMD) systems –
An SIMD system is a multiprocessor machine capable of executing the
same instruction on all the CPUs but operating on different data
streams. Machines based on an SIMD model are well suited to
scientific computing since they involve lots of vector and matrix
operations. So that the information can be passed to all the processing
elements (PEs) organized data elements of vectors can be divided into
multiple sets (N-sets for N PE systems) and each PE can process one
data set.

Dominant representative SIMD systems is Cray’s vector processing

machine.
3. Multiple-instruction, single-data (MISD) systems –
An MISD computing system is a multiprocessor machine capable of
executing different instructions on different PEs but all of them
operating on the same dataset .
65

Example Z = sin(x)+cos(x)+tan(x)
The system performs different operations on the same data set.
Machines built using the MISD model are not useful in most of the
application, a few machines are built, but none of them are available
commercially.
4. Multiple-instruction, multiple-data (MIMD) systems –
An MIMD system is a multiprocessor machine which is capable of
executing multiple instructions on multiple data sets. Each PE in the
MIMD model has separate instruction and data streams; therefore,
machines built using this model are capable to any kind of application.
Unlike SIMD and MISD machines, PEs in MIMD machines work
asynchronously.

MIMD machines are broadly categorized into shared-memory

MIMD and distributed-memory MIMD based on the way PEs are
coupled to the main memory.
In the shared memory MIMD model (tightly coupled multiprocessor
systems), all the PEs are connected to a single global memory and
66

they all have access to it. The communication between PEs in this
model takes place through the shared memory, modification of the data
stored in the global memory by one PE is visible to all other PEs.
Dominant representative shared memory MIMD systems are Silicon
Graphics machines and Sun/IBM’s SMP (Symmetric Multi-Processing).
In Distributed memory MIMD machines (loosely coupled
multiprocessor systems) all PEs have a local memory. The
communication between PEs in this model takes place through the
interconnection network (the inter process communication channel, or
IPC). The network connecting PEs can be configured to tree, mesh or
in accordance with the requirement.
The shared-memory MIMD architecture is easier to program but is less
tolerant to failures and harder to extend with respect to the distributed
memory MIMD model. Failures in a shared-memory MIMD affect the
entire system, whereas this is not the case of the distributed model, in
which each of the PEs can be easily isolated. Moreover, shared
memory MIMD architectures are less likely to scale because the
addition of more PEs leads to memory contention. This is a situation
that does not happen in the case of distributed memory, in which each
PE has its own memory. As a result of practical outcomes and user’s
requirement, distributed memory MIMD architecture is superior to the
other existing models.

Parallel Algorithm - Analysis

Analysis of an algorithm helps us determine whether the algorithm
is useful or not. Generally, an algorithm is analyzed based on its
execution time (Time Complexity) and the amount of
space (Space Complexity) it requires.
Since we have sophisticated memory devices available at
reasonable cost, storage space is no longer an issue. Hence,
space complexity is not given so much of importance.
Parallel algorithms are designed to improve the computation
speed of a computer. For analyzing a Parallel Algorithm, we
normally consider the following parameters
 Time complexity (Execution Time),
 Total number of processors used, and
67

 Total cost.
Time Complexity
The main reason behind developing parallel algorithms was to
reduce the computation time of an algorithm. Thus, evaluating the
execution time of an algorithm is extremely important in analyzing
its efficiency.
Execution time is measured on the basis of the time taken by the
algorithm to solve a problem. The total execution time is
calculated from the moment when the algorithm starts executing
to the moment it stops. If all the processors do not start or end
execution at the same time, then the total execution time of the
algorithm is the moment when the first processor started its
execution to the moment when the last processor stops its
execution.
Time complexity of an algorithm can be classified into three
categories−
 Worst-case complexity − When the amount of time
required by an algorithm for a given input is maximum.
 Average-case complexity − When the amount of time
required by an algorithm for a given input is average.
 Best-case complexity − When the amount of time required
by an algorithm for a given input is minimum.
Asymptotic Analysis
The complexity or efficiency of an algorithm is the number of
steps executed by the algorithm to get the desired output.
Asymptotic analysis is done to calculate the complexity of an
algorithm in its theoretical analysis. In asymptotic analysis, a large
length of input is used to calculate the complexity function of the
algorithm.
Note − Asymptotic is a condition where a line tends to meet a
curve, but they do not intersect. Here the line and the curve is
asymptotic to each other.
68

Asymptotic notation is the easiest way to describe the fastest and

slowest possible execution time for an algorithm using high
bounds and low bounds on speed. For this, we use the following
notations −
 Big O notation
 Omega notation
 Theta notation
Big O notation
In mathematics, Big O notation is used to represent the
asymptotic characteristics of functions. It represents the behavior
of a function for large inputs in a simple and accurate method. It is
a method of representing the upper bound of an algorithm’s
execution time. It represents the longest amount of time that the
algorithm could take to complete its execution. The function −
f(n) = O(g(n))
iff there exists positive constants c and n0 such that f(n) ≤ c *
g(n) for all n where n ≥ n0.
Omega notation
Omega notation is a method of representing the lower bound of
an algorithm’s execution time. The function −
f(n) = Ω (g(n))
iff there exists positive constants c and n0 such that f(n) ≥ c *
g(n) for all n where n ≥ n0.
Theta Notation
Theta notation is a method of representing both the lower bound
and the upper bound of an algorithm’s execution time. The
function −
f(n) = θ(g(n))
iff there exists positive constants c1, c2, and n0 such that c1 * g(n)
≤ f(n) ≤ c2 * g(n) for all n where n ≥ n0.
69

Speedup of an Algorithm
The performance of a parallel algorithm is determined by
calculating its speedup. Speedup is defined as the ratio of the
worst-case execution time of the fastest known sequential
algorithm for a particular problem to the worst-case execution
time of the parallel algorithm.
speedup =
Worst case execution time of the fastest known sequential for a particular
problem / Worst case execution time of the parallel algorithm

Number of Processors Used

The number of processors used is an important factor in
analyzing the efficiency of a parallel algorithm. The cost to buy,
maintain, and run the computers are calculated. Larger the
number of processors used by an algorithm to solve a problem,
costlier becomes the obtained result.
Total Cost
Total cost of a parallel algorithm is the product of time complexity
and the number of processors used in that particular algorithm.
Total Cost = Time complexity × Number of processors used
Therefore, the efficiency of a parallel algorithm is −
Efficiency =
Worst case execution time of sequential algorithm / Worst case execution
time of the parallel algorithm.

Parallel Algorithm - Models

The model of a parallel algorithm is developed by considering a strategy for
dividing the data and processing method and applying a suitable strategy to
reduce interactions. In this chapter, we will discuss the following Parallel
Algorithm Models −
 Data parallel model
70

 Task graph model

 Work pool model
 Master slave model
 Producer consumer or pipeline model
 Hybrid model
Data Parallel
In data parallel model, tasks are assigned to processes and each task
performs similar types of operations on different data. Data parallelism is a
consequence of single operations that is being applied on multiple data
items.
Data-parallel model can be applied on shared-address spaces and
message-passing paradigms. In data-parallel model, interaction overheads
can be reduced by selecting a locality preserving decomposition, by using
optimized collective interaction routines, or by overlapping computation and
interaction.
The primary characteristic of data-parallel model problems is that the
intensity of data parallelism increases with the size of the problem, which in
turn makes it possible to use more processes to solve larger problems.
Example − Dense matrix multiplication.
71

Task Graph Model

In the task graph model, parallelism is expressed by a task graph. A task
graph can be either trivial or nontrivial. In this model, the correlation among
the tasks are utilized to promote locality or to minimize interaction costs. This
model is enforced to solve problems in which the quantity of data associated
with the tasks is huge compared to the number of computation associated
with them. The tasks are assigned to help improve the cost of data
movement among the tasks.
Examples − Parallel quick sort, sparse matrix factorization, and parallel
algorithms derived via divide-and-conquer approach.

Here, problems are divided into atomic tasks and implemented as a graph.
Each task is an independent unit of job that has dependencies on one or
more antecedent task. After the completion of a task, the output of an
antecedent task is passed to the dependent task. A task with antecedent
task starts execution only when its entire antecedent task is completed. The
final output of the graph is received when the last dependent task is
completed (Task 6 in the above figure).
72

Work Pool Model

In work pool model, tasks are dynamically assigned to the processes for
balancing the load. Therefore, any process may potentially execute any task.
This model is used when the quantity of data associated with tasks is
comparatively smaller than the computation associated with the tasks.
There is no desired pre-assigning of tasks onto the processes. Assigning of
tasks is centralized or decentralized. Pointers to the tasks are saved in a
physically shared list, in a priority queue, or in a hash table or tree, or they
could be saved in a physically distributed data structure.
The task may be available in the beginning, or may be generated
dynamically. If the task is generated dynamically and a decentralized
assigning of task is done, then a termination detection algorithm is required
so that all the processes can actually detect the completion of the entire
program and stop looking for more tasks.
Example − Parallel tree search
73

Master-Slave Model
In the master-slave model, one or more master processes generate task and
allocate it to slave processes. The tasks may be allocated beforehand if −
 the master can estimate the volume of the tasks, or
 a random assigning can do a satisfactory job of balancing load, or
 slaves are assigned smaller pieces of task at different times.
This model is generally equally suitable to shared-address-
space or message-passing paradigms, since the interaction is naturally
two ways.
In some cases, a task may need to be completed in phases, and the task in
each phase must be completed before the task in the next phases can be
generated. The master-slave model can be generalized
to hierarchical or multi-level master-slave model in which the top level
master feeds the large portion of tasks to the second-level master, who
74

further subdivides the tasks among its own slaves and may perform a part of
the task itself.

Precautions in using the master-slave model

Care should be taken to assure that the master does not become a
congestion point. It may happen if the tasks are too small or the workers are
comparatively fast.
The tasks should be selected in a way that the cost of performing a task
dominates the cost of communication and the cost of synchronization.
Asynchronous interaction may help overlap interaction and the computation
associated with work generation by the master.
Pipeline Model
It is also known as the producer-consumer model. Here a set of data is
passed on through a series of processes, each of which performs some task
on it. Here, the arrival of new data generates the execution of a new task by
a process in the queue. The processes could form a queue in the shape of
linear or multidimensional arrays, trees, or general graphs with or without
cycles.
This model is a chain of producers and consumers. Each process in the
queue can be considered as a consumer of a sequence of data items for the
process preceding it in the queue and as a producer of data for the process
following it in the queue. The queue does not need to be a linear chain; it can
75

be a directed graph. The most common interaction minimization technique

applicable to this model is overlapping interaction with computation.
Example − Parallel LU factorization algorithm.

Hybrid Models
A hybrid algorithm model is required when more than one model may be
needed to solve a problem.
A hybrid model may be composed of either multiple models applied
hierarchically or multiple models applied sequentially to different phases of
a parallel algorithm.
Example − Parallel quick sort

Parallel Random Access Machines

Parallel Random Access Machines (PRAM) is a model, which is
considered for most of the parallel algorithms. Here, multiple processors are
attached to a single block of memory. A PRAM model contains −
 A set of similar type of processors.
 All the processors share a common memory unit. Processors can
communicate among themselves through the shared memory only.
 A memory access unit (MAU) connects the processors with the single
shared memory.
76

Here, n number of processors can perform independent operations

on n number of data in a particular unit of time. This may result in
simultaneous access of same memory location by different processors.
To solve this problem, the following constraints have been enforced on
PRAM model −
 Exclusive Read Exclusive Write (EREW) − Here no two processors
are allowed to read from or write to the same memory location at the
same time.
 Exclusive Read Concurrent Write (ERCW) − Here no two processors
are allowed to read from the same memory location at the same time,
but are allowed to write to the same memory location at the same time.
 Concurrent Read Exclusive Write (CREW) − Here all the processors
are allowed to read from the same memory location at the same time,
but are not allowed to write to the same memory location at the same
time.
 Concurrent Read Concurrent Write (CRCW) − All the processors are
allowed to read from or write to the same memory location at the same
time.
There are many methods to implement the PRAM model, but the most
prominent ones are −
 Shared memory model
 Message passing model
77

 Data parallel model

Shared Memory Model
Shared memory emphasizes on control parallelism than on data
parallelism. In the shared memory model, multiple processes execute on
different processors independently, but they share a common memory
space. Due to any processor activity, if there is any change in any memory
location, it is visible to the rest of the processors.
As multiple processors access the same memory location, it may happen
that at any particular point of time, more than one processor is accessing the
same memory location. Suppose one is reading that location and the other
is writing on that location. It may create confusion. To avoid this, some
control mechanism, like lock / semaphore, is implemented to ensure mutual
exclusion.

Shared memory programming has been implemented in the following −

 Thread libraries − The thread library allows multiple threads of control
that run concurrently in the same memory location. Thread library
provides an interface that supports multithreading through a library of
subroutine. It contains subroutines for
o Creating and destroying threads
o Scheduling execution of thread
o passing data and message between threads
o saving and restoring thread contexts
Examples of thread libraries include − SolarisTM threads for Solaris, POSIX
threads as implemented in Linux, Win32 threads available in Windows NT
78

and Windows 2000, and JavaTM threads as part of the standard JavaTM
Development Kit (JDK).
 Distributed Shared Memory (DSM) Systems − DSM systems create
an abstraction of shared memory on loosely coupled architecture in
order to implement shared memory programming without hardware
support. They implement standard libraries and use the advanced user-
level memory management features present in modern operating
systems. Examples include Tread Marks System, Munin, IVY, Shasta,
Brazos, and Cashmere.
 Program Annotation Packages − This is implemented on the
architectures having uniform memory access characteristics. The most
notable example of program annotation packages is OpenMP.
OpenMP implements functional parallelism. It mainly focuses on
parallelization of loops.
The concept of shared memory provides a low-level control of shared
memory system, but it tends to be tedious and erroneous. It is more
applicable for system programming than application programming.
Merits of Shared Memory Programming
 Global address space gives a user-friendly programming approach to
memory.
 Due to the closeness of memory to CPU, data sharing among
processes is fast and uniform.
 There is no need to specify distinctly the communication of data among
processes.
 Process-communication overhead is negligible.
 It is very easy to learn.
Demerits of Shared Memory Programming
 It is not portable.
 Managing data locality is very difficult.
Message Passing Model
Message passing is the most commonly used parallel programming
approach in distributed memory systems. Here, the programmer has to
determine the parallelism. In this model, all the processors have their own
local memory unit and they exchange data through a communication
network.
79

Processors use message-passing libraries for communication among

themselves. Along with the data being sent, the message contains the
following components −
 The address of the processor from which the message is being sent;
 Starting address of the memory location of the data in the sending
processor;
 Data type of the sending data;
 Data size of the sending data;
 The address of the processor to which the message is being sent;
 Starting address of the memory location for the data in the receiving
processor.
Processors can communicate with each other by any of the following
methods −
 Point-to-Point Communication
 Collective Communication
 Message Passing Interface

Multithreaded programming:

Multithreading specifically refers to the concurrent execution of more than

one sequential set (thread) of instructions.

Multithreaded programming is programming multiple, concurrent execution

threads. These threads could run on a single processor. Or there could be
multiple threads running on multiple processor cores.
80

Multithreading on a Single Processor

Multithreading on a single processor gives the illusion of running in parallel.
In reality, the processor is switching by using a scheduling algorithm. Or,
it’s switching based on a combination of external inputs (interrupts) and
how the threads have been prioritized.

Multithreaded Programming on Multiple Processors

Multithreading on multiple processor cores is truly parallel. Individual

microprocessors work together to achieve the result more efficiently. There
are multiple parallel, concurrent tasks happening at once.

Why Is Multithreading Important?

Multithreading is important to development teams today. And it will remain

important as technology evolves.

Here’s why:

Processors Are at Maximum Clock Speed

Processors have reached maximum clock speed. The only way to get more
out of CPUs is with parallelism.

Multithreading allows a single processor to spawn multiple, concurrent

threads. Each thread runs its own sequence of instructions. They all access
the same shared memory space and communicate with each other if
necessary. The threads can be carefully managed to optimize
performance.

Parallelism Is Important for AI

As we reach the limits of what can be done on a single processor, more

tasks are run on multiple processor cores. This is particularly important for
AI.

One example of this is autonomous driving. In a traditional car, humans are

relied upon to make quick decisions. And the average reaction time
for humans is 0.25 seconds.
81

So, within autonomous vehicles, AI needs to make these decisions very

quickly — in tenths of a second.

Using multithreading in C and parallel programming in C is the best way to

ensure these decisions are made in a required timeframe.

Moving from single-threaded programs to multithreaded increases

complexity. Programming languages, such as C and C++, have evolved to
make it easier to use multiple threads and handle this complexity. Both C
and C++ now include threading libraries.

Modern C++, in particular, has gone a long way to make parallel

programming easier. C++11 included a standard threading library. C++17
added parallel algorithms — and parallel implementations of many
standard algorithms.

Additional support for parallelism is expected in future versions of C++.

What Are Common Multithreaded Programming Issues?

There are many benefits to multithreading in C. But there are also

concurrency issues that can arise. And these errors can compromise your
program — and lead to security risks.

Using multiple threads helps you get more out of a single processor. But
then these threads need to sync their work in a shared memory. This can
be difficult to get right — and even more difficult to do without concurrency
issues.

Traditional testing and debugging methods are unlikely to identify these

potential issues. You might run a test or a debugger once — and see no
errors. But when you run it again, there’s a bug. In reality, you could keep
testing and testing — and still not find the issue.

Here are two common types of multithreading issues that can be difficult to
find with testing and debugging alone.

Race Conditions (Including Data Race)

Race conditions occur when a program’s behavior depends on the

sequence or timing of uncontrollable events.
82

A data race is a type of race condition. A data race occurs when two or
more threads access shared data and attempt to modify it at the same time
— without proper synchronization.

This type of error can lead to crashes or memory corruption.

Deadlock

Deadlock occurs when multiple threads are blocked while competing for
resources. One thread is stuck waiting for a second thread, which is stuck
waiting for the first.

This type of error can cause programs to get stuck.

This section introduces basic concepts of multithreading.

Concurrency and Parallelism
In a multithreaded process on a single processor, the processor can switch
execution resources between threads, resulting in concurrent execution.
Concurrency indicates that more than one thread is making progress, but
the threads are not actually running simultaneously. The switching between
threads happens quickly enough that the threads might appear to run
simultaneously.

In the same multithreaded process in a shared-memory multiprocessor

environment, each thread in the process can run concurrently on a
separate processor, resulting in parallel execution, which is true
simultaneous execution. When the number of threads in a process is less
than or equal to the number of processors available, the operating system's
thread support system ensures that each thread runs on a different
processor. For example, in a matrix multiplication that is programmed with
four threads, and runs on a system that has two dual-core processors,
each software thread can run simultaneously on the four processor cores to
compute a row of the result at the same time.
83

parallel I/O:
Parallel I/O is a subset of parallel computing that performs
multiple input/output operations simultaneously. Rather than process I/O
requests serially, one at a time, parallel I/O accesses data on disk
simultaneously. This allows a system to achieve higher write speeds and
maximizes bandwidth.

Parallel computing became popular in the 1970s, based on the principle

that larger issues can be divided into multiple, smaller issues that can be
solved at the same time. Used most often in high-performance computing,
parallelism can help run applications quickly and efficiently.

Multicore chips help give parallel computing its processing power, and
make it compatible with most currently deployed servers. In a multicore
processor, each physical core enables efficient use of resources by
managing multiple requests by one user with Multithreading.

With parallel I/O, a portion of the logical cores on the multicore chip are
dedicated to processing I/O from the virtual machines and any applications
the remaining cores service. This allows the processor to handle multiple
read and write operations concurrently. Parallel I/O helps eliminate
I/O bottlenecks, which can stop or impair the flow of data.

Currently, many applications don't utilize parallel I/O, having been designed
to use Unicore sequential processing rather than multicore. However, the
recent rise in popularity of big data analytics may signal a place for parallel
computing in business applications, which face significant I/O performance
issues.

Parallel I/O, in the context of a computer, means the performance of

multiple input/output operations at the same time, for instance
84

simultaneously outputs to storage devices and display devices. It is a

fundamental feature of operating systems.
One particular instance is parallel writing of data to disk; when file data is
spread across multiple disks, for example in a RAID array, one can store
multiple parts of the data at the same time, thereby achieving higher write
speeds than with a single device.
Other ways of parallel access to data include: Parallel Virtual File
System, Lustre, GFS etc.

Performance Optimization of Distributed

System
Performance Optimization of Distributed Systems:
The following are the parameters that should be taken care of for optimizing
performance in Distributed Systems:
85

 Serving Multiple Requests Simultaneously: While a server waits for a

momentarily unavailable resource, the main issue is a delay. A server invokes
a remote function that requires a lot of computation–or has a significant
transmission latency. To avoid this, multithreading can accept and process
other requests.
 Reducing Per-Call Workload of Servers: A server’s performance can be
quickly impacted by a large number of client requests. Each request
necessitates a significant amount of processing on the server’s part. So, keep
requests brief and the amount of work a server has to do for each request to
a minimum. Use servers that are not tied to any one state i.e. use stateless
servers.
 Reply Caching of Idempotent Remote Procedures: If a server is unable to
handle the client requests because of the difference in pace between them.
The requests are arriving at a higher rate than the server can tackle. As a
result, a backlog starts building up due to the unhandled client requests at the
same pace. So, in this case, the server uses its reply cache for sending the
response.
 Timeout Values Should Be Carefully Chosen: A timeout that is “too small”
may expire too frequently, resulting in unnecessary retransmissions. If
communication is genuinely lost, a “too large” timeout setting will cause an
unnecessarily long delay.
 Appropriate Design of RPC Protocol Specifications: The protocol
specifications should be well designed to bring down the amount of
86

transferring data over the network and also the rate (frequency) with which it
is sent.
 Using LRPC (Lightweight Remote Procedure Call) for Cross-Domain
Messaging: LRPC (Lightweight Remote Procedure Call) facility is used in
microkernel operating systems for providing cross-domain (calling and called
processes are both on the same machine) communication. It employs
following the approaches for enhancing the performance of old systems
employing Remote Procedure Call:
 Simple Control Transfer: In this approach, a control transfer procedure is
used that refers to the execution of the requested procedure by the client’s
thread in the server’s domain. It employs hand-off scheduling in which direct
context switching takes place from the client thread to the server thread.
Before the first call is made to the server, the client binds to its interface, and
afterward, it provides the server with the argument stack and its execution
thread for trapping the kernel. Now, the kernel checks the caller and creates
a call linkage, and sends off the client’s thread directly to the server which in
turn activates the server for execution. After completion of the called
procedure, control and results return through the kernel from where it is called.
 Simple Data Transfer: In this approach, a shared argument stack is
employed to avoid duplicate data copying. Shared simply refers to the usage
by both the client and the server. So, in LRPC the same arguments are copied
only once from the client’s stack to the shared argument stack. It leads to cost-
effectiveness as data transfer creates few copies of data when moving from
one domain to another.
 Simple Stub: Because of the above mechanisms, the generation of the highly
optimized stubs is possible using LRPC. The call stub is associated with the
client’s domain and the entry stub is associated with the server’s domain is
having an entry stub in every procedure. The LRPC interface for every
procedure follows a three-layered communication protocol:
 From end to end: communication is carried out as defined by
calling conventions
 stub to stub: requires the usage of stubs
 domain-to-domain: requires kernel implementation
 The benefit of using LRPC stubs is that cost for interlayer gets reduced as it
makes the boundaries blurry. The single requirement in a simple LRPC is that
one formal procedure call to client stub and one return from server procedure
and client stub should be made.
 Design for Concurrency: For achieving high performance in terms of high
call throughput and low call latency, multiple processors are used with shared
memory. Further, throughput can be increased by getting rid of unnecessary
lock contention and reducing the utilization of shared-data structures, while
latency is lowered by decreasing the overhead of context switching. The
87

factor-by-3 performance is achieved using LRPC. The cost involved in cross-

domain communication is also gets reduced.
Performance analysis of parallel
processing systems
A centralized parallel processing system with job splitting is considered. In such a
system, jobs wait in a central queue, which is accessible by all the processors,
and are split into independent tasks that can be executed on separate
processors. This parallel processing system is modeled as a bulk
arrival MX/M/c queueing system where customers and bulks correspond to tasks
and jobs, respectively. Such a system has been studied in [1, 3] and an
expression for the mean response time of a random customer is obtained.
However, since we are interested in the time that a job spends in the system,
including synchronization delay, we must evaluate the bulk response time rather
than simply the customer response time. The job response time is the sum of the
job waiting time and the job service time. By analyzing the bulk queueing system,
we obtain an expression for the mean job waiting time. The mean job service
time is given by a set of recurrence equations.

To compare this system with other parallel processing systems, the following four
models are considered: Distributed/Splitting (D/S), Distributed/No Splitting
(D/NS), Centralized/Splitting (C/S), and Centralized/No Splitting (C/NS). In each
of these systems there are c processors, jobs are assumed to consist of set of
tasks that are independent and have exponentially distributed service
requirements, and arrivals of jobs are assumed to come from a Poisson point
source. The systems differ in the way jobs queue for the processors and in the
way jobs are scheduled on the processors. The queueing of jobs for processors
is distributed if each processor has its own queue, and is centralized if there is a
common queue for all the processors. The scheduling of jobs on the processors
is no splitting if the entire set of tasks composing that job are scheduled to run
sequentially on the same processor once the job is scheduled. On the other
hand, the scheduling is splitting if the tasks of a job are scheduled so that they
can be run independently and potentially in parallel on different processors. In
the splitting case a job is completed only when all of its tasks have finished
execution.

In our study we compare the mean response time of jobs in each of the systems
for differing values of the number of processors, number of tasks per job, server
utilization, and certain overheads associated with splitting up a job.
The MX/M/c system studied in the first part of the paper corresponds to the C/S
88

system. In this system, as processors become free they serve the first task in the
queue. D/. systems are studied in. We use the approximate analysis of the D/S
system and the exact analysis of the D/NS system that are given in that paper.
For systems with 32 processors or less, the relative error in the approximation for
the D/S system was found to be less than 5 percent. In the D/NS system, jobs
are assigned to processors with equal probabilities. The approximation we use
for the mean job response time for the C/NS system is found in. Although an
extensive error analysis for this system over all parameter ranges has not been
carried out, the largest relative error for the M/E2/10 system reported in is about
0.1 percent.
For all values of utilization, ρ, our results show that the splitting systems yield
lower mean job response time than the no splitting systems. This follows from the
fact that, in the splitting case, work is distributed over all the processors. For
any ρ, the lowest (highest) mean job response time is achieved by the C/S
system (the D/NS system). The relative performance of the D/S system and the
C/NS system depends on the value of ρ. For small ρ, the parallelism achieved by
splitting jobs into parallel tasks in the D/S system reduces its mean job response
time as compared to the C/NS system, where tasks of the same job are executed
sequentially. However, for high ρ, the C/NS system has lower mean job response
time than the D/S system. This is due to the long synchronization delay incurred
in the D/S system at high utilizations.

The effect of parallelism on the performance of parallel processing systems is

studied by comparing the performance of the C/NS system to that of the C/S
system. The performance improvement obtained by splitting jobs into tasks is
found to decrease with increasing utilization. For a fixed number of processors
and fixed ρ, we find that by increasing the number of tasks per job, i.e. higher
parallelism, the mean job response time of the C/NS system relative to that of the
C/S system increases. By considering an overhead delay associated with
splitting jobs into independent tasks, we observe that the mean job response
time is a convex function of the number of tasks, and thus, for a given arrival
rate, there exists a unique optimum number of tasks per job.

We also consider problems associated with partitioning the processors into two
sets, each dedicated to one of two classes of jobs: edit jobs and batch
jobs. Edit jobs are assumed to consist of simple operations that have no inherent
parallelism and thus consist of only one task. Batch jobs, on the other hand, are
assumed to be inherently parallel and can be broken up into tasks. All tasks from
either class are assumed to have the same service requirements. A number of
interesting phenomena are observed. For example, when half the jobs are edit
jobs, the mean job response time for both classes of jobs increases if one
processor is allocated to edit jobs. Improvement to edit jobs, at a cost of
increasing the mean job response time of batch jobs, results only when the
89

number of processors allocated to edit jobs is increased to two. This, and other
results, suggest that it is desirable for parallel processing systems to have a
controllable boundary for processor partitioning.

Parallel Programming Models:

In computing, a parallel programming model is an abstraction of parallel
computer architecture, with which it is convenient to
express algorithms and their composition in programs. The value of a
programming model can be judged on its generality: how well a range of
different problems can be expressed for a variety of different architectures,
and its performance: how efficiently the compiled programs can execute.
The implementation of a parallel programming model can take the form of
a library invoked from a sequential language, as an extension to an existing
language, or as an entirely new language.
Consensus around a particular programming model is important because it
leads to different parallel computers being built with support for the model,
thereby facilitating portability of software. In this sense, programming
models are referred to as bridging between hardware and software.
90

Classification of parallel programming models

Classifications of parallel programming models can be divided broadly into
two areas: process interaction and problem decomposition.

 Process interaction
Process interaction relates to the mechanisms by which parallel processes
are able to communicate with each other. The most common forms of
interaction are shared memory and message passing, but interaction can
also be implicit (invisible to the programmer).
 Shared memory
Shared memory is an efficient means of passing data between processes.
In a shared-memory model, parallel processes share a global address
space that they read and write to asynchronously. Asynchronous
concurrent access can lead to race conditions, and mechanisms such
as locks, semaphores and monitors can be used to avoid these.
Conventional multi-core processors directly support shared memory, which
many parallel programming languages and libraries, such as Cilk, Open
MP and Threading Building Blocks, are designed to exploit.
 Message passing
In a message-passing model, parallel processes exchange data through
passing messages to one another. These communications can be
asynchronous, where a message can be sent before the receiver is ready,
or synchronous, where the receiver must be ready. The Communicating
sequential processes (CSP) formalization of message passing uses
synchronous communication channels to connect processes, and led to
important languages such as Occam, Limbo and Go. In contrast, the actor
model uses asynchronous message passing and has been employed in the
design of languages such as D, Scala and SALSA.

 Partitioned global address space

Partitioned Global Address Space (PGAS) models provide a middle ground
between shared memory and message passing. PGAS provides a global
memory address space abstraction that is logically partitioned, where a
portion is local to each process. Parallel processes communicate by
asynchronously performing operations (e.g. reads and writes) on the global
address space, in a manner reminiscent of shared memory models.
91

However by semantically partitioning the global address space into portions

with affinity to a particular processes, they allow programmers to
exploit locality of reference and enable efficient implementation
on distributed memory parallel computers. PGAS is offered by many many
parallel programming languages and libraries, such as Fortran
2008, Chapel, UPC++, and SHMEM.
 Implicit interaction
In an implicit model, no process interaction is visible to the programmer and
instead the compiler and/or runtime is responsible for performing it. Two
examples of implicit parallelism are with domain-specific languages where
the concurrency within high-level operations is prescribed, and
with functional programming languages because the absence of side-
effects allows non-dependent functions to be executed in
parallel.[6] However, this kind of parallelism is difficult to manage[7] and
functional languages such as Concurrent Haskell and Concurrent
ML provide features to manage parallelism explicitly and correctly.

 Problem decomposition
A parallel program is composed of simultaneously executing processes.
Problem decomposition relates to the way in which the constituent
processes are formulated.
 Task parallelism
A task-parallel model focuses on processes, or threads of execution. These
processes will often be behaviorally distinct, which emphasizes the need
for communication. Task parallelism is a natural way to express message-
passing communication. In Flynn's taxonomy, task parallelism is usually
classified as MIMD/MPMD or MISD.
 Data parallelism
A data-parallel model focuses on performing operations on a data set,
typically a regularly structured array. A set of tasks will operate on this data,
but independently on disjoint partitions. In Flynn's taxonomy, data
parallelism is usually classified as MIMD/SPMD or SIMD.
 Implicit parallelism
As with implicit process interaction, an implicit model of parallelism reveals
nothing to the programmer as the compiler, the runtime or the hardware is
responsible. For example, in compilers, automatic parallelization is the
92

process of converting sequential code into parallel code, and in computer

architecture, superscalar execution is a mechanism whereby instruction-
level parallelism is exploited to perform operations in parallel.

Terminology
Parallel programming models are closely related to models of computation.
A model of parallel computation is an abstraction used to analyze the cost
of computational processes, but it does not necessarily need to be
practical, in that it can be implemented efficiently in hardware and/or
software. A programming model, in contrast, does specifically imply the
practical considerations of hardware and software implementation.
A parallel programming language may be based on one or a combination of
programming models. For example, High Performance Fortran is based on
shared-memory interactions and data-parallel problem decomposition,
and Go provides mechanism for shared-memory and message-passing
interaction.

There are four types of parallel programming models:

1.Shared memory model

2.Message passing model
3.Threads model
4.Data parallel model
Explanation:

1.Shared Memory Model

In this type, the programmer views his program as collection of processes

which use common or shared variables.
The processor may not have a private program or data memory. A common
program and data are stored in the main memory. This is accessed by all
processors.
Each processor is assigned a different part of the program and the data.
The main program creates separate process for each processor.
The process is allocated to the processor along with the required data.
These process are executed indecently on different processors.
After the execution, all the processors have to rejoin the main program,
Advantages
Program development becomes simple.
There is no need to explicitly specify the communication between data and
process.
Disadvantages
Users will not understand where his variables use stored.
The native compiler present at each processor translates user variables
into act addresses in main memory.
2.Message Passing Model
94

In this type, different processes may be present on single multiple

processors. Every process has its own set of data
The data transfer between the processes is achieved by send and receive
message requires co-operation between every process.
There must be a receive operation for every send operation.
Advantages
The data can be stored anywhere.
Communication between processes is simple.
Many Manage Passing Interfaces (MPS) are available.
Disadvantage
Programmers should take care that there is a receive function for every
send function.
If any of the process, quite, others also will stop working, and they are
dependent.
3.Threads Model
A thread is defined as a short sequence of instructions, within a process.
Different threads can be executed on same processor or on different
process.
95

If the threads are executed on same processor, then the processor

switches between the threads in a random fashion.
If the threads are executed on different processors, they are executed
simultaneously.
96

The threads communicate through global memory.

Advantages
Programmer need not have to worry about parallel processing.
Disadvantages
Care must be taken that no two threads update the shared resources
simultaneously.
4.Data Parallel Model
Data parallelism is one of the simplest form of parallelism. Here data set is
organized into a common structure. It could be an array.
Many programs apply the same operation on different parts of the common
structurers
Suppose the task is to add two arrays of 100 elements store the result in
another array. If there are four processors, then each processor can do 25
additions.
97

What is Scalability and performance:

What’s the difference?

“Performance is an indication of the responsiveness of a system to execute

any action within a given time interval, while scalability is the ability of a
system either to handle increases in load without impact on performance or
for the available resources to be readily increased. Cloud applications
typically encounter variable workloads and peaks in activity. Predicting
these, especially in a multi-tenant scenario, is almost impossible. Instead,
applications should be able to scale out within limits to meet peaks in
demand, and scale in when demand decreases. Scalability concerns not just
compute instances, but other elements such as data storage, messaging
infrastructure, and more.” -Microsoft

Scalability is the capacity of a system to adapt its performance and cost to the
new changes in application and system processing demands.
The architecture used to build services, networks, and processes is scalable
under these 2 conditions:
Scalability is basically a measure of how well the system will respond to the
addition and omission of resources to meet our requirements. That is why we do
a Requirement Analysis of the System in the first phase of SDLC and make sure
the system is adaptable and scalable.

What is Scalable System in Distributed

System?
The Scalable System in Distributed System refers to the system in which there is
a possibility of extending the system as the number of users and resources grows
with time.
 The system should be enough capable to handle the load that the system and
application software need not change when the scale of the system increases.
 To exemplify, with the increasing number of users and workstations the
frequency of file access is likely to increase in a distributed system. So, there
must be some possibility to add more servers to avoid any issue in file
accessing handling.
98

 Scalability is generally considered concerning hardware and software. In

hardware, scalability refers to the ability to change workloads by altering
hardware resources such as processors, memory, and hard disc space.
Software scalability refers to the capacity to adapt to changing workloads by
altering the scheduling mechanism and parallelism level.

Need for Scalability Framework:

The scalability framework is required for the applications as it refers to a software

system’s ability to scale up in some way when and where required because of
the changing demands of the system like increasing users or workload, etc.
Examples include Spring Framework, Java Server Faces(JSF), Struts, Play!, and
Google Web Toolkit (GWT).

How to Measure Scalability:

We can measure Scalability in terms of developing and testing the foundation of

a scalable system. However, precisely measuring a system’s scalability is difficult
due to scalable systems’ vast and diverse environment. The general metric
method in this process is to analyze system performance improvements by
loading various system resources and system loads. The system has the best
scalability when the workload and computing resources are increased or lowered
by a factor of K at the same time while the average response time of the system
or application remains unchanged.

Measures of Scalability:

 Size Scalability
 Geographical Scalability
 Administrative Scalability

1. Size Scalability: There will be an increase in the size of the system whenever
users and resources grow but it should not be carried out at the cost of
performance and efficiency of the system. The system must respond to the user
in the same manner as it was responding before scaling the system.
2. Geographical Scalability: Geographical scalability refers to the addition of
new nodes in a physical space that should not affect the communication time
between the nodes.
3. Administrative Scalability: In Administrative Scalability, significant
management of the new nodes which are being added to the system should not
99

be required. To exemplify, if there are multiple administrators in the system, the

system is shared with others while in use by one of them.

Types of Scalability:

1. Horizontal Scalability: Horizontal Scalability implies the addition of new

servers to the existing set of resources in the system. The major benefit lies in
the scaling of the system dynamically. For example, Cassandra and MongoDB.
Horizontal scaling is done in them by adding more machines. Furthermore, the
load balancer is employed for distributing the load on the available servers which
increases overall performance.
2. Vertical Scalability: Vertical Scalability refers to the addition of more power
to the existing pool of resources like servers. For example, MySQL. Here, scaling
is carried out by switching from smaller to bigger machines.
100

Why prefer Horizontal Scalability?

Vertical scaling must opt for low-scale applications as there is a constraint of

adding unlimited CPU, RAM, and storage to a single server. But in Horizontal
Scaling, no such constraint exists. The number of machines or resources can be
increased to handle the load.

storage system:

What is parallel storage?

A parallel storage file system is a sort of clustered file system. A clustered file
system is a storage system shared by multiple devices simultaneously.

The data is spread amongst several storage nodes for redundancy and performance
in a parallel file system. In this system, the physical storage device is built using
the storage devices of multiple servers. When the file system receives data, it
distributes it across several storage nodes after breaking it into data blocks.

Parallel file storages duplicate the data on the physically distinct nodes. This lets
the system be fault-tolerant and permits data redundancy. The data distribution
improves the system’s performance and makes it faster.

In other words, the parallel file system breaks data into blocks and distributes the
blocks to multiple storage servers. It uses a global namespace to enable data
access. The data is written/ read using different input/ output (I/O) paths.

Examples of parallel file systems include:

 BeeGFS
 Lustre
 PanFS (Panasas)
 OrangeFS
101

What is distributed storage?

Distributed storage systems are also called network file systems. These systems
share access to the same storage using network protocols. They prevent access to
the file system based on the access lists and capabilities of the server and client
systems. They allow access to files using the same interfaces as local files.

A parallel file system is a kind of distributed file system. Both systems share data
amongst multiple servers.

Examples of distributed file systems available in the market include:

 Windows DFS
 Infinit
 Alluxio
 ObjectiveFS
 JuiceFS
 MapR FS

Features of distributed file systems

 The clients should access the distributed files as they would access local
files, and they should not be aware of the file distribution.
102

 The client system and program should function correctly even when a
server failure occurs.
 The file should be compatible across various hardware and operating
systems.
 All the clients should get the same view of the file in the system. For
instance, if a file is being modified, all the clients accessing the file should
see the changes.
 The clients should not be informed about the data duplication.
 The systems should be scalable. This means that if a system works in a
small environment, it should work for a larger environment.

Difference between parallel and distributed file system

Parallel Storage Distributed Storage

The client system can directly The client systems need to go

access data stored on the through the same storage
storage nodes without nodes to access data even
coordinating with the server when the files are stored on
system in the parallel storage different servers in distributed
system. storage systems.

The parallel file system breaks

The distributed file system
the data file and distributes
stores the data file on only one
the data blocks amongst
storage node.
various nodes.
103

The parallel file systems

The distributed file systems
separate the compute server
store data on centralized or
and storage servers to
application servers and do not
improve the system’s
separate the servers.
performance.

The parallel systems

concentrate on high- The distributed file systems
performance tasks that can be focus on loosely coupled data
benefitted by coordinating applications or active data
input and output access and archives.
the bandwidth.

The distributed file systems

The parallel file systems generally use three-way
operate on shared storage. replication coding to handle
the software’s fault tolerance.

Introduction of Process Synchronization

On the basis of synchronization, processes are categorized as one of the
following two types:
 Independent Process: The execution of one process does not affect the
execution of other processes.
 Cooperative Process: A process that can affect or be affected by other
processes executing in the system.
Process synchronization problem arises in the case of Cooperative process
also because resources are shared in Cooperative processes.

Race Condition:

When more than one process is executing the same code or accessing the
same memory or any shared variable in that condition there is a possibility that
the output or the value of the shared variable is wrong so for that all the
104

processes doing the race to say that my output is correct this condition known
as a race condition. Several processes access and process the manipulations
over the same data concurrently, then the outcome depends on the particular
order in which the access takes place. A race condition is a situation that may
occur inside a critical section. This happens when the result of multiple thread
execution in the critical section differs according to the order in which the
threads execute. Race conditions in critical sections can be avoided if the
critical section is treated as an atomic instruction. Also, proper thread
synchronization using locks or atomic variables can prevent race conditions.

Critical Section Problem:

A critical section is a code segment that can be accessed by only one process
at a time. The critical section contains shared variables that need to be
synchronized to maintain the consistency of data variables. So the critical
section problem means designing a way for cooperative processes to access
shared resources without creating data inconsistencies.

In the entry section, the process requests for entry in the Critical Section.
Any solution to the critical section problem must satisfy three requirements:
105

 Mutual Exclusion: If a process is executing in its critical section, then no

other process is allowed to execute in the critical section.
 Progress: If no process is executing in the critical section and other
processes are waiting outside the critical section, then only those processes
that are not executing in their remainder section can participate in deciding
which will enter in the critical section next, and the selection can not be
postponed indefinitely.
 Bounded Waiting: A bound must exist on the number of times that other
processes are allowed to enter their critical sections after a process has
made a request to enter its critical section and before that request is
granted.

Peterson’s Solution:

Peterson’s Solution is a classical software-based solution to the critical section

problem. In Peterson’s solution, we have two shared variables:
 boolean flag[i]: Initialized to FALSE, initially no one is interested in entering
the critical section
 int turn: The process whose turn is to enter the critical section.
106

Peterson’s Solution preserves all three conditions:

 Mutual Exclusion is assured as only one process can access the critical
section at any time.
 Progress is also assured, as a process outside the critical section does not
block other processes from entering the critical section.
 Bounded Waiting is preserved as every process gets a fair chance.
Disadvantages of Peterson’s solution:
 It involves busy waiting.(In the Peterson’s solution, the code statement-
“while(flag[j] && turn == j);” is responsible for this. Busy waiting is not favored
because it wastes CPU cycles that could be used to perform other tasks.)
 It is limited to 2 processes.
 Peterson’s solution cannot be used in modern CPU architectures.

Semaphores:

A semaphore is a signaling mechanism and a thread that is waiting on a

semaphore can be signaled by another thread. This is different than a mutex as
the mutex can be signaled only by the thread that is called the wait function.
A semaphore uses two atomic operations, wait and signal for process
synchronization.
A Semaphore is an integer variable, which can be accessed only through two
operations wait() and signal().
There are two types of semaphores: Binary Semaphores and Counting
Semaphores.
 Binary Semaphores: They can only be either 0 or 1. They are also known
as mutex locks, as the locks can provide mutual exclusion. All the processes
can share the same mutex semaphore that is initialized to 1. Then, a
process has to wait until the lock becomes 0. Then, the process can make
the mutex semaphore 1 and start its critical section. When it completes its
critical section, it can reset the value of the mutex semaphore to 0 and some
other process can enter its critical section.
 Counting Semaphores: They can have any value and are not restricted
over a certain domain. They can be used to control access to a resource that
has a limitation on the number of simultaneous accesses. The semaphore
can be initialized to the number of instances of the resource. Whenever a
process wants to use that resource, it checks if the number of remaining
instances is more than zero, i.e., the process has an instance available.
Then, the process can enter its critical section thereby decreasing the value
of the counting semaphore by 1. After the process is over with the use of the
instance of the resource, it can leave the critical section thereby adding 1 to
the number of available instances of the resource.
107

Synchronization in Distributed Systems

In the distributed system, the hardware and software components communicate
and coordinate their actions by message passing. Each node in distributed
systems can share their resources with other nodes. So, there is need of proper
allocation of resources to preserve the state of resources and help coordinate
between the several processes. To resolve such conflicts, synchronization is
used. Synchronization in distributed systems is achieved via clocks.
The physical clocks are used to adjust the time of nodes. Each node in the
system can share its local time with other nodes in the system. The time is set
based on UTC (Universal Time Coordination). UTC is used as a reference time
clock for the nodes in the system.
The clock synchronization can be achieved by 2 ways: External and Internal
Clock Synchronization.
1. External clock synchronization is the one in which an external reference
clock is present. It is used as a reference and the nodes in the system can
set and adjust their time accordingly.
2. Internal clock synchronization is the one in which each node shares its
time with other nodes and all the nodes set and adjust their times
accordingly.
There are 2 types of clock synchronization algorithms: Centralized and
Distributed.

1. Centralized is the one in which a time server is used as a reference. The

single time server propagates it’s time to the nodes and all the nodes adjust
the time accordingly. It is dependent on single time server so if that node
fails, the whole system will lose synchronization. Examples of centralized
are- Berkeley Algorithm, Passive Time Server, Active Time Server etc.
2. Distributed is the one in which there is no centralized time server present.
Instead the nodes adjust their time by using their local time and then, taking
the average of the differences of time with other nodes. Distributed
algorithms overcome the issue of centralized algorithms like the scalability
and single point failure. Examples of Distributed algorithms are – Global
Averaging Algorithm, Localized Averaging Algorithm, NTP (Network time
protocol) etc.
108
109

Performance Comparison of Open MP, MPI

Abstract
With problem size and complexity increasing, several parallel and distributed
programming models and frameworks have been developed to efficiently handle such
problems. This paper briefly reviews the parallel computing models and describes three
widely recognized parallel programming frameworks: Open MP, MPI, and Map Reduce.
Open MP is the de facto standard for parallel programming on shared memory systems.
MPI is the de facto industry standard for distributed memory systems. Map Reduce
framework has become the de facto standard for large scale data-intensive applications.
Qualitative pros and cons of each framework are known, but quantitative performance
indexes help get a good picture of which framework to use for the applications. As
benchmark problems to compare those frameworks, two problems are chosen: all-pairs-
shortest-path problem and data join problem. This paper presents the parallel programs for
the problems implemented on the three frameworks, respectively. It shows the experiment
results on a cluster of computers. It also discusses which is the right tool for the jobs by
analyzing the characteristics and performance of the paradigms.
110

1. Introduction
We often happen to meet problems requiring heavy computations or data-intensive
processing. Hence, on one hand, we try to develop efficient algorithms for the problems.
On the other hand, with the advances of hardware and parallel and distributed computing
technology, we are interested in exploiting high performance computing resources to
handle them.
Parallel and distributed computing technology has been focused on how to maximize
inherent parallelism using multicore/many-core processors and networked computing
resources. Various computing architectures and hardware techniques have been developed
such as symmetric multiprocessor (SMP) architecture, no uniform memory access
(NUMA) architecture, simultaneous multithreading (SMT) architecture, single instruction
multiple data (SIMD) architecture, graphics processing unit (GPU), general purpose
graphics processing unit (GPGPU), and superscalar processor.
A variety of software technology has been developed to take advantage of hardware
capability and to effectively develop parallel and distributed applications. With the
plentiful frameworks of parallel and distributed computing, it would be of great help to
have performance comparison studies for the frameworks we may consider.
This paper is concerned with performance studies of three parallel programming
frameworks: Open MP, MPI, and Map Reduce. The comparative studies have been
conducted for two problem sets: the all-pairs-shortest-path problem and a join problem for
large data sets. Open MP is the de facto standard model from shared memory systems, MPI
is the de facto standard for distributed memory systems, and Map Reduce is recognized as
the de facto standard framework intended for big data processing. For each problem, the
parallel programs have been developed in terms of the three models, and their performance
has been observed.
The remainder of the paper is organized as follows: Section 2 briefly reviews the parallel
computing models and Section 3 presents the selected programming frameworks in more
detail. Section 4 explains the developed parallel programs for the problems with the three
frameworks. Section 5 shows the experiment results and finally Section 6 draws
conclusions.

2. Parallel Computing Models

In parallel computing memory architectures, there are shared memory, distributed memory,
and hybrid shared-distributed memory. Shared memory architectures allow all processors
to access all memories as global memory space. They have usually been classified as
uniform memory access (UMA) and NUMA. UMA machines are commonly referred to as
111

SMP and assume all processors to be identical. NUMA machines are often organized by
physically linking two or more SMPs in which not all processors have equal access time to
all memories.
In distributed memory architectures, processors have their own memory, but there is no
global address space across all processors. They have a communication network to connect
processors’ memories.
Hybrid shared-distributed memory employs both shared and distributed memory
architectures. In clusters of multicore or many-core processors, cores in a processor share
their memory and multiple shared memory machines are networked to move data from one
machine to another.
There are several parallel programming models which allow users to specify concurrency
and locality at a high level: thread, message passing, data parallel, and single program
multiple data (SPMD) and multiple program multiple data (MPMD) models.
Thread model organizes a heavy weight process with multiple light weight threads that are
executed concurrently. POSIX threads library (a.k.a. pitheads) and Open MP are typical
implementation of this model.
In the message passing model, an application consists of a set of tasks which use their own
local memory that can be located in the same machine or across a certain number of
machines. Tasks exchange data by sending and receiving messages to conduct the mission.
MPI is the de facto industry standard for message passing.
Data parallel model, also referred to as partitioned global address space (PGAS) model,
provides each process with a view of the global memory even though memory is distributed
across the machines. It makes distinction between local and global memory reference under
the control of programmer. The compiler and runtime take care of converting remote
memory access into message passing operations between processes. There are several
implementations of the data parallel model: Coarray Fortran, Unified Parallel C, X10, and
Chapel .
SPMD model is a high level programming paradigm that executes the same program with
different data multiple times. It is probably the most commonly used parallel programming
model for clusters of nodes. MPMD model is a high level programming paradigm that
allows multiple programs to run on different data. With the advent of general purpose
graphical processing unit (GPGPU), hybrid parallel computing models have been
developed to utilize the many-core GPU to perform heavy computation under the control
of the host thread running on the host CPU.
When data volume is large, demanding memory capacity may hinder its manipulation and
processing. To deal with such situations, big data processing frameworks such as Hadoop
and Dryad have been developed which exploit multiple distributed machines. Hadoop Map
Reduce is a programming model that abstracts an application into two phases of Map and
112

Reduce. Dryad structures the computation as a directed graph in which vertices correspond
to a task and edges are the channels of data transmissions.

3. Open MP, MPI, and Map Reduce

3.1. Open MP

Open MP is a shared-memory multiprocessing application program inference (API) for

easy development of shared memory parallel programs. It provides a set of compiler
directives to create threads, synchronize the operations, and manage the shared memory on
top of pthreads. The programs using Open MP are compiled into multithreaded programs,
in which threads share the same memory address space and hence the communications
between threads can be very efficient.
Compared to using pthreads and working with mutex and condition variables, Open MP is
much easier to use because the compiler takes care of transforming the sequential code into
parallel code according to the directives. Hence the programmers can write multithreaded
programs without serious understanding of multithreading mechanism. Its runtime
maintains the thread pool and provides a set of libraries.
It uses a block-structured approach to switch between sequential and parallel sections,
which follows the fork/join model. At the entry of a parallel block, a single thread of control
is split into some number of threads, and a new sequential thread is started when all the
split threads have finished. Its directives allow the fine-grained control over the threads. It
is supported on various platforms like UNIX, LINUX, and Windows and various languages
like C, C++, and Fortran.

3.2. MPI

MPI is a message passing library specification which defines an extended message passing
model for parallel, distributed programming on distributed computing environment. It is
not actually a specific implementation of the parallel programming environment, and its
several implementations have been made such as OpenMPI, MPICH, and GridMPI. In MPI
model, each process has its own address space and communicates other processes to access
others’ address space. Programmers take charge of partitioning workload and mapping
tasks about which tasks are to be computed by each process.
MPI provides point-to-point, collective, one-sided, and parallel I/O communication
models. Point-to-point communications enable exchanging data between two matched
processes. Collective communication is a broadcast of message from a process to all the
others. One-sided communications facilitate remote memory access without matched
process on the remote node. Three one-sided libraries are available for remote read, remote
113

write, and remote update. MPI provides various library functions to coordinate message
passing in various modes like blocked and unblocked message passing. It can send
messages of gigabytes size between processes.
MPI has been implemented on various platforms like Linux, OS X, Solaris, and Windows.
Most MPI implementations use some kind of network file storage. As network file storage,
network file system (NFS) and Hadoop HDFS can be used. Because MPI is a high level
abstraction for parallel programming, programmers can easily construct parallel and
distributed processing applications without deep understanding of the underlying
mechanism of process creation and synchronization. To order to exploit the multicore of
processors, the MPI processes can be organized to have multiple threads in themselves.
MPI-based programs can be executed on a single computer or a cluster of computers.

3.3. Map Reduce

Map Reduce is a programming paradigm to use Hadoop which is recognized as a

representative big data processing framework. Hadoop clusters consist of up to thousands
of commodity computers and provide a distributed file system called HDFS which can
accommodate big volume of data in a fault-tolerant way. The clusters become the
computing resource to facilitate big data processing.
Map Reduce organizes an application into a pair (or a sequence of pairs) of Map and
Reduce functions. It assumes that input for the functions comes from HDFS file(s) and
output is saved into HDFS files. Data files consist of records, each of which can be treated
as a key-value pair. Input data is partitioned and processed by Map processes, and their
processing results are shaped into key-value pairs and shuffled into Reduce tasks according
to key. Map processes are independent of each other and thus they can be executed in
parallel without collaboration among them. Reduce processes play role of aggregating the
values with the same key.
Map Reduce runtime launches Map and Reduce processes with consideration of data
locality. The programmers do not have to consider data partitioning, process creation, and
synchronization. The same Map and Reduce functions are executed across machines.
Hence, Map Reduce paradigm can be regarded as a kind of SPMD model.
Map Reduce paradigm is a good choice for big data processing because Map Reduce
handles data record by record without loading whole data into memory and in addition the
program is executed in parallel over a cluster. It is very convenient to develop big data
handling programs using Map Reduce because Hadoop provides everything needed for
distributed and parallel processing behind the scene which program does not need to know.

Distributed-Computing Book
No ratings yet
Distributed-Computing Book
149 pages
Parallel and Distributed Computing Complete Notes
No ratings yet
Parallel and Distributed Computing Complete Notes
41 pages
Memory Repair Primer
No ratings yet
Memory Repair Primer
24 pages
Proans
No ratings yet
Proans
834 pages
Unit-1 Cloud Computing (Nep) PDF
100% (1)
Unit-1 Cloud Computing (Nep) PDF
36 pages
Databricks Associate Data Engineer Notes
No ratings yet
Databricks Associate Data Engineer Notes
39 pages
Distributed Computing Tech Knowledge
100% (1)
Distributed Computing Tech Knowledge
149 pages
Unit 1
No ratings yet
Unit 1
104 pages
Unit 1
No ratings yet
Unit 1
69 pages
Online Feedback Management System
No ratings yet
Online Feedback Management System
27 pages
Dsa Basic Data Structure
No ratings yet
Dsa Basic Data Structure
72 pages
BCT - Unit-2
No ratings yet
BCT - Unit-2
27 pages
Distributed System and Cloud Computing
No ratings yet
Distributed System and Cloud Computing
203 pages
Lecture Notes: (R15A0529) B.Tech Iv Year - I Sem (R15) (2019 - 20)
No ratings yet
Lecture Notes: (R15A0529) B.Tech Iv Year - I Sem (R15) (2019 - 20)
223 pages
Mcad22e3 Cloud Computing Notes
No ratings yet
Mcad22e3 Cloud Computing Notes
211 pages
Virtualization & Cloud Computing: Lecture #0
No ratings yet
Virtualization & Cloud Computing: Lecture #0
190 pages
Doc2 2
No ratings yet
Doc2 2
4 pages
IGCSE Computer Science - 2210 - Chapter 1
No ratings yet
IGCSE Computer Science - 2210 - Chapter 1
27 pages
Unit-1 CC
No ratings yet
Unit-1 CC
58 pages
Chapter 4
No ratings yet
Chapter 4
51 pages
Cloud Computing
No ratings yet
Cloud Computing
36 pages
Cloud Computing Continuation
No ratings yet
Cloud Computing Continuation
29 pages
Distrubuted Computing
No ratings yet
Distrubuted Computing
62 pages
Chapter 1.1
No ratings yet
Chapter 1.1
25 pages
CC Sem
No ratings yet
CC Sem
64 pages
CC 2
No ratings yet
CC 2
35 pages
DC 1
No ratings yet
DC 1
27 pages
Week 14 Applications of Parallel and Distributed Computing
No ratings yet
Week 14 Applications of Parallel and Distributed Computing
10 pages
DC Module1
No ratings yet
DC Module1
54 pages
Cloud Computing U1l4
No ratings yet
Cloud Computing U1l4
24 pages
BCA-502 (DE2) - SM03hsshs
No ratings yet
BCA-502 (DE2) - SM03hsshs
13 pages
Scalability and Performance
No ratings yet
Scalability and Performance
19 pages
DC Module1
No ratings yet
DC Module1
62 pages
OS CH 4 Introduction To Distributed System
No ratings yet
OS CH 4 Introduction To Distributed System
46 pages
Comprehensive Guide To Cloud Computing
No ratings yet
Comprehensive Guide To Cloud Computing
30 pages
Cloud Computing: Mr. Ajay B. Kapase
No ratings yet
Cloud Computing: Mr. Ajay B. Kapase
17 pages
Parallel and Distributed Computing Lec 1 & 2
No ratings yet
Parallel and Distributed Computing Lec 1 & 2
32 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
Introduction To Parallel and Distributed Computing
No ratings yet
Introduction To Parallel and Distributed Computing
29 pages
DC Unit 1
No ratings yet
DC Unit 1
15 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
14 pages
CH - 1 DPC
No ratings yet
CH - 1 DPC
6 pages
Module 1
No ratings yet
Module 1
30 pages
IT Notes Unit 5
No ratings yet
IT Notes Unit 5
25 pages
Cloud Computing Fundamentals Didgital Notes
No ratings yet
Cloud Computing Fundamentals Didgital Notes
62 pages
Cloud Computing Unit-1
No ratings yet
Cloud Computing Unit-1
51 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
8 pages
CC Unit-1
No ratings yet
CC Unit-1
17 pages
Introduction DC
No ratings yet
Introduction DC
43 pages
Assign Ment 2 CC
No ratings yet
Assign Ment 2 CC
5 pages
Cloud Computing Digital Notes
No ratings yet
Cloud Computing Digital Notes
85 pages
Mobile Software Systems
No ratings yet
Mobile Software Systems
78 pages
ADSU1VFTVF25
No ratings yet
ADSU1VFTVF25
118 pages
Lecture 2
No ratings yet
Lecture 2
12 pages
CC Unit-1
No ratings yet
CC Unit-1
17 pages
Important Exam Questions For Parallel and Distributed Systems
No ratings yet
Important Exam Questions For Parallel and Distributed Systems
7 pages
Unit 3
No ratings yet
Unit 3
5 pages
Notes CC Unit1
No ratings yet
Notes CC Unit1
21 pages
CC Question and Answers
No ratings yet
CC Question and Answers
14 pages
DC - Hand-Written
No ratings yet
DC - Hand-Written
7 pages
UNIT-1 What Is: Q1: Distributed System? or Why Would You Design A System As A Distributed System
No ratings yet
UNIT-1 What Is: Q1: Distributed System? or Why Would You Design A System As A Distributed System
55 pages
Assignment2 CCL 24
No ratings yet
Assignment2 CCL 24
9 pages
Lecture 1
No ratings yet
Lecture 1
13 pages
Cloud Computing Notes 1
No ratings yet
Cloud Computing Notes 1
47 pages
Unit-4 Rtu Kota
No ratings yet
Unit-4 Rtu Kota
17 pages
Scanner Tutorial
No ratings yet
Scanner Tutorial
13 pages
EZine - 29a-1
100% (1)
EZine - 29a-1
320 pages
FAQ
No ratings yet
FAQ
8 pages
Google: Designs, Lessons and Advice From Building Large Distributed Systems
100% (3)
Google: Designs, Lessons and Advice From Building Large Distributed Systems
73 pages
EES Matlab Code Ref
No ratings yet
EES Matlab Code Ref
1 page
Advance OOP - 021814
No ratings yet
Advance OOP - 021814
64 pages
M580 HSBY System Planning Guide
No ratings yet
M580 HSBY System Planning Guide
217 pages
What Is Functional Programming?
No ratings yet
What Is Functional Programming?
16 pages
Lab Manual Cloud Computing PDF
No ratings yet
Lab Manual Cloud Computing PDF
12 pages
Ug906 Vivado Design Analysis
No ratings yet
Ug906 Vivado Design Analysis
373 pages
Information Security 124528
No ratings yet
Information Security 124528
43 pages
Group Project Computer Literacy 2018
No ratings yet
Group Project Computer Literacy 2018
3 pages
Draeger Controllers: REGARD 3920 Regard
No ratings yet
Draeger Controllers: REGARD 3920 Regard
44 pages
Checkpoint FW Commands
No ratings yet
Checkpoint FW Commands
6 pages
Android Programming
No ratings yet
Android Programming
28 pages
Oop
No ratings yet
Oop
41 pages
Payara Vs The Competition - Payara Services LTD
No ratings yet
Payara Vs The Competition - Payara Services LTD
1 page
Information Security Week 03
No ratings yet
Information Security Week 03
18 pages
Hpe Msa Gen7 Storage (806081)
No ratings yet
Hpe Msa Gen7 Storage (806081)
35 pages
Information Security Week 04
No ratings yet
Information Security Week 04
25 pages
Tutorial 4 Problem Solving: Repetition (Nested Loop) & Conditional Structures
No ratings yet
Tutorial 4 Problem Solving: Repetition (Nested Loop) & Conditional Structures
4 pages
Information Security Week 01
No ratings yet
Information Security Week 01
17 pages
Manual Tecnico
No ratings yet
Manual Tecnico
20 pages
Information Security
No ratings yet
Information Security
15 pages
Mds Transnet: 1.2 Typical Applications
No ratings yet
Mds Transnet: 1.2 Typical Applications
6 pages
Lab Guide - VDP10c-02 - Accessing SLAs
No ratings yet
Lab Guide - VDP10c-02 - Accessing SLAs
12 pages
Docc 004
No ratings yet
Docc 004
10 pages
Information Security Week 02
No ratings yet
Information Security Week 02
12 pages
Lab - Mobile Device Features: Recommended Equipment
No ratings yet
Lab - Mobile Device Features: Recommended Equipment
3 pages
YOUJOY Digital Sensor Technical Q&A
No ratings yet
YOUJOY Digital Sensor Technical Q&A
17 pages
Help Information
No ratings yet
Help Information
2 pages
4 PowerShell Logging
No ratings yet
4 PowerShell Logging
9 pages
Computer Network and It's Types - 082219
No ratings yet
Computer Network and It's Types - 082219
7 pages
Ludo Game
No ratings yet
Ludo Game
4 pages
Computer Studies MS
No ratings yet
Computer Studies MS
3 pages