Parallel and Distributed Computing: Composed By: Danish Khan
Parallel and Distributed Computing: Composed By: Danish Khan
DISTRIBUTED
COMPUTING
COMPOSED BY : DANISH KHAN
1
Table of Contents
Parallel Computing and Distributed Computing .................................................................................... 5
What is Parallel Computing? ................................................................................................................ 6
Advantages and Disadvantages of Parallel Computing ........................................................ 6
What is Distributing Computing? .................................................................................................... 7
Advantages and Disadvantages of Distributed Computing ................................................. 7
Key differences between the Parallel Computing and Distributed Computing ................... 8
Various Failures in Distributed System ................................................................................................. 16
GPU architecture and programming: .................................................................................................... 19
Difference between CPU and GPU: ...................................................................................................... 19
Introduction to CUDA Programming ...................................................................................................... 20
Why do we need CUDA? ............................................................................................................. 20
How CUDA works? ........................................................................................................................ 21
Architecture of CUDA ................................................................................................................... 21
How work is distributed? ............................................................................................................ 22
CUDA Applications ....................................................................................................................... 22
Benefits of CUDA ........................................................................................................................... 23
Limitations of CUDA ..................................................................................................................... 23
What is GPU Programming? ............................................................................................................ 23
Heterogeneous computing ...................................................................................................................... 24
Heterogeneous and other DSM systems | Distributed systems........................................................ 24
Need for Heterogeneous DSM (HDSM): .................................................................................. 24
Heterogeneous DSM:.................................................................................................................... 25
Data compatibility & conversion: ............................................................................................. 25
Block size selection : ................................................................................................................... 26
Advantages of DSM: ..................................................................................................................... 27
Difference between a Homogeneous DSM & Heterogeneous DSM: .............................. 27
Interconnection Network/topologies: ........................................................................................... 28
Evaluating Design Trade-offs in Network Topology ................................................................. 30
Routing ................................................................................................................................................. 30
Routing Mechanisms .................................................................................................................... 30
Deterministic Routing ................................................................................................................... 30
Deadlock Freedom ......................................................................................................................... 30
2
Theta Notation................................................................................................................................. 68
Speedup of an Algorithm ................................................................................................................. 69
Number of Processors Used ........................................................................................................... 69
Total Cost ............................................................................................................................................. 69
Parallel Algorithm - Models ..................................................................................................................... 69
Data Parallel......................................................................................................................................... 70
Task Graph Model .............................................................................................................................. 71
Work Pool Model ................................................................................................................................ 72
Master-Slave Model ........................................................................................................................... 73
Precautions in using the master-slave model ........................................................................ 74
Pipeline Model..................................................................................................................................... 74
Hybrid Models ..................................................................................................................................... 75
Parallel Random Access Machines ....................................................................................................... 75
Shared Memory Model ...................................................................................................................... 77
Merits of Shared Memory Programming .................................................................................. 78
Demerits of Shared Memory Programming............................................................................. 78
Message Passing Model................................................................................................................... 78
Multithreaded programming: ...................................................................................................... 79
Multithreading on a Single Processor ...................................................................................... 80
Multithreaded Programming on Multiple Processors .......................................................... 80
Why Is Multithreading Important? ................................................................................................. 80
Processors Are at Maximum Clock Speed .............................................................................. 80
Parallelism Is Important for AI .................................................................................................... 80
What Are Common Multithreaded Programming Issues? ...................................................... 81
Race Conditions (Including Data Race) ................................................................................... 81
Deadlock ........................................................................................................................................... 82
parallel I/O: ................................................................................................................................................. 83
.................................................................................................................................................................... 84
Performance Optimization of Distributed System ............................................................................... 84
Performance Optimization of Distributed Systems: ........................................................... 84
Performance analysis of parallel processing systems........................................................................ 87
Classification of parallel programming models ................................................................. 90
Process interaction................................................................................................................... 90
5
There are mainly two computation types, including parallel computing and distributed
computing. A computer system may perform tasks according to human instructions. A
single processor executes only one task in the computer system, which is not an
effective way. Parallel computing solves this problem by allowing numerous processors
to accomplish tasks simultaneously. Modern computers support parallel processing to
improve system performance. In contrast, distributed computing enables several
computers to communicate with one another and achieve a goal. All of these computers
communicate and collaborate over the network. Distributed computing is commonly
used by organizations such as Facebook and Google that allow people to share
resources.
Advantages
1. It saves time and money because many resources working together cut down on time
and costs.
2. It may be difficult to resolve larger problems on Serial Computing.
3. You can do many things at once using many computing resources.
7
4. Parallel computing is much better than serial computing for modeling, simulating, and
comprehending complicated real-world events.
Disadvantages
There are various benefits of using distributed computing. It enables scalability and
makes it simpler to share resources. It also aids in the efficiency of computation
processes.
Advantages
Disadvantages
1. Data security and sharing are the main issues in distributed systems due to the features
of open systems
2. Because of the distribution across multiple servers, troubleshooting and diagnostics are
more challenging.
3. The main disadvantage of distributed computer systems is the lack of software support.
Here, you will learn the various key differences between parallel computing and
distributed computation. Some of the key differences between parallel computing and
distributed computing are as follows:
1. Parallel computing is a sort of computation in which various tasks or processes are run at
the same time. In contrast, distributed computing is that type of computing in which the
components are located on various networked systems that interact and coordinate their
actions by passing messages to one another.
2. In parallel computing, processors communicate with another processor via a bus. On the
other hand, computer systems in distributed computing connect with one another via a
network.
3. Parallel computing takes place on a single computer. In contrast, distributed computing
takes place on several computers.
9
Types of Parallelism:
2. Instruction-level parallelism: A processor can only address less than one instruction
for each clock cycle phase. These instructions can be re-ordered and grouped which are
later on executed concurrently without affecting the result of the program. This is called
instruction-level parallelism.
3. Task Parallelism: Task parallelism employs the decomposition of a task into subtasks
and then allocating each of the subtasks for execution. The processors perform
execution of sub tasks concurrently.
Synchronous
Transmission Asynchronous Transmission
In Synchronous transmission, data is In Asynchronous transmission, data is sent
1. sent in form of blocks or frames. in form of bytes or characters.
Synchronous
Transmission Asynchronous Transmission
While in Asynchronous transmission, the
Efficient use of transmission lines is transmission line remains empty during a
7. done in synchronous transmission. gap in character transmission.
Concurrency:
Concurrency relates to an application that is processing more than one task at
the same time. Concurrency is an approach that is used for decreasing the
response time of the system by using the single processing unit. Concurrency is
creating the illusion of parallelism, however actually the chunks of a task aren’t
parallel processed, but inside the application, there are more than one task is
being processed at a time. It doesn’t fully end one task before it begins
ensuing.
Concurrency is achieved through the interleaving operation of processes on the
central processing unit(CPU) or in other words by the context switching. that’s
rationale it’s like parallel processing. It increases the amount of work finished at
a time.
In the above figure, we can see that there is multiple tasks making progress
at the same time. This figure shows the concurrency because concurrency is
the technique that deals with the lot of things at a time.
13
Parallelism:
Parallelism is related to an application where tasks are divided into smaller sub-
tasks that are processed seemingly simultaneously or parallel. It is used to
increase the throughput and computational speed of the system by using
multiple processors. It enables single sequential CPUs to do lot of things
“seemingly” simultaneously.
Parallelism leads to overlapping of central processing units and input-output
tasks in one process with the central processing unit and input-output tasks of
another process. Whereas in concurrency the speed is increased by
overlapping the input-output activities of one process with CPU process of
another process.
In the above figure, we can see that the tasks are divided into smaller sub-
tasks that are processing simultaneously or parallel. This figure shows the
parallelism, the technique that runs threads simultaneously.
Concurrency control:
Two important issues in concurrency control are known as deadlocks and race
conditions. Deadlock occurs when a resource held indefinitely by one process is
14
Fault-tolerance:
Fault tolerance is the process of working of a system in a proper way in spite of the
occurrence of the failures in the system. Even after performing the so many testing
processes there is possibility of failure in system. Practically a system can’t be made
entirely error free. hence, systems are designed in such a way that in case of error
availability and failure, system does the work properly and given correct result.
Any system has two major components – Hardware and Software. Fault may occur in
either of it. So there are separate techniques for fault-tolerance in both hardware and
software.
Hardware Fault-tolerance Techniques:
Making a hardware fault-tolerance is simple as compared to software. Fault-tolerance
techniques make the hardware work proper and give correct result even some fault occurs
in the hardware part of the system. There are basically two techniques used for hardware
fault-tolerance:
1. BIST –
BIST stands for Build in Self Test. System carries out the test of itself after a certain
period of time again and again, that is BIST technique for hardware fault-tolerance.
When system detects a fault, it switches out the faulty component and switches in the
redundant of it. System basically reconfigure itself in case of fault occurrence.
2. TMR –
TMR is Triple Modular Redundancy. Three redundant copies of critical components
15
are generated and all these three copies are run concurrently. Voting of result of all
redundant copies are done and majority result is selected. It can tolerate the
occurrence of a single fault at a time.
2. Recovery Blocks –
Recovery blocks technique is also kike the n-version programming but in recovery
blocks technique, redundant copies are generated using different algorithms only. In
recovery block, all the redundant copies are not run concurrently and these copies are
run one by one. Recovery block technique can only be used where the task deadlines
are more than task computation time.
2. System failure :
In system failure, the processor associated with the distributed system fails to
perform the execution. This is caused by computer code errors and hardware
issues. Hardware issues may involve CPU/memory/bus failure. This is assumed
that whenever the system stops its execution due to some fault then the interior
state is lost.
Behavior –
It is concerned with physical and logical units of the processor. The system
may freeze, reboot and also it does not perform any functioning leading it to
go in an idle state.
Recovery –
This can be cured by rebooting the system as soon as possible and
configuring the failure point and wrong state.
3. Secondary storage device failure :
A storage device failure is claimed to have occurred once the keep information
can’t be accessed. This failure is sometimes caused by parity error, head crash,
or dirt particles settled on the medium.
Behavior –
Stored information can’t be accessed.
Errors inflicting failure –
Parity error, head crash, etc.
Recovery/Design strategies –
Reconstruct content from the archive and the log of activities and style
reflected disk system. A system failure will additionally be classified as
follows.
Associate cognitive state failure
A partial cognitive state failure
a disruption failure
A halting failure
4. Communication medium failure :
A communication medium failure happens once a web site cannot communicate
with another operational site within the network. it’s typically caused by the
failure of the shift nodes and/or the links of the human activity system.
Behavior –
A web site cannot communicate with another operational site.
Errors/Faults –
Failure of shift nodes or communication links.
Recovery/Design strategies –
Reroute, error-resistant communication protocols.
Failure Models:
18
1. Timing failure:
Timing failure occurs when a node in a system correctly sends a response, but
the response arrives earlier or later than anticipated. Timing failures, also
known as performance failures, occur when a node delivers a response
that is either earlier or later than anticipated.
2. Response failure:
When a server’s response is flawed, a response failure occurs. The response’s
value could be off or transmitted using the inappropriate control flow.
3. Omission failure:
A timing issue known as an “infinite late” or omission failure occurs when the
node’s answer never appears to have been sent.
4. Crash failure:
If a node encounters an omission failure once and then totally stops responding
and goes unresponsive, this is known as a crash failure.
5. Arbitrary failure :
A server may produce arbitrary response at arbitrary times.
Architecture of CUDA
22
Each thread “knows” the x and y coordinates of the block it is in, and the
coordinates where it is in the block.
These positions can be used to calculate a unique thread ID for each thread.
The computational work done will depend on the value of the thread ID.
For example, the thread ID corresponds to a group of matrix elements.
CUDA Applications
10. Research
11. Safety and security
12. Tools and management
Benefits of CUDA
There are several advantages that give CUDA an edge over traditional general-
purpose graphics processor (GPU) computers with graphics APIs:
Integrated memory (CUDA 6.0 or later) and Integrated virtual memory
(CUDA 4.0 or later).
Shared memory provides a fast area of shared memory for CUDA threads. It
can be used as a caching mechanism and provides more bandwidth than
texture lookup.
Scattered read codes can be read from any address in memory.
Improved performance on downloads and reads, which works well from the
GPU and to the GPU.
CUDA has full support for bitwise and integer operations.
Limitations of CUDA
CUDA source code is given on the host machine or GPU, as defined by the
C++ syntax rules. Longstanding versions of CUDA use C syntax rules, which
means that up-to-date CUDA source code may or may not work as required.
CUDA has unilateral interoperability (the ability of computer systems or
software to exchange and make use of information) with transferor
languages like OpenGL. OpenGL can access CUDA registered memory, but
CUDA cannot access OpenGL memory.
Afterward versions of CUDA do not provide emulators or fallback support for
older versions.
CUDA supports only NVIDIA hardware.
What is GPU Programming?
While the past GPUs were designed exclusively for computer graphics, today they are
being used extensively for general-purpose computing (GPGPU computing) as well. In
addition to graphical rendering, GPU-driven parallel computing is used for scientific
modelling, machine learning, and other parallelization-prone jobs today.
24
Heterogeneous computing
Heterogeneous computing refers to systems that use more than one kind of
processor or core. These systems gain performance or energy efficiency not just by
adding the same type of processors, but by adding dissimilar coprocessors, usually
incorporating specialized processing capabilities to handle particular tasks.
Heterogeneous DSM:
In a heterogeneous computing environment, applications can take advantage of
the best of several computing architectures. Heterogeneity is typically desired in
distributed systems. With such a heterogeneous DSM system, memory sharing
between machines with different architectures will be conceivable. The two
major issues in building heterogeneous DSM are :
(i) Data Compatibility and conversion
(ii) Block Size Selection
Data compatibility & conversion:
The data comparability and conversion is the initial design concern in a
heterogeneous DSM system. Different byte-ordering and floating-point
representations may be used by machines with different architectures. Data that
is sent from one machine to another must be converted to the destination
machine’s format. The data transmission unit (block) must be transformed
according to the data type of its contents. As a result, application programmers
must be involved because they are familiar with the memory layout. In
heterogeneous DSM systems, data conversion can be accomplished by
organizing the system as a collection of source language objects or by allowing
only one type of data block.
DSM as a collection of source language objects:
The DSM is structured as a collection of source language objects, according
to the first technique of data conversion. The unit of data migration in this
situation is either a shared variable or an object. Conversion procedures can
be used directly by the compiler to translate between different machine
architectures. The DSM system checks whether the requesting node and the
node that has the object are compatible before accessing remote objects or
variables. If the nodes are incompatible, it invokes a conversion routine,
translates, and migrates the shared variable or object.
This approach is employed in Agora Shared Memory systems, and while it is
handy for data conversion, it has a low performance. Scalars, arrays, and
structures are the objects of programming languages. Each of them
necessitates access rights, and migration need communication overhead.
Due to the limited packet size of transport protocols, access to big arrays
may result in false sharing and thrashing, while migration would entail
fragmentation and reassembling.
DSM as one type of data block:
Only one type of data block is allowed in the second data conversion
procedure. Mermaid DSM use this approach, which uses a page size equal
to the block size. Additional information is kept in the page table entry, such
as the type of data preserved in the page and the amount of data allocated
to the page. The method changes the page to an appropriate format when
26
Interconnection Network/topologies:
o Bus networks − A bus network is composed of a number of bit lines onto
which a number of resources are attached. When busses use the same
physical lines for data and addresses, the data and the address lines are
time multiplexed. When there are multiple bus-masters attached to the
bus, an arbiter is required.
o Multistage networks − A multistage network consists of multiple stages
of switches. It is composed of ‘axb’ switches which are connected using a
particular interstage connection pattern (ISC). Small 2x2 switch elements
are a common choice for many multistage networks. The number of
stages determine the delay of the network. By choosing different
interstage connection patterns, various types of multistage network can be
created.
o Crossbar switches − A crossbar switch contains a matrix of simple
switch elements that can switch on and off to create or break a
connection. Turning on a switch element in the matrix, a connection
between a processor and a memory can be made. Crossbar switches are
30
If the main concern is the routing distance, then the dimension has to be maximized and
a hypercube made. In store-and-forward routing, assuming that the degree of the switch
and the number of links were not a significant cost factor, and the numbers of links or
the switch degree are the main costs, the dimension has to be minimized and a mesh
built.
In worst case traffic pattern for each network, it is preferred to have high dimensional
networks where all the paths are short. In patterns where each node is communicating
with only one or two nearby neighbors, it is preferred to have low dimensional networks,
since only a few of the dimensions are actually used.
Routing
The routing algorithm of a network determines which of the possible paths from source
to destination is used as routes and how the route followed by each particular packet is
determined. Dimension order routing limits the set of legal paths so that there is exactly
one route from each source to each destination. The one obtained by first traveling the
correct distance in the high-order dimension, then the next dimension and so on.
Routing Mechanisms
Arithmetic, source-based port select, and table look-up are three mechanisms that high-
speed switches use to determine the output channel from information in the packet
header. All of these mechanisms are simpler than the kind of general routing
computations implemented in traditional LAN and WAN routers. In parallel computer
networks, the switch needs to make the routing decision for all its inputs in every cycle,
so the mechanism needs to be simple and fast.
Deterministic Routing
A routing algorithm is deterministic if the route taken by a message is determined
exclusively by its source and destination, and not by other traffic in the network. If a
routing algorithm only selects shortest paths toward the destination, it is minimal,
otherwise it is non-minimal.
Deadlock Freedom
Deadlock can occur in a various situation. When two nodes attempt to send data to
each other and each begins sending before either receives, a ‘head-on’ deadlock may
occur. Another case of deadlock occurs, when there are multiple messages competing
for resources within the network.
31
The basic technique for proving a network is deadlock free, is to clear the dependencies
that can occur between channels as a result of messages moving through the networks
and to show that there are no cycles in the overall channel dependency graph; hence
there is no traffic patterns that can lead to a deadlock. The common way of doing this is
to number the channel resources such that all routes follow a particular increasing or
decreasing sequences, so that no dependency cycles arise.
Switch Design
Design of a network depends on the design of the switch and how the switches are
wired together. The degree of the switch, its internal routing mechanisms, and its
internal buffering decides what topologies can be supported and what routing algorithms
can be implemented. Like any other hardware component of a computer system, a
network switch contains data path, control, and storage.
Ports
The total number of pins is actually the total number of input and output ports times the
channel width. As the perimeter of the chip grows slowly compared to the area,
switches tend to be pin limited.
Channel Buffers
The organization of the buffer storage within the switch has an important impact on the
switch performance. Traditional routers and switches tend to have large SRAM or
DRAM buffers external to the switch fabric, while in VLSI switches the buffering is
internal to the switch and comes out of the same silicon budget as the data path and the
control section. As the chip size and density increases, more buffering is available and
the network designer has more options, but still the buffer real-estate comes at a prime
choice and its organization is important.
Flow Control
When multiple data flows in the network attempt to use the same shared network
resources at the same time, some action must be taken to control these flows. If we
don’t want to lose any data, some of the flows must be blocked while others proceed.
The problem of flow control arises in all networks and at many levels. But it is
qualitatively different in parallel computer networks than in local and wide area
networks. In parallel computers, the network traffic needs to be delivered about as
accurately as traffic across a bus and there are a very large number of parallel flows on
very small-time scale.
32
Protect against DDoS attacks: The load balancer can distinguish and drop
conveyed refusal of administration (DDoS) traffic before it gets to your site.
Performance: Load balancers can decrease the load on your web servers
and advance traffic for a superior client experience.
SSL Offload: Protecting traffic with SSL (Secure Sockets Layer) on the load
balancer eliminates the upward from web servers bringing about additional
assets being accessible for your web application.
Traffic Compression: A load balancer can pack site traffic giving your
clients a vastly improved encounter with your site.
Round Robin
Least Connections
Least Time
Hash
IP Hash
Classes of Load Adjusting Calculations:
Following are a portion of the various classes of the load adjusting calculations.
Static: In this model assuming any hub/node is found with a heavy load, an
assignment can be taken arbitrarily and move the undertaking to some other
arbitrary system. .
Dynamic: It involves the present status data for load adjusting. These are
better calculations than static calculations.
Deterministic: These calculations utilize processor and cycle attributes to
apportion cycles to the hubs.
Centralized: The framework states data is gathered by a single hub.
34
Migration Models:
Code section
Resource section
Execution section
35
The techniques that are used for scheduling the processes in distributed
systems are as follows:
1. Task Assignment Approach: In the Task Assignment Approach, the user-
submitted process is composed of multiple related tasks which are
36
then selects the one with the least load then it is not considered a good
approach because it leads to poor scalability as it will not work well for a
system having many nodes. The reason is that the inquirer receives a lot
many replies almost simultaneously and the processing time spent for reply
messages is too long for a node selection with the increase in several nodes
(N). A straightforward way is to examine only m of N nodes.
A good scheduling algorithm must be having fairness of service because in
an attempt to balance the workload on all nodes of the system there might
be a possibility that nodes with more load get more benefit as compared to
nodes with less load because they suffer from poor response time than
stand-alone systems. Hence, the solution lies in the concept of load sharing
in which a node can share some of its resources until the user is not
affected.
The Load Balancing approach refers to the division of load among the
processing elements of a distributed system. The excess load of one
processing element is distributed to other processing elements that have less
load according to the defined limits. In other words, the load is maintained at
each processing element in such a manner that neither it gets overloaded nor
idle during the execution of a program to maximize the system throughput
which is the ultimate goal of distributed systems. This approach makes all
processing elements equally busy thus speeding up the entire task leads to the
completion of the task by all processors approximately at the same time.
38
Migration limit policy: Determines the limit value for the migration of
processes.
Issues Related to Load Balancing in
Distributed System
A distributed system is a set of computers joined by some sort of
communication network, each of which has its database system and users may
access data from any spot on the network, necessitating the availability of data
at each site. Example- If you want to withdraw money from an ATM, then you
can go to any ATM (even ATM of other banks) and swipe your card. The money
will be debited from your account and it will be reflected in your account. It
doesn’t matter you are taking money from ATM or transferring it to someone by
net banking. It means internally all of these things are connected to each other
and working as a single unit. Although in real life we see them as distributed.
Load Balancers:
1. Performance Degradation:
It may lead to performance degradation as load balancers assign equivalent or
predetermined weights to diverse resources and therefore it can result in poor
performance in terms of speed and cost. Therefore, it is the need to have
effective load balancers which balance load depending upon the type of
resources.
41
2. Job Selection:
It deals with the issue of job selection. Whenever we are assigning some jobs to
resources through load balancers. There should be an optimal algorithm to
decide the order and which jobs should be given to which servers for our
system to work efficiently.
Processor consistency
in order for consistency in data to be maintained and to attain
scalable processor systems where every processor has its own memory,
the processor consistency model was derived. All processors need to be
consistent in the order in which they see writes done by one processor and
in the way they see writes by different processors to the same location
(coherence is maintained). However, they do not need to be consistent
when the writes are by different processors to different locations.
Every write operation can be divided into several sub-writes to all
memories. A read from one such memory can happen before the write to
this memory completes. Therefore, the data read can be stale. Thus, a
processor under PC can execute a younger load when an older store
needs to be stalled. Read before write, read after read and write before
write ordering is still preserved in this model.
The processor consistency model is similar to PRAM consistency model
with a stronger condition that defines all writes to the same memory
location must be seen in the same sequential order by all other processes.
Processor consistency is weaker than sequential consistency but stronger
than PRAM consistency model.
Cache consistency
Cache consistency requires that all write operations to the same memory
location are performed in some sequential order. Cache consistency is
weaker than processor consistency and incomparable with PRAM
consistency.
Release consistency
The release consistency model relaxes the weak consistency model by
distinguishing the entrance synchronization operation from the exit
synchronization operation. Under weak ordering, when a synchronization
operation is to be seen, all operations in all processors need to be visible
before the synchronization operation is done and the processor proceeds.
However, under release consistency model, during the entry to a critical
section, termed as "acquire", all operations with respect to the local
memory variables need to be completed. During the exit, termed as
"release", all changes made by the local processor should be propagated
to all other processors. Coherence is still maintained.
The acquire operation is a load/read that is performed to access the critical
section. A release operation is a store/write performed to allow other
processors to use the shared variables.
Among synchronization variables, sequential consistency or processor
consistency can be maintained. Using SC, all competing synchronization
variables should be processed in order. However, with PC, a pair of
competing variables need to only follow this order. Younger acquires can
be allowed to happen before older releases.
49
Entry consistency
This is a variant of the release consistency model. It also requires the use
of acquire and release instructions to explicitly state an entry or exit to a
critical section. However, under entry consistency, every shared variable is
assigned a synchronization variable specific to it. This way, only when the
acquire is to variable x, all operations related to x need to be completed
with respect to that processor. This allows concurrent operations of
different critical sections of different shared variables to occur. Concurrency
cannot be seen for critical operations on the same shared variable. Such a
consistency model will be useful when different matrix elements can be
processed at the same time.
Local consistency
In local consistency, each process performs its own operations in the order
defined by its program. There is no constraint on the ordering in which the
write operations of other processes appear to be performed. Local
consistency is the weakest consistency model in shared memory systems.
General consistency
In general consistency, all the copies of a memory location are eventually
identical after all processes' writes are completed.
Eventual consistency
An eventual consistency is a weak consistency model in the system with
the lack of simultaneous updates. It defines that if no update takes a very
long time, all replicas eventually become consistent.
Most shared decentralized databases have an eventual consistency model,
either BASE: basically available; soft state; eventually consistent, or a
combination of ACID and BASE sometimes called SALT: sequential;
agreed; ledge red; tamper-resistant, and also symmetric; admin-free; ledge
red; and time-consensual.
4. Cost per bit: As we move from bottom to top in the Hierarchy, the cost per
bit increases i.e. Internal Memory is costlier than External Memory.
According to the memory Hierarchy, the system supported memory
standards are defined below:
Level 1 2 3 4
Secondary
Name Register Cache Main Memory Memory
DRAM (capacitor
Implementation Multi-ports On-chip/SRAM memory) Magnetic
20000 to 1 lakh
Bandwidth MBytes 5000 to 15000 1000 to 5000 20 to 150
Operating
Managed by Compiler Hardware Operating System System
MPI defines useful syntax for routines and libraries in programming languages
including Fortran, C, C++ and Java.
55
Some organizations are also able to offload MPI to make their programming
models and libraries faster.
Color. This assigns a color to a process, and all processes with the same color
are located in the same communicator. A command related to color
includes MPE_Make_color_array, which changes the available colors.
Derived data types. MPI functions need a specification to what type of data is
sent between processes. MPI_INT, MPI_CHAR and MPI_DOUBLE aid in
predefining the constants.
Collective basics. These are collective functions that need communication among
all processes in a process group. MPI_Bcast is an example of such, which sends
data from one node to all processes in a process group.
One-sided. This term is typically used referring to a form of communications
operations, including MPI_Put, MPI_Get and MPI_Accumulate. They refer
specifically to being a writing to memory, reading from memory and reducing
operation on the same memory across tasks.
Differences
The major differences between shared memory and message
passing model −
Shared Memory Message Passing
It is one of the region for data Mainly the message passing is used for
communication communication.
The shared memory code that has to Here no code is required because the
be read or write the data that should message passing facility provides a
be written explicitly by the mechanism for communication and
application programmer. synchronization of actions that are
performed by the communicating
processes.
In shared memory make sure that the Message passing is useful for sharing
processes are not writing to the same small amounts of data so that conflicts
location simultaneously. need not occur.
60
Concurrent Processing
The easy availability of computers along with the growth of Internet has
changed the way we store and process data. We are living in a day and
age where data is available in abundance. Every day we deal with huge
volumes of data that require complex computing and that too, in quick time.
Sometimes, we need to fetch data from similar or interrelated events that
occur simultaneously. This is where we require concurrent
processing that can divide a complex task and process it multiple systems
to produce the output in quick time.
Concurrent processing is essential where the task involves processing a
huge bulk of complex data. Examples include − accessing large databases,
aircraft testing, astronomical calculations, atomic and nuclear physics,
biomedical analysis, economic planning, image processing, robotics,
weather forecasting, web-based services, etc.
What is Parallelism?
Parallelism is the process of processing several set of instructions
simultaneously. It reduces the total computational time. Parallelism can be
implemented by using parallel computers, i.e. a computer with many
processors. Parallel computers require parallel algorithm, programming
languages, compilers and operating system that support multitasking.
In this tutorial, we will discuss only about parallel algorithms. Before
moving further, let us first discuss about algorithms and their types.
What is an Algorithm?
An algorithm is a sequence of instructions followed to solve a problem.
While designing an algorithm, we should consider the architecture of
computer on which the algorithm will be executed. As per the architecture,
there are two types of computers −
Sequential Computer
Parallel Computer
Depending on the architecture of computers, we have two types of
algorithms −
Sequential Algorithm − An algorithm in which some consecutive
steps of instructions are executed in a chronological order to solve a
problem.
62
Flynn’s taxonomy:’
MIMD/SIMD (models of computing)
Parallel computing is a computing where the jobs are broken into
discrete parts that can be executed concurrently. Each part is further
broken down to a series of instructions. Instructions from each part
execute simultaneously on different CPUs. Parallel systems deal with the
simultaneous use of multiple computer resources that can include a single
computer with multiple processors, a number of computers connected by
a network to form a parallel processing cluster or a combination of both.
Parallel systems are more difficult to program than computers with a
single processor because the architecture of parallel computers varies
accordingly and the processes of multiple CPUs must be coordinated and
synchronized.
The crux of parallel processing are CPUs. Based on the number
of instruction and data streams that can be processed simultaneously,
computing systems are classified into four major categories:
63
Flynn’s classification –
1. Single-instruction, single-data (SISD) systems –
An SISD computing system is a uniprocessor machine which is
capable of executing a single instruction, operating on a single data
stream. In SISD, machine instructions are processed in a sequential
manner and computers adopting this model are popularly called
sequential computers. Most conventional computers have SISD
architecture. All the instructions and data to be processed have to be
stored in primary memory.
64
Example Z = sin(x)+cos(x)+tan(x)
The system performs different operations on the same data set.
Machines built using the MISD model are not useful in most of the
application, a few machines are built, but none of them are available
commercially.
4. Multiple-instruction, multiple-data (MIMD) systems –
An MIMD system is a multiprocessor machine which is capable of
executing multiple instructions on multiple data sets. Each PE in the
MIMD model has separate instruction and data streams; therefore,
machines built using this model are capable to any kind of application.
Unlike SIMD and MISD machines, PEs in MIMD machines work
asynchronously.
they all have access to it. The communication between PEs in this
model takes place through the shared memory, modification of the data
stored in the global memory by one PE is visible to all other PEs.
Dominant representative shared memory MIMD systems are Silicon
Graphics machines and Sun/IBM’s SMP (Symmetric Multi-Processing).
In Distributed memory MIMD machines (loosely coupled
multiprocessor systems) all PEs have a local memory. The
communication between PEs in this model takes place through the
interconnection network (the inter process communication channel, or
IPC). The network connecting PEs can be configured to tree, mesh or
in accordance with the requirement.
The shared-memory MIMD architecture is easier to program but is less
tolerant to failures and harder to extend with respect to the distributed
memory MIMD model. Failures in a shared-memory MIMD affect the
entire system, whereas this is not the case of the distributed model, in
which each of the PEs can be easily isolated. Moreover, shared
memory MIMD architectures are less likely to scale because the
addition of more PEs leads to memory contention. This is a situation
that does not happen in the case of distributed memory, in which each
PE has its own memory. As a result of practical outcomes and user’s
requirement, distributed memory MIMD architecture is superior to the
other existing models.
Total cost.
Time Complexity
The main reason behind developing parallel algorithms was to
reduce the computation time of an algorithm. Thus, evaluating the
execution time of an algorithm is extremely important in analyzing
its efficiency.
Execution time is measured on the basis of the time taken by the
algorithm to solve a problem. The total execution time is
calculated from the moment when the algorithm starts executing
to the moment it stops. If all the processors do not start or end
execution at the same time, then the total execution time of the
algorithm is the moment when the first processor started its
execution to the moment when the last processor stops its
execution.
Time complexity of an algorithm can be classified into three
categories−
Worst-case complexity − When the amount of time
required by an algorithm for a given input is maximum.
Average-case complexity − When the amount of time
required by an algorithm for a given input is average.
Best-case complexity − When the amount of time required
by an algorithm for a given input is minimum.
Asymptotic Analysis
The complexity or efficiency of an algorithm is the number of
steps executed by the algorithm to get the desired output.
Asymptotic analysis is done to calculate the complexity of an
algorithm in its theoretical analysis. In asymptotic analysis, a large
length of input is used to calculate the complexity function of the
algorithm.
Note − Asymptotic is a condition where a line tends to meet a
curve, but they do not intersect. Here the line and the curve is
asymptotic to each other.
68
Speedup of an Algorithm
The performance of a parallel algorithm is determined by
calculating its speedup. Speedup is defined as the ratio of the
worst-case execution time of the fastest known sequential
algorithm for a particular problem to the worst-case execution
time of the parallel algorithm.
speedup =
Worst case execution time of the fastest known sequential for a particular
problem / Worst case execution time of the parallel algorithm
Here, problems are divided into atomic tasks and implemented as a graph.
Each task is an independent unit of job that has dependencies on one or
more antecedent task. After the completion of a task, the output of an
antecedent task is passed to the dependent task. A task with antecedent
task starts execution only when its entire antecedent task is completed. The
final output of the graph is received when the last dependent task is
completed (Task 6 in the above figure).
72
Master-Slave Model
In the master-slave model, one or more master processes generate task and
allocate it to slave processes. The tasks may be allocated beforehand if −
the master can estimate the volume of the tasks, or
a random assigning can do a satisfactory job of balancing load, or
slaves are assigned smaller pieces of task at different times.
This model is generally equally suitable to shared-address-
space or message-passing paradigms, since the interaction is naturally
two ways.
In some cases, a task may need to be completed in phases, and the task in
each phase must be completed before the task in the next phases can be
generated. The master-slave model can be generalized
to hierarchical or multi-level master-slave model in which the top level
master feeds the large portion of tasks to the second-level master, who
74
further subdivides the tasks among its own slaves and may perform a part of
the task itself.
Hybrid Models
A hybrid algorithm model is required when more than one model may be
needed to solve a problem.
A hybrid model may be composed of either multiple models applied
hierarchically or multiple models applied sequentially to different phases of
a parallel algorithm.
Example − Parallel quick sort
and Windows 2000, and JavaTM threads as part of the standard JavaTM
Development Kit (JDK).
Distributed Shared Memory (DSM) Systems − DSM systems create
an abstraction of shared memory on loosely coupled architecture in
order to implement shared memory programming without hardware
support. They implement standard libraries and use the advanced user-
level memory management features present in modern operating
systems. Examples include Tread Marks System, Munin, IVY, Shasta,
Brazos, and Cashmere.
Program Annotation Packages − This is implemented on the
architectures having uniform memory access characteristics. The most
notable example of program annotation packages is OpenMP.
OpenMP implements functional parallelism. It mainly focuses on
parallelization of loops.
The concept of shared memory provides a low-level control of shared
memory system, but it tends to be tedious and erroneous. It is more
applicable for system programming than application programming.
Merits of Shared Memory Programming
Global address space gives a user-friendly programming approach to
memory.
Due to the closeness of memory to CPU, data sharing among
processes is fast and uniform.
There is no need to specify distinctly the communication of data among
processes.
Process-communication overhead is negligible.
It is very easy to learn.
Demerits of Shared Memory Programming
It is not portable.
Managing data locality is very difficult.
Message Passing Model
Message passing is the most commonly used parallel programming
approach in distributed memory systems. Here, the programmer has to
determine the parallelism. In this model, all the processors have their own
local memory unit and they exchange data through a communication
network.
79
Multithreaded programming:
Here’s why:
Processors have reached maximum clock speed. The only way to get more
out of CPUs is with parallelism.
Using multiple threads helps you get more out of a single processor. But
then these threads need to sync their work in a shared memory. This can
be difficult to get right — and even more difficult to do without concurrency
issues.
Here are two common types of multithreading issues that can be difficult to
find with testing and debugging alone.
A data race is a type of race condition. A data race occurs when two or
more threads access shared data and attempt to modify it at the same time
— without proper synchronization.
Deadlock
Deadlock occurs when multiple threads are blocked while competing for
resources. One thread is stuck waiting for a second thread, which is stuck
waiting for the first.
parallel I/O:
Parallel I/O is a subset of parallel computing that performs
multiple input/output operations simultaneously. Rather than process I/O
requests serially, one at a time, parallel I/O accesses data on disk
simultaneously. This allows a system to achieve higher write speeds and
maximizes bandwidth.
Multicore chips help give parallel computing its processing power, and
make it compatible with most currently deployed servers. In a multicore
processor, each physical core enables efficient use of resources by
managing multiple requests by one user with Multithreading.
With parallel I/O, a portion of the logical cores on the multicore chip are
dedicated to processing I/O from the virtual machines and any applications
the remaining cores service. This allows the processor to handle multiple
read and write operations concurrently. Parallel I/O helps eliminate
I/O bottlenecks, which can stop or impair the flow of data.
Currently, many applications don't utilize parallel I/O, having been designed
to use Unicore sequential processing rather than multicore. However, the
recent rise in popularity of big data analytics may signal a place for parallel
computing in business applications, which face significant I/O performance
issues.
transferring data over the network and also the rate (frequency) with which it
is sent.
Using LRPC (Lightweight Remote Procedure Call) for Cross-Domain
Messaging: LRPC (Lightweight Remote Procedure Call) facility is used in
microkernel operating systems for providing cross-domain (calling and called
processes are both on the same machine) communication. It employs
following the approaches for enhancing the performance of old systems
employing Remote Procedure Call:
Simple Control Transfer: In this approach, a control transfer procedure is
used that refers to the execution of the requested procedure by the client’s
thread in the server’s domain. It employs hand-off scheduling in which direct
context switching takes place from the client thread to the server thread.
Before the first call is made to the server, the client binds to its interface, and
afterward, it provides the server with the argument stack and its execution
thread for trapping the kernel. Now, the kernel checks the caller and creates
a call linkage, and sends off the client’s thread directly to the server which in
turn activates the server for execution. After completion of the called
procedure, control and results return through the kernel from where it is called.
Simple Data Transfer: In this approach, a shared argument stack is
employed to avoid duplicate data copying. Shared simply refers to the usage
by both the client and the server. So, in LRPC the same arguments are copied
only once from the client’s stack to the shared argument stack. It leads to cost-
effectiveness as data transfer creates few copies of data when moving from
one domain to another.
Simple Stub: Because of the above mechanisms, the generation of the highly
optimized stubs is possible using LRPC. The call stub is associated with the
client’s domain and the entry stub is associated with the server’s domain is
having an entry stub in every procedure. The LRPC interface for every
procedure follows a three-layered communication protocol:
From end to end: communication is carried out as defined by
calling conventions
stub to stub: requires the usage of stubs
domain-to-domain: requires kernel implementation
The benefit of using LRPC stubs is that cost for interlayer gets reduced as it
makes the boundaries blurry. The single requirement in a simple LRPC is that
one formal procedure call to client stub and one return from server procedure
and client stub should be made.
Design for Concurrency: For achieving high performance in terms of high
call throughput and low call latency, multiple processors are used with shared
memory. Further, throughput can be increased by getting rid of unnecessary
lock contention and reducing the utilization of shared-data structures, while
latency is lowered by decreasing the overhead of context switching. The
87
To compare this system with other parallel processing systems, the following four
models are considered: Distributed/Splitting (D/S), Distributed/No Splitting
(D/NS), Centralized/Splitting (C/S), and Centralized/No Splitting (C/NS). In each
of these systems there are c processors, jobs are assumed to consist of set of
tasks that are independent and have exponentially distributed service
requirements, and arrivals of jobs are assumed to come from a Poisson point
source. The systems differ in the way jobs queue for the processors and in the
way jobs are scheduled on the processors. The queueing of jobs for processors
is distributed if each processor has its own queue, and is centralized if there is a
common queue for all the processors. The scheduling of jobs on the processors
is no splitting if the entire set of tasks composing that job are scheduled to run
sequentially on the same processor once the job is scheduled. On the other
hand, the scheduling is splitting if the tasks of a job are scheduled so that they
can be run independently and potentially in parallel on different processors. In
the splitting case a job is completed only when all of its tasks have finished
execution.
In our study we compare the mean response time of jobs in each of the systems
for differing values of the number of processors, number of tasks per job, server
utilization, and certain overheads associated with splitting up a job.
The MX/M/c system studied in the first part of the paper corresponds to the C/S
88
system. In this system, as processors become free they serve the first task in the
queue. D/. systems are studied in. We use the approximate analysis of the D/S
system and the exact analysis of the D/NS system that are given in that paper.
For systems with 32 processors or less, the relative error in the approximation for
the D/S system was found to be less than 5 percent. In the D/NS system, jobs
are assigned to processors with equal probabilities. The approximation we use
for the mean job response time for the C/NS system is found in. Although an
extensive error analysis for this system over all parameter ranges has not been
carried out, the largest relative error for the M/E2/10 system reported in is about
0.1 percent.
For all values of utilization, ρ, our results show that the splitting systems yield
lower mean job response time than the no splitting systems. This follows from the
fact that, in the splitting case, work is distributed over all the processors. For
any ρ, the lowest (highest) mean job response time is achieved by the C/S
system (the D/NS system). The relative performance of the D/S system and the
C/NS system depends on the value of ρ. For small ρ, the parallelism achieved by
splitting jobs into parallel tasks in the D/S system reduces its mean job response
time as compared to the C/NS system, where tasks of the same job are executed
sequentially. However, for high ρ, the C/NS system has lower mean job response
time than the D/S system. This is due to the long synchronization delay incurred
in the D/S system at high utilizations.
We also consider problems associated with partitioning the processors into two
sets, each dedicated to one of two classes of jobs: edit jobs and batch
jobs. Edit jobs are assumed to consist of simple operations that have no inherent
parallelism and thus consist of only one task. Batch jobs, on the other hand, are
assumed to be inherently parallel and can be broken up into tasks. All tasks from
either class are assumed to have the same service requirements. A number of
interesting phenomena are observed. For example, when half the jobs are edit
jobs, the mean job response time for both classes of jobs increases if one
processor is allocated to edit jobs. Improvement to edit jobs, at a cost of
increasing the mean job response time of batch jobs, results only when the
89
number of processors allocated to edit jobs is increased to two. This, and other
results, suggest that it is desirable for parallel processing systems to have a
controllable boundary for processor partitioning.
Process interaction
Process interaction relates to the mechanisms by which parallel processes
are able to communicate with each other. The most common forms of
interaction are shared memory and message passing, but interaction can
also be implicit (invisible to the programmer).
Shared memory
Shared memory is an efficient means of passing data between processes.
In a shared-memory model, parallel processes share a global address
space that they read and write to asynchronously. Asynchronous
concurrent access can lead to race conditions, and mechanisms such
as locks, semaphores and monitors can be used to avoid these.
Conventional multi-core processors directly support shared memory, which
many parallel programming languages and libraries, such as Cilk, Open
MP and Threading Building Blocks, are designed to exploit.
Message passing
In a message-passing model, parallel processes exchange data through
passing messages to one another. These communications can be
asynchronous, where a message can be sent before the receiver is ready,
or synchronous, where the receiver must be ready. The Communicating
sequential processes (CSP) formalization of message passing uses
synchronous communication channels to connect processes, and led to
important languages such as Occam, Limbo and Go. In contrast, the actor
model uses asynchronous message passing and has been employed in the
design of languages such as D, Scala and SALSA.
Problem decomposition
A parallel program is composed of simultaneously executing processes.
Problem decomposition relates to the way in which the constituent
processes are formulated.
Task parallelism
A task-parallel model focuses on processes, or threads of execution. These
processes will often be behaviorally distinct, which emphasizes the need
for communication. Task parallelism is a natural way to express message-
passing communication. In Flynn's taxonomy, task parallelism is usually
classified as MIMD/MPMD or MISD.
Data parallelism
A data-parallel model focuses on performing operations on a data set,
typically a regularly structured array. A set of tasks will operate on this data,
but independently on disjoint partitions. In Flynn's taxonomy, data
parallelism is usually classified as MIMD/SPMD or SIMD.
Implicit parallelism
As with implicit process interaction, an implicit model of parallelism reveals
nothing to the programmer as the compiler, the runtime or the hardware is
responsible. For example, in compilers, automatic parallelization is the
92
Terminology
Parallel programming models are closely related to models of computation.
A model of parallel computation is an abstraction used to analyze the cost
of computational processes, but it does not necessarily need to be
practical, in that it can be implemented efficiently in hardware and/or
software. A programming model, in contrast, does specifically imply the
practical considerations of hardware and software implementation.
A parallel programming language may be based on one or a combination of
programming models. For example, High Performance Fortran is based on
shared-memory interactions and data-parallel problem decomposition,
and Go provides mechanism for shared-memory and message-passing
interaction.
Scalability is the capacity of a system to adapt its performance and cost to the
new changes in application and system processing demands.
The architecture used to build services, networks, and processes is scalable
under these 2 conditions:
Scalability is basically a measure of how well the system will respond to the
addition and omission of resources to meet our requirements. That is why we do
a Requirement Analysis of the System in the first phase of SDLC and make sure
the system is adaptable and scalable.
Measures of Scalability:
Size Scalability
Geographical Scalability
Administrative Scalability
1. Size Scalability: There will be an increase in the size of the system whenever
users and resources grow but it should not be carried out at the cost of
performance and efficiency of the system. The system must respond to the user
in the same manner as it was responding before scaling the system.
2. Geographical Scalability: Geographical scalability refers to the addition of
new nodes in a physical space that should not affect the communication time
between the nodes.
3. Administrative Scalability: In Administrative Scalability, significant
management of the new nodes which are being added to the system should not
99
Types of Scalability:
storage system:
A parallel storage file system is a sort of clustered file system. A clustered file
system is a storage system shared by multiple devices simultaneously.
The data is spread amongst several storage nodes for redundancy and performance
in a parallel file system. In this system, the physical storage device is built using
the storage devices of multiple servers. When the file system receives data, it
distributes it across several storage nodes after breaking it into data blocks.
Parallel file storages duplicate the data on the physically distinct nodes. This lets
the system be fault-tolerant and permits data redundancy. The data distribution
improves the system’s performance and makes it faster.
In other words, the parallel file system breaks data into blocks and distributes the
blocks to multiple storage servers. It uses a global namespace to enable data
access. The data is written/ read using different input/ output (I/O) paths.
BeeGFS
Lustre
PanFS (Panasas)
OrangeFS
101
Distributed storage systems are also called network file systems. These systems
share access to the same storage using network protocols. They prevent access to
the file system based on the access lists and capabilities of the server and client
systems. They allow access to files using the same interfaces as local files.
A parallel file system is a kind of distributed file system. Both systems share data
amongst multiple servers.
Windows DFS
Infinit
Alluxio
ObjectiveFS
JuiceFS
MapR FS
The clients should access the distributed files as they would access local
files, and they should not be aware of the file distribution.
102
The client system and program should function correctly even when a
server failure occurs.
The file should be compatible across various hardware and operating
systems.
All the clients should get the same view of the file in the system. For
instance, if a file is being modified, all the clients accessing the file should
see the changes.
The clients should not be informed about the data duplication.
The systems should be scalable. This means that if a system works in a
small environment, it should work for a larger environment.
Race Condition:
When more than one process is executing the same code or accessing the
same memory or any shared variable in that condition there is a possibility that
the output or the value of the shared variable is wrong so for that all the
104
processes doing the race to say that my output is correct this condition known
as a race condition. Several processes access and process the manipulations
over the same data concurrently, then the outcome depends on the particular
order in which the access takes place. A race condition is a situation that may
occur inside a critical section. This happens when the result of multiple thread
execution in the critical section differs according to the order in which the
threads execute. Race conditions in critical sections can be avoided if the
critical section is treated as an atomic instruction. Also, proper thread
synchronization using locks or atomic variables can prevent race conditions.
A critical section is a code segment that can be accessed by only one process
at a time. The critical section contains shared variables that need to be
synchronized to maintain the consistency of data variables. So the critical
section problem means designing a way for cooperative processes to access
shared resources without creating data inconsistencies.
In the entry section, the process requests for entry in the Critical Section.
Any solution to the critical section problem must satisfy three requirements:
105
Peterson’s Solution:
Semaphores:
1. Introduction
We often happen to meet problems requiring heavy computations or data-intensive
processing. Hence, on one hand, we try to develop efficient algorithms for the problems.
On the other hand, with the advances of hardware and parallel and distributed computing
technology, we are interested in exploiting high performance computing resources to
handle them.
Parallel and distributed computing technology has been focused on how to maximize
inherent parallelism using multicore/many-core processors and networked computing
resources. Various computing architectures and hardware techniques have been developed
such as symmetric multiprocessor (SMP) architecture, no uniform memory access
(NUMA) architecture, simultaneous multithreading (SMT) architecture, single instruction
multiple data (SIMD) architecture, graphics processing unit (GPU), general purpose
graphics processing unit (GPGPU), and superscalar processor.
A variety of software technology has been developed to take advantage of hardware
capability and to effectively develop parallel and distributed applications. With the
plentiful frameworks of parallel and distributed computing, it would be of great help to
have performance comparison studies for the frameworks we may consider.
This paper is concerned with performance studies of three parallel programming
frameworks: Open MP, MPI, and Map Reduce. The comparative studies have been
conducted for two problem sets: the all-pairs-shortest-path problem and a join problem for
large data sets. Open MP is the de facto standard model from shared memory systems, MPI
is the de facto standard for distributed memory systems, and Map Reduce is recognized as
the de facto standard framework intended for big data processing. For each problem, the
parallel programs have been developed in terms of the three models, and their performance
has been observed.
The remainder of the paper is organized as follows: Section 2 briefly reviews the parallel
computing models and Section 3 presents the selected programming frameworks in more
detail. Section 4 explains the developed parallel programs for the problems with the three
frameworks. Section 5 shows the experiment results and finally Section 6 draws
conclusions.
SMP and assume all processors to be identical. NUMA machines are often organized by
physically linking two or more SMPs in which not all processors have equal access time to
all memories.
In distributed memory architectures, processors have their own memory, but there is no
global address space across all processors. They have a communication network to connect
processors’ memories.
Hybrid shared-distributed memory employs both shared and distributed memory
architectures. In clusters of multicore or many-core processors, cores in a processor share
their memory and multiple shared memory machines are networked to move data from one
machine to another.
There are several parallel programming models which allow users to specify concurrency
and locality at a high level: thread, message passing, data parallel, and single program
multiple data (SPMD) and multiple program multiple data (MPMD) models.
Thread model organizes a heavy weight process with multiple light weight threads that are
executed concurrently. POSIX threads library (a.k.a. pitheads) and Open MP are typical
implementation of this model.
In the message passing model, an application consists of a set of tasks which use their own
local memory that can be located in the same machine or across a certain number of
machines. Tasks exchange data by sending and receiving messages to conduct the mission.
MPI is the de facto industry standard for message passing.
Data parallel model, also referred to as partitioned global address space (PGAS) model,
provides each process with a view of the global memory even though memory is distributed
across the machines. It makes distinction between local and global memory reference under
the control of programmer. The compiler and runtime take care of converting remote
memory access into message passing operations between processes. There are several
implementations of the data parallel model: Coarray Fortran, Unified Parallel C, X10, and
Chapel .
SPMD model is a high level programming paradigm that executes the same program with
different data multiple times. It is probably the most commonly used parallel programming
model for clusters of nodes. MPMD model is a high level programming paradigm that
allows multiple programs to run on different data. With the advent of general purpose
graphical processing unit (GPGPU), hybrid parallel computing models have been
developed to utilize the many-core GPU to perform heavy computation under the control
of the host thread running on the host CPU.
When data volume is large, demanding memory capacity may hinder its manipulation and
processing. To deal with such situations, big data processing frameworks such as Hadoop
and Dryad have been developed which exploit multiple distributed machines. Hadoop Map
Reduce is a programming model that abstracts an application into two phases of Map and
112
Reduce. Dryad structures the computation as a directed graph in which vertices correspond
to a task and edges are the channels of data transmissions.
3.1. Open MP
3.2. MPI
MPI is a message passing library specification which defines an extended message passing
model for parallel, distributed programming on distributed computing environment. It is
not actually a specific implementation of the parallel programming environment, and its
several implementations have been made such as OpenMPI, MPICH, and GridMPI. In MPI
model, each process has its own address space and communicates other processes to access
others’ address space. Programmers take charge of partitioning workload and mapping
tasks about which tasks are to be computed by each process.
MPI provides point-to-point, collective, one-sided, and parallel I/O communication
models. Point-to-point communications enable exchanging data between two matched
processes. Collective communication is a broadcast of message from a process to all the
others. One-sided communications facilitate remote memory access without matched
process on the remote node. Three one-sided libraries are available for remote read, remote
113
write, and remote update. MPI provides various library functions to coordinate message
passing in various modes like blocked and unblocked message passing. It can send
messages of gigabytes size between processes.
MPI has been implemented on various platforms like Linux, OS X, Solaris, and Windows.
Most MPI implementations use some kind of network file storage. As network file storage,
network file system (NFS) and Hadoop HDFS can be used. Because MPI is a high level
abstraction for parallel programming, programmers can easily construct parallel and
distributed processing applications without deep understanding of the underlying
mechanism of process creation and synchronization. To order to exploit the multicore of
processors, the MPI processes can be organized to have multiple threads in themselves.
MPI-based programs can be executed on a single computer or a cluster of computers.