Parallel and Distributed Computing
UCS645
Module-2
Saif Nalband
Contents
● Parallel Architecture: Implicit Parallelism, Array Processor, Vector
Processor, Dichotomy of Parallel Computing Platforms (Flynn’s
Taxonomy, UMA, NUMA, Cache Coherence), Fengs Classification,
Handler Classification, Limitations of Memory System Performance,
Interconnection Networks, Communication Costs in Parallel
Machines , Routing Mechanisms for Interconnection Networks ,
Impact of Process-Processor Mapping and Mapping Techniques,
GPU.
von Neumann Computer Architecture
Automatic and Manual Parallelization
● Manually implementing code is hard, slow, bug-prone
● Automatic parallelizing complier can analyse the source code to
identify parallelism.
○ They use cost-benefit framework to decide where parallelism would
improve performance
○ Loops are common target for automatic parallelization.
● Programmers directive may be used to guide the complier
● BUT,
○ Wrong results may be produced
○ Performance may degrade
Communication
Shared memory Message Passing
P P P P
P P P
Memory
Memory Memory Memory
Parallel architecture : Shared Memory
• A shared memory architecture is a type of computer architecture
in which multiple processors or cores share a common,
centralized memory space.
• This allows all processors to access and modify the same
memory locations, enabling communication and data sharing
between them.
• Shared memory architectures are commonly used in parallel
computing systems, such as multi-core CPUs and GPUs.
Parallel architecture : Shared Memory
Key Characteristics of Shared Memory Architecture
1.Single Address Space:
1. All processors or cores share the same memory address space.
2. Any processor can directly access any memory location.
2.Communication via Memory:
1. Processors communicate by reading from and writing to shared memory locations.
2. No explicit message passing is required.
3.Synchronization:
1. Mechanisms like locks, semaphores, or barriers are needed to coordinate access to
shared memory and avoid race conditions.
4.Uniform Memory Access (UMA) vs. Non-Uniform Memory Access
(NUMA):
1. In UMA, all processors have equal access time to memory.
2. In NUMA, memory access time depends on the location of the memory relative to
the processor.
Parallel architecture : Shared Memory
UMA NUMA
P P P P
P P P
Memory Controller
Memory
Memory Memory Memory
Parallel architecture : Shared Memory
Advantages of Shared Memory Architecture
1.Ease of Programming:
• Developers can use shared variables for communication, which is simpler
than explicit message passing.
2.Efficient Data Sharing:
• Data can be shared between processors without copying.
3.Scalability:
• Suitable for systems with a small to moderate number of processors.
Parallel architecture : Shared Memory
Disadvantages of Shared Memory Architecture
1.Memory Contention:
• Multiple processors accessing the same memory can lead to contention and
performance bottlenecks.
2.Synchronization Overhead:
• Ensuring proper synchronization (e.g., using locks) can be complex and may
reduce performance.
3.Scalability Limits:
• As the number of processors increases, contention and latency issues can
make it difficult to scale.
Distributed Memory Architecture
• Distributed Memory Architecture is a type of parallel computing
architecture where each processor or node has its own private
memory.
• Unlike shared memory architectures, processors in a distributed
memory system cannot directly access the memory of other
processors.
• Instead, they communicate by explicitly sending and receiving
messages over a network. This architecture is widely used in high-
performance computing (HPC), clusters, and distributed systems.
Parallel architecture : Distributed Memory
Key Characteristics of Distributed Memory Architecture
1.Private Memory:
• Each processor has its own local memory, which is not directly accessible by other
processors.
2.Message Passing:
• Communication between processors occurs via explicit message passing (e.g., sending and
receiving data).
3.Scalability:
• Distributed memory systems can scale to a large number of processors or nodes, making
them suitable for large-scale computations.
4.No Cache Coherence:
• Since memory is not shared, there is no need for cache coherence protocols.
5.Network Interconnect:
• Processors are connected via a high-speed network (e.g., Ethernet, InfiniBand) for
communication.
Parallel architecture : Distributed Memory
Advantages of Distributed Memory Architecture
1.Scalability:
• Can scale to thousands of processors or nodes, making it ideal for large-scale
systems.
2.Cost-Effectiveness:
• Easier to build and expand using commodity hardware.
3.Fault Tolerance:
• Failure of one node does not necessarily affect the entire system.
4.Flexibility:
• Suitable for a wide range of applications, from scientific simulations to big data
processing.
Parallel architecture : Distributed Memory
Disadvantages of Distributed Memory Architecture
1.Complex Programming:
• Requires explicit communication between processors, which can be
challenging to implement and debug.
2.Communication Overhead:
• Message passing introduces latency and bandwidth limitations, which can
impact performance.
3.Load Balancing:
• Distributing work evenly across nodes can be difficult, especially for irregular
workloads.
Parallel Programming Models
Shared Memory
• All processors or threads share a common memory space.
• Uses synchronization mechanisms like locks, semaphores, and
barriers to avoid race conditions.
• Suitable for multi-core CPUs and systems with uniform memory
access (UMA).
• Examples: OpenMP: A widely used API for shared memory parallel
programming in C/C++ and Fortran.
• Pthreads: A POSIX standard for thread management in C.
Parallel Programming Models
Message Passing Model
• Each processor has its own private memory, and communication
between processors occurs by explicitly sending and receiving
messages.
• No shared memory is used.
• Requires explicit communication between processes.
• Suitable for distributed memory systems (e.g., clusters) and non-
uniform memory access (NUMA) architectures.
• Examples:
• MPI (Message Passing Interface): A standard for message passing in
distributed systems.
Parallel Programming Models
• Threads
• Data Parallel Model
• The same operation is applied to multiple elements of a dataset in parallel.
• Focuses on parallelizing data processing rather than control flow.
• Key Features:
• Often used in SIMD (Single Instruction, Multiple Data) architectures.
• Well-suited for regular data structures like arrays and matrices.
• Examples:
• CUDA: A parallel computing platform for NVIDIA GPUs.
• OpenCL: A framework for parallel programming across CPUs, GPUs, and other
accelerators.
Parallel Architecture Component
• Processor
• Memory
• Shared
• Distributed
• Communication
• Hierarchical, Crossbar, Bus, memory
• Synchronization
• Control
• Centralized
• distributed
Flynn's Classical Taxonomy
• There are a number of different ways to
classify parallel computers. Examples are
available in the references.
• One of the more widely used
classifications, in use since 1966, is
called Flynn's Taxonomy.
• Flynn's taxonomy distinguishes multi-
processor computer architectures
according to how they can be classified
along the two independent dimensions of
Instruction Stream and Data Stream.
Each of these dimensions can have only
one of two possible states: Single or
Multiple.
Single Instruction, Single Data (SISD)
• A serial (non-parallel) computer
• Single Instruction: Only one
instruction stream is being acted
on by the CPU during any one
clock cycle
• Single Data: Only one data stream
is being used as input during any
one clock cycle
• Deterministic execution
• This is the oldest type of computer
• Examples: older generation
mainframes, minicomputers,
workstations and single
processor/core PCs.
Single Instruction, Multiple Data (SIMD)
• A type of parallel computer
• Single Instruction: All processing units execute the same instruction at any
given clock cycle
• Multiple Data: Each processing unit can operate on a different data
element
• Best suited for specialized problems characterized by a high degree of
regularity, such as graphics/image processing.
• Synchronous (lockstep) and deterministic execution
• Two varieties: Processor Arrays and Vector Pipelines
• Examples:
• Processor Arrays: Thinking Machines CM-2, MasPar MP-1 & MP-2, ILLIAC IV
• Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820,
ETA10
• Most modern computers, particularly those with graphics processor units
(GPUs) employ SIMD instructions and execution units.
Single Instruction, Multiple Data (SIMD)
Multiple Instruction, Single Data (MISD)
• A type of parallel computer
• Multiple Instruction: Each processing unit operates on the
data independently via separate instruction streams.
• Single Data: A single data stream is fed into multiple
processing units.
• Few (if any) actual examples of this class of parallel computer
have ever existed.
• Some conceivable uses might be:
• multiple frequency filters operating on a single signal stream
• multiple cryptography algorithms attempting to crack a single
coded message.
Multiple Instruction, Single Data (MISD)
Multiple Instruction, Multiple Data (MIMD)
• A type of parallel computer
• Multiple Instruction: Every processor may be executing a different
instruction stream
• Multiple Data: Every processor may be working with a different data
stream
• Execution can be synchronous or asynchronous, deterministic or non-
deterministic
• Currently, the most common type of parallel computer - most modern
supercomputers fall into this category.
• Examples: most current supercomputers, networked parallel computer
clusters and "grids", multi-processor SMP computers, multi-core PCs.
• Note Many MIMD architectures also include SIMD execution sub-
components
Multiple Instruction, Multiple Data (MIMD)
Fengs Classification
Feng's Classification is a method used to categorize computer architectures
based on the degree of parallelism in processing operations.
It was proposed by Tse-yun Feng in 1972.
The classification divides computer architectures into four categories:
1.Word Serial Bit Serial (WSBS): Processes one bit of one word at a time.
2.Word Serial Bit Parallel (WSBP): Processes all bits of one word at a time.
3.Word Parallel Bit Serial (WPBS): Processes one bit from all words at a time.
4.Word Parallel Bit Parallel (WPBP): Processes all bits from all words
simultaneously
Limitations of Memory System Performance
Memory system performance is often a bottleneck in computer systems, and its limitations are
primarily captured by two parameters: latency and bandwidth.
1.Latency: This is the time delay from when a memory request is made to when the data is
available to the processor. High latency can significantly slow down system performance
because the processor has to wait for data.
2.Bandwidth: This refers to the rate at which data can be transferred from the memory to the
processor. Limited bandwidth can restrict the amount of data that can be processed in a given
time, affecting overall system performance.
Other factors that can impact memory system performance include:
•Cache Misses: When the data needed by the processor is not found in the cache, it has to be
fetched from the slower main memory, increasing latency.
•Memory Access Patterns: Irregular or unpredictable access patterns can lead to inefficient
use of the memory hierarchy.
•Memory Contention: In multi-core systems, multiple processors accessing memory
simultaneously can lead to contention, reducing effective bandwidth
Interconnect and goals
• Scalable performance
• Job of network: transer of data
• Small latency
• Allow large number of transfer simultously
• Cost : Less
Interconnection Networks for Parallel Computers
• Interconnection networks carry data between processors and to
memory.
• Interconnects are made of switches and links (wires, fiber).
• Interconnects are classified as static or dynamic.
• Static networks consist of point-to-point communication links among
processing nodes and are also referred to as direct networks.
• Dynamic networks are built using switches and communication links.
Dynamic networks are also referred to as indirect networks.
Static and Dynamic Interconnection Networks
Classification of interconnection networks: (a) a static network;
and (b) a dynamic network.
Interconnection Networks
• Switches map a fixed number of inputs to outputs.
• The total number of ports on a switch is the degree of the
switch.
• The cost of a switch grows as the square of the degree of the
switch, the peripheral hardware linearly as the degree, and the
packaging costs linearly as the number of pins.
Interconnection Networks: Network Interfaces
• Processors talk to the network via a network interface.
• The network interface may hang off the I/O bus or the memory
bus.
• In a physical sense, this distinguishes a cluster from a tightly
coupled multicomputer.
• The relative speeds of the I/O and memory buses impact the
performance of the network.
Network Topologies
• A variety of network topologies have been proposed and
implemented.
• These topologies tradeoff performance for cost.
• Commercial machines often implement hybrids of
multiple topologies for reasons of packaging, cost, and
available components.
Network Topologies
• Bus
• Crossbar
• Complete Graph
• Subset
• Multistage network
• Evaluation Parameters include
• Diameter, latency
• Contention, bisection, width
• Cost
Network Topologies: Buses
• Some of the simplest and earliest parallel machines used buses.
• All processors access a common bus for exchanging data.
• The distance between any two nodes is O(1) in a bus. The bus
also provides a convenient broadcast media.
• However, the bandwidth of the shared bus is a major
bottleneck.
• Typical bus based machines are limited to dozens of nodes.
Sun Enterprise servers and Intel Pentium based shared-bus
multiprocessors are examples of such architectures.
Network Topologies: Buses
Bus-based interconnects (a) with no local caches; (b) with local
memory/caches.
Since much of the data accessed by processors is local to the processor, a local memory can improve
the performance of bus- based machines.
Network Topologies: Crossbars
A crossbar network uses an p× m grid of switches to connect p
inputs to m outputs in a non-blocking manner.
A completely non-blocking crossbar network connecting p
processors to b memory banks.
Network Topologies: Crossbars
• The cost of a crossbar of p processors grows as O(p2).
• This is generally difficult to scale for large values of p.
• Examples of machines that employ crossbars include the Sun Ultra
HPC 10000 and the Fujitsu VPP500.
Network Topologies: Multistage Networks
• Crossbars have excellent performance scalability but poor
cost scalability.
• Buses have excellent cost scalability, but poor
performance scalability.
• Multistage interconnects strike a compromise between
these extremes.
Network Topologies: Multistage Networks
The schematic of a typical multistage interconnection network.
Network Topologies: Multistage Omega Network
• One of the most commonly used multistage
interconnects is the Omega network.
• This network consists of log p stages, where p is the
number of inputs/outputs.
• At each stage, input i is connected to output j if:
Network Topologies: Multistage Omega Network
Each stage of the Omega network implements a perfect shuffle as follows:
111 A perfect shuffle interconnection for eight inputs and outputs.
7
Network Topologies: Multistage Omega Network
• The perfect shuffle patterns are connected using 2 × 2
switches.
• The switches operate in two modes – crossover or
(b)
passthrough.
Two switching configurations of the 2 × 2 switch:
(a) Pass-through;(b) Cross-over.
Network Topologies: Multistage Omega Network
A complete Omega network with the perfect shuffle interconnects and switches can now
be illustrated:
A complete omega network connecting eight inputs and eight outputs.
An omega network has p/2 × log p switching nodes, and the cost of such a network grows as Θ(p log p).
Butterfly N/w
Network Topologies: Multistage Omega
Network – Routing
• Let s be the binary representation of the source and d be that of the destination processor.
• The data traverses the link to the first switching node. If the most significant bits of s and t are
the same, then the data is routed in pass-through mode by the switch else, it switches to
crossover.
• This process is repeated for each of the log p switching stages.
• Note that this is not a non-blocking switch.
Network Topologies: Multistage Omega Network – Routing
An example of blocking in omega network: one of the messages
(010 to 111 or 110 to 100) is blocked at link AB.
Network Topologies: Completely Connected
Network
•Each processor is connected to every other processor.
•The number of links in the network scales as O(p2).
•While the performance scales verywell,
the hardware complexity is not realizable
for large values of p.
•In this sense, these networks are static
counterparts of crossbars.
Network Topologies: Completely Connected and Star
Connected Networks
Example of an 8-node completely connected network.
(a) (b)
(a) A completely-connected network of eight nodes; (b) a Star
connected network of nine nodes.
Network Topologies: Star Connected Network
• Every node is connected only to a common node at the
center.
• Distance between any pair of nodes is O(1). However, the
central node becomes a bottleneck.
• In this sense, star connected networks are static counterparts
of buses.
Network Topologies: Linear Arrays, Meshes, and k-d
Meshes
•In a linear array, each node has two neighbors, one
to its left and one to its right. If the nodes at either end
are connected, we refer to it as a 1-D torus or a ring.
•A generalization to 2 dimensions has nodes with 4
neighbors, to the north, south, east, and west.
•A further generalization to d dimensions has nodes
with 2d neighbors.
•A special case of a d-dimensional mesh is a hypercube.
Here,
d = log p, where p is the total number of nodes.
Network Topologies: Linear Arrays
(a) (b)
Linear arrays: (a) with no wraparound links; (b) with wraparound link.
Network Topologies: Two- and Three Dimensional
Meshes
(a) (b) (c)
Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link
(2-D torus); and (c) a 3-D mesh with no wraparound.
Network Topologies: Hypercubes and their Construction
Construction of hypercubes from hypercubes of lower dimension.
Network Topologies: Properties of Hypercubes
• The distance between any two nodes is at most log p.
• Each node has log p neighbors.
• The distance between two nodes is given by the number of bit
positions at which the two nodes differ.
Network Topologies: Tree-Based Networks
(a) (b)
Complete binary tree networks: (a) a static tree network; and
(b) a dynamic tree network.
Network Topologies: Tree Properties
• The distance between any two nodes is no more than 2 log p.
• Links higher up the tree potentially carry more traffic than those
at the lower levels.
• For this reason, a variant called a fat-tree, fattens the links as
we go up the tree.
• Trees can be laid out in 2D with no wire crossings. This is an
attractive property of trees.
Network Topologies: Fat Trees
A fat tree network of 16 processing nodes.
Evaluating Static Interconnection Networks
• Number of Links: Its fails to assume that each link incurs a cost.
Fewer links also imply simplicity of the network.
• Degree: The degree of node refers to number of links connected to it.
Number of ports on the node. High concurrency
• Total bandwidth: the maximum rate at which data can be handled by
the entire network is bandwidth. For l link network with link bandwidth
b, the maximum network bandwidth is bl.
• Minimum throughout: the maximum rate at which data can be send
between a given pairs of node. The minimum such is called min n/w
throughout.
• Average path length: the length of the paths between pairs of nodes
can vary significantly from pair to pair. The min path leght between
nodes pair , average across all pairs.
Evaluating Static Interconnection Networks
• Diameter: The distance between the farthest two nodes in the network.
The diameter of a linear array is p − 1, that of a mesh is 2(√p − 1), that of a
tree and hypercube is log p, and that of a completely connected network
is O(1).
• Bisection Width: The minimum number of wires you must cut to divide
the network into two equal parts. The bisection width of a linear array
and tree is 1, that of a mesh is √p, that of a hypercube is p/2 and that of
a completely connected network is p2/4.
• Cost: The number of links or switches (whichever is
asymptotically higher) is a meaningful measure of the cost.
However, a number of other factors, such as the ability to lay out the
network, the length of wires, etc., also factor in to the cost.
Evaluating Static Interconnection Networks
Evaluating Dynamic Interconnection Networks
Cache Coherence in Multiprocessor Systems
• Interconnects provide basic mechanisms for data transfer.
• In the case of shared address space machines, additional hardware is required
to coordinate access to data that might have multiple copies in the network.
• The underlying technique must provide some guarantees on the semantics.
• This guarantee is generally one of serializability, i.e., there exists some serial order of
instruction execution that corresponds to the parallel schedule.
Cache Coherence in Multiprocessor Systems
When the value of a variable is changes, all its copies must either be invalidated or updated.
(b)
Cache coherence in multiprocessor systems: (a) Invalidate protocol; (b) Update protocol for shared
variables.
Cache Coherence: Update and Invalidate Protocols
• If a processor just reads a value once and does not
need it again, an update protocol may generate significant
overhead.
• If two processors make interleaved test and updates to a
variable, an update protocol isbetter.
• Both protocols suffer from false sharing overheads (two
words that are not shared, however, they lie on the same
cache line).
• Most current machines use invalidate protocols.
Maintaining Coherence Using Invalidate Protocols
• Each copy of a data item is associated with a state.
• One example of such a set of states is, shared, invalid, or dirty.
• In shared state, there are multiple valid copies of the data item
(and therefore, an invalidate would have to be generated on
an update).
• In dirty state, only one copy exists and therefore, no invalidates
need to be generated.
• In invalid state, the data copy is invalid, therefore, a read
generates a data request (and associated state changes).
Maintaining Coherence Using Invalidate Protocols
State diagram of a simple three-state coherence protocol.
Communication Costs in Parallel Machines
• Along with idling and contention, communication is a major overhead
in parallel programs.
• The cost of communication is dependent on a variety of
features including the programming model semantics, the
network topology, data handling and routing, and associated
software protocols.
Message Passing Costs in Parallel Computers
The total time to transfer a message over a network
comprises of the following:
•Startup time (ts): Time spent at sending and
receiving nodes (executing the routing algorithm,
programming routers, etc.).
•Per-hop time (th): This time is a function of number of
hops and includes factors such as switch latencies,
network delays, etc.
•Per-word transfer time (tw): This time includes all
overheads that are determined by the length of the
message. This includes bandwidth of links, error
checking and correction, etc.
Store-and-Forward Routing
• A message traversing multiple hops is completely received
at an intermediate hop before being forwarded to the next
hop.
• The total communication cost for a message of size m words
to traverse l communication links is
tcomm = ts + (mtw + th)l.
• Where th is message cost for each header, tw is for rest of
the message to traverse.
•In most platforms, th is small and the above
expression can be approximated by
tcomm = ts + mltw.
Routing Techniques
Passing a message from node P0 to P3 (a) through a store-and-forward communication network; (b) and
(c) extending the concept to cut-through routing. The shaded regions represent the time that the
message is in transit. The startup time associated with this message transfer is assumed to be zero.
Packet Routing
• Store-and-forward makes poor use of communication resources.
• Packet routing breaks messages into packets and pipelines
them through the network.
• Since packets may take different paths, each packet must
carry routing information, error checking, sequencing, and
other related header information.
• The total communication time for packet routing is approximated
by:
t comm = ts + th l + twm
• The factor tw accounts for overheads in packet headers.
Cut-Through Routing
• Takes the concept of packet routing to an extreme by further dividing
messages into basic units called flits.
• Since flits are typically small, the header information must be minimized.
• This is done byforcing all flits to take the same path, in
sequence.
• A tracer message first programs all intermediate routers. All flits then take
the same route.
• Error checks are performed on the entire message, as opposed to flits.
• No sequence numbers are needed.
Cut-Through Routing
• Thetotalcommunication time for cut-through routing is
approximated by:
t comm = ts + th l + twm.
• This is identical to packet routing, however, tw is typically
much smaller.
Simplified Cost Model for Communicating Messages
• The cost of communicating a message between two nodes l
hops away using cut-through routing is given by
t comm = ts + lth + twm.
• In this expression, t h is typically smaller than t s and tw. For this
reason, the second term in the RHS does not show, particularly,
when m is large.
• Furthermore, it is often not possible to control routing and
placement of tasks.
• For these reasons, we can approximate the cost of message
transfer by
t comm = ts + twm.
Simplified Cost Model for Communicating Messages
• It is important to note that the original
expression for communication time is valid for only
uncongested networks.
• If a link takes multiple messages, the corresponding tw
term must be scaled up by the number of messages.
• Different communication patterns congest different
networks to varying extents.
• Itisimportant to understand and account for this in the
communication time accordingly.
Multi-threading reduces stalls
• interleave processing of multiple threads on the same core to hide
stalls
• If you can’t make progress on the current thread… work on
another one
Reading
• A Grama, A Gupta, G Karypis, V Kumar. Introduction to Parallel
Computing, Addison Wesley (2003). Chapter 2: 2.1, 2.2, 2.3, 2.4,
2.5.1, 2.6
Thank You