0% found this document useful (0 votes)

5 views124 pages

Module2

Parallel and Distributed Computing 2

Uploaded by

saif.nalband

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views124 pages

Module2

Parallel and Distributed Computing 2

Uploaded by

saif.nalband

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 124

Parallel and Distributed Computing

UCS645
Module-2
Saif Nalband
Contents
● Parallel Architecture: Implicit Parallelism, Array Processor, Vector
Processor, Dichotomy of Parallel Computing Platforms (Flynn’s
Taxonomy, UMA, NUMA, Cache Coherence), Fengs Classification,
Handler Classification, Limitations of Memory System Performance,
Interconnection Networks, Communication Costs in Parallel
Machines , Routing Mechanisms for Interconnection Networks ,
Impact of Process-Processor Mapping and Mapping Techniques,
GPU.
von Neumann Computer Architecture
Automatic and Manual Parallelization
● Manually implementing code is hard, slow, bug-prone
● Automatic parallelizing complier can analyse the source code to
identify parallelism.
○ They use cost-benefit framework to decide where parallelism would
improve performance
○ Loops are common target for automatic parallelization.
● Programmers directive may be used to guide the complier
● BUT,
○ Wrong results may be produced
○ Performance may degrade
Communication
Shared memory Message Passing

P P P P

P P P

Memory
Memory Memory Memory
Parallel architecture : Shared Memory
• A shared memory architecture is a type of computer architecture
in which multiple processors or cores share a common,
centralized memory space.
• This allows all processors to access and modify the same
memory locations, enabling communication and data sharing
between them.
• Shared memory architectures are commonly used in parallel
computing systems, such as multi-core CPUs and GPUs.
Parallel architecture : Shared Memory
Key Characteristics of Shared Memory Architecture
1.Single Address Space:
1. All processors or cores share the same memory address space.
2. Any processor can directly access any memory location.
2.Communication via Memory:
1. Processors communicate by reading from and writing to shared memory locations.
2. No explicit message passing is required.
3.Synchronization:
1. Mechanisms like locks, semaphores, or barriers are needed to coordinate access to
shared memory and avoid race conditions.
4.Uniform Memory Access (UMA) vs. Non-Uniform Memory Access
(NUMA):
1. In UMA, all processors have equal access time to memory.
2. In NUMA, memory access time depends on the location of the memory relative to
the processor.
Parallel architecture : Shared Memory
UMA NUMA

P P P P

P P P

Memory Controller

Memory
Memory Memory Memory
Parallel architecture : Shared Memory
Advantages of Shared Memory Architecture
1.Ease of Programming:
• Developers can use shared variables for communication, which is simpler
than explicit message passing.
2.Efficient Data Sharing:
• Data can be shared between processors without copying.
3.Scalability:
• Suitable for systems with a small to moderate number of processors.
Parallel architecture : Shared Memory
Disadvantages of Shared Memory Architecture
1.Memory Contention:
• Multiple processors accessing the same memory can lead to contention and
performance bottlenecks.
2.Synchronization Overhead:
• Ensuring proper synchronization (e.g., using locks) can be complex and may
reduce performance.
3.Scalability Limits:
• As the number of processors increases, contention and latency issues can
make it difficult to scale.
Distributed Memory Architecture
• Distributed Memory Architecture is a type of parallel computing
architecture where each processor or node has its own private
memory.
• Unlike shared memory architectures, processors in a distributed
memory system cannot directly access the memory of other
processors.
• Instead, they communicate by explicitly sending and receiving
messages over a network. This architecture is widely used in high-
performance computing (HPC), clusters, and distributed systems.
Parallel architecture : Distributed Memory
Key Characteristics of Distributed Memory Architecture
1.Private Memory:
• Each processor has its own local memory, which is not directly accessible by other
processors.
2.Message Passing:
• Communication between processors occurs via explicit message passing (e.g., sending and
receiving data).
3.Scalability:
• Distributed memory systems can scale to a large number of processors or nodes, making
them suitable for large-scale computations.
4.No Cache Coherence:
• Since memory is not shared, there is no need for cache coherence protocols.
5.Network Interconnect:
• Processors are connected via a high-speed network (e.g., Ethernet, InfiniBand) for
communication.
Parallel architecture : Distributed Memory
Advantages of Distributed Memory Architecture
1.Scalability:
• Can scale to thousands of processors or nodes, making it ideal for large-scale
systems.
2.Cost-Effectiveness:
• Easier to build and expand using commodity hardware.
3.Fault Tolerance:
• Failure of one node does not necessarily affect the entire system.
4.Flexibility:
• Suitable for a wide range of applications, from scientific simulations to big data
processing.
Parallel architecture : Distributed Memory
Disadvantages of Distributed Memory Architecture
1.Complex Programming:
• Requires explicit communication between processors, which can be
challenging to implement and debug.
2.Communication Overhead:
• Message passing introduces latency and bandwidth limitations, which can
impact performance.
3.Load Balancing:
• Distributing work evenly across nodes can be difficult, especially for irregular
workloads.
Parallel Programming Models
Shared Memory
• All processors or threads share a common memory space.
• Uses synchronization mechanisms like locks, semaphores, and
barriers to avoid race conditions.
• Suitable for multi-core CPUs and systems with uniform memory
access (UMA).
• Examples: OpenMP: A widely used API for shared memory parallel
programming in C/C++ and Fortran.
• Pthreads: A POSIX standard for thread management in C.
Parallel Programming Models
Message Passing Model
• Each processor has its own private memory, and communication
between processors occurs by explicitly sending and receiving
messages.
• No shared memory is used.
• Requires explicit communication between processes.
• Suitable for distributed memory systems (e.g., clusters) and non-
uniform memory access (NUMA) architectures.
• Examples:
• MPI (Message Passing Interface): A standard for message passing in
distributed systems.
Parallel Programming Models
• Threads

• Data Parallel Model

• The same operation is applied to multiple elements of a dataset in parallel.
• Focuses on parallelizing data processing rather than control flow.
• Key Features:
• Often used in SIMD (Single Instruction, Multiple Data) architectures.
• Well-suited for regular data structures like arrays and matrices.
• Examples:
• CUDA: A parallel computing platform for NVIDIA GPUs.
• OpenCL: A framework for parallel programming across CPUs, GPUs, and other
accelerators.
Parallel Architecture Component
• Processor
• Memory
• Shared
• Distributed
• Communication
• Hierarchical, Crossbar, Bus, memory
• Synchronization
• Control
• Centralized
• distributed
Flynn's Classical Taxonomy
• There are a number of different ways to
classify parallel computers. Examples are
available in the references.
• One of the more widely used
classifications, in use since 1966, is
called Flynn's Taxonomy.
• Flynn's taxonomy distinguishes multi-
processor computer architectures
according to how they can be classified
along the two independent dimensions of
Instruction Stream and Data Stream.
Each of these dimensions can have only
one of two possible states: Single or
Multiple.
Single Instruction, Single Data (SISD)
• A serial (non-parallel) computer
• Single Instruction: Only one
instruction stream is being acted
on by the CPU during any one
clock cycle
• Single Data: Only one data stream
is being used as input during any
one clock cycle
• Deterministic execution
• This is the oldest type of computer
• Examples: older generation
mainframes, minicomputers,
workstations and single
processor/core PCs.
Single Instruction, Multiple Data (SIMD)
• A type of parallel computer
• Single Instruction: All processing units execute the same instruction at any
given clock cycle
• Multiple Data: Each processing unit can operate on a different data
element
• Best suited for specialized problems characterized by a high degree of
regularity, such as graphics/image processing.
• Synchronous (lockstep) and deterministic execution
• Two varieties: Processor Arrays and Vector Pipelines
• Examples:
• Processor Arrays: Thinking Machines CM-2, MasPar MP-1 & MP-2, ILLIAC IV
• Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2, Hitachi S820,
ETA10
• Most modern computers, particularly those with graphics processor units
(GPUs) employ SIMD instructions and execution units.
Single Instruction, Multiple Data (SIMD)
Multiple Instruction, Single Data (MISD)
• A type of parallel computer
• Multiple Instruction: Each processing unit operates on the
data independently via separate instruction streams.
• Single Data: A single data stream is fed into multiple
processing units.
• Few (if any) actual examples of this class of parallel computer
have ever existed.
• Some conceivable uses might be:
• multiple frequency filters operating on a single signal stream
• multiple cryptography algorithms attempting to crack a single
coded message.
Multiple Instruction, Single Data (MISD)
Multiple Instruction, Multiple Data (MIMD)
• A type of parallel computer
• Multiple Instruction: Every processor may be executing a different
instruction stream
• Multiple Data: Every processor may be working with a different data
stream
• Execution can be synchronous or asynchronous, deterministic or non-
deterministic
• Currently, the most common type of parallel computer - most modern
supercomputers fall into this category.
• Examples: most current supercomputers, networked parallel computer
clusters and "grids", multi-processor SMP computers, multi-core PCs.
• Note Many MIMD architectures also include SIMD execution sub-
components
Multiple Instruction, Multiple Data (MIMD)
Fengs Classification

Feng's Classification is a method used to categorize computer architectures

based on the degree of parallelism in processing operations.
It was proposed by Tse-yun Feng in 1972.
The classification divides computer architectures into four categories:
1.Word Serial Bit Serial (WSBS): Processes one bit of one word at a time.
2.Word Serial Bit Parallel (WSBP): Processes all bits of one word at a time.
3.Word Parallel Bit Serial (WPBS): Processes one bit from all words at a time.
4.Word Parallel Bit Parallel (WPBP): Processes all bits from all words
simultaneously
Limitations of Memory System Performance
Memory system performance is often a bottleneck in computer systems, and its limitations are
primarily captured by two parameters: latency and bandwidth.
1.Latency: This is the time delay from when a memory request is made to when the data is
available to the processor. High latency can significantly slow down system performance
because the processor has to wait for data.
2.Bandwidth: This refers to the rate at which data can be transferred from the memory to the
processor. Limited bandwidth can restrict the amount of data that can be processed in a given
time, affecting overall system performance.
Other factors that can impact memory system performance include:
•Cache Misses: When the data needed by the processor is not found in the cache, it has to be
fetched from the slower main memory, increasing latency.
•Memory Access Patterns: Irregular or unpredictable access patterns can lead to inefficient
use of the memory hierarchy.
•Memory Contention: In multi-core systems, multiple processors accessing memory
simultaneously can lead to contention, reducing effective bandwidth
Interconnect and goals
• Scalable performance
• Job of network: transer of data
• Small latency
• Allow large number of transfer simultously
• Cost : Less
Interconnection Networks for Parallel Computers
• Interconnection networks carry data between processors and to
memory.

• Interconnects are made of switches and links (wires, fiber).

• Interconnects are classified as static or dynamic.

• Static networks consist of point-to-point communication links among

processing nodes and are also referred to as direct networks.

• Dynamic networks are built using switches and communication links.

Dynamic networks are also referred to as indirect networks.
Static and Dynamic Interconnection Networks

Classification of interconnection networks: (a) a static network;

and (b) a dynamic network.
Interconnection Networks
• Switches map a fixed number of inputs to outputs.

• The total number of ports on a switch is the degree of the

switch.

• The cost of a switch grows as the square of the degree of the

switch, the peripheral hardware linearly as the degree, and the
packaging costs linearly as the number of pins.
Interconnection Networks: Network Interfaces
• Processors talk to the network via a network interface.

• The network interface may hang off the I/O bus or the memory
bus.

• In a physical sense, this distinguishes a cluster from a tightly

coupled multicomputer.

• The relative speeds of the I/O and memory buses impact the
performance of the network.
Network Topologies
• A variety of network topologies have been proposed and
implemented.

• These topologies tradeoff performance for cost.

• Commercial machines often implement hybrids of

multiple topologies for reasons of packaging, cost, and
available components.
Network Topologies
• Bus
• Crossbar
• Complete Graph
• Subset
• Multistage network
• Evaluation Parameters include
• Diameter, latency
• Contention, bisection, width
• Cost
Network Topologies: Buses
• Some of the simplest and earliest parallel machines used buses.

• All processors access a common bus for exchanging data.

• The distance between any two nodes is O(1) in a bus. The bus
also provides a convenient broadcast media.

• However, the bandwidth of the shared bus is a major

bottleneck.

• Typical bus based machines are limited to dozens of nodes.

Sun Enterprise servers and Intel Pentium based shared-bus
multiprocessors are examples of such architectures.
Network Topologies: Buses

Bus-based interconnects (a) with no local caches; (b) with local

memory/caches.

Since much of the data accessed by processors is local to the processor, a local memory can improve
the performance of bus- based machines.
Network Topologies: Crossbars
A crossbar network uses an p× m grid of switches to connect p
inputs to m outputs in a non-blocking manner.

A completely non-blocking crossbar network connecting p

processors to b memory banks.
Network Topologies: Crossbars
• The cost of a crossbar of p processors grows as O(p2).

• This is generally difficult to scale for large values of p.

• Examples of machines that employ crossbars include the Sun Ultra

HPC 10000 and the Fujitsu VPP500.
Network Topologies: Multistage Networks
• Crossbars have excellent performance scalability but poor
cost scalability.

• Buses have excellent cost scalability, but poor

performance scalability.

• Multistage interconnects strike a compromise between

these extremes.
Network Topologies: Multistage Networks

The schematic of a typical multistage interconnection network.

Network Topologies: Multistage Omega Network
• One of the most commonly used multistage
interconnects is the Omega network.

• This network consists of log p stages, where p is the

number of inputs/outputs.

• At each stage, input i is connected to output j if:

Network Topologies: Multistage Omega Network
Each stage of the Omega network implements a perfect shuffle as follows:

111 A perfect shuffle interconnection for eight inputs and outputs.

7
Network Topologies: Multistage Omega Network
• The perfect shuffle patterns are connected using 2 × 2
switches.

• The switches operate in two modes – crossover or

(b)

passthrough.

Two switching configurations of the 2 × 2 switch:

(a) Pass-through;(b) Cross-over.
Network Topologies: Multistage Omega Network
A complete Omega network with the perfect shuffle interconnects and switches can now
be illustrated:

A complete omega network connecting eight inputs and eight outputs.

An omega network has p/2 × log p switching nodes, and the cost of such a network grows as Θ(p log p).
Butterfly N/w
Network Topologies: Multistage Omega
Network – Routing
• Let s be the binary representation of the source and d be that of the destination processor.

• The data traverses the link to the first switching node. If the most significant bits of s and t are
the same, then the data is routed in pass-through mode by the switch else, it switches to
crossover.

• This process is repeated for each of the log p switching stages.

• Note that this is not a non-blocking switch.

Network Topologies: Multistage Omega Network – Routing

An example of blocking in omega network: one of the messages

(010 to 111 or 110 to 100) is blocked at link AB.
Network Topologies: Completely Connected
Network
•Each processor is connected to every other processor.

•The number of links in the network scales as O(p2).

•While the performance scales verywell,

the hardware complexity is not realizable
for large values of p.

•In this sense, these networks are static

counterparts of crossbars.
Network Topologies: Completely Connected and Star
Connected Networks
Example of an 8-node completely connected network.

(a) (b)

(a) A completely-connected network of eight nodes; (b) a Star

connected network of nine nodes.
Network Topologies: Star Connected Network
• Every node is connected only to a common node at the
center.

• Distance between any pair of nodes is O(1). However, the

central node becomes a bottleneck.

• In this sense, star connected networks are static counterparts

of buses.
Network Topologies: Linear Arrays, Meshes, and k-d
Meshes
•In a linear array, each node has two neighbors, one
to its left and one to its right. If the nodes at either end
are connected, we refer to it as a 1-D torus or a ring.
•A generalization to 2 dimensions has nodes with 4
neighbors, to the north, south, east, and west.
•A further generalization to d dimensions has nodes
with 2d neighbors.
•A special case of a d-dimensional mesh is a hypercube.
Here,
d = log p, where p is the total number of nodes.
Network Topologies: Linear Arrays

(a) (b)

Linear arrays: (a) with no wraparound links; (b) with wraparound link.
Network Topologies: Two- and Three Dimensional
Meshes

(a) (b) (c)

Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D mesh with wraparound link
(2-D torus); and (c) a 3-D mesh with no wraparound.
Network Topologies: Hypercubes and their Construction

Construction of hypercubes from hypercubes of lower dimension.

Network Topologies: Properties of Hypercubes
• The distance between any two nodes is at most log p.

• Each node has log p neighbors.

• The distance between two nodes is given by the number of bit

positions at which the two nodes differ.
Network Topologies: Tree-Based Networks

(a) (b)

Complete binary tree networks: (a) a static tree network; and

(b) a dynamic tree network.
Network Topologies: Tree Properties
• The distance between any two nodes is no more than 2 log p.

• Links higher up the tree potentially carry more traffic than those
at the lower levels.

• For this reason, a variant called a fat-tree, fattens the links as

we go up the tree.

• Trees can be laid out in 2D with no wire crossings. This is an

attractive property of trees.
Network Topologies: Fat Trees

A fat tree network of 16 processing nodes.

Evaluating Static Interconnection Networks
• Number of Links: Its fails to assume that each link incurs a cost.
Fewer links also imply simplicity of the network.

• Degree: The degree of node refers to number of links connected to it.

Number of ports on the node. High concurrency

• Total bandwidth: the maximum rate at which data can be handled by

the entire network is bandwidth. For l link network with link bandwidth
b, the maximum network bandwidth is bl.

• Minimum throughout: the maximum rate at which data can be send

between a given pairs of node. The minimum such is called min n/w
throughout.
• Average path length: the length of the paths between pairs of nodes
can vary significantly from pair to pair. The min path leght between
nodes pair , average across all pairs.
Evaluating Static Interconnection Networks
• Diameter: The distance between the farthest two nodes in the network.
The diameter of a linear array is p − 1, that of a mesh is 2(√p − 1), that of a
tree and hypercube is log p, and that of a completely connected network
is O(1).

• Bisection Width: The minimum number of wires you must cut to divide
the network into two equal parts. The bisection width of a linear array
and tree is 1, that of a mesh is √p, that of a hypercube is p/2 and that of
a completely connected network is p2/4.

• Cost: The number of links or switches (whichever is

asymptotically higher) is a meaningful measure of the cost.
However, a number of other factors, such as the ability to lay out the
network, the length of wires, etc., also factor in to the cost.
Evaluating Static Interconnection Networks
Evaluating Dynamic Interconnection Networks
Cache Coherence in Multiprocessor Systems
• Interconnects provide basic mechanisms for data transfer.

• In the case of shared address space machines, additional hardware is required

to coordinate access to data that might have multiple copies in the network.

• The underlying technique must provide some guarantees on the semantics.

• This guarantee is generally one of serializability, i.e., there exists some serial order of
instruction execution that corresponds to the parallel schedule.
Cache Coherence in Multiprocessor Systems
When the value of a variable is changes, all its copies must either be invalidated or updated.

(b)

Cache coherence in multiprocessor systems: (a) Invalidate protocol; (b) Update protocol for shared
variables.
Cache Coherence: Update and Invalidate Protocols
• If a processor just reads a value once and does not
need it again, an update protocol may generate significant
overhead.

• If two processors make interleaved test and updates to a

variable, an update protocol isbetter.

• Both protocols suffer from false sharing overheads (two

words that are not shared, however, they lie on the same
cache line).

• Most current machines use invalidate protocols.

Maintaining Coherence Using Invalidate Protocols
• Each copy of a data item is associated with a state.

• One example of such a set of states is, shared, invalid, or dirty.

• In shared state, there are multiple valid copies of the data item
(and therefore, an invalidate would have to be generated on
an update).

• In dirty state, only one copy exists and therefore, no invalidates

need to be generated.

• In invalid state, the data copy is invalid, therefore, a read

generates a data request (and associated state changes).
Maintaining Coherence Using Invalidate Protocols

State diagram of a simple three-state coherence protocol.

Communication Costs in Parallel Machines
• Along with idling and contention, communication is a major overhead
in parallel programs.

• The cost of communication is dependent on a variety of

features including the programming model semantics, the
network topology, data handling and routing, and associated
software protocols.
Message Passing Costs in Parallel Computers
The total time to transfer a message over a network
comprises of the following:
•Startup time (ts): Time spent at sending and
receiving nodes (executing the routing algorithm,
programming routers, etc.).
•Per-hop time (th): This time is a function of number of
hops and includes factors such as switch latencies,
network delays, etc.
•Per-word transfer time (tw): This time includes all
overheads that are determined by the length of the
message. This includes bandwidth of links, error
checking and correction, etc.
Store-and-Forward Routing
• A message traversing multiple hops is completely received
at an intermediate hop before being forwarded to the next
hop.
• The total communication cost for a message of size m words
to traverse l communication links is
tcomm = ts + (mtw + th)l.
• Where th is message cost for each header, tw is for rest of
the message to traverse.
•In most platforms, th is small and the above
expression can be approximated by
tcomm = ts + mltw.
Routing Techniques

Passing a message from node P0 to P3 (a) through a store-and-forward communication network; (b) and
(c) extending the concept to cut-through routing. The shaded regions represent the time that the
message is in transit. The startup time associated with this message transfer is assumed to be zero.
Packet Routing
• Store-and-forward makes poor use of communication resources.

• Packet routing breaks messages into packets and pipelines

them through the network.

• Since packets may take different paths, each packet must

carry routing information, error checking, sequencing, and
other related header information.

• The total communication time for packet routing is approximated

by:
t comm = ts + th l + twm

• The factor tw accounts for overheads in packet headers.

Cut-Through Routing
• Takes the concept of packet routing to an extreme by further dividing
messages into basic units called flits.

• Since flits are typically small, the header information must be minimized.

• This is done byforcing all flits to take the same path, in

sequence.

• A tracer message first programs all intermediate routers. All flits then take
the same route.

• Error checks are performed on the entire message, as opposed to flits.

• No sequence numbers are needed.

Cut-Through Routing
• Thetotalcommunication time for cut-through routing is
approximated by:

t comm = ts + th l + twm.

• This is identical to packet routing, however, tw is typically

much smaller.
Simplified Cost Model for Communicating Messages
• The cost of communicating a message between two nodes l
hops away using cut-through routing is given by

t comm = ts + lth + twm.

• In this expression, t h is typically smaller than t s and tw. For this

reason, the second term in the RHS does not show, particularly,
when m is large.

• Furthermore, it is often not possible to control routing and

placement of tasks.

• For these reasons, we can approximate the cost of message

transfer by
t comm = ts + twm.
Simplified Cost Model for Communicating Messages
• It is important to note that the original
expression for communication time is valid for only
uncongested networks.

• If a link takes multiple messages, the corresponding tw

term must be scaled up by the number of messages.

• Different communication patterns congest different

networks to varying extents.

• Itisimportant to understand and account for this in the

communication time accordingly.
Multi-threading reduces stalls
• interleave processing of multiple threads on the same core to hide
stalls
• If you can’t make progress on the current thread… work on
another one
Reading
• A Grama, A Gupta, G Karypis, V Kumar. Introduction to Parallel
Computing, Addison Wesley (2003). Chapter 2: 2.1, 2.2, 2.3, 2.4,
2.5.1, 2.6
Thank You

Parallel Processing
No ratings yet
Parallel Processing
35 pages
Flipkart Myntra Case Study Assignment 2 PDF
100% (5)
Flipkart Myntra Case Study Assignment 2 PDF
13 pages
Shanin DOE - Six Sigma
100% (1)
Shanin DOE - Six Sigma
7 pages
U1-Theory of Parallelism
No ratings yet
U1-Theory of Parallelism
43 pages
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
No ratings yet
Parallel Computing: Er. Anupama Singh Department of Computer Science & Engg
22 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Parallel Computers
No ratings yet
Parallel Computers
39 pages
HPA - Notes
No ratings yet
HPA - Notes
5 pages
KCS 713 Unit 1 Lecture 5
No ratings yet
KCS 713 Unit 1 Lecture 5
32 pages
Unit 4
No ratings yet
Unit 4
16 pages
Lecture-3 Parallel Computer Memory Architecture
No ratings yet
Lecture-3 Parallel Computer Memory Architecture
14 pages
Unit-1 ACA
No ratings yet
Unit-1 ACA
26 pages
Module 2 - Parallel Computing
No ratings yet
Module 2 - Parallel Computing
55 pages
Unit 1 - Part - 2
No ratings yet
Unit 1 - Part - 2
30 pages
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
No ratings yet
Theory of Distributed Computing and Parallel Processing With Its Applications, Advantages and Disadvantages
11 pages
Slides Taken From: Parallel Computing Platforms
No ratings yet
Slides Taken From: Parallel Computing Platforms
11 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
CC Unit 1.2
No ratings yet
CC Unit 1.2
39 pages
Unit 1 - Part 1
No ratings yet
Unit 1 - Part 1
51 pages
Quiz Prep
No ratings yet
Quiz Prep
21 pages
Unit Iv Parallelism
No ratings yet
Unit Iv Parallelism
80 pages
unit 1
No ratings yet
unit 1
25 pages
Lecture 2 General Parallelism Terms
No ratings yet
Lecture 2 General Parallelism Terms
22 pages
(SMC), (SMP), (MPP) : Symmetric Multi-Computers Symmetric Multi-Processors
No ratings yet
(SMC), (SMP), (MPP) : Symmetric Multi-Computers Symmetric Multi-Processors
13 pages
PDC Architectures
No ratings yet
PDC Architectures
24 pages
Advanced Computer Architecture Unit 1
No ratings yet
Advanced Computer Architecture Unit 1
23 pages
02 Lecture Flynn IN
No ratings yet
02 Lecture Flynn IN
78 pages
Parallel and Distributed Computing
No ratings yet
Parallel and Distributed Computing
90 pages
Parallel Computing
100% (1)
Parallel Computing
53 pages
Parallel Processing
No ratings yet
Parallel Processing
31 pages
01 Intro Parallel Computing
No ratings yet
01 Intro Parallel Computing
40 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Architecture
No ratings yet
Architecture
67 pages
CS516: Parallelization of Programs: Overview of Parallel Architectures
No ratings yet
CS516: Parallelization of Programs: Overview of Parallel Architectures
43 pages
lect1-parallel system
No ratings yet
lect1-parallel system
52 pages
Quiz Prep
No ratings yet
Quiz Prep
21 pages
Parallel Computing
No ratings yet
Parallel Computing
19 pages
2 Parallel Computer Memory Architectures
No ratings yet
2 Parallel Computer Memory Architectures
26 pages
PDC Complete Course File
No ratings yet
PDC Complete Course File
422 pages
5 4 Parallel
No ratings yet
5 4 Parallel
47 pages
UNIT 4 COA Parallelism
No ratings yet
UNIT 4 COA Parallelism
29 pages
Chapter 1 (Parallel Computer Models)
No ratings yet
Chapter 1 (Parallel Computer Models)
20 pages
UNIT-2 PP FlynnsClassification
No ratings yet
UNIT-2 PP FlynnsClassification
80 pages
Part 1 - Lecture 2 - Parallel Hardware
No ratings yet
Part 1 - Lecture 2 - Parallel Hardware
60 pages
Parallel Distributed Computing
No ratings yet
Parallel Distributed Computing
64 pages
Parallel Computing Main
No ratings yet
Parallel Computing Main
47 pages
Programação Paralela e Distribuída
No ratings yet
Programação Paralela e Distribuída
39 pages
Lec1 Introduction To Parallel Computing
No ratings yet
Lec1 Introduction To Parallel Computing
40 pages
Coa PPT-2
No ratings yet
Coa PPT-2
16 pages
Data-Parallel Architectures and
No ratings yet
Data-Parallel Architectures and
27 pages
Pda 2
No ratings yet
Pda 2
105 pages
CS-3006 3 ParallelArchitectures
No ratings yet
CS-3006 3 ParallelArchitectures
53 pages
P D Group2-2
No ratings yet
P D Group2-2
6 pages
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
No ratings yet
CS 213: Parallel Processing Architectures: Laxmi Narayan Bhuyan
26 pages
W3C1 Principles of Parallel Computing
No ratings yet
W3C1 Principles of Parallel Computing
28 pages
Unit - 01 Easid
No ratings yet
Unit - 01 Easid
18 pages
Lecture 4
No ratings yet
Lecture 4
27 pages
CS0051 - M1-Parallel Computing Hardware
No ratings yet
CS0051 - M1-Parallel Computing Hardware
36 pages
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Quantum Computer Vs Traditional Computer
From Everand
Quantum Computer Vs Traditional Computer
Arief Muinnudin
No ratings yet
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
Mastering the Craft of C Programming: Unraveling the Secrets of Expert-Level Programming
From Everand
Mastering the Craft of C Programming: Unraveling the Secrets of Expert-Level Programming
Steve Jones
No ratings yet
Latihan Soal Paket 1
0% (1)
Latihan Soal Paket 1
14 pages
Data Center Xeon HSBC Whitepaper
No ratings yet
Data Center Xeon HSBC Whitepaper
12 pages
MAKGM252JAN111622 FASTag Statement 1703416032578
No ratings yet
MAKGM252JAN111622 FASTag Statement 1703416032578
2 pages
Non Toxic Epoxy
No ratings yet
Non Toxic Epoxy
36 pages
EVchain An Anonymous Blockchain-Based System For Charging-Connected Electric Vehicles
No ratings yet
EVchain An Anonymous Blockchain-Based System For Charging-Connected Electric Vehicles
12 pages
Immediate Access Kielhofners Research in Occupational Therapy Methods of Inquiry For Enhancing Practice 3rd Edition Taylor Verified PDF Download
No ratings yet
Immediate Access Kielhofners Research in Occupational Therapy Methods of Inquiry For Enhancing Practice 3rd Edition Taylor Verified PDF Download
411 pages
Phosphate Information
No ratings yet
Phosphate Information
2 pages
(Pokemon Go Hack 2021) - NEW UPDATED (POGO HACK) 2021】##for android & iOS at Pokemon Go Hacks 2021 New
No ratings yet
(Pokemon Go Hack 2021) - NEW UPDATED (POGO HACK) 2021】##for android & iOS at Pokemon Go Hacks 2021 New
3 pages
Week 4 Case Digest - MANLUCOB, Lyra Kaye B.
No ratings yet
Week 4 Case Digest - MANLUCOB, Lyra Kaye B.
6 pages
FIG067 Dimensions, Area, and Design Data For Prestressed Concrete Girders AASHTO Types V and VI, Modified Bulb Tees
No ratings yet
FIG067 Dimensions, Area, and Design Data For Prestressed Concrete Girders AASHTO Types V and VI, Modified Bulb Tees
2 pages
17
No ratings yet
17
43 pages
Exercises From Unit 12
No ratings yet
Exercises From Unit 12
8 pages
Space Tourism
No ratings yet
Space Tourism
42 pages
Discuss: Types of Credit
No ratings yet
Discuss: Types of Credit
2 pages
Design Brief
No ratings yet
Design Brief
2 pages
Interstate Cadet Monoplane
No ratings yet
Interstate Cadet Monoplane
7 pages
STD 6 Maths LB
No ratings yet
STD 6 Maths LB
116 pages
JD - Brand Associate
No ratings yet
JD - Brand Associate
2 pages
1104D-44T Engine (PN1835 Jan08)
No ratings yet
1104D-44T Engine (PN1835 Jan08)
2 pages
ks20 Konvetka
No ratings yet
ks20 Konvetka
6 pages
AUSTROADS Roaddesign - Part6a-Agrd-Paths-Walking-Cycling PDF
No ratings yet
AUSTROADS Roaddesign - Part6a-Agrd-Paths-Walking-Cycling PDF
126 pages
Measuring Customer Experience Quality The EXQ Scale Revisited
No ratings yet
Measuring Customer Experience Quality The EXQ Scale Revisited
10 pages
MIST Admission Instruction-2024
No ratings yet
MIST Admission Instruction-2024
3 pages
Cause and Effect Task
No ratings yet
Cause and Effect Task
3 pages
C++ Graphics For Windows Using Winbgim: Download and Install The Winbgim Devpak File "Winbgim-6.0-1G17L.Devpak"
No ratings yet
C++ Graphics For Windows Using Winbgim: Download and Install The Winbgim Devpak File "Winbgim-6.0-1G17L.Devpak"
7 pages
CESA Projects PDF
No ratings yet
CESA Projects PDF
15 pages
Executive Order No. 201 Salary Standardization Law
No ratings yet
Executive Order No. 201 Salary Standardization Law
9 pages
Historical Introduction To The Private Law of Rome - 1886
No ratings yet
Historical Introduction To The Private Law of Rome - 1886
492 pages

Module2

Uploaded by

Module2

Uploaded by

Parallel and Distributed Computing

• Data Parallel Model

Feng's Classification is a method used to categorize computer architectures

• Interconnects are made of switches and links (wires, fiber).

• Interconnects are classified as static or dynamic.

• Static networks consist of point-to-point communication links among

• Dynamic networks are built using switches and communication links.

Classification of interconnection networks: (a) a static network;

• The total number of ports on a switch is the degree of the

• The cost of a switch grows as the square of the degree of the

• In a physical sense, this distinguishes a cluster from a tightly

• These topologies tradeoff performance for cost.

• Commercial machines often implement hybrids of

• All processors access a common bus for exchanging data.

• However, the bandwidth of the shared bus is a major

• Typical bus based machines are limited to dozens of nodes.

Bus-based interconnects (a) with no local caches; (b) with local

A completely non-blocking crossbar network connecting p

• This is generally difficult to scale for large values of p.

• Examples of machines that employ crossbars include the Sun Ultra

• Buses have excellent cost scalability, but poor

• Multistage interconnects strike a compromise between

The schematic of a typical multistage interconnection network.

• This network consists of log p stages, where p is the

• At each stage, input i is connected to output j if:

111 A perfect shuffle interconnection for eight inputs and outputs.

• The switches operate in two modes – crossover or

Two switching configurations of the 2 × 2 switch:

A complete omega network connecting eight inputs and eight outputs.

• This process is repeated for each of the log p switching stages.

• Note that this is not a non-blocking switch.

An example of blocking in omega network: one of the messages

•The number of links in the network scales as O(p2).

•While the performance scales verywell,

•In this sense, these networks are static

(a) A completely-connected network of eight nodes; (b) a Star

• Distance between any pair of nodes is O(1). However, the

• In this sense, star connected networks are static counterparts

(a) (b) (c)

Construction of hypercubes from hypercubes of lower dimension.

• Each node has log p neighbors.

• The distance between two nodes is given by the number of bit

Complete binary tree networks: (a) a static tree network; and

• For this reason, a variant called a fat-tree, fattens the links as

• Trees can be laid out in 2D with no wire crossings. This is an

A fat tree network of 16 processing nodes.

• Degree: The degree of node refers to number of links connected to it.

• Total bandwidth: the maximum rate at which data can be handled by

• Minimum throughout: the maximum rate at which data can be send

• Cost: The number of links or switches (whichever is

• In the case of shared address space machines, additional hardware is required

• The underlying technique must provide some guarantees on the semantics.

• If two processors make interleaved test and updates to a

• Both protocols suffer from false sharing overheads (two

• Most current machines use invalidate protocols.

• One example of such a set of states is, shared, invalid, or dirty.

• In dirty state, only one copy exists and therefore, no invalidates

• In invalid state, the data copy is invalid, therefore, a read

State diagram of a simple three-state coherence protocol.

• The cost of communication is dependent on a variety of

• Packet routing breaks messages into packets and pipelines

• Since packets may take different paths, each packet must

• The total communication time for packet routing is approximated

• The factor tw accounts for overheads in packet headers.

• This is done byforcing all flits to take the same path, in

• Error checks are performed on the entire message, as opposed to flits.

• No sequence numbers are needed.

• This is identical to packet routing, however, tw is typically

t comm = ts + lth + twm.

• In this expression, t h is typically smaller than t s and tw. For this

• Furthermore, it is often not possible to control routing and

• For these reasons, we can approximate the cost of message

• If a link takes multiple messages, the corresponding tw

• Different communication patterns congest different

• Itisimportant to understand and account for this in the

You might also like