0% found this document useful (0 votes)
48 views30 pages

Unit 1 - Part - 2

Uploaded by

Meet Panchal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
48 views30 pages

Unit 1 - Part - 2

Uploaded by

Meet Panchal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 30

High Performance Computing

Dr. Amit Barve,


Associate Professor and Head of the Department,
CSE department, PIET, PU
Motivating Parallelism
•The role of parallelism in accelerating computing speeds has been recognized for
several decades. Developing parallel hardware and software has traditionally been
time and effort intensive.
•If one is to view this in the context of rapidly improving uniprocessor speeds, one
is tempted to question the need for parallel computing.
•There are some unmistakable trends in hardware design, which indicate that
uniprocessor (or implicitly parallel) architectures may not be able to sustain the
rate of realizable performance increments in the future.
•This is the result of a number of fundamental physical and computational
limitations.
•The emergence of standardized parallel programming environments, libraries,
and hardware have significantly reduced time to (parallel) solution.
• The vector column_sum is small and easily fits
into the cache
•The matrix b is accessed in a column order.
•The strided access results in very poor
performance.
Example of memory Bandwidth
•Consider the following code fragment:
• for (i = 0; i < 1000; i++)
• column_sum[i] = 0.0;
• for (j = 0; j < 1000; j++)
• column_sum[i] += b[j][i];

•The code fragment sums columns of the


matrix b into a vector column_sum.
Multiplying a matrix with a vector: (a) multiplying column-by-
column, keeping a running sum; (b) computing each element of
the result as a dot product of a row of the matrix with the vector.
Impact of Memory Bandwidth: Example

•We can fix the above code as follows:


• for (i = 0; i < 1000; i++)
• column_sum[i] = 0.0;
• for (j = 0; j < 1000; j++)
• for (i = 0; i < 1000; i++)
• column_sum[i] += b[j][i];

•In this case, the matrix is traversed in a row-order and


performance can be expected to be significantly better.
Memory System Performance: Summary
Dichotomy of Parallel
Computing Platforms
Control Structure of Parallel Programs
•Processing units in parallel computers either operate under
the centralized control of a single control unit or work
independently.
•If there is a single control unit that dispatches the same
instruction to various processors (that work on different data),
the model is referred to as single instruction stream, multiple
data stream (SIMD).
•If each processor has its own control control unit, each
processor can execute different instructions on different data
items. This model is called multiple instruction stream,
multiple data stream (MIMD).
SIMD and MIMD Processors

A typical SIMD architecture (a) and a typical MIMD architecture (b).


SIMD Processors
•Some of the earliest parallel computers such as the Illiac IV, MPP,
DAP, CM-2, and MasPar MP-1 belonged to this class of machines.
•Variants of this concept have found use in co-processing units such as
the MMX units in Intel processors and DSP chips such as the Sharc.
•SIMD relies on the regular structure of computations (such as those in
image processing).
•It is often necessary to selectively turn off operations on certain data
items. For this reason, most SIMD programming paradigms allow for an
``activity mask'', which determines if a processor should participate in a
computation or not.
Conditional Execution in SIMD Processors

Executing a conditional statement on an SIMD computer with four


processors: (a) the conditional statement; (b) the execution of the
statement in two steps.
MIMD Processors
In contrast to SIMD processors, MIMD processors can
execute different programs on different processors.
A variant of this, called single program multiple data streams
(SPMD) executes the same program on different processors.
It is easy to see that SPMD and MIMD are closely related in
terms of programming flexibility and underlying architectural
support.
Examples of such platforms include current generation Sun
Ultra Servers, SGI Origin Servers, multiprocessor PCs,
workstation clusters, and the IBM SP.
SIMD-MIMD Comparison
•SIMD computers require less hardware than MIMD
computers (single control unit).
•However, since SIMD processors ae specially designed,
they tend to be expensive and have long design cycles.
•Not all applications are naturally suited to SIMD processors.
•In contrast, platforms supporting the SPMD paradigm can be
built from inexpensive off-the-shelf components with relatively
little effort in a short amount of time.
Communication Model of Parallel Platforms
•There are two primary forms of data exchange
between parallel tasks - accessing a shared data
space and exchanging messages.
•Platforms that provide a shared data space are
called shared-address-space machines or
multiprocessors.
•Platforms that support messaging are also called
message passing platforms or multicomputers.
Shared-Address-Space Platforms
•Part (or all) of the memory is accessible to all
processors.
•Processors interact by modifying data objects
stored in this shared-address-space.
•If the time taken by a processor to access any
memory word in the system global or local is
identical, the platform is classified as a uniform
memory access (UMA), else, a non-uniform memory
access (NUMA) machine.
NUMA and UMA Shared-Address-Space Platforms

Typical shared-address-space architectures: (a) Uniform-memory access


shared-address-space computer; (b) Uniform-memory-access shared-
address-space computer with caches and memories; (c) Non-uniform-
memory-access shared-address-space computer with local memory only.
NUMA and UMA Shared-Address-Space Platforms

•The distinction between NUMA and UMA platforms is important from the
point of view of algorithm design. NUMA machines require locality from
underlying algorithms for performance.
•Programming these platforms is easier since reads and writes are
implicitly visible to other processors.
•However, read-write data to shared data must be coordinated (this will
be discussed in greater detail when we talk about threads programming).
•Caches in such machines require coordinated access to multiple copies.
This leads to the cache coherence problem.
•A weaker model of these machines provides an address map, but not
coordinated access. These models are called non cache coherent shared
address space machines.
Shared-Address-Space vs. Shared Memory machines

•It is important to note the difference between the


terms shared address space and shared memory.
•We refer to the former as a programming
abstraction and to the latter as a physical machine
attribute.
•It is possible to provide a shared address space
using a physically distributed memory.
Message-Passing Platforms
•These platforms comprise of a set of processors
and their own (exclusive) memory.
•Instances of such a view come naturally from
clustered workstations and non-shared-address-
space multicomputers.
•These platforms are programmed using (variants
of) send and receive primitives.
•Libraries such as MPI and PVM provide such
primitives.
Message Passing vs. Shared Address Space Platforms

•Message passing requires little hardware support,


other than a network.
•Shared address space platforms can easily emulate
message passing. The reverse is more difficult to do
(in an efficient manner).
Physical Organization

of Parallel Platforms
Architecture of an Ideal Parallel
Computer
•A natural extension of the Random Access Machine
(RAM) serial architecture is the Parallel Random
Access Machine, or PRAM.
•PRAMs consist of p processors and a global
memory of unbounded size that is uniformly
accessible to all processors.
•Processors share a common clock but may execute
different instructions in each cycle.
Architecture of an Ideal Parallel Computer
•Depending on how simultaneous memory accesses are
handled, PRAMs can be divided into four subclasses.
–Exclusive-read, exclusive-write (EREW) PRAM.
–Concurrent-read, exclusive-write (CREW) PRAM.
–Exclusive-read, concurrent-write (ERCW) PRAM.
–Concurrent-read, concurrent-write (CRCW) PRAM.
Architecture of an Ideal Parallel Computer
•What does concurrent write mean, anyway?
–Common: write only if all values are identical.
–Arbitrary: write the data from a randomly selected processor.
–Priority: follow a predetermined priority order.
–Sum: Write the sum of all data items.
Physical Complexity of an Ideal Parallel
•Processors and memoriesComputer
are connected via
switches.
•Since these switches must operate in O(1) time at
the level of words, for a system of p processors and m
words, the switch complexity is O(mp).
•Clearly, for meaningful values of p and m, a true
PRAM is not realizable.
Interconnection Networks for Parallel computers
•Interconnection networks carry data between processors
and to memory.
•Interconnects are made of switches and links (wires, fiber).
•Interconnects are classified as static or dynamic.
•Static networks consist of point-to-point communication links
among processing nodes and are also referred to as direct
networks.
•Dynamic networks are built using switches and
communication links. Dynamic networks are also referred to
as indirect networks.
Interconnection Networks for Parallel
computers
•Interconnection networks carry data between processors
and to memory.
•Interconnects are made of switches and links (wires, fiber).
•Interconnects are classified as static or dynamic.
•Static networks consist of point-to-point communication links
among processing nodes and are also referred to as direct
networks.
•Dynamic networks are built using switches and
communication links. Dynamic networks are also referred to
as indirect networks.
www.paruluniversity.ac.i
n

You might also like