HPC Unit 2
HPC Unit 2
- 2
Principles of Parallel Algorithm Design
Introduction to Parallel Computing Traditionally, software has been written for serial
computation:
To be run on a single computer having a single Central Processing Unit (CPU);
A problem is broken into a discrete series of instructions.
Instructions are executed one after another.
Only one instruction may execute at any moment in time.
Serial computation:
Parallel Computing
In the simplest sense, parallel computing is the simultaneous use of multiple compute
resources to solve a computational problem.
To be run using multiple CPUs
A problem is broken into discrete parts that can be solved concurrently
Each part is further broken down to a series of instructions
Instructions from each part execute simultaneously on different CPUs
Continue…
Motivating Parallelism
• Development of parallel software has traditionally been thought of as time and effort
intensive.
• This can be largely attributed to the inherent complexity of specifying and coordinating
concurrent tasks, a lack of portable algorithms, standardized environments, and software
development toolkits.
1. The Computational Power Argument – from Transistors to FLOPS
2. The Memory/Disk Speed Argument
3. The Data Communication Argument
The Computational Power Argument – from Transistors to
FLOPS …
• In 1965, Gordon Moore made the following simple observation: "The complexity for
minimum component costs has increased at a rate of roughly a factor of two per year.
Certainly, over the short term this rate can be expected to continue, if not to increase. Over
the longer term, the rate of increase is a bit more uncertain, although there is no reason to
believe it will not remain nearly constant for at least 10 years. That means by 1975, the
number of components per integrated circuit for minimum cost will be 65,000.
The Memory/Disk Speed Argument
The overall speed of computation is determined not just by the speed of the processor, but
also by the ability of the memory system to feed data to it. While clock rates of high-end
processors have increased at roughly 40% per year over the past decade, DRAM access
times have only improved at the rate of roughly 10% per year over this interval. • The overall
performance of the memory system is determined by the fraction of the total memory
requests that can be satisfied from the cache.
The Data Communication
Argument
• In many applications there are constraints on the location of data and/or resources across
the Internet. • An example of such an application is mining of large commercial datasets
distributed over a relatively low bandwidth network. • In such applications, even if the
computing power is available to accomplish the required task without resorting to parallel
computing, it is infeasible to collect the data at a central location. • In these cases, the
motivation for parallelism comes not just from the need for computing resources but also
from the infeasibility or undesirability of alternate (centralized) approaches.
Reference Book: Ananth Grama
Modern Processor
1. Stored-program computer architecture : Its defining property, which set it apart from
earlier designs, is that its instructions are numbers that are stored as data in memory.
Instructions are read and executed by a control unit; a separate arithmetic/logic unit is
responsible for the actual computations and manipulates data stored in memory along with
the instructions A von Neumann computer uses the stored-program concept. The CPU
executes a stored program that specifies a sequence of read and write operations on the
memory.
Continue...
Continue...
Instructions and data must be continuously fed to the control and arithmetic units, so that the speed of the
memory interface poses a limitation on compute performance.
The architecture is inherently sequential, processing a single instruction with (possibly) a single operand
or a group of operands from memory.(SISD)
2. General-purpose Cache-based Microprocessor architecture :
• Microprocessors implement stored pgm....
• Modern processors have lot of components but only a small part does the actual work -AU for fp and int
operations.
• Rest are CPU regs, nowadays processors req all operands to reside in regs.
• LD(load) and ST(store) units handle instruction transfer.
• Queues for instructions
• Finally, Cache
Continue...
References
Book Title: Introduction to High Performance Computing for Scientist and Engineers
Authors: George and Wellen
Reference:
http://prdrklaina.weebly.com/uploads/5/7/7/3/5773421/
intIntroduction_to_high_performance_computing_for_scientist s_and_engineers.pdf
Scope of Parallelism
Microprocessor clock speeds have posted impressive gains over the past two decades
(two to three orders of magnitude).
Higher levels of device integration have made available a large number of transistors.
The question of how best to utilize these resources is an important one.
Current processors use these resources in multiple functional units and execute
multiple instructions in the same cycle.
The precise manner in which these instructions are selected and executed provides
impressive diversity in architectures.
Pipelining and Superscalar Execution
The penalty of a misprediction grows with the depth of the pipeline, since a larger
number of instructions will have to be flushed.
One simple way of alleviating these bottlenecks is to use multiple pipelines.
The question then becomes one of selecting these instructions.
In Below example, there is some wastage of resources due to data dependencies .
Superscalar Execution: An Example
Example of a two-way superscalar execution of instructions.
Superscalar Execution: An Example
In the above example, there is some wastage of resources due to data dependencies.
The example also illustrates that different instruction mixes with identical semantics
can take significantly different execution time.
Superscalar Execution
In the simpler model, instructions can be issued only in the order in which they are
encountered. That is, if the second instruction cannot be issued because it has a
data dependency with the first, only one instruction is issued in the cycle. This is
called in-order issue.
•In a more aggressive model, instructions can be issued out of order. In this case,
if the second instruction has data dependencies with the first, but the third
instruction does not, the first and third instructions can be co-scheduled. This is
also called dynamic issue.
•Performance of in-order issue is generally limited.
Superscalar Execution: Efficiency Considerations
Memory system, and not processor speed, is often the bottleneck for many
applications.
•Memory system performance is largely captured by two parameters, latency and
bandwidth.
•Latency is the time from the issue of a memory request to the time the data is
available at the processor.
•Bandwidth is the rate at which data can be pumped to the processor by the
memory system.
Dichotomy of Parallel Computing Platforms
Processing units in parallel computers either operate under the centralized control
of a single control unit or work independently.
•If there is a single control unit that dispatches the same instruction to various
processors (that work on different data), the model is referred to as single
instruction stream, multiple data stream (SIMD).
•If each processor has its own control control unit, each processor can execute
different instructions on different data items. This model is called multiple
instruction stream, multiple data stream (MIMD).
SIMD and MIMD Processors
SIMD Processors
Some of the earliest parallel computers such as the Illiac IV, MPP, DAP, CM-2,
and MasPar MP-1 belonged to this class of machines.
•Variants of this concept have found use in co-processing units such as the MMX
units in Intel processors and DSP chips such as the Sharc.
•SIMD relies on the regular structure of computations (such as those in image
processing).
•It is often necessary to selectively turn off operations on certain data items. For
this reason, most SIMD programming paradigms allow for an ``activity mask'',
which determines if a processor should participate in a computation or not.
Conditional Execution in SIMD Processors
MIMD Processors
SIMD computers require less hardware than MIMD computers (single control
unit).
•However, since SIMD processors ae specially designed, they tend to be
expensive and have long design cycles.
•Not all applications are naturally suited to SIMD processors.
•In contrast, platforms supporting the SPMD paradigm can be built from
inexpensive off-the-shelf components with relatively little effort in a short amount
of time.
Communication Model of Parallel
Platforms
•There are two primary forms of data exchange between parallel tasks - accessing
a shared data space and exchanging messages.
•Platforms that provide a shared data space are called shared-address-space
machines or multiprocessors.
•Platforms that support messaging are also called message passing platforms or
multicomputers.
Shared-Address-Space Platforms
The distinction between NUMA and UMA platforms is important from the point of view
of algorithm design. NUMA machines require locality from underlying algorithms for
performance.
•Programming these platforms is easier since reads and writes are implicitly visible to
other processors.
•However, read-write data to shared data must be coordinated (this will be discussed in
greater detail when we talk about threads programming).
•Caches in such machines require coordinated access to multiple copies. This leads to
the cache coherence problem.
•A weaker model of these machines provides an address map, but not coordinated
access. These models are called non cache coherent shared address space machines.
Shared-Address-Space vs. Shared Memory Machines
It is important to note the difference between the terms shared address space and
shared memory.
•We refer to the former as a programming abstraction and to the latter as a
physical machine attribute.
•It is possible to provide a shared address space using a physically distributed
memory.
Message-Passing Platforms
These platforms comprise of a set of processors and their own (exclusive)
memory.
•Instances of such a view come naturally from clustered workstations and non-
shared-address-space multicomputers.
•These platforms are programmed using (variants of) send and receive primitives.
•Libraries such as MPI and PVM provide such primitives.
Message Passing vs. Shared Address
Space Platforms
We begin this discussion with an ideal parallel machine called Parallel Random Access Machine, or
PRAM.
•A natural extension of the Random-Access Machine (RAM) serial architecture is the Parallel Random-
Access Machine, or PRAM.
•PRAMs consist of p processors and a global memory of unbounded size that is uniformly accessible to
all processors.
•Processors share a common clock but may execute different instructions in each cycle.
Architecture of an Ideal Parallel
Computer
Depending on how simultaneous memory accesses are handled, PRAMs can be divided
into four subclasses.
–Exclusive-read, exclusive-write (EREW) PRAM.
–Concurrent-read, exclusive-write (CREW) PRAM.
–Exclusive-read, concurrent-write (ERCW) PRAM.
–Concurrent-read, concurrent-write (CRCW) PRAM.
Cont..
When the value of a variable is changes, all its copies must either be invalidated
or updated.
Fig. - Cache coherence in multiprocessor systems: (a) Invalidate protocol; (b) Update protocol for shared variables.
Communication Costs in Parallel
Machines
Passing a message from node P0 to P3 (a) through a store-and-forward communication network; (b) and
(c) extending the concept to cut-through routing. The shaded regions represent the time that the message
is in transit. The startup time associated with this message transfer is assumed to be zero.
B]Communication Costs in Shared-
Address-Space Machines
Levels of parallelism
Data Parallelism: Many problems in scientific computing involve processing of large quantities of data stored on a
computer. If this manipulation can be performed in parallel, i.e., by multiple processors working on different parts
of the data, we speak of data parallelism. As a matter of fact, this is the dominant parallelization concept in
scientific computing on MIMD-type computers. It also goes under the name of SPMD (Single Program Multiple
Data), as usually the same code is executed on all processors, with independent instruction pointers.
Ex: Medium-grained loop parallelism, Coarse-grained parallelism by domain decomposition
2. Functional Parallelism: Sometimes the solution of a “big” numerical problem can be split into more or less disparate subtasks,
which work together by data exchange and synchronization. In this case, the subtasks execute completely different code on
different data items, which is why functional parallelism is also called MPMD (Multiple Program Multiple Data).
Ex: Master Worker Scheme, Functional decomposition
Ref:http://prdrklaina.weebly.com/uploads/5/7/7/3/5773421/
introduction_to_high_performance_computing_for_scientists_and_engineers.pdf
Models : SIMD, MIMD, SIMT, SPMD
MIMD
SIMT
•ln a daraflow computer, the execution of an instruction is driven by data availability instead of being
guided by a program counter, ln theory, any instruction should be ready for execution whenever
operands become available.
•The instructions in a data-driven program are not ordered in any way. instead of being stored
separately in a main memory, data are directly held inside instructions.
•This data-driven scheme requires no program counter, and no eonlrol sequencer. However, it requires
special mechanisms to detect data availability, to match data tokens with needy instructions, and to
enable the chain reaction of asynchronous instruction executions. No memory sharing between
instructions results in no side effects.
Cont..
Demand Driven
N-Wide Superscalar
decouple these stages from the rest of the pipeline and regularize somewhat
breaks in the flow
Cont..
Multicore Processor
Multithreaded Processor
Modern Processor and Architecture
•Book Name:
•ADVANCED COMPUTER ARCHITECTURE
Author name: Kai Hwang
THANK YOU