0% found this document useful (0 votes)
4 views

Lecture 2 - Parallel Programming Platforms (Part I)

Lecture 2 discusses parallel computing platforms, focusing on implicit parallelism, memory system performance, and the dichotomy of parallel computing architectures. It highlights the importance of understanding performance bottlenecks in microprocessor architectures, including pipelining and superscalar execution. The lecture also introduces Flynn's Taxonomy for classifying parallel computers based on instruction and data streams.

Uploaded by

Nur Fatihah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
4 views

Lecture 2 - Parallel Programming Platforms (Part I)

Lecture 2 discusses parallel computing platforms, focusing on implicit parallelism, memory system performance, and the dichotomy of parallel computing architectures. It highlights the importance of understanding performance bottlenecks in microprocessor architectures, including pipelining and superscalar execution. The lecture also introduces Flynn's Taxonomy for classifying parallel computers based on instruction and data streams.

Uploaded by

Nur Fatihah
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

LECTURE 2: Parallel Computing Platforms

(PART 1)

Ananth Grama, Anshul Gupta,


George Karypis, and Vipin Kumar

To accompany the text ``Introduction to Parallel Computing'',


Addison Wesley, 2003.

Page 1 Introduction to High Performance Computing


Topic Overview
 Implicit Parallelism: Trends in Microprocessor
Architectures
 Limitations of Memory System Performance
 Dichotomy of Parallel Computing Platforms
 Communication Model of Parallel Platforms

Page 2 Introduction to High Performance Computing


Scope of Parallelism
 Conventional architectures coarsely comprise of a processor,
memory system, and the datapath.
 Each of these components present significant performance
bottlenecks.
 Parallelism addresses each of these components in significant ways.
 Different applications utilize different aspects of parallelism:
◦ - e.g.
 data itensive applications utilize high aggregate throughput
 server applications utilize high aggregate network bandwidth
 scientific applications typically utilize high processing and
memory system performance.
 It is important to understand each of these performance
bottlenecks.

Page 3 Introduction to High Performance Computing


Implicit Parallelism: Trends in Microprocessor
Architectures
 Microprocessor clock speeds have posted impressive
gains over the past two decades.

 In the current technologies, due to higher levels of


device integration, there are a lot of transistors in the
design of the microprocessor.

 Therefore, current processors use these resources


(transistors etc.) to execute multiple instructions in the
same cycle.

Page 4 Introduction to High Performance Computing


Machine Cycle
• The steps performed by the computer processor for each
machine language instruction received. The machine
cycle is a 4 process cycle that includes reading and
interpreting the machine language, executing the code
and then storing that code.
• Four steps:
• Fetch - Retrieve an instruction from the memory.
• Decode - Translate the retrieved instruction into a series of
computer commands.
• Execute - Execute the computer commands.
• Store - Send and write the results back in memory.

Page 5 Introduction to High Performance Computing


Pipelining and Superscalar Execution
• Processors have relied on pipelining to improve execution
rate.
• An instruction pipeline is a technique used in the design
of computers to increase their instruction throughput (the
number of instructions that can be executed in a unit of time).
• The basic instruction cycle is broken up into a series called
a pipeline.
• Rather than processing each instruction sequentially each
instruction is split up into a sequence of steps so different
steps can be executed concurrently and in parallel.
• Pipelining increases instruction throughput by performing
multiple operations at the same time.
Task of washing machine
1) washing the cloth
2) drying
3) ironing

Conclusion, all the task will use simultaneously


Page 6 Introduction to High Performance Computing
Pipelining and Superscalar Execution
 Pipelining overlaps various stages of instruction execution to
achieve performance.
 In pipelining, an instruction can be executed while the next
one is being decoded and the next one is being fetched.
 An assembly line for manufacture of cars – an analogy for
pipelining.
◦ 100 time units – broken into 10 pipelined stages of 10 units
each.
◦ Enables faster execution & increase speed.
 Limitations of pipelining:
 The speed of a pipeline is eventually limited by the slowest
stage.
 Pipeline can cause bottlenecks.
sesak

Page 7 Introduction to High Performance Computing


Pipelining and Superscalar Execution

Basic five-stage pipeline in a RISC machine:


 IF = Instruction Fetch,
 ID = Instruction Decode,
 EX = Execute, RISC
(Reduced Instruction Set Computing)
 MEM = Memory access,
 WB = Register write back

In the fourth clock cycle (the green column), the earliest instruction is in
MEM stage, and the latest instruction has not yet entered the pipeline.

Page 8 Introduction to High Performance Computing


Pipelining and Superscalar Execution
 One simple way of alleviating these bottlenecks is to
use multiple pipelines.
 Multiple pipelines:
◦ Improve instruction execution rate.
◦ Multiple instructions are piped into the processor in
parallel.
◦ Multiple instructions are executed on multiple
functional units.
 Superscalar execution: The ability of a processor to
issue multiple instructions in the same cycle.

Page 9 Introduction to High Performance Computing


Superscalar Execution: An Example

Example of a two-way superscalar execution of instructions. By fetching and dispatching two


instructions at a time, a maximum of two instructions per cycle can be completed.

Page 10 Introduction to High Performance Computing


Superscalar Execution: An Example
 In the above example, there is some
wastage of resources due to data
dependencies.
 The example also illustrates that different
instruction mixes with identical semantics
can take significantly different execution
time.

Page 11 Introduction to High Performance Computing


Superscalar Execution
 Scheduling of instructions is determined by a number of factors:

◦ True Data Dependency: The result of one operation is an


input to the next. Add R1 R2 R3 >> SUB R6 R1 R4
◦ Resource Dependency: Two operations require the same
resource.
◦ Branch Dependency: Scheduling instructions across
conditional branch statements cannot be done deterministically
a-priori.
◦ The scheduler, a piece of hardware looks at a large number of
instructions in an instruction queue and selects appropriate
number of instructions to execute concurrently based on these
factors.
◦ The complexity of this hardware is an important constraint on
superscalar processors.

Page 12 Introduction to High Performance Computing


Superscalar Execution: Issue Mechanisms
 In the simpler model, instructions can be issued only in
the order in which they are encountered. That is, if the
second instruction cannot be issued because it has a
data dependency with the first, only one instruction is
issued in the cycle.This is called in-order issue.
 In a more aggressive model, instructions can be issued
out of order. In this case, if the second instruction has
data dependencies with the first, but the third
instruction does not, the first and third instructions can
be co-scheduled.This is also called dynamic issue.
 Performance of in-order issue is generally limited.

Page 13 Introduction to High Performance Computing


Superscalar Execution: Efficiency Considerations
 Not all functional units can be kept busy at all times.
 Vertical waste: If during a cycle, no functional units
are utilized.
 Horizontal waste: If during a cycle, only some of the
functional units are utilized.
 Due to limited parallelism in typical instruction traces,
dependencies, or the inability of the scheduler to
extract parallelism, the performance of superscalar
processors is eventually limited.
 Conventional microprocessors typically support four-
way superscalar execution.

Page 14 Introduction to High Performance Computing


Very Long Instruction Word (VLIW) Processors
 The hardware cost and complexity of
the superscalar scheduler is a major
consideration in processor design.
 To address this issues, VLIW
processors rely on compile time
analysis to identify and bundle
together instructions that can be
executed concurrently.
 These instructions are packed and
dispatched together, and thus the
name very long instruction word.
 This concept was used with some
commercial success in the Multiflow
Trace machine (circa 1984).
 Variants of this concept are employed
in the Intel IA64 processors.

Page 15 Introduction to High Performance Computing


Limitations of
Memory System Performance
 Memory system, and not processor speed, is often the
bottleneck for many applications.
 Memory system performance is largely captured by two
parameters, latency and bandwidth.
 Latency is the time from the issue of a memory
request to the time the data is available at the
processor.
 Bandwidth is the rate at which data can be pumped to
the processor by the memory system.

Page 16 Introduction to High Performance Computing


Memory System Performance: Bandwidth
and Latency
 It is very important to understand the difference
between latency and bandwidth.
Example latency  Consider the example of a fire-hose. If the water comes
out of the hose two seconds after the hydrant is turned
on, the latency of the system is two seconds.
Example bandwidth  Once the water starts flowing, if the hydrant delivers
water at the rate of 5 gallons/second, the bandwidth of
the system is 5 gallons/second.
 If you want immediate response from the hydrant, it is
important to reduce latency.
 If you want to fight big fires, you want high bandwidth.

Page 17 Introduction to High Performance Computing


Improving Effective Memory
Latency Using Caches
 Caches are small and fast memory elements between
the processor and DRAM.
 This memory acts as a low-latency high-bandwidth
storage.
 If a piece of data is repeatedly used, the effective latency
of this memory system can be reduced by the cache.
 The fraction of data references satisfied by the cache is
called the cache hit ratio of the computation on the
system.
 Cache hit ratio achieved by a code on a memory system
often determines its performance.
Why the processor is very fast as compared to other components in the system
--> It's the material use in processor is very expensive because its gold
Page 18 Introduction to High Performance Computing
Impact of Memory Bandwidth
 Memory bandwidth is determined by the
bandwidth of the memory bus as well as the
memory units.

• Memory bandwidth can be improved by


increasing the size of memory blocks.

 The underlying system takes l time units (where


l is the latency of the system) to deliver b units
of data (where b is the block size).
Fast car (latency)
Road (bandwidth)
Page 19 Introduction to High Performance Computing
Alternate Approaches for
Hiding Memory Latency
 Consider the problem of browsing the web on a very slow
network connection. We deal with the problem in one of three
possible ways:

◦ Prefetching: We anticipate which pages we are going to browse


ahead of time and issue requests for them in advance;

◦ Multithreading: We open multiple browsers and access


different pages in each browser, thus while we are waiting for
one page to load, we could be reading others; or

◦ Spatial locality: We access a whole bunch of pages in one go -


reducing the latency across various accesses.

Page 20 Introduction to High Performance Computing


Tradeoffs of Multithreading and Prefetching
 Multithreading and prefetching are critically impacted by
the memory bandwidth.
 Multithreaded systems become bandwidth bound
instead of latency bound.
 Multithreading and prefetching only address the latency
problem and may often worsen the bandwidth problem.
 Multithreading and prefetching also require significantly
more hardware resources in the form of storage.

Page 21 Introduction to High Performance Computing


Dichotomy of Parallel Computing Platforms
 Based on the logical and physical organization of parallel
platforms.

◦ Logical organization: Programmer’s view of the


platform.

◦ Physical organization: The actual hardware organization


of the platform.

Page 22 Introduction to High Performance Computing


Logical Organization

 An explicitly parallel program must specify:

◦ parallel/ concurrent tasks (Control Structure).

◦ interaction between the concurrent subtasks


(Communication model).

Page 23 Introduction to High Performance Computing


Control Structure of Parallel Programs
 Parallelism can be expressed at various
levels of granularity - from instruction level
to processes.

 Between these extremes exist a range of


models, along with corresponding
architectural support.

Page 24 Introduction to High Performance Computing


Control Structure of Parallel Programs
 Processing units in parallel computers either operate under the
centralized control of a single control unit or work independently.

 If there is a single control unit that dispatches the same instruction


to various processors (that work on different data), the model is
referred to as single instruction stream, multiple data
stream (SIMD).

 If each processor has its own control unit, each processor can
execute different instructions on different data items. This model is
called multiple instruction stream, multiple data stream
(MIMD).

Page 25 Introduction to High Performance Computing


SIMD and MIMD Processors

A typical SIMD architecture (a) and a typical MIMD architecture (b).

Page 26 Introduction to High Performance Computing


SIMD Processors
 Some of the earliest parallel computers such as the Illiac IV,
MPP, DAP, CM-2, and MasPar MP-1 belonged to this class of
machines.

 Variants of this concept have found use in co-processing


units such as the MMX units in Intel processors and DSP
chips such as the Sharc.

 SIMD relies on the regular structure of computations (such


as those in image processing).

Page 27 Introduction to High Performance Computing


MIMD Processors
 In contrast to SIMD processors, MIMD processors can
execute different programs on different processors.
 A variant of this, called single program multiple data
streams (SPMD) executes the same program on
different processors.
 It is easy to see that SPMD and MIMD are closely
related in terms of programming flexibility and
underlying architectural support.
 Examples of such platforms include current generation
Sun Ultra Servers, SGI Origin Servers, multiprocessor
PCs, workstation clusters, and the IBM SP.

Page 28 Introduction to High Performance Computing


IMPORTANT

 Flynn's Classical Taxonomy

• There are different ways to classify parallel computers.

• One of the more widely used classifications, in use since


1966, is called Flynn's Taxonomy.

• Flynn's taxonomy distinguishes multi-processor


computer architectures according to how they can be
classified along the two independent dimensions
of Instruction Stream and Data Stream. Each of these
dimensions can have only one of two possible
states: Single or Multiple.

29

Page 29 Introduction to High Performance Computing


The matrix below defines the 4 possible classifications
according to Flynn:

30

Page 30 Introduction to High Performance Computing


a) Single Instruction, Single Data (SISD)

 A serial (non-parallel) computer


 Single Instruction: Only one instruction stream is being acted on by
the CPU during any one clock cycle
 Single Data: Only one data stream is being used as input during any
one clock cycle
 Deterministic execution
 This is the oldest type of computer
 Examples: older generation mainframes, minicomputers,
workstations and single processor/core PCs.

31

Page 31 Introduction to High Performance Computing


b) Single Instruction, Multiple Data (SIMD)
 A type of parallel computer
 Single Instruction: All processing units execute the same instruction
at any given clock cycle
 Multiple Data: Each processing unit can operate on a different data
element
 Best suited for specialized problems characterized by a high degree
of regularity, such as graphics/image processing.
 Synchronous (lockstep) and deterministic execution
 Two varieties: Processor Arrays and Vector Pipelines

32

Page 32 Introduction to High Performance Computing


c) Multiple Instructions, Single Data (MISD)
 A type of parallel computer
 Multiple Instructions: Each processing unit operates on the data
independently via separate instruction streams.
 Single Data: A single data stream is fed into multiple processing
units.
 Few (if any) actual examples of this class of parallel computer have
ever existed.
 Some conceivable uses might be:
o multiple frequency filters operating on a single signal stream
• multiple cryptography algorithms attempting to crack a single coded
message.

33

Page 33 Introduction to High Performance Computing


d) Multiple Instructions, Multiple Data (MIMD)
 A type of parallel computer
 Multiple Instruction: Every processor may be executing a different
instruction stream
 Multiple Data: Every processor may be working with a different data
stream
 Execution can be synchronous or asynchronous, deterministic or non-

(Part 1)
Lecture 2: Parallel Computing Platforms
deterministic
 Currently, the most common type of parallel computer - most modern
supercomputers fall into this category.
 Examples: most current supercomputers, networked parallel computer
clusters and "grids", multi-processor SMP computers, multi-core PCs.
 Note: many MIMD architectures also include SIMD execution sub-
components

34

Page 34 Introduction to High Performance Computing


SIMD-MIMD Comparison
 SIMD computers require less hardware than MIMD
computers (single control unit).
 However, since SIMD processors ae specially designed,
they tend to be expensive and have long design cycles.
 Not all applications are naturally suited to SIMD
processors.
 In contrast, platforms supporting the SPMD paradigm
can be built from inexpensive off-the-shelf components
with relatively little effort in a short amount of time.

Page 35 Introduction to High Performance Computing


Communication Model of Parallel Platforms
 There are two primary forms of data exchange between
parallel tasks:
◦ accessing a shared data space
◦ exchanging messages

 Platforms that provide a shared data space are called


shared-address-space machines or multiprocessors.

 Platforms that support messaging are also called message


passing platforms or multicomputers.

Page 36 Introduction to High Performance Computing


37

Page 37 Introduction to High Performance Computing


Shared-Address-Space Platforms
 Part (or all) of the memory is accessible to all
processors.
 Processors interact by modifying data objects stored in
this shared-address-space.
 The platform is classified as:
◦ Uniform Memory Access (UMA)
 Processors have equal access time
to memory in the system.
 Identical processors
◦ Non-Uniform Memory Access (NUMA) machine.
 Not all processors have equal access time to all memories.
 Memory access across link is slower

Page 38 Introduction to High Performance Computing


NUMA and UMA Shared-Address-Space Platforms

(a) (b)
Typical shared-address-space architectures: (a) Uniform-
memory access shared-address-space computer; (b) Non-
uniform-memory-access shared-address-space computer
with local memory only.

Page 39 Introduction to High Performance Computing


NUMA and UMA Shared-Address-Space Platforms
 The distinction between NUMA and UMA platforms is
important from the point of view of algorithm design.
NUMA machines require locality from underlying algorithms
for performance.
 Programming these platforms is easier since reads and
writes are implicitly visible to other processors.
 However, read-write data to shared data must be
coordinated (this will be discussed in greater detail when we
talk about threads programming).
 Caches in such machines require coordinated access to
multiple copies.This leads to the cache coherence problem.
 A weaker model of these machines provides an address map,
but not coordinated access. These models are called non
cache coherent shared address space machines.

Page 40 Introduction to High Performance Computing


Message-Passing Platforms
 These platforms comprise of a set of
processors and their own (exclusive)
memory.

 Instances of such a view come naturally from


clustered workstations and non-shared-address-
space multicomputers.

Page 41 Introduction to High Performance Computing


Message-Passing Platforms
 These platforms are programmed using (variants of) send and
receive primitives.
 Libraries such as MPI and PVM provide such primitives.

Beowulf is a multi-computer architecture which can


be used for parallel computations. It is a system
which usually consists of one server node, and one
or more client nodes connected via Ethernet or
some other network.

Page 42 Introduction to High Performance Computing


Message Passing
vs.
Shared Address Space Platforms

 Message passing:
◦ requires little hardware support, other than a network.
◦ processors must explicitly communicate with each
other through messages.
◦ data exchanged among processors cannot be shared, it
is copied (using send/receive messages).

 Shared address space platforms:


◦ requires more hardware support
◦ processors access memory through the shared bus.
◦ data sharing between tasks is fast and uniform.

Page 43 Introduction to High Performance Computing


Next Week:

Lecture 3:
Parallel Platforms
(Part 2)

Page 44 Introduction to High Performance Computing

You might also like