0% found this document useful (0 votes)

67 views47 pages

Parallel Computing Platforms-Dr Nausheen

The document discusses parallel computing platforms and how parallelism addresses performance bottlenecks in processors, memory, and data paths. It covers trends in microprocessor architectures like pipelining and superscalar execution. The document also discusses limitations of memory system performance and how caches can improve effective memory latency.

Uploaded by

K213158 Muhammad Shahmir Raza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

67 views47 pages

Parallel Computing Platforms-Dr Nausheen

Uploaded by

K213158 Muhammad Shahmir Raza

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 47

Parallel Computing Platforms

Dr. Nausheen Shoaib

Scope of Parallelism

► Conventional architectures coarsely comprise of a

processor, memory system, and the datapath.
► Each of these components present significant
performance bottlenecks.
► Parallelism addresses each of these components in
significant ways.
► Different applications utilize different aspects of
parallelism - e.g., data itensive applications utilize
high aggregate throughput, server applications utilize
high aggregate network bandwidth, and scientific
applications typically utilize high processing and
memory system performance.
► It is important to understand each of these
performance bottlenecks.
Implicit Parallelism: Trends in
Microprocessor Architectures
► Microprocessor clock speeds have posted impressive
gains over the past two decades (two to three orders
of magnitude).
► Higher levels of device integration have made
available a large number of transistors.
► The question of how best to utilize these resources is
an important one.
► Current processors use these resources in multiple
functional units and execute multiple instructions in
the same cycle.
► The precise manner in which these instructions are
selected and executed provides impressive diversity
in architectures.
Pipelining and Superscalar
Execution
► Pipelining overlaps various stages of instruction
execution to achieve performance.
► At a high level of abstraction, an instruction can be
executed while the next one is being decoded and the
next one is being fetched.
► This is akin to an assembly line for manufacture of cars.
Pipelining and Superscalar
Execution
► Pipelining, however, has several limitations.
► The speed of a pipeline is eventually limited by the
slowest stage.
► For this reason, conventional processors rely on very
deep pipelines (20 stage pipelines in state-of-the-art
Pentium processors).
► However, in typical program traces, every 5-6th
instruction is a conditional jump! This requires very
accurate branch prediction.
► The penalty of a misprediction grows with the depth of
the pipeline, since a larger number of instructions will
have to be flushed.
Pipelining and Superscalar
Execution
► One simple way of alleviating these bottlenecks is to
use multiple pipelines.
► The question then becomes one of selecting these
instructions.
Superscalar Execution: An Example

Example of a two-way superscalar execution of instructions.

Superscalar Execution: An Example

► In the above example, there is some wastage of

resources due to data dependencies.
► The example also illustrates that different instruction
mixes with identical semantics can take significantly
different execution time.
Superscalar Execution

► Scheduling of instructions is determined by a number of

factors:
► True Data Dependency: The result of one operation is an
input to the next.
► Resource Dependency: Two operations require the same
resource.
► Branch Dependency: Scheduling instructions across
conditional branch statements cannot be done
deterministically a-priori.
► The scheduler, a piece of hardware looks at a large
number of instructions in an instruction queue and selects
appropriate number of instructions to execute
concurrently based on these factors.
► The complexity of this hardware is an important constraint
on superscalar processors.
Superscalar Execution:
Issue Mechanisms
► In the simpler model, instructions can be issued only in
the order in which they are encountered. That is, if the
second instruction cannot be issued because it has a
data dependency with the first, only one instruction is
issued in the cycle. This is called in-order issue.
► In a more aggressive model, instructions can be issued
out of order. In this case, if the second instruction has
data dependencies with the first, but the third
instruction does not, the first and third instructions can
be co-scheduled. This is also called dynamic issue.
► Performance of in-order issue is generally limited.
Superscalar Execution:
Efficiency Considerations
► Not all functional units can be kept busy at all times.
► If during a cycle, no functional units are utilized, this is
referred to as vertical waste.
► If during a cycle, only some of the functional units are
utilized, this is referred to as horizontal waste.
► Due to limited parallelism in typical instruction traces,
dependencies, or the inability of the scheduler to
extract parallelism, the performance of superscalar
processors is eventually limited.
► Conventional microprocessors typically support four-way
superscalar execution.
Very Long Instruction Word
(VLIW) Processors
► The hardware cost and complexity of the superscalar
scheduler is a major consideration in processor design.
► To address this issues, VLIW processors rely on compile
time analysis to identify and bundle together
instructions that can be executed concurrently.
► These instructions are packed and dispatched together,
and thus the name very long instruction word.
► This concept was used with some commercial success in
the Multiflow Trace machine (circa 1984).
► Variants of this concept are employed in the Intel IA64
processors.
Very Long Instruction Word
(VLIW) Processors: Considerations

► Issue hardware is simpler.

► Compiler has a bigger context from which to select
co-scheduled instructions.
► Compilers, however, do not have runtime information
such as cache misses. Scheduling is, therefore,
inherently conservative.
► Branch and memory prediction is more difficult.
► VLIW performance is highly dependent on the compiler.
A number of techniques such as loop unrolling,
speculative execution, branch prediction are critical.
► Typical VLIW processors are limited to 4-way to 8-way
parallelism.
Limitations of
Memory System Performance
► Memory system, and not processor speed, is often the
bottleneck for many applications.
► Memory system performance is largely captured by two
parameters, latency and bandwidth.

► Latency is the time from the issue of a memory request

to the time the data is available at the processor.

► Bandwidth is the rate at which data can be pumped to

the processor by the memory system.
Memory System Performance:
Bandwidth and Latency
► It is very important to understand the difference
between latency and bandwidth.
► Consider the example of a fire-hose. If the water comes
out of the hose two seconds after the hydrant is turned
on, the latency of the system is two seconds.
► Once the water starts flowing, if the hydrant delivers
water at the rate of 5 gallons/second, the bandwidth of
the system is 5 gallons/second.
► If you want immediate response from the hydrant, it is
important to reduce latency.
► If you want to fight big fires, you want high bandwidth.
Memory Latency: An Example

► Consider a processor operating at 1 GHz (1 ns clock)

connected to a DRAM with a latency of 100 ns (no
caches). Assume that the processor has two
multiply-add units and is capable of executing four
instructions in each cycle of 1 ns. The following
observations follow:
► The peak processor rating is 4 GFLOPS.
► Since the memory latency is equal to 100 cycles and block
size is one word, every time a memory request is made,
the processor must wait 100 cycles before it can process
the data.
Kilo 103, Mega 106, Giga 109, Tera 1012, Peta 1015, Exa 1018, Zetta 1021, Yotta 1024,
Milli 10-3, Micro 10-6, Nano 10-9, Peco 10-12, Femto 10-15,
Memory Latency: An Example

► On the above architecture, consider the problem of

computing a dot-product of two vectors.
► A dot-product computation performs one multiply-add on a
single pair of vector elements, i.e., each floating point
operation requires one data fetch.
► It follows that the peak speed of this computation is
limited to one floating point operation every 100 ns, or a
speed of 10 MFLOPS, a very small fraction of the peak
processor rating!
► On the above architecture, consider the int dotProduct(int vect_A[], int vect_B[]) {
problem of computing a dot-product of two
int product = 0;
vectors. // Loop for calculate dot product
for (int i = 0; i < n; i++)
product = product + vect_A[i] * vect_B[i];
► A dot-product computation performs one return product;
multiply-add on a single pair of vector }
elements, i.e., each floating point operation
requires one data fetch.

► It follows that the peak speed of this LD R1, [R2] ; 100ns

computation is limited to one floating point LD R3, [R4] ; 100ns
operation every 100 ns, or a speed of 10 MADD R5, R1, R2 ; 1ns
MFLOPS, a very small fraction of the peak
processor rating!
2/200ns= 10 MFLOPS
Improving Effective Memory
Latency Using Caches
► Caches are small and fast memory elements between
the processor and DRAM.
► This memory acts as a low-latency high-bandwidth
storage.
► If a piece of data is repeatedly used, the effective
latency of this memory system can be reduced by the
cache.
► The fraction of data references satisfied by the cache is
called the cache hit ratio of the computation on the
system.
► Cache hit ratio achieved by a code on a memory system
often determines its performance.

What is the average memory access time if cache hit is 85% for a 100nsec memory access time.
Impact of Caches: Example
Consider the architecture from the previous example. In this case, we
introduce a cache of size 32 KB with a latency of 1 ns or one cycle. We
use this setup to multiply two matrices A and B of dimensions 32 × 32.
We have carefully chosen these numbers so that the cache is large enough to
store matrices A and B, as well as the result matrix C.

for(i=0;i<r;i++) { O (N3)
for(j=0;j<c;j++) {
for(k=0;k<c;k++) {
mul[i][j]+=a[i][k]*b[k][j];
}}}

Action items( Homework )

1.Draw processor, cache and memory

2.Draw Cache of size 32 K
3.How three matrices of size 32 x 32 fits into this cache?
4.How cache data is reused during these calculations?
Impact of Caches: Example (continued)
► The following observations can be made about the problem:
► Fetching the two matrices (n x n = 32 x 32) into the cache corresponds to fetching 2K
words, which takes approximately 200 µs. HOW?

► Multiplying two n × n matrices takes 2n3 operations. For our problem, this corresponds to
64K operations, which can be performed in 16K cycles (or 16 µs) at four instructions per
cycle.

► The total time for the computation is therefore approximately the sum of time for
load/store operations and the time for the computation itself, i.e., 200 + 16 µs.

► This corresponds to a peak computation rate of 64K/216 or 303 MFLOPS.

A thirty-fold improvement over the previous example. However, it is still

less than 10% of the peak processor performance. By placing a small
cache memory, we are able to improve processor utilization considerably.
Impact of Memory Bandwidth

► Memory bandwidth is determined by the bandwidth of

the memory bus as well as the memory units.
• Memory bandwidth can be improved by increasing the
size of memory blocks.
► The underlying system takes l time units (where l is the
latency of the system) to deliver b units of data (where
b is the block size).
Impact of Memory Bandwidth:
Example

► Consider the same setup as before, except in this case,

the block size is 4 words instead of 1 word. We repeat
the dot-product computation in this scenario:
► Assuming that the vectors are laid out linearly in memory,
eight FLOPs (four multiply-adds) can be performed in 200
cycles.
► This is because a single memory access fetches four
consecutive words in the vector.
► Therefore, two accesses can fetch four elements of each
of the vectors. This corresponds to a FLOP every 25 ns, for
a peak speed of 40 MFLOPS.
Cache: Pre-fetching data

CPU in fact stores not only demanded

data into cache. It also loads neighbouring
(following) data in cache because it’s very
likely that these data will be requested
soon. Operation of reading more data than
requested is called prefetching. If such
prefetched data is in fact requested in the
nearest future, it can be loaded with cache
hit latency instead of expensive main
memory reference.

http://katecpp.github.io/cache-prefetching/
Impact of Memory Bandwidth

► It is important to note that increasing block size does

not change latency of the system.
► Physically, the scenario illustrated here can be viewed
as a wide data bus (4 words or 128 bits) connected to
multiple memory banks.
► In practice, such wide buses are expensive to construct.
► In a more practical system, consecutive words are sent
on the memory bus on subsequent bus cycles after the
first word is retrieved.
Impact of Memory Bandwidth

► The above examples clearly illustrate how increased

bandwidth results in higher peak computation rates.
► The data layouts were assumed to be such that
consecutive data words in memory were used by
successive instructions (spatial locality of reference).
► If we take a data-layout centric view, computations
must be reordered to enhance spatial locality of
reference.
Impact of Memory Bandwidth:
Example
Consider the following code fragment:

for (i = 0; i < 1000; i++)

column_sum[i] = 0.0;
for (j = 0; j < 1000; j++)
column_sum[i] += b[j][i];

The code fragment sums columns of the matrix b into a

vector column_sum.
Impact of Memory Bandwidth: Example

► The vector column_sum is small and easily fits into the cache
► The matrix b is accessed in a column order.
► The strided access results in very poor performance.

Multiplying a matrix with a vector: (a) multiplying

column-by-column, keeping a running sum; (b) computing each
element of the result as a dot product of a row of the matrix with
the vector.
Impact of Memory Bandwidth:
Example

We can fix the above code as follows:

for (i = 0; i < 1000; i++)

column_sum[i] = 0.0;
for (j = 0; j < 1000; j++)
for (i = 0; i < 1000; i++)
column_sum[i] += b[j][i];

In this case, the matrix is traversed in a row-order and

performance can be expected to be significantly better.
Memory System Performance:
Summary
► The series of examples presented in this section
illustrate the following concepts:
► Exploiting spatial and temporal locality in applications is
critical for amortizing memory latency and increasing
effective memory bandwidth.
► In Spatial Locality, nearby instructions to recently
executed instruction are likely to be executed soon. In
Temporal Locality, a recently executed instruction is likely
to be executed again very soon. 2. It refers to the
tendency of execution which involve a number of memory
locations
► The ratio of the number of operations to number of
memory accesses is a good indicator of anticipated
tolerance to memory bandwidth.
► Memory layouts and organizing computation appropriately
can make a significant impact on the spatial and temporal
locality.
Alternate Approaches for
Hiding Memory Latency
► Consider the problem of browsing the web on a very
slow network connection. We deal with the problem in
one of three possible ways:
► we anticipate which pages we are going to browse ahead
of time and issue requests for them in advance;
► we open multiple browsers and access different pages in
each browser, thus while we are waiting for one page to
load, we could be reading others; or
► we access a whole bunch of pages in one go - amortizing
the latency across various accesses.
► The first approach is called prefetching, the second
multithreading, and the third one corresponds to spatial
locality in accessing memory words.
Multithreading for Latency
Hiding
A thread is a single stream of control in the flow of a program.
We illustrate threads with a simple example:

for (i = 0; i < n; i++)

c[i] = dot_product(get_row(a, i), b);

Each dot-product is independent of the other, and therefore represents a

concurrent unit of execution. We can safely rewrite the above code segment
as:

for (i = 0; i < n; i++)

c[i] = create_thread(dot_product,get_row(a, i), b);
Multithreading for Latency
Hiding: Example
► In the code, the first instance of this function accesses a
pair of vector elements and waits for them.
► In the meantime, the second instance of this function
can access two other vector elements in the next cycle,
and so on.
► After l units of time, where l is the latency of the
memory system, the first function instance gets the
requested data from memory and can perform the
required computation.
► In the next cycle, the data items for the next function
instance arrive, and so on. In this way, in every clock
cycle, we can perform a computation.
Multithreading for Latency
Hiding
► The execution schedule in the previous example is
predicated upon two assumptions: the memory system is
capable of servicing multiple outstanding requests, and
the processor is capable of switching threads at every
cycle.
► It also requires the program to have an explicit
specification of concurrency in the form of threads.
► Machines such as the HEP and Tera rely on
multithreaded processors that can switch the context of
execution in every cycle. Consequently, they are able to
hide latency effectively.
Prefetching for Latency
Hiding
► Misses on loads cause programs to stall.
► Why not advance the loads so that by the time the data
is actually needed, it is already there!
► The only drawback is that you might need more space to
store advanced loads.
► However, if the advanced loads are overwritten, we are
no worse than before!
Tradeoffs of Multithreading
and Prefetching
► Multithreading and prefetching are critically impacted
by the memory bandwidth. Consider the following
example:
► Consider a computation running on a machine with a 1 GHz
clock, 4-word cache line, single cycle access to the cache,
and 100 ns latency to DRAM. The computation has a cache
hit ratio at 1 KB of 25% and at 32 KB of 90%. Consider two
cases: first, a single threaded execution in which the
entire cache is available to the serial context, and second,
a multithreaded execution with 32 threads where each
thread has a cache residency of 1 KB.
► If the computation makes one data request in every cycle
of 1 ns, you may notice that the first scenario requires
400MB/s of memory bandwidth and the second, 3GB/s.
Tradeoffs of Multithreading
and Prefetching
► Bandwidth requirements of a multithreaded system may
increase very significantly because of the smaller cache
residency of each thread.
► Multithreaded systems become bandwidth bound instead
of latency bound.
► Multithreading and prefetching only address the latency
problem and may often exacerbate the bandwidth
problem.
► Multithreading and prefetching also require significantly
more hardware resources in the form of storage.
Interconnection Networks
for Parallel Computers
► Interconnection networks carry data between processors
and to memory.
► Interconnects are made of switches and links (wires,
fiber).
► Interconnects are classified as static or dynamic.
► Static networks consist of point-to-point communication
links among processing nodes and are also referred to as
direct networks.
► Dynamic networks are built using switches and
communication links. Dynamic networks are also
referred to as indirect networks.
Static and Dynamic
Interconnection Networks

Classification of interconnection networks: (a) a static

network; and (b) a dynamic network.
Interconnection Networks

► Switches map a fixed number of inputs to outputs.

► The total number of ports on a switch is the degree of
the switch.
► The cost of a switch grows as the square of the degree
of the switch, the peripheral hardware linearly as the
degree, and the packaging costs linearly as the number
of pins.
Interconnection Networks:
Network Interfaces
► Processors talk to the network via a network interface.
► The network interface may hang off the I/O bus or the
memory bus.
► In a physical sense, this distinguishes a cluster from a
tightly coupled multicomputer.
► The relative speeds of the I/O and memory buses
impact the performance of the network.
Network Topologies

► A variety of network topologies have been proposed and

implemented.
► These topologies tradeoff performance for cost.
► Commercial machines often implement hybrids of
multiple topologies for reasons of packaging, cost, and
available components.
Network Topologies: Buses

► Some of the simplest and earliest parallel machines

used buses.
► All processors access a common bus for exchanging
data.
► The distance between any two nodes is O(1) in a bus.
The bus also provides a convenient broadcast media.
► However, the bandwidth of the shared bus is a major
bottleneck.
► Typical bus based machines are limited to dozens of
nodes. Sun Enterprise servers and Intel Pentium based
shared-bus multiprocessors are examples of such
architectures.
Network Topologies: Buses

Bus-based interconnects (a) with no local caches; (b) with local

memory/caches.

Since much of the data accessed by processors is local

to the processor, a local memory can improve the
performance of bus-based machines.
Network Topologies: Crossbars
A crossbar network uses an p×m grid of switches to
connect p inputs to m outputs in a non-blocking manner.

A completely non-blocking crossbar network connecting p

processors to b memory banks.

Week # 01
No ratings yet
Week # 01
42 pages
Introduction To High Performance Computing: Unit-I
No ratings yet
Introduction To High Performance Computing: Unit-I
70 pages
Lecture 2 - Parallel Programming Platforms (Part I) - Updated - 2021
No ratings yet
Lecture 2 - Parallel Programming Platforms (Part I) - Updated - 2021
44 pages
Chapter 04 Processors and Memory Hierarchy
75% (8)
Chapter 04 Processors and Memory Hierarchy
50 pages
Input Unit: Memory: in Processing Element (PE) or CPU: Output
No ratings yet
Input Unit: Memory: in Processing Element (PE) or CPU: Output
24 pages
Parallel Programming Platforms: Alexandre David 1.2.05
No ratings yet
Parallel Programming Platforms: Alexandre David 1.2.05
30 pages
Computer Architecture 1st Semester Spring Session Unit 3
No ratings yet
Computer Architecture 1st Semester Spring Session Unit 3
33 pages
SQL Joins
No ratings yet
SQL Joins
17 pages
Module 2
No ratings yet
Module 2
127 pages
Chapter 04 Processors and Memory Hierarchy PDF
No ratings yet
Chapter 04 Processors and Memory Hierarchy PDF
50 pages
CSO Computer Programming
No ratings yet
CSO Computer Programming
73 pages
Advanced Processor Superscalarclass
50% (2)
Advanced Processor Superscalarclass
73 pages
IAS & MIPS Rate
No ratings yet
IAS & MIPS Rate
42 pages
ACA Module2 2018.PDF Extra
No ratings yet
ACA Module2 2018.PDF Extra
48 pages
Introduction To Parallel Computing-Dr Nousheen
No ratings yet
Introduction To Parallel Computing-Dr Nousheen
43 pages
Advanced Processor Architecture: Summer 1997
No ratings yet
Advanced Processor Architecture: Summer 1997
28 pages
Lecture 2
No ratings yet
Lecture 2
17 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
Architecture1 1 (2012)
No ratings yet
Architecture1 1 (2012)
87 pages
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
No ratings yet
EE (CE) 6304 Computer Architecture Lecture #2 (8/28/13)
35 pages
Week # 02
No ratings yet
Week # 02
39 pages
Unit 5
No ratings yet
Unit 5
44 pages
ACA Mod2
No ratings yet
ACA Mod2
45 pages
Module 1 - Parallel Computing
No ratings yet
Module 1 - Parallel Computing
29 pages
Unit 1 Modern Processors
No ratings yet
Unit 1 Modern Processors
52 pages
Chap2 Slides
No ratings yet
Chap2 Slides
127 pages
Advanced Computer Architecture Prof Thriveni T K
No ratings yet
Advanced Computer Architecture Prof Thriveni T K
59 pages
Parallel Processing
No ratings yet
Parallel Processing
127 pages
Lec2 ParallelProgrammingPlatforms
No ratings yet
Lec2 ParallelProgrammingPlatforms
26 pages
Cse410 10 Pipelining A
No ratings yet
Cse410 10 Pipelining A
7 pages
Lecture (2) .PPT-1
100% (1)
Lecture (2) .PPT-1
19 pages
CSO Lecture Notes Unit - 5
No ratings yet
CSO Lecture Notes Unit - 5
11 pages
Architecture
No ratings yet
Architecture
21 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
Assignment 2
No ratings yet
Assignment 2
15 pages
1.1 Processor Micro Architecture
No ratings yet
1.1 Processor Micro Architecture
21 pages
Unit 1
No ratings yet
Unit 1
5 pages
2.1 Advanced Processor Technology
No ratings yet
2.1 Advanced Processor Technology
40 pages
Hyper-Threading Technology: Processor Microarchitecture
No ratings yet
Hyper-Threading Technology: Processor Microarchitecture
18 pages
Chapter 4 (Processors and Memory Hierarchy)
100% (1)
Chapter 4 (Processors and Memory Hierarchy)
17 pages
Simultaneous Multithreading
No ratings yet
Simultaneous Multithreading
50 pages
Hyper-Threading Technology: Shaik Mastanvali (06951A0541)
No ratings yet
Hyper-Threading Technology: Shaik Mastanvali (06951A0541)
23 pages
L1.0 HPC Overview
No ratings yet
L1.0 HPC Overview
58 pages
Parallelism in Uniprocessor System and Granularity
100% (5)
Parallelism in Uniprocessor System and Granularity
5 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
Pipelining
No ratings yet
Pipelining
5 pages
Lecture 2 - Parallel Programming Platforms (Part I)
No ratings yet
Lecture 2 - Parallel Programming Platforms (Part I)
44 pages
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
No ratings yet
Batch 2 ICS 2101 AND BIT 2102 (1) - 1
17 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
A4 版本1 （未使用）
No ratings yet
A4 版本1 （未使用）
2 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
CH02-COA10e Spring 2025
No ratings yet
CH02-COA10e Spring 2025
24 pages
15CS72 ACA Module2Final
No ratings yet
15CS72 ACA Module2Final
29 pages
COMP Unit 1
No ratings yet
COMP Unit 1
52 pages
FIT9134 Week11
No ratings yet
FIT9134 Week11
21 pages
Computer Architecture Unit 3
No ratings yet
Computer Architecture Unit 3
8 pages
CA Final PDF
No ratings yet
CA Final PDF
13 pages
(并行课件w3) 第2讲 1&2
No ratings yet
(并行课件w3) 第2讲 1&2
143 pages