0% found this document useful (0 votes)
8 views63 pages

Presentation 3

The document discusses high-performance computing, focusing on parallelization techniques, technical challenges, and the limitations of current transistor technology. It covers concepts such as Amdahl's Law, memory latency, and bandwidth, as well as different parallel programming platforms like OpenMP and MPI. Additionally, it outlines Flynn's classification of computer architectures and the implications of Dennard scaling on power density and processor frequency.

Uploaded by

aditibraut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views63 pages

Presentation 3

The document discusses high-performance computing, focusing on parallelization techniques, technical challenges, and the limitations of current transistor technology. It covers concepts such as Amdahl's Law, memory latency, and bandwidth, as well as different parallel programming platforms like OpenMP and MPI. Additionally, it outlines Flynn's classification of computer architectures and the implications of Dennard scaling on power density and processor frequency.

Uploaded by

aditibraut
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 63

High Performance

Computing
(DJ19DSC802)
Basics of Parallelization:
• Data Parallelism
• Functional parallelism,
• Parallel Scalability,
• Factors that limit parallel execution
• Scalability matrices
• Refined Performance model
• load imbalance
Technical Challenges:
• Quantum Tunneling:
a transistor smaller than 5 nm will not be able to stop the flow of electrons due to
tunneling of electrons in its depletion region. Due to tunneling, the electrons will not
perceive the depletion region and it will ‘tunnel’ through it as if it did not exist. And a
transistor that cannot stop the flow of electrons is pretty useless.
• Size of Atom: we are now slowly approaching the size of an atom itself and you
cannot build a transistor smaller than an atom! The Silicon atom has a diameter of
around 1 nm and right now we are manufacturing transistors with gates at about 10
times that size. In a few years, not taking into account quantum effects, we will not be
able to go any smaller considering that we are reaching the physical limit of how small
something can be.
• Heating and Current Effects: As we go smaller, transistors tend to get more “leaky”,
meaning that even in their OFF state, they let some current pass through. This is called
the leakage current.
• Dennard scaling ignored the “leakage current” and “threshold voltage”, which establish
a baseline of power per transistor. As transistors get smaller, power density increases
because these don’t scale with size These created a “Power Wall” that has limited
practical processor frequency to around 4 GHz since 2006

• https://medium.com/@csoham358/beginners-guide-to-
moore-s-law-3e00dd8b5057
Dennard Scaling
• Power = alpha * CFV2
• Alpha – percent time switched
• C = capacitance ♦ F = frequency
• V = voltage • Capacitance is related to area
• So, as the size of the transistors shrunk, and the voltage was reduced,
circuits could operate at higher frequencies at the same power
End of Dennard Scaling
• Dennard scaling ignored the “leakage current” and “threshold
voltage”, which establish a baseline of power per transistor.
• As transistors get smaller, power density increases because these
don’t scale with size
• These created a “Power Wall” that has limited practical processor
frequency to around 4 GHz since 2006
Memory Latency and Bandwidth
Latency refers to the delay between a request for data from
the CPU and when that data is actually available to be used.
It's typically measured in nanoseconds (ns)
Bandwidth refers to the rate at which data can be
transferred between the memory and the CPU, usually
measured in megabytes per second (MB/s) or gigabytes per
second (GB/s).

• Latency: How fast the memory responds to a single request.


• Bandwidth: How much data the memory can transfer over
time.
Potential Benefits, Limits and Costs of Parallel
Programming
Amdahl's Law

• Amdahl's Law states that potential program speedup is defined


by the fraction of code (P) that can be parallelized:

• if none of the code can be parallelized, P = 0 and the speedup


= 1 (no speedup).
• If all of the code is parallelized, P = 1 and the speedup is
infinite (in theory).
• If 50% of the code can be parallelized, maximum speedup = 2,
meaning the code will run twice as fast.
Potential Benefits, Limits and Costs of Parallel Programming
Amdahl's Law
• Introducing the number of processors performing the parallel fraction of work,
the relationship can be modeled by:

• where P = parallel fraction, N = number of processors and S = serial fraction.


• It soon becomes obvious that there are limits to the scalability of parallelism.
For example:
1.Data level parallelism
-Data level parallelism: Partition the
data used in solving the problem
among the cores.
Parallel -Operation to be performed is same
Programmin on various data
g Platforms 2.Instruction/task level parallelism
-Instruction/task level parallelism:
Partition various tasks carried out in
solving the problem among the cores
-Data is same operation to be
performed on it is different
Parallel Programming Platforms
- Ex: Data level parallelism
suppose that we need to compute n values and add them
together. We know that this can be done with the following
serial code:
sum = 0;
for(i=0 ; i<n ; i++) {
x=
compute_next_value(…);
sum += x;
}
Now suppose we also have p
cores and p is much smaller
than n (p<n). Then each
core can form a partial sum my_ : indicates each core is
of approximately n/p values: using it own, private
for(my_i = my_first_i ; my_last_i < n ; variables and each core can
my_i++) { my_x = execute this block of code
my_sum = 0; independently of the other
compute_next_value(…);
my_first_i = core
my_sum += my_x;
…; my_last_i
}= …;
Parallel Programming Platforms
- Ex: Data level parallelism
Compute ‘n’ values and add them together on p cores (p<<<n)

After each core completes execution of this code, its variable my sum will store the sum
of the values computed by its calls to Compute next value. For example, if there are
eight cores, n = 24, and the 24 calls to Compute next value return the values If ‘n’ = 24
and 24 calls to compute_next_value() returns value:

1,4,3, 9,2,8, 5,1,1, 6,2,7, 2,5,0, 4,1,8, 6,5,1,2,3,9

Then the values stored in my_sum might be:


When cores are done computing
their values of my_sum,
computing global_sum by
Core: 0 1 2 3 4 5 6 7 master_core….
my_sum: 8 19 7 15 7 13 12 14
Parallel Programming Platforms
- Ex: Data level parallelism
Compute ‘n’ values and add them together on p cores
(p<<<n)

if ( I’m the master core ) {


sum = my_x;
for each core other than
myself { receive value
from core;
sum += value;
}
else {
send my_x to the master;
}

If Core 0
mast : 8 + 19 + 7 + 15 + 7 +13 + 12 +14
er sum: = 95
core
=
Parallel Programming Platforms
- Ex: Instruction level parallelism (when number of cores is
large)

With 1000 cores:


data parallelism requires
999 receives and adds
while task parallelism
requires only 10
Basic concepts
• Adding ‘n’ numbers using ‘n’ processing
elements.

If ‘n’ is a power of 2,
these operations performed in log2(n)
steps

i.e if n = 16 Log2 (16) = 4 as 24 =


16
Basic concepts
• Adding ‘n’ numbers using ‘n’ processing
elements.

Problem can be solved in


Θ(n) times on single processor, Ts and
Θ(log n) times on multiple processors,
Tp

Ts = Θ(n), Tp = Θ(log(n))

So what is the speedup?


Parallel Programming Platforms

There are two main types of parallel systems:


Shared memory systems and distributed-memory systems.

• In a shared-memory system, the cores can share/access the


computer memory.

• In a distributed memory system, each core has its own, private memory,
and the cores must communicate explicitly by doing something like sending
messages across a network.
Parallel Programming Platforms
There are two main types of parallel systems:
Shared memory systems and distributed-memory systems.

• OpenMP were designed for programming shared-memory systems.


They provide mechanisms for accessing shared-memory locations. It is
a high- level extension to C. For example, it can “parallelize” our ‘for’
loop

• MPI, designed for programming distributed-memory systems. It


provides mechanisms for sending messages.
What we will be doing…
Learning to write programs that are explicitly parallel.

• On parallel computers using the C language and extensions to C:


The Message-Passing Interface or MPI, and OpenMP.

• MPI are libraries of type definitions, functions, and macros that can
be
used in C programs.

• OpenMP consists of pragmas and some modifications to the C


compiler.

• CUDA programming on graphics processor/card


Parallel Hardware…

(a) Shared Memory System (b) Distributed Memory System (c) GPU Architecture
Limitation of Memory System Performance
• Performance of a program relies on
• the speed of processor and
• the speed of the memory system (feed data to the
processor)

• A memory system: (L1, L2, L3 ) caches

• Latency and Bandwidth determining memory system


performance
Limitation of Memory System Performance

Effect of memory latency on performance:

• Consider a processor at 1 GHz (1 ns) clock connected to DRAM at a latency


of 100ns.

• Processor can execute 4 instructions/1 ns clock cycle


(assume processor has 2 multiply-add unit)

• The peak processor rate is 4 GFLOPS

• If memory latency is 100ns for block size is 1 word.

• Every time memory request is made, the processor must wait 100 cycles
Basic concepts

Ts = Tp =
Θ(n), Θ(log(n))

Ideal case: If speed-up = p then efficiency = 1


Practical case: (speed-up < p) as efficiency is
0-1
Amdahal’s Law

se = F, fraction of calculation
that is serial

pe = (1 - F), fraction i.e


parallel F + pe = 1
1. FLYNN’S CLASSIFICATION

• Flynn's taxonomy distinguishes


multi-processor computer
architectures according to how
they can be classified along the
two independent dimensions of
Instruction Stream and Data
Stream.
• Each of these dimensions can
have only one of two possible
states: Single or Multiple.
• The matrix below defines the 4
possible classifications according
to Flynn:
Single-instruction, single-data
(SISD) systems:
• An SISD computing system is a uniprocessor
machine which is capable of executing a single
instruction, operating on a single data stream.
• In SISD, machine instructions are processed in a
sequential manner and computers adopting this
model are popularly called sequential computers.
• All the instructions and data to be processed have
to be stored in primary memory.
• The speed of the processing element in the SISD
model is limited(dependent) by the rate at which
the computer can transfer information internally.
• Dominant representative SISD systems are IBM PC,
workstations.
Single-instruction,
multiple-data (SIMD)
systems:
• An SIMD system is a multiprocessor
machine capable of executing the same
instruction on all the CPUs but
operating on different data streams.
• Machines based on an SIMD model are
well suited to scientific computing since
they involve lots of vector and matrix
operations.
• So that the information can be passed
to all the processing elements (PEs)
organized data elements of vectors can
be divided into multiple sets(N-sets for
N PE systems) and each PE can process
one data set.
Multiple-instruction,
single-data (MISD) systems
• An MISD computing system is a
multiprocessor machine capable of
executing different instructions on
different PEs but all of them operating
on the same dataset.
• Example Z = sin(x)+cos(x)+tan(x)
The system performs different
operations on the same data set.
• Machines built using the MISD model
are not useful in most of the
application, a few machines are built,
but none of them are available
commercially.
Multiple-instruction,
multiple-data (MIMD)
systems:
• An MIMD system is a multiprocessor
machine which is capable of executing
multiple instructions on multiple data
sets.
• Each PE in the MIMD model has
separate instruction and data streams;
therefore machines built using this
model are capable to any kind of
application.
• Unlike SIMD and MISD machines, PEs in
MIMD machines work asynchronously.
Single Program Multiple Data
(SPMD)
• SPMD is actually a "high level" programming model that can be built upon any
combination of the previously mentioned parallel programming models.
• SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously.
This program can be threads, message passing, data parallel or hybrid.
• MULTIPLE DATA: All tasks may use different data
• SPMD programs usually have the necessary logic programmed into them to allow
different tasks to branch or conditionally execute only those parts of the program they are
designed to execute. That is, tasks do not necessarily have to execute the entire program -
perhaps only a portion of it.
• The SPMD model, using message passing or hybrid programming, is probably the most
commonly used parallel programming model for multi-node clusters.
Multiple Program Multiple Data
(MPMD)
• Like SPMD, MPMD is actually a "high level" programming model that can be
built upon any combination of the previously mentioned parallel programming
models.
• MULTIPLE PROGRAM: Tasks may execute different programs simultaneously.
The programs can be threads, message passing, data parallel or hybrid.
• MULTIPLE DATA: All tasks may use different data
• MPMD applications are not as common as SPMD applications, but may be
better suited for certain types of problems, particularly those that lend
themselves better to functional decomposition than domain decomposition
(discussed later under Partitioning).

You might also like