0% found this document useful (0 votes)

8 views63 pages

Presentation 3

The document discusses high-performance computing, focusing on parallelization techniques, technical challenges, and the limitations of current transistor technology. It covers concepts such as Amdahl's Law, memory latency, and bandwidth, as well as different parallel programming platforms like OpenMP and MPI. Additionally, it outlines Flynn's classification of computer architectures and the implications of Dennard scaling on power density and processor frequency.

Uploaded by

aditibraut

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

8 views63 pages

Presentation 3

Uploaded by

aditibraut

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 63

High Performance

Computing
(DJ19DSC802)
Basics of Parallelization:
• Data Parallelism
• Functional parallelism,
• Parallel Scalability,
• Factors that limit parallel execution
• Scalability matrices
• Refined Performance model
• load imbalance
Technical Challenges:
• Quantum Tunneling:
a transistor smaller than 5 nm will not be able to stop the flow of electrons due to
tunneling of electrons in its depletion region. Due to tunneling, the electrons will not
perceive the depletion region and it will ‘tunnel’ through it as if it did not exist. And a
transistor that cannot stop the flow of electrons is pretty useless.
• Size of Atom: we are now slowly approaching the size of an atom itself and you
cannot build a transistor smaller than an atom! The Silicon atom has a diameter of
around 1 nm and right now we are manufacturing transistors with gates at about 10
times that size. In a few years, not taking into account quantum effects, we will not be
able to go any smaller considering that we are reaching the physical limit of how small
something can be.
• Heating and Current Effects: As we go smaller, transistors tend to get more “leaky”,
meaning that even in their OFF state, they let some current pass through. This is called
the leakage current.
• Dennard scaling ignored the “leakage current” and “threshold voltage”, which establish
a baseline of power per transistor. As transistors get smaller, power density increases
because these don’t scale with size These created a “Power Wall” that has limited
practical processor frequency to around 4 GHz since 2006

• https://medium.com/@csoham358/beginners-guide-to-
moore-s-law-3e00dd8b5057
Dennard Scaling
• Power = alpha * CFV2
• Alpha – percent time switched
• C = capacitance ♦ F = frequency
• V = voltage • Capacitance is related to area
• So, as the size of the transistors shrunk, and the voltage was reduced,
circuits could operate at higher frequencies at the same power
End of Dennard Scaling
• Dennard scaling ignored the “leakage current” and “threshold
voltage”, which establish a baseline of power per transistor.
• As transistors get smaller, power density increases because these
don’t scale with size
• These created a “Power Wall” that has limited practical processor
frequency to around 4 GHz since 2006
Memory Latency and Bandwidth
Latency refers to the delay between a request for data from
the CPU and when that data is actually available to be used.
It's typically measured in nanoseconds (ns)
Bandwidth refers to the rate at which data can be
transferred between the memory and the CPU, usually
measured in megabytes per second (MB/s) or gigabytes per
second (GB/s).

• Latency: How fast the memory responds to a single request.

• Bandwidth: How much data the memory can transfer over
time.
Potential Benefits, Limits and Costs of Parallel
Programming
Amdahl's Law

• Amdahl's Law states that potential program speedup is defined

by the fraction of code (P) that can be parallelized:

• if none of the code can be parallelized, P = 0 and the speedup

= 1 (no speedup).
• If all of the code is parallelized, P = 1 and the speedup is
infinite (in theory).
• If 50% of the code can be parallelized, maximum speedup = 2,
meaning the code will run twice as fast.
Potential Benefits, Limits and Costs of Parallel Programming
Amdahl's Law
• Introducing the number of processors performing the parallel fraction of work,
the relationship can be modeled by:

• where P = parallel fraction, N = number of processors and S = serial fraction.

• It soon becomes obvious that there are limits to the scalability of parallelism.
For example:
1.Data level parallelism
-Data level parallelism: Partition the
data used in solving the problem
among the cores.
Parallel -Operation to be performed is same
Programmin on various data
g Platforms 2.Instruction/task level parallelism
-Instruction/task level parallelism:
Partition various tasks carried out in
solving the problem among the cores
-Data is same operation to be
performed on it is different
Parallel Programming Platforms
- Ex: Data level parallelism
suppose that we need to compute n values and add them
together. We know that this can be done with the following
serial code:
sum = 0;
for(i=0 ; i<n ; i++) {
x=
compute_next_value(…);
sum += x;
}
Now suppose we also have p
cores and p is much smaller
than n (p<n). Then each
core can form a partial sum my_ : indicates each core is
of approximately n/p values: using it own, private
for(my_i = my_first_i ; my_last_i < n ; variables and each core can
my_i++) { my_x = execute this block of code
my_sum = 0; independently of the other
compute_next_value(…);
my_first_i = core
my_sum += my_x;
…; my_last_i
}= …;
Parallel Programming Platforms
- Ex: Data level parallelism
Compute ‘n’ values and add them together on p cores (p<<<n)

After each core completes execution of this code, its variable my sum will store the sum
of the values computed by its calls to Compute next value. For example, if there are
eight cores, n = 24, and the 24 calls to Compute next value return the values If ‘n’ = 24
and 24 calls to compute_next_value() returns value:

1,4,3, 9,2,8, 5,1,1, 6,2,7, 2,5,0, 4,1,8, 6,5,1,2,3,9

Then the values stored in my_sum might be:

When cores are done computing
their values of my_sum,
computing global_sum by
Core: 0 1 2 3 4 5 6 7 master_core….
my_sum: 8 19 7 15 7 13 12 14
Parallel Programming Platforms
- Ex: Data level parallelism
Compute ‘n’ values and add them together on p cores
(p<<<n)

if ( I’m the master core ) {

sum = my_x;
for each core other than
myself { receive value
from core;
sum += value;
}
else {
send my_x to the master;
}

If Core 0
mast : 8 + 19 + 7 + 15 + 7 +13 + 12 +14
er sum: = 95
core
=
Parallel Programming Platforms
- Ex: Instruction level parallelism (when number of cores is
large)

With 1000 cores:

data parallelism requires
999 receives and adds
while task parallelism
requires only 10
Basic concepts
• Adding ‘n’ numbers using ‘n’ processing
elements.

If ‘n’ is a power of 2,
these operations performed in log2(n)
steps

i.e if n = 16 Log2 (16) = 4 as 24 =

16
Basic concepts
• Adding ‘n’ numbers using ‘n’ processing
elements.

Problem can be solved in

Θ(n) times on single processor, Ts and
Θ(log n) times on multiple processors,
Tp

Ts = Θ(n), Tp = Θ(log(n))

So what is the speedup?

Parallel Programming Platforms

There are two main types of parallel systems:

Shared memory systems and distributed-memory systems.

• In a shared-memory system, the cores can share/access the

computer memory.

• In a distributed memory system, each core has its own, private memory,
and the cores must communicate explicitly by doing something like sending
messages across a network.
Parallel Programming Platforms
There are two main types of parallel systems:
Shared memory systems and distributed-memory systems.

• OpenMP were designed for programming shared-memory systems.

They provide mechanisms for accessing shared-memory locations. It is
a high- level extension to C. For example, it can “parallelize” our ‘for’
loop

• MPI, designed for programming distributed-memory systems. It

provides mechanisms for sending messages.
What we will be doing…
Learning to write programs that are explicitly parallel.

• On parallel computers using the C language and extensions to C:

The Message-Passing Interface or MPI, and OpenMP.

• MPI are libraries of type definitions, functions, and macros that can
be
used in C programs.

• OpenMP consists of pragmas and some modifications to the C

compiler.

• CUDA programming on graphics processor/card

Parallel Hardware…

(a) Shared Memory System (b) Distributed Memory System (c) GPU Architecture
Limitation of Memory System Performance
• Performance of a program relies on
• the speed of processor and
• the speed of the memory system (feed data to the
processor)

• A memory system: (L1, L2, L3 ) caches

• Latency and Bandwidth determining memory system

performance
Limitation of Memory System Performance

Effect of memory latency on performance:

• Consider a processor at 1 GHz (1 ns) clock connected to DRAM at a latency

of 100ns.

• Processor can execute 4 instructions/1 ns clock cycle

(assume processor has 2 multiply-add unit)

• The peak processor rate is 4 GFLOPS

• If memory latency is 100ns for block size is 1 word.

• Every time memory request is made, the processor must wait 100 cycles
Basic concepts

Ts = Tp =
Θ(n), Θ(log(n))

Ideal case: If speed-up = p then efficiency = 1

Practical case: (speed-up < p) as efficiency is
0-1
Amdahal’s Law

se = F, fraction of calculation
that is serial

pe = (1 - F), fraction i.e

parallel F + pe = 1
1. FLYNN’S CLASSIFICATION

• Flynn's taxonomy distinguishes

multi-processor computer
architectures according to how
they can be classified along the
two independent dimensions of
Instruction Stream and Data
Stream.
• Each of these dimensions can
have only one of two possible
states: Single or Multiple.
• The matrix below defines the 4
possible classifications according
to Flynn:
Single-instruction, single-data
(SISD) systems:
• An SISD computing system is a uniprocessor
machine which is capable of executing a single
instruction, operating on a single data stream.
• In SISD, machine instructions are processed in a
sequential manner and computers adopting this
model are popularly called sequential computers.
• All the instructions and data to be processed have
to be stored in primary memory.
• The speed of the processing element in the SISD
model is limited(dependent) by the rate at which
the computer can transfer information internally.
• Dominant representative SISD systems are IBM PC,
workstations.
Single-instruction,
multiple-data (SIMD)
systems:
• An SIMD system is a multiprocessor
machine capable of executing the same
instruction on all the CPUs but
operating on different data streams.
• Machines based on an SIMD model are
well suited to scientific computing since
they involve lots of vector and matrix
operations.
• So that the information can be passed
to all the processing elements (PEs)
organized data elements of vectors can
be divided into multiple sets(N-sets for
N PE systems) and each PE can process
one data set.
Multiple-instruction,
single-data (MISD) systems
• An MISD computing system is a
multiprocessor machine capable of
executing different instructions on
different PEs but all of them operating
on the same dataset.
• Example Z = sin(x)+cos(x)+tan(x)
The system performs different
operations on the same data set.
• Machines built using the MISD model
are not useful in most of the
application, a few machines are built,
but none of them are available
commercially.
Multiple-instruction,
multiple-data (MIMD)
systems:
• An MIMD system is a multiprocessor
machine which is capable of executing
multiple instructions on multiple data
sets.
• Each PE in the MIMD model has
separate instruction and data streams;
therefore machines built using this
model are capable to any kind of
application.
• Unlike SIMD and MISD machines, PEs in
MIMD machines work asynchronously.
Single Program Multiple Data
(SPMD)
• SPMD is actually a "high level" programming model that can be built upon any
combination of the previously mentioned parallel programming models.
• SINGLE PROGRAM: All tasks execute their copy of the same program simultaneously.
This program can be threads, message passing, data parallel or hybrid.
• MULTIPLE DATA: All tasks may use different data
• SPMD programs usually have the necessary logic programmed into them to allow
different tasks to branch or conditionally execute only those parts of the program they are
designed to execute. That is, tasks do not necessarily have to execute the entire program -
perhaps only a portion of it.
• The SPMD model, using message passing or hybrid programming, is probably the most
commonly used parallel programming model for multi-node clusters.
Multiple Program Multiple Data
(MPMD)
• Like SPMD, MPMD is actually a "high level" programming model that can be
built upon any combination of the previously mentioned parallel programming
models.
• MULTIPLE PROGRAM: Tasks may execute different programs simultaneously.
The programs can be threads, message passing, data parallel or hybrid.
• MULTIPLE DATA: All tasks may use different data
• MPMD applications are not as common as SPMD applications, but may be
better suited for certain types of problems, particularly those that lend
themselves better to functional decomposition than domain decomposition
(discussed later under Partitioning).

Diagrama Ternara - Granulometrie
No ratings yet
Diagrama Ternara - Granulometrie
22 pages
Third Quarter Examination in Science 9: Schools Division Office Caloocan City
No ratings yet
Third Quarter Examination in Science 9: Schools Division Office Caloocan City
5 pages
(MoM) Pooja Singh
33% (3)
(MoM) Pooja Singh
41 pages
UNIT 1 "To Be": I Am Very Hungry Today
No ratings yet
UNIT 1 "To Be": I Am Very Hungry Today
108 pages
Doctor's Order & Lab Results
100% (1)
Doctor's Order & Lab Results
4 pages
PCI D Pharm (Pharmacy Diploma) Syllabus - Pharmagang
No ratings yet
PCI D Pharm (Pharmacy Diploma) Syllabus - Pharmagang
29 pages
Papan Suis Utama
100% (1)
Papan Suis Utama
138 pages
Voter List General Member Zoologcal Society of Bangladesh: Department of Zoology Unversty of Dhaka Dhaka-1000
No ratings yet
Voter List General Member Zoologcal Society of Bangladesh: Department of Zoology Unversty of Dhaka Dhaka-1000
105 pages
2021 Aeci Catlogue
No ratings yet
2021 Aeci Catlogue
162 pages
ST One Manual Ita
No ratings yet
ST One Manual Ita
68 pages
4.1 - Environmental Movements
No ratings yet
4.1 - Environmental Movements
20 pages
Gurley, Bill J, - Clinically Relevant Herb Mineral Vitamin Intrxn
No ratings yet
Gurley, Bill J, - Clinically Relevant Herb Mineral Vitamin Intrxn
9 pages
Parallel Computing I
No ratings yet
Parallel Computing I
52 pages
Mr. Rohit Jawa Unilever PDF
No ratings yet
Mr. Rohit Jawa Unilever PDF
16 pages
ACI 447 1R 18 Report On The Modeling Techniques Used in Finite Element
No ratings yet
ACI 447 1R 18 Report On The Modeling Techniques Used in Finite Element
24 pages
Datasheet Norsat LNA Ku Band 4000 Series
No ratings yet
Datasheet Norsat LNA Ku Band 4000 Series
1 page
Defect Improvement by Optimizing Electroplating in BEOL Sub-50nm Pitch
No ratings yet
Defect Improvement by Optimizing Electroplating in BEOL Sub-50nm Pitch
6 pages
World Bank
No ratings yet
World Bank
23 pages
PDC Last Min Notes For MCQS - Theory
No ratings yet
PDC Last Min Notes For MCQS - Theory
39 pages
Ac 20
No ratings yet
Ac 20
1 page
Introduction To Parallel Programming
No ratings yet
Introduction To Parallel Programming
129 pages
Cours 1
No ratings yet
Cours 1
38 pages
Unit VI Parallel Programming Concepts
No ratings yet
Unit VI Parallel Programming Concepts
90 pages
CS0051 - Module 01
No ratings yet
CS0051 - Module 01
52 pages
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
No ratings yet
Cloud Computing CS 15-319: Programming Models-Part I Lecture 4, Jan 25, 2012
40 pages
BDS Session 2
No ratings yet
BDS Session 2
58 pages
Advanced Car Hire Price List
No ratings yet
Advanced Car Hire Price List
5 pages
Introduction To Parallel Programming: Linda Woodard CAC 19 May 2010
100% (1)
Introduction To Parallel Programming: Linda Woodard CAC 19 May 2010
38 pages
Lecture Slides-Week1
No ratings yet
Lecture Slides-Week1
59 pages
A Distinct Method To Find The Critical Path and Total Float Under Fuzzy Environment
No ratings yet
A Distinct Method To Find The Critical Path and Total Float Under Fuzzy Environment
5 pages
Lecture1 Introduction PDF
No ratings yet
Lecture1 Introduction PDF
43 pages
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
No ratings yet
CS326 Parallel and Distributed Computing: SPRING 2021 National University of Computer and Emerging Sciences
33 pages
Parallel Computing Seminar Report
100% (3)
Parallel Computing Seminar Report
35 pages
Report On Improving Positioning Algorithm During Landing
No ratings yet
Report On Improving Positioning Algorithm During Landing
6 pages
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
No ratings yet
DSECL ZG 522: Big Data Systems: Session 2: Parallel and Distributed Systems
58 pages
Cours 1
No ratings yet
Cours 1
38 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
No ratings yet
2-INTRODUCTION TO PDC - MOTIVATION - KEY CONCEPTS-03-Dec-2019Material - I - 03-Dec-2019 - Module - 1 PDF
63 pages
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
No ratings yet
Lecture-13-14 Parallel and Distributed Systems Programming Models-Jameel
70 pages
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
No ratings yet
Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational
78 pages
02 - Lecture #2
No ratings yet
02 - Lecture #2
29 pages
PC 1
No ratings yet
PC 1
53 pages
Intro Parallel Programming 2015
No ratings yet
Intro Parallel Programming 2015
38 pages
Bim Project Execution Plan
No ratings yet
Bim Project Execution Plan
1 page
Parallel Programming: Sathish S. Vadhiyar Course Web Page
No ratings yet
Parallel Programming: Sathish S. Vadhiyar Course Web Page
36 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
Lec1 Introduction
No ratings yet
Lec1 Introduction
23 pages
24-25 - Parallel Processing PDF
No ratings yet
24-25 - Parallel Processing PDF
36 pages
Parallel Computing: "Parallelization" Redirects Here. For Parallelization of Manifolds, See
No ratings yet
Parallel Computing: "Parallelization" Redirects Here. For Parallelization of Manifolds, See
20 pages
Parallel Programming Module 1
No ratings yet
Parallel Programming Module 1
71 pages
Emsb 1F A2
No ratings yet
Emsb 1F A2
1 page
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
No ratings yet
Whitepaper Imsl Increase Performance Parallel Programming Numerical Libraries
8 pages
Design and Development of A Model For Parallelization of Sequential Program For Execution On Multicore Architecture
No ratings yet
Design and Development of A Model For Parallelization of Sequential Program For Execution On Multicore Architecture
19 pages
Lect 02
No ratings yet
Lect 02
51 pages
Lecture 2
No ratings yet
Lecture 2
32 pages
Parallel N Distributed Systems
No ratings yet
Parallel N Distributed Systems
44 pages
Lecture Week - 3 Amdahl Law 1
No ratings yet
Lecture Week - 3 Amdahl Law 1
19 pages
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
No ratings yet
CS 133 Parallel & Distributed Computing: Course Instructor: Adam Kaplan Lecture #1: 4/2/2012
22 pages
JIS-G-3312-2019-Prepainted Hot-Dip Zinc-Coated Steel Sheet and Strip
No ratings yet
JIS-G-3312-2019-Prepainted Hot-Dip Zinc-Coated Steel Sheet and Strip
31 pages
Chapter 1 - Introduction - 2023 - Programming Massively Parallel Processors
No ratings yet
Chapter 1 - Introduction - 2023 - Programming Massively Parallel Processors
20 pages
HPC Parallel
No ratings yet
HPC Parallel
122 pages
CS-3006 2 PDC Overview Compressed
No ratings yet
CS-3006 2 PDC Overview Compressed
107 pages
Importance of Forest in The Philippines
No ratings yet
Importance of Forest in The Philippines
39 pages
Introduction To Paralel Procesing
No ratings yet
Introduction To Paralel Procesing
40 pages
India Coastal Features
No ratings yet
India Coastal Features
28 pages
HPC Unit 2
No ratings yet
HPC Unit 2
72 pages
Screenshot 2024-12-05 at 2.01.32 PM
No ratings yet
Screenshot 2024-12-05 at 2.01.32 PM
49 pages
HPC Unit 1
No ratings yet
HPC Unit 1
65 pages
BDS Session 2
No ratings yet
BDS Session 2
56 pages
Parallel Computing
No ratings yet
Parallel Computing
32 pages
Pipelining vs. Parallel Processing
No ratings yet
Pipelining vs. Parallel Processing
23 pages
HPC - Unit-1 Insem Notes
No ratings yet
HPC - Unit-1 Insem Notes
76 pages
CH02 COA10e.performance Issues
No ratings yet
CH02 COA10e.performance Issues
19 pages
001 - DDS IIIT Jan 10th
No ratings yet
001 - DDS IIIT Jan 10th
34 pages
Marmara Üniversitesi-Fakülte SubContentfluyfhtdht
No ratings yet
Marmara Üniversitesi-Fakülte SubContentfluyfhtdht
3 pages
Hanging Scaffolding - Pipe Rack Area
No ratings yet
Hanging Scaffolding - Pipe Rack Area
20 pages
Act 3 Scene 2
No ratings yet
Act 3 Scene 2
8 pages
Lecture1 Introduction To Parallel Computing - 2025
No ratings yet
Lecture1 Introduction To Parallel Computing - 2025
38 pages
Multicore02 2
No ratings yet
Multicore02 2
18 pages
Parallel Programming - Unit 1
No ratings yet
Parallel Programming - Unit 1
81 pages
HPC BOOk
No ratings yet
HPC BOOk
68 pages
HPC TT1
No ratings yet
HPC TT1
29 pages
Unit 1 - Part 3
No ratings yet
Unit 1 - Part 3
17 pages
1 Introduction
No ratings yet
1 Introduction
30 pages
光纤模块使用说明 (V1 6) 英文
No ratings yet
光纤模块使用说明 (V1 6) 英文
1 page
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
From Everand
Python Programming: General-Purpose Libraries; NumPy,Pandas,Matplotlib,Seaborn,Requests,os & sys: Python, #2
e3
No ratings yet
C Programming
From Everand
C Programming
Netra
No ratings yet
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
From Everand
Conceptual Programming: Conceptual Programming: Learn Programming the old way!
Avishek Sharma
No ratings yet

Presentation 3

Uploaded by

Presentation 3

Uploaded by

High Performance

• Latency: How fast the memory responds to a single request.

• Amdahl's Law states that potential program speedup is defined

• if none of the code can be parallelized, P = 0 and the speedup

• where P = parallel fraction, N = number of processors and S = serial fraction.

1,4,3, 9,2,8, 5,1,1, 6,2,7, 2,5,0, 4,1,8, 6,5,1,2,3,9

Then the values stored in my_sum might be:

if ( I’m the master core ) {

With 1000 cores:

i.e if n = 16 Log2 (16) = 4 as 24 =

Problem can be solved in

So what is the speedup?

There are two main types of parallel systems:

• In a shared-memory system, the cores can share/access the

• OpenMP were designed for programming shared-memory systems.

• MPI, designed for programming distributed-memory systems. It

• On parallel computers using the C language and extensions to C:

• OpenMP consists of pragmas and some modifications to the C

• CUDA programming on graphics processor/card

• A memory system: (L1, L2, L3 ) caches

• Latency and Bandwidth determining memory system

Effect of memory latency on performance:

• Consider a processor at 1 GHz (1 ns) clock connected to DRAM at a latency

• Processor can execute 4 instructions/1 ns clock cycle

• The peak processor rate is 4 GFLOPS

• If memory latency is 100ns for block size is 1 word.

Ideal case: If speed-up = p then efficiency = 1

pe = (1 - F), fraction i.e

• Flynn's taxonomy distinguishes

You might also like