L01 Introduction
L01 Introduction
Introduction
Lecture 1
1
Outline
Parallel Computing
Motivation
Serial vs Parallel Computing
Serial vs Parallel Processing Units
Basic of Parallel Computing
Decompose, Schedule and Map
Performance
Challenges
Landscape of Parallel Computing
Applications
Systems
[ CS3210 AY2425S1 - L1 ]
2
Cheetah vs Hyena
Weigh 60 kg Weigh 50 kg
Fastest land animal Top speed of 65 km/h
Top speed of 120 km/h
0-100 km/h in 3 seconds
Hunt alone Hunt in group of 4-40
Usually over in 1 minute Can take down large animals
50% of losing its catch to other like buffalo, rhino, lion
predators
[ CS3210 AY2425S1 - L1 ]
3
Observations
[ CS3210 AY2425S1 - L1 ]
4
Parallel Computing - Motivation
In the first issue of CACM (1958), Saul Gorn notes that:
“We know that the so-called parallel computers are somewhat faster
than the serial ones, and that no matter how fast we make access and
arithmetic serially, we can do correspondingly better in parallel.
However access and arithmetic speeds seem to be approaching a
definite limit ... Does this mean that digital speeds are reaching a limit,
that digital computation of multi-variate partial differential systems must
accept it, and that predicting tomorrow’s weather must require
more than one day? Not if we have truly-parallel machines, with say,
a thousand instruction registers.”
[ CS3210 AY2425S1 - L1 ]
5
Hitting the Wall
‘70s: “Supercomputers” for Scientific Computing
Early ‘90s: Databases
Inflection point in 2004: Intel hits the Power Density Wall
[ CS3210 AY2425S1 - L1 ]
6
How do you make your program run faster?
Answer before 2004:
Just wait 6 months and buy a new machine!
Answer after 2004:
Write parallel software
[ CS3210 AY2425S1 - L1 ]
7
Parallel Computing - Challenges
Gorn also recognized the challenges that would be posed to
the system designer by the new parallel computers:
[ CS3210 AY2425S1 - L1 ]
8
Serial Computing
problem
processing
unit
in i2 i1
instructions
t2 (2 )
Multiply t1, t2 i3: * t1 t2 a
Store result in a
a (10 )
[ CS3210 AY2425S1 - L1 ]
10
Processing Unit: von Neumann Model
Memory
(holds instructions and data)
P
R
Control Arithmetic O
Unit Logic C
E
(Instruction Unit S
Counter) (Registers) S
O
R
input output
[ CS3210 AY2425S1 - L1 ]
11
von Neumann Computation Model
Processor:
Performs instructions
Memory:
Stores both the instructions and data in cells with unique addresses
Control Scheme:
Fetches instructions one after another from memory
Shuttles data between memory and processor
Memory wall
disparity between processor (< 1 nanosecond or 10-9 sec) and
memory speed (100-1,000 nanoseconds)
[ CS3210 AY2425S1 - L1 ]
12
Ways to improve performance
1. Work hard higher clock frequency
[ CS3210 AY2425S1 - L1 ]
13
Parallel Computing
Simultaneous use of multiple processing units to solve a
problem fast / solve a larger problem
processing
task0 unit0
processing
task1 unit1
…… ………… ……
processing
taskm unitp-1
in i2 i1
1. A problem is divided into m discrete parts (tasks) that can be solved concurrently
2. Each part is further broken down to a series of instructions (i)
3. Instructions from each part execute in parallel on different processing units (p)
[ CS3210 AY2425S1 - L1 ]
15
Example: Evaluate a = (b+1) * (b-c)
i1: + b 1 t1 i2: - b c t2
[ CS3210 AY2425S1 - L1 ]
16
Why Parallel Computing
Primary reasons:
1. Overcome limits of serial computing
2. Save (wall-clock) time
3. Solve larger problems
Other reasons:
1. Take advantage of non-local resources
2. Cost saving – use multiple “cheaper” (commoditized) computing
resources
3. Overcome memory constraints
[ CS3210 AY2425S1 - L1 ]
18
Computational Model Attributes
[Operation Mechanism]
• The primitive units of computation or basic actions of the computer (data types and
operations defined by the instruction set)
[Data Mechanism]
• The definition of address spaces available to the computation (how data are
accessed and stored)
[Control Mechanism]
• The schedulable units of computation (rules for partitioning and scheduling the
problem for computation using the primitive units)
[Communication Mechanism]
• The modes and patterns of communication among processing elements working
in parallel, so that they can exchange needed information
[Synchronization Mechanism ]
• Mechanism to ensure needed information arrives at the right time
[ CS3210 AY2425S1 - L1 ]
19
Basics of Parallel Computing
Application
Problem
decompose
Tasks
schedule
& map
[ CS3210 AY2425S1 - L1 ]
20
Decomposition
One problem many possible decompositions
[ CS3210 AY2425S1 - L1 ]
21
Tasks
Tasks are coded in a parallel programming language
[ CS3210 AY2425S1 - L1 ]
23
An illusion?
[ CS3210 AY2425S1 - L1 ]
24
Concurrency vs. Parallelism
Concurrency Parallelism
Two or more tasks can start, run, Two or more tasks can run (execute)
and complete in overlapping simultaneously, at the exact same
time periods time
They might not be running (executing Tasks do not only make progress,
on CPU) at the same instant but they also actually execute
Two or more execution flows make simultaneously
progress at the same time by
interleaving their executions or by
executing instructions (on CPU) at
exactly the same time
[ CS3210 AY2425S1 - L1 ]
25
Example – Serial Solution
Find the sum for n numbers:
sum = 0;
for (i = 0; i < n; i++) {
x = Compute_next_value(. . .);
sum += x;
}
Serial Code
[ CS3210 AY2425S1 - L1 ]
26
Example – Parallel Solution (1/3)
Suppose we have p cores ( p < n )
Each core can perform a partial sum of n/p values
my_sum = 0;
my_start = ……;
my_end = ……;
for (my_i = my_start; my_i < my_end; my_i++){
my_x = Compute_next_value(. . .);
my_sum += my_x;
}
Each core uses its own private variables and executes this
block of code independently
[ CS3210 AY2425S1 - L1 ]
27
Example – Parallel Solution (2/3)
Upon completion, each core contains:
the partial sum in my_sum
Core 0 1 2 3 4 5 6 7
my_sum 8 19 7 15 7 13 12 14
[ CS3210 AY2425S1 - L1 ]
28
Example – Parallel Solution (3/3)
Next, a core designated as the master core adds up all values
of my_sum to form the global sum (8 + 19 + 7 + 15 + 7 + 13 +
12 + 14 = 95)
if (I’m the master core){
sum = my_x;
for each core other than myself {
receive value from other core
sum += value;
}
} else { //not master core
send my_x to the master;
}
Parallel Code
Core 0 1 2 3 4 5 6 7
my_sum 8
95 19 7 15 7 13 12 14
[ CS3210 AY2425S1 - L1 ]
29
Better Parallel Algorithm
Share the work of the global summing too:
CORES
0 1 2 3 4 5 6 7
8 19 7 15 7 13 12 14
+ 27 + 22 + 20 + 26
time
+ 49 + 46
+ 95
[ CS3210 AY2425S1 - L1 ]
30
Comparison: Two Parallel Solutions
8 cores 1000 cores
1st version = 7 steps 1st version = 999 steps
2nd version = 3 steps 2nd version = 10 steps
[ CS3210 AY2425S1 - L1 ]
31
Parallel Performance
Two perspectives:
Execution time versus Throughput
Parallelization overheads:
distribution of work (tasks) to processors
information exchange or synchronization
idle time
etc
[ CS3210 AY2425S1 - L1 ]
32
Parallel Computing is HARD
“…when we start talking about parallelism and ease of use of truly parallel
computers, we're talking about a problem that's as hard as any that computer
science has faced. …
I would be panicked if I were in industry.”
John Hennessy, President, Stanford University, 2007
[ CS3210 AY2425S1 - L1 ]
33
Compiler Challenges
Heterogeneous processors
Increase in the design space for code optimization
Auto-tuners:
Optimizing code at runtime
[ CS3210 AY2425S1 - L1 ]
34
Parallel Programming Challenges
Finding enough parallelism (Amdahl's Law!)
Granularity
Locality
Load balance
Debugging
[ CS3210 AY2425S1 - L1 ]
36
Uses of Parallel Computing
Data mining, web search engines, web-based business
services, pharmaceutical design, financial and economic
modeling, AI, etc.
http://computing.llnl.gov/tutorials/parallel_comp
[ CS3210 AY2425S1 - L1 ]
37
Types of Computer Systems
Computer Systems
[ CS3210 AY2425S1 - L1 ]
40
Fastest Supercomputer
[ CS3210 AY2425S1 - L1 ]
41
2022-24 Fastest Supercomputer
Supercomputer Frontier
1,206 petaflops
8,699,904 cores
AMD Optimized 3rd Gen EPYC 64C 2GHz
22 Mwatts
No. 7* in GREEN500
Video
[ CS3210 AY2425S1 - L1 ]
42
2020-21 Fastest Supercomputer
Supercomputer
Fugaku
415 petaflops
~7300000 cores ARM A64FX
29 Mwatts
No. 33 in GREEN500
Video
Virtual tour
[ CS3210 AY2425S1 - L1 ]
43
Usage of Frontier
Energy
Fusion energy studies
Combustion energy improvements
Innovative clean energy systems
Materials study
Social and scientific issues
Innovative drug discovery
Personalized and preventive medicine
Disaster prevention and global climate problems
Meteorological and global environmental predictions
Evolution of the universe
[ CS3210 AY2425S1 - L1 ]
45
Deconstructing Frontier
Node:
1 HPC and AI Optimized AMD EPYC CPU
4 Purpose Built AMD Radeon Instinct GPU
74 cabinets:
Over 9000 nodes
Interconnect:
Multiple Slingshot NICs
Slingshot dragonfly network
[ CS3210 AY2425S1 - L1 ]
46
Deconstructing Fugaku
Processor (CPU):
• Fujitsu’s 48-core A64FX SoC (optional assistant cores)
[ CS3210 AY2425S1 - L1 ]
47
TOP500 Performance over Time
source: top500.org
[ CS3210 AY2425S1 - L1 ]
48
Summary
Serial versus parallel computing
[ CS3210 AY2425S1 - L1 ]
49
Evolution of Prominent Computers
year location name first
1834 Cambridge Difference engine Programmable computer
1943 Bletchley Colossus Electronic computer
1948 Manchester SSEM (Baby) Stored-program computer
1951 MIT Whirlwind 1 Real-time I/O computer
1953 Manchester Transistor computer Transistorized computer
1971 California Intel 4004 Mass-market CPU & IC
1979 Cambridge Sinclair ZX-79 Mass-market home computer
1981 New York IBM PC Personal computer
1987 Cambridge Acorn A400 series High-street RISC PC sales
1990 New York IBM RS6000 Superscalar RISC processor
1998 California Sun picolJAVA Computer Based on a language
[ CS3210 AY2425S1 - L1 ]
50
Progression of Computation Speed
year Floating point operations per second (FLOPS)
1941 1
1945 100
1949 1,000 (1 KiloFLOPS, kFLOPS)
1951 10,000
1961 100,000
1964 1,000,000 (1 MegaFLOPS, MFLOPS)
1968 10,000,000
1975 100,000,000
1987 1,000,000,000 (1 GigaFLOPS, GFLOPS)
1992 10,000,000,000
1993 100,000,000,000
1997 1,000,000,000,000 (1 TeraFLOPS, TFLOPS)
2000 10,000,000,000,000
2007 478,000,000,000,000 (478 TFLOPS)
2009 1,100,000,000,000,000 (1.1 PetaFLOPS)
[ CS3210 AY2425S1 - L1 ]
51
Computing Units
International System of Units (SI) International Electrotechnical
Commission (IEC)
name prefix multiplier
Nam prefix multiplier
exa E 1018
e
peta P 1015
Exbi Ei 260
tera T 1012
pebi Pi 250
giga G 109
tebi Ti 240
mega M 106
gibi Gi 230
kilo K 103
mebi Mi 220
milli m 10-3
kibi Ki 210
micro µ 10-6
nano n 10-9 Example of disk capacity:
pico p 10-12 20 Mibytes
20 MiB
20 Mebibytes
= 20 x 220 = 20,971,520 bytes
[ CS3210 AY2425S1 - L1 ]
52
Reference & Readings
Fugaku technical details:
https://www.r-ccs.riken.jp/en/fugaku/
Reading
Robert Robey, Yuliana Zamora, ”Parallel and High-Performance
Computing”, Manning Publications – Chapter 1.
A View of the Parallel Computing Landscape, CACM Vol. 52 No. 10,
pp. 56-67, October 2009. [Technical Report: The Landscape of
Parallel Computing Research: A View from Berkeley, Dec 2006]
[ CS3210 AY2425S1 - L1 ]
53