CS 408 - Parallel Processing
Professor BenBella S. Tawfik
Lecture 2
Analysis
Useful Diagram
One node per
O(1) operation
…
…
Question: why
…
are there no
cycles in this
graph?
Analysis
Definitions
More Definitions
Amdahl’s Law
Now it’s time for some bad news.
In practice, your program won’t just sum all the
elements in an array.
You will have a program with
Some parts that parallelize well
Can turn them into a map or a reduce.
Some parts that won’t parallelize at all
Operations on a linked list.
Reading a text file.
A computation where each step needs the result of
the previous steps.
Amdahl’s Law
Amdahl’s Law
Amdahl’s Law
Amdahl’s Law
This is BAD NEWS
If 1/3 of our program can’t be parallelized, we
can’t get a speedup better than 3.
No matter how many processors we throw at
our problem.
And while the first few processors make a huge
difference, the benefit diminishes quickly.
Amdahl’s Law and Moore’s
Law
Amdahl’s Law: Moving
Forward
Unparallelized code becomes a bottleneck
quickly.
What do we do? Design smarter algorithms!
Consider the following problem:
Given 3
an array of7numbers,6return an2array with4
the “running sum”
3 10 16 18 22
1.3 Parallel Processing Ups and Downs
Fig. 1.10 Richardson’s circular Using thousands of “computers”
theater for weather forecasting (humans + calculators) for 24-hr
calculations. weather prediction in a few hours
1960s: ILLIAC IV (U Illinois) –
four 8 8 mesh quadrants, SIMD
1980s: Commercial interest –
technology was driven by
government grants & contracts.
Once funding dried up,
Conductor
many companies went bankrupt
2000s: Internet revolution –
info providers, multimedia, data
mining, etc. need lots of power
2020s: Cloud, big-data, AI/ML
Trends in High-Technology Development
GovResGovResGovResGovResGovResGovResGovResGovResGovResGovRes
Graphics
IndResIndResIndResIndResIndResIndResIndResIndResIndResIndRes
IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1B
GovResGovResGovResGovResGovResGovResGovResGovResGovResGovRe
Networking
IndResIndResIndResIndResIndResIndResIndResIndResIndR
Transfer of
ideas/people IndDevIndDev $1B$1B$1B$1B$1B$1B$1B$1B$1B$1B$1
GovRes
RISC
IndResIndR
IndDev $1B$1B$1B$1B$1B$1B$1B$1B
GovResGovResGovResG GovResGovResGovResGo
Parallelism
IndResIndResIndResIndResIndResInd
Evolution of parallel processing has been
quite different from other high tech fields IndDevIndDev $1B$1B$1B$1B$1B$1B$1
1960 1970 1980 1990 2000
Development of some technical fields into $1B businesses and the roles played by
government research and industrial R&D over time (IEEE Computer, early 90s?).
Trends in Hi-Tech Development (2003)
2010 2020
Status of Computing Power (circa 2000)
TFLOPS PFLOPS (Peta = 1015)
GFLOPS on desktop: Apple Macintosh, with G4 processor
PFLOPS EFLOPS (Exa = 1018)
TFLOPS in supercomputer center:
1152-processor IBM RS/6000 SP (switch-based network)
Cray T3E, torus-connected
EFLOPS ZFLOPS (Zeta = 1021)
PFLOPS on the drawing board:
1M-processor IBM Blue Gene (2005?)
32 proc’s/chip, 64 chips/board, 8 boards/tower, 64 towers
Processor: 8 threads, on-chip memory, no data cache
Chip: defect-tolerant, row/column rings in a 6 6 array
Board: 8 8 chip grid organized as 4 4 4 cube
Tower: Boards linked to 4 neighbors in adjacent towers
System: 323232 cube of chips, 1.5 MW (water-cooled)
Flynn Taxonomy
Flynn (1966) Single Data Multiple Data
Single Instruction SISD SIMD
Multiple Instruction MISD MIMD
• MISD
• Fault tolerance
• Pipeline processing/streaming or systolic
array
• Now extended to SPMD
1
• single program multiple data
Mikko Lipasti-University of Wisconsin
9
Memory Taxonomy
For Shared Memory Uniform Nonuniform
Memory Memory
Cache Coherence CC-UMA CC-NUMA
No Cache Coherence NCC-UMA NCC-NUMA
• NUMA wins out for practical implementation
• Cache coherence favors programmer
• Common in general-purpose systems
• NCC widespread in scalable systems
• CC overhead is too high, not always2
necessary
Mikko Lipasti-University of Wisconsin
0
Flynn Taxonomy (1966) Recap
Data Stream
Single Multi
Instruction Single SISD SIMD
Stream (Single-Core Processors) (GPUs, Intel SSE/AVX
extensions, …)
Multi MISD MIMD
(Systolic Arrays, …) (VLIW, Parallel Computers)
Flynn Taxonomy Recap
Single-Instruction Single-Instruction
Single-Data Multi-Data
(Single-Core Processors) (GPUs, Intel SIMD Exte
Multi-Instruction Multi-Instruction
Single-Data Multi-Data
(Systolic Arrays,…) (Parallel Computers)
Intel SIMD Extensions
New instructions, new registers
Introduced in phases/groups of functionality
SSE – SSE4 (1999 –2006)
128 bit width operations
AVX, FMA, AVX2, AVX-512 (2008 – 2015)
256 – 512 bit width operations
AVX-512 chips not available yet (as of Spring, 2019), but soon!
F16C, and more to come?
Intel SIMD Registers (AVX-512)
XMM0 – XMM15
XMM0 128-bit registers
YMM0
ZMM0
SSE
XMM1 YMM0 – YMM15
YMM1
ZMM1
256-bit registers
AVX, AVX2
…
ZMM0 – ZMM31
XMM31 512-bit registers
YMM31
ZMM31 AVX-512
SSE/AVX Data Types
255 0
YMM0
float float float float float float float float
double double double double
int32 int32 int32 int32 int32 int32 int32 int32
1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1
6 6 6 6 6 6 6 6 6 6 6 6 6 6 6 6
8888 8888 8888 8888 8888 8888 8888 8 8 8 8 Operation on
32 8-bit values
in one instruction!
Sandy Bridge
Microarchitecture
e.g., “Port 5 pressure” when code uses too much shuffle operations
Skylake Die Layout
Aside: Do I Have SIMD
Capabilities?
less /proc/cpuinfo
The Global View of Computer
Architecture
Applications Parallelism
History
Technology
Computer Architecture
Programming -instruction set design
Languages -Organization OS
-Hardware/Software boundary
Measurement
and Compilers
Evaluation Interface Design (ISA)
Eurípides Montagne University of Central CDA
31
Florida 4150
The Task of A Computer
Designer
Determine what attributes are important for a new
machine.
Design a machine to maximize performance while staying
within cost constrains.
Eurípides Montagne University of Central CDA
32
Florida 4150
Flynn’s Taxonomy
Michael Flynn (from Stanford)
Made a characterization of computer systems which
became known as Flynn’s Taxonomy
Computer
Instructions Data
Eurípides Montagne University of Central CDA
33
Florida 4150
Flynn’s Taxonomy
SISD – Single Instruction Single Data Systems
SI SISD SD
Eurípides Montagne University of Central CDA
34
Florida 4150
Flynn’s Taxonomy
SIMD – Single Instruction Multiple Data Systems “Array
Processors”
SISD SD
SI SISD SD Multiple Data
SISD SD
Eurípides Montagne University of Central CDA
35
Florida 4150
Flynn’s Taxonomy
MIMD Multiple Instructions Multiple Data System:
“Multiprocessors”
Multiple Instructions Multiple Data
SI SISD SD
SI SISD SD
SI SISD SD
Eurípides Montagne University of Central CDA
36
Florida 4150
Flynn’s Taxonomy
MISD- Multiple Instructions / Single Data System
Some people say “pipelining” lies here, but this is
debatable.
Multiple Instructions Single Data
SI SISD
SI SISD SD
SI SISD
Eurípides Montagne University of Central CDA
37
Florida 4150
Abbreviations:
SISD one address Machine.
IP
IP: Instruction pointer
MAR
MAR: Memory Address Register
MDR: Memory Data Register MEMORY
A: Accumulator
OP ADDR MDR A
ALU: Arithmetic Logic Unit
IR: Instruction Register DECODER ALU
OP: Opcode
Eurípides Montagne University of Central CDA
ADDR: Address
Florida 4150
38
One address format
LOAD X
OP ADDRESS
MAR IP
MDR M[MAR] || IP IP + 1
IP
IR MDR
DECODER IR.OP MAR
MAR IR.ADDR
MEMORY
MDR M[MAR]
A MDR
OP ADDR MDR A
DECODER ALU
Eurípides Montagne University of Central CDA
39
Florida 4150
One address format
OP ADDRESS
ADD X
- MAR IP
IP
– MDR M[MAR] || IP IP + 1
– IR MDR MAR
– DECODER IR.OP
– MAR IR.ADDR MEMORY
– MDR M[MAR]
– A A + MDR MDR A
OP ADDR
DECODER ALU
Eurípides Montagne University of Central CDA
40
Florida 4150
One address format
OP ADDRESS
STORE X
IP
- MAR IP
- MDR M[MAR] || IP IP + 1 MAR
- IR MDR
- DECODER IR.OP MEMORY
- MAR IR.ADDR
- MDR A MDR
OP ADDR A
- M[MAR] MDR
DECODER ALU
Eurípides Montagne University of Central CDA
41
Florida 4150
SISD Stack Machine
IP
First Stack Machine
• B5000 MAR
MEMORY
MDR Stack
OP ADDR
1
2
ALU 3
DECODER 4
Eurípides Montagne University of Central CDA
42
Florida 4150
Multiprocessor Machines (MIMD)
MEMORY
CPU CPU CPU
Eurípides Montagne University of Central CDA
43
Florida 4150
Flynn’s Taxonomy
(figure 2.20 from Quinn)
Data stream
Single Multiple
Instruction stream
SIMD
Single
SISD
Processor arrays
Uniprocessor
Pipelined vector processors
MIMD
MISD
Multiple
Multiprocessors
Systolic array
Multicomputers
2-D Mesh Network
Direct topology
Switches arranged into a 2-D lattice
Communication allowed only between neighboring
switches
Variants allow wraparound connections between
switches on edge of mesh
2-D Meshes
Evaluating 2-D Meshes
Diameter: (n1/2)
Bisection width: (n1/2)
Number of edges per switch: 4
Constant edge length? Yes
Binary Tree Network
Indirect topology
n = 2d processor nodes, n-1 switches
Evaluating Binary Tree
Network
Diameter: 2 log n
Bisection width: 1
Edges / node: 3
Constant edge length? Yes/No?
Hypertree Network
Indirect topology
Shares low diameter of binary tree
Greatly improves bisection width
From “front” looks like k-ary tree of height d
From “side” looks like upside down binary tree of height
d
Hypertree Network
Evaluating 4-ary Hypertree
Diameter: log n
Bisection width: n / 2
Edges / node: 6
Constant edge length? No
Butterfly Network
Indirect topology
n = 2d processor 0 1 2 3 4 5 6 7
nodes connected
by n(log n + 1) Rank 0 0,0 0,1 0,2 0,3 0,4 0,5 0,6 0,7
switching nodes
Rank 1 1,0 1,1 1,2 1,3 1,4 1,5 1,6 1,7
Rank 2 2,0 2,1 2,2 2,3 2,4 2,5 2,6 2,7
Rank 3 3,0 3,1 3,2 3,3 3,4 3,5 3,6 3,7
Butterfly Network Routing
Evaluating Butterfly Network
Diameter: log n
Bisection width: n / 2
Edges per node: 4
Constant edge length? No
Hypercube
Directory topology
2 x 2 x … x 2 mesh
Number of nodes a power of 2
Node addresses 0, 1, …, 2k-1
Node i connected to k nodes whose addresses differ
from i in exactly one bit position
Hypercube Addressing
1110 1111
0110 0111
1010 1011
0010 0011
1100 1101
0100 0101
1000 1001
0000 0001
Evaluating Hypercube
Network
Diameter: log n
Bisection width: n / 2
Edges per node: log n
Constant edge length? No
1.4 Types of Parallelism: A Taxonomy
Single data Multiple data Shared Message
stream streams variables passing
Single instr
memory
stream
Global
Johnson’ s expansion
SISD SIMD GMSV GMMP
Uniprocessors Array or vector Shared-memory Rarely used
processors multiprocessors
Multiple instr
Distributed
streams
memory
MISD MIMD DMSV DMMP
Rarely used Multiproc’s or Distributed Distrib-memory
multicomputers shared memory multicomputers
Flynn’s categories
Fig. 1.11 The Flynn-Johnson classification of computer systems.
1.5 Roadblocks to Parallel Processing
Grosch’s law: Economy of scale applies, or power = cost2
No longer valid; in fact we can get more bang per buck in micros
Minsky’s conjecture: Speedup tends to be proportional to log p
Has roots in analysis of memory bank conflicts; can be overcome
Tyranny of IC technology: Uniprocessors suffice (x10 faster/5 yrs)
Faster ICs make parallel machines faster too; what about x1000?
Tyranny of vector supercomputers: Familiar programming model
Not all computations involve vectors; parallel vector machines
Software inertia: Billions of dollars investment in software
New programs; even uniprocessors benefit from parallelism spec
Amdahl’s law: Unparallelizable code severely limits the speedup
Amdahl’s Law
50
f = 0
f = fraction
40
unaffected
Spee du p ( s )
f = 0 .01
30 p = speedup
f = 0 .02 of the rest
20
f = 0 .05
1
s=
10 f + (1 – f)/p
f = 0 .1
0
min(p, 1/f)
0 10 20 30 40 50
E nha nc em en t f ac tor ( p )
Fig. 1.12 Limit on speed-up according to Amdahl’s law.
1.6 Effectiveness of Parallel Processing
1 Fig. 1.13 p Number of processors
Task graph
exhibiting W(p) Work performed by p processors
2
limited
inherent T(p) Execution time with p processors
3
parallelism. T(1) = W(1); T(p) W(p)
4
S(p) Speedup = T(1) / T(p)
5
8
7 6 E(p) Efficiency = T(1) / [p T(p)]
10
9 R(p) Redundancy = W(p) / W(1)
11
12 W(1) = 13 U(p) Utilization = W(p) / [p T(p)]
T(1) = 13
13 Q(p) Quality = T3(1) / [p T2(p) W(p)]
T() = 8
Reduction or Fan-in Computation
Example: Adding 16 numbers, 8 processors, unit-time additions
----------- 16 numbers to be added -----------
Zero-time communication
+ + + + + +
+ + E(8) = 15 / (8 4) = 47%
S(8) = 15 / 4 = 3.75
R(8) = 15 / 15 = 1
+ + + + Q(8) = 1.76
Unit-time communication
+ +
E(8) = 15 / (8 7) = 27%
S(8) = 15 / 7 = 2.14
+ R(8) = 22 / 15 = 1.47
Q(8) = 0.39
Sum
Fig. 1.14 Computation graph for finding the sum of 16 numbers .
ABCs of Parallel Processing in One Slide
A Amdahl’s Law (Speedup Formula)
Bad news – Sequential overhead will kill you, because:
Speedup = T1/Tp 1/[f + (1 – f)/p] min(1/f, p)
Morale: For f = 0.1, speedup is at best 10, regardless of peak OPS.
B Brent’s Scheduling Theorem
Good news – Optimal scheduling is very difficult, but even a naive
scheduling algorithm can ensure:
T1/p Tp T1/p + T = (T1/p)[1 + p/(T1/T)]
Result: For a reasonably parallel task (large T1/T), or for a suitably
small p (say, p T1/T), good speedup and efficiency are possible.
C Cost-Effectiveness Adage
Real news – The most cost-effective parallel solution may not be
the one with highest peak OPS (communication?), greatest speed-up
(at what cost?), or best utilization (hardware busy doing what?).
Analogy: Mass transit might be more cost-effective than private cars
even if it is slower and leads to many empty seats.