Parallel Processing
Parallel Processing
3-5 GFLOPS
for uniprocessor
GLOP = 109 FLOPS TeraFLOP = 1000 GFLOPS = 1012 FLOPS EECC756 - Shaaban
PetaFLOP = 1000 TeraFLOPS = 1015 FLOPS
#6 lec # 1 Spring 2011 3-8-2011
Scientific Supercomputing Trends
• Proving ground and driver for innovative architecture and advanced
high performance computing (HPC) techniques:
– Market is much smaller relative to commercial (desktop/server)
segment.
– Dominated by costly vector machines starting in the 1970s through
the 1980s.
– Microprocessors have made huge gains in the performance needed
for such applications:
• High clock rates. (Bad: Higher CPI?)
• Multiple pipelined floating point units.
• Instruction-level parallelism.
• Effective use of caches.
Enabled with high
transistor density/chip • Multiple processor cores/chip (2 cores 2002-2005, 4 end of 2006, 6-12 cores
2011)
However even the fastest current single microprocessor systems
still cannot meet the needed computational demands. As shown in last slide
• Currently: Large-scale microprocessor-based multiprocessor systems and
computer clusters are replacing (replaced?) vector supercomputers that utilize
custom processors.
EECC756 - Shaaban
#7 lec # 1 Spring 2011 3-8-2011
Uniprocessor Performance Evaluation
• CPU Performance benchmarking is heavily program-mix dependent.
• Ideal performance requires a perfect machine/program match.
• Performance measures:
– Total CPU time = T = TC / f = TC x C = I x CPI x C
= I x (CPIexecution + M x k) x C (in seconds)
TC = Total program execution clock cycles
f = clock rate C = CPU clock cycle time = 1/f I = Instructions executed count
CPI = Cycles per instruction CPIexecution = CPI with ideal memory
M = Memory stall cycles per memory access
k = Memory accesses per instruction
– MIPS Rating = I / (T x 106) = f / (CPI x 106) = f x I /(TC x 106)
(in million instructions per second)
– Throughput Rate: Wp = 1/ T = f /(I x CPI) = (MIPS) x 106 /I
(in programs per second)
• Performance factors: (I, CPIexecution, m, k, C) are influenced by: instruction-set
architecture (ISA) , compiler design, CPU micro-architecture, implementation
and control, cache and memory hierarchy, program access locality, and
program instruction mix and instruction dependencies.
100
Supercomputers
Custom
Processors
10
Performance
Mainframes
Microprocessors
Minicomputers
1 Commodity
Processors
0.1
1965 1970 1975 1980 1985 1990 1995
EECC756 - Shaaban
#9 lec # 1 Spring 2011 3-8-2011
Microprocessor Frequency Trend
10,000 100
Intel
Processor freq
IBM Power PC
scales by 2X per Realty Check:
DEC
Gate delays/clock
generation Clock frequency scaling
is slowing down!
21264S
10
Why?
21164 II
21066 MPC750 1- Power leakage
604 604+
Pentium Pro 2- Clock distribution
100 delays
601, 603 (R)
Pentium(R)
Result:
486 Deeper Pipelines
386 Longer stalls
10 1 Higher CPI
1991
1993
1995
1997
1999
2001
2003
2005
1987
1989
(lowers effective
No longer
performance
the case ° Frequency doubles each generation ? per cycle)
± Number of gates/clock reduce by 25% Solution:
² Leads to deeper pipelines with more stages Exploit TLP at the chip level,
(e.g Intel Pentium 4E has 30+ pipeline stages) Chip-multiprocessor (CMPs)
~ 800,000x transistor density increase in the last 38 years Currently > 2 Billion
Moore’
Moore’s Law:
2X transistors/Chip
Every 1.5 years
(circa 1970)
still holds
Enables Thread-Level
Parallelism (TLP) at the chip
level:
Chip-Multiprocessors (CMPs)
+ Simultaneous Multithreaded
(SMT) processors
Intel 4004
(2300 transistors) Solution
• One billion transistors/chip reached in 2005, two billion in 2008-9, Now ~ three billion
• Transistor count grows faster than clock rate: Currently ~ 40% per year
• Single-threaded uniprocessors do not efficiently utilize the increased transistor count.
i80386
TLP/Parallel
i80286 R3000
100,000 Processing
R2000
Shared
L2 or L3
On-chip crossbar/switch
Cores communicate using shared cache FSB
(Lowest communication latency) Cores communicate using on-chip Cores communicate over external
Interconnects (shared system interface) Front Side Bus (FSB)
Examples: (Highest communication latency)
IBM POWER4/5 Examples:
AMD Dual Core Opteron, Examples:
Intel Pentium Core Duo (Yonah), Conroe
Athlon 64 X2 Intel Pentium D,
(Core 2), i7, Sun UltraSparc T1 (Niagara)
Intel Itanium2 (Montecito) Intel Quad core (two dual-core chips)
AMD Phenom ….
CRAY n = 1,000
Now about
CRAY n = 100 Vector Processors 5-20 GFLOPS
z Micro n = 1,000
Micro n = 100 per microprocessor core
1,000
T94
1 GFLOP
C90
(109 FLOPS) z
LINPACK (MFLOPS)
z DEC 8200
Ymp
Xmp/416 z
z
100 z IBM Power2/990
MIPS R4400
Xmp/14se
z DEC Alpha
z HP9000/735
DEC Alpha AXP
CRAY 1s HP 9000/750
IBM RS6000/540
10
z
z
MIPS M/2000
Microprocessors MIPS M/120
Sun 4/260
1
z
1975 1980 1985 1990 1995 2000
EECC756 - Shaaban
#14 lec # 1 Spring 2011 3-8-2011
Parallel Performance: LINPACK
Since ~ Nov. 2010
Current Top LINPACK Performance:
10,000 Now about 2,566,000 GFlop/s = 2566 TeraFlops = 2.566 PetaFlops
z MPP peak
Tianhe-1A ( @ National Supercomputing Center in Tianjin, China) 186,368 processor cores:
CRAY peak
14,336 Intel Xeon X5670 6-core processors @ 2.9 GHz
+ 7,168 Nvidia Tesla M2050 (8-core?) GPUs
Paragon XP/S MP
(1024) z
T3D
CM-5 z
100
T932(32)
Paragon XP/S
z
CM-200 z
CM-2 z z C90(16)
10 Delta
z iPSC/860
z nCUBE/2(1024)
Ymp/832(8)
1
Xmp /416(4)
0.1
1985 1987 1989 1991 1993 1995 1996
LINPACK (GFLOPS)
T94
1 GFLOP
Paragon XP/S MP
(109 FLOPS) C90 z
(1024)
z T3D
CM-5z
100
LINPACK (MFLOPS)
z DEC 8200
Ymp
T932(32)
Xmp/416 z Paragon XP/S
z z
100 z IBM Power2/99
MIPS R4400 z
CM-200
Xmp/14se CM-2z z C90(16)
z z DEC Alpha 10 Delta
HP9000/735
DEC Alpha AXP
CRA
Y 1s HP 9000/750
IBM RS6000/540 z iPSC/860
z nCUBE/2(1024)
Ymp/832(8)
10
z 1
Xmp
/416(4)
z
MIPS M/2000
MIPS M/120
Sun 4/260 0.1
1
z 1985 1987 1989 1991 1993 1995 199
1975 1980 1985 1990 1995 200
EECC756 - Shaaban
#16 lec # 1 Spring 2011 3-8-2011
Computer System Peak FLOP Rating History
Current Top Peak FP Performance:
Now about 4,701,000 GFlop/s = 4701 TeraFlops = 4.701 PetaFlops Since ~ Nov. 2010
Tianhe-1A ( @ National Supercomputing Center in Tianjin, China) 186,368 processor cores:
14,336 Intel Xeon X5670 6-core processors @ 2.9 GHz + 7,168 Nvidia Tesla M2050 (8-core?) GPUs
Tianhe-1A
Peta FLOP (1015 FLOPS = 1000 Tera FLOPS)
KW
KW
KW
EECC756 - Shaaban
#23 lec # 1 Spring 2011 3-8-2011
HPC
Elements of Parallel Computing
Driving Assign
Force Computing parallel
Problems computations
(Tasks) to
processors
Processing Nodes/Network
Parallel
Algorithms Mapping Parallel
and Data Hardware
Structures Architecture
Dependency
Programming
Operating System
analysis
(Task Dependency Binding
Graphs) High-level Applications Software
Languages (Compile,
Load)
EECC756 - Shaaban
#26 lec # 1 Spring 2011 3-8-2011
Elements of Parallel Computing
5 System Software Support
– Needed for the development of efficient programs in high-level
languages (HLLs.)
– Assemblers, loaders.
– Portable parallel programming languages/libraries
– User interfaces and tools.
6 Compiler Support Approaches to parallel programming
(a) – Implicit Parallelism Approach
• Parallelizing compiler: Can automatically detect parallelism in sequential
source code and transforms it into parallel constructs/code.
• Source code written in conventional sequential languages
(b) – Explicit Parallelism Approach:
• Programmer explicitly specifies parallelism using:
– Sequential compiler (conventional sequential HLL) and low-level
library of the target parallel computer , or ..
– Concurrent (parallel) HLL .
• Concurrency Preserving Compiler: The compiler in this case preserves the
parallelism explicitly specified by the programmer. It may perform some
program flow analysis, dependence checking, limited optimizations for
parallelism detection.
Illustrated next EECC756 - Shaaban
#27 lec # 1 Spring 2011 3-8-2011
Approaches to Parallel Programming
Programmer Programmer
Parallelizing Concurrency
compiler preserving compiler
(b) Explicit
Parallel (a) Implicit Concurrent
Parallelism
object code Parallelism object code
Programmer explicitly
Compiler automatically specifies parallelism
detects parallelism in using parallel constructs
Execution by sequential source code Execution by
and transforms it into
runtime system parallel constructs/code
runtime system
EECC756 - Shaaban
#28 lec # 1 Spring 2011 3-8-2011
Factors Affecting Parallel System Performance
• Parallel Algorithm Related:
i.e Inherent – Available concurrency and profile, grain size, uniformity, patterns.
Parallelism • Dependencies between computations represented by dependency graph
– Type of parallelism present: Functional and/or data parallelism.
– Required communication/synchronization, uniformity and patterns.
– Data size requirements.
– Communication to computation ratio (C-to-C ratio, lower is better).
• Parallel program Related:
– Programming model used.
– Resulting data/code memory requirements, locality and working set
characteristics.
– Parallel task grain size.
– Assignment (mapping) of tasks to processors: Dynamic or static.
– Cost of communication/synchronization primitives.
• Hardware/Architecture related:
– Total CPU computational power available. + Number of processors
– Types of computation modes supported. (hardware parallelism)
– Shared address space Vs. message passing.
– Communication network characteristics (topology, bandwidth, latency)
– Memory hierarchy properties.
EECC756 - Shaaban
Concurrency = Parallelism
#29 lec # 1 Spring 2011 3-8-2011
Sequential Execution Possible Parallel Execution Schedule on Two Processors P0, P1
Task Dependency Graph
on one processor
0 0
Task:
1 Computation 1
2
A run on one A 2
A Idle
processor
3 3
Comm Comm
4 4
5
B 5
C
6 B C 6
B
7 7
8
C Comm
8
F
9 9 D
10
D E F 10
11
D 11
E Idle
Comm
12 12
Comm
13 13
14
E G 14
Idle G
15 15
16 16
17
F What would the speed be 17
18 with 3 processors? 18 T2 =16
19 4 processors? 5 … ? 19
20
G 20
21 21
Assume computation time for each task A-G = 3
Time Assume communication time between parallel tasks = 1 Time
P0 P1
Assume communication can overlap with computation
Speedup on two processors = T1/T2 = 21/16 = 1.3
T1 =21
EECC756 - Shaaban
A simple parallel execution example #30 lec # 1 Spring 2011 3-8-2011
Non-pipelined Scalar Evolution of Computer
Architecture
Sequential Lookahead
Limited Functional
I/E Overlap Parallelism
Pipelining
Application Software
Systolic System
Arrays Software SIMD Data Parallel
Architectures
Architecture
Message Passing
Dataflow
Shared Memory
More on this next lecture
EECC756 - Shaaban
#32 lec # 1 Spring 2011 3-8-2011
Parallel Programming Models
• Programming methodology used in coding parallel applications
• Specifies: 1- communication and 2- synchronization
• Examples: However, a good way to utilize multi-core processors for the masses!
Shown here:
CU = Control Unit array of synchronized
processing elements
PE = Processing Element
M = Memory
Parallel computers
or multiprocessor systems
Multiple Instruction streams and a Single Data stream (MISD): Multiple Instruction streams over Multiple
Systolic arrays for pipelined execution. Data streams (MIMD): Parallel computers:
Distributed memory multiprocessor system shown
Classified according to number of instruction streams
(threads) and number of data streams in architecture
EECC756 - Shaaban
#35 lec # 1 Spring 2011 3-8-2011
Current Trends In Parallel Architectures
Conventional or sequential
• Defines:
1 – Critical abstractions, boundaries, and primitives
(interfaces).
2 – Organizational structures that implement interfaces
(hardware or software) Implementation of Interfaces
– Location transparency
– Similar programming model to time-sharing in uniprocessors
• Except processes run on different processors
• Good throughput on multiprogrammed workloads i.e multi-tasking
P1 pr i v at e
Private portion
of address space
P0 pr i v at e
• Writes to shared address visible to other threads (in other processes too)
•Natural extension of the uniprocessor model: Thus communication is implicit via loads/stores
• Conventional memory operations used for communication
• Special atomic operations needed for synchronization: i.e for event ordering and mutual exclusion
• Using Locks, Semaphores etc. Thus synchronization is explicit
• OS uses shared memory to coordinate processes.
EECC756 - Shaaban
#39 lec # 1 Spring 2011 3-8-2011
Models of Shared-Memory Multiprocessors
1 • The Uniform Memory Access (UMA) Model:
– All physical memory is shared by all processors.
– All processors have equal access (i.e equal memory
bandwidth and access latency) to all memory addresses.
– Also referred to as Symmetric Memory Processors (SMPs).
2 • Distributed memory or Non-uniform Memory Access
(NUMA) Model:
– Shared memory is physically distributed locally among
processors. Access latency to remote memory is higher.
Processor Processor
Network
Network
°°°
D D D
M M °°° M
$ $ $
C C C
P P P
P P P
NUMA 3
Distributed memory or
2 Non-uniform Memory Access (NUMA) Model Cache-Only Memory Architecture (COMA)
EECC756 - Shaaban
#41 lec # 1 Spring 2011 3-8-2011
Uniform Memory Access (UMA)
Example: Intel Pentium Pro Quad Circa 1997
PCI bus
PCI
PCI bus
I/O MIU
cards
1-, 2-, or 4-way
interleaved
DRAM
Dedicated point-to-point interconnects (HyperTransport links) used to connect Total 16 processor cores when
processors alleviating the traditional limitations of FSB-based SMP systems. dual core Opteron processors used
Each processor has two integrated DDR memory channel controllers: (32 cores with quad core processors)
memory bandwidth scales up with number of processors.
NUMA architecture since a processor can access its own memory at a lower latency
than access to remote memory directly connected to other processors in the system. EECC756 - Shaaban
#43 lec # 1 Spring 2011 3-8-2011
Distributed Shared-Memory
Multiprocessor System Example:
NUMA MPP Example Cray T3E Circa 1995-1999
External I/O
MPP = Massively Parallel Processor System
P Mem
$
Mem
ctrl
and NI
XY Switch
More recent Cray MPP Example:
Cray X1E Supercomputer Communication
3D Torus Point-To-Point Network Assist (CA)
Z
• Scale up to 2048 processors, DEC Alpha EV6 microprocessor (COTS)
• Custom 3D Torus point-to-point network, 480MB/s links
• Memory controller generates communication requests for non-local references
• No hardware mechanism for coherence (SGI Origin etc. provide this)
Data Address Y
Send X, Q, t
Recipient Tag
Addr essX Recipient blocks (waits)
Local pr ocess
Receive (Y, P, t) Local pr ocess until message is received
addr ess space
address space
Sender P Data Recipient Q
Sender
Process P Process Q
Mem DMA
ctrl
Driver
NI
4-way
Sandia’ s Intel Paragon XP/S-based Supercomputer
interleaved Communication
DRAM
Assist (CA)
8 bits,
175 MHz,
2D grid network bidirectional
with processing node
attached to every switch
2D grid
point to point
network
EECC756 - Shaaban
#47 lec # 1 Spring 2011 3-8-2011
Message-Passing Example: IBM SP-2
MPP Circa 1994-1998
Power 2
CPU IBM SP-2 node
L2 $
Memory bus
General interconnection
network formed from Memory 4-way
interleaved
8-port switches controller
DRAM
DRAM
i860 NI
• Network interface
integrated in I/O bus
(bandwidth limited by I/O Communication
Assist (CA)
bus)
Multi-stage network EECC756 - Shaaban
MPP = Massively Parallel Processor System #48 lec # 1 Spring 2011 3-8-2011
Message-Passing MPP Example:
Circa 2005
IBM Blue Gene/L
(2 processors/chip) • (2 chips/compute card) • (16 compute cards/node board) • (32 node
boards/tower) • (64 tower) = 128k = 131072 (0.7 GHz PowerPC 440) processors (64k nodes)
2.8 Gflops peak
System Location: Lawrence Livermore National Laboratory per processor core System
(64 cabinets, 64x32x32)
Networks:
Cabinet
3D Torus point-to-point network (32 Node boards, 8x8x16)
Global tree 3D point-to-point network
(both proprietary)
Node Board
(32 chips, 4x4x2)
16 Compute Cards
Compute Card
(2 chips, 2x1x1) 180/360 TF/s
16 TB DDR
Chip
(2 processors)
2.9/5.7 TF/s
256 GB DDR
90/180 GF/s
8 GB DDR
5.6/11.2 GF/s Design Goals:
2.8/5.6 GF/s 0.5 GB DDR - High computational power efficiency
4 MB
- High computational density per volume
LINPACK Performance:
280,600 GFLOPS = 280.6 TeraFLOPS = 0.2806 Peta FLOP
Top Peak FP Performance:
Now about 367,000 GFLOPS = 367 TeraFLOPS = 0.367 Peta FLOP EECC756 - Shaaban
#49 lec # 1 Spring 2011 3-8-2011
Message-Passing Programming Tools
• Message-passing programming environments include:
– Message Passing Interface (MPI):
• Provides a standard for writing concurrent message-passing
programs.
• MPI implementations include parallel libraries used by existing
programming languages (C, C++). Both MPI & PVM are examples
of the explicit parallelism approach
– Parallel Virtual Machine (PVM): to parallel programming
Systolic Architectures
• Replace single processor with an array of regular processing elements
• Orchestrate data flow for high throughput with less memory access
PE = Processing Element
M = Memory
M M
PE
PE PE PE
T=1
EECC756 - Shaaban
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
#55 lec # 1 Spring 2011 3-8-2011
Systolic Array Example:
3x3 Systolic Array Matrix Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one
element of the product b2,2
b2,1 b1,2
Alignments in time b2,0 b1,1 b0,2
b1,0 b0,1
a0,0*b0,0 a0,0*b0,1
a0,1 + a0,1*b1,0 a0,0
a0,2
b0,0
a1,0*b0,0
a1,2 a1,1 a1,0
T=2
EECC756 - Shaaban
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
#56 lec # 1 Spring 2011 3-8-2011
Systolic Array Example:
3x3 Systolic Array Matrix Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one
element of the product
b2,2
Alignments in time
b2,1 b1,2
b2,0 b1,1 b0,2
a0,0*b0,0 a0,0*b0,1
a0,2 + a0,1*b1,0 a0,1 + a0,1*b1,1 a0,0 a0,0*b0,2
+ a0,2*b2,0
C00
b1,0 b0,1
a1,0*b0,0
a1,1 a1,0 a1,0*b0,1
a1,2 + a1,1*b1,0
b0,0
a2,0*b0,0
a2,0
a2,2 a2,1
T=3
EECC756 - Shaaban
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
#57 lec # 1 Spring 2011 3-8-2011
Systolic Array Example:
3x3 Systolic Array Matrix Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one
element of the product
Alignments in time
b2,2
b2,1 b1,2
a0,0*b0,0 a0,0*b0,1
+ a0,1*b1,0 a0,2 + a0,1*b1,1 a0,1 a0,0*b0,2
+ a0,1*b1,2
+ a0,2*b2,0 + a0,2*b2,1
C00 C01
C10
b1,0 b0,1
a2,0*b0,0 a2,0*b0,1
a2,2 a2,1 + a2,1*b1,0
a2,0
a2,2
T=4
EECC756 - Shaaban
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
#58 lec # 1 Spring 2011 3-8-2011
Systolic Array Example:
3x3 Systolic Array Matrix Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one
element of the product
Alignments in time
b2,2
a0,0*b0,0 a0,0*b0,1
+ a0,1*b1,0 + a0,1*b1,1 a0,2 a0,0*b0,2
+ a0,1*b1,2
+ a0,2*b2,0 + a0,2*b2,1
+ a0,2*b2,2
C00 C01 C02
b2,1 b1,2
a1,0*b0,0
a1,2 a1,0*b0,2
+ a1,1*b1,0 a1,0*b0,1 a1,1 + a1,1*b1,2
+ a1,2*a2,0 +a1,1*b1,1
+ a1,2*b2,1
C10 C11
b2,0 b1,1 b0,2
a2,0*b0,0 a2,0*b0,1 a2,0*b0,2
a2,2 + a2,1*b1,0
a2,1 + a2,1*b1,1 a2,0
+ a2,2*b2,0
C20
T=5
EECC756 - Shaaban
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
#59 lec # 1 Spring 2011 3-8-2011
Systolic Array Example:
3x3 Systolic Array Matrix Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one
element of the product
Alignments in time
a0,0*b0,0 a0,0*b0,1
a0,0*b0,2
+ a0,1*b1,0 + a0,1*b1,1
+ a0,1*b1,2
+ a0,2*b2,0 + a0,2*b2,1
+ a0,2*b2,2
C00 C01 C02
b2,2
a1,0*b0,0
a1,0*b0,2
+ a1,1*b1,0 a1,0*b0,1 a1,2 + a1,1*b1,2
+ a1,2*a2,0 +a1,1*b1,1
+ a1,2*b2,1 + a1,2*b2,2
C20 C21
T=6
EECC756 - Shaaban
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
#60 lec # 1 Spring 2011 3-8-2011
Systolic Array Example:
3x3 Systolic Array Matrix Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one On one processor = O(n3) t = 27?
element of the product Speedup = 27/7 = 3.85
Alignments in time
a0,0*b0,0 a0,0*b0,1
a0,0*b0,2
+ a0,1*b1,0 + a0,1*b1,1
+ a0,1*b1,2
+ a0,2*b2,0 + a0,2*b2,1
+ a0,2*b2,2
C00 C01 C02
a1,0*b0,0
a1,0*b0,1 a1,0*b0,2
+ a1,1*b1,0
+a1,1*b1,1 + a1,1*b1,2
+ a1,2*a2,0
+ a1,2*b2,1 + a1,2*b2,2