0% found this document useful (0 votes)
3 views

Parallel Processing

The document provides an overview of parallel processing, including its architecture, necessity, and performance factors. It discusses the evolution of parallel computing, trends in microprocessor technology, and the significance of parallelism in addressing complex computational problems in various scientific fields. Key topics include shared address space architectures, message-passing multicomputers, and the impact of increasing transistor density on performance.

Uploaded by

sandyhoranx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Parallel Processing

The document provides an overview of parallel processing, including its architecture, necessity, and performance factors. It discusses the evolution of parallel computing, trends in microprocessor technology, and the significance of parallelism in addressing complex computational problems in various scientific fields. Key topics include shared address space architectures, message-passing multicomputers, and the impact of increasing transistor density on performance.

Uploaded by

sandyhoranx
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 61

Introduction to Parallel Processing

• Parallel Computer Architecture: Definition & Broad issues involved


– A Generic Parallel Computer Architecture
• The Need And Feasibility of Parallel Computing Why?
– Scientific Supercomputing Trends
– CPU Performance and Technology Trends, Parallelism in Microprocessor Generations
– Computer System Peak FLOP Rating History/Near Future
• The Goal of Parallel Processing
• Elements of Parallel Computing
• Factors Affecting Parallel System Performance
• Parallel Architectures History
– Parallel Programming Models
– Flynn’s 1972 Classification of Computer Architecture
• Current Trends In Parallel Architectures
– Modern Parallel Architecture Layered Framework
• Shared Address Space Parallel Architectures
• Message-Passing Multicomputers: Message-Passing Programming Tools
• Data Parallel Systems
• Dataflow Architectures
• Systolic Architectures: Matrix Multiplication Systolic Array Example

PCA Chapter 1.1, 1.2 EECC756 - Shaaban


#1 lec # 1 Spring 2011 3-8-2011
Parallel Computer Architecture
A parallel computer (or multiple processor system) is a collection of
communicating processing elements (processors) that cooperate to solve
large computational problems fast by dividing such problems into parallel
tasks, exploiting Thread-Level Parallelism (TLP). i.e Parallel Processing
• Broad issues involved: Task = Computation done on one processor
– The concurrency and communication characteristics of parallel algorithms for a given
computational problem (represented by dependency graphs)
– Computing Resources and Computation Allocation:
• The number of processing elements (PEs), computing power of each element and
amount/organization of physical memory used.
• What portions of the computation and data are allocated or mapped to each PE.
– Data access, Communication and Synchronization
• How the processing elements cooperate and communicate.
• How data is shared/transmitted between processors.
• Abstractions and primitives for cooperation/communication and synchronization.
• The characteristics and performance of parallel system network (System interconnects).
– Parallel Processing Performance and Scalability Goals:
Goals • Maximize performance enhancement of parallelism: Maximize Speedup.
– By minimizing parallelization overheads and balancing workload on processors
• Scalability of performance to larger systems/problems.
Processor = Programmable computing element that runs stored programs written
using pre-defined instruction set EECC756 - Shaaban
Processing Elements = PEs = Processors
#2 lec # 1 Spring 2011 3-8-2011
A Generic Parallel Computer Architecture
2 Parallel Machine Network Network
(Custom or industry standard)
°°°
1 A processing nodes
Communication
Processing Nodes
Mem assist (CA)
Operating System Network Interface AKA Communication Assist (CA)
Parallel Programming $
(custom or industry standard)
Environments
P
One or more processing elements or processors
2-8 cores per chip per node: Custom or commercial microprocessors.
Single or multiple processors per chip
1 Processing Nodes: Homogenous or heterogonous
Each processing node contains one or more processing elements (PEs) or
processor(s), memory system, plus communication assist: (Network interface and
communication controller)
2 Parallel machine network (System Interconnects).
Function of a parallel machine network is to efficiently (reduce communication
cost) transfer information (data, results .. ) from source node to destination node
as needed to allow cooperation among parallel processing nodes to solve large
computational problems divided into a number parallel computational tasks.
EECC756 - Shaaban
Parallel Computer = Multiple Processor System
#3 lec # 1 Spring 2011 3-8-2011
The Need And Feasibility of Parallel Computing
• Application demands: More computing cycles/memory needed
Driving
– Scientific/Engineering computing: CFD, Biology, Chemistry, Physics, ...
Force – General-purpose computing: Video, Graphics, CAD, Databases, Transaction Processing,
Gaming…
– Mainstream multithreaded programs, are similar to parallel programs
• Technology Trends: Moore’s Law still alive
– Number of transistors on chip growing rapidly. Clock rates expected to continue to go up but
only slowly. Actual performance returns diminishing due to deeper pipelines.
– Increased transistor density allows integrating multiple processor cores per creating Chip-
Multiprocessors (CMPs) even for mainstream computing applications (desktop/laptop..).
• Architecture Trends: + multi-tasking (multiple independent programs)

– Instruction-level parallelism (ILP) is valuable (superscalar, VLIW) but limited.


– Increased clock rates require deeper pipelines with longer latencies and higher CPIs.
– Coarser-level parallelism (at the task or thread level, TLP), utilized in multiprocessor systems
is the most viable approach to further improve performance.
• Main motivation for development of chip-multiprocessors (CMPs) Multi-core
Processors
• Economics:
– The increased utilization of commodity of-the-shelf (COTS) components in high performance
parallel computing systems instead of costly custom components used in traditional
supercomputers leading to much lower parallel system cost.
• Today’s microprocessors offer high-performance and have multiprocessor support
eliminating the need for designing expensive custom Pes.
• Commercial System Area Networks (SANs) offer an alternative to custom more costly
networks
EECC756 - Shaaban
#4 lec # 1 Spring 2011 3-8-2011
Why is Parallel Processing Needed?
Challenging Applications in Applied Science/Engineering
• Astrophysics Traditional Driving Force For HPC/Parallel Processing
• Atmospheric and Ocean Modeling
• Bioinformatics Such applications have very high
1- computational and 2- memory
• Biomolecular simulation: Protein folding
requirements that cannot be met
• Computational Chemistry with single-processor architectures.
• Computational Fluid Dynamics (CFD) Many applications contain a large
• Computational Physics degree of computational parallelism

• Computer vision and image understanding


• Data Mining and Data-intensive Computing
• Engineering analysis (CAD/CAM)
• Global climate modeling and forecasting
• Material Sciences
• Military applications Driving force for High Performance Computing (HPC)
• Quantum chemistry and multiple processor system development
• VLSI design
• ….
EECC756 - Shaaban
#5 lec # 1 Spring 2011 3-8-2011
Why is Parallel Processing Needed?
Scientific Computing Demands
Driving force for HPC and multiple processor system development
(Memory Requirement)

Computational and memory


demands exceed the capabilities
of even the fastest current
uniprocessor systems

3-5 GFLOPS
for uniprocessor

GLOP = 109 FLOPS TeraFLOP = 1000 GFLOPS = 1012 FLOPS EECC756 - Shaaban
PetaFLOP = 1000 TeraFLOPS = 1015 FLOPS
#6 lec # 1 Spring 2011 3-8-2011
Scientific Supercomputing Trends
• Proving ground and driver for innovative architecture and advanced
high performance computing (HPC) techniques:
– Market is much smaller relative to commercial (desktop/server)
segment.
– Dominated by costly vector machines starting in the 1970s through
the 1980s.
– Microprocessors have made huge gains in the performance needed
for such applications:
• High clock rates. (Bad: Higher CPI?)
• Multiple pipelined floating point units.
• Instruction-level parallelism.
• Effective use of caches.
Enabled with high
transistor density/chip • Multiple processor cores/chip (2 cores 2002-2005, 4 end of 2006, 6-12 cores
2011)
However even the fastest current single microprocessor systems
still cannot meet the needed computational demands. As shown in last slide
• Currently: Large-scale microprocessor-based multiprocessor systems and
computer clusters are replacing (replaced?) vector supercomputers that utilize
custom processors.
EECC756 - Shaaban
#7 lec # 1 Spring 2011 3-8-2011
Uniprocessor Performance Evaluation
• CPU Performance benchmarking is heavily program-mix dependent.
• Ideal performance requires a perfect machine/program match.
• Performance measures:
– Total CPU time = T = TC / f = TC x C = I x CPI x C
= I x (CPIexecution + M x k) x C (in seconds)
TC = Total program execution clock cycles
f = clock rate C = CPU clock cycle time = 1/f I = Instructions executed count
CPI = Cycles per instruction CPIexecution = CPI with ideal memory
M = Memory stall cycles per memory access
k = Memory accesses per instruction
– MIPS Rating = I / (T x 106) = f / (CPI x 106) = f x I /(TC x 106)
(in million instructions per second)
– Throughput Rate: Wp = 1/ T = f /(I x CPI) = (MIPS) x 106 /I
(in programs per second)
• Performance factors: (I, CPIexecution, m, k, C) are influenced by: instruction-set
architecture (ISA) , compiler design, CPU micro-architecture, implementation
and control, cache and memory hierarchy, program access locality, and
program instruction mix and instruction dependencies.

T = I x CPI x C EECC756 - Shaaban


#8 lec # 1 Spring 2011 3-8-2011
Single CPU Performance Trends
• The microprocessor is currently the most natural building block for
multiprocessor systems in terms of cost and performance.
• This is even more true with the development of cost-effective multi-core
microprocessors that support TLP at the chip level.

100
Supercomputers
Custom
Processors

10
Performance

Mainframes
Microprocessors
Minicomputers
1 Commodity
Processors

0.1
1965 1970 1975 1980 1985 1990 1995

EECC756 - Shaaban
#9 lec # 1 Spring 2011 3-8-2011
Microprocessor Frequency Trend
10,000 100
Intel
Processor freq
IBM Power PC
scales by 2X per Realty Check:
DEC
Gate delays/clock
generation Clock frequency scaling
is slowing down!
21264S

Gate Delays/ Clock


1,000 (Did silicone finally hit
21164A 21264 the wall?)
21064A Pentium(R)
Mhz

10
Why?
21164 II
21066 MPC750 1- Power leakage
604 604+
Pentium Pro 2- Clock distribution
100 delays
601, 603 (R)
Pentium(R)
Result:
486 Deeper Pipelines
386 Longer stalls
10 1 Higher CPI
1991

1993

1995

1997

1999

2001

2003

2005
1987

1989

(lowers effective
No longer
performance
the case ° Frequency doubles each generation ? per cycle)
± Number of gates/clock reduce by 25% Solution:
² Leads to deeper pipelines with more stages Exploit TLP at the chip level,
(e.g Intel Pentium 4E has 30+ pipeline stages) Chip-multiprocessor (CMPs)

T = I x CPI x C EECC756 - Shaaban


#10 lec # 1 Spring 2011 3-8-2011
Transistor Count Growth Rate
Enabling Technology for Chip-Level Thread-Level Parallelism (TLP)

~ 800,000x transistor density increase in the last 38 years Currently > 2 Billion
Moore’
Moore’s Law:

2X transistors/Chip
Every 1.5 years
(circa 1970)
still holds

Enables Thread-Level
Parallelism (TLP) at the chip
level:
Chip-Multiprocessors (CMPs)
+ Simultaneous Multithreaded
(SMT) processors

Intel 4004
(2300 transistors) Solution

• One billion transistors/chip reached in 2005, two billion in 2008-9, Now ~ three billion
• Transistor count grows faster than clock rate: Currently ~ 40% per year
• Single-threaded uniprocessors do not efficiently utilize the increased transistor count.

Limited ILP, increased size of cache EECC756 - Shaaban


#11 lec # 1 Spring 2011 3-8-2011
Parallelism in Microprocessor VLSI Generations
Bit-level parallelism Instruction-level Thread-level (?)
100,000,000
(ILP) (TLP)
Multiple micro-operations Superscalar
/VLIW Simultaneous
per cycle Single-issue CPI <1 Multithreading SMT:
(multi-cycle non-pipelined) ‹ e.g. Intel’s Hyper-threading
Pipelined
10,000,000 CPI =1 Chip-Multiprocessors (CMPs)
‹
‹‹‹ ‹ e.g IBM Power 4, 5
‹
Not Pipelined ‹ R10000 Intel Pentium D, Core Duo
CPI >> 1 ‹‹‹‹‹‹ AMD Athlon 64 X2
‹ Dual Core Opteron
‹ ‹‹‹
‹
‹
‹ ‹ ‹
Sun UltraSparc T1 (Niagara)
‹
‹‹ ‹
‹‹‹
1,000,000
‹‹ ‹
Pentium
‹
Chip-Level
Transistors

‹ ‹
‹
‹ i80386
‹
TLP/Parallel
i80286 ‹ ‹ ‹ R3000
100,000 Processing
‹ ‹ ‹ R2000

Even more important


‹ i8086 due to slowing clock
10,000 rate increase
‹ i8080
‹ i8008
‹ ILP = Instruction-Level Parallelism
‹ i4004 Single Thread TLP = Thread-Level Parallelism
Per Chip
1,000
1970 1975 1980 1985 1990 1995 2000 2005

Improving microprocessor generation performance by


exploiting more levels of parallelism
EECC756 - Shaaban
#12 lec # 1 Spring 2011 3-8-2011
Current Dual-Core Chip-Multiprocessor Architectures
Single Die Two Dice – Shared Package
Single Die
Private Caches Private Caches
Shared L2 Cache
Shared System Interface Private System Interface

Shared
L2 or L3

On-chip crossbar/switch
Cores communicate using shared cache FSB
(Lowest communication latency) Cores communicate using on-chip Cores communicate over external
Interconnects (shared system interface) Front Side Bus (FSB)
Examples: (Highest communication latency)
IBM POWER4/5 Examples:
AMD Dual Core Opteron, Examples:
Intel Pentium Core Duo (Yonah), Conroe
Athlon 64 X2 Intel Pentium D,
(Core 2), i7, Sun UltraSparc T1 (Niagara)
Intel Itanium2 (Montecito) Intel Quad core (two dual-core chips)
AMD Phenom ….

Source: Real World Technologies,


EECC756 - Shaaban
http://www.realworldtech.com/page.cfm?ArticleID=RWT101405234615 #13 lec # 1 Spring 2011 3-8-2011
Microprocessors Vs. Vector Processors
Uniprocessor Performance: LINPACK
10,000

„ CRAY n = 1,000
Now about
Œ CRAY n = 100 Vector Processors 5-20 GFLOPS
z Micro n = 1,000
‹ Micro n = 100 per microprocessor core
„

1,000 „
T94
1 GFLOP Œ
C90
(109 FLOPS) Œ z
„
LINPACK (MFLOPS)

z DEC 8200
„ Ymp
„
Xmp/416 Œ z ‹ ‹
Œ z
100 z ‹ IBM Power2/990
MIPS R4400
Xmp/14se
Œ z DEC Alpha
z ‹ HP9000/735
‹
‹ DEC Alpha AXP
Œ„ CRAY 1s ‹ HP 9000/750
‹ IBM RS6000/540

10
z

z
MIPS M/2000
‹
Microprocessors MIPS M/120
‹
Sun 4/260
1 ‹
z
1975 1980 1985 1990 1995 2000

EECC756 - Shaaban
#14 lec # 1 Spring 2011 3-8-2011
Parallel Performance: LINPACK
Since ~ Nov. 2010
Current Top LINPACK Performance:
10,000 Now about 2,566,000 GFlop/s = 2566 TeraFlops = 2.566 PetaFlops
z MPP peak
Tianhe-1A ( @ National Supercomputing Center in Tianjin, China) 186,368 processor cores:
„ CRAY peak
14,336 Intel Xeon X5670 6-core processors @ 2.9 GHz
+ 7,168 Nvidia Tesla M2050 (8-core?) GPUs

1,000 ASCI Red z


1 TeraFLOP Paragon XP/S MP
(1012 FLOPS = 1000 GFLOPS) (6768)
z
LINPACK (GFLOPS)

Paragon XP/S MP
(1024) z
„ T3D
CM-5 z
100
T932(32) „
Paragon XP/S
z
CM-200 z
CM-2 z z „ C90(16)
10 Delta

„ z iPSC/860
z nCUBE/2(1024)
Ymp/832(8)

Xmp /416(4)

0.1
1985 1987 1989 1991 1993 1995 1996

Current ranking of top 500 parallel supercomputers EECC756 - Shaaban


in the world is found at: www.top500.org
#15 lec # 1 Spring 2011 3-8-2011
Why is Parallel Processing Needed?
LINPAK Performance Trends 10,000
10,000 z MPP peak
„ CRA
Y peak
„ CRAY n = 1,000
Œ CRAY n = 100
z Micron = 1,000
‹Micron = 100
1,000 z
ASCI Red
„
Paragon XP/S MP
1,000
1 TeraFLOP (6768)
„ z
(1012 FLOPS =1000 GFLOPS)

LINPACK (GFLOPS)
T94
1 GFLOP Œ
Paragon XP/S MP
(109 FLOPS) C90 z
(1024)
z „T3D
Œ CM-5z
„ 100
LINPACK (MFLOPS)

z DEC 8200
„ Ymp „
T932(32)
„
Xmp/416 Œ z ‹‹ Paragon XP/S
Œ z z
100 z ‹ IBM Power2/99
MIPS R4400 z
CM-200
Xmp/14se CM-2z z „ C90(16)
Œ z z DEC Alpha 10 Delta
‹HP9000/735
‹
‹DEC Alpha AXP
Œ„ CRA
Y 1s ‹HP 9000/750
‹IBM RS6000/540 „ z iPSC/860
z nCUBE/2(1024)
Ymp/832(8)
10
z 1„
Xmp
/416(4)
z
MIPS M/2000
‹
MIPS M/120
‹
Sun 4/260 0.1
1 ‹
z 1985 1987 1989 1991 1993 1995 199
1975 1980 1985 1990 1995 200

Uniprocessor Performance Parallel System Performance

EECC756 - Shaaban
#16 lec # 1 Spring 2011 3-8-2011
Computer System Peak FLOP Rating History
Current Top Peak FP Performance:
Now about 4,701,000 GFlop/s = 4701 TeraFlops = 4.701 PetaFlops Since ~ Nov. 2010
Tianhe-1A ( @ National Supercomputing Center in Tianjin, China) 186,368 processor cores:
14,336 Intel Xeon X5670 6-core processors @ 2.9 GHz + 7,168 Nvidia Tesla M2050 (8-core?) GPUs

Tianhe-1A
Peta FLOP (1015 FLOPS = 1000 Tera FLOPS)

Teraflop (1012 FLOPS = 1000 GFLOPS)

Current ranking of top 500 parallel supercomputers


in the world is found at: www.top500.org
EECC756 - Shaaban
#17 lec # 1 Spring 2011 3-8-2011
November 2005

Source (and for current list): www.top500.org EECC756 - Shaaban


#18 lec # 1 Spring 2011 3-8-2011
TOP500 Supercomputers
32nd List (November 2008): The Top 10

KW

Source (and for current list): www.top500.org EECC756 - Shaaban


#19 lec # 1 Spring 2011 3-8-2011
TOP500 Supercomputers
34nd List (November 2009): The Top 10

KW

Source (and for current list): www.top500.org EECC756 - Shaaban


#20 lec # 1 Spring 2011 3-8-2011
TOP500 Supercomputers
36nd List (November 2010): The Top 10
Current List

KW

Source (and for current list): www.top500.org EECC756 - Shaaban


#21 lec # 1 Spring 2011 3-8-2011
The Goal of Parallel Processing
• Goal of applications in using parallel machines:
Maximize Speedup over single processor performance
Parallel
Performance (p processors)
Speedup (p processors) =
Performance (1 processor)

• For a fixed problem size (input data set),


performance = 1/time
Fixed Problem Size Parallel Speedup
Parallel Speedup, Speedupp

Speedup fixed problem (p processors) = Time (1 processor)


Time (p processors)

• Ideal speedup = number of processors = p


Very hard to achieve + load imbalance
Due to parallelization overheads: communication cost, dependencies ...
EECC756 - Shaaban
#22 lec # 1 Spring 2011 3-8-2011
The Goal of Parallel Processing
• Parallel processing goal is to maximize parallel speedup:
Fixed Problem Size Parallel Speedup Or time
Time(1) Sequential Work on one processor
Speedup =
Time(p) < Max (Work + Synch Wait Time + Comm Cost + Extra Work)
Time Parallelization overheads
i.e the processor with maximum execution time

• Ideal Speedup = p = number of processors


– Very hard to achieve: Implies no parallelization overheads and perfect load
balance among all processors.
• Maximize parallel speedup by:
1 – Balancing computations on processors (every processor does the same amount of
work) and the same amount of overheads.
2 – Minimizing communication cost and other overheads associated with each step
of parallel program creation and execution.
• Performance Scalability:
+ Achieve a good speedup for the parallel application on the parallel architecture as
problem size and machine size (number of processors) are increased.

EECC756 - Shaaban
#23 lec # 1 Spring 2011 3-8-2011
HPC
Elements of Parallel Computing
Driving Assign
Force Computing parallel
Problems computations
(Tasks) to
processors

Processing Nodes/Network
Parallel
Algorithms Mapping Parallel
and Data Hardware
Structures Architecture

Dependency
Programming
Operating System
analysis
(Task Dependency Binding
Graphs) High-level Applications Software
Languages (Compile,
Load)

Performance e.g Parallel Speedup


Evaluation
EECC756 - Shaaban
#24 lec # 1 Spring 2011 3-8-2011
Elements of Parallel Computing
1 Computing Problems:
Driving – Numerical Computing: Science and and engineering numerical
Force problems demand intensive integer and floating point
computations.
– Logical Reasoning: Artificial intelligence (AI) demand logic
inferences and symbolic manipulations and large space searches.
2 Parallel Algorithms and Data Structures
– Special algorithms and data structures are needed to specify the
computations and communication present in computing problems
(from dependency analysis).
– Most numerical algorithms are deterministic using regular data
structures.
– Symbolic processing may use heuristics or non-deterministic
searches.
– Parallel algorithm development requires interdisciplinary
interaction.
EECC756 - Shaaban
#25 lec # 1 Spring 2011 3-8-2011
Elements of Parallel Computing
3 Hardware Resources Computing power
A– Processors, memory, and peripheral devices (processing
nodes) form the hardware core of a computer system.
B – Processor connectivity (system interconnects, network),

memory organization, influence the system architecture.

4 Operating Systems Communication/connectivity


– Manages the allocation of resources to running processes.
– Mapping to match algorithmic structures with hardware
architecture and vice versa: processor scheduling, memory
mapping, interprocessor communication.
• Parallelism exploitation possible at: 1- algorithm design,
2- program writing, 3- compilation, and 4- run time.

EECC756 - Shaaban
#26 lec # 1 Spring 2011 3-8-2011
Elements of Parallel Computing
5 System Software Support
– Needed for the development of efficient programs in high-level
languages (HLLs.)
– Assemblers, loaders.
– Portable parallel programming languages/libraries
– User interfaces and tools.
6 Compiler Support Approaches to parallel programming
(a) – Implicit Parallelism Approach
• Parallelizing compiler: Can automatically detect parallelism in sequential
source code and transforms it into parallel constructs/code.
• Source code written in conventional sequential languages
(b) – Explicit Parallelism Approach:
• Programmer explicitly specifies parallelism using:
– Sequential compiler (conventional sequential HLL) and low-level
library of the target parallel computer , or ..
– Concurrent (parallel) HLL .
• Concurrency Preserving Compiler: The compiler in this case preserves the
parallelism explicitly specified by the programmer. It may perform some
program flow analysis, dependence checking, limited optimizations for
parallelism detection.
Illustrated next EECC756 - Shaaban
#27 lec # 1 Spring 2011 3-8-2011
Approaches to Parallel Programming
Programmer Programmer

Source code written in Source code written in


sequential languages C, C++ concurrent dialects of C, C++
FORTRAN, LISP .. FORTRAN, LISP ..

Parallelizing Concurrency
compiler preserving compiler

(b) Explicit
Parallel (a) Implicit Concurrent
Parallelism
object code Parallelism object code
Programmer explicitly
Compiler automatically specifies parallelism
detects parallelism in using parallel constructs
Execution by sequential source code Execution by
and transforms it into
runtime system parallel constructs/code
runtime system

EECC756 - Shaaban
#28 lec # 1 Spring 2011 3-8-2011
Factors Affecting Parallel System Performance
• Parallel Algorithm Related:
i.e Inherent – Available concurrency and profile, grain size, uniformity, patterns.
Parallelism • Dependencies between computations represented by dependency graph
– Type of parallelism present: Functional and/or data parallelism.
– Required communication/synchronization, uniformity and patterns.
– Data size requirements.
– Communication to computation ratio (C-to-C ratio, lower is better).
• Parallel program Related:
– Programming model used.
– Resulting data/code memory requirements, locality and working set
characteristics.
– Parallel task grain size.
– Assignment (mapping) of tasks to processors: Dynamic or static.
– Cost of communication/synchronization primitives.
• Hardware/Architecture related:
– Total CPU computational power available. + Number of processors
– Types of computation modes supported. (hardware parallelism)
– Shared address space Vs. message passing.
– Communication network characteristics (topology, bandwidth, latency)
– Memory hierarchy properties.
EECC756 - Shaaban
Concurrency = Parallelism
#29 lec # 1 Spring 2011 3-8-2011
Sequential Execution Possible Parallel Execution Schedule on Two Processors P0, P1
Task Dependency Graph
on one processor
0 0
Task:
1 Computation 1
2
A run on one A 2
A Idle
processor
3 3
Comm Comm
4 4
5
B 5
C
6 B C 6
B
7 7
8
C Comm
8
F
9 9 D
10
D E F 10
11
D 11
E Idle
Comm
12 12
Comm
13 13
14
E G 14
Idle G
15 15
16 16
17
F What would the speed be 17
18 with 3 processors? 18 T2 =16
19 4 processors? 5 … ? 19
20
G 20
21 21
Assume computation time for each task A-G = 3
Time Assume communication time between parallel tasks = 1 Time
P0 P1
Assume communication can overlap with computation
Speedup on two processors = T1/T2 = 21/16 = 1.3
T1 =21
EECC756 - Shaaban
A simple parallel execution example #30 lec # 1 Spring 2011 3-8-2011
Non-pipelined Scalar Evolution of Computer
Architecture
Sequential Lookahead

Limited Functional
I/E Overlap Parallelism
Pipelining

Multiple Pipelined (single or multiple issue)


Func. Units
Pipeline

Implicit Explicit Vector/data parallel


I/E: Instruction Fetch and
Vector Vector
Execute
Memory-to Register-to Shared
SIMD: Single Instruction stream -Memory -Register
over Multiple Data streams Memory
MIMD: Multiple Instruction Parallel
streams over Multiple Data
SIMD Machines
MIMD
streams
Associative Processor
Processor Multicomputer Multiprocessor
Array
Data Parallel Computer Massively Parallel
Clusters Processors (MPPs)
Message Passing
EECC756 - Shaaban
Parallel Architectures History
Historically, parallel architectures were tied to parallel
programming models:
• Divergent architectures, with no predictable pattern of growth.

Application Software

Systolic System
Arrays Software SIMD Data Parallel
Architectures
Architecture
Message Passing
Dataflow
Shared Memory
More on this next lecture
EECC756 - Shaaban
#32 lec # 1 Spring 2011 3-8-2011
Parallel Programming Models
• Programming methodology used in coding parallel applications
• Specifies: 1- communication and 2- synchronization
• Examples: However, a good way to utilize multi-core processors for the masses!

– Multiprogramming: or Multi-tasking (not true parallel processing!)


No communication or synchronization at program level. A number of
independent programs running on different processors in the system.
– Shared memory address space (SAS):
Parallel program threads or tasks communicate implicitly using
a shared memory address space (shared data in memory).
– Message passing:
Explicit point to point communication (via send/receive pairs) is used
between parallel program tasks using messages.
– Data parallel:
More regimented, global actions on data (i.e the same operations over
all elements on an array or vector)
– Can be implemented with shared address space or message
passing.
EECC756 - Shaaban
#33 lec # 1 Spring 2011 3-8-2011
Flynn’s 1972 Classification of Computer Architecture
(Taxonomy) Instruction Stream = Thread of Control or Hardware Context

(a)• Single Instruction stream over a Single Data stream (SISD):


Conventional sequential machines or uniprocessors.
(b)• Single Instruction stream over Multiple Data streams
(SIMD): Vector computers, array of synchronized
processing elements. Data parallel systems
(c)• Multiple Instruction streams and a Single Data stream
(MISD): Systolic arrays for pipelined execution.
(d)• Multiple Instruction streams over Multiple Data streams
(MIMD): Parallel computers:
Tightly coupled
processors
• Shared memory multiprocessors.
Loosely coupled
• Multicomputers: Unshared distributed memory,
processors message-passing used instead (e.g clusters)
Classified according to number of instruction streams EECC756 - Shaaban
(threads) and number of data streams in architecture
#34 lec # 1 Spring 2011 3-8-2011
Flynn’s Classification of Computer Architecture
(Taxonomy)
Single Instruction stream over Multiple Data streams (SIMD):
Uniprocessor Vector computers, array of synchronized processing elements.

Shown here:
CU = Control Unit array of synchronized
processing elements
PE = Processing Element
M = Memory

Single Instruction stream over a Single Data stream (SISD):


Conventional sequential machines or uniprocessors.

Parallel computers
or multiprocessor systems
Multiple Instruction streams and a Single Data stream (MISD): Multiple Instruction streams over Multiple
Systolic arrays for pipelined execution. Data streams (MIMD): Parallel computers:
Distributed memory multiprocessor system shown
Classified according to number of instruction streams
(threads) and number of data streams in architecture
EECC756 - Shaaban
#35 lec # 1 Spring 2011 3-8-2011
Current Trends In Parallel Architectures
Conventional or sequential

• The extension of “computer architecture” to support


communication and cooperation:
– OLD: Instruction Set Architecture (ISA)
– NEW: Communication Architecture

• Defines:
1 – Critical abstractions, boundaries, and primitives
(interfaces).
2 – Organizational structures that implement interfaces
(hardware or software) Implementation of Interfaces

• Compilers, libraries and OS are important bridges today


i.e. software abstraction layers
More on this next lecture EECC756 - Shaaban
#36 lec # 1 Spring 2011 3-8-2011
Modern Parallel Architecture
Layered Framework
CAD Database Scientific modeling Parallel applications

Multiprogramming Shared Message Data Programming models


Software

address passing parallel


User Space
Compilation
or library Communication abstraction
User/system boundary
Operating systems support
System Space
Hardware/software boundary
Communication hardware
(ISA)
Hardware

Physical communication medium

Hardware: Processing Nodes & Interconnects

More on this next lecture EECC756 - Shaaban


#37 lec # 1 Spring 2011 3-8-2011
Shared Address Space (SAS) Parallel Architectures
(in shared address space)

• Any processor can directly reference any memory location


– Communication occurs implicitly as result of loads and stores
• Convenient: Communication is implicit via loads/stores

– Location transparency
– Similar programming model to time-sharing in uniprocessors
• Except processes run on different processors
• Good throughput on multiprogrammed workloads i.e multi-tasking

• Naturally provided on a wide range of platforms


– Wide range of scale: few to hundreds of processors
• Popularly known as shared memory machines or model
– Ambiguous: Memory may be physically distributed among
processing nodes. i.e Distributed shared memory multiprocessors

Sometimes called Tightly-Coupled Parallel Computers EECC756 - Shaaban


#38 lec # 1 Spring 2011 3-8-2011
Shared Address Space (SAS) Parallel Programming Model
• Process: virtual address space plus one or more threads of control
• Portions of address spaces of processes are shared:
Machine physical address space
Virtual address spaces for a
collection of processes communicating
via shared addresses Pn pr i v at e
In SAS:
Load
Communication is implicit Pn

via loads/stores. Common physical


Shared
P2
P1
addresses
Space
P0
Ordering/Synchronization
is explicit using synchronization St or e
P2 pr i v at e
Primitives. Shared portion
of address space

P1 pr i v at e
Private portion
of address space
P0 pr i v at e

• Writes to shared address visible to other threads (in other processes too)
•Natural extension of the uniprocessor model: Thus communication is implicit via loads/stores
• Conventional memory operations used for communication
• Special atomic operations needed for synchronization: i.e for event ordering and mutual exclusion
• Using Locks, Semaphores etc. Thus synchronization is explicit
• OS uses shared memory to coordinate processes.
EECC756 - Shaaban
#39 lec # 1 Spring 2011 3-8-2011
Models of Shared-Memory Multiprocessors
1 • The Uniform Memory Access (UMA) Model:
– All physical memory is shared by all processors.
– All processors have equal access (i.e equal memory
bandwidth and access latency) to all memory addresses.
– Also referred to as Symmetric Memory Processors (SMPs).
2 • Distributed memory or Non-uniform Memory Access
(NUMA) Model:
– Shared memory is physically distributed locally among
processors. Access latency to remote memory is higher.

3 • The Cache-Only Memory Architecture (COMA) Model:


– A special case of a NUMA machine where all distributed
main memory is converted to caches.
– No memory hierarchy at each processor.
EECC756 - Shaaban
#40 lec # 1 Spring 2011 3-8-2011
Models of Shared-Memory Multiprocessors
Uniform Memory Access (UMA) Model UMA
1 or Symmetric Memory Processors (SMPs). Interconnect:
I/O
devices Bus, Crossbar, Multistage network
P: Processor
Mem Mem Mem Mem I/O ctrl I/O ctrl M or Mem: Memory
C: Cache
Interconnect Interconnect /Network D: Cache directory

Processor Processor

Network
Network

°°°
D D D
M M °°° M
$ $ $
C C C
P P P
P P P
NUMA 3
Distributed memory or
2 Non-uniform Memory Access (NUMA) Model Cache-Only Memory Architecture (COMA)

EECC756 - Shaaban
#41 lec # 1 Spring 2011 3-8-2011
Uniform Memory Access (UMA)
Example: Intel Pentium Pro Quad Circa 1997

4-way SMP CPU


P-Pro P-Pro P-Pro
Interrupt 256-KB module module module
controller L2 $
Bus interface

Shared P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)


FSB

PCI PCI Memory


bridge bridge controller

PCI bus
PCI

PCI bus
I/O MIU
cards
1-, 2-, or 4-way
interleaved
DRAM

• All coherence and multiprocessing glue in


processor module
• Highly integrated, targeted at high volume
• Computing node used in Intel’s ASCI-Red
MPP
Bus-Based Symmetric Memory Processors (SMPs).
A single Front Side Bus (FSB) is shared among processors EECC756 - Shaaban
This severely limits scalability to only ~ 2-4 processors
#42 lec # 1 Spring 2011 3-8-2011
Non-Uniform Memory Access (NUMA)
Example: AMD 8-way Opteron Server Node
Circa 2003

Dedicated point-to-point interconnects (HyperTransport links) used to connect Total 16 processor cores when
processors alleviating the traditional limitations of FSB-based SMP systems. dual core Opteron processors used
Each processor has two integrated DDR memory channel controllers: (32 cores with quad core processors)
memory bandwidth scales up with number of processors.
NUMA architecture since a processor can access its own memory at a lower latency
than access to remote memory directly connected to other processors in the system. EECC756 - Shaaban
#43 lec # 1 Spring 2011 3-8-2011
Distributed Shared-Memory
Multiprocessor System Example:
NUMA MPP Example Cray T3E Circa 1995-1999

External I/O
MPP = Massively Parallel Processor System

P Mem
$

Mem
ctrl
and NI

XY Switch
More recent Cray MPP Example:
Cray X1E Supercomputer Communication
3D Torus Point-To-Point Network Assist (CA)
Z
• Scale up to 2048 processors, DEC Alpha EV6 microprocessor (COTS)
• Custom 3D Torus point-to-point network, 480MB/s links
• Memory controller generates communication requests for non-local references
• No hardware mechanism for coherence (SGI Origin etc. provide this)

Example of Non-uniform Memory Access (NUMA) EECC756 - Shaaban


#44 lec # 1 Spring 2011 3-8-2011
Message-Passing Multicomputers
• Comprised of multiple autonomous computers (computing nodes) connected
via a suitable network. Industry standard System Area Network (SAN) or proprietary network

• Each node consists of one or more processors, local memory, attached


storage and I/O peripherals and Communication Assist (CA).
• Local memory is only accessible by local processors in a node (no shared
memory among nodes).
• Inter-node communication is carried explicitly out by message passing
through the connection network via send/receive operations.
Thus communication is explicit
• Process communication achieved using a message-passing programming
environment (e.g. PVM, MPI). Portable, platform-independent
– Programming model more removed or abstracted from basic hardware
operations
• Include:
1 – A number of commercial Massively Parallel Processor systems (MPPs).

2 – Computer clusters that utilize commodity of-the-shelf (COTS) components.

Also called Loosely-Coupled Parallel Computers EECC756 - Shaaban


#45 lec # 1 Spring 2011 3-8-2011
Message-Passing Abstraction
Tag

Send (X, Q, t) Match Receive Y, P, t

Data Address Y
Send X, Q, t
Recipient Tag
Addr essX Recipient blocks (waits)

Local pr ocess
Receive (Y, P, t) Local pr ocess until message is received
addr ess space
address space
Sender P Data Recipient Q
Sender
Process P Process Q

• Send specifies buffer to be transmitted and receiving process. Communication is explicit


• Receive specifies sending process and application storage to receive into. via sends/receives
• Memory to memory copy possible, but need to name processes.
• Optional tag on send and matching rule on receive.
i.e event ordering, in this case
• User process names local data and entities in process/tag space too
• In simplest form, the send/receive match achieves implicit pairwise synchronization event
– Ordering of computations according to dependencies
Synchronization is
• Many possible overheads: copying, buffer management, protection ... implicit
Pairwise synchronization Blocking Receive
Data Dependency
using send/receive match
/Ordering
Sender P Recipient Q EECC756 - Shaaban
#46 lec # 1 Spring 2011 3-8-2011
Message-Passing Example:
Intel Paragon Circa 1983

i860 i860 Intel


Paragon
L1 $ L1 $ node Each node
Is a 2-way-SMP

Memory bus (64-bit, 50 MHz)

Mem DMA
ctrl
Driver
NI
4-way
Sandia’ s Intel Paragon XP/S-based Supercomputer
interleaved Communication
DRAM
Assist (CA)

8 bits,
175 MHz,
2D grid network bidirectional
with processing node
attached to every switch

2D grid
point to point
network
EECC756 - Shaaban
#47 lec # 1 Spring 2011 3-8-2011
Message-Passing Example: IBM SP-2
MPP Circa 1994-1998

Power 2
CPU IBM SP-2 node

L2 $

Memory bus

General interconnection
network formed from Memory 4-way
interleaved
8-port switches controller
DRAM

• Made out of essentially MicroChannel bus


NIC
complete RS6000
I/O DMA
workstations

DRAM
i860 NI
• Network interface
integrated in I/O bus
(bandwidth limited by I/O Communication
Assist (CA)
bus)
Multi-stage network EECC756 - Shaaban
MPP = Massively Parallel Processor System #48 lec # 1 Spring 2011 3-8-2011
Message-Passing MPP Example:
Circa 2005
IBM Blue Gene/L
(2 processors/chip) • (2 chips/compute card) • (16 compute cards/node board) • (32 node
boards/tower) • (64 tower) = 128k = 131072 (0.7 GHz PowerPC 440) processors (64k nodes)
2.8 Gflops peak
System Location: Lawrence Livermore National Laboratory per processor core System
(64 cabinets, 64x32x32)
Networks:
Cabinet
3D Torus point-to-point network (32 Node boards, 8x8x16)
Global tree 3D point-to-point network
(both proprietary)

Node Board
(32 chips, 4x4x2)
16 Compute Cards

Compute Card
(2 chips, 2x1x1) 180/360 TF/s
16 TB DDR
Chip
(2 processors)
2.9/5.7 TF/s
256 GB DDR
90/180 GF/s
8 GB DDR
5.6/11.2 GF/s Design Goals:
2.8/5.6 GF/s 0.5 GB DDR - High computational power efficiency
4 MB
- High computational density per volume
LINPACK Performance:
280,600 GFLOPS = 280.6 TeraFLOPS = 0.2806 Peta FLOP
Top Peak FP Performance:
Now about 367,000 GFLOPS = 367 TeraFLOPS = 0.367 Peta FLOP EECC756 - Shaaban
#49 lec # 1 Spring 2011 3-8-2011
Message-Passing Programming Tools
• Message-passing programming environments include:
– Message Passing Interface (MPI):
• Provides a standard for writing concurrent message-passing
programs.
• MPI implementations include parallel libraries used by existing
programming languages (C, C++). Both MPI & PVM are examples
of the explicit parallelism approach
– Parallel Virtual Machine (PVM): to parallel programming

• Enables a collection of heterogeneous computers to used as a


coherent and flexible concurrent computational resource.
• PVM support software executes on each machine in a user-
configurable pool, and provides a computational environment of
concurrent applications.
• User programs written for example in C, Fortran or Java are
provided access to PVM through the use of calls to PVM library
routines.
Both MPI and PVM are portable (platform-independent)
and allow the user to explicitly specify parallelism
EECC756 - Shaaban
#50 lec # 1 Spring 2011 3-8-2011
Data Parallel Systems SIMD in Flynn taxonomy
• Programming model (Data Parallel)
– Similar operations performed in parallel on
each element of data structure
– Logically single thread of control, performs Control
sequential or parallel steps processor

– Conceptually, a processor is associated with


each data element
• Architectural model PE PE °°° PE
– Array of many simple processors each with
little memory PE PE °°° PE
• Processors don’t sequence through
instructions
– Attached to a control processor that issues °°° °°° °°°
instructions
– Specialized and general communication, global PE PE
°°° PE
synchronization
• Example machines: All PE are synchronized
– Thinking Machines CM-1, CM-2 (and CM-5) (same instruction or operation
– Maspar MP-1 and MP-2, in a given cycle)

Other Data Parallel Architectures: Vector Machines


EECC756 - Shaaban
PE = Processing Element
#51 lec # 1 Spring 2011 3-8-2011
Dataflow Architectures
• Represent computation as a graph of essential data dependencies
– Non-Von Neumann Architecture (Not PC-based Architecture)
i.e data or results
– Logical processor at each node, activated by availability of operands
– Message (tokens) carrying tag of next instruction sent to next processor
– Tag compared with others in matching store; match fires execution
1 b c e Research Dataflow machine
Token
prototypes include:
a = (b +1) × (b − c) + − ×
d=c×e
Distribution • The MIT Tagged Architecture
f=a×d Network
d • The Manchester Dataflow Machine
× Dataflow graph
a
× The Tomasulo approach of dynamic
Dependency graph for Network
entire computation (program) instruction execution utilizes dataflow
f
driven execution engine:
• The data dependency graph for a small
window of instructions is constructed
Token Program One Node
store store dynamically when instructions are issued
Network
in order of the program.
Waiting Instruction Execute Form
Matching fetch token
•The execution of an issued instruction
Token queue
is triggered by the availability of its
Network
operands (data it needs) over the CDB.
Token
Matching Token
Distribution EECC756 - Shaaban
Tokens = Copies of computation results
#52 lec # 1 Spring 2011 3-8-2011
Example of Flynn’s Taxonomy’s MISD (Multiple Instruction Streams Single Data Stream):

Systolic Architectures
• Replace single processor with an array of regular processing elements
• Orchestrate data flow for high throughput with less memory access
PE = Processing Element
M = Memory
M M

PE
PE PE PE

• Different from linear pipelining


– Nonlinear array structure, multidirection data flow, each PE may have
(small) local instruction and data memory
• Different from SIMD: each PE may do something different
• Initial motivation: VLSI Application-Specific Integrated Circuits (ASICs)
• Represent algorithms directly by chips connected in regular pattern

A possible example of MISD in Flynn’s


Classification of Computer Architecture EECC756 - Shaaban
#53 lec # 1 Spring 2011 3-8-2011
Systolic Array Example: C=AXB

3x3 Systolic Array Matrix Multiplication


• Processors arranged in a 2-D grid b2,2
• Each processor accumulates one b2,1 b1,2
element of the product b2,0 b1,1 b0,2
b1,0 b0,1 Column 2
Alignments in time b0,0 Column 1 Columns of B
Column 0
Rows of A

a0,2 a0,1 a0,0


Row 0

a1,2 a1,1 a1,0


Row 1

a2,2 a2,1 a2,0


Row 2
T=0
EECC756 - Shaaban
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
#54 lec # 1 Spring 2011 3-8-2011
Systolic Array Example:
3x3 Systolic Array Matrix Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one b2,2
element of the product b2,1 b1,2
b2,0 b1,1 b0,2
Alignments in time b1,0 b0,1
b0,0
a0,0*b0,0
a0,0
a0,2 a0,1

a1,2 a1,1 a1,0

a2,2 a2,1 a2,0

T=1
EECC756 - Shaaban
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
#55 lec # 1 Spring 2011 3-8-2011
Systolic Array Example:
3x3 Systolic Array Matrix Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one
element of the product b2,2
b2,1 b1,2
Alignments in time b2,0 b1,1 b0,2

b1,0 b0,1
a0,0*b0,0 a0,0*b0,1
a0,1 + a0,1*b1,0 a0,0
a0,2

b0,0
a1,0*b0,0
a1,2 a1,1 a1,0

a2,2 a2,1 a2,0

T=2
EECC756 - Shaaban
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
#56 lec # 1 Spring 2011 3-8-2011
Systolic Array Example:
3x3 Systolic Array Matrix Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one
element of the product
b2,2
Alignments in time
b2,1 b1,2
b2,0 b1,1 b0,2
a0,0*b0,0 a0,0*b0,1
a0,2 + a0,1*b1,0 a0,1 + a0,1*b1,1 a0,0 a0,0*b0,2
+ a0,2*b2,0

C00

b1,0 b0,1
a1,0*b0,0
a1,1 a1,0 a1,0*b0,1
a1,2 + a1,1*b1,0

b0,0
a2,0*b0,0
a2,0
a2,2 a2,1

T=3
EECC756 - Shaaban
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
#57 lec # 1 Spring 2011 3-8-2011
Systolic Array Example:
3x3 Systolic Array Matrix Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one
element of the product

Alignments in time
b2,2

b2,1 b1,2
a0,0*b0,0 a0,0*b0,1
+ a0,1*b1,0 a0,2 + a0,1*b1,1 a0,1 a0,0*b0,2
+ a0,1*b1,2
+ a0,2*b2,0 + a0,2*b2,1

C00 C01

b2,0 b1,1 b0,2


a1,0*b0,0
a1,1 a1,0*b0,2
a1,2 + a1,1*b1,0 a1,0*b0,1 a1,0
+ a1,2*a2,0 +a1,1*b1,1

C10
b1,0 b0,1
a2,0*b0,0 a2,0*b0,1
a2,2 a2,1 + a2,1*b1,0
a2,0
a2,2
T=4
EECC756 - Shaaban
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
#58 lec # 1 Spring 2011 3-8-2011
Systolic Array Example:
3x3 Systolic Array Matrix Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one
element of the product

Alignments in time

b2,2
a0,0*b0,0 a0,0*b0,1
+ a0,1*b1,0 + a0,1*b1,1 a0,2 a0,0*b0,2
+ a0,1*b1,2
+ a0,2*b2,0 + a0,2*b2,1
+ a0,2*b2,2
C00 C01 C02
b2,1 b1,2
a1,0*b0,0
a1,2 a1,0*b0,2
+ a1,1*b1,0 a1,0*b0,1 a1,1 + a1,1*b1,2
+ a1,2*a2,0 +a1,1*b1,1
+ a1,2*b2,1

C10 C11
b2,0 b1,1 b0,2
a2,0*b0,0 a2,0*b0,1 a2,0*b0,2
a2,2 + a2,1*b1,0
a2,1 + a2,1*b1,1 a2,0
+ a2,2*b2,0

C20
T=5
EECC756 - Shaaban
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
#59 lec # 1 Spring 2011 3-8-2011
Systolic Array Example:
3x3 Systolic Array Matrix Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one
element of the product

Alignments in time

a0,0*b0,0 a0,0*b0,1
a0,0*b0,2
+ a0,1*b1,0 + a0,1*b1,1
+ a0,1*b1,2
+ a0,2*b2,0 + a0,2*b2,1
+ a0,2*b2,2
C00 C01 C02

b2,2
a1,0*b0,0
a1,0*b0,2
+ a1,1*b1,0 a1,0*b0,1 a1,2 + a1,1*b1,2
+ a1,2*a2,0 +a1,1*b1,1
+ a1,2*b2,1 + a1,2*b2,2

C10 C11 C12


b2,1 b1,2
a2,0*b0,0 a2,0*b0,1 a2,0*b0,2
+ a2,1*b1,0
a2,2 + a2,1*b1,1 a2,1 + a2,1*b1,2
+ a2,2*b2,0 + a2,2*b2,1

C20 C21
T=6
EECC756 - Shaaban
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
#60 lec # 1 Spring 2011 3-8-2011
Systolic Array Example:
3x3 Systolic Array Matrix Multiplication
• Processors arranged in a 2-D grid
• Each processor accumulates one On one processor = O(n3) t = 27?
element of the product Speedup = 27/7 = 3.85
Alignments in time

a0,0*b0,0 a0,0*b0,1
a0,0*b0,2
+ a0,1*b1,0 + a0,1*b1,1
+ a0,1*b1,2
+ a0,2*b2,0 + a0,2*b2,1
+ a0,2*b2,2
C00 C01 C02

a1,0*b0,0
a1,0*b0,1 a1,0*b0,2
+ a1,1*b1,0
+a1,1*b1,1 + a1,1*b1,2
+ a1,2*a2,0
+ a1,2*b2,1 + a1,2*b2,2

Done C10 C11 C12


b2,2
a2,0*b0,0 a2,0*b0,1 a2,0*b0,2
+ a2,1*b1,0 + a2,1*b1,1 a2,2 + a2,1*b1,2
+ a2,2*b2,0 + a2,2*b2,1 + a2,2*b2,2

C20 C21 C22


T=7
EECC756 - Shaaban
Example source: http://www.cs.hmc.edu/courses/2001/spring/cs156/
#61 lec # 1 Spring 2011 3-8-2011

You might also like