CS6461 Computer Architecture
Fall 2016
Morris Lancaster - Lecturer
Adapted from Professor Stephen Kaislers Notes
Lecture 10
High Performance Computing:
Multiprocessors
Introduction
So far, we have studied uniprocessors one processor,
possibly pipelined, and one memory - and vector processors,
still one computer with special functional units.
Performance can be improved through the use of multiple
processors. Multiprocessors allow multiprogramming, e.g., true
concurrent or parallel programming.
Idea: create powerful computers by connecting many
smaller ones
good news: works for timesharing (better than supercomputer)
bad news: its really hard to write good concurrent programs;
many commercial failures
Ref: Introduction to Parallel Processing: Algorithms and
Processing, by Behrooz Parhami
10/7/2017 CS61 Computer Architecture 10-2
Paradigms
Parallel Computing
Simultaneous use of multiple processors - all components of a
single architecture - to solve a task. Typically processors
identical, single user (even if machine is multi-user).
Distributed Computing
Use of a network of processors, each capable of being viewed
as a computer in its own right, to solve a problem. Processors
may be heterogeneous, multi-user, and usually individual tasks
are assigned to individual processors.
Concurrent Computing
Both of the above
10/7/2017 CS61 Computer Architecture 10-3
Types of Parallelism
10/7/2017 CS61 Computer Architecture 10-4
For A Given Problem
10/7/2017 CS61 Computer Architecture 10-5
Speedup
If we can do some computations in parallel, then we
can attain a speedup over sequential execution:
Speedup = Tsequential/Tparallel
Amdahl (Gene) defined the speedup for a parallel
processor as:
S = 1/(f + (1-f)/p)
where: p = # processors; f = fraction of
unparallelizable code
So, if f = 10%, the speedup can be no greater than
10!
With p = 10, S = 1/(0.1 + 0.9/10) ~= 5.3
With p = infinity, S = 1/(0.1 + 0.9/infinity) = 10
10/7/2017 CS61 Computer Architecture 10-6
Efficiency
Efficiency is the ratio of speedup to p: 5.3/10 ~= 53%
Ignores the possibility of new algorithms, with much smaller f
Ignores possibility that more of a program is run from higher
speed memory, such as registers, cache, main memory
often, problem is scaled with the number of processors
f is a function of size of the program which may be decreasing
serial code may take constant time, independent of size
(George Michael 80/20 vs 20/80 rule)
10/7/2017 CS61 Computer Architecture 10-7
Multiprocessor Example
Compare performance of a single Motorola 68020 with ten
68020s coupled as a multiprocessor
Consider the task of adding 100 numbers in memory
The ADD.W <ea>,Dn op takes 4 clock cycles. Thus, a single
68020 would take 400 clock cycles to add 100 numbers one at
a time
Multiprocessor works as follows:
a. All 10 uPs would add first 10 numbers each = 40 cycles
b. 5 uPs would add 10 partial sums = 4 cycles
c. 2 uPs would 4 of 5 partial sums = 4 cycles
d. one uP would add three partial sums of two = 8 cycles
Total: 56 cycles!
Performance Improvement: 7.14 (not 10 due to overhead)
10/7/2017 CS61 Computer Architecture 10-8
Limitations
Software Inertia: Billions of dollars worth of
FORTRAN and C/C++ and COBOL software exists.
Who will rewrite them?
Into what programming language?
Many programmers have experience with multicore
computers, but to the extent they program they
have no direct experience.
Who will retrain them ?
What about languages which are becoming
obsolete: Tcl, other scripting languages, etc.
10/7/2017 CS61 Computer Architecture 10-9
The Path to PetaFLOPS
10/7/2017 CS61 Computer Architecture 10-10
Michael Flynns Hardware Taxonomy
I: Instruction Stream D: Data Stream
SI: Single Instruction Stream (a)
All processors execute the same instruction in the same cycle
Instruction may be conditional
in Multiprocessors, control processor issues the instruction
MI: Multiple Instruction Stream (c)
Different processors may be simultaneously executing different instructions
SD: Single Data Stream (d)
All processors operate on the same data item (e.g., copies of it) at the
same time
MD: Multiple Data Stream (b)
Different processors may be simultaneously operating on different data
items
Multiplying a coefficient vector by a data vector (e.g., in filtering)
y[i] := c[i] x[i], 0 i < n
10/7/2017 CS61 Computer Architecture 10-11
Taxonomy Visual
10/7/2017 CS61 Computer Architecture 10-12
SISD
This model has been the subject of previous lectures.
10/7/2017 CS61 Computer Architecture 10-13
SIMD
Cray-1
This was the topic of lecture 9.
10/7/2017 CS61 Computer Architecture 10-14
SIMD
Execute one operation on multiple data streams
concurrency in time vector processing
concurrency in space array processing
CM-1 (SIMD) CM-5 (MIMD)
Thinking Machines, Inc. Connection Machines
10/7/2017 CS61 Computer Architecture 10-15
Thinking Machines CM-1
10/7/2017 CS61 Computer Architecture 10-16
MISD
10/7/2017 CS61 Computer Architecture 10-17
CMU/GE WARP
1987: 100 MFLOPS for
$300,000; about 30
times cheaper than a
Cray-1 (also 100
MFLOPS) @ $10M
Limited programming
models, however.
Example: WARP Systolic Array
Designed by H.T. Kung at CMU (now at Harvard)
Built by General Electric for me for the DARPA
Strategic Computing Program (ca 1986-1987)
10/7/2017 CS61 Computer Architecture 10-18
MIMD
10/7/2017 CS61 Computer Architecture 10-19
Processor Coupling
Tightly Coupled System
Tasks and/or processors communicate in a highly
synchronized fashion
Communicates through a common shared memory
Shared memory system
Loosely Coupled System
Tasks or processors do not communicate in a
synchronized fashion
Communicates by message passing packets
Overhead for data exchange is high
Distributed memory system
10/7/2017 CS61 Computer Architecture 10-20
Granularity of Parallelism
Coarse-grain
A task is broken into a handful of pieces, each of which is executed by
a powerful processor
Processors may be heterogeneous
Computation/communication ratio is very high
Example: BBN Butterfly
Medium-grain
Tens to few thousands of processors typically running the same code
Computation/communication ratio is often hundreds or more
Intel Paragon XP, Touchstone Series
Fine-grain
Thousands to perhaps millions of small pieces, executed by very small,
simple processors or through pipelines
Processors typically have instructions broadcasted to them
Compute/communicate ratio often near unity
Example: Thinking Machines CM-1, CM-2, CM-200
10/7/2017 CS61 Computer Architecture 10-21
Intel Paragon XP/S 140 Supercomputer
10/7/2017 CS61 Computer Architecture 10-22
Memory Architectures
Shared (Global) Memory
A Global Memory Space accessible by all processors
Processors may also have some local memory
Distributed (Local, Message-Passing) Memory
All memory units are associated with processors
To retrieve information from another processor's memory a
message must be sent there
Uniform Memory : All processors take the same time
to reach all memory locations
Non-uniform (NUMA) Memory: Processors have
varying access patterns to shared memory
10/7/2017 CS61 Computer Architecture 10-23
Shared Memory Multiprocessors
Characteristics
All processors have equally direct access to one large memory address
space
Example systems
Bus and cache-based systems: Sequent Balance, Encore Multimax
Multistage IN-based systems: Ultracomputer, Butterfly, RP3, HEP
Crossbar switch-based systems: C.mmp, Alliant FX/8
Limitations
Memory access latency; Hot spot problem
10/7/2017 CS61 Computer Architecture 10-24
Centralized Shared vs. Distributed Memory
Large caches single memory serves Pro: reduces latency of local memory
small number of processes (up to 16 or accesses
so) Con: communicating data between
More processors introduces more processors becomes more complex
contention
10/7/2017 CS61 Computer Architecture 10-25
Message Passing Multiprocessors
Characteristics
Interconnected computers
Each processor has its own memory, and communicates via message-
passing
Messages are variable-length data containers
Example systems
Tree structure: Teradata, DADO
Mesh-connected: Rediflow, Series 2010, J-Machine
Hypercube: Cosmic Cube, iPSC, NCUBE, FPS T Series, Mark III
10/7/2017 CS61 Computer Architecture 10-26
Message Passing Multiprocessors
One-to-one communication: one source, one destination
Collective communication
One-to-many: multicast, broadcast (one-to-all), scatter
Many-to-one: combine (fan-in), global combine, gather
Many-to-many: all-to-all broadcast (gossiping), scatter-gather
10/7/2017 CS61 Computer Architecture 10-27
Message Routing
10/7/2017 CS61 Computer Architecture 10-28
Non-Uniform Memory Access
All memories can be addressed by all processors, but access to a processors own
local memory is faster than access to another processors remote memory.
Looks like a distributed machine, but the interconnection network is usually custom-designed
switches and/or buses.
CC-NUMA: Cache Coherent NUMA
Exemplified by Kendall Square Research KSR1
10/7/2017 CS61 Computer Architecture 10-29
Operating System Support
Individual OS
Each CPU has its own OS
Statically allocate physical memory to each CPU
Each CPU runs its own independents OS
Share peripherals
Each CPU handles its processes system calls
Used in early multiprocessor systems
Simple to implement
Avoids concurrency issues by not sharing
10/7/2017 CS61 Computer Architecture 10-30
Operating System Support
Individual OS Issues:
Each processor has its own scheduling queue.
Each processor has its own memory partition.
Consistency is an issue with independent disk buffer caches and
potentially shared files.
10/7/2017 CS61 Computer Architecture 10-31
Operating System Support
Master-Slave Multiprocessors
OS mostly runs on a single fixed CPU.
User-level applications run on the other CPUs.
All system calls are passed to the Master CPU for
processing
Very little synchronization required
Single to implement
Single centralized scheduler to keep all processors busy
Memory can be allocated as needed to all CPUs.
Issues: Master CPU becomes the bottleneck.
10/7/2017 CS61 Computer Architecture 10-32
Operating System Support
Master-Slave Multiprocessors Issues:
Master CPU becomes the bottleneck.
10/7/2017 CS61 Computer Architecture 10-33
Operating System Support
OS kernel runs on all processors, while load and resources are
balanced between all processors.
One alternative: A single mutex (mutual exclusion object) that make
the entire kernel a large critical section; Only one CPU can be in the
kernel at a time; Only slight better than master-slave
Better alternative: Identify independent parts of the kernel and make
each of them their own critical section, which allows parallelism in the
kernel
Issues: A difficult task; Code is mostly similar to uniprocessor code;
hard part is identifying independent parts that dont interfere with each
other
10/7/2017 CS61 Computer Architecture 10-34
Interconnection Topologies
Shared Bus:
M3 wishes to communicate with S5
[1] M3 sends signals (address) on the bus that causes S5 to
respond
[2] M3 sends data to S5 or S5 sends data to M3 (determined by
the command line)
Master Device: Device that initiates and controls the
communication
Slave Device: Responding device
Multiple-master buses: Bus conflict requires bus arbitration
10/7/2017 CS61 Computer Architecture 10-35
Interconnection Topologies
Shared Bus:
All processors (and memory) are connected to a
common bus or busses
Memory access is fairly uniform, but not very scalable
A collection of signal lines that carry module-to-module
communication
Data highways connecting several digital system
elements
Can handle only one data transmission at a time
Can be easily expanded by connecting additional
processors to the shared bus, along with the
necessary bus arbitration circuitry
10/7/2017 CS61 Computer Architecture 10-36
Interconnection Topologies
Mesh Architecture:
Diameter = 2(m - 1)
In general, an n-
dimensional mesh has
diameter = d ( p1/n - 1)
Diameter can be halved by
having wrap-around
connections (=> Torus)
Ring is a 1-dimensional
mesh with wrap-around
connections
10/7/2017 CS61 Computer Architecture 10-37
Mesh-Type Interconnects
10/7/2017 CS61 Computer Architecture 10-38
Tree-Type Interconnects
10/7/2017 CS61 Computer Architecture 10-39
Example: Mesh Matrix Multiplication
10/7/2017 CS61 Computer Architecture 10-40
Interconnection Topologies
Multiport Memory:
Has multiple sets of address, data, and
control pins to allow simultaneous data
transfers to occur Memory Modules
CPU and DMA controller can transfer data
concurrently MM 1 MM 2 MM 3 MM 4
A system with more than one CPU could
handle simultaneous requests from two
different processors CPU 1
Does not scale well because of explosion of
number of busses CPU 2
Memory Module Control Logic
CPU 3
Each memory module has control logic
Resolve memory module conflicts via fixed
priority among CPUs CPU 4
Requests to read from and write to the
same memory location simultaneously
Advantages
Multiple paths -> high transfer rate
Disadvantages
Multiple copies of Memory control logic
Large number of connections
10/7/2017 CS61 Computer Architecture 10-41
Interconnection Topologies
Crossbar Switch:
Processors p and Memory banks b are
connected to routing switches like in telephone Memory modules
system MM1 MM2 MM3 MM4
Switches might have queues (combining logic),
which improve functionality but increase latency
Switch settings may be determined by message CPU1
headers or preset by controller
Connections can be packet-switched or circuit-
switched (remain connected as long as it is
needed) CPU2
Nonblocking Switch: the connection of a
processing node to a memory bank does not
block the connection of any other processing CPU3
nodes to other memory banks.
Older versions used circuit switching where a
dedicated path was created for the duration of CPU4
the communication
More recently, packet switching has been used
across the interstitial nodes
Requires p*b switches
Examples of machines that employ crossbars
include the Sun Ultra HPC 10000 and the Fujitsu
VPP500
10/7/2017 CS61 Computer Architecture 10-42
Butterfly Network
An example of blocking in omega network: one of the messages
(010 to 111 or 110 to 100) is blocked at link AB.
10/7/2017 CS61 Computer Architecture 10-43
Butterfly Network
10/7/2017 CS61 Computer Architecture 10-44
Butterfly Routing
10/7/2017 CS61 Computer Architecture 10-45
Interconnection Topologies
Omega Network (Butterfly Network)
Consists of log p stages, p is the number of inputs (processing
nodes) and also the number of outputs (memory banks)
Each stage consists of an interconnection pattern that connects
p inputs and p outputs:
Perfect shuffle(left rotation): 2i, 0 i p / 2 1
j
2i 1 p, p / 2 i p 1
Each switch has two connection modes:
Pass-thought connection: the inputs are sent straight through to the
outputs
Cross-over connection: the inputs to the switching node are crossed
over and then sent out.
Has p*(log p)/2 switching nodes: if p = 8, nodes = 12
Much better than crossbar using p*p=64 switches
The cost of such a network grows as (p log p)
10/7/2017 CS61 Computer Architecture 10-46
Interconnection Topologies
Omega Network (Butterfly Network)
Omega network has self-routing property
The path for a cell to take to reach its destination can be
determined directly from its routing tag (i.e., destination port id)
Stage k of the network looks at bit k of the tag
If bit k is 0, then send cell out upper port
If bit k is 1, then send cell out lower port
Works for every possible input port (really!)
Route from any input x to output y by selecting links determined by
successive d-ary digits of ys label.
This process is reversible; we can route from output y back to x by
following the links determined by successive digits of xs label.
This self-routing property allows for simple hardware-based routing
of cells.
yk-1xk-1 . . . x2yk-2
xk-1 . . . x1 yk-1 yk-1 yk-2xk-1 . . . x3yk-3 yk-1 yk-2 . . . xk-1y1
x1 x2
x=xk-1 . . . x0 x0 y1 xk-1
y0 y=yk-1 . . . y0
yk-1 yk-2 yk-3 xk-2
10/7/2017 CS61 Computer Architecture 10-47
Interconnection Topologies
Hypercube:
Processors are directly connected to only certain other
processors and data must traverse multiple hops to get to
additional processors
Usually distributed memory
Hardware may handle only single hops, or multiple hops
Software may mask hardware limitations
Latency is related to graph diameter, among many other
factors
Usually NUMA, nonblocking, scalable, upgradeable
Examples: Ring, Mesh, Torus, Hypercube, Binary Tree
p = 2n, n >= 0
Processors are conceptually on the corners of a n-
dimensional hypercube, and each is directly connected to
the n neighboring nodes
10/7/2017 CS61 Computer Architecture 10-48
Interconnection Topologies
10/7/2017 CS61 Computer Architecture 10-49
Interconnection Topologies
10/7/2017 CS61 Computer Architecture 10-50
Red Storm Cray Research (2003)
Red Storm Overview
Red Storm Network
Red Storm Processor Board
Growth Over Thirty Years
10/7/2017 CS61 Computer Architecture 10-55
What Have We learned Over 30 Years?
Building general-purpose parallel machines is a very
difficult task.
Proof by contradiction:
Many companies have gone bankrupt or left the parallel machine
market
Even harder is developing general parallel programming
schemes
Still an art rather than a science
10/7/2017 CS61 Computer Architecture 10-56
Additional Material
SIMD => Data Level Parallelism