0% found this document useful (0 votes)
39 views

CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer

This document discusses high performance computing using multiprocessors. It begins by introducing multiprocessors as systems with multiple processors that allow for true concurrent programming. It then covers different paradigms of parallel computing including parallel, distributed, and concurrent computing. The document discusses different types of parallelism that can be exploited in problems including data, task, and pipelined parallelism. It also covers concepts like speedup, efficiency, and Amdahl's law. Examples are provided of multiprocessor architectures like shared memory and distributed memory systems.

Uploaded by

闫麟阁
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
39 views

CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer

This document discusses high performance computing using multiprocessors. It begins by introducing multiprocessors as systems with multiple processors that allow for true concurrent programming. It then covers different paradigms of parallel computing including parallel, distributed, and concurrent computing. The document discusses different types of parallelism that can be exploited in problems including data, task, and pipelined parallelism. It also covers concepts like speedup, efficiency, and Amdahl's law. Examples are provided of multiprocessor architectures like shared memory and distributed memory systems.

Uploaded by

闫麟阁
Copyright
© © All Rights Reserved
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 58

CS6461 Computer Architecture

Fall 2016
Morris Lancaster - Lecturer
Adapted from Professor Stephen Kaislers Notes

Lecture 10
High Performance Computing:
Multiprocessors
Introduction

So far, we have studied uniprocessors one processor,


possibly pipelined, and one memory - and vector processors,
still one computer with special functional units.
Performance can be improved through the use of multiple
processors. Multiprocessors allow multiprogramming, e.g., true
concurrent or parallel programming.

Idea: create powerful computers by connecting many


smaller ones
good news: works for timesharing (better than supercomputer)
bad news: its really hard to write good concurrent programs;
many commercial failures
Ref: Introduction to Parallel Processing: Algorithms and
Processing, by Behrooz Parhami

10/7/2017 CS61 Computer Architecture 10-2


Paradigms

Parallel Computing
Simultaneous use of multiple processors - all components of a
single architecture - to solve a task. Typically processors
identical, single user (even if machine is multi-user).

Distributed Computing
Use of a network of processors, each capable of being viewed
as a computer in its own right, to solve a problem. Processors
may be heterogeneous, multi-user, and usually individual tasks
are assigned to individual processors.

Concurrent Computing
Both of the above

10/7/2017 CS61 Computer Architecture 10-3


Types of Parallelism

10/7/2017 CS61 Computer Architecture 10-4


For A Given Problem

10/7/2017 CS61 Computer Architecture 10-5


Speedup

If we can do some computations in parallel, then we


can attain a speedup over sequential execution:
Speedup = Tsequential/Tparallel
Amdahl (Gene) defined the speedup for a parallel
processor as:
S = 1/(f + (1-f)/p)
where: p = # processors; f = fraction of
unparallelizable code
So, if f = 10%, the speedup can be no greater than
10!
With p = 10, S = 1/(0.1 + 0.9/10) ~= 5.3
With p = infinity, S = 1/(0.1 + 0.9/infinity) = 10

10/7/2017 CS61 Computer Architecture 10-6


Efficiency

Efficiency is the ratio of speedup to p: 5.3/10 ~= 53%


Ignores the possibility of new algorithms, with much smaller f
Ignores possibility that more of a program is run from higher
speed memory, such as registers, cache, main memory
often, problem is scaled with the number of processors
f is a function of size of the program which may be decreasing
serial code may take constant time, independent of size
(George Michael 80/20 vs 20/80 rule)

10/7/2017 CS61 Computer Architecture 10-7


Multiprocessor Example

Compare performance of a single Motorola 68020 with ten


68020s coupled as a multiprocessor
Consider the task of adding 100 numbers in memory
The ADD.W <ea>,Dn op takes 4 clock cycles. Thus, a single
68020 would take 400 clock cycles to add 100 numbers one at
a time
Multiprocessor works as follows:
a. All 10 uPs would add first 10 numbers each = 40 cycles
b. 5 uPs would add 10 partial sums = 4 cycles
c. 2 uPs would 4 of 5 partial sums = 4 cycles
d. one uP would add three partial sums of two = 8 cycles
Total: 56 cycles!
Performance Improvement: 7.14 (not 10 due to overhead)

10/7/2017 CS61 Computer Architecture 10-8


Limitations

Software Inertia: Billions of dollars worth of


FORTRAN and C/C++ and COBOL software exists.
Who will rewrite them?
Into what programming language?
Many programmers have experience with multicore
computers, but to the extent they program they
have no direct experience.
Who will retrain them ?
What about languages which are becoming
obsolete: Tcl, other scripting languages, etc.

10/7/2017 CS61 Computer Architecture 10-9


The Path to PetaFLOPS

10/7/2017 CS61 Computer Architecture 10-10


Michael Flynns Hardware Taxonomy

I: Instruction Stream D: Data Stream

SI: Single Instruction Stream (a)


All processors execute the same instruction in the same cycle
Instruction may be conditional
in Multiprocessors, control processor issues the instruction
MI: Multiple Instruction Stream (c)
Different processors may be simultaneously executing different instructions
SD: Single Data Stream (d)
All processors operate on the same data item (e.g., copies of it) at the
same time
MD: Multiple Data Stream (b)
Different processors may be simultaneously operating on different data
items
Multiplying a coefficient vector by a data vector (e.g., in filtering)
y[i] := c[i] x[i], 0 i < n

10/7/2017 CS61 Computer Architecture 10-11


Taxonomy Visual

10/7/2017 CS61 Computer Architecture 10-12


SISD

This model has been the subject of previous lectures.

10/7/2017 CS61 Computer Architecture 10-13


SIMD

Cray-1

This was the topic of lecture 9.


10/7/2017 CS61 Computer Architecture 10-14
SIMD

Execute one operation on multiple data streams


concurrency in time vector processing
concurrency in space array processing
CM-1 (SIMD) CM-5 (MIMD)

Thinking Machines, Inc. Connection Machines


10/7/2017 CS61 Computer Architecture 10-15
Thinking Machines CM-1

10/7/2017 CS61 Computer Architecture 10-16


MISD

10/7/2017 CS61 Computer Architecture 10-17


CMU/GE WARP

1987: 100 MFLOPS for


$300,000; about 30
times cheaper than a
Cray-1 (also 100
MFLOPS) @ $10M

Limited programming
models, however.

Example: WARP Systolic Array


Designed by H.T. Kung at CMU (now at Harvard)
Built by General Electric for me for the DARPA
Strategic Computing Program (ca 1986-1987)
10/7/2017 CS61 Computer Architecture 10-18
MIMD

10/7/2017 CS61 Computer Architecture 10-19


Processor Coupling

Tightly Coupled System


Tasks and/or processors communicate in a highly
synchronized fashion
Communicates through a common shared memory
Shared memory system

Loosely Coupled System


Tasks or processors do not communicate in a
synchronized fashion
Communicates by message passing packets
Overhead for data exchange is high
Distributed memory system

10/7/2017 CS61 Computer Architecture 10-20


Granularity of Parallelism

Coarse-grain
A task is broken into a handful of pieces, each of which is executed by
a powerful processor
Processors may be heterogeneous
Computation/communication ratio is very high
Example: BBN Butterfly
Medium-grain
Tens to few thousands of processors typically running the same code
Computation/communication ratio is often hundreds or more
Intel Paragon XP, Touchstone Series
Fine-grain
Thousands to perhaps millions of small pieces, executed by very small,
simple processors or through pipelines
Processors typically have instructions broadcasted to them
Compute/communicate ratio often near unity
Example: Thinking Machines CM-1, CM-2, CM-200

10/7/2017 CS61 Computer Architecture 10-21


Intel Paragon XP/S 140 Supercomputer

10/7/2017 CS61 Computer Architecture 10-22


Memory Architectures

Shared (Global) Memory


A Global Memory Space accessible by all processors
Processors may also have some local memory
Distributed (Local, Message-Passing) Memory
All memory units are associated with processors
To retrieve information from another processor's memory a
message must be sent there
Uniform Memory : All processors take the same time
to reach all memory locations
Non-uniform (NUMA) Memory: Processors have
varying access patterns to shared memory

10/7/2017 CS61 Computer Architecture 10-23


Shared Memory Multiprocessors

Characteristics
All processors have equally direct access to one large memory address
space
Example systems
Bus and cache-based systems: Sequent Balance, Encore Multimax
Multistage IN-based systems: Ultracomputer, Butterfly, RP3, HEP
Crossbar switch-based systems: C.mmp, Alliant FX/8
Limitations
Memory access latency; Hot spot problem

10/7/2017 CS61 Computer Architecture 10-24


Centralized Shared vs. Distributed Memory

Large caches single memory serves Pro: reduces latency of local memory
small number of processes (up to 16 or accesses
so) Con: communicating data between
More processors introduces more processors becomes more complex
contention

10/7/2017 CS61 Computer Architecture 10-25


Message Passing Multiprocessors

Characteristics
Interconnected computers
Each processor has its own memory, and communicates via message-
passing
Messages are variable-length data containers
Example systems
Tree structure: Teradata, DADO
Mesh-connected: Rediflow, Series 2010, J-Machine
Hypercube: Cosmic Cube, iPSC, NCUBE, FPS T Series, Mark III

10/7/2017 CS61 Computer Architecture 10-26


Message Passing Multiprocessors

One-to-one communication: one source, one destination


Collective communication
One-to-many: multicast, broadcast (one-to-all), scatter
Many-to-one: combine (fan-in), global combine, gather
Many-to-many: all-to-all broadcast (gossiping), scatter-gather

10/7/2017 CS61 Computer Architecture 10-27


Message Routing

10/7/2017 CS61 Computer Architecture 10-28


Non-Uniform Memory Access

All memories can be addressed by all processors, but access to a processors own
local memory is faster than access to another processors remote memory.
Looks like a distributed machine, but the interconnection network is usually custom-designed
switches and/or buses.
CC-NUMA: Cache Coherent NUMA
Exemplified by Kendall Square Research KSR1

10/7/2017 CS61 Computer Architecture 10-29


Operating System Support

Individual OS
Each CPU has its own OS
Statically allocate physical memory to each CPU
Each CPU runs its own independents OS
Share peripherals
Each CPU handles its processes system calls
Used in early multiprocessor systems
Simple to implement
Avoids concurrency issues by not sharing

10/7/2017 CS61 Computer Architecture 10-30


Operating System Support

Individual OS Issues:
Each processor has its own scheduling queue.
Each processor has its own memory partition.
Consistency is an issue with independent disk buffer caches and
potentially shared files.

10/7/2017 CS61 Computer Architecture 10-31


Operating System Support

Master-Slave Multiprocessors
OS mostly runs on a single fixed CPU.
User-level applications run on the other CPUs.
All system calls are passed to the Master CPU for
processing
Very little synchronization required
Single to implement
Single centralized scheduler to keep all processors busy
Memory can be allocated as needed to all CPUs.
Issues: Master CPU becomes the bottleneck.

10/7/2017 CS61 Computer Architecture 10-32


Operating System Support

Master-Slave Multiprocessors Issues:


Master CPU becomes the bottleneck.

10/7/2017 CS61 Computer Architecture 10-33


Operating System Support

OS kernel runs on all processors, while load and resources are


balanced between all processors.
One alternative: A single mutex (mutual exclusion object) that make
the entire kernel a large critical section; Only one CPU can be in the
kernel at a time; Only slight better than master-slave
Better alternative: Identify independent parts of the kernel and make
each of them their own critical section, which allows parallelism in the
kernel
Issues: A difficult task; Code is mostly similar to uniprocessor code;
hard part is identifying independent parts that dont interfere with each
other
10/7/2017 CS61 Computer Architecture 10-34
Interconnection Topologies

Shared Bus:
M3 wishes to communicate with S5
[1] M3 sends signals (address) on the bus that causes S5 to
respond
[2] M3 sends data to S5 or S5 sends data to M3 (determined by
the command line)
Master Device: Device that initiates and controls the
communication
Slave Device: Responding device
Multiple-master buses: Bus conflict requires bus arbitration
10/7/2017 CS61 Computer Architecture 10-35
Interconnection Topologies

Shared Bus:
All processors (and memory) are connected to a
common bus or busses
Memory access is fairly uniform, but not very scalable
A collection of signal lines that carry module-to-module
communication
Data highways connecting several digital system
elements
Can handle only one data transmission at a time
Can be easily expanded by connecting additional
processors to the shared bus, along with the
necessary bus arbitration circuitry
10/7/2017 CS61 Computer Architecture 10-36
Interconnection Topologies

Mesh Architecture:
Diameter = 2(m - 1)
In general, an n-
dimensional mesh has
diameter = d ( p1/n - 1)
Diameter can be halved by
having wrap-around
connections (=> Torus)
Ring is a 1-dimensional
mesh with wrap-around
connections

10/7/2017 CS61 Computer Architecture 10-37


Mesh-Type Interconnects

10/7/2017 CS61 Computer Architecture 10-38


Tree-Type Interconnects

10/7/2017 CS61 Computer Architecture 10-39


Example: Mesh Matrix Multiplication

10/7/2017 CS61 Computer Architecture 10-40


Interconnection Topologies
Multiport Memory:
Has multiple sets of address, data, and
control pins to allow simultaneous data
transfers to occur Memory Modules
CPU and DMA controller can transfer data
concurrently MM 1 MM 2 MM 3 MM 4
A system with more than one CPU could
handle simultaneous requests from two
different processors CPU 1
Does not scale well because of explosion of
number of busses CPU 2
Memory Module Control Logic
CPU 3
Each memory module has control logic
Resolve memory module conflicts via fixed
priority among CPUs CPU 4

Requests to read from and write to the


same memory location simultaneously
Advantages
Multiple paths -> high transfer rate
Disadvantages
Multiple copies of Memory control logic
Large number of connections

10/7/2017 CS61 Computer Architecture 10-41


Interconnection Topologies
Crossbar Switch:
Processors p and Memory banks b are
connected to routing switches like in telephone Memory modules
system MM1 MM2 MM3 MM4
Switches might have queues (combining logic),
which improve functionality but increase latency
Switch settings may be determined by message CPU1
headers or preset by controller
Connections can be packet-switched or circuit-
switched (remain connected as long as it is
needed) CPU2
Nonblocking Switch: the connection of a
processing node to a memory bank does not
block the connection of any other processing CPU3
nodes to other memory banks.
Older versions used circuit switching where a
dedicated path was created for the duration of CPU4
the communication
More recently, packet switching has been used
across the interstitial nodes
Requires p*b switches
Examples of machines that employ crossbars
include the Sun Ultra HPC 10000 and the Fujitsu
VPP500

10/7/2017 CS61 Computer Architecture 10-42


Butterfly Network

An example of blocking in omega network: one of the messages


(010 to 111 or 110 to 100) is blocked at link AB.
10/7/2017 CS61 Computer Architecture 10-43
Butterfly Network

10/7/2017 CS61 Computer Architecture 10-44


Butterfly Routing

10/7/2017 CS61 Computer Architecture 10-45


Interconnection Topologies

Omega Network (Butterfly Network)


Consists of log p stages, p is the number of inputs (processing
nodes) and also the number of outputs (memory banks)
Each stage consists of an interconnection pattern that connects
p inputs and p outputs:
Perfect shuffle(left rotation): 2i, 0 i p / 2 1
j
2i 1 p, p / 2 i p 1
Each switch has two connection modes:
Pass-thought connection: the inputs are sent straight through to the
outputs
Cross-over connection: the inputs to the switching node are crossed
over and then sent out.
Has p*(log p)/2 switching nodes: if p = 8, nodes = 12
Much better than crossbar using p*p=64 switches
The cost of such a network grows as (p log p)

10/7/2017 CS61 Computer Architecture 10-46


Interconnection Topologies

Omega Network (Butterfly Network)


Omega network has self-routing property
The path for a cell to take to reach its destination can be
determined directly from its routing tag (i.e., destination port id)
Stage k of the network looks at bit k of the tag
If bit k is 0, then send cell out upper port
If bit k is 1, then send cell out lower port
Works for every possible input port (really!)
Route from any input x to output y by selecting links determined by
successive d-ary digits of ys label.
This process is reversible; we can route from output y back to x by
following the links determined by successive digits of xs label.
This self-routing property allows for simple hardware-based routing
of cells.
yk-1xk-1 . . . x2yk-2
xk-1 . . . x1 yk-1 yk-1 yk-2xk-1 . . . x3yk-3 yk-1 yk-2 . . . xk-1y1
x1 x2
x=xk-1 . . . x0 x0 y1 xk-1
y0 y=yk-1 . . . y0
yk-1 yk-2 yk-3 xk-2

10/7/2017 CS61 Computer Architecture 10-47


Interconnection Topologies

Hypercube:
Processors are directly connected to only certain other
processors and data must traverse multiple hops to get to
additional processors
Usually distributed memory
Hardware may handle only single hops, or multiple hops
Software may mask hardware limitations
Latency is related to graph diameter, among many other
factors
Usually NUMA, nonblocking, scalable, upgradeable
Examples: Ring, Mesh, Torus, Hypercube, Binary Tree
p = 2n, n >= 0
Processors are conceptually on the corners of a n-
dimensional hypercube, and each is directly connected to
the n neighboring nodes

10/7/2017 CS61 Computer Architecture 10-48


Interconnection Topologies

10/7/2017 CS61 Computer Architecture 10-49


Interconnection Topologies

10/7/2017 CS61 Computer Architecture 10-50


Red Storm Cray Research (2003)
Red Storm Overview
Red Storm Network
Red Storm Processor Board
Growth Over Thirty Years

10/7/2017 CS61 Computer Architecture 10-55


What Have We learned Over 30 Years?

Building general-purpose parallel machines is a very


difficult task.

Proof by contradiction:
Many companies have gone bankrupt or left the parallel machine
market

Even harder is developing general parallel programming


schemes
Still an art rather than a science

10/7/2017 CS61 Computer Architecture 10-56


Additional Material
SIMD => Data Level Parallelism

You might also like