CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer
CS6461 - Computer Architecture Fall 2016: Morris Lancaster - Lecturer
Fall 2016
Morris Lancaster - Lecturer
Adapted from Professor Stephen Kaislers Notes
Lecture 10
High Performance Computing:
Multiprocessors
Introduction
Parallel Computing
Simultaneous use of multiple processors - all components of a
single architecture - to solve a task. Typically processors
identical, single user (even if machine is multi-user).
Distributed Computing
Use of a network of processors, each capable of being viewed
as a computer in its own right, to solve a problem. Processors
may be heterogeneous, multi-user, and usually individual tasks
are assigned to individual processors.
Concurrent Computing
Both of the above
Cray-1
Limited programming
models, however.
Coarse-grain
A task is broken into a handful of pieces, each of which is executed by
a powerful processor
Processors may be heterogeneous
Computation/communication ratio is very high
Example: BBN Butterfly
Medium-grain
Tens to few thousands of processors typically running the same code
Computation/communication ratio is often hundreds or more
Intel Paragon XP, Touchstone Series
Fine-grain
Thousands to perhaps millions of small pieces, executed by very small,
simple processors or through pipelines
Processors typically have instructions broadcasted to them
Compute/communicate ratio often near unity
Example: Thinking Machines CM-1, CM-2, CM-200
Characteristics
All processors have equally direct access to one large memory address
space
Example systems
Bus and cache-based systems: Sequent Balance, Encore Multimax
Multistage IN-based systems: Ultracomputer, Butterfly, RP3, HEP
Crossbar switch-based systems: C.mmp, Alliant FX/8
Limitations
Memory access latency; Hot spot problem
Large caches single memory serves Pro: reduces latency of local memory
small number of processes (up to 16 or accesses
so) Con: communicating data between
More processors introduces more processors becomes more complex
contention
Characteristics
Interconnected computers
Each processor has its own memory, and communicates via message-
passing
Messages are variable-length data containers
Example systems
Tree structure: Teradata, DADO
Mesh-connected: Rediflow, Series 2010, J-Machine
Hypercube: Cosmic Cube, iPSC, NCUBE, FPS T Series, Mark III
All memories can be addressed by all processors, but access to a processors own
local memory is faster than access to another processors remote memory.
Looks like a distributed machine, but the interconnection network is usually custom-designed
switches and/or buses.
CC-NUMA: Cache Coherent NUMA
Exemplified by Kendall Square Research KSR1
Individual OS
Each CPU has its own OS
Statically allocate physical memory to each CPU
Each CPU runs its own independents OS
Share peripherals
Each CPU handles its processes system calls
Used in early multiprocessor systems
Simple to implement
Avoids concurrency issues by not sharing
Individual OS Issues:
Each processor has its own scheduling queue.
Each processor has its own memory partition.
Consistency is an issue with independent disk buffer caches and
potentially shared files.
Master-Slave Multiprocessors
OS mostly runs on a single fixed CPU.
User-level applications run on the other CPUs.
All system calls are passed to the Master CPU for
processing
Very little synchronization required
Single to implement
Single centralized scheduler to keep all processors busy
Memory can be allocated as needed to all CPUs.
Issues: Master CPU becomes the bottleneck.
Shared Bus:
M3 wishes to communicate with S5
[1] M3 sends signals (address) on the bus that causes S5 to
respond
[2] M3 sends data to S5 or S5 sends data to M3 (determined by
the command line)
Master Device: Device that initiates and controls the
communication
Slave Device: Responding device
Multiple-master buses: Bus conflict requires bus arbitration
10/7/2017 CS61 Computer Architecture 10-35
Interconnection Topologies
Shared Bus:
All processors (and memory) are connected to a
common bus or busses
Memory access is fairly uniform, but not very scalable
A collection of signal lines that carry module-to-module
communication
Data highways connecting several digital system
elements
Can handle only one data transmission at a time
Can be easily expanded by connecting additional
processors to the shared bus, along with the
necessary bus arbitration circuitry
10/7/2017 CS61 Computer Architecture 10-36
Interconnection Topologies
Mesh Architecture:
Diameter = 2(m - 1)
In general, an n-
dimensional mesh has
diameter = d ( p1/n - 1)
Diameter can be halved by
having wrap-around
connections (=> Torus)
Ring is a 1-dimensional
mesh with wrap-around
connections
Hypercube:
Processors are directly connected to only certain other
processors and data must traverse multiple hops to get to
additional processors
Usually distributed memory
Hardware may handle only single hops, or multiple hops
Software may mask hardware limitations
Latency is related to graph diameter, among many other
factors
Usually NUMA, nonblocking, scalable, upgradeable
Examples: Ring, Mesh, Torus, Hypercube, Binary Tree
p = 2n, n >= 0
Processors are conceptually on the corners of a n-
dimensional hypercube, and each is directly connected to
the n neighboring nodes
Proof by contradiction:
Many companies have gone bankrupt or left the parallel machine
market