CICS 504 Computer Organization
CICS 504 Computer Organization
Computer Organization
Lecture 11:
Multiprocessor 1: Reasons, Classifications,
Performance Metrics, Applications
1
Adapted from D.A. Patterson, Copyright 2003 UCB
Review: Networking
• Clusters +: fault isolation and repair,
scaling, cost
• Clusters -: maintenance, network interface
performance, memory efficiency
• Google as cluster example:
– scaling (6000 PCs, 1 petabyte storage)
– fault isolation (2 failures per day yet available)
– repair (replace failures weekly/repair offline)
– Maintenance: 8 people for 6000 PCs
• Cell phone as portable network device
• Is future services built on Google-like
clusters delivered to gadgets like cell phone
handset?
2
Parallel Computers
• Definition: “A parallel computer is a collection
of processing elements that cooperate and
communicate to solve large problems fast.”
Almasi and Gottlieb, Highly Parallel Computing ,1989
• Questions about parallel computers:
– How large a collection?
– How powerful are processing elements?
– How do they cooperate and communicate?
– How are data transmitted?
– What type of interconnection?
– Does it translate into performance?
3
Parallel Processors “Religion”
4
What level Parallelism?
• Bit level parallelism: 1970 to ~1985
– 4 bits, 8 bit, 16 bit, 32 bit microprocessors
• Instruction level parallelism (ILP):
~1985 through today
– Pipelining
– Superscalar
– VLIW
– Out-of-Order execution
– Limits to benefits of ILP?
• Process Level or Thread level parallelism;
mainstream for general purpose computing?
– Servers are parallel
– Highend Desktop dual processor PC soon??
(or just the sell the socket?)
5
Why Multiprocessors?
1. Microprocessors as the fastest CPUs
• Collecting several much easier than redesigning 1
2. Complexity of current microprocessors
• Do we have enough ideas to sustain 1.5X/yr?
• Can we deliver such complexity on schedule?
3. Slow (but steady) improvement in parallel software
(scientific apps, databases, OS)
4. Emergence of embedded and server markets driving
microprocessors in addition to desktops
• Embedded functional parallelism, producer/consumer model
• Server figure of merit is tasks per hour vs. latency
6
Popular Flynn Categories
(e.g., ~RAID level for MPPs)
7
Major MIMD Styles
1. Centralized shared memory ("Uniform
Memory Access" time or "Shared Memory
Processor")
2. Decentralized memory (memory module
with CPU)
• get more memory bandwidth, lower memory
latency
• Drawback: Longer communication latency
• Drawback: Software model more complex
8
Decentralized Memory versions
1. Shared Memory with "Non Uniform Memory
Access" time (NUMA)
2. Message passing "multicomputer" with separate
address space per processor
– Can invoke software with Remote Procedue Call (RPC)
– Often via library, such as MPI: Message Passing
Interface
– Also called "Syncrohnous communication" since
communication causes synchronization between 2
processes
9
Performance Metrics:
Latency and Bandwidth
1. Bandwidth
– Need high bandwidth in communication
– Match limits in network, memory, and processor
– Challenge is link speed of network interface vs. bisection
bandwidth of network
2. Latency
– Affects performance, since processor may have to wait
– Affects ease of programming, since requires more thought to
overlap communication and computation
– Overhead to communicate is a problem in many machines
3. Latency Hiding
– How can a mechanism help hide latency?
– Increases programming system burden
– Examples: overlap message send with computation, prefetch
data, switch to other tasks
10
Parallel Framework
• Layers:
– Programming Model:
• Multiprogramming : lots of jobs, no communication
• Shared address space: communicate via memory
• Message passing: send and recieve messages
• Data Parallel: several agents operate on several data
sets simultaneously and then exchange information
globally and simultaneously (shared or message
passing)
– Communication Abstraction:
• Shared address space: e.g., load, store, atomic swap
• Message passing: e.g., send, recieve library calls
• Debate over this topic (ease of programming, scaling)
11
Shared Address Model Summary
• Each processor can name every physical location in
the machine
• Each process can name all data it shares with other
processes
• Data transfer via load and store
• Data size: byte, word, ... or cache blocks
• Uses virtual memory to map virtual to local or remote
physical
• Memory hierarchy model applies: now communication
moves data to local processor cache (as load moves
data from memory to cache)
– Latency, BW, scalability when communicate?
12
Shared Address/Memory
Multiprocessor Model
• Communicate via Load and Store
– Oldest and most popular model
• Based on timesharing: processes on multiple
processors vs. sharing single processor
• process: a virtual address space
and ~ 1 thread of control
– Multiple processes can overlap (share), but ALL
threads share a process address space
• Writes to shared address space by one thread
are visible to reads of other threads
– Usual model: share code, private stack, some shared
heap, some private heap
13
SMP Interconnect
• Processors to Memory AND to I/O
• Bus based: all memory locations equal
access time so SMP = “Symmetric MP”
– Sharing limited BW as add processors, I/O
14
Message Passing Model
• Whole computers (CPU, memory, I/O devices)
communicate as explicit I/O operations
• Send specifies local buffer + receiving process
on remote computer
• Receive specifies sending process on remote
computer + local buffer to place data
– Usually send includes process tag
and receive has rule on tag: match 1, match any
– Synch: when send completes, when buffer free, when
request accepted, receive wait for send
• Send+receive => memory-memory copy, where
each supplies local address,
AND does pairwise synchronization!
15
Data Parallel Model
• Operations can be performed in parallel
on each element of a large regular data
structure, such as an array
• 1 Control Processsor broadcast to many
PEs
• Condition flag per PE so that can skip
• Data distributed in each memory
• Data parallel programming languages lay
out data to processor
16
Data Parallel Model
• Vector processors have similar ISAs,
but no data placement restriction
• SIMD led to Data Parallel Programming languages
• Advancing VLSI led to single chip FPUs and whole fast
µProcs (SIMD less attractive)
• SIMD programming model led to
Single Program Multiple Data (SPMD) model
– All processors execute identical program
• Data parallel programming languages still useful, do
communication all at once:
“Bulk Synchronous” phases in which all communicate
after a global barrier
17
Advantages shared-memory
communication model
• Compatibility with SMP hardware
• Ease of programming when communication patterns are
complex or vary dynamically during execution
• Ability to develop apps using familiar SMP model, attention
only on performance critical accesses
• Lower communication overhead, better use of BW for small
items, due to implicit communication and memory mapping to
implement protection in hardware, rather than through I/O
system
• HW-controlled caching to reduce remote comm. by caching of
all data, both shared and private.
18
Advantages message-passing
communication model
• The hardware can be simpler (esp. vs. NUMA)
• Communication explicit => simpler to understand; in shared
memory it can be hard to know when communicating and when not,
and how costly it is
• Explicit communication focuses attention on costly aspect of parallel
computation, sometimes leading to improved structure in
multiprocessor program
• Synchronization is naturally associated with sending messages,
reducing the possibility for errors introduced by incorrect
synchronization
• Easier to use sender-initiated communication, which may have
some advantages in performance
19
Communication Models
• Shared Memory
– Processors communicate with shared address space
– Easy on small-scale machines
– Advantages:
• Model of choice for uniprocessors, small-scale MPs
• Ease of programming
• Lower latency
• Easier to use hardware controlled caching
• Message passing
– Processors have private memories,
communicate via messages
– Advantages:
• Less hardware, easier to design
• Focuses attention on costly non-local operations
• Can support either SW model on either HW base
20
2 Parallel Applications
• Commercial Workload
• Multiprogramming and OS Workload
21
Parallel App: Commercial Workload
• Online transaction processing workload (OLTP)
• Decision support system (DSS)
• Web index search (Altavista)
Benchmark % Time % Time % Time
User Kernel I/O time
Mode (CPU Idle)
OLTP 71% 18% 11%
26
Basic Snoopy Protocols
• Write Invalidate Protocol:
– Multiple readers, single writer
– Write to shared data: an invalidate is sent to all caches which
snoop and invalidate any copies
– Read Miss:
• Write-through: memory is always up-to-date
• Write-back: snoop in caches to find most recent copy
• Write Broadcast Protocol (typically write through):
– Write to shared data: broadcast on bus, processors snoop, and
update any copies
– Read miss: memory is always up-to-date
• Write serialization: bus serializes requests!
– Bus is single point of arbitration
27
Basic Snoopy Protocols
• Write Invalidate versus Broadcast:
– Invalidate requires one transaction per write-
run
– Invalidate uses spatial locality: one
transaction per block
– Broadcast has lower latency between write
and read
28
Implementation Complications
• Write Races:
– Cannot update cache until bus is obtained
• Otherwise, another processor may get bus first,
and then write the same cache block!
– Two step process:
• Arbitrate for bus
• Place miss on bus and complete operation
– If miss occurs to block while waiting for bus,
handle miss (invalidate may be needed) and then restart.
– Split transaction bus:
• Bus transaction is not atomic:
can have multiple outstanding transactions for a block
• Multiple misses can interleave,
allowing two caches to grab block in the Exclusive state
• Must track and prevent multiple misses for one block
• Must support interventions and invalidations
29
Implementing Snooping Caches
• Multiple processors must be on bus, access to both
addresses and data
• Add a few new commands to perform coherency,
in addition to read and write
• Processors continuously snoop on address bus
– If address matches tag, either invalidate or update
• Since every bus transaction checks cache tags,
could interfere with CPU just to check:
– solution 1: duplicate set of tags for L1 caches just to allow
checks in parallel with CPU
– solution 2: L2 cache already duplicate,
provided L2 obeys inclusion with L1 cache
• block size, associativity of L2 affects L1
30
Implementing Snooping Caches
• Bus serializes writes, getting bus ensures no
one else can perform memory operation
• On a miss in a write back cache, may have the
desired copy and its dirty, so must reply
• Add extra state bit to cache to determine
shared or not
• Add 4th state (MESI)
31
Fundamental Issues
• 3 Issues to characterize parallel machines
1) Naming
2) Synchronization
3) Performance: Latency and Bandwidth
(covered earlier)
32
Fundamental Issue #1: Naming
• Naming: how to solve large problem fast
– what data is shared
– how it is addressed
– what operations can access data
– how processes refer to each other
• Choice of naming affects code produced by a
compiler; via load where just remember
address or keep track of processor number
and local virtual address for msg. passing
• Choice of naming affects replication of data;
via load in cache memory hierarchy or via SW
replication and consistency
33
Fundamental Issue #1: Naming
• Global physical address space:
any processor can generate, address and access it in a
single operation
– memory can be anywhere:
virtual addr. translation handles it
• Global virtual address space: if the address space of
each process can be configured to contain all shared
data of the parallel program
• Segmented shared address space:
locations are named
<process number, address>
uniformly for all processes of the parallel program
34
Fundamental Issue #2:
Synchronization
• To cooperate, processes must coordinate
• Message passing is implicit coordination
with transmission or arrival of data
• Shared address
=> additional operations to explicitly
coordinate:
e.g., write a flag, awaken a thread, interrupt
a processor
35