MultiProcessors Tanenbaum BP
MultiProcessors Tanenbaum BP
1
Multiple CPU systems
Multiprocessor Multicomputer Distributed
Multiprocessors
• Each processor can address every word of
memory
• Two paradigms
2
UMA cache coherency
• Multicore
– single cache (e.g. Intel dual core)
– or multiple cache (e.g. AMD Opteron)
UMA architectures
nu_lectures/lecture11/switching/xbar/xbar.html
http://people.seas.harvard.edu/~jones/cscie129/
• multistage switch
Multistage switch
3
Single bus UMA
• Primary problem: bus contention
• Ways to remedy
– local cache
Unshared process data
– local private memory (loader needs hints)
Crossbar switch
Tanenbaum p. 528
4
Crossbar and multistage
• Crossbars
– grow exponentially
– are nonblocking (>1 CPU can access
the same path to memory)
• Multistage switches provide multistage message
alternative to exponential growth
• Accessing multistage switches
– consider read/write as message on
memory bus
Omega Network
(an inexpensive multistage switch)
5
Omega network
Omega network
• Blocking network
(other multistages may differ)
• Interleaving memory
– consecutive words in different RAM modules
– prevents tying up any single path
– can permit parallel access in some
architectures
6
NUMA – non-uniform mem. access
• Interconnect networks become unwieldy >
~100 CPUs
• Dual class memory
– fast local: as usual
– slow remote: load/store operations only
• Otherwise transparent to user
NUMA flavors
• No Cache (NC-NUMA)
– Remote memory is not cached
• Cache-coherent (CC-NUMA)
– Allows cached remote memory
– Frequently uses directory database for each
cache line
• status (clean/dirty)
• which cache
7
CC-NUMA example
• Cache line size: 64 (26) bytes
• 32 bit address space
• 232/26=226 cache lines
• 256 nodes
– 16 Mb local RAM (226 bytes)
– 1 CPU
• Addressing scheme
Tanenbaum p. 532
CC-NUMA example
Tanenbaum p. 532
8
CC-NUMA example
0xFF0AB004
0xC
0xA
0x0
0x2
0x0
offset
Node
line/block
cache
0x04
0xFF
0x02AC0
CC-NUMA example
Node FF directory
0x3FFFF Case 1: invalid
VALID 0 0x3c
...
NODE
0x00001
0x00000
9
CC-NUMA example
Node FF directory
0x3FFFF Case 1: invalid
0x02AC0 0 0x3c
...
1. Fetch cache line 0x02AC0
2. Send cache line to node 00
0x02AC0 0 0x3c 3. Update directory to indicate
cache line 0x02AC0 at node 0
...
0x02AC0 1 0x00
0x00001
0x00000
CC-NUMA example
10
CC-NUMA
• Acceptable overhead
– 218 high speed ($$$) 9 bit entries for cache
– ~1.76% for 16 MB RAM
Multicore chips
• Common RAM for all cores (UMA)
• Common or separate cache
Tanenbaum p. 23
11
Multicore chips
• Snooping logic – ensures cache
coherency
Multiprocesor OS
• Separate OS
• Master-slave
• Symmetric multiprocessor
12
Separate OS
• CPUs function as separate computers
• Resources partitioned
(some sharing possible, e.g. OS code)
Master-Slave
• Asymmetric
• OS runs on a specific CPU
Tanenbaum p. 536
13
Symmetric multiprocessor (SMP)
• OS can be executed by any CPU
• Concurrency issues
Note: race conditions can occur on asymmetric
OS as well…
• Deadlocks…
14
Multiprocessor synchronization
• Mutual exclusion protocol
– needs atomic instruction, e.g. TSL/SWAP
– any atomic instruction must be able to lock
the bus
– what happens if the bus is not locked?
Multiprocessor synchronization
• Playing ping-pong with the cache
1. 0x8A3 cached
3. modifies shared vars,
cache moves here
15
Multiprocessor synchronization
Strategies to prevent cache invalidation
1. Poll w/ read, use TSL once free
2. Exponential backoff (developed for
Ethernet)
3. Grant private lock
Tanenbaum p. 541
When to block
• Spin locks waste CPU cycles, but so do context
switches
– sample context switch: 1 ms
– sample mutual exclusion: 50 μs
• Mutual exclusion time is unknown…
• Alternatives
– always spin
– always switch
– predict based on history or static threshold
• Does it make sense to spin on a uniprocessor?
16
Multiprocessor scheduling
• Kernel-level threads Multiprocessor
• Timesharing vs.
spacesharing Shared memory
2-10 ns access
[Tanenbaum, p. 525]
17
Common queue for
independent threads
Tanenbaum p. 544
Alternatives/Enhancements
• Smart scheduling
– critical section flag
– extend time quantum when flag set
• Affinity scheduling
– when process finishes CPU burst, lots of
cache entries
– If we schedule soon enough, may run faster
– Can assign to CPU, then schedule (2 level
scheduling)
18
Space sharing
• Some processes may benefit from being
scheduled simultaneously.
• Typically scheduled FCFS
Tanenbaum p. 546
CPUs dedicated CPUs idle when process
to specific processes blocked
Gang scheduling
I am a fugitive from a chain gang (Warner Bros, 1932)
time
19
Multicomputers
aka: cluster computers, cluster of workstations
• Recall:
– Tightly coupled
– No shared memory
• Nodes
– CPU (possibly multicore)
– high speed network
– RAM
– perhaps secondary storage
Interconnect
• Various network topologies
• Samples:
Tanenbaum p. 550
20
Routing
• Packet switched
– messages packetized
– “store and forward:” each switch point
• receives packet
• forwards to next switch point
• latency increases with # switch points
• Circuit switched
– Establish path
– All bits sent along path
Network interfaces
• Copying buffers increases delay
21
User-level communication
• Message passing (CS570)
– send/receive
– ports/mailboxes
– addressing
• unlike Internet, fixed network
• typically CPU# & port/mailbox or process#
– blocking/nonblocking
Implementation non-blocking
messages
• send
– user cannot modify buffer until message
actually sent
– three possibilities
• block until kernel can copy to an internal buffer*
• generate interrupt once buffer is sent
• mark page as copy-on-write until sent
*From a network perspective, this call is still non-blocking, but not from
an OS one. It is the easiest and most common option implemented.
22
Receive
• Will blocking cause problems?
• Non-blocking
– polling
– message arrival by interrupt
• inform calling thread (traditional)
• pop-up threads
• active messages (pop-up variant)
– call from user-level interrupt handler
23
RPC Gotchas
• Pointers to data structures that are not
well contained (e.g. graph)
• Weak types (e.g. int x[] in C/C++)
• Types can be difficult to deduce (e.g.
printf)
• References to globals
• Transparent to user
• Modifications to page table
– Invalid pages may be on another processor
– Page fault results in fetching page from other
CPU’s memory
• Read-only pages can be shared
• Extensions possible (e.g. share until write)
24
DSM Issues
• Network startup is expensive
• Small difference between sending 1 page
vs. 4 pages
• Too large a transfer is more likely to result
in false sharing
Tanenbaum p. 564
back to playing ping-pong…
Sequential consistency
• Suppose we let writable pages be shared:
What if threads 0 and 1 both
write to the same location...
0x37F
logical memory
P32
P32
thread
thread
0
1
machine 0 machine 1
25
Easy sequential consistency
• Mark shared pages read only
• Writing causes a page fault
• Page fault handler
– send message to other processors to
invalidate shared page
– mark page read/write
– instruction restart
26
Multicomputer scheduling
Multicomputer
• Admission scheduler is important but easily
managed
• Short-time scheduler
– Any appropriate scheduler for local processes
– Even multiprocessor algorithms can be considered
within node
– Globally
• more difficult
• one possibility: gang scheduling
• Load balancing
– Plays role of memory scheduler
– Referred to as a processor allocation algorithm
– Migrating is expensive
Tightly coupled
20-50x103 ns access
[Tanenbaum, p. 525]
• Distributed heuristics
– distribute work from overloaded nodes
– solicit work on underloaded nodes
27
Using graph theory to load balance
[Tanenbaum, p. 567]
Sender-Initiated Distributed
Heuristic Algorithm
When above threshold:
• Solicit peer at random to
take processes
• Peer accepts/rejects
oa
d? based upon acceptance
ffl
1.
O
o
bu
sy threshold
To
2. • Up to N probes before
running anyway
28
Receiver-Initiated Distributed
Heuristic Algorithm
When below threshold:
• Solicit peer at random to
take processes
• Peer accepts/rejects
2. Migrate P32
threshold
• After N probes, waits a
while before probing
again.
29