Multi Processor
Multi Processor
3
Classification
❑ Multiprocessors
▪ Multiple CPUs with shared memory
▪ Memory access delays about 10 – 50 nsec
❑ Multicomputers
▪ Multiple computers, each with own CPU and memory, connected by a high-
speed interconnect
▪ Tightly coupled with delays in micro-seconds
❑ Distributed Systems
▪ Loosely coupled systems connected over Local Area Network (LAN), or even
long-haul networks such as Internet
▪ Delays can be seconds, and unpredictable
4
Mutiprocessors
5
Multiprocessor Systems
❑ Multiple CPUs with a shared memory
❑ From an application’s perspective, difference with
single-processor system need not be visible
▪ Virtual memory where pages may reside in memories
associated with other CPUs
▪ Applications can exploit parallelism for speed-up
❑ Topics to cover
1. Multiprocessor architectures
2. Cache coherence
3. OS organization
4. Synchronization
5. Scheduling
6
Multiprocessor Architecture
7
Bus-based UMA
❑ All CPUs and memory module connected over a
shared bus
❑ To reduce traffic, each CPU also has a cache
❑ Key design issue: how to maintain coherency of data
that appears in multiple places?
❑ Each CPU can have a local memory module also that
is not shared with others
❑ Compilers can be designed to exploit the memory
structure
❑ Typically, such an architecture can support 16 or 32
CPUs as a common bus is a bottleneck (memory
access not parallelized)
8
Switched UMA
❑ Goal: To reduce traffic on bus, provide multiple
connections between CPUs and memory units so that
many accesses can be concurrent
❑ Crossbar Switch: Grid with horizontal lines from CPUs
and vertical lines from memory modules
❑ Crossbar at (i,j) can connect i-th CPU with j-th
memory module
❑ As long as different processors are accessing
different modules, all requests can be in parallel
❑ Non-blocking: waiting caused only by contention for
memory, but not for bus
❑ Disadvantage: Too many connections (quadratic)
❑ Many other networks: omega, counting, …
9
Crossbar Switch
10
Cache Coherence
❑ Many processors can have locally cached copies of
the same object
▪ Level of granularity can be an object or a block of 64 bytes
❑ We want to maximize concurrency
▪ If many processors just want to read, then each one can have a
local copy, and reads won’t generate any bus traffic
❑ We want to ensure coherence
▪ If a processor writes a value, then all subsequent reads by
other processors should return the latest value
❑ Coherence refers to a logically consistent global
ordering of reads and writes of multiple processors
❑ Modern multiprocessors support intricate schemes
11
Consistency and replication
❑ Need to replicate (cache) to improve
performance
▪ How updates are propagated between cached replicas
▪ How to keep them consistent
❑ How to keep them consistency (much more
complicated than sequential processor)
▪ When a processor change the vale value of its copy of a
variable,
• the other copies are invalidated (invalidate protocol), or
• the other copies are updated (update protocol).
12
Example
X=1 Memory
13
Invalidate vs. update protocols
14
Snoopy Protocol
❑ Each processor, for every cached object, keeps a state that can be
Invalid, Exclusive or Read-only
❑ Goal: If one has Exclusive copy then all others must be Invalid
❑ Each processor issues three types of messages on bus
▪ Read-request (RR), Write-request (WR), and Value-response (VR)
▪ Each message identifies object, and VR has a tagged value
❑ Assumption:
▪ If there is contention for bus then only one succeeds
▪ No split transactions (RR will have a response by VR)
❑ Protocol is called Snoopy, because everyone is listening to the bus
all the time, and updates state in response to messages RR and WR
❑ Each cache controller responds to 4 types of events
▪ Read or write operation issued by its processor
▪ Messages (RR, WR, or VR) observed on the bus
❑ Caution: This is a simplified version
15
Snoopy Cache Coherence
ID Val State
Processor 1 Processor N
Read(x), Write(x,u)
x v Exclusive
Cache Controller
16
Snoopy Protocol
❑ If state is Read-only
▪ Read operation: return local value
▪ Write operation: Broadcast WR message on bus, update state to Exclusive,
and update local value
▪ WR message on bus: update state to Invalid
▪ RR message on bus: broadcast VR(v) on bus
❑ If state is Exclusive
▪ Read operation: return local value
▪ Write operation: update local value
▪ RR message on bus: Broadcast VR(v), and change state to Read-only
▪ WR message on bus: update state to Invalid
❑ If state is Invalid
▪ Read operation: Broadcast RR, Receive VR(v), update state to Read-only,
and local value to v
▪ Write operation: As in first case
▪ VR(v) message on bus: Update state to Read-only, and local copy to v
▪ WR message on the bus: do nothing
17
Sample Scenario for Snoopy
❑ Assume 3 processors P1, P2, P3. One object x : int
❑ Initially, P1’s entry for x is invalid, P2’s entry is Exclusive with value 3, and
P3’s entry is invalid
❑ A process running on P3 issues Read(x)
❑ P3 sends the message RR(x) on the bus
❑ P2 updates its entry to Read-only, and sends the message VR(x,3) on the bus
❑ P3 updates its entry to Read-only, records the value 3 in the cache, and
returns the value 3 to Read(x)
❑ P1 also updates the x-entry to (Read-Only, 3)
❑ Now, if Read(x) is issued on any of the processors, no messages will be
exchanged, and the corresponding processor will just return value 3 by a
local look-up
18
Snoopy Scenario (Continued)
❑ Suppose a process running on P1 issues Write(x,0)
❑ At the same time, a process running on P2 issues Write(x,2)
❑ P1 will try to send WR on the bus, as well as P2 will try to send WR on
the bus
❑ Only one of them succeeds, say, P1 succeeds
❑ P1 will update cache-entry to (Exclusive,0)
❑ P3 will update cache-entry to Invalid
❑ P2 will update cache-entry to Invalid
❑ Now, Read / Write operations by processes on P1 will use local copy,
and won’t generate any messages
19
Notions of consistency
❑ Strict consistency: any read on a data item x returns a
value corresponding to the result of the most recent write
on x (need absolute global time)
▪ P1: w(x)a P1: w(x)a
▪ P2: r(x)a P2: r(x)NIL r(x)a
20
Multiprocessor OS
❑ How should OS software be organized?
❑ OS should handle allocation of processes to
processors. Challenge due to shared data structures
such as process tables and ready queues
❑ OS should handle disk I/O for the system as a whole
❑ Two standard architectures
▪ Master-slave
▪ Symmetric multiprocessors (SMP)
21
Master-Slave Organization
22
Symmetric Multiprocessing (SMP)
❑ Only one kernel space, but OS can run on any CPU
❑ Whenever a user process makes a system call, the same CPU runs
OS to process it
❑ Key issue: Multiple system calls can run in parallel on different
CPUs
▪ Need locks on all OS data structures to ensure mutual exclusion for critical
updates
❑ Design issue: OS routines should have independence so that level
of granularity for locking gives good performance
Bus
23
Synchronization
❑ Recall: Mutual exclusion solutions to protect critical
regions involving updates to shared data structures
❑ Classical single-processor solutions
▪ Disable interrupts
▪ Powerful instructions such as Test&Set (TSL)
▪ Software solution such as Peterson’s algorithm
❑ In multiprocessor setting, competing processes can
all be OS routines (e.g., to update process table)
❑ Disabling interrupts is not relevant as there are
multiple CPUs
❑ TSL can be used, but requires modification
24
Original Solution using TSL
Shared variable: lock :{0,1}
lock==1 means some process is in CS
Initially lock is 0
Code for process P0 as well as P1:
while (TRUE) {
try: TSL X, lock /* test-and-set lock */
if (X!=0) goto try; /*retry if lock set*/
CS();
lock = 0; /* reset the lock */
Non_CS();
}
25
TSL solution for multi-processors
❑ TSL involves testing and setting memory, this can
require 2 memory accesses
▪ Not a problem to implement this in single-processor system
❑ Now, bus must be locked to avoid split transaction
▪ Bus provides a special line for locking
❑ A process that fails to acquire lock checks repeatedly
issuing more TSL instructions
▪ Requires Exclusive access to memory block
▪ Cache coherence protocol would generate lots of traffic
❑ Goal: To reduce number of checks
1. Exponential back-off: instead of constant polling, check only
after delaying (1, 2, 4, 8 instructions)
2. Maintain a list of processes waiting to acquire lock.
26
Busy-Waiting vs Process switch
❑ In single-processors, if a process is waiting to
acquire lock, OS schedules another ready process
❑ This may not be optimal for multiprocessor systems
▪ If OS itself is waiting to acquire ready list, then switching
impossible
▪ Switching may be possible, but involves acquiring locks, and
thus, is expensive
❑ OS must decide whether to switch (choice between
spinning and switching)
▪ spinning wastes CPU cycles
▪ switching uses up CPU cycles also
▪ possible to make separate decision each time locked mutex
encountered
27
Multiprocessors: Summary
❑ Set of processors connected over a bus with shared
memory modules
❑ Architecture of bus and switches important for efficient
memory access
❑ Caching essential; to manage multiple caches, cache
coherence protocol necessary (e.g. Snoopy)
❑ Symmetric Multiprocessing (SMP) allows OS to run on
different CPUs concurrently
❑ Synchronization issues: OS components work on shared
data structures
▪ TSL based solution to ensure mutual exclusion
▪ Spin locks (i.e. busy waiting) with exponential backoff to reduce
bus traffic
28
Scheduling
❑ Recall: Standard scheme for single-processor scheduling
▪ Make a scheduling decision when a process blocks/exits or when
a clock interrupt happens indicating end of time quantum
▪ Scheduling policy needed to pick among ready processes, e.g.
multi-level priority (queues for each priority level)
❑ In multiprocessor system, scheduler must pick among
ready processes and also a CPU
❑ Natural scheme: when a process executing on CPU k
finishes or blocks or exceeds its time quantum, then
pick a ready process according to scheduling policy and
assign it to CPU k. But this ignores many issues…
29
Issues for Multiprocessor Scheduling
❑ If a process is holding a lock, it is unwise to switch it
even if time quantum expires
❑ Locality issues
▪ If a process p is assigned to CPU k, then CPU k may hold
memory blocks relevant to p in its cache, so p should be assigned
to CPU k whenever possible
▪ If a set of threads/processes communicate with one another then
it is advantageous to schedule them together
❑ Solutions
▪ Space sharing by allocating CPUs in partitions
▪ Gang scheduling: scheduling related threads in same time slots
30
Multicomputers
31
Multicomputers
❑ Definition:
Tightly-coupled CPUs that do not share memory
❑ Also known as
▪ cluster computers
▪ clusters of workstations (COWs)
32
Clusters
❑Interconnection topologies
(a) single switch (d) double torus
(b) ring (e) cube
(c) grid (f) Hypercube (2**d, d is dimeter)
33
Switching Schemes
❑ Messages are transferred in chunks called packets
❑ Store and forward packet switching
▪ Each switch collects bits on input line, assembles the packet, and
forwards it towards destination
▪ Each switch has a buffer to store packets
▪ Delays can be long
❑ Hot-potato routing: No buffering
▪ Necessary for optical communication links
❑ Circuit switching
▪ First establish a path from source to destination
▪ Pump bits on the reserved path at a high rate
❑ Wormhole routing
▪ Split packet into subpackets to optimize circuit switching
34
Interprocess Communication
35
Message-based Communication
❑ Library Routines
▪ Send (destination address, buffer containing message)
▪ Receive (optional source address, buffer to store message)
❑ Design issues
▪ Blocking vs non-blocking calls
▪ Should buffers be copied into kernel space?
37
Blocking vs Non-blocking
38
Buffers and Copying
40
Remote Procedure Call
❑ Procedure call is a more natural way to
communicate
▪ every language supports it
▪ semantics are well defined and understood
▪ natural for programmers to use
❑ Basic idea of RPC (Remote Procedure Call)
▪ define a server as a module that exports a set of
procedures that can be called by client programs.
call
Client Server
return
41
A brief history of RPC
42
Remote Procedure Call
❑ Use procedure call as a model for distributed
communication
❑ RPCs can offer a good programming abstraction to hide
low-level communication details
❑ Goal - make RPC look as much like local PC as possible
❑ Many issues:
▪ how do we make this invisible to the programmer?
▪ what are the semantics of parameter passing?
▪ how is binding done (locating the server)?
▪ how do we support heterogeneity (OS, arch., language)?
▪ how to deal with failures?
▪ etc.
43
Steps in Remote Procedure Calls
45
RPC Call Structure
Call
46
RPC Return Structure
runtime runtime
RPC responds RPC
runtime receives msg,
to original runtime
calls stub
msg
return
47
RPC Stubs
48
RPC Parameter Marshalling
49
RPC failure semantics
50
Types of failure
51
Handling message failure
52
Possible semantics to deal with crashes
53
Shared memory vs. message passing
❑ Message passing
▪ better performance
▪ know when and what msgs sent: control, knowledge
❑ Shared memory
▪ familiar
▪ hides details of communication
▪ no need to name receivers or senders, just write to
specific memory address and read later
▪ caching for “free”
▪ porting from centralized system (the original “write
once run anywhere”)
▪ no need to rewrite when adding processs, scales because
adds memory for each node
▪ Initial implementation correct (agreement is reached at
the memory system level), all changes are just
optimizations
54
Distributed Shared Memory (DSM)
Replication
(a) Pages distributed on 4
machines
55
Distributed Shared Memory (DSM)
❑ data in shared address space accessed as in traditional
VM.
❑ mapping manager -- maps the shared address space to
the physical address space.
❑ Advantage of DSM
56
DSM Implementation Issues
❑ Recall: In virtual memory, OS hides the fact that pages may reside
in main memory or on disk
❑ Recall: In multiprocessors, there is a single shared memory
(possibly virtual) accessed by multiple CPUs. There may be
multiple caches, but cache coherency protocols hide this from
applications
▪ how to make shared data concurrently accessible
❑ DSM: Each machine has its own physical memory, but virtual
memory is shared, so pages can reside in any memory or on disk
▪ how to keep track of the location of shared data
❑ On page fault, OS can fetch the page from remote memory
▪ how to overcome comm. delays and protocol overhead when accessing
remote data
57
Distributed Shared Memory
❑ False Sharing
❑ Must also achieve sequential consistency
61
Load Balancing
62
Algorithms for Load Balancing
63