0% found this document useful (0 votes)
57 views29 pages

MultiProcessors Tanenbaum BP

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views29 pages

MultiProcessors Tanenbaum BP

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Multiple processor systems

8.1, 8.2 – Tanenbaum


Marie Roch
contains slides from:
Tanenbaum 3rd ed. © 2008

Mulitple CPU systems


• Why multiple CPUS?

– Some programs are inherently (or


embarrassingly) parallel

– Approaching size/temperature tradeoff limit


for current technologies

1
Multiple CPU systems
Multiprocessor Multicomputer Distributed

Shared memory Tightly coupled Loosely coupled,


2-10 ns access 20-50x103 ns access 10-100x106 ns access

Fig. 8-1 Types of multiple processor systems and typical


times required to access shared/remote memory [Tanenbaum, p. 525]

Multiprocessors
• Each processor can address every word of
memory
• Two paradigms

– Unified memory architecture (UMA)

– Nonunified memory architecture (NUMA)

2
UMA cache coherency
• Multicore
– single cache (e.g. Intel dual core)
– or multiple cache (e.g. AMD Opteron)

• Mutliple caches need a cache coherence


protocol (overview)
– Mark blocks
• read-only – may be in multiple caches
• read/write – only in one cache
– When accessing a word in another CPU’s read/write
cache, other cache must write before access
completes.

UMA architectures
nu_lectures/lecture11/switching/xbar/xbar.html
http://people.seas.harvard.edu/~jones/cscie129/

• single bus (ouch!)

Early telephone crossbar


• crossbar switch
Quadrics QSNet

• multistage switch

Multistage switch

3
Single bus UMA
• Primary problem: bus contention
• Ways to remedy
– local cache
Unshared process data
– local private memory (loader needs hints)

Crossbar switch
Tanenbaum p. 528

Figure 8-3. (a) An 8 × 8 crossbar switch. (b) An


open crosspoint. (c) A closed crosspoint.

4
Crossbar and multistage
• Crossbars
– grow exponentially
– are nonblocking (>1 CPU can access
the same path to memory)
• Multistage switches provide multistage message
alternative to exponential growth
• Accessing multistage switches
– consider read/write as message on
memory bus

Omega Network
(an inexpensive multistage switch)

• Each stage is a series of 2x2 switches

• Interconnected in a perfect shuffle…

youtube video I0fjh86UihA

5
Omega network

Figure 8-5. An omega switching network.

Omega network
• Blocking network
(other multistages may differ)
• Interleaving memory
– consecutive words in different RAM modules
– prevents tying up any single path
– can permit parallel access in some
architectures

6
NUMA – non-uniform mem. access
• Interconnect networks become unwieldy >
~100 CPUs
• Dual class memory
– fast local: as usual
– slow remote: load/store operations only
• Otherwise transparent to user

NUMA flavors
• No Cache (NC-NUMA)
– Remote memory is not cached
• Cache-coherent (CC-NUMA)
– Allows cached remote memory
– Frequently uses directory database for each
cache line
• status (clean/dirty)
• which cache

7
CC-NUMA example
• Cache line size: 64 (26) bytes
• 32 bit address space
• 232/26=226 cache lines
• 256 nodes
– 16 Mb local RAM (226 bytes)
– 1 CPU
• Addressing scheme
Tanenbaum p. 532

CC-NUMA example

Tanenbaum p. 532

Suppose node 0 fetches 0xFF0AB004

8
CC-NUMA example
0xFF0AB004

1111 1111 0000 1010 1011 0000 0000 0100

0xC
0xA

0x0
0x2
0x0

offset
Node

line/block
cache
0x04
0xFF

0x02AC0

Node 0 sends message to node FF requesting block 0x02AC0

CC-NUMA example
Node FF directory
0x3FFFF Case 1: invalid

VALID 0 0x3c
...
NODE

0x02AC0 Case 2: valid


... 1 0xEE

0x00001
0x00000

9
CC-NUMA example
Node FF directory
0x3FFFF Case 1: invalid
0x02AC0 0 0x3c
...
1. Fetch cache line 0x02AC0
2. Send cache line to node 00
0x02AC0 0 0x3c 3. Update directory to indicate
cache line 0x02AC0 at node 0
...
0x02AC0 1 0x00
0x00001
0x00000

CC-NUMA example

10
CC-NUMA
• Acceptable overhead
– 218 high speed ($$$) 9 bit entries for cache
– ~1.76% for 16 MB RAM

• More sophisticated (expensive) designs let


one have multiple caches.

Multicore chips
• Common RAM for all cores (UMA)
• Common or separate cache

Tanenbaum p. 23

11
Multicore chips
• Snooping logic – ensures cache
coherency

• What type of core?


– homogeneous: same processor
– heterogeneous: typically system on a chip

Multiprocesor OS
• Separate OS

• Master-slave

• Symmetric multiprocessor

12
Separate OS
• CPUs function as separate computers

• Resources partitioned
(some sharing possible, e.g. OS code)

• Many details to consider, e.g. …


– dirty disk pages
– no easy way to load balance

Master-Slave
• Asymmetric
• OS runs on a specific CPU

Tanenbaum p. 536

13
Symmetric multiprocessor (SMP)
• OS can be executed by any CPU

• Concurrency issues
Note: race conditions can occur on asymmetric
OS as well…

Symmetric multiprocessor (SMP)


• One critical region vs. multiple…

• Deadlocks…

• Remember these issues are also concerns for a


multi-threaded kernel on an asymmetric
multiprocessor

14
Multiprocessor synchronization
• Mutual exclusion protocol
– needs atomic instruction, e.g. TSL/SWAP
– any atomic instruction must be able to lock
the bus
– what happens if the bus is not locked?

Multiprocessor synchronization
• Playing ping-pong with the cache

1. 0x8A3 cached
3. modifies shared vars,
cache moves here

2. TSL writes, 0x8A3 moves here


4. Still polling,
cache moves here…

15
Multiprocessor synchronization
Strategies to prevent cache invalidation
1. Poll w/ read, use TSL once free
2. Exponential backoff (developed for
Ethernet)
3. Grant private lock

Tanenbaum p. 541

When to block
• Spin locks waste CPU cycles, but so do context
switches
– sample context switch: 1 ms
– sample mutual exclusion: 50 μs
• Mutual exclusion time is unknown…
• Alternatives
– always spin
– always switch
– predict based on history or static threshold
• Does it make sense to spin on a uniprocessor?

16
Multiprocessor scheduling
• Kernel-level threads Multiprocessor

– Which thread to run?


– What might influence the
decision?

• Which CPU to schedule?

• Timesharing vs.
spacesharing Shared memory
2-10 ns access
[Tanenbaum, p. 525]

Independent vs. dependent threads


• Independent – unrelated
• Dependent
– Could be related through a graph
– May not make as much sense to schedule
indpendentently

17
Common queue for
independent threads

Tanenbaum p. 544

Any potential pitfalls here?

Alternatives/Enhancements
• Smart scheduling
– critical section flag
– extend time quantum when flag set
• Affinity scheduling
– when process finishes CPU burst, lots of
cache entries
– If we schedule soon enough, may run faster
– Can assign to CPU, then schedule (2 level
scheduling)

18
Space sharing
• Some processes may benefit from being
scheduled simultaneously.
• Typically scheduled FCFS

Tanenbaum p. 546
CPUs dedicated CPUs idle when process
to specific processes blocked

Gang scheduling
I am a fugitive from a chain gang (Warner Bros, 1932)

• Space sharing eliminated context switches


• Discretize scheduling

time

• Spaceshare “gang” of related processes at each


interval.
• CPUs remain idle until next quantum if CPU
burst completes.

19
Multicomputers
aka: cluster computers, cluster of workstations

• Recall:
– Tightly coupled
– No shared memory
• Nodes
– CPU (possibly multicore)
– high speed network
– RAM
– perhaps secondary storage

Interconnect
• Various network topologies
• Samples:
Tanenbaum p. 550

double torus cube 4D hypercube

• Others: star, ring, grid

20
Routing
• Packet switched
– messages packetized
– “store and forward:” each switch point
• receives packet
• forwards to next switch point
• latency increases with # switch points
• Circuit switched
– Establish path
– All bits sent along path

Network interfaces
• Copying buffers increases delay

• Map hardware buffer into user space?


– problems for multiple users
– problems for kernel processes

– partial solution: Use two network interfaces

21
User-level communication
• Message passing (CS570)
– send/receive
– ports/mailboxes
– addressing
• unlike Internet, fixed network
• typically CPU# & port/mailbox or process#
– blocking/nonblocking

Implementation non-blocking
messages
• send
– user cannot modify buffer until message
actually sent
– three possibilities
• block until kernel can copy to an internal buffer*
• generate interrupt once buffer is sent
• mark page as copy-on-write until sent

*From a network perspective, this call is still non-blocking, but not from
an OS one. It is the easiest and most common option implemented.

22
Receive
• Will blocking cause problems?

• Non-blocking
– polling
– message arrival by interrupt
• inform calling thread (traditional)
• pop-up threads
• active messages (pop-up variant)
– call from user-level interrupt handler

Remote procedure calls (RPCs)


Marshalling -
Packing
arguments into
a message.
Tanenbaum p. 559

Stub functions hide message transmission

23
RPC Gotchas
• Pointers to data structures that are not
well contained (e.g. graph)
• Weak types (e.g. int x[] in C/C++)
• Types can be difficult to deduce (e.g.
printf)
• References to globals

Distributed shared memory (DSM) http://www.gnurf.net/v3/clip-art/

• Transparent to user
• Modifications to page table
– Invalid pages may be on another processor
– Page fault results in fetching page from other
CPU’s memory
• Read-only pages can be shared
• Extensions possible (e.g. share until write)

24
DSM Issues
• Network startup is expensive
• Small difference between sending 1 page
vs. 4 pages
• Too large a transfer is more likely to result
in false sharing

Tanenbaum p. 564
back to playing ping-pong…

Sequential consistency
• Suppose we let writable pages be shared:
What if threads 0 and 1 both
write to the same location...
0x37F

logical memory

physical memory physical memory

P32
P32
thread
thread
0
1

machine 0 machine 1

25
Easy sequential consistency
• Mark shared pages read only
• Writing causes a page fault
• Page fault handler
– send message to other processors to
invalidate shared page
– mark page read/write
– instruction restart

A little bit trickier…


• Shared pages
– read/write
– writes
• obtain mutex from OS covering region of page
• write
• upon release, OS propagates region to other
processors
• Other techniques possible…

26
Multicomputer scheduling
Multicomputer
• Admission scheduler is important but easily
managed

• Short-time scheduler
– Any appropriate scheduler for local processes
– Even multiprocessor algorithms can be considered
within node
– Globally
• more difficult
• one possibility: gang scheduling

• Load balancing
– Plays role of memory scheduler
– Referred to as a processor allocation algorithm
– Migrating is expensive
Tightly coupled
20-50x103 ns access
[Tanenbaum, p. 525]

Processor allocation algorithms


• Graph theoretic

• Distributed heuristics
– distribute work from overloaded nodes
– solicit work on underloaded nodes

27
Using graph theory to load balance

[Tanenbaum, p. 567]

Partition graph to minimize internode traffic


How: Beyond the scope of lecture, but it might
make a good presentation…

Sender-Initiated Distributed
Heuristic Algorithm
When above threshold:
• Solicit peer at random to
take processes
• Peer accepts/rejects
oa
d? based upon acceptance
ffl
1.
O
o
bu
sy threshold
To
2. • Up to N probes before
running anyway

28
Receiver-Initiated Distributed
Heuristic Algorithm
When below threshold:
• Solicit peer at random to
take processes
• Peer accepts/rejects

1. Will work for electrons


based upon acceptance

2. Migrate P32
threshold
• After N probes, waits a
while before probing
again.

Question: Both algorithms can result in lots of messages being sent.


Which one might perform better?

29

You might also like