0% found this document useful (0 votes)

57 views29 pages

MultiProcessors Tanenbaum BP

Uploaded by

Sangameshwer Enterprises

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views29 pages

MultiProcessors Tanenbaum BP

Uploaded by

Sangameshwer Enterprises

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 29

Multiple processor systems

8.1, 8.2 – Tanenbaum

Marie Roch
contains slides from:
Tanenbaum 3rd ed. © 2008

Mulitple CPU systems

• Why multiple CPUS?

– Some programs are inherently (or

embarrassingly) parallel

– Approaching size/temperature tradeoff limit

for current technologies

1
Multiple CPU systems
Multiprocessor Multicomputer Distributed

Shared memory Tightly coupled Loosely coupled,

2-10 ns access 20-50x103 ns access 10-100x106 ns access

Fig. 8-1 Types of multiple processor systems and typical

times required to access shared/remote memory [Tanenbaum, p. 525]

Multiprocessors
• Each processor can address every word of
memory
• Two paradigms

– Unified memory architecture (UMA)

– Nonunified memory architecture (NUMA)

2
UMA cache coherency
• Multicore
– single cache (e.g. Intel dual core)
– or multiple cache (e.g. AMD Opteron)

• Mutliple caches need a cache coherence

protocol (overview)
– Mark blocks
• read-only – may be in multiple caches
• read/write – only in one cache
– When accessing a word in another CPU’s read/write
cache, other cache must write before access
completes.

UMA architectures
nu_lectures/lecture11/switching/xbar/xbar.html
http://people.seas.harvard.edu/~jones/cscie129/

• single bus (ouch!)

Early telephone crossbar

• crossbar switch
Quadrics QSNet

• multistage switch

Multistage switch

3
Single bus UMA
• Primary problem: bus contention
• Ways to remedy
– local cache
Unshared process data
– local private memory (loader needs hints)

Crossbar switch
Tanenbaum p. 528

Figure 8-3. (a) An 8 × 8 crossbar switch. (b) An

open crosspoint. (c) A closed crosspoint.

4
Crossbar and multistage
• Crossbars
– grow exponentially
– are nonblocking (>1 CPU can access
the same path to memory)
• Multistage switches provide multistage message
alternative to exponential growth
• Accessing multistage switches
– consider read/write as message on
memory bus

Omega Network
(an inexpensive multistage switch)

• Each stage is a series of 2x2 switches

• Interconnected in a perfect shuffle…

youtube video I0fjh86UihA

5
Omega network

6
NUMA – non-uniform mem. access
• Interconnect networks become unwieldy >
~100 CPUs
• Dual class memory
– fast local: as usual
– slow remote: load/store operations only
• Otherwise transparent to user

NUMA flavors
• No Cache (NC-NUMA)
– Remote memory is not cached
• Cache-coherent (CC-NUMA)
– Allows cached remote memory
– Frequently uses directory database for each
cache line
• status (clean/dirty)
• which cache

7
CC-NUMA example
• Cache line size: 64 (26) bytes
• 32 bit address space
• 232/26=226 cache lines
• 256 nodes
– 16 Mb local RAM (226 bytes)
– 1 CPU
• Addressing scheme
Tanenbaum p. 532

CC-NUMA example

Tanenbaum p. 532

Suppose node 0 fetches 0xFF0AB004

8
CC-NUMA example
0xFF0AB004

1111 1111 0000 1010 1011 0000 0000 0100

0xC
0xA

0x0
0x2
0x0

offset
Node

line/block
cache
0x04
0xFF

0x02AC0

Node 0 sends message to node FF requesting block 0x02AC0

CC-NUMA example
Node FF directory
0x3FFFF Case 1: invalid

VALID 0 0x3c
...
NODE

0x02AC0 Case 2: valid

... 1 0xEE

0x00001
0x00000

9
CC-NUMA example
Node FF directory
0x3FFFF Case 1: invalid
0x02AC0 0 0x3c
...
1. Fetch cache line 0x02AC0
2. Send cache line to node 00
0x02AC0 0 0x3c 3. Update directory to indicate
cache line 0x02AC0 at node 0
...
0x02AC0 1 0x00
0x00001
0x00000

CC-NUMA example

10
CC-NUMA
• Acceptable overhead
– 218 high speed ($$$) 9 bit entries for cache
– ~1.76% for 16 MB RAM

• More sophisticated (expensive) designs let

one have multiple caches.

Multicore chips
• Common RAM for all cores (UMA)
• Common or separate cache

Tanenbaum p. 23

11
Multicore chips
• Snooping logic – ensures cache
coherency

• What type of core?

– homogeneous: same processor
– heterogeneous: typically system on a chip

Multiprocesor OS
• Separate OS

• Master-slave

• Symmetric multiprocessor

12
Separate OS
• CPUs function as separate computers

• Resources partitioned
(some sharing possible, e.g. OS code)

• Many details to consider, e.g. …

– dirty disk pages
– no easy way to load balance

Master-Slave
• Asymmetric
• OS runs on a specific CPU

Tanenbaum p. 536

13
Symmetric multiprocessor (SMP)
• OS can be executed by any CPU

• Concurrency issues
Note: race conditions can occur on asymmetric
OS as well…

Symmetric multiprocessor (SMP)

• One critical region vs. multiple…

• Deadlocks…

• Remember these issues are also concerns for a

multi-threaded kernel on an asymmetric
multiprocessor

14
Multiprocessor synchronization
• Mutual exclusion protocol
– needs atomic instruction, e.g. TSL/SWAP
– any atomic instruction must be able to lock
the bus
– what happens if the bus is not locked?

Multiprocessor synchronization
• Playing ping-pong with the cache

1. 0x8A3 cached
3. modifies shared vars,
cache moves here

2. TSL writes, 0x8A3 moves here

4. Still polling,
cache moves here…

15
Multiprocessor synchronization
Strategies to prevent cache invalidation
1. Poll w/ read, use TSL once free
2. Exponential backoff (developed for
Ethernet)
3. Grant private lock

Tanenbaum p. 541

When to block
• Spin locks waste CPU cycles, but so do context
switches
– sample context switch: 1 ms
– sample mutual exclusion: 50 μs
• Mutual exclusion time is unknown…
• Alternatives
– always spin
– always switch
– predict based on history or static threshold
• Does it make sense to spin on a uniprocessor?

16
Multiprocessor scheduling
• Kernel-level threads Multiprocessor

– Which thread to run?

– What might influence the
decision?

• Which CPU to schedule?

• Timesharing vs.
spacesharing Shared memory
2-10 ns access
[Tanenbaum, p. 525]

Independent vs. dependent threads

• Independent – unrelated
• Dependent
– Could be related through a graph
– May not make as much sense to schedule
indpendentently

17
Common queue for
independent threads

Tanenbaum p. 544

Any potential pitfalls here?

Alternatives/Enhancements
• Smart scheduling
– critical section flag
– extend time quantum when flag set
• Affinity scheduling
– when process finishes CPU burst, lots of
cache entries
– If we schedule soon enough, may run faster
– Can assign to CPU, then schedule (2 level
scheduling)

18
Space sharing
• Some processes may benefit from being
scheduled simultaneously.
• Typically scheduled FCFS

Tanenbaum p. 546
CPUs dedicated CPUs idle when process
to specific processes blocked

Gang scheduling
I am a fugitive from a chain gang (Warner Bros, 1932)

• Space sharing eliminated context switches

• Discretize scheduling

time

• Spaceshare “gang” of related processes at each

interval.
• CPUs remain idle until next quantum if CPU
burst completes.

19
Multicomputers
aka: cluster computers, cluster of workstations

• Recall:
– Tightly coupled
– No shared memory
• Nodes
– CPU (possibly multicore)
– high speed network
– RAM
– perhaps secondary storage

Interconnect
• Various network topologies
• Samples:
Tanenbaum p. 550

double torus cube 4D hypercube

Network interfaces
• Copying buffers increases delay

• Map hardware buffer into user space?

– problems for multiple users
– problems for kernel processes

Implementation non-blocking
messages
• send
– user cannot modify buffer until message
actually sent
– three possibilities
• block until kernel can copy to an internal buffer*
• generate interrupt once buffer is sent
• mark page as copy-on-write until sent

*From a network perspective, this call is still non-blocking, but not from
an OS one. It is the easiest and most common option implemented.

22
Receive
• Will blocking cause problems?

• Non-blocking
– polling
– message arrival by interrupt
• inform calling thread (traditional)
• pop-up threads
• active messages (pop-up variant)
– call from user-level interrupt handler

Remote procedure calls (RPCs)

Marshalling -
Packing
arguments into
a message.
Tanenbaum p. 559

Stub functions hide message transmission

23
RPC Gotchas
• Pointers to data structures that are not
well contained (e.g. graph)
• Weak types (e.g. int x[] in C/C++)
• Types can be difficult to deduce (e.g.
printf)
• References to globals

Distributed shared memory (DSM) http://www.gnurf.net/v3/clip-art/

• Transparent to user
• Modifications to page table
– Invalid pages may be on another processor
– Page fault results in fetching page from other
CPU’s memory
• Read-only pages can be shared
• Extensions possible (e.g. share until write)

24
DSM Issues
• Network startup is expensive
• Small difference between sending 1 page
vs. 4 pages
• Too large a transfer is more likely to result
in false sharing

Tanenbaum p. 564
back to playing ping-pong…

Sequential consistency
• Suppose we let writable pages be shared:
What if threads 0 and 1 both
write to the same location...
0x37F

logical memory

physical memory physical memory

P32
P32
thread
thread
0
1

machine 0 machine 1

25
Easy sequential consistency
• Mark shared pages read only
• Writing causes a page fault
• Page fault handler
– send message to other processors to
invalidate shared page
– mark page read/write
– instruction restart

A little bit trickier…

• Shared pages
– read/write
– writes
• obtain mutex from OS covering region of page
• write
• upon release, OS propagates region to other
processors
• Other techniques possible…

26
Multicomputer scheduling
Multicomputer
• Admission scheduler is important but easily
managed

• Short-time scheduler
– Any appropriate scheduler for local processes
– Even multiprocessor algorithms can be considered
within node
– Globally
• more difficult
• one possibility: gang scheduling

• Load balancing
– Plays role of memory scheduler
– Referred to as a processor allocation algorithm
– Migrating is expensive
Tightly coupled
20-50x103 ns access
[Tanenbaum, p. 525]

Processor allocation algorithms

• Graph theoretic

• Distributed heuristics
– distribute work from overloaded nodes
– solicit work on underloaded nodes

27
Using graph theory to load balance

[Tanenbaum, p. 567]

Partition graph to minimize internode traffic

How: Beyond the scope of lecture, but it might
make a good presentation…

Sender-Initiated Distributed
Heuristic Algorithm
When above threshold:
• Solicit peer at random to
take processes
• Peer accepts/rejects
oa
d? based upon acceptance
ffl
1.
O
o
bu
sy threshold
To
2. • Up to N probes before
running anyway

28
Receiver-Initiated Distributed
Heuristic Algorithm
When below threshold:
• Solicit peer at random to
take processes
• Peer accepts/rejects

1. Will work for electrons

based upon acceptance

2. Migrate P32
threshold
• After N probes, waits a
while before probing
again.

Question: Both algorithms can result in lots of messages being sent.

Which one might perform better?

Multi-Core Architectures
100% (1)
Multi-Core Architectures
43 pages
Yan Solihin - Fundamentals of Parallel Computer Architecture
100% (2)
Yan Solihin - Fundamentals of Parallel Computer Architecture
547 pages
Introduction To Distributed Operating Systems Communication in Distributed Systems
No ratings yet
Introduction To Distributed Operating Systems Communication in Distributed Systems
150 pages
Digital Design Interview Questions
100% (2)
Digital Design Interview Questions
47 pages
Module - 6
No ratings yet
Module - 6
89 pages
Unit 1
No ratings yet
Unit 1
88 pages
Week 6 A
No ratings yet
Week 6 A
32 pages
Multiple Processor Systems: 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed Systems
No ratings yet
Multiple Processor Systems: 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed Systems
36 pages
Distributed Operating Systems
No ratings yet
Distributed Operating Systems
42 pages
05 Multiprocessor
No ratings yet
05 Multiprocessor
54 pages
Final Exam Notes
No ratings yet
Final Exam Notes
79 pages
Week 5
No ratings yet
Week 5
52 pages
VII. Cache Coherence. Interconnection Networks (1) : March 16, 2009
No ratings yet
VII. Cache Coherence. Interconnection Networks (1) : March 16, 2009
42 pages
Multiprocessors and Linux: Krzysztof Lichota Lichota@mimuw - Edu.pl
No ratings yet
Multiprocessors and Linux: Krzysztof Lichota Lichota@mimuw - Edu.pl
30 pages
Unit 5
No ratings yet
Unit 5
89 pages
CA Lecture 13
No ratings yet
CA Lecture 13
27 pages
DoS - Unit 1
No ratings yet
DoS - Unit 1
57 pages
Fullstack Web Development Syllabus
100% (1)
Fullstack Web Development Syllabus
18 pages
Lecture 25
No ratings yet
Lecture 25
41 pages
ACA Lecture 29 Cache-Coherence 2
No ratings yet
ACA Lecture 29 Cache-Coherence 2
42 pages
Parallel Architecture
No ratings yet
Parallel Architecture
33 pages
Module 4
No ratings yet
Module 4
40 pages
CH17 COA9e Parallel Processing
No ratings yet
CH17 COA9e Parallel Processing
52 pages
10 Multithreading
No ratings yet
10 Multithreading
60 pages
Unit-5 Part-2
No ratings yet
Unit-5 Part-2
22 pages
Future Processors To Use Coarse-Grain Parallelism
No ratings yet
Future Processors To Use Coarse-Grain Parallelism
48 pages
Parallelism and Multicores
No ratings yet
Parallelism and Multicores
54 pages
Synthesis
No ratings yet
Synthesis
18 pages
L32 SMP
No ratings yet
L32 SMP
47 pages
Chapter 3
No ratings yet
Chapter 3
35 pages
Comporg6 ch12
No ratings yet
Comporg6 ch12
36 pages
Computer Architecture: Multiprocessors Shared Memory Architectures Prof. Jerry Breecher CSCI 240 Fall 2003
No ratings yet
Computer Architecture: Multiprocessors Shared Memory Architectures Prof. Jerry Breecher CSCI 240 Fall 2003
24 pages
Unit 6
No ratings yet
Unit 6
36 pages
Week 5
No ratings yet
Week 5
35 pages
DSM
No ratings yet
DSM
36 pages
Unit VI
No ratings yet
Unit VI
50 pages
Multiprocessor
No ratings yet
Multiprocessor
22 pages
Multi Processor
No ratings yet
Multi Processor
63 pages
Multiprocessors
No ratings yet
Multiprocessors
39 pages
Chapter 8 - Parallel Processing
No ratings yet
Chapter 8 - Parallel Processing
50 pages
Multiprocessors and Multithreading: CS151B/EE M116C Computer Systems Architecture
No ratings yet
Multiprocessors and Multithreading: CS151B/EE M116C Computer Systems Architecture
13 pages
Week 6 A
No ratings yet
Week 6 A
22 pages
Os Answer Paper
No ratings yet
Os Answer Paper
30 pages
Advanced Operating System: Unit I
No ratings yet
Advanced Operating System: Unit I
27 pages
Unit6 - Microprocessor - Final 1
No ratings yet
Unit6 - Microprocessor - Final 1
30 pages
Chapter 7
No ratings yet
Chapter 7
97 pages
Multi Processors and Thread Level Parallelism
No ratings yet
Multi Processors and Thread Level Parallelism
74 pages
CH17 COA9e
No ratings yet
CH17 COA9e
51 pages
Distributed Operating Syst EM: 15SE327E Unit 1
No ratings yet
Distributed Operating Syst EM: 15SE327E Unit 1
49 pages
CS Chap7 Multicores Multiprocessors Clusters
No ratings yet
CS Chap7 Multicores Multiprocessors Clusters
65 pages
Multiple Processor Systems: 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed Systems
No ratings yet
Multiple Processor Systems: 8.1 Multiprocessors 8.2 Multicomputers 8.3 Distributed Systems
55 pages
2ad6a430 1637912349895
No ratings yet
2ad6a430 1637912349895
51 pages
Unit 5 (Slides)
No ratings yet
Unit 5 (Slides)
75 pages
07 Multiprocessors MF PDF
No ratings yet
07 Multiprocessors MF PDF
99 pages
William Stallings Computer Organization and Architecture 9 Edition
No ratings yet
William Stallings Computer Organization and Architecture 9 Edition
51 pages
CompArch 23a MP-1
No ratings yet
CompArch 23a MP-1
17 pages
1st Ia Preparation
No ratings yet
1st Ia Preparation
15 pages
Lecture 3 Multiprocessor Vs Multicomputer Vs DS
No ratings yet
Lecture 3 Multiprocessor Vs Multicomputer Vs DS
55 pages
What Is Parallel Computing
No ratings yet
What Is Parallel Computing
9 pages
Lecture 19
No ratings yet
Lecture 19
20 pages
Automatic Irrigation Project
No ratings yet
Automatic Irrigation Project
10 pages
Alv Editable and Saving Into Database Table
No ratings yet
Alv Editable and Saving Into Database Table
9 pages
LG 42lv5500 Training Manual PDF
No ratings yet
LG 42lv5500 Training Manual PDF
88 pages
Cactus - Fabg - Retail (One X)
100% (1)
Cactus - Fabg - Retail (One X)
82 pages
Rohde and Schwarz Tsma6b
No ratings yet
Rohde and Schwarz Tsma6b
157 pages
CS 211: Computer Architecture Cache Memory Design
No ratings yet
CS 211: Computer Architecture Cache Memory Design
32 pages
Design Rules, Layout and Stick Diagram
No ratings yet
Design Rules, Layout and Stick Diagram
69 pages
Introduction To Automobile Systems
No ratings yet
Introduction To Automobile Systems
18 pages
Mealy and Moore Sequential Circuits
No ratings yet
Mealy and Moore Sequential Circuits
12 pages
Interface Panel Function PDF
No ratings yet
Interface Panel Function PDF
41 pages
CS304 Short Notes
No ratings yet
CS304 Short Notes
7 pages
2: Transistors, Fabrication, Layout
No ratings yet
2: Transistors, Fabrication, Layout
44 pages
Administrivia: ECE 252 / CPS 220 Advanced Computer Architecture I
No ratings yet
Administrivia: ECE 252 / CPS 220 Advanced Computer Architecture I
13 pages
Amplifier Design Using ADS July 2004
No ratings yet
Amplifier Design Using ADS July 2004
74 pages
Microsoft Access Projects With Microsoft SQL Se
No ratings yet
Microsoft Access Projects With Microsoft SQL Se
691 pages
Epl Lab Manual
No ratings yet
Epl Lab Manual
24 pages
PreSonus FaderPort Support in Samplitude EN
No ratings yet
PreSonus FaderPort Support in Samplitude EN
3 pages
Use Only: Oracle Hyperion Planning 11.1.1: Installation and Configuration
No ratings yet
Use Only: Oracle Hyperion Planning 11.1.1: Installation and Configuration
370 pages
IBM Mainframe Bits: Understanding The Platform Hardware: Paper
No ratings yet
IBM Mainframe Bits: Understanding The Platform Hardware: Paper
34 pages
Telnet & FTP: Jabatan Multimedia Pendidikan
No ratings yet
Telnet & FTP: Jabatan Multimedia Pendidikan
25 pages
OpenStack Networking Essentials - Sample Chapter
No ratings yet
OpenStack Networking Essentials - Sample Chapter
23 pages
2nd Quarter Test css11
100% (1)
2nd Quarter Test css11
2 pages
Control Systems Pre Question Paper
No ratings yet
Control Systems Pre Question Paper
3 pages
20 MHZ To 500 MHZ If Gain Block Adl5531: Features Functional Block Diagram
No ratings yet
20 MHZ To 500 MHZ If Gain Block Adl5531: Features Functional Block Diagram
12 pages
Searching On Sorted Sequence
No ratings yet
Searching On Sorted Sequence
9 pages
Sumit 1.2 Java
No ratings yet
Sumit 1.2 Java
4 pages
SCADA System 2016 - Communication Part I Communication Network
No ratings yet
SCADA System 2016 - Communication Part I Communication Network
30 pages
CS7000056 - BTS Battery - 100AH Lithium 48NPFC100 (19 Inch3u) DS
No ratings yet
CS7000056 - BTS Battery - 100AH Lithium 48NPFC100 (19 Inch3u) DS
2 pages
Modbus 通信協定03
No ratings yet
Modbus 通信協定03
10 pages
All Lab Assignment
No ratings yet
All Lab Assignment
8 pages
Sona College of Technology: U10It505 - Software Engineering
No ratings yet
Sona College of Technology: U10It505 - Software Engineering
3 pages
Discovering Computers Fundamentals, 2012 Edition: Your Interactive Guide To The Digital World
No ratings yet
Discovering Computers Fundamentals, 2012 Edition: Your Interactive Guide To The Digital World
68 pages
Native Log
No ratings yet
Native Log
4 pages
The Python Tutorial - Python 3.8.6 Documentation
No ratings yet
The Python Tutorial - Python 3.8.6 Documentation
13 pages

MultiProcessors Tanenbaum BP

Uploaded by

MultiProcessors Tanenbaum BP

Uploaded by

Multiple processor systems

8.1, 8.2 – Tanenbaum

Mulitple CPU systems

– Some programs are inherently (or

– Approaching size/temperature tradeoff limit

Shared memory Tightly coupled Loosely coupled,

Fig. 8-1 Types of multiple processor systems and typical

– Unified memory architecture (UMA)

– Nonunified memory architecture (NUMA)

• Mutliple caches need a cache coherence

• single bus (ouch!)

Early telephone crossbar

Figure 8-3. (a) An 8 × 8 crossbar switch. (b) An

• Each stage is a series of 2x2 switches

• Interconnected in a perfect shuffle…

youtube video I0fjh86UihA

Figure 8-5. An omega switching network.

Suppose node 0 fetches 0xFF0AB004

1111 1111 0000 1010 1011 0000 0000 0100

Node 0 sends message to node FF requesting block 0x02AC0

0x02AC0 Case 2: valid

• More sophisticated (expensive) designs let

• What type of core?

• Many details to consider, e.g. …

Symmetric multiprocessor (SMP)

• Remember these issues are also concerns for a

2. TSL writes, 0x8A3 moves here

– Which thread to run?

• Which CPU to schedule?

Independent vs. dependent threads

Any potential pitfalls here?

• Space sharing eliminated context switches

• Spaceshare “gang” of related processes at each

double torus cube 4D hypercube

• Others: star, ring, grid

• Map hardware buffer into user space?

– partial solution: Use two network interfaces

Remote procedure calls (RPCs)

Stub functions hide message transmission

Distributed shared memory (DSM) http://www.gnurf.net/v3/clip-art/

physical memory physical memory

A little bit trickier…

Processor allocation algorithms

Partition graph to minimize internode traffic

1. Will work for electrons

Question: Both algorithms can result in lots of messages being sent.

You might also like