0% found this document useful (0 votes)
41 views14 pages

Chapter 7: Distributed Shared Memory: Why DSM?

DSM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views14 pages

Chapter 7: Distributed Shared Memory: Why DSM?

DSM
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

CHAPTER 7: DISTRIBUTED

SHARED MEMORY
DSM simulates a logical shared memory address space over a set of physically
distributed local memory systems.

Why DSM?

• direct information sharing programming paradigm (transparency)

• multilevel memory access (locality)

• wealth of existing programs (portability)

• large physical memory

• scalable multiprocessor system

Chapter outline

• NUMA architectures: similarity between multiprocessor cache and DSM


systems

• Memory consistency models: why is memory consistency a more crit-


ical problem in multiprocessor and DSM systems? how is memory
consistency defined?

• Cache coherency protocols: implementation of consistency models

• DSM Implementation: applying the consistency models and coherency


protocols to a DSM system

1
Nonuniform Memory Access (NUMA) architectures

Generic NUMA architecture

processor memory processor memory

memory coherence . . . . . . . memory coherence


controller controller

buses, interconnection network, or communication network

2
Multiprocessor Cache and DSM architectures

Global Memory

Common Bus

Local Caches

Processors

(a) Multiprocessor cache architecture

Virtual Memory Space

Communication
Network

Local
Memory

Processors

(b) Distributed shared memory architecture

3
Common issues
Data consistency and coherency due to data placement, migration and repli-
cation

• Data Sharing Granularity

• Cache Miss Granularity

• Tradeoffs:

– Transfer time
– Administrative overhead
– Hit rate
– Replacement rate
– False Sharing

What to do on cache miss?

• Locating block - owner/directory

• Block Migration - block bouncing

• Block Replication

• Push vs. Pull

4
Memory consistency models
These models apply consistency constraints to all memory accesses
Accesses may require multiple messages and take significant time

Atomic consistency
All processors see same (global) order
Respects real-time order

Sequential consistency
All processors see same (global) order and
order respects all internal orders (not nec. real time)

P1 : W (X)1
P2 : W (Y )2
P3 : R(Y )2 R(X)0 R(X)1

A2 A1 A3
"Global Time"

P1 Access 1 Access 3

P2 Access 2

Atomic Consistency − global total order respecting access intervals

A2 A3 A4 A1 A5
"Global Order"

P1 W(X) 1

P2 W(Y) 2

P3 R(Y)2 R(X)0 R(X)1

Sequential Consistency − global total order (not nec. respecting access intervals)

5
Causal consistency
Processors may see different order
all orders respect causal order (internal and r-w)

P1 : W (X)1 W (X)3
P2 : R(X)1 W (X)2
P3 : R(X)1 R(X)3 R(X)2
P4 : R(X)1 R(X)2 R(X)3

P4 A4: R(X)1 A9: R(X)2 A10 R(X)3

P1 A1: W(X)1 A5: W(X)3

P2 A2: R(X)1 A6: W(X)2

P3 A3: R(X)1 A7: R(X)3 A8 R(X)2

causal link that must be respected in any order


A1 A5
"P1’s View"

A1 A2 A6
"P2’s View"

A1 A3 A5 A7 A6 A8
"P3’s View"

A1 A4 A6 A9 A5 A10
"P4’s View"

Causal Consistency − no global total order; causal partial order only


Each processor’s order respects internal order and Write−Read causality

6
Processor consistency
Writes from same processor are in order
Writes from different processors not constrained

P1 : W (X)1
P2 : R(X)1 W (X)2
P3 : R(X)1 R(X)2
P4 : R(X)2 R(X)1

P4 A6: R(X)1 A7: R(X)2

P1 A1: W(X)1

P2 A2: R(X)1 A3: W(X)2

P3 A4: R(X)2 A5: R(X)1

causal link that need not be respected internal link that is respected
A1
"P1’s View"

A1 A2 A3
"P2’s View"

A3 A4 A1 A5
"P3’s View"

A1 A6 A3 A7
"P4’s View"

Processor Consistency − no global total order; partial order on writes by same processor
Each processor’s order respects internal order and order of writes by same processor

7
Slow memory consistency
Writes from same processor to same location are in order
Writes from different processors or locations not constrained

P1 : W (X)1 W (X)2 W (Y )3
P2 : R(Y )3 R(X)1 R(X)2

P1 A1: W(X)1 A3: W(X)2 A3: W(Y)3

P2 A4: R(Y)3 A5: R(X)1 A6: R(X)2

causal link that need not be respected causal link that must be respected
A1 A2 A3
"P1’s View"

A3 A4 A1 A5 A2 A6
"P2’s View"

Slow Memory Consistency − no global total order, no constraints across memory locations
Each processor’s order respects its internal order and order of writes to same memory by same processor

8
Synchronization Access Consistency Models
Accesses to synchronization variables distinguished from accesses to ordinary
shared variables

Weak consistency
• Accesses to synchronization variables are sequentially consistent

• No access to a synchronization variable is issued by a processor before


all previous read/write data accesses have been performed
(i.e., synch waits until all ongoing accesses complete)

• No read/write data access is issued by a processor before a previous


access to a synchronization variable has been performed
(i.e., all new accesses must wait until synch is performed)

• in effect, system “settles” at synch.

Release consistency
The synchronization access (synch(S)) in the weak consistency model can
be refined as a pair of acquire(S) and release(S) accesses. Shared variables
in the critical section are made consistent when the release operation is per-
formed.
(i.e., S “locks” access to shared variables it protects, and release is not com-
pleted until all accesses to them are also completed).

Entry consistency
acquire and release are applied to general variables.
Each variable has an implicit synchronization variable that may be acquired
to prevent concurrent access to it.

9
delay = time at which shared vars consistent
future
accesses
issued
delay til issued acquire(S) acquire(X)
previous performed
all accesses delay til only accesses
done
synch(S) are previous to X are
exclusive done exclusive

delay issued
future performed release(S) release(X)
accesses
performed
delay til delay til
previous previous
(a) Weak consistency done (b) Release consistency done (c) Entry consistency
like barrier sync Processor consistency Consistency w.r.t.
but local to process − w.r.t. S memory object X
only sync when necessary All vars in DSM system across all procs

R(X) W(Y)

R(Y) W(Y)

W(Z) W(Z)

No synchronization

R(X) W(Y)

R(Y) Synch(S) W(Y)

W(Z) W(Z)

Weak consistency

Acq(S) R(X) W(Y) Rel(S)

Acq(S) R(Y) W(Y) Rel(S)

Acq(R) W(Z) W(Z) Rel(R)

Release consistency

10
Taxonomy

atomic consistency

Real−time Order Weakening

sequential consistency
Processor Relative Access Type
Weakening Weakening

causal consistency weak consistency

Processor Relative
Weakening

processor consistency release consistency

Location Relative
Weakening
slow memory entry consistency

no system coherence support

11
Multiprocessor Cache Systems

Cache directory

master copy E

P bits

replicated block V E P : Number of processors

V : Valid or invalid
replicated block V E
E : Exclusive or
shared-read-only
replicated block V E

V bit for validity (in replicas), E bit for exclusive access (in all)
May also include private (= not shared) bit and/or dirty (= modified) bit.

Cache coherency protocols

write-invalidate and write-update

Write-invalidate
• Read hit

• Read miss: transfer block, set P-, V-, and E-bit.

• Write hit: invalidate cache copies, write and set E-bit

• Write miss: like read miss/write hit

Hardware mechanisms

• Directory-based

• Snooping cache

12
DSM implementation

Memory management algorithms

exclusive copy

READ : remote migrate replicate

1 2 3 4

WRITE : remote migrate replicate

1 : Central server algorithm (SRSW) 2 : Migration algoritm (SRSW)


3 : Read-replication algorithm (MRSW) 4 : Full-replication algorithm (MRMW)

• Read-remote-write-remote: long network delay, trivial consistency

• Read-migrate-write-migrate: thrashing and false sharing

• Read-replicate-write-migrate: write-invalidate

• Read-replicate-write-replicate; full concurrency, atomic update

Considerations:

• Block granularity

• Block transfer communication overhead

• Read/write ratio

• Locality of reference

• Number of nodes and type of interaction

13
Distributed implementation of Directory

Locating Block Owner:

Previous Owners Current Owner

probable probable current


block block block
owner owner owner

1 2 3
Request
4 ...2n−1
probable
block
owner
request and change probable owner along way

Maintaining Copy List:

From To’s From To’s From To’s

From Nil From Nil

(a) Spanning tree representation of copy set

invalidate or update
Head Master Node End Node

master next master next master next master Nil

request forwarded request

acknowledgement
requestor
append

(b) Linked list representation of copy set

14

You might also like