0% found this document useful (0 votes)
11 views26 pages

Lecture 06

Uploaded by

Mohamed Ghetas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views26 pages

Lecture 06

Uploaded by

Mohamed Ghetas
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 26

HIGH PERFORMANCE

COMPUTING
LECTURE 6

Dr. Mohamed Ghetas


MIMD Systems Interconnection

MIMD

Shared Memory Distributed Memory

Bus Direct Indirect


Crossbar Ring Crossbar
Toroidal mesh Omega
2
Cache coherence
 Programmers have no
control over caches
and when they get updated.

Example
A shared memory system with two
cores and two caches
y0 privately owned by Core 0
y1 and z1 privately owned by Core 1
Copyright © 2010,
Elsevier Inc. All rights
Reserved 3
Cache coherence
y0 privately owned by Core 0
y1 and z1 privately owned by Core 1
x = 2; /* shared variable */

y0 eventually ends up = 2
y1 eventually ends up = 6
z1 = ???

Copyright © 2010,
Elsevier Inc. All rights
Reserved 4
Problem with Write - Through
Policy

5
Problem with Write - back
Policy

6
Cache coherence
 Programmers have no control over caches
and when they get updated.
 Copies of the data stored in the shared
memory must match those copies stored in
the local caches. This is referred to as
cache coherence.
 The copies of a shared variable are coherent
if they are all equal
 Cache coherence is important to guarantee
correct program execution and to ensure high
system performance. 7
Cache Coherence Protocols
 A cache coherence protocol must be used to
ensure that the contents of the cache
memories are consistent with the contents of
the shared memory.

 Two main cache coherence protocols:


1. Snooping Cache Coherence
2. Directory Based Cache Coherence

8
Snooping Cache Coherence

 The cores share a bus .


 Any signal transmitted on the bus can be
“seen” by all cores connected to the bus.
 When core 0 updates the copy of x stored in
its cache it also broadcasts this information
across the bus.
 If core 1 is “snooping” the bus, it will see that x
has been updated and it can mark its copy of x
as invalid.
Copyright © 2010,
Elsevier Inc. All rights
Reserved 9
Snooping Cache Coherence

10
Snooping Cache Coherence
 Write-through vs. write-back
 Requires a broadcast every time a variable is
updated.
 Large networks broadcasts are expensive
 Snooping cache coherence isn’t scalable,
because for larger systems it will cause
performance to degrade.

11
Directory Based Cache
Coherence
 Uses a data structure called a directory that
stores the status of each cache line.

 When a variable is updated, the directory is


consulted, and the cache controllers of the
cores that have that variable’s cache line in
their caches are invalidated.

Copyright © 2010,
Elsevier Inc. All rights
Reserved 12
Directory Based Cache
Coherence

13
Directory Based Cache
Coherence
 The local caches associated with the processors
have local cache controllers to coordinate
updating the copies of the shared variables
stored in the local caches.
 The central controller is responsible for cache
coherence for the system..
 There will be additional storage required for the
directory
 When a cache variable is updated, only the
cores storing that variable need to be contacted.
14
False Sharing
 CPU caches are implemented in hardware, so
they operate on cache lines, not individual
variables.

Copyright © 2010,
Elsevier Inc. All rights
Reserved 15
False Sharing

 We can parallelize the previous code by


dividing the iterations in the outer loop among
the cores.
 If we have core_count cores, we might assign
the first m/core_count iterations to the first core,
the next m/core_count iterations to the second
core, and so on.

Copyright © 2010,
Elsevier Inc. All rights
Reserved 16
False Sharing

Copyright © 2010,
Elsevier Inc. All rights
Reserved 17
False Sharing

 Suppose our shared-memory system has two


cores, m = 8, doubles are eight bytes, cache
lines are 64 bytes, and y[0] is stored at the
beginning of a cache line.
 A cache line can store eight doubles, and y
takes one full cache line.
 What happens when core 0 and core 1
simultaneously execute their codes ?

Copyright © 2010,
Elsevier Inc. All rights
Reserved 18
False Sharing

 Since all of y is stored in a single cache line,


each time one of the cores executes the
statement y[i] += f(i,j), the line will be
invalidated, and the next time the other core
tries to execute this statement it will have to
fetch the updated line from memory!

Copyright © 2010,
Elsevier Inc. All rights
Reserved 19
False Sharing

 This is called false sharing, because the


system is behaving as if the elements of y were
being shared by the cores.
 False sharing does not cause incorrect results
 It can degrade the performance of a program
by causing memory accesses more than
necessary.

Copyright © 2010,
Elsevier Inc. All rights
Reserved 20
False Sharing

 How to solve this problem?


 To reduce its effect, use temporary storage that
is local to the thread or process and then
copying the temporary storage to the shared
storage.

Copyright © 2010,
Elsevier Inc. All rights
Reserved 21
Parallel software
The burden is on software
 Hardware and compilers can keep up the pace
needed.
 From now on…
In shared memory programs:
◼ Start a single process and fork threads.
◼ Threads carry out tasks.

In distributed memory programs:


◼ Start multiple processes.
◼ Processes carry out tasks.

Copyright © 2010,
Elsevier Inc. All rights
Reserved 23
SPMD – single program multiple data

 A SPMD programs consists of a single


executable that can behave as if it were
multiple different programs through the use of
conditional branches.
if (I’m thread process i)
do this;
else
do that;
Copyright © 2010,
Elsevier Inc. All rights
Reserved 24
Writing Parallel Programs

1. Divide the work among the double x[n], y[n];


processes/threads
(a) so each process/thread …
gets roughly the same for (i = 0; i < n; i++)
amount of work
(b) and communication is
x[i] += y[i];
minimized.
2. Arrange for the processes/threads to synchronize.
3. Arrange for communication among processes/threads.

Copyright © 2010,
Elsevier Inc. All rights
Reserved 25
Shared Memory
 Dynamic threads
Master thread waits for work, forks new threads,
and when threads are done, they terminate
Efficient use of resources, but thread creation and
termination is time consuming.
 Static threads
Pool of threads created and are allocated work,
but do not terminate until cleanup.
Better performance, but potential waste of system
resources.
Copyright © 2010,
Elsevier Inc. All rights
Reserved 26

You might also like