Parallel & Distributed Computing
Parallel & Distributed Computing
Fall 2023
CS4172
Chapter 2, Lecture 1
Muhammad Asim Butt
[email protected]
Surah Al-Hujurat
2
An Introduction to Parallel Programming
Peter Pacheco
Chapter 2
Parallel Hardware and Parallel
Software
Figure 2.1
Copyright © 2010, Elsevier Inc. All rights Reserved
Main memory
• This is a collection of locations, each of which is capable of
storing both instructions and data.
fetch/read
CPU
write/store
CPU
• That is, a memory access will effectively operate on blocks of data and
instructions instead of individual instructions and individual data items.
This lecture has been prepared from different web resources. Muhammad Asim Butt
Levels of Cache
L2
L3
fetch
x L1 x sum
L2 y z total
L3 A[ ] radius r1 center
fetch x
x main
L1 y
sum memory
L2 r1 z
total
L3 A[ ] radius
center
▪Write-back caches mark data in the cache as dirty. When the cache
line is replaced by a new cache line from memory, the dirty line is
written to memory.
• However, knowing the principle of spatial and temporal locality allows us to have some
indirect control over caching.
• Since it’s not in the cache, this will result in a cache miss, and the system will read the line
consisting of the first row of A, A [0][0] , A [0][1] , A [0][2] , A [0][3] into the cache.
▪The first pair of loops then accesses A [0][1] , A [0][2] , A [0][3] , all of which are in the cache, and
the next miss in the first pair of loops will occur when the code accesses A [1][0] .
▪Continuing in this fashion, we see that the first pair of loops will result in a total of four misses
when it accesses elements of A, one for each row.
• Note that since our hypothetical cache can only store two lines or eight elements of A,
▪when we read the first element of row two and the first element of row three, one of the lines that’s
already in the cache will have to be evicted from the cache,
▪but once a line is evicted, the first pair of loops won’t need to access the elements of that line again.
This lecture has been prepared from different web resources. Muhammad Asim Butt
Second loop: An Example ..
• After reading the first row into the cache, the second pair of loops
needs to then access A [1][0] , A [2][0] , A [3][0] , none of which are in
the cache.
• So the next three accesses of A will also result in misses.
• Furthermore, because the cache is small, the reads of A [2][0] and A
[3][0] will require that lines already in the cache be evicted.
• Since A [2][0] is stored in cache line 2, reading its line will evict line 0, and
reading A [3][0] will evict line 1.
• After finishing the first pass through the outer loop, we’ll next need to
access A [0][1] , which was evicted with the rest of the first row.
• So we see that every time we read an element of A, we’ll have a miss,
and the second pair of loops results in 16 misses.
This lecture has been prepared from different web resources. Muhammad Asim Butt
• In fact, if we run the code on one of our systems with
MAX = 1000, the first pair of
• nested loops is approximately three times faster than
the second pair.
This lecture has been prepared from different web resources. Muhammad Asim Butt
2.2.4: Virtual memory
• Caches make it possible for the CPU to quickly access instructions and data that
are in main memory.
• However, If we run a very large program or a program that accesses very large
data sets, all of the instructions and data may not fit into main memory.
• This is especially true with multitasking operating systems; to switch between
programs and create the illusion that multiple programs are running
simultaneously, the instructions and data that will be used during the next time
slice should be in main memory.
• Thus in a multitasking system, even if the main memory is very large, many running
programs must share the available main memory.
• Furthermore, this sharing must be done in such a way that each program’s data and
instructions are protected from corruption by other programs.
Table 2.2: Virtual Address Divided into Virtual Page Number and Byte
Offset
z[1] z[2]
adder #1 adder #2
superscalar
w=x;
else
w=y;
• Of course, for this to be useful, the system must support very rapid
switching between threads.
▪For example, in some older systems, threads were simply implemented as
processes, and in the time it took to switch between processes, thousands of
instructions could be executed.
Copyright © 2010, Elsevier Inc. All rights Reserved
Hardware multithreading …..