0% found this document useful (0 votes)
6 views

Parallel & Distributed Computing

Uploaded by

BscsF21M 37
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views

Parallel & Distributed Computing

Uploaded by

BscsF21M 37
Copyright
© © All Rights Reserved
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 58

Parallel and Distributed Computing(PDC)

Fall 2023
CS4172
Chapter 2, Lecture 1
Muhammad Asim Butt
[email protected]
Surah Al-Hujurat

2
An Introduction to Parallel Programming
Peter Pacheco

Chapter 2
Parallel Hardware and Parallel
Software

Copyright © 2010, Elsevier Inc. All rights Reserved


3
Roadmap
• Some background
• Modifications to the von Neumann model
• Parallel hardware
• Parallel software
• Input and output
• Performance
• Parallel program design
• Writing and running parallel programs
• Assumptions

Copyright © 2010, Elsevier Inc. All rights Reserved


Some background

Copyright © 2010, Elsevier Inc. All rights Reserved


Serial hardware and software
programs
input

Computer runs one


program at a time.
output

Copyright © 2010, Elsevier Inc. All rights Reserved


The von Neumann Architecture

Figure 2.1
Copyright © 2010, Elsevier Inc. All rights Reserved
Main memory
• This is a collection of locations, each of which is capable of
storing both instructions and data.

• Every location consists of


▪ An address, which is used to access the location, and
▪The contents of the location.

Copyright © 2010, Elsevier Inc. All rights Reserved


Central processing unit (CPU)
• Divided into two parts.

• Control unit - responsible for deciding which


instruction in a program should be executed. (the boss)
add 2+2
• Arithmetic and logic unit (ALU) - responsible for
executing the actual instructions. (the worker)
• Data path

Copyright © 2010, Elsevier Inc. All rights Reserved


Key terms
• Register – very fast storage, part of the CPU.

• Program counter – stores address of the next instruction to


be executed.

• Bus – wires and hardware that connects the CPU and


memory.
▪ Hardwar controls access to any device

Copyright © 2010, Elsevier Inc. All rights Reserved


memory

fetch/read

CPU

Copyright © 2010, Elsevier Inc. All rights Reserved


memory

write/store

CPU

Copyright © 2010, Elsevier Inc. All rights Reserved


von Neumann bottleneck

Copyright © 2010, Elsevier Inc. All rights Reserved


An operating system “process”
•An instance of a computer program that is being
executed.
•Components of a process:
▪The executable machine language program.
▪A block of memory.
▪Descriptors of resources the OS has allocated to the process.
▪Security information.
▪Information about the state of the process.

Copyright © 2010, Elsevier Inc. All rights Reserved


Multitasking
•Gives the illusion that a single processor system is
running multiple programs simultaneously.

•Each process takes turns running. (time slice)

•After its time is up, it waits until it has a turn again.


(blocks)

Copyright © 2010, Elsevier Inc. All rights Reserved


Threading
•Threads are contained within processes.

•They allow programmers to divide their programs into


(more or less) independent tasks.

•The hope is that when one thread blocks because it is


waiting on a resource, another will have work to do
and can run.

Copyright © 2010, Elsevier Inc. All rights Reserved


A process and two threads
the “master” thread

starting a thread terminating a thread


Is called forking Is called joining
Figure 2.2

Copyright © 2010, Elsevier Inc. All rights Reserved


Modifications to the von neumann
model

Copyright © 2010, Elsevier Inc. All rights Reserved


Basics of caching
•A collection of memory locations that can be accessed
in less time than some other memory locations.

•A CPU cache is typically located on the same chip, or


one that can be accessed much faster than ordinary
memory.

Copyright © 2010, Elsevier Inc. All rights Reserved


Principle of locality

•Accessing one location is followed by an access of a


nearby location.

• Spatial locality – accessing a nearby location.

• Temporal locality – accessing in the near future.

Copyright © 2010, Elsevier Inc. All rights Reserved


Principle of locality
• Arrays are allocated as blocks of
contiguous memory locations. float z[1000];
▪ So, for example, the location
storing z[1] immediately follows

the location z[0].
sum = 0.0;
▪ Thus as long as i < 999, the read
of z[i] is immediately followed by for (i = 0; i < 1000; i++)
a read of z[i+1].
sum += z[i];

Copyright © 2010, Elsevier Inc. All rights Reserved


Principle of locality …
• To exploit the principle of locality, the system uses an effectively wider
interconnect to access data and instructions.

• That is, a memory access will effectively operate on blocks of data and
instructions instead of individual instructions and individual data items.

• These blocks are called cache blocks or cache lines.

• A typical cache line stores 8 to 16 times as much information as a single memory


location.

This lecture has been prepared from different web resources. Muhammad Asim Butt
Levels of Cache

smallest & fastest


L1

L2

L3

largest & slowest


Copyright © 2010, Elsevier Inc. All rights Reserved
Cache hit

fetch
x L1 x sum

L2 y z total

L3 A[ ] radius r1 center

Copyright © 2010, Elsevier Inc. All rights Reserved


Cache miss

fetch x
x main
L1 y
sum memory
L2 r1 z
total
L3 A[ ] radius
center

Copyright © 2010, Elsevier Inc. All rights Reserved


Issues with cache
• When a CPU writes data to cache, the value in cache may be
inconsistent with the value in main memory.

▪Write-through caches handle this by updating the data in main


memory at the time it is written to cache.

▪Write-back caches mark data in the cache as dirty. When the cache
line is replaced by a new cache line from memory, the dirty line is
written to memory.

Copyright © 2010, Elsevier Inc. All rights Reserved


Cache mappings
• Full associative – a new line can be placed at any location in
the cache.

• Direct mapped – each cache line has a unique location in the


cache to which it will be assigned.

• n-way set associative – each cache line can be placed in one


of n different locations in the cache.
▪ "n-way" indicates how many cache lines are in each set.
▪ For example, in a two-way set associative cache, each line (block) can be
mapped to one of two locations.
Copyright © 2010, Elsevier Inc. All rights Reserved
n-way set associative
• When more than one line in memory can be mapped to
several different locations in cache we also need to be able to
decide which line should be replaced or evicted.

Copyright © 2010, Elsevier Inc. All rights Reserved


Example

Table 2.1: Assignments of a 16-line main memory to a 4-line cache


Copyright © 2010, Elsevier Inc. All rights Reserved
Caches and programs: an
example
• It’s important to remember that the workings of the CPU cache are controlled by the
system hardware, and we, the programmers, don’t directly determine which data and
which instructions are in the cache.

• However, knowing the principle of spatial and temporal locality allows us to have some
indirect control over caching.

• As an example, C stores two-dimensional arrays in “row-major” order.

• That is, although we think of a two-dimensional array as a rectangular block, memory is


effectively a huge one-dimensional array.

• So in row-major storage, we store row 0 first, then row 1, and so on.


This lecture has been prepared from different web resources. Muhammad Asim Butt
Caches and programs

• Suppose MAX is four, a cache line stores four


doubles, and the elements of A are stored in
memory as above.

• Let’s also suppose that the cache is direct


mapped, and it can only store eight elements
of A, or two cache lines. (We won’t worry
about x and y.).

Copyright © 2010, Elsevier Inc. All rights Reserved


First Loop:An Example ..
• Both pairs of loops attempt to first access A [0][0] .

• Since it’s not in the cache, this will result in a cache miss, and the system will read the line
consisting of the first row of A, A [0][0] , A [0][1] , A [0][2] , A [0][3] into the cache.
▪The first pair of loops then accesses A [0][1] , A [0][2] , A [0][3] , all of which are in the cache, and
the next miss in the first pair of loops will occur when the code accesses A [1][0] .
▪Continuing in this fashion, we see that the first pair of loops will result in a total of four misses
when it accesses elements of A, one for each row.

• Note that since our hypothetical cache can only store two lines or eight elements of A,
▪when we read the first element of row two and the first element of row three, one of the lines that’s
already in the cache will have to be evicted from the cache,
▪but once a line is evicted, the first pair of loops won’t need to access the elements of that line again.

This lecture has been prepared from different web resources. Muhammad Asim Butt
Second loop: An Example ..
• After reading the first row into the cache, the second pair of loops
needs to then access A [1][0] , A [2][0] , A [3][0] , none of which are in
the cache.
• So the next three accesses of A will also result in misses.
• Furthermore, because the cache is small, the reads of A [2][0] and A
[3][0] will require that lines already in the cache be evicted.
• Since A [2][0] is stored in cache line 2, reading its line will evict line 0, and
reading A [3][0] will evict line 1.
• After finishing the first pass through the outer loop, we’ll next need to
access A [0][1] , which was evicted with the rest of the first row.
• So we see that every time we read an element of A, we’ll have a miss,
and the second pair of loops results in 16 misses.
This lecture has been prepared from different web resources. Muhammad Asim Butt
• In fact, if we run the code on one of our systems with
MAX = 1000, the first pair of
• nested loops is approximately three times faster than
the second pair.

This lecture has been prepared from different web resources. Muhammad Asim Butt
2.2.4: Virtual memory
• Caches make it possible for the CPU to quickly access instructions and data that
are in main memory.

• However, If we run a very large program or a program that accesses very large
data sets, all of the instructions and data may not fit into main memory.
• This is especially true with multitasking operating systems; to switch between
programs and create the illusion that multiple programs are running
simultaneously, the instructions and data that will be used during the next time
slice should be in main memory.
• Thus in a multitasking system, even if the main memory is very large, many running
programs must share the available main memory.
• Furthermore, this sharing must be done in such a way that each program’s data and
instructions are protected from corruption by other programs.

Copyright © 2010, Elsevier Inc. All rights Reserved


Virtual memory
• Virtual memory was developed so that main memory can function as a
cache for secondary storage.
• It exploits the principle of spatial and temporal locality by keeping in main
memory only the active parts of the many running programs; those parts
that are idle can be kept in a block of secondary storage, called swap space.
• Like CPU caches, virtual memory operates on blocks of data and
instructions.
• These blocks are commonly called pages, and since secondary storage
access can be hundreds of thousands of times slower than main memory
access, pages are relatively large
• Most systems have a fixed page size that currently ranges from 4 to 16
kilobytes.
This lecture has been prepared from different web resources. Muhammad Asim Butt
Virtual memory ….
program
• We may run into trouble if we try to
A
main
assign physical memory addresses to
pages when we compile a program.
memory

• If we do this, then each page of the


program can only be assigned to one
block of memory, and with a multitasking
operating system, we’re likely to have
program B
many programs wanting to use the same
block of memory.

• To avoid this problem, when a program is


compiled, its pages are assigned virtual
page numbers. program C

Copyright © 2010, Elsevier Inc. All rights Reserved


Virtual page numbers
• When a program is compiled,
▪its pages are assigned virtual page numbers.

• When the program is run,


▪ a table is created that maps
▪the virtual page numbers to physical addresses.

• A page table is used to translate the virtual address into a


physical address.

Copyright © 2010, Elsevier Inc. All rights Reserved


Page table

Table 2.2: Virtual Address Divided into Virtual Page Number and Byte
Offset

Copyright © 2010, Elsevier Inc. All rights Reserved


Translation-lookaside buffer (TLB)

• Using a page table has the potential to significantly increase


each program’s overall run-time.
▪First Access: When you access a virtual address, you first need to access the page table to find
the physical page number. This requires reading from memory.
▪Second Access: After obtaining the physical page number, you then access the actual data
(instruction) in memory.

• A special address translation cache in the processor.

Copyright © 2010, Elsevier Inc. All rights Reserved


Translation-lookaside buffer (2)

• It caches a small number of entries (typically 16–512) from


the page table in very fast memory.

• Page fault – attempting to access a valid physical address for


a page in the page table but the page is only stored on disk.

Copyright © 2010, Elsevier Inc. All rights Reserved


2.2.5: Instruction Level Parallelism (ILP)

• Attempts to improve processor performance by having


multiple processor components or functional units
simultaneously executing instructions.

Copyright © 2010, Elsevier Inc. All rights Reserved


Instruction Level Parallelism (2)

•Pipelining - functional units are arranged in stages.

•Multiple issue - multiple instructions can be


simultaneously initiated.

Copyright © 2010, Elsevier Inc. All rights Reserved


Pipelining

Copyright © 2010, Elsevier Inc. All rights Reserved


Pipelining example (1)

Add the floating point numbers 9.87×104 and 6.54×103


Copyright © 2010, Elsevier Inc. All rights Reserved
Pipelining example (2)

• Assume each operation takes one


nanosecond (10-9 seconds).

• This for loop takes about 7000


nanoseconds.
Copyright © 2010, Elsevier Inc. All rights Reserved
Pipelining (3)

• Divide the floating point adder into 7 separate pieces of


hardware or functional units.

• First unit fetches two operands, second unit compares


exponents, etc.

• Output of one functional unit is input to the next.

Copyright © 2010, Elsevier Inc. All rights Reserved


Pipelining (4)

Table 2.3: Pipelined Addition.


Numbers in the table are subscripts of operands/results.
Copyright © 2010, Elsevier Inc. All rights Reserved
Pipelining (5)

• One floating point addition still


takes 7 nanoseconds.

• But 1000 floating point additions


now takes 1006 nanoseconds!

Copyright © 2010, Elsevier Inc. All rights Reserved


Multiple Issue (1)
• Multiple issue processors replicate functional units and try
to simultaneously execute different instructions in a
program.

for (i = 0; i < 1000; i+


+)
z[i]
z[3]
= x[i] +
z[4]
y[i];

z[1] z[2]
adder #1 adder #2

Copyright © 2010, Elsevier Inc. All rights Reserved


Multiple Issue (2)
• static multiple issue - functional units are scheduled at
compile time.

• dynamic multiple issue – functional units are scheduled at


run-time.

superscalar

Copyright © 2010, Elsevier Inc. All rights Reserved


Speculation (1)
• In order to make use of multiple issue, the system must find
instructions that can be executed simultaneously.

■In speculation, the compiler or the


processor makes a guess about an
instruction, and then executes the
instruction on the basis of the
guess.

Copyright © 2010, Elsevier Inc. All rights Reserved


Speculation ….
z=x+y; Z will be
i f ( z > 0) positive

w=x;
else
w=y;

If the system speculates incorrectly,


it must go back and recalculate w =
Copyright © 2010, Elsevier Inc. All rights Reserved y.
2.2.6: Hardware multithreading
• ILP can be very difficult to exploit: a program with a
long sequence of dependent statements offers few
opportunities for simultaneous execution of
different threads.
No opportunity for
• Thread-level parallelism, or TLP, attempts to provide simultaneous execution
parallelism through the simultaneous execution of of instructions.
different threads, so it provides a coarser-grained
parallelism than ILP, that is, the program units that
are being simultaneously executed.
• Threads are larger or coarser than the finer-grained
units i.e. individual instructions.

Copyright © 2010, Elsevier Inc. All rights Reserved


Hardware multithreading …..
• Hardware multithreading provides a means for systems to continue doing
useful work when the task being currently executed has stalled.
▪Ex., the current task has to wait for data to be loaded from memory.

• Instead of looking for parallelism in the currently executing thread, it may


make sense to simply run another thread.

• Of course, for this to be useful, the system must support very rapid
switching between threads.
▪For example, in some older systems, threads were simply implemented as
processes, and in the time it took to switch between processes, thousands of
instructions could be executed.
Copyright © 2010, Elsevier Inc. All rights Reserved
Hardware multithreading …..

• Fine-grained multithreading - the processor switches between


threads after each instruction, skipping threads that are stalled.

• Pros: potential to avoid wasted machine time due to stalls.


• Cons: a thread that’s ready to execute a long sequence of
instructions may have to wait to execute every instruction.

Copyright © 2010, Elsevier Inc. All rights Reserved


Hardware multithreading …
• Coarse-grained multithreading: only switches threads that
are stalled waiting for a time-consuming operation to
complete.

• Pros: switching threads doesn’t need to be nearly instantaneous.


• Cons: the processor can be idled on shorter stalls, and thread
switching will also cause delays.

Copyright © 2010, Elsevier Inc. All rights Reserved


Hardware multithreading (3)

• Simultaneous multithreading (SMT) : a variation on fine-


grained multithreading.

• Allows multiple threads to make use of the multiple


functional units.

Copyright © 2010, Elsevier Inc. All rights Reserved

You might also like