0% found this document useful (0 votes)

47 views58 pages

Parallel & Distributed Computing

Uploaded by

BscsF21M 37

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views58 pages

Parallel & Distributed Computing

Uploaded by

BscsF21M 37

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

Parallel and Distributed Computing(PDC)

Fall 2023
CS4172
Chapter 2, Lecture 1
Muhammad Asim Butt
[Link]@[Link]
Surah Al-Hujurat

2
An Introduction to Parallel Programming
Peter Pacheco

Chapter 2
Parallel Hardware and Parallel
Software

Copyright © 2010, Elsevier Inc. All rights Reserved

3
Roadmap
• Some background
• Modifications to the von Neumann model
• Parallel hardware
• Parallel software
• Input and output
• Performance
• Parallel program design
• Writing and running parallel programs
• Assumptions

Copyright © 2010, Elsevier Inc. All rights Reserved

Some background

Copyright © 2010, Elsevier Inc. All rights Reserved

Serial hardware and software
programs
input

Computer runs one

program at a time.
output

Copyright © 2010, Elsevier Inc. All rights Reserved

The von Neumann Architecture

Figure 2.1
Copyright © 2010, Elsevier Inc. All rights Reserved
Main memory
• This is a collection of locations, each of which is capable of
storing both instructions and data.

• Every location consists of

▪ An address, which is used to access the location, and
▪The contents of the location.

Copyright © 2010, Elsevier Inc. All rights Reserved

Central processing unit (CPU)
• Divided into two parts.

• Control unit - responsible for deciding which

instruction in a program should be executed. (the boss)
add 2+2
• Arithmetic and logic unit (ALU) - responsible for
executing the actual instructions. (the worker)
• Data path

Copyright © 2010, Elsevier Inc. All rights Reserved

Key terms
• Register – very fast storage, part of the CPU.

• Program counter – stores address of the next instruction to

be executed.

• Bus – wires and hardware that connects the CPU and

memory.
▪ Hardwar controls access to any device

Copyright © 2010, Elsevier Inc. All rights Reserved

memory

fetch/read

CPU

Copyright © 2010, Elsevier Inc. All rights Reserved

memory

write/store

CPU

Copyright © 2010, Elsevier Inc. All rights Reserved

von Neumann bottleneck

Copyright © 2010, Elsevier Inc. All rights Reserved

An operating system “process”
•An instance of a computer program that is being
executed.
•Components of a process:
▪The executable machine language program.
▪A block of memory.
▪Descriptors of resources the OS has allocated to the process.
▪Security information.
▪Information about the state of the process.

Copyright © 2010, Elsevier Inc. All rights Reserved

Multitasking
•Gives the illusion that a single processor system is
running multiple programs simultaneously.

•Each process takes turns running. (time slice)

•After its time is up, it waits until it has a turn again.

(blocks)

Copyright © 2010, Elsevier Inc. All rights Reserved

Threading
•Threads are contained within processes.

•They allow programmers to divide their programs into

(more or less) independent tasks.

•The hope is that when one thread blocks because it is

waiting on a resource, another will have work to do
and can run.

Copyright © 2010, Elsevier Inc. All rights Reserved

A process and two threads
the “master” thread

starting a thread terminating a thread

Is called forking Is called joining
Figure 2.2

Copyright © 2010, Elsevier Inc. All rights Reserved

Modifications to the von neumann
model

Copyright © 2010, Elsevier Inc. All rights Reserved

Basics of caching
•A collection of memory locations that can be accessed
in less time than some other memory locations.

•A CPU cache is typically located on the same chip, or

one that can be accessed much faster than ordinary
memory.

Copyright © 2010, Elsevier Inc. All rights Reserved

Principle of locality

•Accessing one location is followed by an access of a

nearby location.

• Spatial locality – accessing a nearby location.

• Temporal locality – accessing in the near future.

Copyright © 2010, Elsevier Inc. All rights Reserved

Principle of locality
• Arrays are allocated as blocks of
contiguous memory locations. float z[1000];
▪ So, for example, the location
storing z[1] immediately follows
…
the location z[0].
sum = 0.0;
▪ Thus as long as i < 999, the read
of z[i] is immediately followed by for (i = 0; i < 1000; i++)
a read of z[i+1].
sum += z[i];

Copyright © 2010, Elsevier Inc. All rights Reserved

Principle of locality …
• To exploit the principle of locality, the system uses an effectively wider
interconnect to access data and instructions.

• That is, a memory access will effectively operate on blocks of data and
instructions instead of individual instructions and individual data items.

• These blocks are called cache blocks or cache lines.

• A typical cache line stores 8 to 16 times as much information as a single memory

location.

This lecture has been prepared from different web resources. Muhammad Asim Butt
Levels of Cache

smallest & fastest

largest & slowest

fetch
x L1 x sum

L2 y z total

L3 A[ ] radius r1 center

Copyright © 2010, Elsevier Inc. All rights Reserved

Cache miss

fetch x
x main
L1 y
sum memory
L2 r1 z
total
L3 A[ ] radius
center

Copyright © 2010, Elsevier Inc. All rights Reserved

Issues with cache
• When a CPU writes data to cache, the value in cache may be
inconsistent with the value in main memory.

▪Write-through caches handle this by updating the data in main

memory at the time it is written to cache.

▪Write-back caches mark data in the cache as dirty. When the cache
line is replaced by a new cache line from memory, the dirty line is
written to memory.

Copyright © 2010, Elsevier Inc. All rights Reserved

Cache mappings
• Full associative – a new line can be placed at any location in
the cache.

• Direct mapped – each cache line has a unique location in the

cache to which it will be assigned.

• n-way set associative – each cache line can be placed in one

of n different locations in the cache.
▪ "n-way" indicates how many cache lines are in each set.
▪ For example, in a two-way set associative cache, each line (block) can be
mapped to one of two locations.
Copyright © 2010, Elsevier Inc. All rights Reserved
n-way set associative
• When more than one line in memory can be mapped to
several different locations in cache we also need to be able to
decide which line should be replaced or evicted.

Copyright © 2010, Elsevier Inc. All rights Reserved

Example

Table 2.1: Assignments of a 16-line main memory to a 4-line cache

Copyright © 2010, Elsevier Inc. All rights Reserved
Caches and programs: an
example
• It’s important to remember that the workings of the CPU cache are controlled by the
system hardware, and we, the programmers, don’t directly determine which data and
which instructions are in the cache.

• However, knowing the principle of spatial and temporal locality allows us to have some
indirect control over caching.

• As an example, C stores two-dimensional arrays in “row-major” order.

• That is, although we think of a two-dimensional array as a rectangular block, memory is

effectively a huge one-dimensional array.

• So in row-major storage, we store row 0 first, then row 1, and so on.

This lecture has been prepared from different web resources. Muhammad Asim Butt
Caches and programs

• Suppose MAX is four, a cache line stores four

doubles, and the elements of A are stored in
memory as above.

• Let’s also suppose that the cache is direct

mapped, and it can only store eight elements
of A, or two cache lines. (We won’t worry
about x and y.).

Copyright © 2010, Elsevier Inc. All rights Reserved

First Loop:An Example ..
• Both pairs of loops attempt to first access A [0][0] .

• Since it’s not in the cache, this will result in a cache miss, and the system will read the line
consisting of the first row of A, A [0][0] , A [0][1] , A [0][2] , A [0][3] into the cache.
▪The first pair of loops then accesses A [0][1] , A [0][2] , A [0][3] , all of which are in the cache, and
the next miss in the first pair of loops will occur when the code accesses A [1][0] .
▪Continuing in this fashion, we see that the first pair of loops will result in a total of four misses
when it accesses elements of A, one for each row.

• Note that since our hypothetical cache can only store two lines or eight elements of A,
▪when we read the first element of row two and the first element of row three, one of the lines that’s
already in the cache will have to be evicted from the cache,
▪but once a line is evicted, the first pair of loops won’t need to access the elements of that line again.

This lecture has been prepared from different web resources. Muhammad Asim Butt
Second loop: An Example ..
• After reading the first row into the cache, the second pair of loops
needs to then access A [1][0] , A [2][0] , A [3][0] , none of which are in
the cache.
• So the next three accesses of A will also result in misses.
• Furthermore, because the cache is small, the reads of A [2][0] and A
[3][0] will require that lines already in the cache be evicted.
• Since A [2][0] is stored in cache line 2, reading its line will evict line 0, and
reading A [3][0] will evict line 1.
• After finishing the first pass through the outer loop, we’ll next need to
access A [0][1] , which was evicted with the rest of the first row.
• So we see that every time we read an element of A, we’ll have a miss,
and the second pair of loops results in 16 misses.
This lecture has been prepared from different web resources. Muhammad Asim Butt
• In fact, if we run the code on one of our systems with
MAX = 1000, the first pair of
• nested loops is approximately three times faster than
the second pair.

This lecture has been prepared from different web resources. Muhammad Asim Butt
2.2.4: Virtual memory
• Caches make it possible for the CPU to quickly access instructions and data that
are in main memory.

• However, If we run a very large program or a program that accesses very large
data sets, all of the instructions and data may not fit into main memory.
• This is especially true with multitasking operating systems; to switch between
programs and create the illusion that multiple programs are running
simultaneously, the instructions and data that will be used during the next time
slice should be in main memory.
• Thus in a multitasking system, even if the main memory is very large, many running
programs must share the available main memory.
• Furthermore, this sharing must be done in such a way that each program’s data and
instructions are protected from corruption by other programs.

Copyright © 2010, Elsevier Inc. All rights Reserved

Virtual memory
• Virtual memory was developed so that main memory can function as a
cache for secondary storage.
• It exploits the principle of spatial and temporal locality by keeping in main
memory only the active parts of the many running programs; those parts
that are idle can be kept in a block of secondary storage, called swap space.
• Like CPU caches, virtual memory operates on blocks of data and
instructions.
• These blocks are commonly called pages, and since secondary storage
access can be hundreds of thousands of times slower than main memory
access, pages are relatively large
• Most systems have a fixed page size that currently ranges from 4 to 16
kilobytes.
This lecture has been prepared from different web resources. Muhammad Asim Butt
Virtual memory ….
program
• We may run into trouble if we try to
A
main
assign physical memory addresses to
pages when we compile a program.
memory

• If we do this, then each page of the

program can only be assigned to one
block of memory, and with a multitasking
operating system, we’re likely to have
program B
many programs wanting to use the same
block of memory.

• To avoid this problem, when a program is

compiled, its pages are assigned virtual
page numbers. program C

Copyright © 2010, Elsevier Inc. All rights Reserved

Virtual page numbers
• When a program is compiled,
▪its pages are assigned virtual page numbers.

• When the program is run,

▪ a table is created that maps
▪the virtual page numbers to physical addresses.

• A page table is used to translate the virtual address into a

physical address.

Copyright © 2010, Elsevier Inc. All rights Reserved

Page table

Table 2.2: Virtual Address Divided into Virtual Page Number and Byte
Offset

Copyright © 2010, Elsevier Inc. All rights Reserved

Translation-lookaside buffer (TLB)

• Using a page table has the potential to significantly increase

each program’s overall run-time.
▪First Access: When you access a virtual address, you first need to access the page table to find
the physical page number. This requires reading from memory.
▪Second Access: After obtaining the physical page number, you then access the actual data
(instruction) in memory.

• A special address translation cache in the processor.

Copyright © 2010, Elsevier Inc. All rights Reserved

Translation-lookaside buffer (2)

• It caches a small number of entries (typically 16–512) from

the page table in very fast memory.

• Page fault – attempting to access a valid physical address for

a page in the page table but the page is only stored on disk.

Copyright © 2010, Elsevier Inc. All rights Reserved

2.2.5: Instruction Level Parallelism (ILP)

• Attempts to improve processor performance by having

multiple processor components or functional units
simultaneously executing instructions.

Instruction Level Parallelism (2)

•Pipelining - functional units are arranged in stages.

•Multiple issue - multiple instructions can be

simultaneously initiated.

Pipelining

Pipelining example (1)

Add the floating point numbers 9.87×104 and 6.54×103

• Assume each operation takes one

nanosecond (10-9 seconds).

• This for loop takes about 7000

• Divide the floating point adder into 7 separate pieces of

hardware or functional units.

• First unit fetches two operands, second unit compares

exponents, etc.

• Output of one functional unit is input to the next.

Pipelining (4)

Table 2.3: Pipelined Addition.

• One floating point addition still

takes 7 nanoseconds.

• But 1000 floating point additions

now takes 1006 nanoseconds!

Multiple Issue (1)
• Multiple issue processors replicate functional units and try
to simultaneously execute different instructions in a
program.

for (i = 0; i < 1000; i+

+)
z[i]
z[3]
= x[i] +
z[4]
y[i];

z[1] z[2]
adder #1 adder #2

Multiple Issue (2)
• static multiple issue - functional units are scheduled at
compile time.

• dynamic multiple issue – functional units are scheduled at

run-time.

superscalar

Speculation (1)
• In order to make use of multiple issue, the system must find
instructions that can be executed simultaneously.

■In speculation, the compiler or the

processor makes a guess about an
instruction, and then executes the
instruction on the basis of the
guess.

Speculation ….
z=x+y; Z will be
i f ( z > 0) positive

w=x;
else
w=y;

If the system speculates incorrectly,

it must go back and recalculate w =
Copyright © 2010, Elsevier Inc. All rights Reserved y.
2.2.6: Hardware multithreading
• ILP can be very difficult to exploit: a program with a
long sequence of dependent statements offers few
opportunities for simultaneous execution of
different threads.
No opportunity for
• Thread-level parallelism, or TLP, attempts to provide simultaneous execution
parallelism through the simultaneous execution of of instructions.
different threads, so it provides a coarser-grained
parallelism than ILP, that is, the program units that
are being simultaneously executed.
• Threads are larger or coarser than the finer-grained
units i.e. individual instructions.

Hardware multithreading …..
• Hardware multithreading provides a means for systems to continue doing
useful work when the task being currently executed has stalled.
▪Ex., the current task has to wait for data to be loaded from memory.

• Instead of looking for parallelism in the currently executing thread, it may

make sense to simply run another thread.

• Of course, for this to be useful, the system must support very rapid
switching between threads.
▪For example, in some older systems, threads were simply implemented as
processes, and in the time it took to switch between processes, thousands of
instructions could be executed.
Copyright © 2010, Elsevier Inc. All rights Reserved
Hardware multithreading …..

• Fine-grained multithreading - the processor switches between

threads after each instruction, skipping threads that are stalled.

• Pros: potential to avoid wasted machine time due to stalls.

• Cons: a thread that’s ready to execute a long sequence of
instructions may have to wait to execute every instruction.

Hardware multithreading …
• Coarse-grained multithreading: only switches threads that
are stalled waiting for a time-consuming operation to
complete.

• Pros: switching threads doesn’t need to be nearly instantaneous.

• Cons: the processor can be idled on shorter stalls, and thread
switching will also cause delays.

Hardware multithreading (3)

• Simultaneous multithreading (SMT) : a variation on fine-

grained multithreading.

• Allows multiple threads to make use of the multiple

functional units.

Lec 03
No ratings yet
Lec 03
27 pages
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
No ratings yet
Lecture 5: Memory Hierarchy and Cache Traditional Four Questions For Memory Hierarchy Designers
10 pages
Week 11
No ratings yet
Week 11
45 pages
10 Caches
No ratings yet
10 Caches
34 pages
Chapter 2
No ratings yet
Chapter 2
143 pages
Chapter - 2 - Parallel Hardware and Parallel Software
No ratings yet
Chapter - 2 - Parallel Hardware and Parallel Software
143 pages
Rec 07
No ratings yet
Rec 07
40 pages
Understanding Cache Memory Operations
No ratings yet
Understanding Cache Memory Operations
49 pages
Lecture 16
No ratings yet
Lecture 16
22 pages
Chapter2 - Part 1
No ratings yet
Chapter2 - Part 1
21 pages
Cache PPT
No ratings yet
Cache PPT
38 pages
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
No ratings yet
Computer Architecture: Assoc. Prof. Nguyễn Trí Thành, Phd
55 pages
Cache Performance and Write Strategies
No ratings yet
Cache Performance and Write Strategies
30 pages
Lec2 PDF
No ratings yet
Lec2 PDF
21 pages
15IF11 Multicore B
No ratings yet
15IF11 Multicore B
36 pages
Sampriya Chandra Cache Memory
No ratings yet
Sampriya Chandra Cache Memory
36 pages
Memory Cache
No ratings yet
Memory Cache
18 pages
Cache Optimization Techniques
No ratings yet
Cache Optimization Techniques
23 pages
Chapter 6
No ratings yet
Chapter 6
37 pages
Cache
No ratings yet
Cache
34 pages
08 Caches
No ratings yet
08 Caches
78 pages
Cache Memory Organization Guide
No ratings yet
Cache Memory Organization Guide
19 pages
Memory 2
No ratings yet
Memory 2
31 pages
Cache Misses
No ratings yet
Cache Misses
8 pages
Cache
No ratings yet
Cache
36 pages
CacheII Annotated
No ratings yet
CacheII Annotated
21 pages
Comp Org Exam 3 Cheat Sheet
No ratings yet
Comp Org Exam 3 Cheat Sheet
3 pages
Chapter 2z
No ratings yet
Chapter 2z
54 pages
Lab3 Suppl
No ratings yet
Lab3 Suppl
25 pages
Cache Memory Optimization Techniques
No ratings yet
Cache Memory Optimization Techniques
100 pages
Cache Memory: A Safe Place For Hiding or Storing Things
100% (1)
Cache Memory: A Safe Place For Hiding or Storing Things
34 pages
Chapter2 - Part 2
No ratings yet
Chapter2 - Part 2
18 pages
Chapter 6: Memory: - CPU Accesses Memory at Least Once Per Fetch-Execute Cycle: - Memory Is Organized Into A Hierarchy
No ratings yet
Chapter 6: Memory: - CPU Accesses Memory at Least Once Per Fetch-Execute Cycle: - Memory Is Organized Into A Hierarchy
25 pages
Week 12 - Lecture 12 - Memory
No ratings yet
Week 12 - Lecture 12 - Memory
27 pages
Data Hazards and Cache Optimization
No ratings yet
Data Hazards and Cache Optimization
2 pages
4 Memory Models
No ratings yet
4 Memory Models
19 pages
COSS - Lecture - 6 - With Annotation
No ratings yet
COSS - Lecture - 6 - With Annotation
37 pages
Memory Hierarchy - Introduction: Cost Performance of Memory Reference
No ratings yet
Memory Hierarchy - Introduction: Cost Performance of Memory Reference
52 pages
Computer Memory Organization: Elephants Don't Forget But Do Computers?
No ratings yet
Computer Memory Organization: Elephants Don't Forget But Do Computers?
9 pages
Computer Architecture: Cache Design
No ratings yet
Computer Architecture: Cache Design
61 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
Intel Optane Memory Business Overview
No ratings yet
Intel Optane Memory Business Overview
27 pages
Lec8 Memory
No ratings yet
Lec8 Memory
17 pages
Memory Hierarchy in Computer Architecture
No ratings yet
Memory Hierarchy in Computer Architecture
48 pages
10 Cache
No ratings yet
10 Cache
28 pages
Chap 5 Memory System p1
No ratings yet
Chap 5 Memory System p1
30 pages
Cache and Caching: Electrical and Electronic Engineering
No ratings yet
Cache and Caching: Electrical and Electronic Engineering
15 pages
CMP3010L09 MemoryII
No ratings yet
CMP3010L09 MemoryII
39 pages
Understanding Cache in Embedded Systems
No ratings yet
Understanding Cache in Embedded Systems
11 pages
High Performance Computing WS2022 Slides 5 Caches
No ratings yet
High Performance Computing WS2022 Slides 5 Caches
36 pages
CA09 2024S2 New
No ratings yet
CA09 2024S2 New
29 pages
Address Field Breakdown for Cache System
No ratings yet
Address Field Breakdown for Cache System
55 pages
Unit - 5 DPCO
No ratings yet
Unit - 5 DPCO
35 pages
Memory & Cache Fundamentals
No ratings yet
Memory & Cache Fundamentals
38 pages
Computer Org and Arch: R.Magesh
No ratings yet
Computer Org and Arch: R.Magesh
48 pages
Lecture 6 Memory 2023
No ratings yet
Lecture 6 Memory 2023
66 pages
Memory Hierarchy Design
No ratings yet
Memory Hierarchy Design
76 pages
Tectonic Correlation in Emergency Design
No ratings yet
Tectonic Correlation in Emergency Design
16 pages
SDI Part 1 - Systems, Roles, and Development Methodologies
No ratings yet
SDI Part 1 - Systems, Roles, and Development Methodologies
10 pages
Bca 2 3
No ratings yet
Bca 2 3
11 pages
vx305c-40g Datasheet
No ratings yet
vx305c-40g Datasheet
4 pages
Expert Mathematician & Economist
No ratings yet
Expert Mathematician & Economist
1 page
Operating System - 18-01-2024
No ratings yet
Operating System - 18-01-2024
3 pages
Coot Tutorial
No ratings yet
Coot Tutorial
21 pages
WP Open Ended Project
No ratings yet
WP Open Ended Project
8 pages
NoSQL DB
No ratings yet
NoSQL DB
33 pages
USB Audio Windows Driver A Version 2.0 Installation and Operation Guide
No ratings yet
USB Audio Windows Driver A Version 2.0 Installation and Operation Guide
4 pages
Serial Windows Server 2008
No ratings yet
Serial Windows Server 2008
3 pages
Cisco Nexus Dashboard Fabric Controller 12.1.2e Release Notes
100% (1)
Cisco Nexus Dashboard Fabric Controller 12.1.2e Release Notes
21 pages
Must Have ASMCMD Commands For ASM Management-1
No ratings yet
Must Have ASMCMD Commands For ASM Management-1
8 pages
Abhay Garg: Software Engineer at Infomo
No ratings yet
Abhay Garg: Software Engineer at Infomo
1 page
Dependencies: On This Page
No ratings yet
Dependencies: On This Page
4 pages
Android Kotlin Handbook Senior NonCompose
No ratings yet
Android Kotlin Handbook Senior NonCompose
10 pages
Caption Generator
No ratings yet
Caption Generator
18 pages
HANA Security Audit Log Configuration
No ratings yet
HANA Security Audit Log Configuration
8 pages
Avaya OneX Communicator Guide
No ratings yet
Avaya OneX Communicator Guide
12 pages
Kernel in Operating System
No ratings yet
Kernel in Operating System
26 pages
Full Stack Web Developer Resume
No ratings yet
Full Stack Web Developer Resume
3 pages
Group 4 Review 1
No ratings yet
Group 4 Review 1
13 pages
Grid Measurement & Protection Module
No ratings yet
Grid Measurement & Protection Module
9 pages
OS Security Notes
No ratings yet
OS Security Notes
2 pages
Salesforce & Spring Boot Developer
No ratings yet
Salesforce & Spring Boot Developer
4 pages
Virtual Humans as Campus Receptionists
No ratings yet
Virtual Humans as Campus Receptionists
8 pages
Calling A Program in Totally Free RPG @
No ratings yet
Calling A Program in Totally Free RPG @
6 pages
AI Disease Prediction for Healthcare
No ratings yet
AI Disease Prediction for Healthcare
25 pages
PG-Diploma in VLSI Design Overview
No ratings yet
PG-Diploma in VLSI Design Overview
16 pages
Aspect CTRM Buyers Guide3
No ratings yet
Aspect CTRM Buyers Guide3
10 pages