0% found this document useful (0 votes)

92 views143 pages

Chapter - 2 - Parallel Hardware and Parallel Software

This document provides an overview of parallel hardware and software concepts. It begins with a discussion of the traditional von Neumann model for serial computing and how it has been modified for parallelism. This includes additions like caching, multitasking, and virtual memory. It then covers parallel hardware components like multiple CPUs and parallel software techniques like threading. The goal is to introduce fundamental concepts needed to understand parallel and distributed systems.

Uploaded by

newtopia.loossemble

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

92 views143 pages

Chapter - 2 - Parallel Hardware and Parallel Software

Uploaded by

newtopia.loossemble

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 143

An Introduction to Parallel Programming

Peter Pacheco

Chapter 2
Parallel Hardware and Parallel
Software

Copyright © 2010, Elsevier Inc. All rights Reserved 1

# Chapter Subtitle
Roadmap
 Some background
 Modifications to the von Neumann model
 Parallel hardware
 Parallel software
 Input and output
 Performance
 Parallel program design
 Writing and running parallel programs
 Assumptions

Copyright © 2010, Elsevier Inc. All rights Reserved 2

SOME BACKGROUND

Copyright © 2010, Elsevier Inc. All rights Reserved 3

Serial hardware and software
programs
input

Computer runs one

program at a time.
output

Copyright © 2010, Elsevier Inc. All rights Reserved 4

# Chapter Subtitle
The von Neumann Architecture

Figure 2.1

Copyright © 2010, Elsevier Inc. All rights Reserved 5

Main memory
 This is a collection of locations, each of
which is capable of storing both
instructions and data.

 Every location consists of an address,

which is used to access the location, and
the contents of the location.

Copyright © 2010, Elsevier Inc. All rights Reserved 6

Central processing unit (CPU)
 Divided into two parts.

 Control unit - responsible for

deciding which instruction in add 2+2

a program should be
executed. (the boss)

 Arithmetic and logic unit (ALU) -

responsible for executing the actual
instructions. (the worker)
Copyright © 2010, Elsevier Inc. All rights Reserved 7
Key terms
 Register – very fast storage, part of the
CPU.

 Program counter – stores address of the

next instruction to be executed.

 Bus – wires and hardware that connects

the CPU and memory.

Copyright © 2010, Elsevier Inc. All rights Reserved 8

memory

fetch/read

CPU

Copyright © 2010, Elsevier Inc. All rights Reserved 9

memory

write/store

CPU

Copyright © 2010, Elsevier Inc. All rights Reserved 10

von Neumann bottleneck

Copyright © 2010, Elsevier Inc. All rights Reserved 11

An operating system “process”
 An instance of a computer program that is
being executed.
 Components of a process:
 The executable machine language program.
 A block of memory.
 Descriptors of resources the OS has allocated
to the process.
 Security information.
 Information about the state of the process.

Copyright © 2010, Elsevier Inc. All rights Reserved 12

Multitasking
 Gives the illusion that a single processor
system is running multiple programs
simultaneously.
 Each process takes turns running. (time
slice)
 After its time is up, it waits until it has a
turn again. (blocks)

Copyright © 2010, Elsevier Inc. All rights Reserved 13

Threading
 Threads are contained within processes.

 They allow programmers to divide their

programs into (more or less) independent
tasks.
 The hope is that when one thread blocks
because it is waiting on a resource,
another will have work to do and can run.

Copyright © 2010, Elsevier Inc. All rights Reserved 14

A process and two threads

the “master” thread

terminating a thread
starting a thread
Is called joining
Is called forking

Figure 2.2

Copyright © 2010, Elsevier Inc. All rights Reserved 15

MODIFICATIONS TO THE VON
NEUMANN MODEL

Copyright © 2010, Elsevier Inc. All rights Reserved 16

Basics of caching
 A collection of memory locations that can
be accessed in less time than some other
memory locations.

 A CPU cache is typically located on the

same chip, or one that can be accessed
much faster than ordinary memory.

Copyright © 2010, Elsevier Inc. All rights Reserved 17

Principle of locality
 Accessing one location is followed by an
access of a nearby location.

 Spatial locality – accessing a nearby

location.

 Temporal locality – accessing in the near

future.

Copyright © 2010, Elsevier Inc. All rights Reserved 18

Principle of locality

float z[1000];
…
sum = 0.0;
for (i = 0; i < 1000; i++)
sum += z[i];

Copyright © 2010, Elsevier Inc. All rights Reserved 19

Levels of Cache

smallest & fastest

largest & slowest

Copyright © 2010, Elsevier Inc. All rights Reserved 20

Cache hit

fetch x

L1 x sum

L2 y z total

L3 A[ ] radius r1 center

Copyright © 2010, Elsevier Inc. All rights Reserved 21

Cache miss

fetch x x
main
L1 y sum memory

L2 r1 z total

L3 A[ ] radius center

Copyright © 2010, Elsevier Inc. All rights Reserved 22

Issues with cache
 When a CPU writes data to cache, the
value in cache may be inconsistent with
the value in main memory.
 Write-through caches handle this by
updating the data in main memory at the
time it is written to cache.
 Write-back caches mark data in the cache
as dirty. When the cache line is replaced
by a new cache line from memory, the dirty
line is written to memory.
Copyright © 2010, Elsevier Inc. All rights Reserved 23
Cache mappings
 Full associative – a new line can be
placed at any location in the cache.

 Direct mapped – each cache line has a

unique location in the cache to which it will
be assigned.

 n-way set associative – each cache line

can be place in one of n different locations
in the cache.
Copyright © 2010, Elsevier Inc. All rights Reserved 24
n-way set associative
 When more than one line in memory
can be mapped to several different
locations in cache we also need to be
able to decide which line should be
replaced or evicted.

Copyright © 2010, Elsevier Inc. All rights Reserved 25

Example

Table 2.1: Assignments of a 16-line

main memory to a 4-line cache

Copyright © 2010, Elsevier Inc. All rights Reserved 26

Caches and programs

Copyright © 2010, Elsevier Inc. All rights Reserved 27

Virtual memory (1)
 If we run a very large program or a
program that accesses very large data
sets, all of the instructions and data may
not fit into main memory.

 Virtual memory functions as a cache for

secondary storage.

Copyright © 2010, Elsevier Inc. All rights Reserved 28

Virtual memory (2)
 It exploits the principle of spatial and
temporal locality.

 It only keeps the active parts of running

programs in main memory.

Copyright © 2010, Elsevier Inc. All rights Reserved 29

Virtual memory (3)
 Swap space - those parts that are idle are
kept in a block of secondary storage.

 Pages – blocks of data and instructions.

 Usually these are relatively large.
 Most systems have a fixed page
size that currently ranges from
4 to 16 kilobytes.

Copyright © 2010, Elsevier Inc. All rights Reserved 30

Virtual memory (4)
program A main memory

program B

program C

Copyright © 2010, Elsevier Inc. All rights Reserved 31

Virtual page numbers
 When a program is compiled its pages are
assigned virtual page numbers.

 When the program is run, a table is

created that maps the virtual page
numbers to physical addresses.

 A page table is used to translate the

virtual address into a physical address.

Copyright © 2010, Elsevier Inc. All rights Reserved 32

Page table

Table 2.2: Virtual Address Divided into

Virtual Page Number and Byte Offset

Copyright © 2010, Elsevier Inc. All rights Reserved 33

Translation-lookaside buffer (TLB)
 Using a page table has the potential to
significantly increase each program’s
overall run-time.

 A special address translation cache in the

processor.

Copyright © 2010, Elsevier Inc. All rights Reserved 34

Translation-lookaside buffer (2)
 It caches a small number of entries
(typically 16–512) from the page table in
very fast memory.

 Page fault – attempting to access a valid

physical address for a page in the page
table but the page is only stored on disk.

Copyright © 2010, Elsevier Inc. All rights Reserved 35

Instruction Level Parallelism (ILP)
 Attempts to improve processor
performance by having multiple processor
components or functional units
simultaneously executing instructions.

Copyright © 2010, Elsevier Inc. All rights Reserved 36

Instruction Level Parallelism (2)
 Pipelining - functional units are arranged
in stages.

 Multiple issue - multiple instructions can

be simultaneously initiated.

Copyright © 2010, Elsevier Inc. All rights Reserved 37

Pipelining

Copyright © 2010, Elsevier Inc. All rights Reserved 38

Pipelining example (1)

Add the floating point numbers

9.87×104 and 6.54×103

Copyright © 2010, Elsevier Inc. All rights Reserved 39

Pipelining example (2)

 Assume each operation

takes one nanosecond
(10-9 seconds).

 This for loop takes about

7000 nanoseconds.

Copyright © 2010, Elsevier Inc. All rights Reserved 40

Pipelining (3)
 Divide the floating point adder into 7
separate pieces of hardware or functional
units.
 First unit fetches two operands, second
unit compares exponents, etc.
 Output of one functional unit is input to the
next.

Copyright © 2010, Elsevier Inc. All rights Reserved 41

Pipelining (4)

Table 2.3: Pipelined Addition.

Numbers in the table are subscripts of operands/results.

Copyright © 2010, Elsevier Inc. All rights Reserved 42

Pipelining (5)
 One floating point addition still takes
7 nanoseconds.

 But 1000 floating point additions

now takes 1006 nanoseconds!

Copyright © 2010, Elsevier Inc. All rights Reserved 43

Multiple Issue (1)
 Multiple issue processors replicate
functional units and try to simultaneously
execute different instructions in a
program.
for (i = 0; i < 1000; i++)
z[i] = x[i] + y[i];
z[3] z[4]
z[1] z[2]
adder #1 adder #2

Copyright © 2010, Elsevier Inc. All rights Reserved 44

Multiple Issue (2)
 static multiple issue - functional units are
scheduled at compile time.

 dynamic multiple issue – functional units

are scheduled at run-time.

superscalar

Copyright © 2010, Elsevier Inc. All rights Reserved 45

Speculation (1)
 In order to make use of multiple issue, the
system must find instructions that can be
executed simultaneously.
 In speculation, the compiler or
the processor makes a guess
about an instruction, and then
executes the instruction on the
basis of the guess.

Copyright © 2010, Elsevier Inc. All rights Reserved 46

Speculation (2)

z=x+y;
i f ( z > 0) Z will be
positive
w=x;
else
w=y;

If the system speculates incorrectly,

it must go back and recalculate w = y.

Copyright © 2010, Elsevier Inc. All rights Reserved 47

Hardware multithreading (1)
 There aren’t always good opportunities for
simultaneous execution of different
threads.
 Hardware multithreading provides a means
for systems to continue doing useful work
when the task being currently executed
has stalled.
 Ex., the current task has to wait for data to be
loaded from memory.

Copyright © 2010, Elsevier Inc. All rights Reserved 48

Hardware multithreading (2)
 Fine-grained - the processor switches
between threads after each instruction,
skipping threads that are stalled.

 Pros: potential to avoid wasted machine time

due to stalls.
 Cons: a thread that’s ready to execute a long
sequence of instructions may have to wait to
execute every instruction.

Copyright © 2010, Elsevier Inc. All rights Reserved 49

Hardware multithreading (3)
 Coarse-grained - only switches threads
that are stalled waiting for a time-
consuming operation to complete.

 Pros: switching threads doesn’t need to be

nearly instantaneous.
 Cons: the processor can be idled on shorter
stalls, and thread switching will also cause
delays.

Copyright © 2010, Elsevier Inc. All rights Reserved 50

Hardware multithreading (3)
 Simultaneous multithreading (SMT) - a
variation on fine-grained multithreading.

 Allows multiple threads to make use of the

multiple functional units.

Copyright © 2010, Elsevier Inc. All rights Reserved 51

A programmer can write code to exploit.

PARALLEL HARDWARE

Copyright © 2010, Elsevier Inc. All rights Reserved 52

Flynn’s Taxonomy
m a nn
N eu
ic v on SISD (SIMD)
s s
cla Single instruction stream Single instruction stream
Single data stream Multiple data stream

MISD (MIMD)
Multiple instruction stream Multiple instruction stream
Single data stream Multiple data stream
no
tc
ov
ere
d

Copyright © 2010, Elsevier Inc. All rights Reserved 53

SIMD
 Parallelism achieved by dividing data
among the processors.

 Applies the same instruction to multiple

data items.

 Called data parallelism.

Copyright © 2010, Elsevier Inc. All rights Reserved 54

SIMD example

n data items
control unit
n ALUs

x[1] x[2] … x[n]

ALU1 ALU2 ALUn

for (i = 0; i < n; i++)

x[i] += y[i];

Copyright © 2010, Elsevier Inc. All rights Reserved 55

SIMD
 What if we don’t have as many ALUs as
data items?
 Divide the work and process iteratively.
 Ex. m = 4 ALUs and n = 15 data items.

Round3 ALU1 ALU2 ALU3 ALU4

1 X[0] X[1] X[2] X[3]
2 X[4] X[5] X[6] X[7]
3 X[8] X[9] X[10] X[11]
4 X[12] X[13] X[14]

Copyright © 2010, Elsevier Inc. All rights Reserved 56

SIMD drawbacks
 All ALUs are required to execute the same
instruction, or remain idle.
 In classic design, they must also operate
synchronously.
 The ALUs have no instruction storage.
 Efficient for large data parallel problems,
but not other types of more complex
parallel problems.

Copyright © 2010, Elsevier Inc. All rights Reserved 57

Vector processors (1)
 Operate on arrays or vectors of data while
conventional CPU’s operate on individual
data elements or scalars.

 Vector registers.
 Capable of storing a vector of operands and
operating simultaneously on their contents.

Copyright © 2010, Elsevier Inc. All rights Reserved 58

Vector processors (2)
 Vectorized and pipelined functional units.
 The same operation is applied to each
element in the vector (or pairs of elements).

 Vector instructions.
 Operate on vectors rather than scalars.

Copyright © 2010, Elsevier Inc. All rights Reserved 59

Vector processors (3)
 Interleaved memory.
 Multiple “banks” of memory, which can be
accessed more or less independently.
 Distribute elements of a vector across multiple
banks, so reduce or eliminate delay in
loading/storing successive elements.
 Strided memory access and hardware
scatter/gather.
 The program accesses elements of a vector
located at fixed intervals.

Copyright © 2010, Elsevier Inc. All rights Reserved 60

Vector processors - Pros
 Fast.
 Easy to use.
 Vectorizing compilers are good at
identifying code to exploit.
 Compilers also can provide information
about code that cannot be vectorized.
 Helps the programmer re-evaluate code.
 High memory bandwidth.
 Uses every item in a cache line.
Copyright © 2010, Elsevier Inc. All rights Reserved 61
Vector processors - Cons
 They don’t handle irregular
data structures as well as other
parallel architectures.

 A very finite limit to their ability to handle

ever larger problems. (scalability)

Copyright © 2010, Elsevier Inc. All rights Reserved 62

Graphics Processing Units (GPU)
 Real time graphics application
programming interfaces or API’s use
points, lines, and triangles to internally
represent the surface of an object.

Copyright © 2010, Elsevier Inc. All rights Reserved 63

GPUs
 A graphics processing pipeline converts
the internal representation into an array of
pixels that can be sent to a computer
screen.

 Several stages of this pipeline

(called shader functions) are
programmable.
 Typically just a few lines of C code.

Copyright © 2010, Elsevier Inc. All rights Reserved 64

GPUs
 Shader functions are also implicitly
parallel, since they can be applied to
multiple elements in the graphics stream.

 GPU’s can often optimize performance by

using SIMD parallelism.
 The current generation of GPU’s use
SIMD parallelism.
 Although they are not pure SIMD systems.

Copyright © 2010, Elsevier Inc. All rights Reserved 65

MIMD
 Supports multiple simultaneous instruction
streams operating on multiple data
streams.

 Typically consist of a collection of fully

independent processing units or cores,
each of which has its own control unit and
its own ALU.

Copyright © 2010, Elsevier Inc. All rights Reserved 66

Shared Memory System (1)
 A collection of autonomous processors is
connected to a memory system via an
interconnection network.
 Each processor can access each memory
location.
 The processors usually communicate
implicitly by accessing shared data
structures.

Copyright © 2010, Elsevier Inc. All rights Reserved 67

Shared Memory System (2)
 Most widely available shared memory
systems use one or more multicore
processors.
 (multiple CPU’s or cores on a single chip)

Copyright © 2010, Elsevier Inc. All rights Reserved 68

Shared Memory System

Figure 2.3

Copyright © 2010, Elsevier Inc. All rights Reserved 69

UMA multicore system

Time to access all

the memory locations
will be the same for Figure 2.5
all the cores.

Copyright © 2010, Elsevier Inc. All rights Reserved 70

NUMA multicore system

A memory location a core is

Figure 2.6
directly connected to can be
accessed faster than a memory
location that must be accessed
through another chip.

Copyright © 2010, Elsevier Inc. All rights Reserved 71

Distributed Memory System
 Clusters (most popular)
 A collection of commodity systems.
 Connected by a commodity interconnection
network.

 Nodes of a cluster are individual

computations units joined by a
communication network.
a.k.a. hybrid systems

Copyright © 2010, Elsevier Inc. All rights Reserved 72

Distributed Memory System

Figure 2.4

Copyright © 2010, Elsevier Inc. All rights Reserved 73

Interconnection networks
 Affects performance of both distributed
and shared memory systems.

 Two categories:
 Shared memory interconnects
 Distributed memory interconnects

Copyright © 2010, Elsevier Inc. All rights Reserved 74

Shared memory interconnects
 Bus interconnect
 A collection of parallel communication wires
together with some hardware that controls
access to the bus.
 Communication wires are shared by the
devices that are connected to it.
 As the number of devices connected to the
bus increases, contention for use of the bus
increases, and performance decreases.

Copyright © 2010, Elsevier Inc. All rights Reserved 75

Shared memory interconnects
 Switched interconnect
 Uses switches to control the routing of data
among the connected devices.

 Crossbar –
 Allows simultaneous communication among
different devices.
 Faster than buses.
 But the cost of the switches and links is relatively
high.

Copyright © 2010, Elsevier Inc. All rights Reserved 76

Figure 2.7

(a)
A crossbar switch connecting 4 processors
(Pi) and 4 memory modules (Mj)

(b)
Configuration of internal switches in
a crossbar

(c) Simultaneous memory accesses

by the processors

Copyright © 2010, Elsevier Inc. All rights Reserved 77

Distributed memory interconnects
 Two groups
 Direct interconnect
 Each switch is directly connected to a processor
memory pair, and the switches are connected to
each other.

 Indirect interconnect
 Switches may not be directly connected to a
processor.

Copyright © 2010, Elsevier Inc. All rights Reserved 78

Direct interconnect

Figure 2.8

ring toroidal mesh

Copyright © 2010, Elsevier Inc. All rights Reserved 79

Bisection width
 A measure of “number of simultaneous
communications” or “connectivity”.

 How many simultaneous communications

can take place “across the divide” between
the halves?

Copyright © 2010, Elsevier Inc. All rights Reserved 80

Two bisections of a ring

Figure 2.9

Copyright © 2010, Elsevier Inc. All rights Reserved 81

A bisection of a toroidal mesh

Figure 2.10

Copyright © 2010, Elsevier Inc. All rights Reserved 82

Definitions
 Bandwidth
 The rate at which a link can transmit data.
 Usually given in megabits or megabytes per
second.

 Bisection bandwidth
 A measure of network quality.
 Instead of counting the number of links joining
the halves, it sums the bandwidth of the links.

Copyright © 2010, Elsevier Inc. All rights Reserved 83

Fully connected network
 Each switch is directly connected to every
other switch.

al
tic
ac
pr
im

bisection width = p2/4

Figure 2.11

Copyright © 2010, Elsevier Inc. All rights Reserved 84

Hypercube
 Highly connected direct interconnect.
 Built inductively:
 A one-dimensional hypercube is a fully-
connected system with two processors.
 A two-dimensional hypercube is built from two
one-dimensional hypercubes by joining
“corresponding” switches.
 Similarly a three-dimensional hypercube is
built from two two-dimensional hypercubes.

Copyright © 2010, Elsevier Inc. All rights Reserved 85

Hypercubes

Figure 2.12

one- two- three-dimensional

Copyright © 2010, Elsevier Inc. All rights Reserved 86

Indirect interconnects
 Simple examples of indirect networks:
 Crossbar
 Omega network

 Often shown with unidirectional links and a

collection of processors, each of which has
an outgoing and an incoming link, and a
switching network.

Copyright © 2010, Elsevier Inc. All rights Reserved 87

A generic indirect network

Figure 2.13

Copyright © 2010, Elsevier Inc. All rights Reserved 88

Crossbar interconnect for
distributed memory

Figure 2.14

Copyright © 2010, Elsevier Inc. All rights Reserved 89

An omega network

Figure 2.15

A switch in an omega network

Figure 2.16

More definitions
 Any time data is transmitted, we’re
interested in how long it will take for the
data to reach its destination.
 Latency
 The time that elapses between the source’s
beginning to transmit the data and the
destination’s starting to receive the first byte.
 Bandwidth
 The rate at which the destination receives data
after it has started to receive the first byte.

Message transmission time = l + n / b

latency (seconds)

length of message (bytes)

bandwidth (bytes per second)

Cache coherence
 Programmers have no
control over caches
and when they get
updated.

Figure 2.17

A shared memory system with two cores

and two caches

Cache coherence
y0 privately owned by Core 0
y1 and z1 privately owned by Core 1

x = 2; /* shared variable */

y0 eventually ends up = 2
y1 eventually ends up = 6
z1 = ???

Snooping Cache Coherence
 The cores share a bus .
 Any signal transmitted on the bus can be
“seen” by all cores connected to the bus.
 When core 0 updates the copy of x stored
in its cache it also broadcasts this
information across the bus.
 If core 1 is “snooping” the bus, it will see
that x has been updated and it can mark
its copy of x as invalid.

Directory Based Cache Coherence
 Uses a data structure called a directory
that stores the status of each cache line.

 When a variable is updated, the directory

is consulted, and the cache controllers of
the cores that have that variable’s cache
line in their caches are invalidated.

PARALLEL SOFTWARE

The burden is on software
 Hardware and compilers can keep up the
pace needed.
 From now on…
 In shared memory programs:
 Start a single process and fork threads.
 Threads carry out tasks.
 In distributed memory programs:
 Start multiple processes.
 Processes carry out tasks.

SPMD – single program multiple data
 A SPMD programs consists of a single
executable that can behave as if it were
multiple different programs through the use
of conditional branches.
if (I’m thread process i)
do this;
else
do that;

Writing Parallel Programs

1. Divide the work among the double x[n], y[n];

processes/threads
(a) so each process/thread
…
gets roughly the same for (i = 0; i < n; i++)
amount of work
(b) and communication is
x[i] += y[i];
minimized.
2. Arrange for the processes/threads to synchronize.
3. Arrange for communication among processes/threads.

Shared Memory
 Dynamic threads
 Master thread waits for work, forks new
threads, and when threads are done, they
terminate
 Efficient use of resources, but thread creation
and termination is time consuming.
 Static threads
 Pool of threads created and are allocated work,
but do not terminate until cleanup.
 Better performance, but potential waste of
system resources.
Copyright © 2010, Elsevier Inc. All rights Reserved 102
Nondeterminism
...
printf ( "Thread %d > my_val = %d\n" ,
my_rank , my_x ) ;
...

Thread 0 > my_val = 7

Thread 1 > my_val = 19
Thread 1 > my_val = 19
Thread 0 > my_val = 7

Nondeterminism
my_val = Compute_val ( my_rank ) ;
x += my_val ;

Nondeterminism
 Race condition
 Critical section
 Mutually exclusive
 Mutual exclusion lock (mutex, or simply
lock)
my_val = Compute_val ( my_rank ) ;
Lock(&add_my_val_lock ) ;
x += my_val ;
Unlock(&add_my_val_lock ) ;

busy-waiting

my_val = Compute_val ( my_rank ) ;

i f ( my_rank == 1)
whi l e ( ! ok_for_1 ) ; /* Busy−wait loop */
x += my_val ; /* Critical section */
i f ( my_rank == 0)
ok_for_1 = true ; /* Let thread 1 update x */

message-passing
char message [ 1 0 0 ] ;
...
my_rank = Get_rank ( ) ;
i f ( my_rank == 1) {
sprintf ( message , "Greetings from process 1" ) ;
Send ( message , MSG_CHAR , 100 , 0 ) ;
} e l s e i f ( my_rank == 0) {
Receive ( message , MSG_CHAR , 100 , 1 ) ;
printf ( "Process 0 > Received: %s\n" , message ) ;
}

Partitioned Global Address
Space Languages
shared i n t n = . . . ;
shared double x [ n ] , y [ n ] ;
private i n t i , my_first_element , my_last_element ;
my_first_element = . . . ;
my_last_element = . . . ;
/ * Initialize x and y */
...
f o r ( i = my_first_element ; i <= my_last_element ; i++)
x [ i ] += y [ i ] ;

Input and Output
 In distributed memory programs, only
process 0 will access stdin. In shared
memory programs, only the master thread
or thread 0 will access stdin.

 In both distributed memory and shared

memory programs all the
processes/threads can access stdout and
stderr.

Input and Output
 However, because of the indeterminacy of
the order of output to stdout, in most cases
only a single process/thread will be used
for all output to stdout other than
debugging output.

 Debug output should always include the

rank or id of the process/thread that’s
generating the output.

Input and Output
 Only a single process/thread will attempt to
access any single file other than stdin,
stdout, or stderr. So, for example, each
process/thread can open its own, private
file for reading or writing, but no two
processes/threads will open the same file.

PERFORMANCE

Speedup
 Number of cores = p
 Serial run-time = Tserial
 Parallel run-time = Tparallel

up
ed
s pe
ar Tparallel = Tserial / p
il ne

Speedup of a parallel program

Tserial
S=
Tparallel

Efficiency of a parallel program

Tserial
Tparallel
S Tserial
E= = =
p p
.
p Tparallel

Speedups and efficiencies of a
parallel program

Speedups and efficiencies of
parallel program on different
problem sizes

Speedup

Efficiency

Effect of overhead

Tparallel = Tserial / p + Toverhead

Amdahl’s Law
 Unless virtually all of a serial program is
parallelized, the possible speedup is going
to be very limited — regardless of the
number of cores available.

Example
 We can parallelize 90% of a serial
program.
 Parallelization is “perfect” regardless of the
number of cores p we use.
 Tserial = 20 seconds
 Runtime of parallelizable part is
0.9 x Tserial / p = 18 / p

Example (cont.)
 Runtime of “unparallelizable” part is

0.1 x Tserial = 2
 Overall parallel run-time is

Tparallel = 0.9 x Tserial / p + 0.1 x Tserial = 18 / p + 2

Example (cont.)
 Speed up

Tserial 20
S= 0.9 x Tserial / p + 0.1 x Tserial
= 18 / p + 2

Scalability
 In general, a problem is scalable if it can handle
ever increasing problem sizes.
 If we increase the number of processes/threads
and keep the efficiency fixed without increasing
problem size, the problem is strongly scalable.
 If we keep the efficiency fixed by increasing the
problem size at the same rate as we increase
the number of processes/threads, the problem is
weakly scalable.

Taking Timings
 What is time?
 Start to finish?
 A program segment of interest?
 CPU time?
 Wall clock time?

Taking Timings
theoretical
function

MPI_Wtime omp_get_wtime

Taking Timings

PARALLEL PROGRAM
DESIGN

Foster’s methodology
1. Partitioning: divide the computation to be
performed and the data operated on by
the computation into small tasks.

The focus here should be on identifying

tasks that can be executed in parallel.

Foster’s methodology
2. Communication: determine what
communication needs to be carried out
among the tasks identified in the previous
step.

Foster’s methodology
3. Agglomeration or aggregation: combine
tasks and communications identified in
the first step into larger tasks.

For example, if task A must be executed

before task B can be executed, it may
make sense to aggregate them into a
single composite task.

Foster’s methodology
4. Mapping: assign the composite tasks
identified in the previous step to
processes/threads.

This should be done so that

communication is minimized, and each
process/thread gets roughly the same
amount of work.

Example - histogram
 1.3,2.9,0.4,0.3,1.3,4.4,1.7,0.4,3.2,0.3,4.9,2
.4,3.1,4.4,3.9,0.4,4.2,4.5,4.9,0.9

Serial program - input
1. The number of measurements:
data_count
2. An array of data_count floats: data
3. The minimum value for the bin containing
the smallest values: min_meas
4. The maximum value for the bin containing
the largest values: max_meas
5. The number of bins: bin_count

Serial program - output
1. bin_maxes : an array of bin_count floats

2. bin_counts : an array of bin_count ints

First two stages of Foster’s
Methodology

Alternative definition of tasks
and communication

Adding the local arrays

Concluding Remarks (1)
 Serial systems
 The standard model of computer hardware
has been the von Neumann architecture.
 Parallel hardware
 Flynn’s taxonomy.
 Parallel software
 We focus on software for homogeneous MIMD
systems, consisting of a single program that
obtains parallelism by branching.
 SPMD programs.
Copyright © 2010, Elsevier Inc. All rights Reserved 141
Concluding Remarks (2)
 Input and Output
 We’ll write programs in which one process or
thread can access stdin, and all processes
can access stdout and stderr.
 However, because of nondeterminism, except
for debug output we’ll usually have a single
process or thread accessing stdout.

Concluding Remarks (3)
 Performance
 Speedup
 Efficiency
 Amdahl’s law
 Scalability
 Parallel Program Design
 Foster’s methodology

Read Play of Consciousness: A Spiritual Autobiography - Download File
20% (5)
Read Play of Consciousness: A Spiritual Autobiography - Download File
1 page
TX Control Operating Manual DOC100445104
100% (1)
TX Control Operating Manual DOC100445104
108 pages
Cloud Computing Lab Manual-New
No ratings yet
Cloud Computing Lab Manual-New
150 pages
Module 1 PDF
100% (1)
Module 1 PDF
33 pages
Unit I
No ratings yet
Unit I
53 pages
Computer Organization
No ratings yet
Computer Organization
1 page
Parallelism
No ratings yet
Parallelism
22 pages
Content Beyond Syllabus
No ratings yet
Content Beyond Syllabus
8 pages
Cs3591 Computer Networks
No ratings yet
Cs3591 Computer Networks
32 pages
Jerusalem College of Engineering: ACADEMIC YEAR 2021 - 2022
No ratings yet
Jerusalem College of Engineering: ACADEMIC YEAR 2021 - 2022
40 pages
CS8791-CC Unit-II
No ratings yet
CS8791-CC Unit-II
75 pages
Question Bank - OS
No ratings yet
Question Bank - OS
6 pages
Data Processing Instruction
No ratings yet
Data Processing Instruction
35 pages
Cs3591-Unit 5
No ratings yet
Cs3591-Unit 5
27 pages
CS2302 Computer Networks Anna University Engineering Question Bank 4 U
No ratings yet
CS2302 Computer Networks Anna University Engineering Question Bank 4 U
48 pages
Cs3591-Unit 4
No ratings yet
Cs3591-Unit 4
19 pages
Cs3551 Distributed Computing
No ratings yet
Cs3551 Distributed Computing
2 pages
AD3461 ML Lab Manual
No ratings yet
AD3461 ML Lab Manual
32 pages
CST 402 DC QB
No ratings yet
CST 402 DC QB
6 pages
IOS Important Questions Unit Wise
No ratings yet
IOS Important Questions Unit Wise
2 pages
SOFT COMPUTING 2 Marks With Answer
No ratings yet
SOFT COMPUTING 2 Marks With Answer
13 pages
Computer Networks - CS3591 - Notes - Unit 5 - Data Link and Physical Layers
No ratings yet
Computer Networks - CS3591 - Notes - Unit 5 - Data Link and Physical Layers
31 pages
Computer Architecture Question Anna University
No ratings yet
Computer Architecture Question Anna University
2 pages
DAN Lab ManuaL
No ratings yet
DAN Lab ManuaL
53 pages
Ccs336 CSM Lab Manual
No ratings yet
Ccs336 CSM Lab Manual
30 pages
Cloud Computing Unit - 3 Final
No ratings yet
Cloud Computing Unit - 3 Final
43 pages
Aca Lab Manual Final
No ratings yet
Aca Lab Manual Final
28 pages
Ca Unit 4 Prabu
No ratings yet
Ca Unit 4 Prabu
24 pages
CP4292-Multicore Lab
No ratings yet
CP4292-Multicore Lab
39 pages
UNIT 3 Developing IoTs-1
No ratings yet
UNIT 3 Developing IoTs-1
53 pages
Memory Reference Instructions Execution
100% (1)
Memory Reference Instructions Execution
13 pages
Pipelining and Superscalar Techniques: CSE539: Advanced Computer Architecture
No ratings yet
Pipelining and Superscalar Techniques: CSE539: Advanced Computer Architecture
49 pages
Os Lab Manual AI&DS
No ratings yet
Os Lab Manual AI&DS
64 pages
Compiler Lab Manual RCS 652
No ratings yet
Compiler Lab Manual RCS 652
33 pages
SPINS: Security Protocols For Sensor Networks
No ratings yet
SPINS: Security Protocols For Sensor Networks
29 pages
CS8591 Computer Networks L T P C 3 0 0 3 Objectives
0% (1)
CS8591 Computer Networks L T P C 3 0 0 3 Objectives
5 pages
CS3691 EMBEDDED SYSTEMS AND IOTl 1 To 4 Unit
100% (1)
CS3691 EMBEDDED SYSTEMS AND IOTl 1 To 4 Unit
265 pages
Module 5: Basic Processing Unit: Computer Organization
No ratings yet
Module 5: Basic Processing Unit: Computer Organization
17 pages
CS3591 Computer Networks Lab Manual Finalized
No ratings yet
CS3591 Computer Networks Lab Manual Finalized
67 pages
COA Unit 1
No ratings yet
COA Unit 1
33 pages
Deep Learning r18 Jntuh Lab Manual
No ratings yet
Deep Learning r18 Jntuh Lab Manual
20 pages
Esiot Lab
No ratings yet
Esiot Lab
29 pages
CN Lab Programs Part-B Java Programs
No ratings yet
CN Lab Programs Part-B Java Programs
14 pages
CC Unit 1
No ratings yet
CC Unit 1
139 pages
MODULE 2: Input / Output Organization: Courtesy: Text Book: Carl Hamacher 5 Edition
No ratings yet
MODULE 2: Input / Output Organization: Courtesy: Text Book: Carl Hamacher 5 Edition
95 pages
PPS Course Material
100% (1)
PPS Course Material
177 pages
Microprocessors and Microcontrollers Notes - Programs For 16 Bit Arithmetic Operations For 8086 (Using Various Addressing Modes) - Studentboxoffice
100% (1)
Microprocessors and Microcontrollers Notes - Programs For 16 Bit Arithmetic Operations For 8086 (Using Various Addressing Modes) - Studentboxoffice
16 pages
CCS336 Set3
No ratings yet
CCS336 Set3
2 pages
Unit - V Implementation, Testing & Maintenance
No ratings yet
Unit - V Implementation, Testing & Maintenance
60 pages
Network Programming-Module 04
No ratings yet
Network Programming-Module 04
23 pages
PowerPoint Slides To Chapter 07
No ratings yet
PowerPoint Slides To Chapter 07
49 pages
Slides Chapter 5 Basic Processing Unit
No ratings yet
Slides Chapter 5 Basic Processing Unit
44 pages
Distributed Computing Question Bank
No ratings yet
Distributed Computing Question Bank
6 pages
Foundation of Data Science - CS3352 - Hand Written Notes - Unit 4 - Python Libraries For Data Wrangling
No ratings yet
Foundation of Data Science - CS3352 - Hand Written Notes - Unit 4 - Python Libraries For Data Wrangling
42 pages
FSWD University Question Paper
No ratings yet
FSWD University Question Paper
6 pages
Eiot Notes
No ratings yet
Eiot Notes
129 pages
BE LP5 Manual 23-24
No ratings yet
BE LP5 Manual 23-24
67 pages
Chapter 2.1 The 8086 Microprocessor Architecture
No ratings yet
Chapter 2.1 The 8086 Microprocessor Architecture
26 pages
Introduction to Linux: Installation and Programming
From Everand
Introduction to Linux: Installation and Programming
N. B. Venkateswarlu
No ratings yet
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
From Everand
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
Sebastian Thelen
5/5 (1)
Chapter 2
No ratings yet
Chapter 2
143 pages
Chapter2 - Part 2
No ratings yet
Chapter2 - Part 2
18 pages
Xjtag Product Sheet - Xjanalyser en
No ratings yet
Xjtag Product Sheet - Xjanalyser en
2 pages
Stanton S-500 Professional Dual CD Player - User Manual
No ratings yet
Stanton S-500 Professional Dual CD Player - User Manual
12 pages
HDSD Monitor Elance
No ratings yet
HDSD Monitor Elance
178 pages
En US 8933733200 BKS 9220 T EN PDF
No ratings yet
En US 8933733200 BKS 9220 T EN PDF
10 pages
Powerroc T25 DC
No ratings yet
Powerroc T25 DC
5 pages
OTS60PB
No ratings yet
OTS60PB
4 pages
Acoustic Induced Vibration - Flare Systems PDF
No ratings yet
Acoustic Induced Vibration - Flare Systems PDF
2 pages
Cimdata Ugs Nx3 Cam Review
No ratings yet
Cimdata Ugs Nx3 Cam Review
10 pages
DIY FET Tester
No ratings yet
DIY FET Tester
7 pages
Workforce Management For Small Call Centers Final
No ratings yet
Workforce Management For Small Call Centers Final
15 pages
Authentication Application Form For Qualifications of Channel Partner
No ratings yet
Authentication Application Form For Qualifications of Channel Partner
4 pages
Model Name: 965P-DS4: Revision 1.01G
No ratings yet
Model Name: 965P-DS4: Revision 1.01G
41 pages
1 2 3 4 5 CP 343-2 / CP 343-2 P AS-Interface Master Simatic Net
No ratings yet
1 2 3 4 5 CP 343-2 / CP 343-2 P AS-Interface Master Simatic Net
134 pages
Sensor Test
0% (1)
Sensor Test
4 pages
Bump Cal Handbook
No ratings yet
Bump Cal Handbook
82 pages
HP 15s-Du1013TU 10th Gen Intel Core I5 10210U Gold
No ratings yet
HP 15s-Du1013TU 10th Gen Intel Core I5 10210U Gold
2 pages
315C 315CL Diagram Parts
33% (3)
315C 315CL Diagram Parts
24 pages
Install Windows 95 in A Virtual Machine
No ratings yet
Install Windows 95 in A Virtual Machine
1 page
Bot Transfer To Agent With Dynamic Skills Routing (POC)
No ratings yet
Bot Transfer To Agent With Dynamic Skills Routing (POC)
12 pages
Mohit CV
No ratings yet
Mohit CV
1 page
Roshan Kumar Yadav - Ee
No ratings yet
Roshan Kumar Yadav - Ee
1 page
Enable Legacy Boot Mode
No ratings yet
Enable Legacy Boot Mode
4 pages
HC-06 Datasheet
No ratings yet
HC-06 Datasheet
17 pages
EE2401 Power System Operation and Control
100% (1)
EE2401 Power System Operation and Control
93 pages
D4B80609EAF-Air Conditioning Systems With Refrigerant R1234yf - General Information
No ratings yet
D4B80609EAF-Air Conditioning Systems With Refrigerant R1234yf - General Information
266 pages
Av Schematics
No ratings yet
Av Schematics
1 page
Assessment Service Undercarriage System
No ratings yet
Assessment Service Undercarriage System
4 pages
ASPICE
No ratings yet
ASPICE
58 pages