CS526 3 Design of Parallel Programs
CS526 3 Design of Parallel Programs
Performance Aspects
(CS 526)
• Parallel Task
– A task that can be executed by multiple processors safely
(producing correct results)
• Serial Execution
– Execution of a program sequentially, one statement at a
time.
– In the simplest sense, this is what happens on a one
processor machine.
Some General Parallel Terminologies
• Parallel Execution
– Execution of a program by more than one task (threads)
– Each task being able to execute the same or different
statement at the same moment in time.
• Shared Memory
– where all processors have direct (usually bus based) access
to common physical memory
– In a programming sense, it describes a model where parallel
tasks all have the same "picture" of memory
• Distributed Memory
– Network based memory access for physical memory that is
not common.
– Tasks can only logically "see" local machine memory and
must use communications to access memory on other
nodes.
Some General Parallel Terminologies
• Communications
– Parallel tasks typically need to exchange data. This can be
accomplished: shared memory or over a network,
– However the actual event of data exchange is commonly
referred to as communications (regardless of the method
employed).
• Synchronization
– The coordination of parallel tasks in real time, very often
associated with communications
– Often implemented by establishing a synchronization point
within an application where a task may not proceed further
until another task(s) reaches the same or logically
equivalent point.
Some General Parallel Terminologies
• Granularity
– In parallel computing, granularity is a measure of the ratio
of computation to communication.
– Coarse: relatively large amount of computational work are
done between communication events
– Fine: relatively small amounts of computational work are
done between communication events
• Observed Speedup:
– Observed speedup of a code which has been parallelized
wall-clock time of serial execution
wall-clock time of parallel execution
• Massively Parallel
– Refers to the hardware that comprises a given parallel
system - having many processors (over 100’s of processors)
Some General Parallel Terminologies
• Scalability
– Refers to a parallel system's (hardware and/or software)
ability to demonstrate a proportionate increase in
parallel speedup with the addition of more processors.
– When the last task reaches the barrier, all tasks are
Synchronized
Types of Synchronization
2. Lock / semaphore
– Can involve any number of tasks
– The first task to acquire the lock "sets" it. This task can
then safely (serially) access the protected data or
code.
1. Fine-grain parallelism
2. Coarse-grain parallelism
Fine-grain Parallelism
• Relatively small amounts of computational work are done between
communication events
• Low computation to communication ratio
• Implies high communication overhead and less opportunity for
performance enhancement
• If granularity is too fine it is possible that the overhead required for
communications and synchronization between tasks takes longer
than the computation.
Coarse-grain Parallelism
• Relatively large amounts of computational
work are done between
communication/synchronization events
1
Max.speedup = --------
1 - P
speedup
--------------------------------
N P = .50 P = .90 P = .99
----- ------- ------- -------
10 1.82 5.26 9.17
100 1.98 9.17 50.25
1000 1.99 9.91 90.99
10000 1.99 9.91 99.02
Amdahl's Law
F = serial fraction
E.g., 1/0.05 (5% serial) = 20 speedup (maximum)
Maximum Speedup (Amdahl's Law)
Maximum speedup is usually p with p processors
(linear speedup).
e.g. if f==1 (all the code is serial, then the speedup will be 1
no matter how may processors are used
Speedup (with N CPUs or Machines)
• Introducing the number of processors performing the
parallel fraction of work, the relationship can be
modelled by:
1
speedup = ------------
fS + fP
-----
Proc
• Number of operations
• Volume of data manipulated
• Type of data: temporary, read only, etc.
• Volume of data communicated between nodes
Regularity versus Irregularity
• Data Structures: dense vectors/matrices versus sparse
(stored as such) matrices
– Message Passing
• Explicit exchange of messages
• Implicit synchronization
• Communication hardware:
– Shared memory: Bus-based shared memory systems,
Symmetrical Multiprocessors – SMPs
• Two categories:
1. Temporal Locality: location that is referenced once is
likely to be referenced multiple times in near future:
for(int i=0;i<1000;i++)
for(int j=0;j<1000;j++)
a[j] = b[i] * PI;
Principal of Locality
1. Spatial Locality: memory location that is referenced
once, then the program is likely to be reference a
nearby memory location:
for(int i=0;i<1000;i++)
for(int j=0;j<1000;j++)
a[j] = b[i] * PI;
Vector Product Example
return sum;
}
Vector Product Example
Assumptions:
•Access Sequence
– Read x[0] •Analysis
• x[0], x[1], x[2], x[3] loaded – x[i] and y[i] map to different cache lines
– Read y[0] – Cache Miss rate = 25% ( 2misses/8
• y[0], y[1], y[2], y[3] loaded loads)
– Read x[1] • Two memory accesses / iteration
• Hit • After every 4th iteration we have
two misses
– Read y[1]
• Hit
– •••
– 2 misses / 8 reads
Thrashing Example: Bad Case
x[0] y[0]
x[1] y[1] Loaded into same
x[2] y[2] Cache Lines
x[3] y[3]
•Analysis
•Access Pattern
– x[i] and y[i] map to same cache lines
– Read x[0]
– Miss rate = 100%
• x[0], x[1], x[2], x[3] loaded
• Two memory accesses / iteration
– Read y[0]
• On every iteration have two
• y[0], y[1], y[2], y[3] loaded
misses
– Read x[1]
• x[0], x[1], x[2], x[3] loaded
– Read y[1]
• y[0], y[1], y[2], y[3] loaded
•••
– 8 misses / 8 reads (Thrashing)
Matrix Sum Example-1
// <Get START time here>
for(kk=0;kk<1000;kk++)
{ sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += A[i][j];
}
for(kk=0;kk<1000;kk++)
{ sum = 0;
for (i = 0; i < M; i++)
for (j = 0; j < N; j++)
sum += A[j][i];
}
• OR:
– Use the past to predict the future
– Use the compiler
– Bet on several horses
Programming for Parallel Architectures
(Trick-4)