0% found this document useful (0 votes)
19 views76 pages

Performance and Tuning of Openmp Programs

Uploaded by

spamailbat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views76 pages

Performance and Tuning of Openmp Programs

Uploaded by

spamailbat
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 76

Performance and Tuning of

OpenMP Programs
Performance Measurement
• Why do we measure performance?
• Basics to measure performance
• How to get a good speedup?
• What affects performance?
Measures of Performance
• To computer scientists: speedup,
execution time
• To applications people: megaflops, size
of problem, accuracy of solution, etc
Tuning
• Performance tuning is the improvement of application
performance, usually for a specific computer system.
• Most applications will respond to increased system work load with
some degree of decreasing performance.
• Tuning follows these steps:
1. Assess the problem
2. Measure the performance of the application before modification.
3. Identify the parts of the application that are critical for improving the
performance. This is called the bottleneck. (Using profiling tools)
4. Modify these parts of the application to remove or alleviate
bottlenecks.
5. Measure the performance of the application after modification.
6. Repeat steps 3 through 5 as needed
Some performance Lessons
1- Optimize when you are done with coding
a- It's almost impossible to identify performance bottlenecks before the program
is completely working.
b- Rarely when you define a bottleneck, you allow other ones to become critical
c- Focusing on optimization during initial development detracts from achieving
other program characteristics

2- Reducing lines of code does not mean your code is faster


for (i = 1; i<6; i++) a[ i ] = i;

a[ 1 ] = 1
a[ 2 ] = 2
a[ 3 ] = 3
a[ 4 ] = 4
a[ 5 ] = 5
(Over 80% faster than the first)
Lessons …. Cont
3- Performance depends on language,
machine and compiler.
4- Correctness is more important than speed
(We don’t need to wait forever though )
5- Measurement is your proof
6- Measurements need to be precise (Use
profiling tools)
Optimization
Instrumentation
Instrumentation levels
• Manual: Performed by the programmer (printf() statements)

• Automatic source level: instrumentation added to the source code


by an automatic tool according to an instrumentation policy (TAU)

• Compiler assisted: Ex: "gcc -pg ..." for gprof ”

• Binary translation: The tool adds instrumentation to a compiled


binary. (Ex: ATOM)

• Runtime instrumentation: Directly before execution, the code is


instrumented. May result in high overhead (Ex: Valgrind)
Measurement with Profiling
Timing the OpenMP Performance
• A standard practice is to use a standard operating
system command.
• The /bin/time command is available on standard UNIX
systems. For example
/bin/time ./a.out
– The “real”, “user”, and “system” times are then printed after
the program has finished execution.
– For example
$ /bin/time ./program.exe
real 5.4
user 3.2
sys 1.0
– These three numbers can be used to get initial information
about the performance.
– For deeper analysis, a more sophisticated performance tool is
needed.
Timing the OpenMP Performance (cont)
– Generally, the sum of the user and system time is referred to
as the CPU time.
– The number following “real” tells us that the program took 5.4
seconds from the beginning to the end. The real time is also
called the wall-clock time or elapsed time
– The user time of 3.2 seconds is the time spent outside any
operating system service.
– The sys time is the time spent on operating system services
such as input/output routines
– A common cause for the difference is a processor sharing too
high load on the system.
– The omp_get_wtime() function provided by OpenMP is
useful for measuring the elapsed time of blocks of source
code.
– If your times are varying, the reason may be that you are
sharing the processors with others
Speedup of Algorithm
• Speedup of algorithm = sequential execution
time / execution time on p processors (with
the same data set).
speedup

p
Speedup on Problem
• Speedup on problem = sequential
execution time of best known
sequential algorithm / execution time
on p processors.
• A more honest measure of
performance.
• Avoids picking an easily parallelizable
algorithm with poor sequential execution
time.
What Speedups Can You
Get?
• Linear speedup
– Confusing term: implicitly means a 1-to-1 speedup
per processor / core.
– (almost always) as good as you can do.
• Sub-linear speedup: common, due to
overhead of startup, synchronization,
communication, etc.
• Super-linear speedup: Due to cache/memory
effects
Speedup

speedup
linear

actual

p
Efficiency
• Fix problem size.
• Let T(1) be time to execute program on
sequential machine. Many researchers take
time of best sequential algorithm
• T(p) is time to execute program on p
processors
Speedup(p) = T(1) / T(p)
Efficiency(p) = Speedup(p) / p
• Linear speedup if Efficiency(p) = 1
Scalability
• No really precise definition
• Roughly speaking, a program is said to
scale to a certain number of processors
p, if going from p-1 to p processors
results in some acceptable
improvement in speedup (for instance,
an increase of 0.5)
Where is a need for
parallelization?
• Profiling the application
– Understanding what regions of the code consumes most of the
time.
– Focus your effort on the important regions that would benefit
from parallelization.
– Amdahl’s law. (Overall speedup)
1
(1-P) + P/s
s is the speedup of the part of the task that benefits from improved system resources;
P is the proportion of execution time that the part benefiting from improved resources
originally occupied.

– Work vs. Parallel overhead.


Amdahl’s Law Example
• If 40% of the execution time is a subject
of a speedup; if the improvement makes
the affected part twice as fast, find the
overall speedup.

P=0.4, s=2
1
Speedup = (1- 0.4) + 0.4/2

= 20/16 = 1.25
Why keep something
sequential?
• Some parts of the program are not
parallelizable (because of
dependences)
• Some parts may be parallelizable, but
the overhead dwarfs the increased
speedup
Returning to Sequential vs.
Parallel
• Sequential execution time: t seconds
• Overheads of parallel execution: t_st
seconds (depends on architecture)
• (Ideal) parallel execution time: t/p(#
processors) + t_st
• If t/p + t_st >= t, no gain
Techniques to improve performance
1- Code Optimization: Some code transformations that can help
to improve the performance (Ex: Less computations in loops)

2- Caching strategy: Help to remove performance bottlenecks


resulting from slow access to data.

3- Load balancing: Your application should exploit the resources


available in the best way. Threads should have equal load in order
to achieve the optimal performance.

4- Bottlenecks: They can also be defined as hot spots where the


operation count and time of execution is the highest.
Loop Optimization
• By making minor changes in loops, the programmer or
compiler can improve the use of memory and remove
dependencies.
• For example a i-loop that has an embedded j-loop may
calculate the same results more efficiently if the position
of the two loops are switched.
– May result in accessing data in C in rows instead of in columns.
– Since much of computational time is spent in loops and since
most array access occurs there, a suitable reorganization to
exploit cache can significantly improve a programs performance.
• Rule for Interchangeability: If any memory location is
referenced more than once in the loop nest and if at
least one of those references modify its value, then their
relative order must not be changed by the transformation
Loop Optimization (cont)
• Check a program’s code to see if a reordering of
statements is allowed and whether it is desirable:
– This is often done better by programmers than by
compilers.
– Should be considered if array accesses in the loop nest do
not occur in the order they are stored in memory.
– Also consider loop transforming if loop has a large body
– A simple reordering within the loop may make a difference.
– Reordering the loop may also allow better exploitation of
parallelism or to better utilize the instruction pipeline.
– They can also be used to increase the size of the parallel
area.
Loop Unrolling
• A loop unrolling transformation packs all of the work
of several loop iterations into a single pass through
the loop.
– A powerful technique to reduce the overheads of loops
– May not reduce to one single pass, but reduce the number
of passes through the loop (say by a factor of two by
performing two loop calculations on each pass through
“reduced loop”).
• In example, two is called the “unroll factor”.
– Eliminates number of increment of the loop variable, test
for completion, and branches to the start of the loop code.
– Helps improve cache line utilization by improving data
reuse.
– Can also increase the instruction level parallelism (ILP)
Loop Unrolling (cont)
• Compilers are good at doing loop unrolling.
• One problem is that if the unroll factor does not divide
the iteration count, the remaining iterations have to be
performed outside of the loop nest
– Implemented through a second “cleanup” loop.
• If loop already contains a lot of computation, loop
unrolling may the cache less efficient.
• If loop contains a procedure call, unrolling the loop
results in new overheads that may outweigh the benefits.
• If loop contains branches, the benefits may also be low.
• Loop jamming is similar to loop unrolling and consists of
“jamming the body of two inner loops” into a single inner
loop that performs the work of both.
Loop Fusion
• Loop Fusion merges two or more loops to create
a bigger loop.
– May enable data in cache to be reused more
frequently
– Might increase the amount of computation per
iteration in order to improve the instruction level
parallelism.
– Would probably also reduce loop overhead because
more work is done per loop.
Loop Fission
• Loop fission is a transformation that breaks up a
loop into several loops.
– May be useful in improving the use of cache or to
isolate a part that inhibits the full optimization of the
loop.
– Most useful when loop nest is large and its data does
not fit well into cache or if we can optimize different
parts of the loop in different ways.
Loop Tiling or Blocking
• Transformation designed to tailor the number of
memory references inside a loop so they can fit
within cache.
– If data sizes are large and memory access is bad or if
there is little data reuse in the loop, then chopping the
loop into chunks (tiles) may be helpful.
– Loop tiling replaces the original loop with a couple of
loops.
– This can be done for as many loops inside the loop
nest as needed.
– May have to experiment with the size of the tiles
Parallel Overheads
• Recall, a parallel program has additional overheads,
collectively called the parallel overhead. This includes the
time to create, start, and stop threads, the extra work to
figure out what each task is to perform, the time spent
waiting for barriers and at critical sections and locks, and
the time spent computing the same operations redundantly.
Overview of Overheads
• Memory accessing overheads: The manner in which the
memory is accessed by individual threads has a major influence on
performance (good use of thread-local cache)
• Replication overheads: Replicated work refers to computations
that occur once in the sequential program but are performed by
each thread in the parallel version.
• (OpenMP) parallelization overheads: The cost for time
spent handling OpenMP constructs.
– Each of the directives and routines in OpenMP comes with some
overhead.
– For example, when a parallel section is created, threads may
have to be created or woken up
– The work performed by each thread is determined at run-time.
Overview of Overheads
• Load imbalance overheads: If the threads perform differing
amounts of work in the work-shared region, the faster threads have
to wait at the barrier for the slower ones to reach that point.
• Synchronization overheads: Threads typically waste time
waiting for access to critical regions or a variable involved in an
atomic update, or to acquire a lock, or at barriers.
Synchronization
• It is expensive because:
– It is inherently time-consuming
– if the load is not well balanced it can lead to a lot
of thread idle time.
• Minimize synchronization when possible
– Use nowait when possible
– Maximize parallel region
– Load balance (use dynamic scheduling)
– Avoid large critical regions
Cost of synch
#pragma omp parallel private(i) #pragma omp parallel private(i)
{ {
#pragma omp for #pragma omp for nowait
for(i=0;i<n;i++) for(i=0;i<n;i++)
a[i] +=b[i]; a[i] +=b[i];
#pragma omp for #pragma omp for nowait
for(i=0;i<n;i++) for(i=0;i<n;i++)
c[i] +=d[i]; c[i] +=d[i];
#pragma omp barrier
#pragma omp for reduction(+:sum) #pragma omp for nowait reduction(+:sum)
for(i=0;i<n;i++) for(i=0;i<n;i++)
sum += a[i] + c[i]; sum += a[i] + c[i];
} }
Maximize parallel region
#pragma omp parallel #pragma omp parallel
{ {
#pragma omp for #pragma omp for
for (…) { /* Work-sharing loop 1 */ } for (…) { /* Work-sharing loop 1 */ }
}
opt = opt + N; //sequential #pragma omp single nowait
#pragma omp parallel opt = opt + N; //sequential
{
#pragma omp for #pragma omp for
for(…) { /* Work-sharing loop 2 */ } for(…) { /* Work-sharing loop 2 */ }

#pragma omp for #pragma omp for


for(…) { /* Work-sharing loop N */} for(…) { /* Work-sharing loop N */}
} }
Cost of serialization
• Find the most important loops and try to
parallelize them.
– Look for dependences across iterations.
– Fine grain loops don’t benefit too much from
parallelization. Parallel overhead becomes an
issue.
– Increase the granularity as much as possible
(not too fine, not too coarse)
Example 1

for( i=0; i<n; i++ ) {


tmp = a[i];
a[i] = b[i];
b[i] = tmp;
}
• Dependence on tmp
Scalar Privatization

#pragma omp parallel for private(tmp)


for( i=0; i<n; i++ ) {
tmp = a[i];
a[i] = b[i];
b[i] = tmp;
}
• Dependence on tmp is removed (shared
memory case!!)
Example 2

for( i=0, sum=0; i<n; i++ )


sum += a[i];

• Dependence on sum
Reduction

#pragma omp parallel for reduction(+,sum)


for( i=0, sum=0; i<n; i++ )
sum += a[i];

• Dependence on sum is removed by


using special construct in language
Example 3

for( i=0, index=0; i<n; i++ ) {


index += i;
a[i] = b[index];
}

• Dependence on index
• Induction variable: can be computed
from loop variable
Induction Variable Elimination

#pragma omp parallel for


for( i=0; i<n; i++ ) {
a[i] = b[i*(i+1)/2];
}

• Dependence removed by computing the


induction variable
Example 4

for( i=0, index=0; i<n; i++ ) {


index += f(i);
b[i] = g(a[index]);
}

• Dependence on induction variable


index, but no closed formula for its
value
Loop Splitting

for( i=0; i<n; i++ ) {


index[i] += f(i);
}
#pragma omp parallel for
for( i=0; i<n; i++ ) {
b[i] = g(a[index[i]]);
}
• Loop splitting has removed dependence
Example 5
for( k=0; k<n; k++ )
for( i=0; i<n; i++ )
for( j=0; j<n; j++ )
a[i][j] += b[i][k] + c[k][j];

• Dependence on a[i][j] prevents k-loop


parallelization
• No dependencies carried by i- and j-loops
Parallelization
for( k=0; k<n; k++ )
#pragma omp parallel for
for( i=0; i<n; i++ )
for( j=0; j<n; j++ )
a[i][j] += b[i][k] + c[k][j];

• We can do better by reordering the loops


Loop Reordering
#pragma omp parallel for
for( i=0; i<n; i++ )
for( j=0; j<n; j++ )
for( k=0; k<n; k++ )
a[i][j] += b[i][k] + c[k][j];

• Larger parallel pieces of work


Example 6
#pragma omp parallel for
for(i=0; i<n; i++ )
a[i] = b[i];
#pragma omp parallel for
for( i=0; i<n; i++ )
c[i] = b[i]^2;

• Merge two parallel loops into one


Loop Fusion
#pragma omp parallel for
for(i=0; i<n; i++ ) {
a[i] = b[i];
c[i] = b[i]^2;
}

• Reduces loop startup overhead, improves


memory usage
Example 7

for( i=0, wrap=n; i<n; i++ ) {


b[i] = a[i] + a[wrap];
wrap = i;
}

• Dependence on wrap
• Only first iteration causes dependence
Loop Peeling

b[0] = a[0] + a[n];


#pragma omp parallel for
for( i=1; i<n; i++ ) {
b[i] = a[i] + a[i-1];
}
Privatization
• Privatize variables where possible to gain
performance!
• Private variables are stored close to the thread
whenever possible. It means they are not all on
the thread’s stack.
• If you can make an entire procedure parallel, the
more performance you will get.
• Make sure that interprocedural variables stay
shared!
Load Balance: Finding the best schedule
#pragma omp for schedule (dynamic, 10)
nowait
for( i=0; i<n; i++ )
for( j=0; j<n; j++ )
for( k=0; k<n; k++ )
a[i][j][k] += b[i][k] + c[k][j];
Schedulers and Chunk Sizes
• Sometimes application don’t scale
because of load imbalances.
• Try with different schedulers:
– Static
– Dynamic
– Guided

and chunk sizes.


Effects of Load Imbalances

100

Speed Up 100
600
10
1000
Ideal

8
16
32
64
2
4
8

12
threads
Pros/Cons Schedulers
• STATIC (default)
– Least overhead, good for well balanced loops
of code.
• DYNAMIC
– More overhead and synchronization
– Idle threads can work instead of waiting on
barriers
– Good for load imbalanced loops
CHUNK Sizes
• Small chunk sizes requires more calls to
the scheduler.
– Good to balance loops with large amounts of
work per iteration
• Larger Chunk sizes requires less
overhead.
– Loops with less amount of work per iteration.
Tasking performance Issues
• Too Finely Grained Task Parallelism
– Creating tasks and managing tasks is overhead
– For small tasks the overhead might exceed the tasks
runtime
– Overhead depends on hardware, compiler/runtime, task
attributes, …
– Use the if clause to group tasks.
• Too Coarsely Grained Task Parallelism
– A few large tasks can lead to load imbalance
– Untied tasks can achieve an easier load balance.
• Task creation Bottleneck
– When a lot of threads execute tasks and only one/few
create them, threads will idle waiting for tasks
Tied vs Untied
• Tasks are tied by default
– Tied tasks are executed always by the same thread
– Tied tasks have scheduling restrictions
– Deterministic scheduling points (creation, synchronization, ... )
– Another constraint to avoid deadlock problems
– Tied tasks may run into performance problem

• Programmer can use untied clause to lift


all restrictions
– Can migrate between threads.
– Note: Mix very carefully with threadprivate, critical and thread-ids
– Generally, tied tasks gives a better performance
Example: Parallelizing a
Recursive Algorithm
int fib(int n)
{ 0 1 2 3
int x, y;
if (n < 2)
return n;
else { 0: fib(4)
#pragma omp task shared(x)
x = fib(n-1);

#pragma omp task shared(y) Task 1 Task 2


y = fib(n-2); 1:fib(3)
fib(3) fib(2)
1: fib(3) 2: fib(2)

#pragma omp taskwait


Task 3 Task 4
return x + y; fib(2) fib(1)
3: fib(2) 0: fib(1)
}
}
Profiling Vs Tracing
• Similarities:
1- Help to understand the behavior of an application at different
levels of detail and abstraction.
2- Can be used to measure a variety of different kinds of event
• Profiling aggregates the results of event tracking for phases in the
code or for the entire execution of the program. TAU supports both.
• Tracing can keep track of every single instance of an event at run
time, and captures patterns over time.
KOJAK uses tracing to find performance patterns
• Obstacles:
1- Overheads become significant while working on a shared file
system
2- Tracing can generate very large trace files that degrade the
overall performance.
Shared Memory Machines
• Small number of processors: shared
memory with coherent caches (SMP)
• Larger number of processors:
distributed shared memory with
coherent caches (CC-NUMA)
Shared Memory: Logical View

Shared memory space

proc1 proc2 proc3 procN


Physical Implementation

Shared memory

bus

cache1 cache2 cache3 cacheN

proc1 proc2 proc3 procN


CC-NUMA: Physical Implementation

mem1 mem2 mem3 memN


inter-
connect
cache1 cache2 cache3 cacheN

proc1 proc2 proc3 procN


NUMA
NUMA: is a computer memory design used in Multiprocessing,
where the memory access time depends on the memory location
relative to a processor. Under NUMA, a processor can access its own
local memory faster than non-local memory, that is, memory local to
another processor or memory shared between processors.
Drawback: When multiple processors access the same memory location

First Touch Policy: Whenever any process accesses some


data, this data will be allocated in the node’s memory where this
process is running

numactl command (Linux)


• numactl --help: to display all the options available
• numactl --hardware: to display NUMA nodes, memory available per
numa node, cores associated per numa node.
• numactl [options] ./application [arguments]
Locally VS Remotely
• Data need to be initialized in parallel, otherwise most
threads will access data remotely
Binding on OpenMP applications
• Imagine we want to spawn 8 threads.
• export OMP_NUM_THREADS=8
• • and we want to bind them to core ids 0,2,4,6,8.
• For open64 compiler
• export O64_OMP_AFFINITY_MAP=0,2,4,8
• For GNU compiler
• export GOMP_CPU_AFFINITY=“0 2 4 8”
• For PGI compiler
• export MP_BLIST=0,2,4,8
• • Then we run the application binary
• ./application_binary
Profiling tools … PGPROF
%pgcc –mp –g –Mprof=lines|func test.c –o test
%pgcollect ./test [args]
%pgprof –exe ./test

Attributes:
1- No need to recompile
or relink
2- Data allocation overhead
Is low
3- Simple to use
4- Supports multi-threaded
5- Supports shared objects
And dynamic libraries
6- Good to show load
Balance Among threads
GPROF
%gcc –pg –g test.c –o test
%./test
%gprof test (Add –l for line profiling)

• Flat profile:

Flat profile:

Each sample counts as 0.01 seconds.


% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
44.12 0.07 0.07 zazLoop
35.29 0.14 0.06 main
20.59 0.17 0.04 bazMillion
ompP
%(kinst-ompp|kinst-ompp-papi) pgcc|icc|etc test.c –o test
%./test
%output in o test.n-m.ompp.txt (n: # of threads, m: consecutive number)

ompP is a profiling tool for OpenMP applications written in C/C++ or


FORTRAN.

• Attributes:
1- ompP’s profiling report becomes available immediately after
program termination in a human-readable format.
2- ompP supports the measurement of hardware performance
counters using PAPI
3- Supports several advanced productivity features such as
overhead analysis and detection of common inefficiency situations
(performance properties).
4- Support the analysis of tied tasks by implementing some
instrumentation calls implemented in OPARI following the POMP
specifications.
ompP .. cont
• Four overhead categories are distinguished
1- Synchronization: overhead that arises due to threads having
to synchronize their activity, e.g. barrier call
2- Imbalance loading: waiting time incurred due to an
imbalanced amount of work in a work-sharing or parallel region
3- Limited parallelism: idle threads due not enough parallelism
being exposed by the program
4- Thread management: overhead for the creation and
destruction of threads, and for signaling critical sections, locks
Overheads wrt. each individual parallel region:

Total Ovhds (%) = Synch (%) + Imbal (%) + Limpar (%) + Mgmt (%)
R00002 6.38 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00)
R00001 0.05 0.00 ( 3.41) 0.00 ( 0.00) 0.00 ( 2.87) 0.00 ( 0.00) 0.00 ( 0.54)
TAU
• TAU is a performance evaluation tool
• It supports parallel profiling and tracing
• Profiling shows you how much (total) time was spent in each routine
• Tracing shows you when the events take place in each process along a
timeline
• TAU uses a package called PDT for automatic instrumentation of the source
code
• Profiling and tracing can measure time as well as hardware performance
counters from your CPU
• TAU can automatically instrument your source code (routines, loops, I/O,
memory, phases, etc.)
• TAU runs on all HPC platforms and it is free
• TAU has instrumentation, measurement and analysis tools
– paraprof is TAU’s 3D profile browser
• To use TAU’s automatic source instrumentation, you need to set a couple
of environment variables and substitute the name of your compiler with a
TAU shell script
Automatic Source-Level Instrumentation in
TAU using Program Database Toolkit (PDT)
Using TAU with source instrumentation

• TAU supports several measurement options (profiling, tracing, profiling with


hardware counters, etc.)
• Each measurement configuration of TAU corresponds to a unique stub
makefile and library that is generated when you configure it
• To instrument source code using PDT
– Choose an appropriate TAU stub makefile in <arch>/lib:
% module load tau
% export TAU_MAKEFILE=$TAULIBDIR/Makefile.tau-papi-openmp-pdt-pgi
% export TAU_OPTIONS=‘-optVerbose …’ (see tau_compiler.sh -help)
And use tau_f90.sh, tau_cxx.sh or tau_cc.sh as Fortran, C++ or C compilers:
% pgcc test,c
changes to
% tau_cc.sh test.c
• Execute application and analyze performance data:
% pprof (for text based profile display)
% paraprof (for GUI)

You might also like