Performance and Tuning of Openmp Programs
Performance and Tuning of Openmp Programs
OpenMP Programs
Performance Measurement
• Why do we measure performance?
• Basics to measure performance
• How to get a good speedup?
• What affects performance?
Measures of Performance
• To computer scientists: speedup,
execution time
• To applications people: megaflops, size
of problem, accuracy of solution, etc
Tuning
• Performance tuning is the improvement of application
performance, usually for a specific computer system.
• Most applications will respond to increased system work load with
some degree of decreasing performance.
• Tuning follows these steps:
1. Assess the problem
2. Measure the performance of the application before modification.
3. Identify the parts of the application that are critical for improving the
performance. This is called the bottleneck. (Using profiling tools)
4. Modify these parts of the application to remove or alleviate
bottlenecks.
5. Measure the performance of the application after modification.
6. Repeat steps 3 through 5 as needed
Some performance Lessons
1- Optimize when you are done with coding
a- It's almost impossible to identify performance bottlenecks before the program
is completely working.
b- Rarely when you define a bottleneck, you allow other ones to become critical
c- Focusing on optimization during initial development detracts from achieving
other program characteristics
a[ 1 ] = 1
a[ 2 ] = 2
a[ 3 ] = 3
a[ 4 ] = 4
a[ 5 ] = 5
(Over 80% faster than the first)
Lessons …. Cont
3- Performance depends on language,
machine and compiler.
4- Correctness is more important than speed
(We don’t need to wait forever though )
5- Measurement is your proof
6- Measurements need to be precise (Use
profiling tools)
Optimization
Instrumentation
Instrumentation levels
• Manual: Performed by the programmer (printf() statements)
p
Speedup on Problem
• Speedup on problem = sequential
execution time of best known
sequential algorithm / execution time
on p processors.
• A more honest measure of
performance.
• Avoids picking an easily parallelizable
algorithm with poor sequential execution
time.
What Speedups Can You
Get?
• Linear speedup
– Confusing term: implicitly means a 1-to-1 speedup
per processor / core.
– (almost always) as good as you can do.
• Sub-linear speedup: common, due to
overhead of startup, synchronization,
communication, etc.
• Super-linear speedup: Due to cache/memory
effects
Speedup
speedup
linear
actual
p
Efficiency
• Fix problem size.
• Let T(1) be time to execute program on
sequential machine. Many researchers take
time of best sequential algorithm
• T(p) is time to execute program on p
processors
Speedup(p) = T(1) / T(p)
Efficiency(p) = Speedup(p) / p
• Linear speedup if Efficiency(p) = 1
Scalability
• No really precise definition
• Roughly speaking, a program is said to
scale to a certain number of processors
p, if going from p-1 to p processors
results in some acceptable
improvement in speedup (for instance,
an increase of 0.5)
Where is a need for
parallelization?
• Profiling the application
– Understanding what regions of the code consumes most of the
time.
– Focus your effort on the important regions that would benefit
from parallelization.
– Amdahl’s law. (Overall speedup)
1
(1-P) + P/s
s is the speedup of the part of the task that benefits from improved system resources;
P is the proportion of execution time that the part benefiting from improved resources
originally occupied.
P=0.4, s=2
1
Speedup = (1- 0.4) + 0.4/2
= 20/16 = 1.25
Why keep something
sequential?
• Some parts of the program are not
parallelizable (because of
dependences)
• Some parts may be parallelizable, but
the overhead dwarfs the increased
speedup
Returning to Sequential vs.
Parallel
• Sequential execution time: t seconds
• Overheads of parallel execution: t_st
seconds (depends on architecture)
• (Ideal) parallel execution time: t/p(#
processors) + t_st
• If t/p + t_st >= t, no gain
Techniques to improve performance
1- Code Optimization: Some code transformations that can help
to improve the performance (Ex: Less computations in loops)
• Dependence on sum
Reduction
• Dependence on index
• Induction variable: can be computed
from loop variable
Induction Variable Elimination
• Dependence on wrap
• Only first iteration causes dependence
Loop Peeling
100
Speed Up 100
600
10
1000
Ideal
8
16
32
64
2
4
8
12
threads
Pros/Cons Schedulers
• STATIC (default)
– Least overhead, good for well balanced loops
of code.
• DYNAMIC
– More overhead and synchronization
– Idle threads can work instead of waiting on
barriers
– Good for load imbalanced loops
CHUNK Sizes
• Small chunk sizes requires more calls to
the scheduler.
– Good to balance loops with large amounts of
work per iteration
• Larger Chunk sizes requires less
overhead.
– Loops with less amount of work per iteration.
Tasking performance Issues
• Too Finely Grained Task Parallelism
– Creating tasks and managing tasks is overhead
– For small tasks the overhead might exceed the tasks
runtime
– Overhead depends on hardware, compiler/runtime, task
attributes, …
– Use the if clause to group tasks.
• Too Coarsely Grained Task Parallelism
– A few large tasks can lead to load imbalance
– Untied tasks can achieve an easier load balance.
• Task creation Bottleneck
– When a lot of threads execute tasks and only one/few
create them, threads will idle waiting for tasks
Tied vs Untied
• Tasks are tied by default
– Tied tasks are executed always by the same thread
– Tied tasks have scheduling restrictions
– Deterministic scheduling points (creation, synchronization, ... )
– Another constraint to avoid deadlock problems
– Tied tasks may run into performance problem
Shared memory
bus
Attributes:
1- No need to recompile
or relink
2- Data allocation overhead
Is low
3- Simple to use
4- Supports multi-threaded
5- Supports shared objects
And dynamic libraries
6- Good to show load
Balance Among threads
GPROF
%gcc –pg –g test.c –o test
%./test
%gprof test (Add –l for line profiling)
• Flat profile:
Flat profile:
• Attributes:
1- ompP’s profiling report becomes available immediately after
program termination in a human-readable format.
2- ompP supports the measurement of hardware performance
counters using PAPI
3- Supports several advanced productivity features such as
overhead analysis and detection of common inefficiency situations
(performance properties).
4- Support the analysis of tied tasks by implementing some
instrumentation calls implemented in OPARI following the POMP
specifications.
ompP .. cont
• Four overhead categories are distinguished
1- Synchronization: overhead that arises due to threads having
to synchronize their activity, e.g. barrier call
2- Imbalance loading: waiting time incurred due to an
imbalanced amount of work in a work-sharing or parallel region
3- Limited parallelism: idle threads due not enough parallelism
being exposed by the program
4- Thread management: overhead for the creation and
destruction of threads, and for signaling critical sections, locks
Overheads wrt. each individual parallel region:
Total Ovhds (%) = Synch (%) + Imbal (%) + Limpar (%) + Mgmt (%)
R00002 6.38 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00)
R00001 0.05 0.00 ( 3.41) 0.00 ( 0.00) 0.00 ( 2.87) 0.00 ( 0.00) 0.00 ( 0.54)
TAU
• TAU is a performance evaluation tool
• It supports parallel profiling and tracing
• Profiling shows you how much (total) time was spent in each routine
• Tracing shows you when the events take place in each process along a
timeline
• TAU uses a package called PDT for automatic instrumentation of the source
code
• Profiling and tracing can measure time as well as hardware performance
counters from your CPU
• TAU can automatically instrument your source code (routines, loops, I/O,
memory, phases, etc.)
• TAU runs on all HPC platforms and it is free
• TAU has instrumentation, measurement and analysis tools
– paraprof is TAU’s 3D profile browser
• To use TAU’s automatic source instrumentation, you need to set a couple
of environment variables and substitute the name of your compiler with a
TAU shell script
Automatic Source-Level Instrumentation in
TAU using Program Database Toolkit (PDT)
Using TAU with source instrumentation