0% found this document useful (0 votes)

22 views76 pages

Performance and Tuning of Openmp Programs

Uploaded by

spamailbat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

22 views76 pages

Performance and Tuning of Openmp Programs

Uploaded by

spamailbat

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 76

Performance and Tuning of

OpenMP Programs
Performance Measurement
• Why do we measure performance?
• Basics to measure performance
• How to get a good speedup?
• What affects performance?
Measures of Performance
• To computer scientists: speedup,
execution time
• To applications people: megaflops, size
of problem, accuracy of solution, etc
Tuning
• Performance tuning is the improvement of application
performance, usually for a specific computer system.
• Most applications will respond to increased system work load with
some degree of decreasing performance.
• Tuning follows these steps:
1. Assess the problem
2. Measure the performance of the application before modification.
3. Identify the parts of the application that are critical for improving the
performance. This is called the bottleneck. (Using profiling tools)
4. Modify these parts of the application to remove or alleviate
bottlenecks.
5. Measure the performance of the application after modification.
6. Repeat steps 3 through 5 as needed
Some performance Lessons
1- Optimize when you are done with coding
a- It's almost impossible to identify performance bottlenecks before the program
is completely working.
b- Rarely when you define a bottleneck, you allow other ones to become critical
c- Focusing on optimization during initial development detracts from achieving
other program characteristics

2- Reducing lines of code does not mean your code is faster

for (i = 1; i<6; i++) a[ i ] = i;

a[ 1 ] = 1
a[ 2 ] = 2
a[ 3 ] = 3
a[ 4 ] = 4
a[ 5 ] = 5
(Over 80% faster than the first)
Lessons …. Cont
3- Performance depends on language,
machine and compiler.
4- Correctness is more important than speed
(We don’t need to wait forever though )
5- Measurement is your proof
6- Measurements need to be precise (Use
profiling tools)
Optimization
Instrumentation
Instrumentation levels
• Manual: Performed by the programmer (printf() statements)

• Automatic source level: instrumentation added to the source code

by an automatic tool according to an instrumentation policy (TAU)

• Compiler assisted: Ex: "gcc -pg ..." for gprof ”

• Binary translation: The tool adds instrumentation to a compiled

binary. (Ex: ATOM)

• Runtime instrumentation: Directly before execution, the code is

instrumented. May result in high overhead (Ex: Valgrind)
Measurement with Profiling
Timing the OpenMP Performance
• A standard practice is to use a standard operating
system command.
• The /bin/time command is available on standard UNIX
systems. For example
/bin/time ./a.out
– The “real”, “user”, and “system” times are then printed after
the program has finished execution.
– For example
$ /bin/time ./program.exe
real 5.4
user 3.2
sys 1.0
– These three numbers can be used to get initial information
about the performance.
– For deeper analysis, a more sophisticated performance tool is
needed.
Timing the OpenMP Performance (cont)
– Generally, the sum of the user and system time is referred to
as the CPU time.
– The number following “real” tells us that the program took 5.4
seconds from the beginning to the end. The real time is also
called the wall-clock time or elapsed time
– The user time of 3.2 seconds is the time spent outside any
operating system service.
– The sys time is the time spent on operating system services
such as input/output routines
– A common cause for the difference is a processor sharing too
high load on the system.
– The omp_get_wtime() function provided by OpenMP is
useful for measuring the elapsed time of blocks of source
code.
– If your times are varying, the reason may be that you are
sharing the processors with others
Speedup of Algorithm
• Speedup of algorithm = sequential execution
time / execution time on p processors (with
the same data set).
speedup

p
Speedup on Problem
• Speedup on problem = sequential
execution time of best known
sequential algorithm / execution time
on p processors.
• A more honest measure of
performance.
• Avoids picking an easily parallelizable
algorithm with poor sequential execution
time.
What Speedups Can You
Get?
• Linear speedup
– Confusing term: implicitly means a 1-to-1 speedup
per processor / core.
– (almost always) as good as you can do.
• Sub-linear speedup: common, due to
overhead of startup, synchronization,
communication, etc.
• Super-linear speedup: Due to cache/memory
effects
Speedup

speedup
linear

actual

p
Efficiency
• Fix problem size.
• Let T(1) be time to execute program on
sequential machine. Many researchers take
time of best sequential algorithm
• T(p) is time to execute program on p
processors
Speedup(p) = T(1) / T(p)
Efficiency(p) = Speedup(p) / p
• Linear speedup if Efficiency(p) = 1
Scalability
• No really precise definition
• Roughly speaking, a program is said to
scale to a certain number of processors
p, if going from p-1 to p processors
results in some acceptable
improvement in speedup (for instance,
an increase of 0.5)
Where is a need for
parallelization?
• Profiling the application
– Understanding what regions of the code consumes most of the
time.
– Focus your effort on the important regions that would benefit
from parallelization.
– Amdahl’s law. (Overall speedup)
1
(1-P) + P/s
s is the speedup of the part of the task that benefits from improved system resources;
P is the proportion of execution time that the part benefiting from improved resources
originally occupied.

– Work vs. Parallel overhead.

Amdahl’s Law Example
• If 40% of the execution time is a subject
of a speedup; if the improvement makes
the affected part twice as fast, find the
overall speedup.

P=0.4, s=2
1
Speedup = (1- 0.4) + 0.4/2

= 20/16 = 1.25
Why keep something
sequential?
• Some parts of the program are not
parallelizable (because of
dependences)
• Some parts may be parallelizable, but
the overhead dwarfs the increased
speedup
Returning to Sequential vs.
Parallel
• Sequential execution time: t seconds
• Overheads of parallel execution: t_st
seconds (depends on architecture)
• (Ideal) parallel execution time: t/p(#
processors) + t_st
• If t/p + t_st >= t, no gain
Techniques to improve performance
1- Code Optimization: Some code transformations that can help
to improve the performance (Ex: Less computations in loops)

2- Caching strategy: Help to remove performance bottlenecks

resulting from slow access to data.

3- Load balancing: Your application should exploit the resources

available in the best way. Threads should have equal load in order
to achieve the optimal performance.

4- Bottlenecks: They can also be defined as hot spots where the

operation count and time of execution is the highest.
Loop Optimization
• By making minor changes in loops, the programmer or
compiler can improve the use of memory and remove
dependencies.
• For example a i-loop that has an embedded j-loop may
calculate the same results more efficiently if the position
of the two loops are switched.
– May result in accessing data in C in rows instead of in columns.
– Since much of computational time is spent in loops and since
most array access occurs there, a suitable reorganization to
exploit cache can significantly improve a programs performance.
• Rule for Interchangeability: If any memory location is
referenced more than once in the loop nest and if at
least one of those references modify its value, then their
relative order must not be changed by the transformation
Loop Optimization (cont)
• Check a program’s code to see if a reordering of
statements is allowed and whether it is desirable:
– This is often done better by programmers than by
compilers.
– Should be considered if array accesses in the loop nest do
not occur in the order they are stored in memory.
– Also consider loop transforming if loop has a large body
– A simple reordering within the loop may make a difference.
– Reordering the loop may also allow better exploitation of
parallelism or to better utilize the instruction pipeline.
– They can also be used to increase the size of the parallel
area.
Loop Unrolling
• A loop unrolling transformation packs all of the work
of several loop iterations into a single pass through
the loop.
– A powerful technique to reduce the overheads of loops
– May not reduce to one single pass, but reduce the number
of passes through the loop (say by a factor of two by
performing two loop calculations on each pass through
“reduced loop”).
• In example, two is called the “unroll factor”.
– Eliminates number of increment of the loop variable, test
for completion, and branches to the start of the loop code.
– Helps improve cache line utilization by improving data
reuse.
– Can also increase the instruction level parallelism (ILP)
Loop Unrolling (cont)
• Compilers are good at doing loop unrolling.
• One problem is that if the unroll factor does not divide
the iteration count, the remaining iterations have to be
performed outside of the loop nest
– Implemented through a second “cleanup” loop.
• If loop already contains a lot of computation, loop
unrolling may the cache less efficient.
• If loop contains a procedure call, unrolling the loop
results in new overheads that may outweigh the benefits.
• If loop contains branches, the benefits may also be low.
• Loop jamming is similar to loop unrolling and consists of
“jamming the body of two inner loops” into a single inner
loop that performs the work of both.
Loop Fusion
• Loop Fusion merges two or more loops to create
a bigger loop.
– May enable data in cache to be reused more
frequently
– Might increase the amount of computation per
iteration in order to improve the instruction level
parallelism.
– Would probably also reduce loop overhead because
more work is done per loop.
Loop Fission
• Loop fission is a transformation that breaks up a
loop into several loops.
– May be useful in improving the use of cache or to
isolate a part that inhibits the full optimization of the
loop.
– Most useful when loop nest is large and its data does
not fit well into cache or if we can optimize different
parts of the loop in different ways.
Loop Tiling or Blocking
• Transformation designed to tailor the number of
memory references inside a loop so they can fit
within cache.
– If data sizes are large and memory access is bad or if
there is little data reuse in the loop, then chopping the
loop into chunks (tiles) may be helpful.
– Loop tiling replaces the original loop with a couple of
loops.
– This can be done for as many loops inside the loop
nest as needed.
– May have to experiment with the size of the tiles
Parallel Overheads
• Recall, a parallel program has additional overheads,
collectively called the parallel overhead. This includes the
time to create, start, and stop threads, the extra work to
figure out what each task is to perform, the time spent
waiting for barriers and at critical sections and locks, and
the time spent computing the same operations redundantly.
Overview of Overheads
• Memory accessing overheads: The manner in which the
memory is accessed by individual threads has a major influence on
performance (good use of thread-local cache)
• Replication overheads: Replicated work refers to computations
that occur once in the sequential program but are performed by
each thread in the parallel version.
• (OpenMP) parallelization overheads: The cost for time
spent handling OpenMP constructs.
– Each of the directives and routines in OpenMP comes with some
overhead.
– For example, when a parallel section is created, threads may
have to be created or woken up
– The work performed by each thread is determined at run-time.
Overview of Overheads
• Load imbalance overheads: If the threads perform differing
amounts of work in the work-shared region, the faster threads have
to wait at the barrier for the slower ones to reach that point.
• Synchronization overheads: Threads typically waste time
waiting for access to critical regions or a variable involved in an
atomic update, or to acquire a lock, or at barriers.
Synchronization
• It is expensive because:
– It is inherently time-consuming
– if the load is not well balanced it can lead to a lot
of thread idle time.
• Minimize synchronization when possible
– Use nowait when possible
– Maximize parallel region
– Load balance (use dynamic scheduling)
– Avoid large critical regions
Cost of synch
#pragma omp parallel private(i) #pragma omp parallel private(i)
{ {
#pragma omp for #pragma omp for nowait
for(i=0;i<n;i++) for(i=0;i<n;i++)
a[i] +=b[i]; a[i] +=b[i];
#pragma omp for #pragma omp for nowait
for(i=0;i<n;i++) for(i=0;i<n;i++)
c[i] +=d[i]; c[i] +=d[i];
#pragma omp barrier
#pragma omp for reduction(+:sum) #pragma omp for nowait reduction(+:sum)
for(i=0;i<n;i++) for(i=0;i<n;i++)
sum += a[i] + c[i]; sum += a[i] + c[i];
} }
Maximize parallel region
#pragma omp parallel #pragma omp parallel
{ {
#pragma omp for #pragma omp for
for (…) { /* Work-sharing loop 1 */ } for (…) { /* Work-sharing loop 1 */ }
}
opt = opt + N; //sequential #pragma omp single nowait
#pragma omp parallel opt = opt + N; //sequential
{
#pragma omp for #pragma omp for
for(…) { /* Work-sharing loop 2 */ } for(…) { /* Work-sharing loop 2 */ }

#pragma omp for #pragma omp for

for(…) { /* Work-sharing loop N */} for(…) { /* Work-sharing loop N */}
} }
Cost of serialization
• Find the most important loops and try to
parallelize them.
– Look for dependences across iterations.
– Fine grain loops don’t benefit too much from
parallelization. Parallel overhead becomes an
issue.
– Increase the granularity as much as possible
(not too fine, not too coarse)
Example 1

for( i=0; i<n; i++ ) {

tmp = a[i];
a[i] = b[i];
b[i] = tmp;
}
• Dependence on tmp
Scalar Privatization

#pragma omp parallel for private(tmp)

for( i=0; i<n; i++ ) {
tmp = a[i];
a[i] = b[i];
b[i] = tmp;
}
• Dependence on tmp is removed (shared
memory case!!)
Example 2

for( i=0, sum=0; i<n; i++ )

sum += a[i];

• Dependence on sum
Reduction

#pragma omp parallel for reduction(+,sum)

for( i=0, sum=0; i<n; i++ )
sum += a[i];

• Dependence on sum is removed by

using special construct in language
Example 3

for( i=0, index=0; i<n; i++ ) {

index += i;
a[i] = b[index];
}

• Dependence on index
• Induction variable: can be computed
from loop variable
Induction Variable Elimination

#pragma omp parallel for

for( i=0; i<n; i++ ) {
a[i] = b[i*(i+1)/2];
}

• Dependence removed by computing the

induction variable
Example 4

for( i=0, index=0; i<n; i++ ) {

index += f(i);
b[i] = g(a[index]);
}

• Dependence on induction variable

index, but no closed formula for its
value
Loop Splitting

for( i=0; i<n; i++ ) {

index[i] += f(i);
}
#pragma omp parallel for
for( i=0; i<n; i++ ) {
b[i] = g(a[index[i]]);
}
• Loop splitting has removed dependence
Example 5
for( k=0; k<n; k++ )
for( i=0; i<n; i++ )
for( j=0; j<n; j++ )
a[i][j] += b[i][k] + c[k][j];

• Dependence on a[i][j] prevents k-loop

parallelization
• No dependencies carried by i- and j-loops
Parallelization
for( k=0; k<n; k++ )
#pragma omp parallel for
for( i=0; i<n; i++ )
for( j=0; j<n; j++ )
a[i][j] += b[i][k] + c[k][j];

• We can do better by reordering the loops

Loop Reordering
#pragma omp parallel for
for( i=0; i<n; i++ )
for( j=0; j<n; j++ )
for( k=0; k<n; k++ )
a[i][j] += b[i][k] + c[k][j];

• Larger parallel pieces of work

Example 6
#pragma omp parallel for
for(i=0; i<n; i++ )
a[i] = b[i];
#pragma omp parallel for
for( i=0; i<n; i++ )
c[i] = b[i]^2;

• Merge two parallel loops into one

Loop Fusion
#pragma omp parallel for
for(i=0; i<n; i++ ) {
a[i] = b[i];
c[i] = b[i]^2;
}

• Reduces loop startup overhead, improves

memory usage
Example 7

for( i=0, wrap=n; i<n; i++ ) {

b[i] = a[i] + a[wrap];
wrap = i;
}

• Dependence on wrap
• Only first iteration causes dependence
Loop Peeling

b[0] = a[0] + a[n];

#pragma omp parallel for
for( i=1; i<n; i++ ) {
b[i] = a[i] + a[i-1];
}
Privatization
• Privatize variables where possible to gain
performance!
• Private variables are stored close to the thread
whenever possible. It means they are not all on
the thread’s stack.
• If you can make an entire procedure parallel, the
more performance you will get.
• Make sure that interprocedural variables stay
shared!
Load Balance: Finding the best schedule
#pragma omp for schedule (dynamic, 10)
nowait
for( i=0; i<n; i++ )
for( j=0; j<n; j++ )
for( k=0; k<n; k++ )
a[i][j][k] += b[i][k] + c[k][j];
Schedulers and Chunk Sizes
• Sometimes application don’t scale
because of load imbalances.
• Try with different schedulers:
– Static
– Dynamic
– Guided

and chunk sizes.

Effects of Load Imbalances

100

Speed Up 100
600
10
1000
Ideal

8
16
32
64
2
4
8

12
threads
Pros/Cons Schedulers
• STATIC (default)
– Least overhead, good for well balanced loops
of code.
• DYNAMIC
– More overhead and synchronization
– Idle threads can work instead of waiting on
barriers
– Good for load imbalanced loops
CHUNK Sizes
• Small chunk sizes requires more calls to
the scheduler.
– Good to balance loops with large amounts of
work per iteration
• Larger Chunk sizes requires less
overhead.
– Loops with less amount of work per iteration.
Tasking performance Issues
• Too Finely Grained Task Parallelism
– Creating tasks and managing tasks is overhead
– For small tasks the overhead might exceed the tasks
runtime
– Overhead depends on hardware, compiler/runtime, task
attributes, …
– Use the if clause to group tasks.
• Too Coarsely Grained Task Parallelism
– A few large tasks can lead to load imbalance
– Untied tasks can achieve an easier load balance.
• Task creation Bottleneck
– When a lot of threads execute tasks and only one/few
create them, threads will idle waiting for tasks
Tied vs Untied
• Tasks are tied by default
– Tied tasks are executed always by the same thread
– Tied tasks have scheduling restrictions
– Deterministic scheduling points (creation, synchronization, ... )
– Another constraint to avoid deadlock problems
– Tied tasks may run into performance problem

• Programmer can use untied clause to lift

all restrictions
– Can migrate between threads.
– Note: Mix very carefully with threadprivate, critical and thread-ids
– Generally, tied tasks gives a better performance
Example: Parallelizing a
Recursive Algorithm
int fib(int n)
{ 0 1 2 3
int x, y;
if (n < 2)
return n;
else { 0: fib(4)
#pragma omp task shared(x)
x = fib(n-1);

#pragma omp task shared(y) Task 1 Task 2

y = fib(n-2); 1:fib(3)
fib(3) fib(2)
1: fib(3) 2: fib(2)

#pragma omp taskwait

Task 3 Task 4
return x + y; fib(2) fib(1)
3: fib(2) 0: fib(1)
}
}
Profiling Vs Tracing
• Similarities:
1- Help to understand the behavior of an application at different
levels of detail and abstraction.
2- Can be used to measure a variety of different kinds of event
• Profiling aggregates the results of event tracking for phases in the
code or for the entire execution of the program. TAU supports both.
• Tracing can keep track of every single instance of an event at run
time, and captures patterns over time.
KOJAK uses tracing to find performance patterns
• Obstacles:
1- Overheads become significant while working on a shared file
system
2- Tracing can generate very large trace files that degrade the
overall performance.
Shared Memory Machines
• Small number of processors: shared
memory with coherent caches (SMP)
• Larger number of processors:
distributed shared memory with
coherent caches (CC-NUMA)
Shared Memory: Logical View

Shared memory space

proc1 proc2 proc3 procN

Physical Implementation

Shared memory

bus

cache1 cache2 cache3 cacheN

proc1 proc2 proc3 procN

CC-NUMA: Physical Implementation

mem1 mem2 mem3 memN

inter-
connect
cache1 cache2 cache3 cacheN

proc1 proc2 proc3 procN

NUMA
NUMA: is a computer memory design used in Multiprocessing,
where the memory access time depends on the memory location
relative to a processor. Under NUMA, a processor can access its own
local memory faster than non-local memory, that is, memory local to
another processor or memory shared between processors.
Drawback: When multiple processors access the same memory location

First Touch Policy: Whenever any process accesses some

data, this data will be allocated in the node’s memory where this
process is running

numactl command (Linux)

• numactl --help: to display all the options available
• numactl --hardware: to display NUMA nodes, memory available per
numa node, cores associated per numa node.
• numactl [options] ./application [arguments]
Locally VS Remotely
• Data need to be initialized in parallel, otherwise most
threads will access data remotely
Binding on OpenMP applications
• Imagine we want to spawn 8 threads.
• export OMP_NUM_THREADS=8
• • and we want to bind them to core ids 0,2,4,6,8.
• For open64 compiler
• export O64_OMP_AFFINITY_MAP=0,2,4,8
• For GNU compiler
• export GOMP_CPU_AFFINITY=“0 2 4 8”
• For PGI compiler
• export MP_BLIST=0,2,4,8
• • Then we run the application binary
• ./application_binary
Profiling tools … PGPROF
%pgcc –mp –g –Mprof=lines|func test.c –o test
%pgcollect ./test [args]
%pgprof –exe ./test

Attributes:
1- No need to recompile
or relink
2- Data allocation overhead
Is low
3- Simple to use
4- Supports multi-threaded
5- Supports shared objects
And dynamic libraries
6- Good to show load
Balance Among threads
GPROF
%gcc –pg –g test.c –o test
%./test
%gprof test (Add –l for line profiling)

• Flat profile:

Flat profile:

Each sample counts as 0.01 seconds.

% cumulative self self total
time seconds seconds calls Ts/call Ts/call name
44.12 0.07 0.07 zazLoop
35.29 0.14 0.06 main
20.59 0.17 0.04 bazMillion
ompP
%(kinst-ompp|kinst-ompp-papi) pgcc|icc|etc test.c –o test
%./test
%output in o test.n-m.ompp.txt (n: # of threads, m: consecutive number)

ompP is a proﬁling tool for OpenMP applications written in C/C++ or

FORTRAN.

• Attributes:
1- ompP’s proﬁling report becomes available immediately after
program termination in a human-readable format.
2- ompP supports the measurement of hardware performance
counters using PAPI
3- Supports several advanced productivity features such as
overhead analysis and detection of common ineﬃciency situations
(performance properties).
4- Support the analysis of tied tasks by implementing some
instrumentation calls implemented in OPARI following the POMP
specifications.
ompP .. cont
• Four overhead categories are distinguished
1- Synchronization: overhead that arises due to threads having
to synchronize their activity, e.g. barrier call
2- Imbalance loading: waiting time incurred due to an
imbalanced amount of work in a work-sharing or parallel region
3- Limited parallelism: idle threads due not enough parallelism
being exposed by the program
4- Thread management: overhead for the creation and
destruction of threads, and for signaling critical sections, locks
Overheads wrt. each individual parallel region:

Total Ovhds (%) = Synch (%) + Imbal (%) + Limpar (%) + Mgmt (%)
R00002 6.38 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00) 0.00 ( 0.00)
R00001 0.05 0.00 ( 3.41) 0.00 ( 0.00) 0.00 ( 2.87) 0.00 ( 0.00) 0.00 ( 0.54)
TAU
• TAU is a performance evaluation tool
• It supports parallel profiling and tracing
• Profiling shows you how much (total) time was spent in each routine
• Tracing shows you when the events take place in each process along a
timeline
• TAU uses a package called PDT for automatic instrumentation of the source
code
• Profiling and tracing can measure time as well as hardware performance
counters from your CPU
• TAU can automatically instrument your source code (routines, loops, I/O,
memory, phases, etc.)
• TAU runs on all HPC platforms and it is free
• TAU has instrumentation, measurement and analysis tools
– paraprof is TAU’s 3D profile browser
• To use TAU’s automatic source instrumentation, you need to set a couple
of environment variables and substitute the name of your compiler with a
TAU shell script
Automatic Source-Level Instrumentation in
TAU using Program Database Toolkit (PDT)
Using TAU with source instrumentation

• TAU supports several measurement options (profiling, tracing, profiling with

hardware counters, etc.)
• Each measurement configuration of TAU corresponds to a unique stub
makefile and library that is generated when you configure it
• To instrument source code using PDT
– Choose an appropriate TAU stub makefile in <arch>/lib:
% module load tau
% export TAU_MAKEFILE=$TAULIBDIR/Makefile.tau-papi-openmp-pdt-pgi
% export TAU_OPTIONS=‘-optVerbose …’ (see tau_compiler.sh -help)
And use tau_f90.sh, tau_cxx.sh or tau_cc.sh as Fortran, C++ or C compilers:
% pgcc test,c
changes to
% tau_cc.sh test.c
• Execute application and analyze performance data:
% pprof (for text based profile display)
% paraprof (for GUI)

Slither Into Python
No ratings yet
Slither Into Python
221 pages
9COM List With Approved Manufacturers
100% (5)
9COM List With Approved Manufacturers
720 pages
Build Your Own Lisp
100% (3)
Build Your Own Lisp
194 pages
Saudi Aramco Avl List - Jan-2024 (Updated)
88% (8)
Saudi Aramco Avl List - Jan-2024 (Updated)
70 pages
Bill of Quantities (Boq) : Net Total Amount SAR 3,589,900
100% (1)
Bill of Quantities (Boq) : Net Total Amount SAR 3,589,900
10 pages
List of Approved Subcontractors
100% (4)
List of Approved Subcontractors
6 pages
Customer List Riyadh
77% (13)
Customer List Riyadh
23 pages
SA Regulated Vendor List (Y2021)
50% (2)
SA Regulated Vendor List (Y2021)
73 pages
CP Deep Anode Bed Method of Statement
No ratings yet
CP Deep Anode Bed Method of Statement
11 pages
List of Suppliers in KSA
83% (6)
List of Suppliers in KSA
32 pages
Saudi Aramco Procurement Quality Requirements
90% (10)
Saudi Aramco Procurement Quality Requirements
22 pages
JSA For Cathodic Protection Installation For Pipeline
100% (1)
JSA For Cathodic Protection Installation For Pipeline
15 pages
Python With Django
No ratings yet
Python With Django
11 pages
MIP17 - HSE - PP - 002 Contractor Site Safety Program (CSSP) REV 2
50% (2)
MIP17 - HSE - PP - 002 Contractor Site Safety Program (CSSP) REV 2
210 pages
Wa0021
No ratings yet
Wa0021
409 pages
Email Companies KSA
63% (8)
Email Companies KSA
18 pages
General Instruction Manual: Organization Consulting Department
95% (19)
General Instruction Manual: Organization Consulting Department
12 pages
List of Construction Companies in Saudi Arabia
70% (10)
List of Construction Companies in Saudi Arabia
3 pages
Aramco Contractor
100% (1)
Aramco Contractor
6 pages
Saudi Aramco GI Manual 2.710 PDF
95% (37)
Saudi Aramco GI Manual 2.710 PDF
33 pages
Python Programming
No ratings yet
Python Programming
4 pages
Aramco Standard List
75% (8)
Aramco Standard List
34 pages
Computer Applications (86) : Class Ix
No ratings yet
Computer Applications (86) : Class Ix
3 pages
Saudi ARAMCO - Lifting Instruction Manual
100% (8)
Saudi ARAMCO - Lifting Instruction Manual
10 pages
L14 Introduction To Performance Evaluation
No ratings yet
L14 Introduction To Performance Evaluation
48 pages
List of Saudi Companies-May 11
86% (21)
List of Saudi Companies-May 11
21 pages
Aramco Material Sourcing
75% (12)
Aramco Material Sourcing
239 pages
6 Recursion
No ratings yet
6 Recursion
23 pages
11 - Dynamic Interaction Modelling
No ratings yet
11 - Dynamic Interaction Modelling
23 pages
Python Made Easy PDF
No ratings yet
Python Made Easy PDF
224 pages
Efficient C Coding For AVR
No ratings yet
Efficient C Coding For AVR
15 pages
Library Foundations For Asynchronous Operations, Revision 1: Document Number: N3964 Revises: Date: Reply To
No ratings yet
Library Foundations For Asynchronous Operations, Revision 1: Document Number: N3964 Revises: Date: Reply To
32 pages
How Much Parallelism
No ratings yet
How Much Parallelism
23 pages
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
67 pages
Os Lab
No ratings yet
Os Lab
33 pages
Performance Metrics For Parallel Programs: 8 March 2010
No ratings yet
Performance Metrics For Parallel Programs: 8 March 2010
44 pages
Aramco Quick Standard Guide
90% (20)
Aramco Quick Standard Guide
18 pages
P 01 YNB-201 4400005822 WA-407752 4400005822: - Enlargement No.01 - Enlargement No.02
No ratings yet
P 01 YNB-201 4400005822 WA-407752 4400005822: - Enlargement No.01 - Enlargement No.02
1 page
ParallelIzation Principles
No ratings yet
ParallelIzation Principles
40 pages
Chapter (7) Performance Analysis Techniques: Asmaa Ismail Farah Basil Raua Waleed
No ratings yet
Chapter (7) Performance Analysis Techniques: Asmaa Ismail Farah Basil Raua Waleed
46 pages
Notes:: Saudi Arabian Oil Company
No ratings yet
Notes:: Saudi Arabian Oil Company
7 pages
Control Transfer Statements: Break, Continue, Goto
No ratings yet
Control Transfer Statements: Break, Continue, Goto
21 pages
Unit 1 Introduction
No ratings yet
Unit 1 Introduction
136 pages
Unikl Bmi: Section A: Course Details
No ratings yet
Unikl Bmi: Section A: Course Details
4 pages
12 MPIProgramPerformance
No ratings yet
12 MPIProgramPerformance
33 pages
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
No ratings yet
William Stallings Computer Organization and Architecture 8 Edition Computer Evolution and Performance
28 pages
14 Constructor
No ratings yet
14 Constructor
5 pages
What Is A Method in Java
No ratings yet
What Is A Method in Java
11 pages
Unit 4
No ratings yet
Unit 4
205 pages
Lecture 2: Performance/Power, MIPS Instructions
No ratings yet
Lecture 2: Performance/Power, MIPS Instructions
28 pages
L-2 (Computer Performance)
No ratings yet
L-2 (Computer Performance)
47 pages
Advanced Computer Architecture
No ratings yet
Advanced Computer Architecture
18 pages
Performance Chap4
No ratings yet
Performance Chap4
20 pages
Operating System Lab Manual
No ratings yet
Operating System Lab Manual
22 pages
PLQ - Optoin 1
No ratings yet
PLQ - Optoin 1
2 pages
Computer Applications
No ratings yet
Computer Applications
29 pages
AY2021-22 Syllabus Book MLAI
No ratings yet
AY2021-22 Syllabus Book MLAI
156 pages
Unit 2 Performance Evaluations: Structure Nos
No ratings yet
Unit 2 Performance Evaluations: Structure Nos
18 pages
Amdahl's Law Example #2: - Protein String Matching Code
No ratings yet
Amdahl's Law Example #2: - Protein String Matching Code
23 pages
06 Profiling
No ratings yet
06 Profiling
29 pages
PC 2
No ratings yet
PC 2
44 pages
M116C 1 M116C 1 Lect02-Performance
No ratings yet
M116C 1 M116C 1 Lect02-Performance
23 pages
OOAD
No ratings yet
OOAD
67 pages
Detour For Asphalt Road Crossing: Flagman
No ratings yet
Detour For Asphalt Road Crossing: Flagman
1 page
Chapter 23
No ratings yet
Chapter 23
24 pages
NSH-SAOMPP-CMS-PI-001 Piping Fabrication
No ratings yet
NSH-SAOMPP-CMS-PI-001 Piping Fabrication
13 pages
ZULUF PROJECT (0-8887) : 10-08643-0005 Saudi Aramco 0-8887-2-P-3130-MC2-A GAS-JGC-MOS-MECH-006 FA NMR Ref NO
No ratings yet
ZULUF PROJECT (0-8887) : 10-08643-0005 Saudi Aramco 0-8887-2-P-3130-MC2-A GAS-JGC-MOS-MECH-006 FA NMR Ref NO
52 pages
HPC Overview
No ratings yet
HPC Overview
45 pages
CS1101 Discussion U2
100% (1)
CS1101 Discussion U2
3 pages
6.2. Procurement Plan
No ratings yet
6.2. Procurement Plan
27 pages
Hpca Notes
No ratings yet
Hpca Notes
216 pages
Performance: Latency
No ratings yet
Performance: Latency
7 pages
Week 2
No ratings yet
Week 2
12 pages
Performance Metrices
100% (1)
Performance Metrices
18 pages
Parallel2 PDF
No ratings yet
Parallel2 PDF
16 pages
Lec7 PDF
No ratings yet
Lec7 PDF
16 pages
2024-Updated As Per Rules-C Programming Lab Manual
No ratings yet
2024-Updated As Per Rules-C Programming Lab Manual
58 pages
Zindagi Zama Da
No ratings yet
Zindagi Zama Da
21 pages
Projects List
No ratings yet
Projects List
2 pages
Program Design and Analysis Program-Level Performance Analysis
No ratings yet
Program Design and Analysis Program-Level Performance Analysis
13 pages
Karp
No ratings yet
Karp
5 pages
20.optimization I
No ratings yet
20.optimization I
55 pages
Pc98 Lect5 Part1 Speedup
No ratings yet
Pc98 Lect5 Part1 Speedup
36 pages
CP4253 Map Unit Ii
No ratings yet
CP4253 Map Unit Ii
23 pages
Quatitative Principle
No ratings yet
Quatitative Principle
56 pages
Optimising Serial Code
No ratings yet
Optimising Serial Code
101 pages
Lect 02
No ratings yet
Lect 02
51 pages
HPC 4th Unit - 240504 - 160030
No ratings yet
HPC 4th Unit - 240504 - 160030
19 pages
Lecture 4 Analytical Modeling of Parallel Programs
No ratings yet
Lecture 4 Analytical Modeling of Parallel Programs
11 pages
Lecture Week - 3 Amdahl Law 1
No ratings yet
Lecture Week - 3 Amdahl Law 1
19 pages
Tristram FP 443
No ratings yet
Tristram FP 443
6 pages
Principles of Scalable Performance
0% (1)
Principles of Scalable Performance
7 pages
A Practical Approach To Optimize Code Implementation
No ratings yet
A Practical Approach To Optimize Code Implementation
11 pages
3.2 Performance Evaluations
No ratings yet
3.2 Performance Evaluations
18 pages
CS621 Week 14 - Complete
No ratings yet
CS621 Week 14 - Complete
69 pages
PWC Unit-2 (Decision Making and Branching)
No ratings yet
PWC Unit-2 (Decision Making and Branching)
44 pages
Al-Fanar Co. - Neom Man Power - Quotation - P24-6701
No ratings yet
Al-Fanar Co. - Neom Man Power - Quotation - P24-6701
2 pages
Module 3
No ratings yet
Module 3
23 pages
Variables: Page 1 of 193
No ratings yet
Variables: Page 1 of 193
193 pages
CS-3006 4 PerformanceAnalysis
No ratings yet
CS-3006 4 PerformanceAnalysis
62 pages
Measuring Computer Performance
No ratings yet
Measuring Computer Performance
26 pages
ACA 2024W 01 Introduction
No ratings yet
ACA 2024W 01 Introduction
19 pages
Screenshot 2024-12-05 at 2.01.32 PM
No ratings yet
Screenshot 2024-12-05 at 2.01.32 PM
49 pages
Optimization of Computer Programs in C
No ratings yet
Optimization of Computer Programs in C
37 pages
CSC 313 Module 3 Pipelining
No ratings yet
CSC 313 Module 3 Pipelining
59 pages
L05 Performance
No ratings yet
L05 Performance
53 pages
LM32 Ait L19
No ratings yet
LM32 Ait L19
19 pages
Loops Activity 1
No ratings yet
Loops Activity 1
3 pages
Heart Rate Monitoring System Project
No ratings yet
Heart Rate Monitoring System Project
16 pages
410A Week 4
No ratings yet
410A Week 4
12 pages
CS-3006 10 PerformanceAnalysis
No ratings yet
CS-3006 10 PerformanceAnalysis
52 pages
OpenMP Performance Consideration
No ratings yet
OpenMP Performance Consideration
49 pages
Lec 2
No ratings yet
Lec 2
31 pages
Lec 2
No ratings yet
Lec 2
31 pages
Lec02 1 Measuring Profiling
No ratings yet
Lec02 1 Measuring Profiling
25 pages
Computer Science PG Rur Syllabus
No ratings yet
Computer Science PG Rur Syllabus
56 pages
Solution Manual For Concepts of Programming Languages 12th Edition by Robert Sebesta
No ratings yet
Solution Manual For Concepts of Programming Languages 12th Edition by Robert Sebesta
32 pages
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
IGNOU Operating System Previous Years Solved Papers
From Everand
IGNOU Operating System Previous Years Solved Papers
Manish Soni
No ratings yet
Advanced Backend Code Optimization
From Everand
Advanced Backend Code Optimization
Sid Touati
No ratings yet

Performance and Tuning of Openmp Programs

Uploaded by

Performance and Tuning of Openmp Programs

Uploaded by

Performance and Tuning of

2- Reducing lines of code does not mean your code is faster

• Automatic source level: instrumentation added to the source code

• Compiler assisted: Ex: "gcc -pg ..." for gprof ”

• Binary translation: The tool adds instrumentation to a compiled

• Runtime instrumentation: Directly before execution, the code is

– Work vs. Parallel overhead.

2- Caching strategy: Help to remove performance bottlenecks

3- Load balancing: Your application should exploit the resources

4- Bottlenecks: They can also be defined as hot spots where the

#pragma omp for #pragma omp for

for( i=0; i<n; i++ ) {

#pragma omp parallel for private(tmp)

for( i=0, sum=0; i<n; i++ )

#pragma omp parallel for reduction(+,sum)

• Dependence on sum is removed by

for( i=0, index=0; i<n; i++ ) {

#pragma omp parallel for

• Dependence removed by computing the

for( i=0, index=0; i<n; i++ ) {

• Dependence on induction variable

for( i=0; i<n; i++ ) {

• Dependence on a[i][j] prevents k-loop

• We can do better by reordering the loops

• Larger parallel pieces of work

• Merge two parallel loops into one

• Reduces loop startup overhead, improves

for( i=0, wrap=n; i<n; i++ ) {

b[0] = a[0] + a[n];

and chunk sizes.

• Programmer can use untied clause to lift

#pragma omp task shared(y) Task 1 Task 2

#pragma omp taskwait

Shared memory space

proc1 proc2 proc3 procN

cache1 cache2 cache3 cacheN

proc1 proc2 proc3 procN

mem1 mem2 mem3 memN

proc1 proc2 proc3 procN

First Touch Policy: Whenever any process accesses some

numactl command (Linux)

Each sample counts as 0.01 seconds.

ompP is a proﬁling tool for OpenMP applications written in C/C++ or

• TAU supports several measurement options (profiling, tracing, profiling with

You might also like