0% found this document useful (0 votes)

5 views61 pages

Unit Iii

The document provides an overview of OpenMP, an API for writing portable, multithreaded applications, detailing its components, challenges in threading, and performance optimization techniques. It discusses issues like data races, deadlocks, and memory management, along with practical examples of loop threading and scheduling strategies. Additionally, it highlights the importance of proper synchronization and the use of OpenMP library functions to enhance parallel programming efficiency.

Uploaded by

easwaran0508

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views61 pages

Unit Iii

Uploaded by

easwaran0508

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 61

OpenMP PROGRAMMING

Syllabus Content
• OpenMP
• Threading a loop
• Thread overheads
• Performance issues
• Library functions
• Solutions to parallel programming problems
‒ Data races, deadlocks and live locks
‒ Non-blocking algorithms
‒ Memory and cache related issues
OpenMP (Open Multi-Processing)
• Formulated in 1997 as an API for writing portable, multithreaded applications.
• Fortran-based standard, but later grew to include C and C++.
• The current version is OpenMP Version 2.5, which supports FORTRAN, C, and C++.
• Intel C++ and Fortran compilers support the OpenMP Version 2.5 standard.
• Provides a platform-independent set of compiler pragmas, directives, function calls, and environment
variables that explicitly instruct the compiler how and where to use parallelism in the application.
• Many loops can be threaded by inserting only one pragma right before the loop, as demonstrated by
examples in this chapter.
• By leaving the nitty-gritty details to the compiler and OpenMP runtime library, you can spend more time
determining which loops should be threaded and how to best restructure the algorithms for performance on
multi-core processors.
• The full potential of OpenMP is realized when it is used to thread the most time- consuming loops, that is,
the hot spots.
OpenMP (Open Multi-Processing)
OpenMP Components
• Compiler Directives and Clauses
• Runtime Libraries
• Environment Variables
Compiler Directives and Clauses
Runtime Libraries
Environment Variables
Challenges in Threading a Loop
• Loop-carried Dependence
• Data Race Condition
• Managing Shared and Private Data
• Loop Scheduling and Partitioning
1. Loop-carried Dependence
• True (flow) dependence ‒ RAW
• Anti-dependence ‒ WAR
• Output dependence ‒ WAW
Loop-carried Dependence

True (flow) dependence ‒ RAW

Anti-dependence ‒ WAR
Output dependence ‒ WAW
Example 1:
for( i=1; i<100; i++ ) {
a[i] = …;
…;
… = a[i-1];
}

Loop-carried dependence, not parallelizable

for (i=0; i<n; i++){
a[i] = b[i] + 1; //S1
c[i] = a[i] + 2; //S2
}

i=0 i=1 i=2

S1: a[0] = b[0] + 1 S1: a[1] = b[1] + 1 S1: a[2] = b[2] + 1
S2: c[0] = a[0] + 2 S2: c[1] = a[1] + 2 S2: c[2] = a[2] + 2
Loop-carried Dependence
Example
x[0] = 0;
y[0] = 1;
#pragma omp parallel for private (k)
for ( k = 1; k < 100; k++ ) {
x[k] = y[k-1] + 1; Loop Carried Dependence

y[k] = x[k-1] + 2;
Predetermine the initial value of x[49] and
} y[49];
Loop strip mining technique
x[0] = 0;

y[0] = 1;

x[49] = 74; //derived from the equation x(k)=x(k-2)+3

y[49] = 74; //derived from the equation y(k)=y(k-2)+3

#pragma omp parallel for private(m, k)

for (m=0, m<2; m++) {

for ( k = m49+1; k < m50+50; k++ ) {

x[k] = y[k-1] + 1; // S1

y[k] = x[k-1] + 2; // S2

}
2. Data Race Condition
• Data-race conditions are due to
‒ output dependences, in which multiple threads attempt to update
the same memory location, or variable, after threading.
• OpenMP does not perform the detection of data-race
conditions.
• The code needs to be modified via Privatization or
synchronized using mechanisms like mutexes.
When multiple threads compete to update the same resource, it causes
inconsistent data (called a race condition).
Data Race Condition cont.…..
• data-race conditions - difficult to spot
• using the full thread synchronization tools in the Windows API or in Pthreads,
developers are more likely to avoid these issues
• using OpenMP, it is easier to overlook data-race conditions
• Intel Thread Checker, which is an add-on to Intel VTuneTM Performance
Analyzer.
3. Managing Shared and Private Data
understanding which data is shared and which is private becomes extremely
important
3. Managing Shared and Private Data
• OpenMP makes this distinction
‒ Shared - all threads access the exact same memory
location
‒ Private - a separate copy of the variable is made for
each thread to access in private
By Default - all the variables in a parallel region are shared, with exceptions
in parallel for loops, the loop index is private
variables that are local to the block of the parallel region are private
any variables listed in the private, first private, last private, or reduction clauses are private
• The developer’s responsibility to indicate to the compiler which pieces of memory should be shared
Memory can be declared as private in the following three ways.
• Private,
• Firstprivate,
• Lastprivate,
A lastprivate(temp) clause will copy the last loop (stack) value of temp to the (global)
temp storage when the parallel DO is complete.

A firstprivate(temp) would copy the global temp value to each stack’s temp.
• Use the thread private pragma to specify the global variables that need to be
private for each thread.
• Declare the variable inside the loop without the static keyword
The following loop fails to function correctly because the variable x is shared.

#pragma omp parallel for

for ( k = 0; k < 100; k++ ) {
x = array[k];
array[k] = do_work(x);
}
The variable x is specified as private.
#pragma omp parallel for private(x)
for ( k = 0; k < 100; k++ )
{
x = array[k];
array[k] = do_work(x);
}
4. Loop Scheduling and Partitioning
• Good load balancing
‒ achieve optimal performance
‒ have effective loop scheduling and partitioning
‒ To ensure that the execution cores are busy most
‒ with minimum overhead of scheduling, context switching and synchronization
• With a poorly balanced workload
‒ threads may finish significantly before others
static-even scheduling.
• parallel for or work sharing for loop uses static-even scheduling
• equal number of iterations
• 100 iterations and 10 threads ---- 100/10 = 10
• m iterations and N threads ---- m/N
• minimize the chances of memory conflicts
static-even scheduling.
#pragma omp parallel for
for ( k = 0; k < 1000; k++ ) do_work(k);
• Assume 1000 iteration and 2 threads
• iterations 0 to 499 on one thread and 500 to 999 on the other thread
• Loop-scheduling and partitioning information is conveyed ----schedule clause
#pragma omp for schedule (kind [,chunk-size])
#pragma omp for schedule (Dynamic,16])
The Four Schedule Schemes in OpenMP
Schedule Type Description

static (default with no Partitions the loop iterations into equal-sized chunks or as nearly equal as possible

chunk size)

Uses an internal work queue to give a chunk-sized block of loop iterations to each thread as it
dynamic
becomes available.

Similar to dynamic scheduling, but the chunk size starts off large and shrinks
guided

Uses the OMP_SCHEDULE environment variable at runtime to specify which one of the three
runtime loop-scheduling types should be used.
Dynamic Scheduling
• For dynamic scheduling,
‒ the chunks are handled with the first-come, first-serve scheme,
‒ the default chunk size is 1.
‒ equal to the chunk size specified in the schedule clause for each thread, except the last chunk.
‒ The last set of iterations may be less than the chunk size.
‒ For Example:
‒ (dynamic, 16)
‒ total number of iterations is 100,
‒ 16, 16,16,16,16,16, 4
Guided scheduling
• For guided scheduling,
π k = β k / 2N
• N is the number of threads,
• πk denotes the size of the k th chunk,
• βk denotes the number of remaining unscheduled loop iterations
Guided scheduling
•
Runtime Scheduling Scheme
• Not a scheduling scheme
• gives the end-user some flexibility in selecting the type of scheduling
dynamically
export OMP_SCHEDULE=dynamic,16
Example
Effective Use of Reductions
for ( k = 0; k < 100; k++ ){
sum = sum + func(k);
}

sum = 0;
#pragma omp parallel for reduction(+:sum)
for (k = 0; k < 100; k++) {
sum = sum + func(k);
}
Reduction Operators and Reduction Variable’s Initial Value in
OpenMP

Operator Initialization Value

+ (addition) 0
- (subtraction) 0
* (multiplication) 1
& (bitwise and) ~0
| (bitwise or) 0
^ (bitwise exclusive or) 0
&& (conditional and) 1
Minimizing Threading Overhead
• simple fork-join execution model
‒ compiler and run-time library
‒ with lower threading overhead
‒ can improve your application performance
Measured Cost of OpenMP Constructs and Clauses
Constructs Cost (in microseconds) Scalability

parallel 1.5 Linear

Barrier 1.0 Linear or O(log(n))

Schedule (static) 1.0 Linear

schedule(guided) 6.0 Depends on contention

schedule(dynamic) 50 Depends on contention

ordered 0.5 Depends on contention

Single 1.0 Depends on contention

Reduction 2.5 Linear or O(log(n))

Atomic 0.5 Depends on data-type and hardware

Critical 0.5 Depends on contention

Lock/Unlock 0.5 Depends on contention

To reduce overhead
Work-sharing Sections
• Directs the OpenMP compiler and runtime to distribute the
identified sections in the team created for the parallel
region
Performance-oriented Programming
• OpenMP provides
‒ a set of important pragmas
‒ runtime functions
• Enable thread synchronization and related actions to facilitate
correct parallel programming.
1. Using Barrier and No wait
2. Interleaving Single-thread and Multi-thread Execution
3. Data Copy-in and Copy-out
4. Protecting Updates of Shared Variables
Using Barrier and No wait
• Barriers are a form of synchronization method
• Threads will wait at a barrier until all the threads in the parallel region have
reached the same point
Example 2
Example for nowait
Interleaving Single-thread and Multi-thread Execution

• a program may consist of both serial and parallel code segments

• single-thread execution---OpenMP provides a way to specify
Data Copy-in and Copy-out
• how to copy in the initial value of a private variable to initialize its private copy
• copy out the value of the private variable computed in the last iteration
• OpenMP standard provides four clauses
• firstprivate,
• lastprivate,
• copyin,
• copyprivate
• firstprivate provides a way to initialize the value
• lastprivate provides a way to copy out the value
• copyin provides a way to copy the master thread’s value to
members
• copyprivate clause is allowed to associate with the single
construct - broadcast action ‒ earlier completeion
Protecting Updates of Shared Variables

• for avoiding data-race conditions.

#pragma omp parallel
{
if ( max < new_value )
max = new_value
}

#pragma omp parallel critical(max)

{
if ( max < new_value )
max = new_value
}
Atomic pragma
• Directs the compiler to generate code to ensure that the specific memory storage
is updated atomically
• x++
• ++x
• x --
• --x
Intel Taskqueuing Extension to OpenMP

• Taskqueuing extension to OpenMP allows a programmer to parallelize control

structures such as
‒ recursive function,
‒ dynamic-tree search,
‒ pointer-chasing
Taskqueuing Execution Model
OpenMP Library Functions
• OpenMP provides
‒ Pragmas
the highest degree of simplicity and portability
can be easily switched off
‒ a set of functions calls
• to add the conditional compilation in your programs
‒ environment variables
The Most Heavily Used OpenMP Library Functions
Most Commonly Used Environment Variables for OpenMP
Data Races, Deadlocks, and Live Locks
• Unsynchronized access to shared memory can introduce race conditions
• Results depend non-deterministically
Example
t=x u=x
x=t+1 x=u+2
Tools
• Intel Thread Checker is a powerful tool for detecting potential race conditions.
• Mask
• Sometimes race conditions are intended and useful
• “latest current value.”
• Some times in syn. Access can produce data race
If ( !list.contains( Key)

Another thread might insert key

values between contains and insert
call
list.insert( Key)
Dead lock

Tread1 ----lock A Thread2----lock B

Deadlock can occur only if the following four conditions
1. Access to each resource is exclusive.
2. A thread is allowed to hold one resource while requesting
another.
3. No thread is willing to relinquish a resource that it has
acquired.
4. There is a cycle of threads trying to acquire resources, where
each resource is held by one thread and requested by another.
Deadlock can be avoided by breaking any one of these conditions
Ordering the lock
void AcquireTwoLocksViaOrdering( Lock& x, Lock& y ) {
assert( &x!=&y );
if( &x<&y ) {
acquire x
acquire y
} else {
acquire y
acquire x
}
}

Photoshop MCQ Questions and Answers
73% (15)
Photoshop MCQ Questions and Answers
9 pages
High Performance Computing (HPC) Lec4
No ratings yet
High Performance Computing (HPC) Lec4
32 pages
Openmp
No ratings yet
Openmp
115 pages
Open MP3
No ratings yet
Open MP3
38 pages
OPENMP
No ratings yet
OPENMP
37 pages
Parallel Programming Module 2
No ratings yet
Parallel Programming Module 2
112 pages
Parallel Programming Module 3
No ratings yet
Parallel Programming Module 3
44 pages
Openmp Overview
No ratings yet
Openmp Overview
74 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
Open MP2
No ratings yet
Open MP2
28 pages
OpenMP P1
No ratings yet
OpenMP P1
32 pages
Introduction To OpenMP
No ratings yet
Introduction To OpenMP
46 pages
4 Openmp
No ratings yet
4 Openmp
32 pages
Lecture Open MP
No ratings yet
Lecture Open MP
25 pages
Openmp 1
No ratings yet
Openmp 1
38 pages
Lect11 Openmp1
No ratings yet
Lect11 Openmp1
35 pages
Openmp Boston
No ratings yet
Openmp Boston
90 pages
Chapter 3 - Shared-Memory Programming, OpenMP
No ratings yet
Chapter 3 - Shared-Memory Programming, OpenMP
65 pages
CS-3006 8 UsingOpenMP SharedMemoryProgramming
No ratings yet
CS-3006 8 UsingOpenMP SharedMemoryProgramming
61 pages
Shared Memory: Openmp Environment and Synchronization
No ratings yet
Shared Memory: Openmp Environment and Synchronization
32 pages
CS-3006 5 UsingOpenMP SharedMemoryProgramming
No ratings yet
CS-3006 5 UsingOpenMP SharedMemoryProgramming
76 pages
Lecture 10 Shared Memory Programming With OpenMP
No ratings yet
Lecture 10 Shared Memory Programming With OpenMP
30 pages
Openmp: Martin Kruliš Ji Ří Dokulil
No ratings yet
Openmp: Martin Kruliš Ji Ří Dokulil
38 pages
Openmp HPC Ass1
No ratings yet
Openmp HPC Ass1
43 pages
Lecture Open MP
No ratings yet
Lecture Open MP
35 pages
Parallel Programming Using Openmp: Mike Bailey
No ratings yet
Parallel Programming Using Openmp: Mike Bailey
27 pages
Unit 3
No ratings yet
Unit 3
13 pages
Lec 12 OpenMP
No ratings yet
Lec 12 OpenMP
152 pages
Edexcel IGCSE Mathematics B 4MB1 Revision Notes
No ratings yet
Edexcel IGCSE Mathematics B 4MB1 Revision Notes
42 pages
OpenMPSlides Tamu SC
No ratings yet
OpenMPSlides Tamu SC
80 pages
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
No ratings yet
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
58 pages
Unit 4 Shared-Memory Parallel Programming With Openmp
No ratings yet
Unit 4 Shared-Memory Parallel Programming With Openmp
37 pages
Open MP
No ratings yet
Open MP
28 pages
Openmp
No ratings yet
Openmp
61 pages
Unit III
No ratings yet
Unit III
15 pages
Day 2 1 Advanced-Openmp
No ratings yet
Day 2 1 Advanced-Openmp
52 pages
Open MP
No ratings yet
Open MP
35 pages
Omp Sync Data Runtime Environment
No ratings yet
Omp Sync Data Runtime Environment
59 pages
Open MP
No ratings yet
Open MP
59 pages
A Tutorial On Parallel Computing On Shared Memory Systems
No ratings yet
A Tutorial On Parallel Computing On Shared Memory Systems
23 pages
Mpsoc Architectures Openmp
No ratings yet
Mpsoc Architectures Openmp
35 pages
OpenMP Presentation
No ratings yet
OpenMP Presentation
51 pages
OpenMP SPM
No ratings yet
OpenMP SPM
9 pages
Tutorial Presentation 8
No ratings yet
Tutorial Presentation 8
15 pages
Parallel Programming: in C With Mpi and Openmp Michael J. Quinn
No ratings yet
Parallel Programming: in C With Mpi and Openmp Michael J. Quinn
73 pages
OpenMP 01 Introduction
No ratings yet
OpenMP 01 Introduction
70 pages
Num Tech
No ratings yet
Num Tech
39 pages
Dictionary - Programs Questions and Answers - Class 11
No ratings yet
Dictionary - Programs Questions and Answers - Class 11
17 pages
Openmp: Parallel Processing
No ratings yet
Openmp: Parallel Processing
40 pages
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
No ratings yet
Concurrent and Parallel Programming Unit V-Notes Unit V Openmp, Opencl, Cilk++, Intel TBB, Cuda 5.1 Openmp
10 pages
The Design Process & The Role of CAD
100% (1)
The Design Process & The Role of CAD
12 pages
Presentation2 HS OpenMP
No ratings yet
Presentation2 HS OpenMP
29 pages
Worksharing and Parallel Loops
No ratings yet
Worksharing and Parallel Loops
23 pages
Open MP
No ratings yet
Open MP
30 pages
OpenMP 2
No ratings yet
OpenMP 2
3 pages
01-Bowles-Foundation Analysis and Design PDF
No ratings yet
01-Bowles-Foundation Analysis and Design PDF
6 pages
Open MPLecture
No ratings yet
Open MPLecture
54 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
Chromatopac C-R3A. Instruction Manual (расп.)
100% (1)
Chromatopac C-R3A. Instruction Manual (расп.)
210 pages
Challenges InThreading A Loop - Doc1
100% (2)
Challenges InThreading A Loop - Doc1
6 pages
OpenMP Basics
No ratings yet
OpenMP Basics
47 pages
Iare DS Lecture Notes 2
No ratings yet
Iare DS Lecture Notes 2
135 pages
Mu-Analysis and Synthesis Toolbox
No ratings yet
Mu-Analysis and Synthesis Toolbox
734 pages
First Order Open-Loop Systems
0% (1)
First Order Open-Loop Systems
18 pages
Albert Einstein
No ratings yet
Albert Einstein
19 pages
Assignment 1 Excel Spreadsheet 2 3
No ratings yet
Assignment 1 Excel Spreadsheet 2 3
20 pages
MLE1101 - Tutorial 2 - Suggested Solutions
No ratings yet
MLE1101 - Tutorial 2 - Suggested Solutions
8 pages
Ramsey S Legacy 1st Edition Lillehammer Download PDF
100% (6)
Ramsey S Legacy 1st Edition Lillehammer Download PDF
84 pages
Draftspecificationformantransformer 7775 Kvawithincr
No ratings yet
Draftspecificationformantransformer 7775 Kvawithincr
13 pages
Prelims Test Series Csat 1722243977612
No ratings yet
Prelims Test Series Csat 1722243977612
3 pages
CWS19产品资料英文
No ratings yet
CWS19产品资料英文
7 pages
Screenshot 2020-08-05 at 3.32.42 PM
No ratings yet
Screenshot 2020-08-05 at 3.32.42 PM
1 page
Understanding Scuffing and Micropitting of Gears: R W Snidle, H P Evans, M P Alanou, M J A Holmes
No ratings yet
Understanding Scuffing and Micropitting of Gears: R W Snidle, H P Evans, M P Alanou, M J A Holmes
18 pages
AI Unit 1 Short Answer
No ratings yet
AI Unit 1 Short Answer
14 pages
Discussion Forum Unit 5
No ratings yet
Discussion Forum Unit 5
2 pages
Guia Desmontaje Pavilion Dv7t
No ratings yet
Guia Desmontaje Pavilion Dv7t
16 pages
Type of Proportions
No ratings yet
Type of Proportions
20 pages
Radiator - Wikipedia
No ratings yet
Radiator - Wikipedia
8 pages
14 Slide
No ratings yet
14 Slide
44 pages
Marantz SR 4500 Brochure
No ratings yet
Marantz SR 4500 Brochure
4 pages
Desymm
No ratings yet
Desymm
13 pages
GP2Y0D340K: Distance Measuring Sensor Unit Digital Output (400 MM) Type
No ratings yet
GP2Y0D340K: Distance Measuring Sensor Unit Digital Output (400 MM) Type
9 pages
STF5 Equilibrium Beam Datasheet
No ratings yet
STF5 Equilibrium Beam Datasheet
2 pages
M.S.Ramaiah Institute of Technology Department of Management Studies
No ratings yet
M.S.Ramaiah Institute of Technology Department of Management Studies
5 pages
Tension 13: 5or1 He T TH Ro No H RD in
No ratings yet
Tension 13: 5or1 He T TH Ro No H RD in
1 page
Python Beyond Limits: Python, #3
From Everand
Python Beyond Limits: Python, #3
AnwaarX
No ratings yet
Zig Programming: From Zero to Systems Master
From Everand
Zig Programming: From Zero to Systems Master
Niklas Hoffmann
No ratings yet
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
Franco Mario
No ratings yet
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Projects With Microcontrollers And PICC
From Everand
Projects With Microcontrollers And PICC
Guillermo Perez Guillen
5/5 (1)

Unit Iii

Uploaded by

Unit Iii

Uploaded by

OpenMP PROGRAMMING

True (flow) dependence ‒ RAW

Loop-carried dependence, not parallelizable

i=0 i=1 i=2

x[49] = 74; //derived from the equation x(k)=x(k-2)+3

y[49] = 74; //derived from the equation y(k)=y(k-2)+3

#pragma omp parallel for private(m, k)

for (m=0, m<2; m++) {

for ( k = m*49+1; k < m*50+50; k++ ) {

#pragma omp parallel for

Operator Initialization Value

parallel 1.5 Linear

Barrier 1.0 Linear or O(log(n))

Schedule (static) 1.0 Linear

schedule(guided) 6.0 Depends on contention

schedule(dynamic) 50 Depends on contention

ordered 0.5 Depends on contention

Single 1.0 Depends on contention

Reduction 2.5 Linear or O(log(n))

Atomic 0.5 Depends on data-type and hardware

Critical 0.5 Depends on contention

Lock/Unlock 0.5 Depends on contention

• a program may consist of both serial and parallel code segments

• for avoiding data-race conditions.

#pragma omp parallel critical(max)

• Taskqueuing extension to OpenMP allows a programmer to parallelize control

Another thread might insert key

Tread1 ----lock A Thread2----lock B

You might also like

for ( k = m49+1; k < m50+50; k++ ) {