0% found this document useful (0 votes)

32 views50 pages

Govindarajan - ParallelizationPrinciples NSM AstroPhysics

The document provides an overview of parallel programming fundamentals, including programming models, task creation, and synchronization. It introduces OpenMP and CUDA as tools for parallel programming, detailing their execution models and memory management. Key concepts such as Amdahl's Law, task decomposition, and synchronization mechanisms are also discussed to enhance understanding of parallel computing.

Uploaded by

morgan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views50 pages

Govindarajan - ParallelizationPrinciples NSM AstroPhysics

Uploaded by

morgan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPTX, PDF, TXT or read online on Scribd

You are on page 1/ 50

Parallel Programming

Fundamentals

R. Govindarajan
CSA/SERC, IISc
[email protected]

1
Overview
 Introduction
 Parallelization Essentials
 Programming Models
 Task creation
 Synchronization / Communication
 Introduction to OpenMP
 Introduction to CUDA
Acknowledgments:

Slides for this tutorial are taken from presentation materials available with
the book “Parallel Computing Architecture: A Hardware/Software
Approach” (Culler, Singh and Gupta, Morgan Kaufmann Pub.) and the
associated course material. They have been suitably adapted.
2
Space of Parallel Computing

Parallel Architecture
 Shared Memory
 Centralized shared memory (UMA)
 Distributed Shared Memory (NUMA)
 Distributed Memory
 A.k.a. Message passing
 E.g., Clusters
 Data Parallel
 SIMD, Vector Machine
 GPUs

3
Space of Parallel Computing

Programming Models
 What programmer uses in coding
applications
 Specifies synchronization and
communication
 Programming Models:
 Shared address space, e.g., OpenMP
 Message passing, e.g., MPI
 Single Instrn. Multi-Threaded (SIMT) , e.g.,
CUDA or OpenCL, specifically for GPUs

4
Definitions

 Speedup =

 Efficiency =

 Amdahl’s Law:
For a program that has s part as
sequential execution, the speedup is
limited by 1/s .
 20% time is spent on sequential part 
maximum speedup is 5 !
5
Understanding Amdahl’s Law

Example: 2-phase calculation

 sweep over n x n grid and do some
independent computation
 sweep again and add each value to global sum
concurrency

1
n2 n2 Time

(a) Serial

 Serial Execution Time = n2 + n2 = 2n2

6
Understanding Amdahl’s Law

Naïve Parallel Execution Improved Parallel

 Phase 1 : Parallel Execution
 Time for Phase = n2/p  Localize sum in p procs
 Second phase serialized  Time = n2/p
at global variable  Accumulate local sums
 Time for Phase 2 = n2; serial execution;
 Speedup = (2n2/(n2 +  Time = p
n2/p)) or at most 2 !  Speedup = (2n2/(2n2/p+p))
≈p
p p
concurrency

concurrency
1 1

n2 p
n2/p
Time Time

n2/p
n2/p

Naïve Parallel Improved 7

Overview
 Introduction
 Parallelization Essentials
 Programming Models
 Task creation
 Synchronization / Communication
 Introduction to OpenMP
 Introduction to CUDA

8
Definitions
 Task
Arbitrary piece of work in parallel computation
Executed sequentially; concurrency is only across
tasks
Fine-grained vs. coarse-grained tasks

 Process
Abstract entity that performs the tasks
Communicate and synchronize to perform the
tasks
 Process vs. Thread
Coarse vs. Fine grain
Threads typically share the address space
9
Steps involved in Parallelizaton

 Identify work that can be done in parallel

 work includes computation, data access and I/O
 Partition work and perhaps data among
processes
 Manage data access, communication and
synchronization

10
Steps in Creating a Parallel
Program
Partitioning

D A O M
e s r a
c s c p
o i h p
m g p0 p1 e p0 p1 i
p s P0 P1
n n
o m t g
s e r
i n a
t t t
P2 P3
i p2 p3 i p2 p3
o o
n n

Sequential Tasks Processes Parallel Processors

computation program

11
Task vs. Data Decomposition

 Computation is
decomposed and assigned A

(partitioned) – task
B
decomposition D
 Task graphs, C

 Synchronization among E F

tasks
 fork- join G
 barrier

12
Task vs. Domain Decomposition

Domain
 Partitioning Data is Decomposition
often a natural view too
– data or domain
decomposition
Grid example;
Computation follows
data: owner computes
for i=1 to m
for j= 1 to n
a[i,j] = a[i,j] + v[i]
13
Assignment

 Specifies how to group tasks together for a

process P1 Compute wait
Balance workload P2
P3
P4

Reduce communication and management cost

 Static versus dynamic assignment
 Both decomposition and assignment are usually
independent of architecture or programming
model
But cost and complexity of using primitives may
14
Assignment (contd.)
Block Cyclic

Block-Cyclic Tiled

15
Orchestration

 Different for different programming

models/architectures
Shared address space
 Naming: global address space
 Synchronization through barriers and locks
Distributed Memory /Message passing
 Non-shared address space
 Send-receive messages + barrier for
synchronization

16
Overview
 Introduction
 Parallelization Essentials
 Programming Models
 Task creation
 Synchronization / Communication
 Introduction to OpenMP
 Introduction to CUDA

17
What is OpenMP?

 What does OpenMP stands for?

 Open specifications for Multi Processing via collaborative
work between interested parties from the hardware and
software industry, government and academia.
 OpenMP is an Application Program Interface (API)
that may be used for explicitly direct multi-
threaded, shared memory parallelism.
 API components:
 Compiler Directives
 Runtime Library Routines
 Environment Variables
 OpenMP primitives can be included incrementally,
one function or even one loop at a time.

18
OpenMP Parallel Computing Solution
Stack
User layer

End User

Application
Prog. Layer
(OpenMP )

Environment
Directives OpenMP library
variables
System layer

Runtime library

OS/system support for shared memory.

Hardware

Proc. 1 Proc. 2 Proc. n

Shared Memory 19
OpenMP execution model

Fork and Join: Master thread spawns a

team of threads as needed Worker
Thread

Master thread

FORK
FORK

JOIN
JOIN

Parallel
Region
20
OpenMP Memory Model
 Shared memory model
 Shared variable: single copy of the variable
shared by many threads;
 Private variable: variable seen by one thread;
each thread will have its own copy of the variable

 Unintended sharing of data causes race

conditions or incorrect behavior
Thread 1 Thread 2 Values printed by Threads 1 and 2
Shared int x ; Shared int x ; (10, 25) ?
(25,15) ?
x = x + 10; x = x + 15;
(10, 15) ?
print (x); print (x); …

 Use synchronization to protect from conflicts

 Synchronization is expensive: Change how data is
21
OpenMP: Contents

 OpenMP’s constructs fall in 5

categories:
Parallel Regions
Worksharing
Data Environment
Synchronization
Runtime functions/environment variables
 OpenMP is basically the same between
Fortran and C/C++

22
OpenMP: Parallel Regions

 Omp_parallel pragma
double D[1000];
let all threads execute
#pragma omp parallel the section in parallel
{  Executes the same code
int i; double sum = 0; several times (by many
for (i=0; i<1000; i++) threads)
sum += D[I];  How many threads we
printf(“Thread %d computes have?
%f\n”, omp_thread_num(), omp_set_num_threads(n)
sum);  What is the use of
} repeating the same work
several times in parallel?
 D is shared between the
threads, i and sum are
private 23
Parallel Regions – Another Example

 You create threads in OpenMP with the

“omp_set_num_threads ” pragma.
 For example, To create a 4 thread Parallel
region:
double A[1000]; Runtime
Runtimefunction
functiontoto
Each
Eachthread
thread omp_set_num_threads(4); request
requestaacertain
certain
executes
executes aa number
number of
ofthreads
threads
copy
#pragma omp parallel
copyof ofthe
the
the {
thecode
code
within
withinthe
the
int ID = omp_get_thread_num();
structured
structured pooh(ID,A); Runtime
Runtimefunction
function
block
block } returning
returningaathread
threadIDID

 Each thread calls pooh(ID,A) for ID = 0 to 3

24
Parallel Regions – Another Example

double A[1000];
 Each thread executes omp_set_num_threads(4);
the same code #pragma omp parallel
redundantly. {
int ID = omp_get_thread_num();
double A[1000]; pooh(ID, A);
}
omp_set_num_threads(4) printf(“all done\n”);

AAsingle
singlecopy
copy
of
ofAAisisshared
shared pooh(0,A) pooh(1,A) pooh(2,A) pooh(3,A)
between
betweenall all
threads.
threads.

printf(“all done\n”); Threads

Threadswait
wait here
here for
forall
allthreads
threadstotofinish
finish
before
beforeproceeding
proceeding(i.e.
(i.e.aabarrier)
barrier)
25
OpenMP: Contents

 OpenMP’s constructs fall into 5

categories:
 Parallel Regions
 Work-sharing
 The “for” Work-Sharing construct splits up loop
iterations among the threads in a team
 By default, there is a barrier at the end of the “omp
for”. Use the “nowait” clause to turn off the barrier.
 Data Environment
 Synchronization
 Runtime functions/environment variables

26
OpenMP: Work Sharing
Constructs
Sequential for (int i=0; i<N; i++)
code
a[i]=b[i]+c[i];

#pragma omp parallel

#pragma omp for schedule(static)
Parallelization {
of the for loop for (int i=0; i<N; i++)
using OpenMP a[i]=b[i]+c[i];
}
27
OpenMP For construct:
The Schedule Clause
 The schedule clause affects how loop
iterations are mapped onto threads
 schedule(static [,csize])
 Deal-out blocks of iterations of size “csize” to each thread.
 Default: chunks of approximately equal size, one to each thread
 If more chunks than threads: assign in round-robin to the
threads
 Why might we want to use chunks of different size?
 schedule(dynamic[,csize])
 Each thread grabs “csize” iterations off a queue until all
iterations have been handled.
 Threads receive chunk assignments dynamically
 Default csize = 1

28
OpenMP Section :
Work Sharing Construct
 The Sections work-sharing construct gives
a different structured block to each
thread.
#pragma omp parallel
#pragma omp parallel
#pragma
#pragmaompompsections
sections
{{
#pragma
#pragmaomp
ompsection
section
X_calculation();
X_calculation();
#pragma
#pragmaomp
ompsection
section
y_calculation();
y_calculation();
#pragma
#pragmaompompsection
section By default, there is a
z_calculation();
z_calculation(); barrier at the end of the
}} “omp sections”. Use the
“nowait” clause to turn off
the barrier. 29
OpenMP: Data Environment

 Shared Variables
Most variables (including locals) are shared
by default
Global variables are shared
 File scope variables, static variables
 Some variables can be private
Variables can be explicitly declared as
private:
A local copy is created for each thread
Automatic variables inside the statement
block
Automatic variables in the called functions
30
Data Environment: Changing
Storage Attribute
 One can selectively change storage
attributes constructs using the
following clauses*
 SHARED
 PRIVATE
 FIRSTPRIVATE
 LASTPRIVATE
 THREADPRIVATE
 The default status can be modified
with:
 DEFAULT (PRIVATE | SHARED | NONE)
31
OpenMP Synchronization

X = 0;
What should be the result #pragma omp parallel
(assume 2 threads)?
X = X+1;
Could be 1 or 2!

 OpenMP assumes that the programmer

knows what (s)he is doing
 Regions of code that are marked to run in parallel
are independent
 If access collisions are possible, it is the
programmer’s responsibility to insert protection

32
Synchronization Mechanisms

 Many of the existing mechanisms for

shared programming
Critical sections, Atomic updates
Barriers
…

33
Critical Sections & Atomic

 #pragma omp critical [name]

 Standard critical section functionality
 Critical sections are global in the program
 Can be used to protect a single resource in
different functions
 Critical sections are identified by the name
 All the critical sections having the same name
are mutually exclusive between themselves
#pragma omp atomic
 Protects a single variable update

34
Barrier synchronization
 #pragma omp barrier
 Performs a barrier synchronization between
all the threads in a team at the given point.
 Example:
#pragma omp parallel
{
int result = heavy_computation_part1();
#pragma omp atomic
sum += result;
#pragma omp barrier
heavy_computation_part2(sum);
}
35
Reduction Motivation

 How to parallelize this code?

for (i=0; i<N; j++) {
sum += a[i]*b[i];
}
 sum is not private; need synchronization to ensure
correct reduction!
 accessing it atomically (or with synchronization)
serializes the execution
 Have a private copy of sum in each thread, then
add the private copies serially to get overall sum.

36
OpenMP: Reduction Example

#include <omp.h>
#define NUM_THREADS 4
void main ()
{
int i;
int A[1000], B[1000]; sum=0;
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel for reduction(+:sum)
for (i=0; i< 1000; i++){
sum += A[i] * B[i] ;
} Private copy of sum for each thread, computes
the sum in parallel for the chunk (of array)
} assigned to it!
Separate code for computing global sum (using
synch.) automatically added by OpenMP
37
Controlling OpenMP behavior

 omp_set_num_threads(int)
 Control the number of threads used for parallelization
 Must be called from sequential code
 Also can be set by OMP_NUM_THREADS environment
variable
 omp_get_num_threads()
 How many threads are currently available?
 omp_get_thread_num()
 omp_in_parallel()
 Am I currently running in parallel mode?
 omp_get_wtime()
 A portable way to compute wall clock time

38
Overview
 Introduction
 Parallelization Essentials
 Programming Models
 Task creation
 Synchronization / Communication
 Introduction to OpenMP
 Introduction to CUDA

39
Intel SIMD Extensions
 New SIMD instructions, new registers
 Introduced in phases/groups of functionality
 MMX (1993 – 1999)
 64 bit width operations
 SSE – SSE4 (1999 –2006)
 128 bit width operations
 AVX, FMA, AVX2, AVX-512 (2008 – …)
 256 – 512 bit width operations

40
SIMD Operations

 SIMD execution
 performs operation in parallel on an array of 2, 4,
8,16 or 32 values
X0 X1 X2 X3
 Operation ⊗ can be a
 data movement instruction
 arithmetic instruction Y0 Y1 Y2 Y3
 logical instruction
 comparison instruction ⊗ ⊗ ⊗ ⊗
 conversion instruction
 shuffle instruction
Z0 Z1 Z2 Z3

41
Automatic Vectorization

 In gcc, flags “-O3 –mavx –mavx2” attempts

automatic vectorization
 Works pretty well for simple loops
L1: vmovdqa xmm1, X[rax]
float X[256], Y[256], Z[256]; vmovdqa xmm2, Y[rax]
{ int i; vpmulld xmm0, xmm1, xmm2
for (i=0; i<256; i++) vmovap Z[rax], xmm0
Z[i] = X[i] * Y[i]; add rax, 4
} cmp rax, 256
jne .L1

 But not for anything complex

 E.g., naïve bubble sort code not parallelized at all

42
CUDA Programming

0 1 63
0 255 0 255 0 255

cudaMalloc((void **) &d_x, size);

cudaMalloc((void **) &d_y, size);
cudaMemcpy(d_x, h_x, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_y, h_y, size, cudaMemcpyHostToDevice);
saxpy_parallel <<<nblocks, 256>>>(n,2.0, d_x, d_y)
cudaMemcpy(h_y, d_y, size, cudaMemcpyDeviceToHost);
cudaFree(d_x); cudaFree(d_y);

49
Summary
 Introduction
 Parallelization Essentials
 Introduction to OpenMP
 Covered only the basics --- there is more!
 Current version is OpenMP 5.x
 Support for accelerators
 SIMD/Vectorization support
 Nested Parallelism
 Introduction to CUDA
 there is more to learn about CUDA!
 Current version is CUDA 12.x
 Dynamic kernel invocation, CUDA Graphs, unified
memory, …
50

Generative AI Startup Landscape in India - Revised Version
100% (3)
Generative AI Startup Landscape in India - Revised Version
44 pages
HOLY VEDAS and HOLY BIBLES A Comparative Study by Kanayalal M. Talreja
50% (2)
HOLY VEDAS and HOLY BIBLES A Comparative Study by Kanayalal M. Talreja
181 pages
Students Data - 2018 Updated 27-8-2021
No ratings yet
Students Data - 2018 Updated 27-8-2021
485 pages
Mathematical Physics Useful Formulae PDF
100% (1)
Mathematical Physics Useful Formulae PDF
29 pages
Fiverr Masterclass (Freecourses - Sjee.online)
No ratings yet
Fiverr Masterclass (Freecourses - Sjee.online)
40 pages
Parallel Programming Using OpenMP
No ratings yet
Parallel Programming Using OpenMP
76 pages
R A F T: Notes On Statistical Mechanics
No ratings yet
R A F T: Notes On Statistical Mechanics
177 pages
2021-Ansys ICEM CFD Users Manual
No ratings yet
2021-Ansys ICEM CFD Users Manual
66 pages
Techpack Guide (With Examples)
No ratings yet
Techpack Guide (With Examples)
6 pages
Bda Unit5
No ratings yet
Bda Unit5
110 pages
Ankush Yadav Internship
No ratings yet
Ankush Yadav Internship
23 pages
Low Price Student Paperback Edition
No ratings yet
Low Price Student Paperback Edition
23 pages
About OpenMP
No ratings yet
About OpenMP
86 pages
CS-3006 8 UsingOpenMP SharedMemoryProgramming
No ratings yet
CS-3006 8 UsingOpenMP SharedMemoryProgramming
61 pages
Opentower - Mount Analysis: Connect Edition
No ratings yet
Opentower - Mount Analysis: Connect Edition
26 pages
Parallel Programming Module 2
No ratings yet
Parallel Programming Module 2
112 pages
Parallel Programming Unit 2
No ratings yet
Parallel Programming Unit 2
71 pages
CS-3006 5 UsingOpenMP SharedMemoryProgramming
No ratings yet
CS-3006 5 UsingOpenMP SharedMemoryProgramming
76 pages
Shared Memory and Accelerators
No ratings yet
Shared Memory and Accelerators
88 pages
Industrial Training Report Python: Submitted To: Submitted by
No ratings yet
Industrial Training Report Python: Submitted To: Submitted by
25 pages
Tasy EMR - Application Server Installation and Configuration Guide
No ratings yet
Tasy EMR - Application Server Installation and Configuration Guide
62 pages
Parallel Programming
No ratings yet
Parallel Programming
108 pages
OSIG
No ratings yet
OSIG
67 pages
OMP Common Core-Voss
No ratings yet
OMP Common Core-Voss
217 pages
M.Sc. (Physics) Semester III Phyc-301 Quantum Mechanics - 1 (4 Credits) Unit I (14 Lectures)
No ratings yet
M.Sc. (Physics) Semester III Phyc-301 Quantum Mechanics - 1 (4 Credits) Unit I (14 Lectures)
8 pages
Content: Title Page No
No ratings yet
Content: Title Page No
37 pages
Chapter 3 - Shared-Memory Programming, OpenMP
No ratings yet
Chapter 3 - Shared-Memory Programming, OpenMP
65 pages
Openmp HPC Ass1
No ratings yet
Openmp HPC Ass1
43 pages
Approved TE IT Scheme and Syllabus RC2019 - 20 Sem V - VI
No ratings yet
Approved TE IT Scheme and Syllabus RC2019 - 20 Sem V - VI
55 pages
Openmp Overview
No ratings yet
Openmp Overview
74 pages
Introduction To Parallel Programming: Center For Institutional Research Computing
No ratings yet
Introduction To Parallel Programming: Center For Institutional Research Computing
98 pages
Fitness - Centre
No ratings yet
Fitness - Centre
33 pages
Lec 12 OpenMP
No ratings yet
Lec 12 OpenMP
152 pages
Lecture - 06 (Shared Memory Programming With OpenMP)
No ratings yet
Lecture - 06 (Shared Memory Programming With OpenMP)
65 pages
Cad/ Cam
No ratings yet
Cad/ Cam
20 pages
Short Pulse Laser Plasma Interactions PDF
No ratings yet
Short Pulse Laser Plasma Interactions PDF
25 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
527307-001C - Basler Camera Firmware Upgrade
No ratings yet
527307-001C - Basler Camera Firmware Upgrade
22 pages
PDSOpen MP
No ratings yet
PDSOpen MP
22 pages
Omp Hands On SC08 PDF
No ratings yet
Omp Hands On SC08 PDF
153 pages
AssuredSAN 3004 Series FRU Installation and Replacement Guide - G222 - 83-00006724-15-01-A
No ratings yet
AssuredSAN 3004 Series FRU Installation and Replacement Guide - G222 - 83-00006724-15-01-A
57 pages
Openmp
No ratings yet
Openmp
61 pages
Shared Memory Parallel Programming: Introduction To Openmp
No ratings yet
Shared Memory Parallel Programming: Introduction To Openmp
39 pages
04 Network Management
No ratings yet
04 Network Management
59 pages
Use OF (GIS) IN PETROLEUM GEOLOGY
No ratings yet
Use OF (GIS) IN PETROLEUM GEOLOGY
24 pages
Parallel Programming Module 3
No ratings yet
Parallel Programming Module 3
44 pages
Omp Hands On SC08
No ratings yet
Omp Hands On SC08
153 pages
Lecture 10 Shared Memory Programming With OpenMP
No ratings yet
Lecture 10 Shared Memory Programming With OpenMP
30 pages
Lec7 - TLP Shared Memory and OpenMP
No ratings yet
Lec7 - TLP Shared Memory and OpenMP
45 pages
Unit 4 Shared-Memory Parallel Programming With Openmp
No ratings yet
Unit 4 Shared-Memory Parallel Programming With Openmp
37 pages
Introduction To OpenMP
No ratings yet
Introduction To OpenMP
46 pages
Lect11 Openmp1
No ratings yet
Lect11 Openmp1
35 pages
Nscet E-Learning Presentation: Listen Learn Lead
No ratings yet
Nscet E-Learning Presentation: Listen Learn Lead
67 pages
OpenMP 01 Introduction
No ratings yet
OpenMP 01 Introduction
70 pages
OpenMP P1
No ratings yet
OpenMP P1
32 pages
Babasaheb Bhimrao Ambedkar University
No ratings yet
Babasaheb Bhimrao Ambedkar University
1 page
Prateek CompGasdyn
No ratings yet
Prateek CompGasdyn
21 pages
Chap4 OpenMP
No ratings yet
Chap4 OpenMP
35 pages
Lecture Open MP
No ratings yet
Lecture Open MP
35 pages
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
No ratings yet
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
58 pages
Untitled Document
No ratings yet
Untitled Document
23 pages
Userguide
No ratings yet
Userguide
32 pages
4 Openmp
No ratings yet
4 Openmp
32 pages
1 - 1 IntroComputers
No ratings yet
1 - 1 IntroComputers
24 pages
10 OpenMP-2
No ratings yet
10 OpenMP-2
25 pages
Lecture Open MP
No ratings yet
Lecture Open MP
25 pages
Types of Buses
No ratings yet
Types of Buses
16 pages
Openmp: Parallel Processing
No ratings yet
Openmp: Parallel Processing
40 pages
Presentation2 HS OpenMP
No ratings yet
Presentation2 HS OpenMP
29 pages
Shared Memory: Openmp Environment and Synchronization
No ratings yet
Shared Memory: Openmp Environment and Synchronization
32 pages
Mpsoc Architectures Openmp
No ratings yet
Mpsoc Architectures Openmp
35 pages
Open MP
No ratings yet
Open MP
35 pages
Open MPLecture
No ratings yet
Open MPLecture
54 pages
Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
No ratings yet
Programming Shared-Memory Platforms With Openmp: John Mellor-Crummey
46 pages
Unit 3
No ratings yet
Unit 3
13 pages
Electronics and Experimental Methods NET-JRF June 2011-Dec 2014
No ratings yet
Electronics and Experimental Methods NET-JRF June 2011-Dec 2014
16 pages
Top 15 Free VPN For Secure Anonymous Surfing: Why You Need VPN
No ratings yet
Top 15 Free VPN For Secure Anonymous Surfing: Why You Need VPN
10 pages
OpenMP SPM
No ratings yet
OpenMP SPM
9 pages
Supplementary Information For - Higher-Order and Fractional Discrete Time Crystals in - Clean Long-Range Interacting Systems
No ratings yet
Supplementary Information For - Higher-Order and Fractional Discrete Time Crystals in - Clean Long-Range Interacting Systems
7 pages
22.101 Applied Nuclear Physics (Fall 2006) Lecture 8 (10/4/06) Neutron-Proton Scattering
No ratings yet
22.101 Applied Nuclear Physics (Fall 2006) Lecture 8 (10/4/06) Neutron-Proton Scattering
7 pages
Saprouter
No ratings yet
Saprouter
4 pages
Unit III
No ratings yet
Unit III
15 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
Lab # 2 by Akram
No ratings yet
Lab # 2 by Akram
14 pages
Optical Sources
No ratings yet
Optical Sources
33 pages
A Tutorial On Parallel Computing On Shared Memory Systems
No ratings yet
A Tutorial On Parallel Computing On Shared Memory Systems
23 pages
PC Software 2015-2022 Ques Paper
No ratings yet
PC Software 2015-2022 Ques Paper
16 pages
Qled
No ratings yet
Qled
7 pages
OpenMP Basics
No ratings yet
OpenMP Basics
47 pages
Lec12 PDF
No ratings yet
Lec12 PDF
8 pages
ACA 2024W 04 Shared-Memory Programming With OpenMP 1-15
No ratings yet
ACA 2024W 04 Shared-Memory Programming With OpenMP 1-15
8 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
Howtoapply
No ratings yet
Howtoapply
8 pages
OpenMP 2
No ratings yet
OpenMP 2
3 pages
Diwali 2021 Food Details - Sheet1
No ratings yet
Diwali 2021 Food Details - Sheet1
2 pages
CGR 5
No ratings yet
CGR 5
4 pages
Kritik Resume
No ratings yet
Kritik Resume
2 pages
Part 2 - Quantum Physics in One-Dimensional Potentials - Video Lectures - Quantum Physics I - Physics - MIT OpenCourseWare
No ratings yet
Part 2 - Quantum Physics in One-Dimensional Potentials - Video Lectures - Quantum Physics I - Physics - MIT OpenCourseWare
2 pages
About Our Authors: (In Alphabetical Order)
No ratings yet
About Our Authors: (In Alphabetical Order)
2 pages
Trendnet: 8-Port Stackable Rack Mount KVM Switch With Osd
No ratings yet
Trendnet: 8-Port Stackable Rack Mount KVM Switch With Osd
2 pages
FurMark 0001
No ratings yet
FurMark 0001
1 page

Govindarajan - ParallelizationPrinciples NSM AstroPhysics

Uploaded by

Govindarajan - ParallelizationPrinciples NSM AstroPhysics

Uploaded by

Parallel Programming

Example: 2-phase calculation

 Serial Execution Time = n2 + n2 = 2n2

Naïve Parallel Execution Improved Parallel

Naïve Parallel Improved 7

 Identify work that can be done in parallel

Sequential Tasks Processes Parallel Processors

 Specifies how to group tasks together for a

Reduce communication and management cost

 Different for different programming

 What does OpenMP stands for?

OS/system support for shared memory.

Proc. 1 Proc. 2 Proc. n

Fork and Join: Master thread spawns a

 Unintended sharing of data causes race

 Use synchronization to protect from conflicts

 OpenMP’s constructs fall in 5

 You create threads in OpenMP with the

 Each thread calls pooh(ID,A) for ID = 0 to 3

printf(“all done\n”); Threads

 OpenMP’s constructs fall into 5

#pragma omp parallel

 OpenMP assumes that the programmer

 Many of the existing mechanisms for

 #pragma omp critical [name]

 How to parallelize this code?

 In gcc, flags “-O3 –mavx –mavx2” attempts

 But not for anything complex

cudaMalloc((void **) &d_x, size);

You might also like