0% found this document useful (0 votes)

41 views25 pages

OpenMPBoothTalk PyOMP

Uploaded by

Erialdo Domingos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views25 pages

OpenMPBoothTalk PyOMP

Uploaded by

Erialdo Domingos

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

SC’21 Booth Talk Series

PyOMP: Parallel Multithreading

that is fast AND Pythonic
Tim Mattson, Senior Principal Engineer, Intel Corp.
Multithreaded parallel Python through OpenMP support in Numba, Todd Anderson and Tim
Mattson, SciPy’2021 http://conference.scipy.org/proceedings/scipy2021/tim_mattson.html
Legal Disclaimer & Optimization Notice
This document contains information on products, services and/or processes in development. All information provided here is subject to change
without notice.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests,
such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to
any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products. For more complete information visit
www.intel.com/benchmarks.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY
INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL
DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR
OTHER INTELLECTUAL PROPERTY RIGHT.
Copyright © 2021, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and OpenVINO are trademarks of Intel Corporation
or its subsidiaries in the U.S. and other countries.

Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors.
These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use
with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable
product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804

Third party names are the property of their owners

Disclaimer
● The views expressed in this talk are those of the speakers and
not their employer.

● If we say something “smart” or worthwhile:

○ Credit goes to the many smart people we work with.

● If we say something stupid…

○ It’s our own fault

We work in Intel’s research labs. We don’t build products. Instead, we get to poke into dark corners and
think silly thoughts… just to make sure we don’t miss any great ideas.

Hence, our views are by design far “off the roadmap”.

Acknowledgments

• Michel Pelletier (Graphegon):

– His GraphBLAS binding to python was the inspiration for the design of PyOMP

• Todd Anderson (Intel):

– A Numba wizard who did the HARD implementation work that made PyOMP possible

• Giorgis Georgakoudis (LLNL) and Johannes Doerfert(ANL):

– They are working with us to port PyOMP to an OpenMP enabled open-source version of LLVM

Multithreaded parallel Python through OpenMP support in Numba, Todd Anderson and Tim
Mattson, SciPy’2021 http://conference.scipy.org/proceedings/scipy2021/tim_mattson.html

https://github.com/Python-for-HPC/pyomp
LLNL: Lawrence Livermore National Laboratory ANL: Argonne National Laboratory 4
Software vs. Hardware and the nature of Performance

Since ~2005
Up until ~2005, performance comes
performance came
from
from semiconductor
“the top”
technology
Better software Tech.
Better algorithms
Better HW architecture#

*It’s because of the end of #HW architecture matters,

Dennard Scaling … but dramatically LESS than
Moore’s law has nothing software and algorithms
to do with it
The view of Python from an HPC perspective
(from the ”Room at the top” paper).
A proxy for computing
for i in range(4096): over nested loops …
for j in range(4096): yes, they know you
should use optimized
for k in range (4096): library code for
C[i][j] += A[i][k]*B[k][j] DGEMM

Amazon AWS c4.8xlarge spot instance, Intel® Xeon® E5-2666 v3 CPU, 2.9 Ghz, 18 core, 60 GB RAM
The view of Python from an HPC perspective
(from the ”Room at the top” paper).
A proxy for computing
for i in range(4096): over nested loops …
for j in range(4096): yes, they know you
should use optimized
for k in range (4096): library code for
C[i][j] += A[i][k]*B[k][j] DGEMM

This demonstrates a common attitude in the HPC community ….

Python is great for productivity, algorithm development, and combining functions from high-level modules in
new ways to solve problems. If getting a high fraction of peak performance is a goal … recode in C.

Amazon AWS c4.8xlarge spot instance, Intel® Xeon® E5-2666 v3 CPU, 2.9 Ghz, 18 core, 60 GB RAM
Our goal … to help programmers “keep it in Python”

• Modern technology should be able to map Python onto low-level code (such as
C or LLVM) and avoid the “Python performance tax”.

• We’ve* worked on …
– Numba (2012): JIT Python code into LLVM

– Parallel accelerator (2017): Find and exploit parallel patterns in Python code.

– Intel High-Performance Analytics Toolkit and Scalable Dataframe Compiler (2019): Parallel
performance from data frames.

– Intel numba-dppy (2020): Numba ParallelAccelerator regions that run on GPUs via SYCL.

*OK, not “we” … it was mostly Todd

Third party names are the property of their owners 8
How do you get high performance for a modern CPU?
Three simple principles:
SIMD Lanes SIMD Lanes • Lots of threads … at least one per hardware thread (often two hardware
Core 0 Core 1 threads per core)
L1D$ L1I$ L1D$ L1I$ • Exploit SIMD lanes form each thread
L2$ L2$
• Maximize cache utilization
L3$
Why not embed parallelism inside Numpy? This works, but it suffers
L2$ L2$ from two problems:
L1D$ L1I$ L1D$ L1I$ 1. Overhead of creating/destroying threads at each operation … increases
parallel overhead and limits scalability (due to Amdahl’s law)
Core 2 Core 3
2. Lost opportunity for parallelism from running multiple Numpy operations in
SIMD Lanes SIMD Lanes parallel

… We want threads, but the GIL (Global Interpreter Lock) prevents multiple threads from making forward
progress in parallel. The GIL is great for supporting thread safety and making it hard to write code that
contains data races, but it prevents parallel multithreading in Python

What is the most common way in HPC to create multithreaded code? Something called OpenMP
9
PyOMP Implementation in Numba: Overview

Numba Intel LLVM Compiler

with middle-end
LLVM IR OpenMP support
Numba IR
Numba IR with
with Parse and
Python with Convert to OpenMP
OpenMP convert
Code OpenMP LLVM IR pseudo-
context OpenMP
IR nodes calls and
managers
tags

OpenMP runtime CFFI interface

Standard/openly-available Python components

Machine Code
PyOMP specific components
Existing components from Intel’s software development tools
OpenMP runtime
Numba compilation phases that we did not modify are not shown.
Understanding OpenMP
We will explain the key elements of OpenMP as we explore the three fundamental design patterns of
OpenMP (Loop parallelism, SPMD, and divide and conquer) applied to the following problem

def piFunc(NumSteps):
step=1.0/NumSteps
sum = 0.0
x = 0.5
for i in range(NumSteps):
x+=step
sum += 4.0/(1.0+x*x)
pi=step*sum
return pi

The information on this page is subject to the use and disclosure restrictions provided on the second page to this document.
11
Loop Parallelism code
from numba import njit
OpenMP constructs managed through
from numba.openmp import openmp_context as openmp
the with context manager.

@njit
def piFunc(NumSteps):
step = 1.0/NumSteps
sum = 0.0
Pass the OpenMP directive into the
with openmp ("parallel for private(x) reduction(+:sum)"): OpenMP context manager as a string
for i in range(NumSteps):
x = (i+0.5)*step
sum += 4.0/(1.0 + x*x)
• parallel: create a team of threads
pi = step*sum • for: map loop iterations onto threads
return pi • private(x): each threads gets its own x
• reduction(+:x): combine x from each thread using +
pi = piFunc(100000000)

12
Numerical Integration results in seconds … lower is better
PyOMP C
Threads
Loop SPMD Task Loop SPMD Task

1 0.447 0.450 0.453 0.444 0.448 0.445

2 0.252 0.255 0.245 0.245 0.242 0.222

4 0.160 0.164 0.146 0.149 0.149 0.131

8 0.0890 0.0890 0.0898 0.0827 0.0826 0.0720

16 0.0520 0.0503 0.0517 0.0451 0.0451 0.0431

Intel® Xeon® E5-2699 v3 CPU with 18 cores running at 2.30 GHz.

● Run the same program on P processing elements where P can be arbitrarily large.
● Use the rank … an ID ranging from 0 to (P-1) … to select between a set of tasks and to manage any
shared data structures.

Replicate the program.

Add glue code
Break up the data

This pattern is very general and has been used to support most (if not all) the algorithm strategy patterns.
MPI programs almost always use this pattern … it is probably the most commonly used pattern in the history of parallel programming.

Third party names are the property of their owners 14

Single Program Multiple Data (SPMD)
from numba import njit
import numpy as np
from numba.openmp import openmp_context as openmp
from numba.openmp import omp_get_thread_num, omp_get_num_threads
MaxTHREADS = 32
@njit
def piFunc(NumSteps):
step = 1.0/NumSteps
partialSums = np.zeros(MaxTHREADS)
with openmp(“parallel shared(partialSums,numThrds) private(threadID,i,x,localSum)”):
threadID = omp_get_thread_num()
with openmp("single"): • omp_get_num_threads(): get N=number of threads
numThrds = omp_get_num_threads() • omp_get_thread_num(): thread rank = 0…(N-1)
localSum = 0.0 • single: One thread does the work, others wait
for i in range(threadID, NumSteps, numThrds): • private(x): each threads gets its own x
x = (i+0.5)*step • shared(x): all threads see the same x
localSum = localSum + 4.0/(1.0 + x*x)
partialSums[threadID] = localSum
return step*np.sum(partialSums) Deal out loop iterations as if a deck of cards (a cyclic distribution)
… each threads starts with the Iteration = ID, incremented by the
number of threads, until the whole “deck” is dealt out.
pi = piFunc(100000000) 15
Numerical Integration results in seconds … lower is better
PyOMP C
Threads
Loop SPMD Task Loop SPMD Task

1 0.447 0.450 0.453 0.444 0.448 0.445

2 0.252 0.255 0.245 0.245 0.242 0.222

4 0.160 0.164 0.146 0.149 0.149 0.131

8 0.0890 0.0890 0.0898 0.0827 0.0826 0.0720

16 0.0520 0.0503 0.0517 0.0451 0.0451 0.0431

Intel® Xeon® E5-2699 v3 CPU with 18 cores running at 2.30 GHz.

For the C programs we used Intel® icc compiler version 19.1.3.304 as icc -qnextgen -O3 –fiopenmp
Ran each case 5 times and kept the minimum time. JIT time is not included for PyOMP (it was about 1.5 seconds)
16
Divide and conquer design pattern
● Split the problem into smaller sub-problems; continue until the sub-problems can
be solve directly

3 Options for parallelism:

¨ Do work as you split
into sub-problems
¨ Do work at the
leaves
¨ Do work as you
recombine
Divide and conquer (with explicit tasks)
from numba import njit
from numba.openmp import openmp_context as openmp
from numba.openmp import omp_get_num_threads, omp_set_num_threads
MIN_BLK = 1024*256
@njit @njit
def piComp(Nstart, Nfinish, step): def piFunc(NumSteps):
iblk = Nfinish-Nstart step = 1.0/NumSteps
if(iblk<MIN_BLK): sum = 0.0 Fork threads
sum = 0.0 startTime = omp_get_wtime() and launch the
for i in range(Nstart,Nfinish): with openmp ("parallel"): computation
Solve
x= (i+0.5)*step with openmp ("single"):
sum += 4.0/(1.0 + x*x) sum = piComp(0,NumSteps,step)
else:
sum1 = 0.0 pi = step*sum
sum2 = 0.0 return step*sum
with openmp ("task shared(sum1)"):
sum1 = piComp(Nstart, Nfinish-iblk/2,step) pi = piFunc(100000000)
with openmp ("task shared(sum2)"): Split
sum2 = piComp(Nfinish-iblk/2,Nfinish,step)
with openmp ("taskwait"): • single: One thread does the work, others wait
Merge • task: code block enqueued for execution
sum = sum1 + sum2 • taskwait: wait until task in the code block finish
return sum 18
Numerical Integration results in seconds … lower is better

PyOMP C
Threads
Loop SPMD Task Loop SPMD Task

1 0.447 0.450 0.453 0.444 0.448 0.445

2 0.252 0.255 0.245 0.245 0.242 0.222

4 0.160 0.164 0.146 0.149 0.149 0.131

8 0.0890 0.0890 0.0898 0.0827 0.0826 0.0720

16 0.0520 0.0503 0.0517 0.0451 0.0451 0.0431

Intel® Xeon® E5-2699 v3 CPU with 18 cores running at 2.30 GHz.

For the C programs we used Intel® icc compiler version 19.1.3.304 as icc -qnextgen -O3 –fiopenmp
Ran each case 5 times and kept the minimum time. JIT time is not included for PyOMP (it was about 1.5 seconds)
19
OpenMP subset supported in PyOMP
with openmp("parallel"): Create a team of threads. Execute a parallel region
with openmp("for"): Use inside a parallel region. Split up a loop across the team.
with openmp("parallel for"): A combined construct. Same a parallel followed by a for.
with openmp ("single"): One thread does the work. Others wait for it to finish
with openmp("task"): Create an explicit task for work within the construct.
with openmp("taskwait"): Wait for all tasks in the current task to complete.
with openmp("barrier"): All threads arrive at a barrier before any proceed.
with openmp("critical"): Mutual exclusion. One thread at a time executes code
schedule(static [,chunk]) Map blocks of loop iterations across the team. Use with for.
reduction(op:list) Combine values with op across the team. Used with for
private(list) Make a local copy of variables for each thread. Use with parallel, for or task.
firstprivate(list) private, but initialize with original value. Use with parallel, for or task
shared(list) Variables shared between threads. Use with parallel, for or task.

default(none) Force definition of variables as private or shared.

omp_get_num_threads() Return the number of threads in a team
omp_get_thread_num() Return an ID from 0 to the number of threads minus one
omp_set_num_threads(int) Set the number of threads to request for parallel regions
omp_get_wtime() Return a snapshot of the wall clock time.
OMP_NUM_THREADS=N Environment variable to set the default number of threads
OpenMP subset supported in PyOMP
with openmp("parallel"): Create a team of threads. Execute a parallel region Fork threads
with openmp("for"): Use inside a parallel region. Split up a loop across the team.
with openmp("parallel for"): A combined construct. Same a parallel followed by a for.
with openmp ("single"): One thread does the work. Others wait for it to finish Work sharing
with openmp("task"): Create an explicit task for work within the construct.
with openmp("taskwait"): Wait for all tasks in the current task to complete.
with openmp("barrier"): All threads arrive at a barrier before any proceed. Synchronization
with openmp("critical"): Mutual exclusion. One thread at a time executes code
schedule(static [,chunk]) Map blocks of loop iterations across the team. Use with for.
reduction(op:list) Combine values with op across the team. Used with for Par. Loop support

private(list) Make a local copy of variables for each thread. Use with parallel, for or task.
firstprivate(list) private, but initialize with original value. Use with parallel, for or task
Data
shared(list) Variables shared between threads. Use with parallel, for or task. Environment

default(none) Force definition of variables as private or shared.

omp_get_num_threads() Return the number of threads in a team
omp_get_thread_num() Return an ID from 0 to the number of threads minus one
runtime
omp_set_num_threads(int) Set the number of threads to request for parallel regions libraries
omp_get_wtime() Return a snapshot of the wall clock time.
OMP_NUM_THREADS=N Environment variable to set the default number of threads Environment
The view of Python from an HPC perspective
We know better …
for I in range(4096): the IKJ order is more for I in range(1000):
for j in range(4096): cache friendly for k in range(1000):
for k in range (4096): for j in range (1000):
C[i][j] += A[i][k]*B[k][j] And we picked a C[i][j] += A[i][k]*B[k][j]
smaller problem

Amazon AWS c4.8xlarge spot instance, Intel® Xeon® E5-2666 v3 CPU, 2.9 Ghz, 18 core, 60 GB RAM
PyOMP DGEMM (Mat-Mul with double precision numbers)
tInit = omp_get_wtime()
from numba import njit with openmp("parallel for private(j,k)"):
import numpy as np for i in range(order):
from numba.openmp import openmp_context as openmp for k in range(order):
from numba.openmp import omp_get_wtime for j in range(order):
C[i][j] += A[i][k] * B[k][j]
@njit(fastmath=True)
def dgemm(iterations,order): dgemmTime = omp_get_wtime() - tInit

# allocate and initialize arrays # Check result

A = np.zeros((order,order)) checksum = 0.0;
B = np.zeros((order,order)) for i in range(order):
C = np.zeros((order,order)) for j in range(order):
checksum += C[i][j];
# Assign values to A and B such that ref_checksum = order*order*order
# the product matrix has a known value. ref_checksum *= 0.25*(order-1.0)*(order-1.0)
for i in range(order): eps=1.e-8
A[:,i] = float(i) if abs((checksum - ref_checksum)/ref_checksum) < eps:
B[:,i] = float(i) print('Solution validates')
nflops = 2.0*order*order*order
print('Rate (MF/s): ',1.e-6*nflops/dgemmTime)
23
DGEMM PyOMP vs C-OpenMP
Matrix Multiplication, double precision, order = 1000, with error bars (std dev)

Ave. GFLOPS (Billions of floating point ops per sec)

40
PyOMP 250 runs for
C with OpenMP
order 1000
30 matrices

PyOMP times
DO NOT include
20
the one-time JIT
cost of ~2
seconds.
10

1 2 4 8 16
Number of threads
Intel® Xeon® E5-2699 v3 CPU, 18 cores, 2.30 GHz, threads mapped to a single CPU, one thread/per core, first 16 physical cores.
Intel® icc compiler ver 19.1.3.304 (icc –std=c11 –pthread –O3 xHOST –qopenmp)
Summary
• We’ve created a research prototype
OpenMP interface in Python called
PyOMP.
• It is based on Numba and an OpenMP
enabled LLVM

• Next steps:
• We need to carry out detailed
benchmarking (DASK, Ray, MPI4py)

• We need to map PyOMP onto an open

source, publicly available LLVM
• Work ongoing in partnership between Intel,
ANL, and LLNL.

• Track our progress at:

https://github.com/Python-for-HPC/pyomp

My Greenlandic skin-on-frame kayak in the middle of Budd Inlet during a negative tide 25

Barbara Chapman Using OpenMP
No ratings yet
Barbara Chapman Using OpenMP
378 pages
Parallel Programming Using OpenMP
No ratings yet
Parallel Programming Using OpenMP
76 pages
OpenMP Tutorial - Lawrence Livermore National Laboratory
No ratings yet
OpenMP Tutorial - Lawrence Livermore National Laboratory
75 pages
Openmp: Author: Blaise Barney, Lawrence Livermore National Laboratory
No ratings yet
Openmp: Author: Blaise Barney, Lawrence Livermore National Laboratory
62 pages
High Performance Computing (HPC) Lec4
No ratings yet
High Performance Computing (HPC) Lec4
32 pages
Operating Systems Full
100% (1)
Operating Systems Full
150 pages
Topic 4:the Art of Assembly Language Programming (08 Marks)
100% (1)
Topic 4:the Art of Assembly Language Programming (08 Marks)
9 pages
Advanced Python
No ratings yet
Advanced Python
70 pages
EuroPython 2018 Shailen Sobhee
No ratings yet
EuroPython 2018 Shailen Sobhee
64 pages
Open MP1551363136163
No ratings yet
Open MP1551363136163
29 pages
Tutorial 4
No ratings yet
Tutorial 4
32 pages
A Compilers View of OpenMP
No ratings yet
A Compilers View of OpenMP
72 pages
Openmp
No ratings yet
Openmp
10 pages
Omp Hands On
No ratings yet
Omp Hands On
200 pages
Unit 3 - Programming Multi-Core and Shared Memory
No ratings yet
Unit 3 - Programming Multi-Core and Shared Memory
100 pages
Lecture 6 Parallel Programming Models
No ratings yet
Lecture 6 Parallel Programming Models
17 pages
Parallel Programming
No ratings yet
Parallel Programming
108 pages
Introduction To Parallel Programming: Center For Institutional Research Computing
No ratings yet
Introduction To Parallel Programming: Center For Institutional Research Computing
98 pages
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
No ratings yet
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
128 pages
11-Programming With OpenMP
No ratings yet
11-Programming With OpenMP
28 pages
Lec 01
No ratings yet
Lec 01
2 pages
High Performance Computing (HPC) - Lec3
No ratings yet
High Performance Computing (HPC) - Lec3
35 pages
Unit III Assignment Digital Electronics
No ratings yet
Unit III Assignment Digital Electronics
6 pages
Begin Parallel Programming With OpenMP - CodeProject
No ratings yet
Begin Parallel Programming With OpenMP - CodeProject
8 pages
About OpenMP
No ratings yet
About OpenMP
86 pages
Lecture Open MP
No ratings yet
Lecture Open MP
25 pages
Shared Memory Parallel Programming: Introduction To Openmp
No ratings yet
Shared Memory Parallel Programming: Introduction To Openmp
39 pages
Multicore Code Entwicklung
No ratings yet
Multicore Code Entwicklung
33 pages
A Gênese (Allan Kardec)
No ratings yet
A Gênese (Allan Kardec)
80 pages
High Performance Computing For Computational Mechanics: ISCM-10
No ratings yet
High Performance Computing For Computational Mechanics: ISCM-10
63 pages
Mpi Openmp Handouts
No ratings yet
Mpi Openmp Handouts
67 pages
Omp Exercises
No ratings yet
Omp Exercises
81 pages
OMP Common Core-Voss
No ratings yet
OMP Common Core-Voss
217 pages
Openmp
No ratings yet
Openmp
95 pages
Composable Multi-Threading For Python Libraries: Hsutter Wtichy
No ratings yet
Composable Multi-Threading For Python Libraries: Hsutter Wtichy
5 pages
Cs6801 Mcap MGM
No ratings yet
Cs6801 Mcap MGM
7 pages
Openmp
No ratings yet
Openmp
61 pages
ParallelProgramming Start2016
No ratings yet
ParallelProgramming Start2016
41 pages
OpenMPSlides Tamu SC PDF
No ratings yet
OpenMPSlides Tamu SC PDF
74 pages
3.introduction To Parallelism
No ratings yet
3.introduction To Parallelism
64 pages
Lec6 - TLP Data Dependence Solutions
No ratings yet
Lec6 - TLP Data Dependence Solutions
20 pages
Day 2 1 Advanced-Openmp
No ratings yet
Day 2 1 Advanced-Openmp
52 pages
Parallel Programming Unit 2
No ratings yet
Parallel Programming Unit 2
71 pages
Synopsis of Music Player
No ratings yet
Synopsis of Music Player
7 pages
PDC Experiments
No ratings yet
PDC Experiments
11 pages
Intro To OpenMP Mattson Customized
No ratings yet
Intro To OpenMP Mattson Customized
94 pages
Omp Hands On SC08 PDF
No ratings yet
Omp Hands On SC08 PDF
153 pages
OpenMP SPM
No ratings yet
OpenMP SPM
9 pages
CSC-334 - P&DC - Lab Manual - V2.0
No ratings yet
CSC-334 - P&DC - Lab Manual - V2.0
102 pages
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
No ratings yet
Parallel Computing and Openmp Tutorial: Shao-Ching Huang
58 pages
A Tutorial On Parallel Computing On Shared Memory Systems
No ratings yet
A Tutorial On Parallel Computing On Shared Memory Systems
23 pages
Data Stage Scenarios: Scenario1. Cummilative Sum
No ratings yet
Data Stage Scenarios: Scenario1. Cummilative Sum
13 pages
CAQA5e ch1
No ratings yet
CAQA5e ch1
42 pages
Demystifying Multicore Germany 14 PDF
No ratings yet
Demystifying Multicore Germany 14 PDF
82 pages
CSC 418 Lecture 3 Notes
No ratings yet
CSC 418 Lecture 3 Notes
2 pages
Omp Hands On SC08
No ratings yet
Omp Hands On SC08
153 pages
Xe 62011 Open MP
No ratings yet
Xe 62011 Open MP
46 pages
Programming Assignment: On Openmp
No ratings yet
Programming Assignment: On Openmp
19 pages
If 4 If If Else Else Else: Int Input
No ratings yet
If 4 If If Else Else Else: Int Input
15 pages
Openmp
No ratings yet
Openmp
21 pages
.Trashed-1650000204-Hpc Prac Exam
No ratings yet
.Trashed-1650000204-Hpc Prac Exam
5 pages
Toc Assignment 1
No ratings yet
Toc Assignment 1
2 pages
Intro To MPI
No ratings yet
Intro To MPI
44 pages
Cse Odd Syllabus 2023-24
No ratings yet
Cse Odd Syllabus 2023-24
19 pages
Mock Paper - Grade 4 - Mathematics
No ratings yet
Mock Paper - Grade 4 - Mathematics
6 pages
Chapter-1. Introduction To Data Structure
No ratings yet
Chapter-1. Introduction To Data Structure
3 pages
SC Unit I
No ratings yet
SC Unit I
191 pages
Lecture 1 - Introduction - Sets
No ratings yet
Lecture 1 - Introduction - Sets
45 pages
(Simu 2024) Midterm Review - TA Review
No ratings yet
(Simu 2024) Midterm Review - TA Review
30 pages
PX632-631-231 Pics - Mics
No ratings yet
PX632-631-231 Pics - Mics
64 pages
Mad Report1st Sheet
No ratings yet
Mad Report1st Sheet
4 pages
Compiler Design Practical File Lco17365 PDF
No ratings yet
Compiler Design Practical File Lco17365 PDF
133 pages
Data Structures Assignment 1
No ratings yet
Data Structures Assignment 1
16 pages
OpenMP Basics
No ratings yet
OpenMP Basics
47 pages
Ii-Iv B.tech I-Sem Supplementary Exams - August 2021 (R15, R19) Results
No ratings yet
Ii-Iv B.tech I-Sem Supplementary Exams - August 2021 (R15, R19) Results
49 pages
Module 1
No ratings yet
Module 1
23 pages
PC Unit II Notes
No ratings yet
PC Unit II Notes
56 pages
Piler Design - Lexical Analysis
No ratings yet
Piler Design - Lexical Analysis
53 pages
Unit 4 CD
No ratings yet
Unit 4 CD
8 pages
CHAPTER 28: Digital Signal Processor: C Versus Assembly
No ratings yet
CHAPTER 28: Digital Signal Processor: C Versus Assembly
21 pages
Introduction To Machine Learning
No ratings yet
Introduction To Machine Learning
8 pages
High Performance Computing Matrix Mul.
No ratings yet
High Performance Computing Matrix Mul.
15 pages
Software Construction
No ratings yet
Software Construction
15 pages
Os (Deadlock)
No ratings yet
Os (Deadlock)
10 pages
Systems Software/Programming - Lab Manual: Lab 6 - Processes' Systems Calls Fork, Wait, Exec, Exit
No ratings yet
Systems Software/Programming - Lab Manual: Lab 6 - Processes' Systems Calls Fork, Wait, Exec, Exit
3 pages
Bangla Grammar
No ratings yet
Bangla Grammar
5 pages
Mission Impossible - Build your own CPU from Scratch: Learn How to Design, Construct and Program a 16-bit Computer CPU
From Everand
Mission Impossible - Build your own CPU from Scratch: Learn How to Design, Construct and Program a 16-bit Computer CPU
Doug Domke
No ratings yet
Learn Java Programming in 24 Hours
From Everand
Learn Java Programming in 24 Hours
PublishDrive
No ratings yet
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
PlayStation Architecture: Architecture of Consoles: A Practical Analysis, #6
From Everand
PlayStation Architecture: Architecture of Consoles: A Practical Analysis, #6
Rodrigo Copetti
No ratings yet

OpenMPBoothTalk PyOMP

Uploaded by

OpenMPBoothTalk PyOMP

Uploaded by

SC’21 Booth Talk Series

PyOMP: Parallel Multithreading

Third party names are the property of their owners

● If we say something “smart” or worthwhile:

○ Credit goes to the many smart people we work with.

● If we say something stupid…

○ It’s our own fault

Hence, our views are by design far “off the roadmap”.

• Michel Pelletier (Graphegon):

• Todd Anderson (Intel):

• Giorgis Georgakoudis (LLNL) and Johannes Doerfert(ANL):

*It’s because of the end of #HW architecture matters,

This demonstrates a common attitude in the HPC community ….

*OK, not “we” … it was mostly Todd

Numba Intel LLVM Compiler

OpenMP runtime CFFI interface

Standard/openly-available Python components

1 0.447 0.450 0.453 0.444 0.448 0.445

2 0.252 0.255 0.245 0.245 0.242 0.222

4 0.160 0.164 0.146 0.149 0.149 0.131

8 0.0890 0.0890 0.0898 0.0827 0.0826 0.0720

16 0.0520 0.0503 0.0517 0.0451 0.0451 0.0431

Intel® Xeon® E5-2699 v3 CPU with 18 cores running at 2.30 GHz.

Replicate the program.

Third party names are the property of their owners 14

1 0.447 0.450 0.453 0.444 0.448 0.445

2 0.252 0.255 0.245 0.245 0.242 0.222

4 0.160 0.164 0.146 0.149 0.149 0.131

8 0.0890 0.0890 0.0898 0.0827 0.0826 0.0720

16 0.0520 0.0503 0.0517 0.0451 0.0451 0.0431

Intel® Xeon® E5-2699 v3 CPU with 18 cores running at 2.30 GHz.

3 Options for parallelism:

1 0.447 0.450 0.453 0.444 0.448 0.445

2 0.252 0.255 0.245 0.245 0.242 0.222

4 0.160 0.164 0.146 0.149 0.149 0.131

8 0.0890 0.0890 0.0898 0.0827 0.0826 0.0720

16 0.0520 0.0503 0.0517 0.0451 0.0451 0.0431

Intel® Xeon® E5-2699 v3 CPU with 18 cores running at 2.30 GHz.

default(none) Force definition of variables as private or shared.

default(none) Force definition of variables as private or shared.

# allocate and initialize arrays # Check result

Ave. GFLOPS (Billions of floating point ops per sec)

• We need to map PyOMP onto an open

• Track our progress at:

You might also like