OpenMPBoothTalk PyOMP
OpenMPBoothTalk PyOMP
Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors.
These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use
with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable
product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804
We work in Intel’s research labs. We don’t build products. Instead, we get to poke into dark corners and
think silly thoughts… just to make sure we don’t miss any great ideas.
Multithreaded parallel Python through OpenMP support in Numba, Todd Anderson and Tim
Mattson, SciPy’2021 http://conference.scipy.org/proceedings/scipy2021/tim_mattson.html
https://github.com/Python-for-HPC/pyomp
LLNL: Lawrence Livermore National Laboratory ANL: Argonne National Laboratory 4
Software vs. Hardware and the nature of Performance
Since ~2005
Up until ~2005, performance comes
performance came
from
from semiconductor
“the top”
technology
Better software Tech.
Better algorithms
Better HW architecture#
Amazon AWS c4.8xlarge spot instance, Intel® Xeon® E5-2666 v3 CPU, 2.9 Ghz, 18 core, 60 GB RAM
The view of Python from an HPC perspective
(from the ”Room at the top” paper).
A proxy for computing
for i in range(4096): over nested loops …
for j in range(4096): yes, they know you
should use optimized
for k in range (4096): library code for
C[i][j] += A[i][k]*B[k][j] DGEMM
Python is great for productivity, algorithm development, and combining functions from high-level modules in
new ways to solve problems. If getting a high fraction of peak performance is a goal … recode in C.
Amazon AWS c4.8xlarge spot instance, Intel® Xeon® E5-2666 v3 CPU, 2.9 Ghz, 18 core, 60 GB RAM
Our goal … to help programmers “keep it in Python”
• Modern technology should be able to map Python onto low-level code (such as
C or LLVM) and avoid the “Python performance tax”.
• We’ve* worked on …
– Numba (2012): JIT Python code into LLVM
– Parallel accelerator (2017): Find and exploit parallel patterns in Python code.
– Intel High-Performance Analytics Toolkit and Scalable Dataframe Compiler (2019): Parallel
performance from data frames.
– Intel numba-dppy (2020): Numba ParallelAccelerator regions that run on GPUs via SYCL.
… We want threads, but the GIL (Global Interpreter Lock) prevents multiple threads from making forward
progress in parallel. The GIL is great for supporting thread safety and making it hard to write code that
contains data races, but it prevents parallel multithreading in Python
What is the most common way in HPC to create multithreaded code? Something called OpenMP
9
PyOMP Implementation in Numba: Overview
def piFunc(NumSteps):
step=1.0/NumSteps
sum = 0.0
x = 0.5
for i in range(NumSteps):
x+=step
sum += 4.0/(1.0+x*x)
pi=step*sum
return pi
The information on this page is subject to the use and disclosure restrictions provided on the second page to this document.
11
Loop Parallelism code
from numba import njit
OpenMP constructs managed through
from numba.openmp import openmp_context as openmp
the with context manager.
@njit
def piFunc(NumSteps):
step = 1.0/NumSteps
sum = 0.0
Pass the OpenMP directive into the
with openmp ("parallel for private(x) reduction(+:sum)"): OpenMP context manager as a string
for i in range(NumSteps):
x = (i+0.5)*step
sum += 4.0/(1.0 + x*x)
• parallel: create a team of threads
pi = step*sum • for: map loop iterations onto threads
return pi • private(x): each threads gets its own x
• reduction(+:x): combine x from each thread using +
pi = piFunc(100000000)
12
Numerical Integration results in seconds … lower is better
PyOMP C
Threads
Loop SPMD Task Loop SPMD Task
● Run the same program on P processing elements where P can be arbitrarily large.
● Use the rank … an ID ranging from 0 to (P-1) … to select between a set of tasks and to manage any
shared data structures.
This pattern is very general and has been used to support most (if not all) the algorithm strategy patterns.
MPI programs almost always use this pattern … it is probably the most commonly used pattern in the history of parallel programming.
PyOMP C
Threads
Loop SPMD Task Loop SPMD Task
private(list) Make a local copy of variables for each thread. Use with parallel, for or task.
firstprivate(list) private, but initialize with original value. Use with parallel, for or task
Data
shared(list) Variables shared between threads. Use with parallel, for or task. Environment
Amazon AWS c4.8xlarge spot instance, Intel® Xeon® E5-2666 v3 CPU, 2.9 Ghz, 18 core, 60 GB RAM
PyOMP DGEMM (Mat-Mul with double precision numbers)
tInit = omp_get_wtime()
from numba import njit with openmp("parallel for private(j,k)"):
import numpy as np for i in range(order):
from numba.openmp import openmp_context as openmp for k in range(order):
from numba.openmp import omp_get_wtime for j in range(order):
C[i][j] += A[i][k] * B[k][j]
@njit(fastmath=True)
def dgemm(iterations,order): dgemmTime = omp_get_wtime() - tInit
PyOMP times
DO NOT include
20
the one-time JIT
cost of ~2
seconds.
10
1 2 4 8 16
Number of threads
Intel® Xeon® E5-2699 v3 CPU, 18 cores, 2.30 GHz, threads mapped to a single CPU, one thread/per core, first 16 physical cores.
Intel® icc compiler ver 19.1.3.304 (icc –std=c11 –pthread –O3 xHOST –qopenmp)
Summary
• We’ve created a research prototype
OpenMP interface in Python called
PyOMP.
• It is based on Numba and an OpenMP
enabled LLVM
• Next steps:
• We need to carry out detailed
benchmarking (DASK, Ray, MPI4py)
My Greenlandic skin-on-frame kayak in the middle of Budd Inlet during a negative tide 25