0% found this document useful (0 votes)
41 views25 pages

OpenMPBoothTalk PyOMP

Uploaded by

Erialdo Domingos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views25 pages

OpenMPBoothTalk PyOMP

Uploaded by

Erialdo Domingos
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

SC’21 Booth Talk Series

PyOMP: Parallel Multithreading


that is fast AND Pythonic
Tim Mattson, Senior Principal Engineer, Intel Corp.
Multithreaded parallel Python through OpenMP support in Numba, Todd Anderson and Tim
Mattson, SciPy’2021 http://conference.scipy.org/proceedings/scipy2021/tim_mattson.html
Legal Disclaimer & Optimization Notice
This document contains information on products, services and/or processes in development. All information provided here is subject to change
without notice.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors. Performance tests,
such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to
any of those factors may cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your
contemplated purchases, including the performance of that product when combined with other products. For more complete information visit
www.intel.com/benchmarks.
INFORMATION IN THIS DOCUMENT IS PROVIDED “AS IS”. NO LICENSE, EXPRESS OR IMPLIED, BY ESTOPPEL OR OTHERWISE, TO ANY
INTELLECTUAL PROPERTY RIGHTS IS GRANTED BY THIS DOCUMENT. INTEL ASSUMES NO LIABILITY WHATSOEVER AND INTEL
DISCLAIMS ANY EXPRESS OR IMPLIED WARRANTY, RELATING TO THIS INFORMATION INCLUDING LIABILITY OR WARRANTIES
RELATING TO FITNESS FOR A PARTICULAR PURPOSE, MERCHANTABILITY, OR INFRINGEMENT OF ANY PATENT, COPYRIGHT OR
OTHER INTELLECTUAL PROPERTY RIGHT.
Copyright © 2021, Intel Corporation. All rights reserved. Intel, the Intel logo, Xeon, Core, VTune, and OpenVINO are trademarks of Intel Corporation
or its subsidiaries in the U.S. and other countries.

Optimization Notice
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors.
These optimizations include SSE2, SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or
effectiveness of any optimization on microprocessors not manufactured by Intel. Microprocessor-dependent optimizations in this product are intended for use
with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel microprocessors. Please refer to the applicable
product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
Notice revision #20110804

Third party names are the property of their owners


Disclaimer
● The views expressed in this talk are those of the speakers and
not their employer.

● If we say something “smart” or worthwhile:

○ Credit goes to the many smart people we work with.

● If we say something stupid…

○ It’s our own fault

We work in Intel’s research labs. We don’t build products. Instead, we get to poke into dark corners and
think silly thoughts… just to make sure we don’t miss any great ideas.

Hence, our views are by design far “off the roadmap”.


Acknowledgments

• Michel Pelletier (Graphegon):


– His GraphBLAS binding to python was the inspiration for the design of PyOMP

• Todd Anderson (Intel):


– A Numba wizard who did the HARD implementation work that made PyOMP possible

• Giorgis Georgakoudis (LLNL) and Johannes Doerfert(ANL):


– They are working with us to port PyOMP to an OpenMP enabled open-source version of LLVM

Multithreaded parallel Python through OpenMP support in Numba, Todd Anderson and Tim
Mattson, SciPy’2021 http://conference.scipy.org/proceedings/scipy2021/tim_mattson.html

https://github.com/Python-for-HPC/pyomp
LLNL: Lawrence Livermore National Laboratory ANL: Argonne National Laboratory 4
Software vs. Hardware and the nature of Performance

Since ~2005
Up until ~2005, performance comes
performance came
from
from semiconductor
“the top”
technology
Better software Tech.
Better algorithms
Better HW architecture#

*It’s because of the end of #HW architecture matters,


Dennard Scaling … but dramatically LESS than
Moore’s law has nothing software and algorithms
to do with it
The view of Python from an HPC perspective
(from the ”Room at the top” paper).
A proxy for computing
for i in range(4096): over nested loops …
for j in range(4096): yes, they know you
should use optimized
for k in range (4096): library code for
C[i][j] += A[i][k]*B[k][j] DGEMM

Amazon AWS c4.8xlarge spot instance, Intel® Xeon® E5-2666 v3 CPU, 2.9 Ghz, 18 core, 60 GB RAM
The view of Python from an HPC perspective
(from the ”Room at the top” paper).
A proxy for computing
for i in range(4096): over nested loops …
for j in range(4096): yes, they know you
should use optimized
for k in range (4096): library code for
C[i][j] += A[i][k]*B[k][j] DGEMM

This demonstrates a common attitude in the HPC community ….

Python is great for productivity, algorithm development, and combining functions from high-level modules in
new ways to solve problems. If getting a high fraction of peak performance is a goal … recode in C.

Amazon AWS c4.8xlarge spot instance, Intel® Xeon® E5-2666 v3 CPU, 2.9 Ghz, 18 core, 60 GB RAM
Our goal … to help programmers “keep it in Python”

• Modern technology should be able to map Python onto low-level code (such as
C or LLVM) and avoid the “Python performance tax”.

• We’ve* worked on …
– Numba (2012): JIT Python code into LLVM

– Parallel accelerator (2017): Find and exploit parallel patterns in Python code.

– Intel High-Performance Analytics Toolkit and Scalable Dataframe Compiler (2019): Parallel
performance from data frames.

– Intel numba-dppy (2020): Numba ParallelAccelerator regions that run on GPUs via SYCL.

*OK, not “we” … it was mostly Todd


Third party names are the property of their owners 8
How do you get high performance for a modern CPU?
Three simple principles:
SIMD Lanes SIMD Lanes • Lots of threads … at least one per hardware thread (often two hardware
Core 0 Core 1 threads per core)
L1D$ L1I$ L1D$ L1I$ • Exploit SIMD lanes form each thread
L2$ L2$
• Maximize cache utilization
L3$
Why not embed parallelism inside Numpy? This works, but it suffers
L2$ L2$ from two problems:
L1D$ L1I$ L1D$ L1I$ 1. Overhead of creating/destroying threads at each operation … increases
parallel overhead and limits scalability (due to Amdahl’s law)
Core 2 Core 3
2. Lost opportunity for parallelism from running multiple Numpy operations in
SIMD Lanes SIMD Lanes parallel

… We want threads, but the GIL (Global Interpreter Lock) prevents multiple threads from making forward
progress in parallel. The GIL is great for supporting thread safety and making it hard to write code that
contains data races, but it prevents parallel multithreading in Python

What is the most common way in HPC to create multithreaded code? Something called OpenMP
9
PyOMP Implementation in Numba: Overview

Numba Intel LLVM Compiler


with middle-end
LLVM IR OpenMP support
Numba IR
Numba IR with
with Parse and
Python with Convert to OpenMP
OpenMP convert
Code OpenMP LLVM IR pseudo-
context OpenMP
IR nodes calls and
managers
tags

OpenMP runtime CFFI interface

Standard/openly-available Python components


Machine Code
PyOMP specific components
Existing components from Intel’s software development tools
OpenMP runtime
Numba compilation phases that we did not modify are not shown.
Understanding OpenMP
We will explain the key elements of OpenMP as we explore the three fundamental design patterns of
OpenMP (Loop parallelism, SPMD, and divide and conquer) applied to the following problem

def piFunc(NumSteps):
step=1.0/NumSteps
sum = 0.0
x = 0.5
for i in range(NumSteps):
x+=step
sum += 4.0/(1.0+x*x)
pi=step*sum
return pi

The information on this page is subject to the use and disclosure restrictions provided on the second page to this document.
11
Loop Parallelism code
from numba import njit
OpenMP constructs managed through
from numba.openmp import openmp_context as openmp
the with context manager.

@njit
def piFunc(NumSteps):
step = 1.0/NumSteps
sum = 0.0
Pass the OpenMP directive into the
with openmp ("parallel for private(x) reduction(+:sum)"): OpenMP context manager as a string
for i in range(NumSteps):
x = (i+0.5)*step
sum += 4.0/(1.0 + x*x)
• parallel: create a team of threads
pi = step*sum • for: map loop iterations onto threads
return pi • private(x): each threads gets its own x
• reduction(+:x): combine x from each thread using +
pi = piFunc(100000000)

12
Numerical Integration results in seconds … lower is better
PyOMP C
Threads
Loop SPMD Task Loop SPMD Task

1 0.447 0.450 0.453 0.444 0.448 0.445

2 0.252 0.255 0.245 0.245 0.242 0.222

4 0.160 0.164 0.146 0.149 0.149 0.131

8 0.0890 0.0890 0.0898 0.0827 0.0826 0.0720

16 0.0520 0.0503 0.0517 0.0451 0.0451 0.0431

Intel® Xeon® E5-2699 v3 CPU with 18 cores running at 2.30 GHz.


For the C programs we used Intel® icc compiler version 19.1.3.304 as icc -qnextgen -O3 –fiopenmp
Ran each case 5 times and kept the minimum time. JIT time is not included for PyOMP (it was about 1.5 seconds)
13
SPMD (Single Program Multiple Data) design pattern

● Run the same program on P processing elements where P can be arbitrarily large.
● Use the rank … an ID ranging from 0 to (P-1) … to select between a set of tasks and to manage any
shared data structures.

Replicate the program.


Add glue code
Break up the data

This pattern is very general and has been used to support most (if not all) the algorithm strategy patterns.
MPI programs almost always use this pattern … it is probably the most commonly used pattern in the history of parallel programming.

Third party names are the property of their owners 14


Single Program Multiple Data (SPMD)
from numba import njit
import numpy as np
from numba.openmp import openmp_context as openmp
from numba.openmp import omp_get_thread_num, omp_get_num_threads
MaxTHREADS = 32
@njit
def piFunc(NumSteps):
step = 1.0/NumSteps
partialSums = np.zeros(MaxTHREADS)
with openmp(“parallel shared(partialSums,numThrds) private(threadID,i,x,localSum)”):
threadID = omp_get_thread_num()
with openmp("single"): • omp_get_num_threads(): get N=number of threads
numThrds = omp_get_num_threads() • omp_get_thread_num(): thread rank = 0…(N-1)
localSum = 0.0 • single: One thread does the work, others wait
for i in range(threadID, NumSteps, numThrds): • private(x): each threads gets its own x
x = (i+0.5)*step • shared(x): all threads see the same x
localSum = localSum + 4.0/(1.0 + x*x)
partialSums[threadID] = localSum
return step*np.sum(partialSums) Deal out loop iterations as if a deck of cards (a cyclic distribution)
… each threads starts with the Iteration = ID, incremented by the
number of threads, until the whole “deck” is dealt out.
pi = piFunc(100000000) 15
Numerical Integration results in seconds … lower is better
PyOMP C
Threads
Loop SPMD Task Loop SPMD Task

1 0.447 0.450 0.453 0.444 0.448 0.445

2 0.252 0.255 0.245 0.245 0.242 0.222

4 0.160 0.164 0.146 0.149 0.149 0.131

8 0.0890 0.0890 0.0898 0.0827 0.0826 0.0720

16 0.0520 0.0503 0.0517 0.0451 0.0451 0.0431

Intel® Xeon® E5-2699 v3 CPU with 18 cores running at 2.30 GHz.


For the C programs we used Intel® icc compiler version 19.1.3.304 as icc -qnextgen -O3 –fiopenmp
Ran each case 5 times and kept the minimum time. JIT time is not included for PyOMP (it was about 1.5 seconds)
16
Divide and conquer design pattern
● Split the problem into smaller sub-problems; continue until the sub-problems can
be solve directly

3 Options for parallelism:


¨ Do work as you split
into sub-problems
¨ Do work at the
leaves
¨ Do work as you
recombine
Divide and conquer (with explicit tasks)
from numba import njit
from numba.openmp import openmp_context as openmp
from numba.openmp import omp_get_num_threads, omp_set_num_threads
MIN_BLK = 1024*256
@njit @njit
def piComp(Nstart, Nfinish, step): def piFunc(NumSteps):
iblk = Nfinish-Nstart step = 1.0/NumSteps
if(iblk<MIN_BLK): sum = 0.0 Fork threads
sum = 0.0 startTime = omp_get_wtime() and launch the
for i in range(Nstart,Nfinish): with openmp ("parallel"): computation
Solve
x= (i+0.5)*step with openmp ("single"):
sum += 4.0/(1.0 + x*x) sum = piComp(0,NumSteps,step)
else:
sum1 = 0.0 pi = step*sum
sum2 = 0.0 return step*sum
with openmp ("task shared(sum1)"):
sum1 = piComp(Nstart, Nfinish-iblk/2,step) pi = piFunc(100000000)
with openmp ("task shared(sum2)"): Split
sum2 = piComp(Nfinish-iblk/2,Nfinish,step)
with openmp ("taskwait"): • single: One thread does the work, others wait
Merge • task: code block enqueued for execution
sum = sum1 + sum2 • taskwait: wait until task in the code block finish
return sum 18
Numerical Integration results in seconds … lower is better

PyOMP C
Threads
Loop SPMD Task Loop SPMD Task

1 0.447 0.450 0.453 0.444 0.448 0.445

2 0.252 0.255 0.245 0.245 0.242 0.222

4 0.160 0.164 0.146 0.149 0.149 0.131

8 0.0890 0.0890 0.0898 0.0827 0.0826 0.0720

16 0.0520 0.0503 0.0517 0.0451 0.0451 0.0431

Intel® Xeon® E5-2699 v3 CPU with 18 cores running at 2.30 GHz.


For the C programs we used Intel® icc compiler version 19.1.3.304 as icc -qnextgen -O3 –fiopenmp
Ran each case 5 times and kept the minimum time. JIT time is not included for PyOMP (it was about 1.5 seconds)
19
OpenMP subset supported in PyOMP
with openmp("parallel"): Create a team of threads. Execute a parallel region
with openmp("for"): Use inside a parallel region. Split up a loop across the team.
with openmp("parallel for"): A combined construct. Same a parallel followed by a for.
with openmp ("single"): One thread does the work. Others wait for it to finish
with openmp("task"): Create an explicit task for work within the construct.
with openmp("taskwait"): Wait for all tasks in the current task to complete.
with openmp("barrier"): All threads arrive at a barrier before any proceed.
with openmp("critical"): Mutual exclusion. One thread at a time executes code
schedule(static [,chunk]) Map blocks of loop iterations across the team. Use with for.
reduction(op:list) Combine values with op across the team. Used with for
private(list) Make a local copy of variables for each thread. Use with parallel, for or task.
firstprivate(list) private, but initialize with original value. Use with parallel, for or task
shared(list) Variables shared between threads. Use with parallel, for or task.

default(none) Force definition of variables as private or shared.


omp_get_num_threads() Return the number of threads in a team
omp_get_thread_num() Return an ID from 0 to the number of threads minus one
omp_set_num_threads(int) Set the number of threads to request for parallel regions
omp_get_wtime() Return a snapshot of the wall clock time.
OMP_NUM_THREADS=N Environment variable to set the default number of threads
OpenMP subset supported in PyOMP
with openmp("parallel"): Create a team of threads. Execute a parallel region Fork threads
with openmp("for"): Use inside a parallel region. Split up a loop across the team.
with openmp("parallel for"): A combined construct. Same a parallel followed by a for.
with openmp ("single"): One thread does the work. Others wait for it to finish Work sharing
with openmp("task"): Create an explicit task for work within the construct.
with openmp("taskwait"): Wait for all tasks in the current task to complete.
with openmp("barrier"): All threads arrive at a barrier before any proceed. Synchronization
with openmp("critical"): Mutual exclusion. One thread at a time executes code
schedule(static [,chunk]) Map blocks of loop iterations across the team. Use with for.
reduction(op:list) Combine values with op across the team. Used with for Par. Loop support

private(list) Make a local copy of variables for each thread. Use with parallel, for or task.
firstprivate(list) private, but initialize with original value. Use with parallel, for or task
Data
shared(list) Variables shared between threads. Use with parallel, for or task. Environment

default(none) Force definition of variables as private or shared.


omp_get_num_threads() Return the number of threads in a team
omp_get_thread_num() Return an ID from 0 to the number of threads minus one
runtime
omp_set_num_threads(int) Set the number of threads to request for parallel regions libraries
omp_get_wtime() Return a snapshot of the wall clock time.
OMP_NUM_THREADS=N Environment variable to set the default number of threads Environment
The view of Python from an HPC perspective
We know better …
for I in range(4096): the IKJ order is more for I in range(1000):
for j in range(4096): cache friendly for k in range(1000):
for k in range (4096): for j in range (1000):
C[i][j] += A[i][k]*B[k][j] And we picked a C[i][j] += A[i][k]*B[k][j]
smaller problem

Amazon AWS c4.8xlarge spot instance, Intel® Xeon® E5-2666 v3 CPU, 2.9 Ghz, 18 core, 60 GB RAM
PyOMP DGEMM (Mat-Mul with double precision numbers)
tInit = omp_get_wtime()
from numba import njit with openmp("parallel for private(j,k)"):
import numpy as np for i in range(order):
from numba.openmp import openmp_context as openmp for k in range(order):
from numba.openmp import omp_get_wtime for j in range(order):
C[i][j] += A[i][k] * B[k][j]
@njit(fastmath=True)
def dgemm(iterations,order): dgemmTime = omp_get_wtime() - tInit

# allocate and initialize arrays # Check result


A = np.zeros((order,order)) checksum = 0.0;
B = np.zeros((order,order)) for i in range(order):
C = np.zeros((order,order)) for j in range(order):
checksum += C[i][j];
# Assign values to A and B such that ref_checksum = order*order*order
# the product matrix has a known value. ref_checksum *= 0.25*(order-1.0)*(order-1.0)
for i in range(order): eps=1.e-8
A[:,i] = float(i) if abs((checksum - ref_checksum)/ref_checksum) < eps:
B[:,i] = float(i) print('Solution validates')
nflops = 2.0*order*order*order
print('Rate (MF/s): ',1.e-6*nflops/dgemmTime)
23
DGEMM PyOMP vs C-OpenMP
Matrix Multiplication, double precision, order = 1000, with error bars (std dev)

Ave. GFLOPS (Billions of floating point ops per sec)


40
PyOMP 250 runs for
C with OpenMP
order 1000
30 matrices

PyOMP times
DO NOT include
20
the one-time JIT
cost of ~2
seconds.
10

1 2 4 8 16
Number of threads
Intel® Xeon® E5-2699 v3 CPU, 18 cores, 2.30 GHz, threads mapped to a single CPU, one thread/per core, first 16 physical cores.
Intel® icc compiler ver 19.1.3.304 (icc –std=c11 –pthread –O3 xHOST –qopenmp)
Summary
• We’ve created a research prototype
OpenMP interface in Python called
PyOMP.
• It is based on Numba and an OpenMP
enabled LLVM

• Next steps:
• We need to carry out detailed
benchmarking (DASK, Ray, MPI4py)

• We need to map PyOMP onto an open


source, publicly available LLVM
• Work ongoing in partnership between Intel,
ANL, and LLNL.

• Track our progress at:


https://github.com/Python-for-HPC/pyomp

My Greenlandic skin-on-frame kayak in the middle of Budd Inlet during a negative tide 25

You might also like