Python, Performance, and GPUs - Towards Data Science
Python, Performance, and GPUs - Towards Data Science
Matthew Rocklin
Jun 28 · 7 min read
This blogpost was delivered in talk form at the recent PASC 2019 conference. Slides for that
talk are here.
Executive Summary
We’re improving the state of scalable GPU computing in Python.
This post lays out the current status, and describes future work. It also summarizes and
links to several other more blogposts from recent months that drill down into different
topics for the interested reader.
https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d 1/8
15/08/2019 Python, Performance, and GPUs - Towards Data Science
Many users know libraries for deep learning like PyTorch and TensorFlow, but there are
several other for more general purpose computing. These tend to copy the APIs of
popular Python projects:
These libraries build GPU accelerated variants of popular Python libraries like NumPy,
Pandas, and Scikit-Learn. In order to better understand the relative performance
differences Peter Entschev recently put together a benchmark suite to help with
comparisons. He has produced the following image showing the relative speedup
between GPU and CPU:
There are lots of interesting results there. Peter goes into more depth in this in his
blogpost.
More broadly though, we see that there is variability in performance. Our mental
model for what is fast and slow on the CPU doesn’t neccessarily carry over to the GPU.
Fortunately though, due consistent APIs, users that are familiar with Python can easily
experiment with GPU acceleration without learning CUDA.
https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d 2/8
15/08/2019 Python, Performance, and GPUs - Towards Data Science
The built-in operations in GPU libraries like CuPy and RAPIDS cover most common
operations. However, in real-world settings we often find messy situations that require
writing a little bit of custom code. Switching down to C/C++/CUDA in these cases can
be challenging, especially for users that are primarily Python developers. This is where
Numba can come in.
Python has this same problem on the CPU as well. Users often couldn’t be bothered to
learn C/C++ to write fast custom code. To address this there are tools like Cython or
Numba, which let Python programmers write fast numeric code without learning much
beyond the Python language.
For example, Numba accelerates the for-loop style code below about 500x on the CPU,
from slow Python speeds up to fast C/Fortran speeds.
The ability to drop down to low-level performant code without context switching out of
Python is useful, particularly if you don’t already know C/C++ or have a compiler
chain set up for you (which is the case for most Python users today).
This benefit is even more pronounced on the GPU. While many Python programmers
know a little bit of C, very few of them know CUDA. Even if they did, they would
probably have difficulty in setting up the compiler tools and development environment.
Enter numba.cuda.jit Numba’s backend for CUDA. Numba.cuda.jit allows Python users
to author, compile, and run CUDA code, written in Python, interactively without
leaving a Python session. Here is an image of writing a stencil computation that
smoothes a 2d-image all from within a Jupyter Notebook:
https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d 3/8
15/08/2019 Python, Performance, and GPUs - Towards Data Science
CPU — 600 ms
@numba.jit
def _smooth(x):
out = np.empty_like(x)
for i in range(1, x.shape[0] - 1):
for j in range(1, x.shape[1] - 1):
out[i,j] = (x[i-1, j-1] + x[i-1, j+0] + x[i-1, j+1] +
x[i+0, j-1] + x[i+0, j+0] + x[i+0, j+1] +
x[i+1, j-1] + x[i+1, j+0] + x[i+1, j+1])//9
return out
@numba.stencil
def _smooth(x):
return (x[-1, -1] + x[-1, 0] + x[-1, 1] +
x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +
x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9
https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d 4/8
15/08/2019 Python, Performance, and GPUs - Towards Data Science
GPU — 3 ms
@numba.cuda.jit
def smooth_gpu(x, out):
i, j = cuda.grid(2)
n, m = x.shape
if 1 <= i < n - 1 and 1 <= j < m - 1:
out[i, j] = (x[i-1, j-1] + x[i-1, j] + x[i-1, j+1] +
x[i , j-1] + x[i , j] + x[i , j+1] +
x[i+1, j-1] + x[i+1, j] + x[i+1, j+1]) // 9
Numba.cuda.jit has been out in the wild for quite a while now. It’s accessible, mature,
and fun to play with. If you have a machine with a GPU in it and some curiosity then we
strongly recommend that you try it out.
https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d 5/8
15/08/2019 Python, Performance, and GPUs - Towards Data Science
We see that there is about a 10x speed improvement on the computation. Most
importantly, we were able to switch between a CPU implementation and a GPU
implementation with a small one-line change, but continue using the sophisticated
algorithms with Dask Array, like it’s parallel SVD implementation.
We’ve been integrating the OpenUCX library into Python with UCX-Py. UCX provides
uniform access to transports like TCP, InfiniBand, shared memory, and NVLink. UCX-Py
is the first time that access to many of these transports has been easily accessible from
the Python language.
Using UCX and Dask together we’re able to get significant speedups. Here is a trace of
the SVD computation from before both before and after adding UCX:
Before UCX:
After UCX:
https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d 6/8
15/08/2019 Python, Performance, and GPUs - Towards Data Science
There is still a great deal to do here though (the blogpost linked above has several items
in the Future Work section).
People can try out UCX and UCX-Py with highly experimental conda packages:
We hope that this work will also affect non-GPU users on HPC systems with Infiniband,
or even users on consumer hardware due to the easy access to shared memory
communication.
Packaging
In an earlier blogpost we discussed the challenges around installing the wrong versions
of CUDA enabled packages that don’t match the CUDA driver installed on the system.
Fortunately due to recent work from Stan Seibert and Michael Sarahan at Anaconda,
Conda 4.7 now has a special cuda meta-package that is set to the version of the
installed driver. This should make it much easier for users in the future to install the
correct package.
Conda 4.7 was just releasead, and comes with many new features other than the cuda
There is still plenty of work to do in the packaging space today. Everyone who builds
conda packages does it their own way, resulting in headache and heterogeneity. This is
largely due to not having centralized infrastructure to build and test CUDA enabled
https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d 7/8
15/08/2019 Python, Performance, and GPUs - Towards Data Science
packages, like we have in Conda Forge. Fortunately, the Conda Forge community is
working together with Anaconda and NVIDIA to help resolve this, though that will
likely take some time.
Summary
This post gave an update of the status of some of the efforts behind GPU computing in
Python. It also provided a variety of links for future reading. We include them below if
you would like to learn more:
Slides
Benchmark suite
A talk on UCX
Conda 4.7
https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d 8/8