0% found this document useful (0 votes)

177 views8 pages

Python, Performance, and GPUs - Towards Data Science

This document summarizes the current state of using GPUs to accelerate Python performance. It discusses several Python libraries like CuPy, RAPIDS, and Numba that provide GPU acceleration. It also discusses how Dask can be used to scale these libraries across multiple GPUs. Improving communication between GPUs using libraries like UCX is an important area of ongoing work to reduce bottlenecks. The document provides an overview of the options available today for Python developers to leverage GPUs and highlights directions of future improvement.

Uploaded by

Marlon Faria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

177 views8 pages

Python, Performance, and GPUs - Towards Data Science

Uploaded by

Marlon Faria

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

15/08/2019 Python, Performance, and GPUs - Towards Data Science

Python, Performance, and GPUs

A status update for using GPU accelerators from Python

Matthew Rocklin
Jun 28 · 7 min read

This blogpost was delivered in talk form at the recent PASC 2019 conference. Slides for that
talk are here.

Executive Summary
We’re improving the state of scalable GPU computing in Python.

This post lays out the current status, and describes future work. It also summarizes and
links to several other more blogposts from recent months that drill down into different
topics for the interested reader.

Broadly we cover briefly the following categories:

Python libraries written in CUDA like CuPy and RAPIDS

Python-CUDA compilers, specifically Numba

Scaling these libraries out with Dask

Network communication with UCX

Packaging with Conda

Performance of GPU accelerated Python Libraries

Probably the easiest way for a Python programmer to get access to GPU performance is
to use a GPU-accelerated Python library. These provide a set of common operations
that are well tuned and integrate well together.

https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d 1/8
15/08/2019 Python, Performance, and GPUs - Towards Data Science

Many users know libraries for deep learning like PyTorch and TensorFlow, but there are
several other for more general purpose computing. These tend to copy the APIs of
popular Python projects:

Numpy on the GPU: CuPy

Numpy on the GPU (again): Jax

Pandas on the GPU: RAPIDS cuDF

Scikit-Learn on the GPU: RAPIDS cuML

These libraries build GPU accelerated variants of popular Python libraries like NumPy,
Pandas, and Scikit-Learn. In order to better understand the relative performance
differences Peter Entschev recently put together a benchmark suite to help with
comparisons. He has produced the following image showing the relative speedup
between GPU and CPU:

There are lots of interesting results there. Peter goes into more depth in this in his
blogpost.

More broadly though, we see that there is variability in performance. Our mental
model for what is fast and slow on the CPU doesn’t neccessarily carry over to the GPU.
Fortunately though, due consistent APIs, users that are familiar with Python can easily
experiment with GPU acceleration without learning CUDA.

https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d 2/8
15/08/2019 Python, Performance, and GPUs - Towards Data Science

Numba: Compiling Python to CUDA

See also this recent blogpost about Numba stencils and the attached GPU notebook

The built-in operations in GPU libraries like CuPy and RAPIDS cover most common
operations. However, in real-world settings we often find messy situations that require
writing a little bit of custom code. Switching down to C/C++/CUDA in these cases can
be challenging, especially for users that are primarily Python developers. This is where
Numba can come in.

Python has this same problem on the CPU as well. Users often couldn’t be bothered to
learn C/C++ to write fast custom code. To address this there are tools like Cython or
Numba, which let Python programmers write fast numeric code without learning much
beyond the Python language.

For example, Numba accelerates the for-loop style code below about 500x on the CPU,
from slow Python speeds up to fast C/Fortran speeds.

import numba # We added these two lines for a 500x speedup

@numba.jit # We added these two lines for a 500x speedup

def sum(x):
total = 0
for i in range(x.shape[0]):
total += x[i]
return total

The ability to drop down to low-level performant code without context switching out of
Python is useful, particularly if you don’t already know C/C++ or have a compiler
chain set up for you (which is the case for most Python users today).

This benefit is even more pronounced on the GPU. While many Python programmers
know a little bit of C, very few of them know CUDA. Even if they did, they would
probably have difficulty in setting up the compiler tools and development environment.

Enter numba.cuda.jit Numba’s backend for CUDA. Numba.cuda.jit allows Python users
to author, compile, and run CUDA code, written in Python, interactively without
leaving a Python session. Here is an image of writing a stencil computation that
smoothes a 2d-image all from within a Jupyter Notebook:

https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d 3/8
15/08/2019 Python, Performance, and GPUs - Towards Data Science

Here is a simplified comparison of Numba CPU/GPU code to compare programming

style. The GPU code gets a 200x speed improvement over a single CPU core.

CPU — 600 ms

@numba.jit
def _smooth(x):
out = np.empty_like(x)
for i in range(1, x.shape[0] - 1):
for j in range(1, x.shape[1] - 1):
out[i,j] = (x[i-1, j-1] + x[i-1, j+0] + x[i-1, j+1] +
x[i+0, j-1] + x[i+0, j+0] + x[i+0, j+1] +
x[i+1, j-1] + x[i+1, j+0] + x[i+1, j+1])//9

return out

or if we use the fancy numba.stencil decorator …

@numba.stencil
def _smooth(x):
return (x[-1, -1] + x[-1, 0] + x[-1, 1] +
x[ 0, -1] + x[ 0, 0] + x[ 0, 1] +
x[ 1, -1] + x[ 1, 0] + x[ 1, 1]) // 9

https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d 4/8
15/08/2019 Python, Performance, and GPUs - Towards Data Science

GPU — 3 ms

@numba.cuda.jit
def smooth_gpu(x, out):
i, j = cuda.grid(2)
n, m = x.shape
if 1 <= i < n - 1 and 1 <= j < m - 1:
out[i, j] = (x[i-1, j-1] + x[i-1, j] + x[i-1, j+1] +
x[i , j-1] + x[i , j] + x[i , j+1] +
x[i+1, j-1] + x[i+1, j] + x[i+1, j+1]) // 9

Numba.cuda.jit has been out in the wild for quite a while now. It’s accessible, mature,
and fun to play with. If you have a machine with a GPU in it and some curiosity then we
strongly recommend that you try it out.

conda install numba

# or
pip install numba

>>> import numba.cuda

Scaling with Dask

As mentioned in previous blogposts ( 1, 2, 3, 4 ) we’ve been generalizing Dask, to
operate not just with Numpy arrays and Pandas dataframes, but with anything that
looks enough like Numpy (like CuPy or Sparse or Jax) or enough like Pandas (like
RAPIDS cuDF) to scale those libraries out too. This is working out well. Here is a brief
video showing Dask array computing an SVD in parallel, and seeing what happens
when we swap out the Numpy library for CuPy.

Integrating GPUs into Anaconda OSS Ecosystem

https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d 5/8
15/08/2019 Python, Performance, and GPUs - Towards Data Science

We see that there is about a 10x speed improvement on the computation. Most
importantly, we were able to switch between a CPU implementation and a GPU
implementation with a small one-line change, but continue using the sophisticated
algorithms with Dask Array, like it’s parallel SVD implementation.

We also saw a relative slowdown in communication. In general almost all non-trivial

Dask + GPU work today is becoming communication-bound. We’ve gotten fast enough
at computation that the relative importance of communication has grown significantly.

Communication with UCX

See this talk by Akshay Venkatesh or view the slides

Also see this recent blogpost about UCX and Dask

We’ve been integrating the OpenUCX library into Python with UCX-Py. UCX provides
uniform access to transports like TCP, InfiniBand, shared memory, and NVLink. UCX-Py
is the first time that access to many of these transports has been easily accessible from
the Python language.

Using UCX and Dask together we’re able to get significant speedups. Here is a trace of
the SVD computation from before both before and after adding UCX:

Before UCX:

After UCX:

https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d 6/8
15/08/2019 Python, Performance, and GPUs - Towards Data Science

There is still a great deal to do here though (the blogpost linked above has several items
in the Future Work section).

People can try out UCX and UCX-Py with highly experimental conda packages:

conda create -n ucx -c conda-forge -c jakirkham/label/ucx

cudatoolkit=9.2 ucx-proc=*=gpu ucx ucx-py python=3.7

We hope that this work will also affect non-GPU users on HPC systems with Infiniband,
or even users on consumer hardware due to the easy access to shared memory
communication.

Packaging
In an earlier blogpost we discussed the challenges around installing the wrong versions
of CUDA enabled packages that don’t match the CUDA driver installed on the system.
Fortunately due to recent work from Stan Seibert and Michael Sarahan at Anaconda,
Conda 4.7 now has a special cuda meta-package that is set to the version of the
installed driver. This should make it much easier for users in the future to install the
correct package.

Conda 4.7 was just releasead, and comes with many new features other than the cuda

meta-package. You can read more about it here.

conda update conda

There is still plenty of work to do in the packaging space today. Everyone who builds
conda packages does it their own way, resulting in headache and heterogeneity. This is
largely due to not having centralized infrastructure to build and test CUDA enabled

https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d 7/8
15/08/2019 Python, Performance, and GPUs - Towards Data Science

packages, like we have in Conda Forge. Fortunately, the Conda Forge community is
working together with Anaconda and NVIDIA to help resolve this, though that will
likely take some time.

Summary
This post gave an update of the status of some of the efforts behind GPU computing in
Python. It also provided a variety of links for future reading. We include them below if
you would like to learn more:

Slides

Numpy on the GPU: CuPy

Numpy on the GPU (again): Jax

Pandas on the GPU: RAPIDS cuDF

Scikit-Learn on the GPU: RAPIDS cuML

Benchmark suite

Numba CUDA JIT notebook

A talk on UCX

A blogpost on UCX and Dask

Conda 4.7

Machine Learning Rapids Ai Dask Pydata Python

About Help Legal

https://towardsdatascience.com/python-performance-and-gpus-1be860ffd58d 8/8

Servicenow Certified Application Developer Question Set
100% (2)
Servicenow Certified Application Developer Question Set
16 pages
Gym Management
88% (8)
Gym Management
118 pages
01 Laurie Stephey
No ratings yet
01 Laurie Stephey
14 pages
Acceleratingpythonongpus
No ratings yet
Acceleratingpythonongpus
33 pages
2013 07 22-Python-CUDA
No ratings yet
2013 07 22-Python-CUDA
25 pages
GPU Computing With Apache Spark and Python: April 5, 2016
No ratings yet
GPU Computing With Apache Spark and Python: April 5, 2016
55 pages
PyCUDA Tutorial
100% (1)
PyCUDA Tutorial
15 pages
Gpu, Cuda and Pycuda
No ratings yet
Gpu, Cuda and Pycuda
11 pages
GPU Computing With Python: Performance, Energy Efficiency and Usability
No ratings yet
GPU Computing With Python: Performance, Energy Efficiency and Usability
23 pages
Tutorial hpcs2011 Fixed
No ratings yet
Tutorial hpcs2011 Fixed
89 pages
Computation 12 00061
No ratings yet
Computation 12 00061
13 pages
GPU Computing For Data Science - John Joo
No ratings yet
GPU Computing For Data Science - John Joo
34 pages
Cuda Opencl
No ratings yet
Cuda Opencl
17 pages
Practical GPU Programming
From Everand
Practical GPU Programming
Maris Fenlor
No ratings yet
Practical GPU Programming: High-performance computing with CUDA, CuPy, and Python on modern GPUs
From Everand
Practical GPU Programming: High-performance computing with CUDA, CuPy, and Python on modern GPUs
Maris Fenlor
No ratings yet
Intro GPUs
No ratings yet
Intro GPUs
36 pages
PyCUDA AH PDF
No ratings yet
PyCUDA AH PDF
16 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
GPU Computing With Spark and Python
No ratings yet
GPU Computing With Spark and Python
33 pages
HPC 1
No ratings yet
HPC 1
27 pages
Intro To Matlab GPU Programming
No ratings yet
Intro To Matlab GPU Programming
35 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Accelerating Large Graph Algorithms On The GPU Using Cuda
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using Cuda
12 pages
Thesis Gpu Programming
100% (2)
Thesis Gpu Programming
6 pages
Numba - A Dynamic Python Compiler For Science
0% (1)
Numba - A Dynamic Python Compiler For Science
39 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
ACA Unit3 Revised
No ratings yet
ACA Unit3 Revised
53 pages
Pawan 09 Graph Algorithms
No ratings yet
Pawan 09 Graph Algorithms
26 pages
Accelerating Large Graph Algorithms On The GPU Using CUDA
No ratings yet
Accelerating Large Graph Algorithms On The GPU Using CUDA
12 pages
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
From Everand
Learning PyTorch 2.0, Second Edition: Utilize PyTorch 2.3 and CUDA 12 to experiment neural networks and deep learning models
Matthew Rosch
No ratings yet
Learning PyTorch 2.0, Second Edition
From Everand
Learning PyTorch 2.0, Second Edition
Matthew Rosch
No ratings yet
CUDA 4 1 Webinar v11-11-22
100% (1)
CUDA 4 1 Webinar v11-11-22
41 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
Lecture17 12
No ratings yet
Lecture17 12
86 pages
CUDA
No ratings yet
CUDA
33 pages
IEEE - HiPC 2023
No ratings yet
IEEE - HiPC 2023
2 pages
2023 CSC14120 Lecture00 CourseIntroduction
No ratings yet
2023 CSC14120 Lecture00 CourseIntroduction
30 pages
w13s1 MultiprocessingGPU
No ratings yet
w13s1 MultiprocessingGPU
21 pages
Fundamentals of Accelerated Computing With CUDA Python
No ratings yet
Fundamentals of Accelerated Computing With CUDA Python
2 pages
Cuda-: An Emerging Technology That Can Make Robots Reflex Action Faster
No ratings yet
Cuda-: An Emerging Technology That Can Make Robots Reflex Action Faster
11 pages
CUDA Zone - Library of Resources - NVIDIA Developer
No ratings yet
CUDA Zone - Library of Resources - NVIDIA Developer
7 pages
Owens
No ratings yet
Owens
67 pages
Nvidia - Rapids
No ratings yet
Nvidia - Rapids
33 pages
06 Intro Gpus
No ratings yet
06 Intro Gpus
33 pages
Slide 2
No ratings yet
Slide 2
38 pages
Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
Libraries
No ratings yet
Libraries
10 pages
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
No ratings yet
Lập Trình Trên Bộ Xử Lý Song Song GPU Có Hỗ Trợ Lõi CUDA
18 pages
NB4-06 PT I Using CNN
No ratings yet
NB4-06 PT I Using CNN
21 pages
Accelerating GPU Based Machine Learning Python 2020
No ratings yet
Accelerating GPU Based Machine Learning Python 2020
12 pages
Parallelization of BFS Graph Algorithm Using CUDA
No ratings yet
Parallelization of BFS Graph Algorithm Using CUDA
6 pages
Ometric
100% (1)
Ometric
26 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
GPUProgramming Talk
No ratings yet
GPUProgramming Talk
18 pages
Khan Muhammad Nafee Mostafa: Presented by
No ratings yet
Khan Muhammad Nafee Mostafa: Presented by
20 pages
4 - Key Concepts
No ratings yet
4 - Key Concepts
2 pages
CUDA
No ratings yet
CUDA
46 pages
Scipy09 Pycuda Tut
No ratings yet
Scipy09 Pycuda Tut
162 pages
CUDA Optimization Fundamentals
No ratings yet
CUDA Optimization Fundamentals
150 pages
Math Muster
No ratings yet
Math Muster
38 pages
Physics Muster PDF
No ratings yet
Physics Muster PDF
36 pages
Transformation, Means, and Confidence Intervals - Bland Altman PDF
No ratings yet
Transformation, Means, and Confidence Intervals - Bland Altman PDF
1 page
Alien Supercivilizations Absent From 100,000 Nearby Galaxies - Scientific American
No ratings yet
Alien Supercivilizations Absent From 100,000 Nearby Galaxies - Scientific American
8 pages
Resumão de Cálculo - Integral
100% (1)
Resumão de Cálculo - Integral
1 page
Ps Pcs Sa 9.1r8.0 Supportedplatforms
No ratings yet
Ps Pcs Sa 9.1r8.0 Supportedplatforms
21 pages
First CIE QP and Scheme Java
No ratings yet
First CIE QP and Scheme Java
6 pages
Sigcomm 99
No ratings yet
Sigcomm 99
11 pages
Assignment2 20BCE0023
No ratings yet
Assignment2 20BCE0023
10 pages
PHP Course Details: Duration: 2 Month
No ratings yet
PHP Course Details: Duration: 2 Month
8 pages
Visual Studio by Subscription Level
No ratings yet
Visual Studio by Subscription Level
620 pages
Object Oriented Programming Object Oriented Programming: Course IT Course #: It 114
No ratings yet
Object Oriented Programming Object Oriented Programming: Course IT Course #: It 114
16 pages
Settings Provider
No ratings yet
Settings Provider
21 pages
VB Chapter 9
No ratings yet
VB Chapter 9
47 pages
6) Spring Security Core
No ratings yet
6) Spring Security Core
69 pages
Personal Profile: Name and Contact Details
No ratings yet
Personal Profile: Name and Contact Details
8 pages
Java Programs
78% (9)
Java Programs
161 pages
Cenopdf User'S Manual: Lystech Computing
No ratings yet
Cenopdf User'S Manual: Lystech Computing
112 pages
2022 AP Computer Science Principles Student Samples
No ratings yet
2022 AP Computer Science Principles Student Samples
39 pages
Unit-1 Introduction To C
No ratings yet
Unit-1 Introduction To C
49 pages
Senthilnathan V - Resume
No ratings yet
Senthilnathan V - Resume
3 pages
DGE 528T Menu EN FR
No ratings yet
DGE 528T Menu EN FR
1 page
Perencanaan Strategis Teknologi Informasi - English
No ratings yet
Perencanaan Strategis Teknologi Informasi - English
24 pages
Chap-3 (Malware Analysis) (Sem-5)
No ratings yet
Chap-3 (Malware Analysis) (Sem-5)
22 pages
Problem Solving Technique Using C-1-2
No ratings yet
Problem Solving Technique Using C-1-2
2 pages
JS Assignment 3
No ratings yet
JS Assignment 3
4 pages
ReactJs Presentation
No ratings yet
ReactJs Presentation
132 pages
Profile
No ratings yet
Profile
2 pages
Bypass AV-EDR Solutions Combining Well Known Techniques
No ratings yet
Bypass AV-EDR Solutions Combining Well Known Techniques
42 pages
App Developer JD
No ratings yet
App Developer JD
2 pages
Python Installation and Verification
No ratings yet
Python Installation and Verification
7 pages
Update Information CEMAT V71 SP1 HF 27-01-2012
No ratings yet
Update Information CEMAT V71 SP1 HF 27-01-2012
21 pages
Open Gapps Log
No ratings yet
Open Gapps Log
2 pages

Python, Performance, and GPUs - Towards Data Science

Uploaded by

Python, Performance, and GPUs - Towards Data Science

Uploaded by

15/08/2019 Python, Performance, and GPUs - Towards Data Science

Python, Performance, and GPUs

Broadly we cover briefly the following categories:

Python libraries written in CUDA like CuPy and RAPIDS

Python-CUDA compilers, specifically Numba

Scaling these libraries out with Dask

Network communication with UCX

Packaging with Conda

Performance of GPU accelerated Python Libraries

Numpy on the GPU: CuPy

Numpy on the GPU (again): Jax

Pandas on the GPU: RAPIDS cuDF

Scikit-Learn on the GPU: RAPIDS cuML

Numba: Compiling Python to CUDA

import numba # We added these two lines for a 500x speedup

@numba.jit # We added these two lines for a 500x speedup

Here is a simplified comparison of Numba CPU/GPU code to compare programming

or if we use the fancy numba.stencil decorator …

conda install numba

>>> import numba.cuda

Scaling with Dask

Integrating GPUs into Anaconda OSS Ecosystem

We also saw a relative slowdown in communication. In general almost all non-trivial

Communication with UCX

Also see this recent blogpost about UCX and Dask

conda create -n ucx -c conda-forge -c jakirkham/label/ucx

meta-package. You can read more about it here.

conda update conda

Numpy on the GPU: CuPy

Numpy on the GPU (again): Jax

Pandas on the GPU: RAPIDS cuDF

Scikit-Learn on the GPU: RAPIDS cuML

Numba CUDA JIT notebook

A blogpost on UCX and Dask

Machine Learning Rapids Ai Dask Pydata Python

About Help Legal

You might also like