2013 07 22-Python-CUDA

A wonderful presentation on CUDA programming with the Python programming language. In PDF format. This is not my original work, it is freely available elsewhere... DONT PAY ANYONE BUT THE AUTHOR!!! And because he's so generous you don't even have to pay him! ;-)

Uploaded by

DouglasAndrewBrummellIII

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

333 views25 pages

2013 07 22-Python-CUDA

Uploaded by

DouglasAndrewBrummellIII

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

Python for GPUs

Bryan Catanzaro, NVIDIA Research

Some slides from Mark Harris (NVIDIA)
and Andreas Klckner (NYU)
2013 NVIDIA Corporation
Rapid
Development
Powerful
Libraries
Commercial
Support
Large
Community
2013 NVIDIA Corporation
Is Python Fast Enough?
Python apps often implement
performance critical functions in C/C++.
2013 NVIDIA Corporation
Three Python projects
PyCUDA/PyOpenCL (Andreas Klckner)
Bindings for GPU runtimes
Intended to be used with Runtime Code Generation

NumbaPro (Continuum Analytics)
Write CUDA code in Python
GPU bindings
Copperhead (Bryan Catanzaro)
A data parallel Python dialect
Runtime compiled to GPUs and CPUs
2013 NVIDIA Corporation
PyCUDA: Programming Approaches
Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions
Programming Approaches
Decisions that determine your
approach to throughput computing:
AOT vs JIT
Meta vs not
In-language vs Hybrid
If hybrid, why not use a scripting language?
Andreas Klockner GPU Programming in Python
2013 NVIDIA Corporation
PyCUDA: Why do scripting?
Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions
Why do Scripting for GPUs?
GPUs are everything that scripting
languages are not.
Highly parallel
Very architecture-sensitive
Built for maximum FP/memory
throughput
complement each other
CPU: largely restricted to control
tasks (1000/sec)
Scripting fast enough
Python + OpenCL = PyOpenCL
Python + CUDA = PyCUDA
Andreas Klockner GPU Programming in Python
2013 NVIDIA Corporation
Dive into PyCUDA
!"#$%& #()*+,-,*&$!.!&
!"#$%& #()*+,-+%!/0% ,1 +%/
!"#$%& .*"#(

2%$" #()*+,-)$"#!30% !"#$%& 4$*%)05$+*30
"$+ 6 4$*%)05$+*307888
99:3$;,399 /$!+ "*3&!#3(9&<0"723$,& =+01&> 23$,& =,> 23$,& =;?
@
)$.1& !.& ! 6 &<%0,+A+B-BC
+01&D!E 6 ,D!E = ;D!EC
F
888?

CUDA Code
2013 NVIDIA Corporation
Dive into PyCUDA, cont.
"*3&!#3(9&<0" 6 "$+-:0&92*.)&!$.78"*3&!#3(9&<0"8?

, 6 .*"#(-%,.+$"-%,.+.7GHH?-,1&(#07.*"#(-23$,&IJ?
; 6 .*"#(-%,.+$"-%,.+.7GHH?-,1&(#07.*"#(-23$,&IJ?

+01& 6 .*"#(-K0%$193!L07,?
"*3&!#3(9&<0"7
+%/-M*&7+01&?> +%/-A.7,?> +%/-A.7;?>
;3$)L67GHH>N>N?> :%!+67N>N??

#%!.& +01&O,=;
numpy
interop
kernel
launch
2013 NVIDIA Corporation
PyCUDA/PyOpenCL Philosophy
Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions Hello World About In the box PyCUDA
PyOpenCL Philosophy
Provide complete access
Automatically manage resources
Provide abstractions
Allow interactive use
Check for and report errors
automatically
Integrate tightly with numpy
Andreas Klockner GPU Programming in Python
2013 NVIDIA Corporation
PyCUDA/PyOpenCL: Completeness
PyCUDA exposes all of the CUDA driver API
For example:
Streams/events
Surfaces/textures
Peer to peer access, pinned memory
Profiling,
PyOpenCL exposes all of OpenCL
2013 NVIDIA Corporation
Workflow
Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions Hello World About In the box PyCUDA
PyOpenCL, PyCUDA: Workow
Edit
PyOpenCL/PyCUDA
Run
Program("...")
Cache?
Compiler
no
Binary
Upload to GPU
Run on GPU
Andreas Klockner GPU Programming in Python
2013 NVIDIA Corporation
Metaprogramming
Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions How?
Metaprogramming
Idea
Python Code
GPU Code
GPU Compiler
GPU Binary
GPU
Result
Machine
Human
In GPU scripting,
GPU code does
not need to be
a compile-time
constant.
(Key: Code is datait wants to be
reasoned about at run time)
Good for code
generation
P
yC
U
D
A PyOpenCL
Andreas Klockner GPU Programming in Python
2013 NVIDIA Corporation
How to metaprogram in PyCUDA/
PyOpenCL
Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions How?
PyOpenCL: Support for Metaprogramming
Three (main) ways of generating code:
Simple %-operator substitution
Combine with C preprocessor: simple, often sucient
Use a templating engine (Mako works very well)
codepy:
Build C syntax trees from Python
Generates readable, indented C
Many ways of evaluating codemost important one:
Exact device timing via events
Andreas Klockner GPU Programming in Python
2013 NVIDIA Corporation
Other nice things
Elementwise functions very similar to numpy ufuncs
reductions, scans
gpuarray with overloaded arithmetic operators
random number generators
2013 NVIDIA Corporation
PyCUDA/PyOpenCL information
Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions Hello World About In the box PyCUDA
PyOpenCL, PyCUDA: Vital Information
http://mathema.tician.de/
software/pyopencl (or /pycuda)
Downloads:
Direct: PyOpenCL 60k, PyCUDA 30k
Binaries: Win, Debian, Arch, Fedora,
Gentoo, . . .
MIT License
Compiler Cache, RAII, Error checking
Require: numpy, Python 2.4+
(Win/OS X/Linux)
Community: mailing list, wiki, add-on
packages (PyFFT, scikits.cuda, Sailsh,
PyWENO, Copperhead. . . )
Andreas Klockner GPU Programming in Python
2013 NVIDIA Corporation
NumbaPro from Continuum
Anaconda Accelerate from Continuum Analytics
NumbaPro array-oriented compiler for Python & NumPy
Compile Python for GPUs or CPUs
Automatically compile Python functions on NumPy arrays
Or write CUDA Python kernels for maximum performance
Fast Development + Fast Execution: Ideal Combination
http://continuum.io
Free Academic
License
2013 NVIDIA Corporation
1024
2
Mandelbrot Time Speedup v. Pure Python
Pure Python 4.85s --
NumbaPro (CPU) 0.11s 44x
CUDA Python (K20) .004s 1221x
@cuda.jit(restype=uint32, argtypes=[f8, f8, uint32], device=True)
def mandel(x, y, max_iters):
c = complex(x, y)
z = 0.0j
for i in range(max_iters):
z = z*z + c
if (z.real*z.real + z.imag*z.imag) >= 4:
return i
return max_iters

@cuda.jit(argtypes=[uint8[:,:], f8, f8, f8, f8, uint32])
def mandel_kernel(img, xmin, xmax ymin, ymax, iters):
x, y = cuda.grid(2)
if x < img.shape[0] and y < img.shape[1]:
img[y, x] = mandel(min_x+x*((max_x-min_x)/img.shape[0]),
min_y+y*((max_y-min_y)/img.shape[1]), iters)

gimage = np.zeros((1024, 1024), dtype = np.uint8)
d_image = cuda.to_device(gimage)
mandel_kernel[(32,32), (32,32)](d_image, -2.0, 1.0, -1.0, 1.0, 20)
d_image.to_host()
CUDA Python
CUDA Programming,
Python Syntax
2013 NVIDIA Corporation
Copperhead
Goal: Efficiency and
Productivity

Note: Copperhead is
a research project,
not a product.
Copperhead
Python
Data
Parallelism
Need for
productivity
Copperhead code is just Python code.
No C-isms, no annotations.
http://copperhead.github.io
2013 NVIDIA Corporation
Hello world of data parallelism
Consider this intrinsically parallel procedure
!"# %&'()%* &* (+,
-"./-0 1%')2%13!% &4*(4, %5&4 6 (4* &* (+
or for the lambda averse
!"# %&'()%* &* (+,
-"./-0 7%5&4 6 (4 #8- &4*(4 40 94')&*(+:
This procedure is both
completely valid Python code
compilable to data parallel substrates (CUDA, OpenCL,
OpenMP+AVX intrinsics, etc.)
2013 NVIDIA Corporation
Support for Heterogeneity
Programmer specifies execution place

Currently support:

P!&< #3,)01-:#*HQ
:#*9%01*3& 6 ,B#(7---?
P!&< #3,)01-$#0."#Q
)#*9%01*3& 6 ,B#(7---?
CUDA
OpenMP
TBB
Sequential C++
2013 NVIDIA Corporation
Runtime Data Management
The Copperhead runtime manages all data
Data lazily transferred to and from memory
spaces

Memory is garbage collected via Pythons
garbage collector
Data interoperates with .*"#(, ",&#3$&3!;, etc.
,
a b c d
d
, 6 R
; 6 2$$7,?
) 6 2$$7;?
+ 6 2$$7)?
#%!.&7+?
CPU
GPU
2013 NVIDIA Corporation
Runtime code generation
Runtime code generation
Copperhead compiler produces C++ code
C++ code is compiled to a dynamic library using )$+0#(
Compilation artifacts persistently stored in 99#(),)<099
Runtime overhead: ~10-100 !sec (from Python, per fn call)
0
2
4
6
8
10
12
Minimal Black Scholes
S
e
c
o
n
d
s

Compile Time
0.00E+00
2.00E-05
4.00E-05
6.00E-05
8.00E-05
1.00E-04
1.20E-04
1.40E-04
Minimal Black Scholes
S
e
c
o
n
d
s

Execution Overhead
;66
1%<"
2013 NVIDIA Corporation
Some results (GTX480)
Solving Laplaces
equation (from Travis
Oliphants blog)

0.1
1
10
100
1000
S
e
c
o
n
d
s

0.001
0.01
0.1
1
Pure Python Numpy Copperhead
S
e
c
o
n
d
s

Sorting array of 1M
float32 elements

2013 NVIDIA Corporation
Conclusion
Increasing options for Python on GPUs:
PyCUDA/PyOpenCL (Andreas Klckner)
Bindings for GPU runtimes

NumbaPro (Continuum Analytics)
Write CUDA code in Python

Copperhead (Bryan Catanzaro)
A data parallel Python dialect

2013 NVIDIA Corporation
Questions?
Bryan Catanzaro
[email protected]

http://research.nvidia.com

Bryan Catanzaro

Parallel Programming With CUDA - Architecture, Analysis
No ratings yet
Parallel Programming With CUDA - Architecture, Analysis
93 pages
Scipy09 Pycuda Tut
No ratings yet
Scipy09 Pycuda Tut
162 pages
Tutorial hpcs2011 Fixed
No ratings yet
Tutorial hpcs2011 Fixed
89 pages
The Python Programming Language
No ratings yet
The Python Programming Language
64 pages
Keras - TF2 - Book
No ratings yet
Keras - TF2 - Book
364 pages
C++ Proposed Exercises (Chapter 8: The C++ Programing Language, Fourth Edition) - Solution
25% (4)
C++ Proposed Exercises (Chapter 8: The C++ Programing Language, Fourth Edition) - Solution
6 pages
Basic Python
No ratings yet
Basic Python
111 pages
OS by JJsir
No ratings yet
OS by JJsir
269 pages
PyCUDA Tutorial
100% (1)
PyCUDA Tutorial
15 pages
AI Facial Recognition System
No ratings yet
AI Facial Recognition System
54 pages
Data Preprocessing
No ratings yet
Data Preprocessing
57 pages
02 Expressions Variables Forloops
No ratings yet
02 Expressions Variables Forloops
15 pages
Python Loops Programs
No ratings yet
Python Loops Programs
10 pages
GPU Computing For Data Science - John Joo
No ratings yet
GPU Computing For Data Science - John Joo
34 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Learn C++ - Skill Up With Our Free Tutorials
No ratings yet
Learn C++ - Skill Up With Our Free Tutorials
24 pages
GPU Computing With Apache Spark and Python: April 5, 2016
No ratings yet
GPU Computing With Apache Spark and Python: April 5, 2016
55 pages
Intro GPUs
No ratings yet
Intro GPUs
36 pages
PyCUDA AH PDF
No ratings yet
PyCUDA AH PDF
16 pages
Thin Features: HK902S E.00
No ratings yet
Thin Features: HK902S E.00
21 pages
GPU Computing With Python: Performance, Energy Efficiency and Usability
No ratings yet
GPU Computing With Python: Performance, Energy Efficiency and Usability
23 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
Programming in Modern C++
No ratings yet
Programming in Modern C++
1 page
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
OO C++ Notes
No ratings yet
OO C++ Notes
227 pages
Cuda Opencl
No ratings yet
Cuda Opencl
17 pages
DL Unit - III Notes1
No ratings yet
DL Unit - III Notes1
14 pages
Data Science Questions and Answers
No ratings yet
Data Science Questions and Answers
4 pages
Puting Experiences
No ratings yet
Puting Experiences
15 pages
Nvis 5586A Final
No ratings yet
Nvis 5586A Final
191 pages
01 Laurie Stephey
No ratings yet
01 Laurie Stephey
14 pages
ACA Unit3 Revised
No ratings yet
ACA Unit3 Revised
53 pages
A Visual Intro To Numpy and Data Representation
No ratings yet
A Visual Intro To Numpy and Data Representation
16 pages
Socket Programming in Python
No ratings yet
Socket Programming in Python
7 pages
CUDA
No ratings yet
CUDA
18 pages
Python, Performance, and GPUs - Towards Data Science
No ratings yet
Python, Performance, and GPUs - Towards Data Science
8 pages
Data Science With Machine Learning Curriculum 2021
No ratings yet
Data Science With Machine Learning Curriculum 2021
12 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Barnett Haskins
No ratings yet
Barnett Haskins
29 pages
Acceleratingpythonongpus
No ratings yet
Acceleratingpythonongpus
33 pages
Core Libraries For Machine Learning
No ratings yet
Core Libraries For Machine Learning
5 pages
CUDA
No ratings yet
CUDA
33 pages
Gpu, Cuda and Pycuda
No ratings yet
Gpu, Cuda and Pycuda
11 pages
Components of Android
No ratings yet
Components of Android
36 pages
Introduction To Data Visualization in Python - by Gilbert Tanner - Towards Data Science
No ratings yet
Introduction To Data Visualization in Python - by Gilbert Tanner - Towards Data Science
22 pages
HTML Tags Chart: Example
No ratings yet
HTML Tags Chart: Example
8 pages
Conquer 1Z0-1122-24 Oracle Cloud Infrastructure 2024 AI Foundations Exam
No ratings yet
Conquer 1Z0-1122-24 Oracle Cloud Infrastructure 2024 AI Foundations Exam
6 pages
Yeungnam University School of Mechanical Engineering Syllabus For 0993 Tribology
No ratings yet
Yeungnam University School of Mechanical Engineering Syllabus For 0993 Tribology
42 pages
Faster Python Programs Through Optimization PDF
No ratings yet
Faster Python Programs Through Optimization PDF
2 pages
LabView Tutorial Step-By-Step On How To Use ActiveX in Labview
No ratings yet
LabView Tutorial Step-By-Step On How To Use ActiveX in Labview
17 pages
Storage Configuration: HK902S E.00
No ratings yet
Storage Configuration: HK902S E.00
27 pages
Machine Learning Mini-Project Report
No ratings yet
Machine Learning Mini-Project Report
26 pages
Intro To Data Science Summary
No ratings yet
Intro To Data Science Summary
17 pages
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
No ratings yet
SIMD For C++ Developers © 2019 Konstantin, Http://const - Me Page 1 of 21
21 pages
AUTODYN - Chapter 11 - Parallel - Processing PDF
No ratings yet
AUTODYN - Chapter 11 - Parallel - Processing PDF
42 pages
Anritsu MT9090A Network Master
No ratings yet
Anritsu MT9090A Network Master
8 pages
Distributed Computing Note
100% (1)
Distributed Computing Note
54 pages
PHP / Mysql Tutorial: My-Sql Tutorial My-Sql Introduction
No ratings yet
PHP / Mysql Tutorial: My-Sql Tutorial My-Sql Introduction
36 pages
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
No ratings yet
Big Data Analytics (CS443) IV B.Tech (IT) 2018-19 I Semester
72 pages
2 Convolutional Neural Network For Image Classification
No ratings yet
2 Convolutional Neural Network For Image Classification
6 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
No ratings yet
Super Study Guide: Data Science Tools: Afshine Amidi and Shervine Amidi August 21, 2020
23 pages
DIY Cozmo Robot Expressions: Technology Workshop Craft Home Food Play Outside Costumes
No ratings yet
DIY Cozmo Robot Expressions: Technology Workshop Craft Home Food Play Outside Costumes
7 pages
Course Contents For BS Telecom: General Courses: Calculus I
No ratings yet
Course Contents For BS Telecom: General Courses: Calculus I
18 pages
Abbre
No ratings yet
Abbre
91 pages
COL380: Introduction To Parallel & Distributed Programming
No ratings yet
COL380: Introduction To Parallel & Distributed Programming
20 pages
System On Chips Soc'S & Multiprocessor System On Chips Mpsocs
No ratings yet
System On Chips Soc'S & Multiprocessor System On Chips Mpsocs
42 pages
Unit 6 Part1 Ilp
No ratings yet
Unit 6 Part1 Ilp
39 pages
Inter-Process Communication
No ratings yet
Inter-Process Communication
37 pages
Sardar Patel Institute of Technology: Department of Computer Engineering
No ratings yet
Sardar Patel Institute of Technology: Department of Computer Engineering
3 pages
CS4961 Parallel Programming: Course Details
No ratings yet
CS4961 Parallel Programming: Course Details
7 pages
Ebooks File Multi Threaded Game Engine Design 1st Edition Jonathan S. Harbour All Chapters
100% (1)
Ebooks File Multi Threaded Game Engine Design 1st Edition Jonathan S. Harbour All Chapters
51 pages
Lesson 1
No ratings yet
Lesson 1
56 pages
R20 Regulations Full Syllabus 14112021 Min
No ratings yet
R20 Regulations Full Syllabus 14112021 Min
33 pages
Unit 5 DOS SCR
No ratings yet
Unit 5 DOS SCR
46 pages
CS8552-Computer Architecture and Organization
No ratings yet
CS8552-Computer Architecture and Organization
2 pages
Stereo Video Processing For Depth Map: Harlan Hile and Colin Zheng
100% (2)
Stereo Video Processing For Depth Map: Harlan Hile and Colin Zheng
8 pages
Minor Project Synopsis - Dog Breed Identification
No ratings yet
Minor Project Synopsis - Dog Breed Identification
43 pages
Improving Energy Efficiency Through Parallelization
No ratings yet
Improving Energy Efficiency Through Parallelization
10 pages
BTP Report Final
No ratings yet
BTP Report Final
40 pages
Unit 5 IRS
No ratings yet
Unit 5 IRS
17 pages
Kud Notes
No ratings yet
Kud Notes
25 pages
Multi Processor Sheduling
No ratings yet
Multi Processor Sheduling
6 pages
Tesseract Pim Architecture For Graph Processing - Isca15
No ratings yet
Tesseract Pim Architecture For Graph Processing - Isca15
13 pages
Statement of Purpose
No ratings yet
Statement of Purpose
2 pages
About The Chip in Your Phone
No ratings yet
About The Chip in Your Phone
4 pages
SIMD Vs MIMD With Memory Models
No ratings yet
SIMD Vs MIMD With Memory Models
7 pages
An Assignment On Cloud Computing
No ratings yet
An Assignment On Cloud Computing
5 pages
CSCE 513: Computer Architecture: Quantitative Approach, 4
No ratings yet
CSCE 513: Computer Architecture: Quantitative Approach, 4
2 pages

2013 07 22-Python-CUDA

Uploaded by

2013 07 22-Python-CUDA

Uploaded by

Python for GPUs

Bryan Catanzaro, NVIDIA Research

You might also like