A wonderful presentation on CUDA programming with the Python programming language. In PDF format. This is not my original work, it is freely available elsewhere... DONT PAY ANYONE BUT THE AUTHOR!!! And because he's so generous you don't even have to pay him! ;-)
A wonderful presentation on CUDA programming with the Python programming language. In PDF format. This is not my original work, it is freely available elsewhere... DONT PAY ANYONE BUT THE AUTHOR!!! And because he's so generous you don't even have to pay him! ;-)
Some slides from Mark Harris (NVIDIA) and Andreas Klckner (NYU) 2013 NVIDIA Corporation Rapid Development Powerful Libraries Commercial Support Large Community 2013 NVIDIA Corporation Is Python Fast Enough? Python apps often implement performance critical functions in C/C++. 2013 NVIDIA Corporation Three Python projects PyCUDA/PyOpenCL (Andreas Klckner) Bindings for GPU runtimes Intended to be used with Runtime Code Generation
NumbaPro (Continuum Analytics) Write CUDA code in Python GPU bindings Copperhead (Bryan Catanzaro) A data parallel Python dialect Runtime compiled to GPUs and CPUs 2013 NVIDIA Corporation PyCUDA: Programming Approaches Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions Programming Approaches Decisions that determine your approach to throughput computing: AOT vs JIT Meta vs not In-language vs Hybrid If hybrid, why not use a scripting language? Andreas Klockner GPU Programming in Python 2013 NVIDIA Corporation PyCUDA: Why do scripting? Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions Why do Scripting for GPUs? GPUs are everything that scripting languages are not. Highly parallel Very architecture-sensitive Built for maximum FP/memory throughput complement each other CPU: largely restricted to control tasks (1000/sec) Scripting fast enough Python + OpenCL = PyOpenCL Python + CUDA = PyCUDA Andreas Klockner GPU Programming in Python 2013 NVIDIA Corporation Dive into PyCUDA !"#$%& #()*+,-,*&$!.!& !"#$%& #()*+,-+%!/0% ,1 +%/ !"#$%& .*"#(
#%!.& +01&O,=; numpy interop kernel launch 2013 NVIDIA Corporation PyCUDA/PyOpenCL Philosophy Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions Hello World About In the box PyCUDA PyOpenCL Philosophy Provide complete access Automatically manage resources Provide abstractions Allow interactive use Check for and report errors automatically Integrate tightly with numpy Andreas Klockner GPU Programming in Python 2013 NVIDIA Corporation PyCUDA/PyOpenCL: Completeness PyCUDA exposes all of the CUDA driver API For example: Streams/events Surfaces/textures Peer to peer access, pinned memory Profiling, PyOpenCL exposes all of OpenCL 2013 NVIDIA Corporation Workflow Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions Hello World About In the box PyCUDA PyOpenCL, PyCUDA: Workow Edit PyOpenCL/PyCUDA Run Program("...") Cache? Compiler no Binary Upload to GPU Run on GPU Andreas Klockner GPU Programming in Python 2013 NVIDIA Corporation Metaprogramming Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions How? Metaprogramming Idea Python Code GPU Code GPU Compiler GPU Binary GPU Result Machine Human In GPU scripting, GPU code does not need to be a compile-time constant. (Key: Code is datait wants to be reasoned about at run time) Good for code generation P yC U D A PyOpenCL Andreas Klockner GPU Programming in Python 2013 NVIDIA Corporation How to metaprogram in PyCUDA/ PyOpenCL Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions How? PyOpenCL: Support for Metaprogramming Three (main) ways of generating code: Simple %-operator substitution Combine with C preprocessor: simple, often sucient Use a templating engine (Mako works very well) codepy: Build C syntax trees from Python Generates readable, indented C Many ways of evaluating codemost important one: Exact device timing via events Andreas Klockner GPU Programming in Python 2013 NVIDIA Corporation Other nice things Elementwise functions very similar to numpy ufuncs reductions, scans gpuarray with overloaded arithmetic operators random number generators 2013 NVIDIA Corporation PyCUDA/PyOpenCL information Intro Py{OpenCL,CUDA} Code Gen. Loo.py Conclusions Hello World About In the box PyCUDA PyOpenCL, PyCUDA: Vital Information http://mathema.tician.de/ software/pyopencl (or /pycuda) Downloads: Direct: PyOpenCL 60k, PyCUDA 30k Binaries: Win, Debian, Arch, Fedora, Gentoo, . . . MIT License Compiler Cache, RAII, Error checking Require: numpy, Python 2.4+ (Win/OS X/Linux) Community: mailing list, wiki, add-on packages (PyFFT, scikits.cuda, Sailsh, PyWENO, Copperhead. . . ) Andreas Klockner GPU Programming in Python 2013 NVIDIA Corporation NumbaPro from Continuum Anaconda Accelerate from Continuum Analytics NumbaPro array-oriented compiler for Python & NumPy Compile Python for GPUs or CPUs Automatically compile Python functions on NumPy arrays Or write CUDA Python kernels for maximum performance Fast Development + Fast Execution: Ideal Combination http://continuum.io Free Academic License 2013 NVIDIA Corporation 1024 2 Mandelbrot Time Speedup v. Pure Python Pure Python 4.85s -- NumbaPro (CPU) 0.11s 44x CUDA Python (K20) .004s 1221x @cuda.jit(restype=uint32, argtypes=[f8, f8, uint32], device=True) def mandel(x, y, max_iters): c = complex(x, y) z = 0.0j for i in range(max_iters): z = z*z + c if (z.real*z.real + z.imag*z.imag) >= 4: return i return max_iters
@cuda.jit(argtypes=[uint8[:,:], f8, f8, f8, f8, uint32]) def mandel_kernel(img, xmin, xmax ymin, ymax, iters): x, y = cuda.grid(2) if x < img.shape[0] and y < img.shape[1]: img[y, x] = mandel(min_x+x*((max_x-min_x)/img.shape[0]), min_y+y*((max_y-min_y)/img.shape[1]), iters)
Note: Copperhead is a research project, not a product. Copperhead Python Data Parallelism Need for productivity Copperhead code is just Python code. No C-isms, no annotations. http://copperhead.github.io 2013 NVIDIA Corporation Hello world of data parallelism Consider this intrinsically parallel procedure !"# %&'()%* &* (+, -"./-0 1%')2%13!% &4*(4, %5&4 6 (4* &* (+ or for the lambda averse !"# %&'()%* &* (+, -"./-0 7%5&4 6 (4 #8- &4*(4 40 94')&*(+: This procedure is both completely valid Python code compilable to data parallel substrates (CUDA, OpenCL, OpenMP+AVX intrinsics, etc.) 2013 NVIDIA Corporation Support for Heterogeneity Programmer specifies execution place
Currently support:
P!&< #3,)01-:#*HQ :#*9%01*3& 6 ,B#(7---? P!&< #3,)01-$#0."#Q )#*9%01*3& 6 ,B#(7---? CUDA OpenMP TBB Sequential C++ 2013 NVIDIA Corporation Runtime Data Management The Copperhead runtime manages all data Data lazily transferred to and from memory spaces
Memory is garbage collected via Pythons garbage collector Data interoperates with .*"#(, ",$&3!;, etc. , a b c d d , 6 R ; 6 2$$7,? ) 6 2$$7;? + 6 2$$7)? #%!.&7+? CPU GPU 2013 NVIDIA Corporation Runtime code generation Runtime code generation Copperhead compiler produces C++ code C++ code is compiled to a dynamic library using )$+0#( Compilation artifacts persistently stored in 99#(),)<099 Runtime overhead: ~10-100 !sec (from Python, per fn call) 0 2 4 6 8 10 12 Minimal Black Scholes S e c o n d s
Compile Time 0.00E+00 2.00E-05 4.00E-05 6.00E-05 8.00E-05 1.00E-04 1.20E-04 1.40E-04 Minimal Black Scholes S e c o n d s