Acceleratingpythonongpus

Accelerating Python on GPUs
NVIDIA Webinar
9th October 2024
Paul Graham, Senior Solutions Architect, NVIDIA
[email protected]
NVIDIA
GPU Computing Computer Graphics Artificial Intelligence

GPUs: The basics
Million-X Speedup for Innovation and Discovery
Simulation + AI
CLIMATE CHANGE DIGITAL BIOLOGY

FourcastNet Orbnet
109 MACHINE
LEARNING
108
107
106 SCALE
UP & OUT
105
RENEWABLE ENERGY INDUSTRIAL HPC
104 SGTC Multi-disciplinary Physics
103 ACCELERATED
COMPUTING
102 1.1X per year
101
1.5X per year
Single-threaded perf
1980 1990 2000 2010 2020

Small Changes, Big Speed-up
Application Code
Compute-Intensive Functions
Rest of Sequential
CPU Code
GPU CPU
X
GH100 GPU Architecture
https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper
• 132 of these in the H100 SXM5 GPU
• > 16000 FP32 cores in total
• We usually want number of threads >> num cores
• So we need a lot of threads!

Multi-die Multi-chip Multi-node
The CUDA Platform
Target the abstraction layer that works best for your application
Developer & Frameworks SDKs

Application PyTorch, TensorFlow, Jax, Medical Devices, Energy,
Ecosystem Modulus, Triton, ... Autonomous Vehicles, ...
NVIDIA
Accelerated
Libraries cuBLAS cuSPARSE cuTENSOR cuSOLVER cuRAND cuFFT Math API NPP
Standard
Parallel OpenACC OpenMP
C++
Python Julia MATLAB ...
Languages CUDA
FORTRAN
CUDA C++ OpenCL Ada Haskell R ...
NVVM / LLVM IR
Compilation
Stack
PTX Assembly ISA
Tensor Cores
Hardware for Matrix Multiply and Accumulate operations
• Perform several MMA calcs per clock cycle

• Introduced in the V100
• FP32 in, FP32 out (accumulate)
• FP16 multiply
• Turing added int8, int4, int1 calculations
• Ampere (A100)
• Full FP64 MMA
• Bfloat16, Tensor Float 32
• Hopper (H100)
• FP8
• Transformer Engine
cuDNN Library, CUTLASS
Exploiting Tensor Cores
cuDNN - Accelerating deep learning primitives CUTLASS – Tensor Core Programming Model
Key Features • Warp-Level GEMM and Reusable Components for
Linear Algebra Kernels in CUDA
• Tensor Core acceleration for all popular convolutions
• Has Python interfaces
• Supports FP32, FP16, BF16 and TF32 floating point
formats and INT8, and UINT8 integer formats
• Arbitrary dimension ordering, striding, and sub-
regions for 4d tensors means easy integration into
any neural net implementation
Frameworks and Libraries
Many Frameworks … All With Python Support
NVIDIA Launchpad — free hands-on labs
• cuOpt – accelerated optimisation engine e.g. for • Merlin – end to end system for recommender
Logistics and Route Optimisation frameworks
• Isaac Sim – robotics simulation toolkit – building virtual • NeMo – framework for building and deploying
worlds for training robots generative AI models
• Riva – speech AI services: transcription, translation, text • TAO toolkit – for transfer learning
to voice …
• DeepStream SDK – for streaming IVA applications
• Clara – AI powered solutions for healthcare and life
sciences e.g. Genomics, Medical Instruments • Modulus – PyTorch-based framework for Physics-
informed Neural Networks (PINNs)
• Holoscan – acceleration of sensor data processing
pipelines • ...
Isaac Sim NVIDIA Modulus and Omniverse

RAPIDS
GPU-accelerated data science workflow
RAPIDS.ai
DATA PREDICTIONS
DATA PREPARATION - ETL MODEL TRAINING VISUALIZATION

cuDF: Python drop-in pandas cuML: GPU-acceleration of popular Effortless exploration of datasets,
replacement built on CUDA. ML algorithms e.g. XGBoost billions of records in milliseconds
GPU-accelerated Spark Easy-to-adopt, scikit-learn like Dynamic interaction with data =
interface faster ML model development
Announcing RAPIDS cuDF Accelerates pandas with Zero Code Change
World’s fastest data analytics with pandas
150x Faster pandas with Zero Code Change

(DuckDB Data Benchmark, 5 GB)
• 150x Faster than CPU-only
• Unified workflow on CPUs and GPUs across laptops,

Join 5 mins 30 sec
workstation & datacenter 1 sec
• Compatible with third-party libraries built on pandas

Advanced
• Available today in Open Beta and NVIDIA AI Groupby
4 mins 45 sec
2 sec
Enterprise support coming soon
1 10 100
Seconds
pandas on CPU pandas with RAPIDS cuDF on GH
NVIDIA Grace Hopper vs. Intel Xeon Platinum 8480CL CPU
• See also cuGraph – focussed on GPU-accelerated graph analytics including GNNs and NetworkX: blog
• Has a zero code change backend for NetworkX , nx-cugraph

** NEW **
nvmath-python Polars now GPU accelerated

• open beta
• Bringing NVIDIA maths libraries to the Python
ecosystem
o Performance, productivity, interoperability • Python DataFrame library
o cuBLAS, cuFFT … without need for C/C++ bindings o Aimed at 10s-100s GB workloads on single machine
o Kernel fusion for efficiency o Now accelerated on NVIDIA GPUs
o Kernel autotuning o Up to 13x speed up over CPUs
o Interoperable – e.g. can pass PyTorch data objects o Makes use of cuDF from RAPIDS
directly to maths libraries
o Supports Python logging • Technical blog
• Intro notebooks: Colab | GitHub
• Demo from GTC session "Deep Dive into Math Libraries"
Programming directly for GPUs
Programming the NVIDIA Platform
CPU, GPU, and Network
ACCELERATED STANDARD LANGUAGES PLATFORM SPECIALIZATION

LANGUAGEISO
FEATURES AND DROP-IN
C++, ISO Fortran INCREMENTAL PORTABLE OPTIMIZATION PLATFORM SPECIALIZATION
CUDA
LIBRARIES OpenACC, OpenMP, Numba CUDA, Numba, PyCUDA
ISO C++, ISO Fortran, CuPy, cuNumeric
std::transform(par, x, x+n, y, y, @cuda.jit(void(float32, float 32[:],

[=](float x, float y){ return y + float32[:], float32[:]))
#pragma acc data copy(x,y) {
a*x; } def saxpy(a, x, y, out):
...
); idx = cuda.grid(1)
std::transform(par, x, x+n, y, y,
out[idx] = a * x[idx] + y[idx]
[=](float x, float y){
return y + a*x;
do concurrent (i = 1:n)
});
y(i) = y(i) + a*x(i) mod = cuda.SourceModule("""
...
enddo __global__
}
void saxpy(int n, float a,
float *x, float *y) {
@vectorize([‘float64(float64, float64,
int i = blockIdx.x*blockDim.x +
float64)'], target='cuda’)
import cunumeric as np threadIdx.x;
def saxpy_ufunc(a, x, y):
… if (i < n) y[i] += a*x[i];
return a*x+y;
def saxpy(a, x, y): }
y[:] += a*x """)
ACCELERATION LIBRARIES
Core Math Communication Data Analytics AI Quantum
cuPy – NumPy Compatible Library for GPU
Key Features
• Supports a subset of the numpy.ndarray interface
• Also makes use of NVIDIA libraries: cuBLAS, cuRAND, cuSolver …
• Can make use of Unified Memory
CPU GPU
import numpy as np import cupy as cp
def saxpy(a, x, y): def saxpy(a, x, y):

return a * x + y return a * x + y
a = 3.141 a = 3.141
x = np.random.rand(1024, 2048) x = cp.random.rand(1024, 2048)
y = np.random.rand(1024, 2048) y = cp.random.rand(1024, 2048)
result = saxpy(a, x, y) result = saxpy(a, x, y)

cuNumeric – Implicitly Parallel Implementations of NumPy APIs
Developer blog: Accelerating Python Applications with cuNumeric and Legate
Stencil Benchmark NumPy

No modifications required to scale to a thousand GPUs Application
cuNumeric Python Library
Productivity / Composability Layer

Accelerates library development
Legate
Common Runtime System
Scalable extraction of implicit parallelism
Accelerated Domain Libraries

Maximise single-accelerator performance
cuNumeric – Implicitly Parallel Implementations of NumPy APIs
Developer blog: Accelerating Python Applications with cuNumeric and Legate
Stencil Benchmark NumPy

No modifications required to scale to a thousand GPUs Application
numba – Function Annotation and/or CUDA C-like Programming
ufunc example
Key Features
• Just-In-Time (JIT) Compilation – makes use of type specialisation
• Can accelerate CPU code as well as GPU code
• Works very well with NumPy ufuncs – element-wise operations …
CPU GPU
import numpy as np import numpy as np
from numba import vectorize from numba import vectorize
@vectorize @vectorize([float32(float32, float32, float32)],

def saxpy(a, x, y): target='cuda')
return a * x + y def saxpy(a, x, y):
return a * x + y
a = 3.141 a = 3.141
x = np.random.rand(1024, 2048) x = np.random.rand(1024, 2048)
y = np.random.rand(1024, 2048) y = np.random.rand(1024, 2048)
result = saxpy(a, x, y) result = saxpy(a, x, y)

numba – Function Annotation and/or CUDA C-like Programming
kernel example
Key Features
• … also allows CUDA-style kernels for more complex algorithms
import numpy as np
from numba import cuda
@cuda.jit(void(float32, float 32[:], float32[:], float32[:]))

def saxpy(a, x, y, out):
i = cuda.grid(1) # Shorthand for cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
out[i] = a * x[i] + y[i]
a = 3.141
x = np.random.rand(1024*2048)
y = np.random.rand(1024*2048)
d_x = cuda.to_device(x) # Make a copy of x on the GPU

d_y = cuda.to_device(y) # Make a copy of y on the GPU
d_out = cuda.device_array_like(d_y) # Create an array shaped like y on the GPU
threads_per_block = 256
blocks = 1024*2048 / threads_per_block
# Launch a GPU kernel with an appropriate execution configuration

saxpy[blocks, threads_per_block](a, d_x, d_y, d_out)
cuda.synchronize()
PyCUDA – Kernel Programming
Key Features
• Python interface to CUDA
• Low-level access and fine-grained control
• Can write custom kernels in C/C++ directly within Python
import pycuda.autoinit
import pycuda.driver as cuda
import numpy as np
from pycuda.compiler import SourceModule
mod = SourceModule(""" # Compile the CUDA kernel code

__global__ void saxpy(int n, float a, float *x, float *y) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] += a*x[i];
}
""")
saxpy_cuda = mod.get_function("saxpy") # Get the function pointer for the compiled kernel
a = 3.141
x = np.random.rand(1024*2048, dtype=np.float32)
y = np.random.rand(1024*2048, dtype=np.float32)
d_x = cuda.mem_alloc(x.nbytes) # Allocate memory for x on the GPU

d_y = cuda.mem_alloc(y.nbytes) # Allocate memory for y on the GPU
cuda.memcpy_htod(d_x, x) # Copy data from CPU to GPU
cuda.memcpy_htod(d_y, y) # Copy data from CPU to GPU
block_dim = (256, 1, 1)
grid_dim = ((1024*2048-1) // block_dim[0] + 1, 1)
# Launch the GPU kernel

saxpy_cuda(np.float32(a), d_x, d_y, n, block=block_dim, grid=grid_dim)
cuda.memcpy_dtoh(y, d_y) # Copy the results back to the CPU
d_x.free() # Free GPU memory

d_y.free()
Useful Links
• Numba programming course

• Fundamentals of Accelerated Computing with CUDA Python
• Claim a free DLI course here
• cuNumeric: https://developer.nvidia.com/cunumeric
• Numba for CUDA GPUs: https://numba.pydata.org/numba-doc/latest/cuda/index.html
• CuPy: https://cupy.dev/
• PyCUDA: https://pypi.org/project/pycuda/
Resources
Developer Tools
Debuggers: cuda-gdb, Nsight Visual Studio Edition Profilers: Nsight Systems, Nsight Compute, CUPTI, NVIDIA Tools
eXtension (NVTX)
Correctness Checker: Compute Sanitizer IDE integrations: Nsight Eclipse Edition

Nsight Visual Studio Edition
Nsight Visual Studio Code Edition
NGC: GPU-Optimized Software Hub
Simplifying DL, ML and HPC workflows
50+ Containers Model Training Scripts

NLP, Image Classification, Object
DL, ML, HPC
Detection and more
DEEP LEARNING MACHINE LEARNING
TensorFlow | PyTorch | more RAPIDS | H2O | more

NGC
HPC VISUALIZATION
60 Pre-trained Models Workflows
NLP, Image Classification, Object Medical Imaging, Intelligent NAMD | GROMACS | more ParaView | IndeX | more
Detection and more Video Analytics
Deep Learning Institute (DLI)
Hands-on, self-paced and instructor-led training in

•
deep learning and accelerated computing:

https://www.nvidia.com/en-gb/training/
•
Accelerated Computing Autonomous Vehicles Medical Image

Fundamentals Analysis
NUMBA course:
Fundamentals of Accelerated Computing with CUDA
Python
Lots of Python-based material:

• Accelerating End-to-End Data Science Workflows
• Get Started with Highly Accurate Custom ASR for
Speech AI Genomics Finance Digital Content Creation
• Introduction to Transformer-Based Natural
Language Processing
• Introduction to Physics-Informed Machine
Learning with Modulus More industry-specific
• … training coming soon…
Game Development Deep Learning

Fundamentals
Claim your Free Self–Paced Course
Access essential technical training
Sharpen your skills or learn a new technology. In partnership

with NVIDIA Deep Learning Institute, we are offering a free
self-paced course (worth up to $90).
Courses on offer include:
• Fundamentals of Accelerated Computing with CUDA Python
• Getting Started with Deep Learning
• Getting Started with Accelerated Computing in CUDA C/C++
• Essentials of USD in Omniverse
• Synthetic Data Generation for Training Computer Vision
Models
• Get Started with Highly Accurate Custom ASR for Speech AI
Scan the QR code to access the full course list and redeem
your free training.
Thank you!
Accelerating Python on GPUs
Paul Graham, Senior Solutions Architect
[email protected]

Acceleratingpythonongpus

Uploaded by

Copyright:

Available Formats

Acceleratingpythonongpus

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Acceleratingpythonongpus

Uploaded by

Copyright:

Available Formats

Accelerating Python on GPUs

GPU Computing Computer Graphics Artificial Intelligence

CLIMATE CHANGE DIGITAL BIOLOGY

1.5X per year

1980 1990 2000 2010 2020

• 132 of these in the H100 SXM5 GPU

• > 16000 FP32 cores in total

• We usually want number of threads >> num cores

• So we need a lot of threads!

Developer & Frameworks SDKs

• Perform several MMA calcs per clock cycle

Isaac Sim NVIDIA Modulus and Omniverse

DATA PREPARATION - ETL MODEL TRAINING VISUALIZATION

150x Faster pandas with Zero Code Change

• 150x Faster than CPU-only

• Unified workflow on CPUs and GPUs across laptops,

• Compatible with third-party libraries built on pandas

pandas on CPU pandas with RAPIDS cuDF on GH

NVIDIA Grace Hopper vs. Intel Xeon Platinum 8480CL CPU

• Has a zero code change backend for NetworkX , nx-cugraph

nvmath-python Polars now GPU accelerated

ACCELERATED STANDARD LANGUAGES PLATFORM SPECIALIZATION

std::transform(par, x, x+n, y, y, @cuda.jit(void(float32, float 32[:],

import numpy as np import cupy as cp

def saxpy(a, x, y): def saxpy(a, x, y):

result = saxpy(a, x, y) result = saxpy(a, x, y)

Stencil Benchmark NumPy

cuNumeric Python Library

Productivity / Composability Layer

Accelerated Domain Libraries

Stencil Benchmark NumPy

@vectorize @vectorize([float32(float32, float32, float32)],

result = saxpy(a, x, y) result = saxpy(a, x, y)

@cuda.jit(void(float32, float 32[:], float32[:], float32[:]))

d_x = cuda.to_device(x) # Make a copy of x on the GPU

# Launch a GPU kernel with an appropriate execution configuration

mod = SourceModule(""" # Compile the CUDA kernel code

d_x = cuda.mem_alloc(x.nbytes) # Allocate memory for x on the GPU

# Launch the GPU kernel

cuda.memcpy_dtoh(y, d_y) # Copy the results back to the CPU

d_x.free() # Free GPU memory

• Numba programming course

Correctness Checker: Compute Sanitizer IDE integrations: Nsight Eclipse Edition

50+ Containers Model Training Scripts

DEEP LEARNING MACHINE LEARNING

TensorFlow | PyTorch | more RAPIDS | H2O | more

Hands-on, self-paced and instructor-led training in

deep learning and accelerated computing:

Accelerated Computing Autonomous Vehicles Medical Image

Lots of Python-based material:

Game Development Deep Learning

Sharpen your skills or learn a new technology. In partnership

You might also like