0% found this document useful (0 votes)
65 views33 pages

Acceleratingpythonongpus

Uploaded by

Aissa Hadjoudja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views33 pages

Acceleratingpythonongpus

Uploaded by

Aissa Hadjoudja
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 33

Accelerating Python on GPUs

NVIDIA Webinar
9th October 2024
Paul Graham, Senior Solutions Architect, NVIDIA
[email protected]
NVIDIA

GPU Computing Computer Graphics Artificial Intelligence


GPUs: The basics
Million-X Speedup for Innovation and Discovery
Simulation + AI

CLIMATE CHANGE DIGITAL BIOLOGY


FourcastNet Orbnet

109 MACHINE
LEARNING
108
107
106 SCALE
UP & OUT
105
RENEWABLE ENERGY INDUSTRIAL HPC
104 SGTC Multi-disciplinary Physics

103 ACCELERATED
COMPUTING
102 1.1X per year

101

1.5X per year

Single-threaded perf

1980 1990 2000 2010 2020


Small Changes, Big Speed-up

Application Code

Compute-Intensive Functions
Rest of Sequential
CPU Code
GPU CPU

X
GH100 GPU Architecture
https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper
GH100 GPU Architecture
https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper
GH100 GPU Architecture
https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper
GH100 GPU Architecture
https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper

• 132 of these in the H100 SXM5 GPU

• > 16000 FP32 cores in total

• We usually want number of threads >> num cores

• So we need a lot of threads!


Multi-die Multi-chip Multi-node
The CUDA Platform
Target the abstraction layer that works best for your application

Developer & Frameworks SDKs


Application PyTorch, TensorFlow, Jax, Medical Devices, Energy,
Ecosystem Modulus, Triton, ... Autonomous Vehicles, ...

NVIDIA
Accelerated
Libraries cuBLAS cuSPARSE cuTENSOR cuSOLVER cuRAND cuFFT Math API NPP

Standard
Parallel OpenACC OpenMP
C++
Python Julia MATLAB ...

Languages CUDA
FORTRAN
CUDA C++ OpenCL Ada Haskell R ...

NVVM / LLVM IR
Compilation
Stack
PTX Assembly ISA
Tensor Cores
Hardware for Matrix Multiply and Accumulate operations

• Perform several MMA calcs per clock cycle


• Introduced in the V100
• FP32 in, FP32 out (accumulate)
• FP16 multiply
• Turing added int8, int4, int1 calculations
• Ampere (A100)
• Full FP64 MMA
• Bfloat16, Tensor Float 32
• Hopper (H100)
• FP8
• Transformer Engine
cuDNN Library, CUTLASS
Exploiting Tensor Cores

cuDNN - Accelerating deep learning primitives CUTLASS – Tensor Core Programming Model
Key Features • Warp-Level GEMM and Reusable Components for
Linear Algebra Kernels in CUDA
• Tensor Core acceleration for all popular convolutions
• Has Python interfaces
• Supports FP32, FP16, BF16 and TF32 floating point
formats and INT8, and UINT8 integer formats
• Arbitrary dimension ordering, striding, and sub-
regions for 4d tensors means easy integration into
any neural net implementation
Frameworks and Libraries
Many Frameworks … All With Python Support
NVIDIA Launchpad — free hands-on labs

• cuOpt – accelerated optimisation engine e.g. for • Merlin – end to end system for recommender
Logistics and Route Optimisation frameworks
• Isaac Sim – robotics simulation toolkit – building virtual • NeMo – framework for building and deploying
worlds for training robots generative AI models
• Riva – speech AI services: transcription, translation, text • TAO toolkit – for transfer learning
to voice …
• DeepStream SDK – for streaming IVA applications
• Clara – AI powered solutions for healthcare and life
sciences e.g. Genomics, Medical Instruments • Modulus – PyTorch-based framework for Physics-
informed Neural Networks (PINNs)
• Holoscan – acceleration of sensor data processing
pipelines • ...

Isaac Sim NVIDIA Modulus and Omniverse


RAPIDS
GPU-accelerated data science workflow

RAPIDS.ai

DATA PREDICTIONS

DATA PREPARATION - ETL MODEL TRAINING VISUALIZATION


cuDF: Python drop-in pandas cuML: GPU-acceleration of popular Effortless exploration of datasets,
replacement built on CUDA. ML algorithms e.g. XGBoost billions of records in milliseconds
GPU-accelerated Spark Easy-to-adopt, scikit-learn like Dynamic interaction with data =
interface faster ML model development
Announcing RAPIDS cuDF Accelerates pandas with Zero Code Change
World’s fastest data analytics with pandas

150x Faster pandas with Zero Code Change


(DuckDB Data Benchmark, 5 GB)

• 150x Faster than CPU-only

• Unified workflow on CPUs and GPUs across laptops,


Join 5 mins 30 sec
workstation & datacenter 1 sec

• Compatible with third-party libraries built on pandas


Advanced
• Available today in Open Beta and NVIDIA AI Groupby
4 mins 45 sec
2 sec
Enterprise support coming soon

1 10 100
Seconds

pandas on CPU pandas with RAPIDS cuDF on GH

NVIDIA Grace Hopper vs. Intel Xeon Platinum 8480CL CPU

• See also cuGraph – focussed on GPU-accelerated graph analytics including GNNs and NetworkX: blog

• Has a zero code change backend for NetworkX , nx-cugraph


** NEW **

nvmath-python Polars now GPU accelerated


• open beta
• Bringing NVIDIA maths libraries to the Python
ecosystem
o Performance, productivity, interoperability • Python DataFrame library
o cuBLAS, cuFFT … without need for C/C++ bindings o Aimed at 10s-100s GB workloads on single machine
o Kernel fusion for efficiency o Now accelerated on NVIDIA GPUs
o Kernel autotuning o Up to 13x speed up over CPUs
o Interoperable – e.g. can pass PyTorch data objects o Makes use of cuDF from RAPIDS
directly to maths libraries
o Supports Python logging • Technical blog
• Intro notebooks: Colab | GitHub
• Demo from GTC session "Deep Dive into Math Libraries"
Programming directly for GPUs
Programming the NVIDIA Platform
CPU, GPU, and Network

ACCELERATED STANDARD LANGUAGES PLATFORM SPECIALIZATION


LANGUAGEISO
FEATURES AND DROP-IN
C++, ISO Fortran INCREMENTAL PORTABLE OPTIMIZATION PLATFORM SPECIALIZATION
CUDA
LIBRARIES OpenACC, OpenMP, Numba CUDA, Numba, PyCUDA
ISO C++, ISO Fortran, CuPy, cuNumeric

std::transform(par, x, x+n, y, y, @cuda.jit(void(float32, float 32[:],


[=](float x, float y){ return y + float32[:], float32[:]))
#pragma acc data copy(x,y) {
a*x; } def saxpy(a, x, y, out):
...
); idx = cuda.grid(1)
std::transform(par, x, x+n, y, y,
out[idx] = a * x[idx] + y[idx]
[=](float x, float y){
return y + a*x;
do concurrent (i = 1:n)
});
y(i) = y(i) + a*x(i) mod = cuda.SourceModule("""
...
enddo __global__
}
void saxpy(int n, float a,
float *x, float *y) {
@vectorize([‘float64(float64, float64,
int i = blockIdx.x*blockDim.x +
float64)'], target='cuda’)
import cunumeric as np threadIdx.x;
def saxpy_ufunc(a, x, y):
… if (i < n) y[i] += a*x[i];
return a*x+y;
def saxpy(a, x, y): }
y[:] += a*x """)

ACCELERATION LIBRARIES
Core Math Communication Data Analytics AI Quantum
cuPy – NumPy Compatible Library for GPU

Key Features
• Supports a subset of the numpy.ndarray interface
• Also makes use of NVIDIA libraries: cuBLAS, cuRAND, cuSolver …
• Can make use of Unified Memory

CPU GPU

import numpy as np import cupy as cp

def saxpy(a, x, y): def saxpy(a, x, y):


return a * x + y return a * x + y

a = 3.141 a = 3.141
x = np.random.rand(1024, 2048) x = cp.random.rand(1024, 2048)
y = np.random.rand(1024, 2048) y = cp.random.rand(1024, 2048)

result = saxpy(a, x, y) result = saxpy(a, x, y)


cuNumeric – Implicitly Parallel Implementations of NumPy APIs
Developer blog: Accelerating Python Applications with cuNumeric and Legate

Stencil Benchmark NumPy


No modifications required to scale to a thousand GPUs Application

cuNumeric Python Library

Productivity / Composability Layer


Accelerates library development

Legate
Common Runtime System
Scalable extraction of implicit parallelism

Accelerated Domain Libraries


Maximise single-accelerator performance
cuNumeric – Implicitly Parallel Implementations of NumPy APIs
Developer blog: Accelerating Python Applications with cuNumeric and Legate

Stencil Benchmark NumPy


No modifications required to scale to a thousand GPUs Application
numba – Function Annotation and/or CUDA C-like Programming
ufunc example
Key Features
• Just-In-Time (JIT) Compilation – makes use of type specialisation
• Can accelerate CPU code as well as GPU code
• Works very well with NumPy ufuncs – element-wise operations …

CPU GPU
import numpy as np import numpy as np
from numba import vectorize from numba import vectorize

@vectorize @vectorize([float32(float32, float32, float32)],


def saxpy(a, x, y): target='cuda')
return a * x + y def saxpy(a, x, y):
return a * x + y

a = 3.141 a = 3.141
x = np.random.rand(1024, 2048) x = np.random.rand(1024, 2048)
y = np.random.rand(1024, 2048) y = np.random.rand(1024, 2048)

result = saxpy(a, x, y) result = saxpy(a, x, y)


numba – Function Annotation and/or CUDA C-like Programming
kernel example
Key Features
• … also allows CUDA-style kernels for more complex algorithms

import numpy as np
from numba import cuda

@cuda.jit(void(float32, float 32[:], float32[:], float32[:]))


def saxpy(a, x, y, out):
i = cuda.grid(1) # Shorthand for cuda.threadIdx.x + cuda.blockIdx.x * cuda.blockDim.x
out[i] = a * x[i] + y[i]

a = 3.141
x = np.random.rand(1024*2048)
y = np.random.rand(1024*2048)

d_x = cuda.to_device(x) # Make a copy of x on the GPU


d_y = cuda.to_device(y) # Make a copy of y on the GPU
d_out = cuda.device_array_like(d_y) # Create an array shaped like y on the GPU

threads_per_block = 256
blocks = 1024*2048 / threads_per_block

# Launch a GPU kernel with an appropriate execution configuration


saxpy[blocks, threads_per_block](a, d_x, d_y, d_out)
cuda.synchronize()
PyCUDA – Kernel Programming
Key Features
• Python interface to CUDA
• Low-level access and fine-grained control
• Can write custom kernels in C/C++ directly within Python
import pycuda.autoinit
import pycuda.driver as cuda
import numpy as np
from pycuda.compiler import SourceModule

mod = SourceModule(""" # Compile the CUDA kernel code


__global__ void saxpy(int n, float a, float *x, float *y) {
int i = blockIdx.x*blockDim.x + threadIdx.x;
if (i < n) y[i] += a*x[i];
}
""")
saxpy_cuda = mod.get_function("saxpy") # Get the function pointer for the compiled kernel

a = 3.141
x = np.random.rand(1024*2048, dtype=np.float32)
y = np.random.rand(1024*2048, dtype=np.float32)

d_x = cuda.mem_alloc(x.nbytes) # Allocate memory for x on the GPU


d_y = cuda.mem_alloc(y.nbytes) # Allocate memory for y on the GPU
cuda.memcpy_htod(d_x, x) # Copy data from CPU to GPU
cuda.memcpy_htod(d_y, y) # Copy data from CPU to GPU

block_dim = (256, 1, 1)
grid_dim = ((1024*2048-1) // block_dim[0] + 1, 1)

# Launch the GPU kernel


saxpy_cuda(np.float32(a), d_x, d_y, n, block=block_dim, grid=grid_dim)

cuda.memcpy_dtoh(y, d_y) # Copy the results back to the CPU

d_x.free() # Free GPU memory


d_y.free()
Useful Links

• Numba programming course


• Fundamentals of Accelerated Computing with CUDA Python
• Claim a free DLI course here

• cuNumeric: https://developer.nvidia.com/cunumeric
• Numba for CUDA GPUs: https://numba.pydata.org/numba-doc/latest/cuda/index.html
• CuPy: https://cupy.dev/
• PyCUDA: https://pypi.org/project/pycuda/
Resources
Developer Tools
Debuggers: cuda-gdb, Nsight Visual Studio Edition Profilers: Nsight Systems, Nsight Compute, CUPTI, NVIDIA Tools
eXtension (NVTX)

Correctness Checker: Compute Sanitizer IDE integrations: Nsight Eclipse Edition


Nsight Visual Studio Edition
Nsight Visual Studio Code Edition
NGC: GPU-Optimized Software Hub
Simplifying DL, ML and HPC workflows

50+ Containers Model Training Scripts


NLP, Image Classification, Object
DL, ML, HPC
Detection and more

DEEP LEARNING MACHINE LEARNING

TensorFlow | PyTorch | more RAPIDS | H2O | more


NGC

HPC VISUALIZATION
60 Pre-trained Models Workflows
NLP, Image Classification, Object Medical Imaging, Intelligent NAMD | GROMACS | more ParaView | IndeX | more
Detection and more Video Analytics
Deep Learning Institute (DLI)

Hands-on, self-paced and instructor-led training in


deep learning and accelerated computing:


https://www.nvidia.com/en-gb/training/

Accelerated Computing Autonomous Vehicles Medical Image


Fundamentals Analysis
NUMBA course:
Fundamentals of Accelerated Computing with CUDA
Python

Lots of Python-based material:


• Accelerating End-to-End Data Science Workflows
• Get Started with Highly Accurate Custom ASR for
Speech AI Genomics Finance Digital Content Creation
• Introduction to Transformer-Based Natural
Language Processing
• Introduction to Physics-Informed Machine
Learning with Modulus More industry-specific
• … training coming soon…

Game Development Deep Learning


Fundamentals
Claim your Free Self–Paced Course
Access essential technical training

Sharpen your skills or learn a new technology. In partnership


with NVIDIA Deep Learning Institute, we are offering a free
self-paced course (worth up to $90).
Courses on offer include:
• Fundamentals of Accelerated Computing with CUDA Python
• Getting Started with Deep Learning
• Getting Started with Accelerated Computing in CUDA C/C++
• Essentials of USD in Omniverse
• Synthetic Data Generation for Training Computer Vision
Models
• Get Started with Highly Accurate Custom ASR for Speech AI
Scan the QR code to access the full course list and redeem
your free training.
Thank you!
Accelerating Python on GPUs
Paul Graham, Senior Solutions Architect
[email protected]

You might also like