Acceleratingpythonongpus
Acceleratingpythonongpus
NVIDIA Webinar
9th October 2024
Paul Graham, Senior Solutions Architect, NVIDIA
[email protected]
NVIDIA
109 MACHINE
LEARNING
108
107
106 SCALE
UP & OUT
105
RENEWABLE ENERGY INDUSTRIAL HPC
104 SGTC Multi-disciplinary Physics
103 ACCELERATED
COMPUTING
102 1.1X per year
101
Single-threaded perf
Application Code
Compute-Intensive Functions
Rest of Sequential
CPU Code
GPU CPU
X
GH100 GPU Architecture
https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper
GH100 GPU Architecture
https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper
GH100 GPU Architecture
https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper
GH100 GPU Architecture
https://resources.nvidia.com/en-us-tensor-core/gtc22-whitepaper-hopper
NVIDIA
Accelerated
Libraries cuBLAS cuSPARSE cuTENSOR cuSOLVER cuRAND cuFFT Math API NPP
Standard
Parallel OpenACC OpenMP
C++
Python Julia MATLAB ...
Languages CUDA
FORTRAN
CUDA C++ OpenCL Ada Haskell R ...
NVVM / LLVM IR
Compilation
Stack
PTX Assembly ISA
Tensor Cores
Hardware for Matrix Multiply and Accumulate operations
cuDNN - Accelerating deep learning primitives CUTLASS – Tensor Core Programming Model
Key Features • Warp-Level GEMM and Reusable Components for
Linear Algebra Kernels in CUDA
• Tensor Core acceleration for all popular convolutions
• Has Python interfaces
• Supports FP32, FP16, BF16 and TF32 floating point
formats and INT8, and UINT8 integer formats
• Arbitrary dimension ordering, striding, and sub-
regions for 4d tensors means easy integration into
any neural net implementation
Frameworks and Libraries
Many Frameworks … All With Python Support
NVIDIA Launchpad — free hands-on labs
• cuOpt – accelerated optimisation engine e.g. for • Merlin – end to end system for recommender
Logistics and Route Optimisation frameworks
• Isaac Sim – robotics simulation toolkit – building virtual • NeMo – framework for building and deploying
worlds for training robots generative AI models
• Riva – speech AI services: transcription, translation, text • TAO toolkit – for transfer learning
to voice …
• DeepStream SDK – for streaming IVA applications
• Clara – AI powered solutions for healthcare and life
sciences e.g. Genomics, Medical Instruments • Modulus – PyTorch-based framework for Physics-
informed Neural Networks (PINNs)
• Holoscan – acceleration of sensor data processing
pipelines • ...
RAPIDS.ai
DATA PREDICTIONS
1 10 100
Seconds
• See also cuGraph – focussed on GPU-accelerated graph analytics including GNNs and NetworkX: blog
ACCELERATION LIBRARIES
Core Math Communication Data Analytics AI Quantum
cuPy – NumPy Compatible Library for GPU
Key Features
• Supports a subset of the numpy.ndarray interface
• Also makes use of NVIDIA libraries: cuBLAS, cuRAND, cuSolver …
• Can make use of Unified Memory
CPU GPU
a = 3.141 a = 3.141
x = np.random.rand(1024, 2048) x = cp.random.rand(1024, 2048)
y = np.random.rand(1024, 2048) y = cp.random.rand(1024, 2048)
Legate
Common Runtime System
Scalable extraction of implicit parallelism
CPU GPU
import numpy as np import numpy as np
from numba import vectorize from numba import vectorize
a = 3.141 a = 3.141
x = np.random.rand(1024, 2048) x = np.random.rand(1024, 2048)
y = np.random.rand(1024, 2048) y = np.random.rand(1024, 2048)
import numpy as np
from numba import cuda
a = 3.141
x = np.random.rand(1024*2048)
y = np.random.rand(1024*2048)
threads_per_block = 256
blocks = 1024*2048 / threads_per_block
a = 3.141
x = np.random.rand(1024*2048, dtype=np.float32)
y = np.random.rand(1024*2048, dtype=np.float32)
block_dim = (256, 1, 1)
grid_dim = ((1024*2048-1) // block_dim[0] + 1, 1)
• cuNumeric: https://developer.nvidia.com/cunumeric
• Numba for CUDA GPUs: https://numba.pydata.org/numba-doc/latest/cuda/index.html
• CuPy: https://cupy.dev/
• PyCUDA: https://pypi.org/project/pycuda/
Resources
Developer Tools
Debuggers: cuda-gdb, Nsight Visual Studio Edition Profilers: Nsight Systems, Nsight Compute, CUPTI, NVIDIA Tools
eXtension (NVTX)
HPC VISUALIZATION
60 Pre-trained Models Workflows
NLP, Image Classification, Object Medical Imaging, Intelligent NAMD | GROMACS | more ParaView | IndeX | more
Detection and more Video Analytics
Deep Learning Institute (DLI)