25-04 Gpu Programming Without Cuda
25-04 Gpu Programming Without Cuda
Without CUDA
CS6023: GPU Programming
• But it was the best solution then… and still remains the best now!
• Most GPUs have their own low-level language similar CUDA: hip, dpc++, etc.
CUDA: A Maintainer’s Nightmare!
• Most real-world code is memory bound; need fusion to hide latency
What Are Our Alternatives?
• Use compilers that generate performant code for accelerators
• And provide quicker turn-around time & maintainable code over CUDA
•
[2] https://www.modular.com/blog/democratizing-ai-compute-part-6-what-about-ai-compilers?utm_source=linkedin
What Are Our Alternatives?
• Have bindings for both C++ & Python; extensive documentation and support
[3] https://developer.nvidia.com/gpu-accelerated-libraries
[4] https://www.modular.com/blog/democratizing-ai-compute-part-6-what-about-ai-compilers?utm_source=linkedin
What Are Our Alternatives?
• Queries: value (count, accumulate), property (all_of, any_of, none_of), search, etc.
• for_each, transform
std::sort(employees.begin(), Std::sort(std::execution::par,
employees.end(), employees.begin(),
CompareByLastName()) employees.end(),
CompareByLastName())
[8] “Accelerating Standard C++ with GPUs using Stdpar”, by D. Olsen, G. Lopez, and B. A. Lelbach https://developer.nvidia.com/blog/accelerating-
standard-c-with-gpus-using-stdpar/
fi
fi
fi
The C++ Execution Policies
• Avoid function pointers as params to parallel functions; use lambda or func obj
[9] https://www.nvidia.com/en-us/on-demand/session/gtc25-s72572/
Performance on Real Applications
STBLM (Lattice-
Lulesh Boltzmann Multi-Physics
Hydrodynamics Simulations) Aerodynamics
Mini-App (M-AIA)
• Triton is cross-platform with public backends for NVIDIA and AMD GPUs
@triton.jit @triton.jit
def _add(z_ptr, x_ptr, y_ptr, N): def _add(z_ptr, x_ptr, y_ptr, N):
# same as torch.arange # same as torch.arange
offsets = tl.arange(0, 1024) offsets = tl.arange(0, 1024)
offsets += tl.program_id(0) * 1024
# create 1024 pointers to X, Y, Z
x_ptrs = x_ptr + offsets # create 1024 pointers to X, Y, Z
y_ptrs = y_ptr + offsets x_ptrs = x_ptr + offsets
z_ptrs = z_ptr + offsets y_ptrs = y_ptr + offsets
z_ptrs = z_ptr + offsets
# load 1024 elements of X, Y, Z
# load 1024 elements of X, Y, Z
x = tl.load(x_ptrs) x = tl.load(x_ptrs, mask=offset<N)
y = tl.load(y_ptrs) y = tl.load(y_ptrs, mask=offset<N)
# do computation # do computation
z = x + y z = x + y
N = 1024 N = 1024
x = torch.randn(N, device = 'cuda') x = torch.randn(N, device = 'cuda')
y = torch.randn(N, device = 'cuda') y = torch.randn(N, device = 'cuda')
z = torch.randn(N, device = 'cuda') z = torch.randn(N, device = 'cuda')
grid = (1,) grid = (triton.cdiv(N,1024),)
_add[grid] (z, x, y, N) _add[grid] (z, x, y, N)
[12] The Triton Language by Philippe Tillet: https://www.youtube.com/watch?v=G951lCm_qnk
[13] Triton Tutorials: https://triton-lang.org/main/getting-started/tutorials/index.html
Performance of Triton kernels close to
Native Kernels for Some Operators
Softmax: 10 lines of Triton Code [12] Matmul: ~25 lines of Triton [12]
[14] How to write a CUDA Program: The Parallel Programming Edition: https://www.nvidia.com/gtc/session-catalog/?
search=S72897&tab.catalogallsessionstab=16566177511100015Kus#/session/1727909865019001Xmyv
Examples with cuTile
Llama-3.1: Implemented in cuTile
cuTile —> TileIR —> GPU
Conclusion
• Programming GPUs with CUDA enabled the AI revolution!
[2] https://www.modular.com/blog/democratizing-ai-compute-part-6-what-about-ai-compilers?utm_source=linkedin
[3] https://developer.nvidia.com/gpu-accelerated-libraries
[4] https://www.modular.com/blog/democratizing-ai-compute-part-6-what-about-ai-compilers?utm_source=linkedin
[5] “NVIDIA HPC Standard Language Parallelism C++” by Robert Searles - https://vimeo.com/697495123
[8] “Accelerating Standard C++ with GPUs using Stdpar”, by D. Olsen, G. Lopez, and B. A. Lelbach https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/
[9] https://www.nvidia.com/en-us/on-demand/session/gtc25-s72572/
[10] “Triton: An intermediate Language and Compiler for Tiled Neural Network Computations”, http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf
[14] How to write a CUDA Program: The Parallel Programming Edition: https://www.nvidia.com/gtc/session-catalog/?search=S72897&tab.catalogallsessionstab=16566177511100015Kus#/session/
1727909865019001Xmyv