0% found this document useful (0 votes)
35 views38 pages

25-04 Gpu Programming Without Cuda

The document discusses GPU programming alternatives to CUDA, highlighting the evolution of compilers, libraries, and programming languages like C++ and Python to improve parallelism expression. It emphasizes advancements in MLIR compilers, the use of high-performance libraries, and the emergence of languages like Triton for simplified GPU programming. The conclusion notes that while CUDA has been pivotal for AI, new methods are evolving to enhance GPU programming efficiency and accessibility.

Uploaded by

ramkumarravi49
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
35 views38 pages

25-04 Gpu Programming Without Cuda

The document discusses GPU programming alternatives to CUDA, highlighting the evolution of compilers, libraries, and programming languages like C++ and Python to improve parallelism expression. It emphasizes advancements in MLIR compilers, the use of high-performance libraries, and the emergence of languages like Triton for simplified GPU programming. The conclusion notes that while CUDA has been pivotal for AI, new methods are evolving to enhance GPU programming efficiency and accessibility.

Uploaded by

ramkumarravi49
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

Programming GPUs

Without CUDA
CS6023: GPU Programming

23rd & 24th April, 2025


CUDA: A Programmer’s Dream

• Introduced in 2006 when compiler weren’t mature to automatically


target accelerator HW

• CUDA put the power in the hands of the programmer

• Resulted in the growth of the “ninja programmer”!


CUDA: A Programmer’s Nightmare!

• Broke abstractions; SW engineers needed to know HW details

• Not portable to other GPUs


• Performance on subsequent versions of HW / CUDA wasn’t always guaranteed

• But it was the best solution then… and still remains the best now!

• Most GPUs have their own low-level language similar CUDA: hip, dpc++, etc.
CUDA: A Maintainer’s Nightmare!
• Most real-world code is memory bound; need fusion to hide latency
What Are Our Alternatives?
• Use compilers that generate performant code for accelerators

• Invoke high-performance library functions from C++ & python code

• Leverage advancements in high-level languages like C++, python


to express parallelism better

Focus of this & next lecture: A avor of active work in these


fast-moving elds
fi
fl
What Are Our Alternatives?

• Use compilers that generate performant code for accelerators

• Invoke high-performance library functions from C++ & python code

• Leverage advancements in high-level languages like C++, python


to express parallelism better
Improving Code-Gen for GPUs
• Traditional compilers’ code-gen focused on
scalars; hard to generalize for SIMD & SIMT

• IR design, and optimizations focused on extracting ILP

• Multi-Level IR (MLIR) compilers attempting


to address this limitation [1]

• High-level —> machine code through series of dialects;


concept of “progressive lowering”

• Optimizations enabled at each stage of lowering

• Common IR blocks that can be reused across front-ends


[1] https://mlir.llvm.org/
Build a compiler for parallel languages!
Auto-gen Code Less Performant than
hand-tuned code for GPUs!
• MLIR compilers get good gains over existing compilers from fusion

• And provide quicker turn-around time & maintainable code over CUDA

• But still immature to replace hand-tuned code! [2]


[2] https://www.modular.com/blog/democratizing-ai-compute-part-6-what-about-ai-compilers?utm_source=linkedin
What Are Our Alternatives?

• Use compilers that generate performant code for accelerators

• Invoke high-performance library functions from C++ & python code

• Leverage advancements in high-level languages like C++, python


to express parallelism better
Libraries for Popular Functionalities

• Accelerator vendors provide libraries that have high-performance


versions of popularly-used functions for images, math, etc.

• In-turn use hand-written kernels in low-level abstractions (CUDA / others)

• Have bindings for both C++ & Python; extensive documentation and support

• In fact, modern AI compilers (XLA, etc.) match functions against libraries


instead of code-generation!
Extensive Libs with Hand-tuned Kernels
For NVIDIA GPUs [3] For AMD GPUs [4]

[3] https://developer.nvidia.com/gpu-accelerated-libraries
[4] https://www.modular.com/blog/democratizing-ai-compute-part-6-what-about-ai-compilers?utm_source=linkedin
What Are Our Alternatives?

• Use compilers that generate performant code for accelerators

• Invoke high-performance library functions from C++ & python code

• Leverage advancements in high-level languages like C++, python


to express parallelism better
Improving Parallelism Expression in C++
The Impact of These Advancements
• Specifying what you want to do, not how => code faster & simpler! [5,6]

[5] NVIDIA HPC Standard Language Parallelism C++ by Robert Searles


[6] “Going Native 2013 C++ Seasoning” by Sean Parent
The C++ Algorithm Library
• Available from C++11 with #include <algorithm>

• But compilers also implement common algorithms for older versions

• Built-in implementations of common functions

• Search, sort, count, etc.

• Works with iterators on pre-allocated STLs

• STL algorithms cannot change the size of a collection!


[7] “105 STL algos in an hour”, by Jonathan Boccara
Examples in Each Category
• Permutations: heaps, sorting, partitioning, permutations

• Queries: value (count, accumulate), property (all_of, any_of, none_of), search, etc.

• Algos on Sets: set_difference, set_union, set_intersection, , etc.

• Movers: copy, move, swap_ranges, etc.

• Value Modi ers: iota, ll, generate, replace

• Structure Changes: remove, unique

• Raw Memory: uninitialized_ ll, uninitialized_copy, uninitialized_move, destroy, etc.

• for_each, transform

• Runes: *_n, *_copy, stable_*, is_*, *_if


fi
fi
fi
C++ Execution Policies
Enable automatic device selection for
std::algorithm
C++ Execution Policy Overview
• Speci es if algorithm may be vectored or parallellized

• What is the difference between vector and parallel execution?

• Speci ed as a rst parameter to an algorithm

• GPU allocation, data movement handled by compiler + runtime


Original Parallel version [8]

std::sort(employees.begin(), Std::sort(std::execution::par,
employees.end(), employees.begin(),
CompareByLastName()) employees.end(),
CompareByLastName())
[8] “Accelerating Standard C++ with GPUs using Stdpar”, by D. Olsen, G. Lopez, and B. A. Lelbach https://developer.nvidia.com/blog/accelerating-
standard-c-with-gpus-using-stdpar/
fi
fi
fi
The C++ Execution Policies

• std::execution::seq - Sequential exec; no parallelism allowed

• std::execution::unseq - Vector exec on calling thread (C++20)

• std::execution::par - Parallel execution on one or more threads

• std::execution::par_unseq - Parallel execution on 1 or more


threads with each thread possibly being vectored
Stdpar on NVIDIA GPUs : Guidelines [8]

• Add nvcc ag -stdpar to enable automatic parallelisation on Nvidia GPUs

• No __device__ annotation, but de nition & call should be in same le

• Code used in parallel algo should be allocated on heap to leverage GPU

• Avoid function pointers as params to parallel functions; use lambda or func obj

• Random access iterators required for C++ parallel algorithms

• C++ parallel Algorithms on GPUs don’t throw exceptions!


[8] “Accelerating Standard C++ with GPUs using Stdpar”, by D. Olsen, G. Lopez, and B. A. Lelbach https://developer.nvidia.com/blog/accelerating-
standard-c-with-gpus-using-stdpar/
fl
fi
fi
GPUs Pushing Parallelism in C++
NVIDIA Pushing Parallelism in C++[9]

[9] https://www.nvidia.com/en-us/on-demand/session/gtc25-s72572/
Performance on Real Applications
STBLM (Lattice-
Lulesh Boltzmann Multi-Physics
Hydrodynamics Simulations) Aerodynamics
Mini-App (M-AIA)

• Performance may be close to hand-tuned CUDA for some loads![2]


But What about
Python?
Must we restrict
“performance programming”
to the select few?
Parallelism in Python
• Python has become de-facto standard for programming

• But python code executes on a single thread!


• Language supports parallelism, but makes code look like C++!

• Several active efforts to improve parallelism expression in Python


• Domain Speci c Language (DSL) Extensions like TensorFlow, Torch, JAX

• JIT compilers for python like Numba

• pyCUDA, OpenAI Trition that enable writing GPU kernels in Python

• Languages like Mojo that attempt to improve parallelism expression


fi
Triton: A GPU-friendly Programming Language
• Triton kernels are single-threaded & auto-parallelized [10]

• Triton doesn’t need “GPU Ninjas” [11]

• Triton is cross-platform with public backends for NVIDIA and AMD GPUs

• Becoming increasingly popular in DL frameworks like PyTorch


• Torch.compile() backend generates “Triton code” which is then compiled to the GPU
[10]“Triton: An intermediate Language and Compiler for Tiled Neural Network Computations”, http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf

[11] “Triton Compiler”, https://www.youtube.com/watch?v=AtbnRIzpwho


Triton Overview [12]

• Triton is both the language and the compiler


• Tensors de ned in SRAM, modi ed using torch-like operators

• Triton compiler uses MLIR for back-end code generation

• Key bene ts of the language over CUDA, or DSLs like PyTorch / TF


• Embedded in python: Kernels in python with triton.jit decorator

• DRAM accesses handled via pointers; still have low-level control

• Blocked program representation enables ef cient code-gen


[12] The Triton Language by Philippe Tillet: https://www.youtube.com/watch?v=G951lCm_qnk
fi
fi
fi
fi
Triton Examples : Vector Add [13]

import torch import torch


import triton.language as tl import triton.language as tl
import triton
[12]
import triton

@triton.jit @triton.jit
def _add(z_ptr, x_ptr, y_ptr, N): def _add(z_ptr, x_ptr, y_ptr, N):
# same as torch.arange # same as torch.arange
offsets = tl.arange(0, 1024) offsets = tl.arange(0, 1024)
offsets += tl.program_id(0) * 1024
# create 1024 pointers to X, Y, Z
x_ptrs = x_ptr + offsets # create 1024 pointers to X, Y, Z
y_ptrs = y_ptr + offsets x_ptrs = x_ptr + offsets
z_ptrs = z_ptr + offsets y_ptrs = y_ptr + offsets
z_ptrs = z_ptr + offsets
# load 1024 elements of X, Y, Z
# load 1024 elements of X, Y, Z
x = tl.load(x_ptrs) x = tl.load(x_ptrs, mask=offset<N)
y = tl.load(y_ptrs) y = tl.load(y_ptrs, mask=offset<N)
# do computation # do computation
z = x + y z = x + y

# store 1024 elements of Z # store 1024 elements of Z


tl.store(z_ptrs, z) tl.store(z_ptrs, z, mask=offset<N)

N = 1024 N = 1024
x = torch.randn(N, device = 'cuda') x = torch.randn(N, device = 'cuda')
y = torch.randn(N, device = 'cuda') y = torch.randn(N, device = 'cuda')
z = torch.randn(N, device = 'cuda') z = torch.randn(N, device = 'cuda')
grid = (1,) grid = (triton.cdiv(N,1024),)
_add[grid] (z, x, y, N) _add[grid] (z, x, y, N)
[12] The Triton Language by Philippe Tillet: https://www.youtube.com/watch?v=G951lCm_qnk
[13] Triton Tutorials: https://triton-lang.org/main/getting-started/tutorials/index.html
Performance of Triton kernels close to
Native Kernels for Some Operators
Softmax: 10 lines of Triton Code [12] Matmul: ~25 lines of Triton [12]

[12] The Triton Language by Philippe Tillet: https://www.youtube.com/watch?v=G951lCm_qnk


NVIDIA’s cuTile: Block Based
Programming in Python & C++

[14] How to write a CUDA Program: The Parallel Programming Edition: https://www.nvidia.com/gtc/session-catalog/?
search=S72897&tab.catalogallsessionstab=16566177511100015Kus#/session/1727909865019001Xmyv
Examples with cuTile
Llama-3.1: Implemented in cuTile
cuTile —> TileIR —> GPU
Conclusion
• Programming GPUs with CUDA enabled the AI revolution!

• Alternate methods to program GPUs exist & are evolving

• Signi cant advancements in MLIR compilers to improve auto-gen

• Libraries package hand-tuned implementation to simplify app developers

• C++ & Python evolving to improve expression of parallelism

• Block-based programming with Triton / cuTile evolving; future of kernel programming?


fi
References
[1] https://mlir.llvm.org/

[2] https://www.modular.com/blog/democratizing-ai-compute-part-6-what-about-ai-compilers?utm_source=linkedin

[3] https://developer.nvidia.com/gpu-accelerated-libraries

[4] https://www.modular.com/blog/democratizing-ai-compute-part-6-what-about-ai-compilers?utm_source=linkedin

[5] “NVIDIA HPC Standard Language Parallelism C++” by Robert Searles - https://vimeo.com/697495123

[6] “Going Native 2013 C++ Seasoning” by Sean Parent - https://www.youtube.com/watch?v=W2tWOdzgXHA

[7] “105 STL algos in an hour”, by Jonathan Boccara - https://www.youtube.com/watch?v=2olsGf6JIkU

[8] “Accelerating Standard C++ with GPUs using Stdpar”, by D. Olsen, G. Lopez, and B. A. Lelbach https://developer.nvidia.com/blog/accelerating-standard-c-with-gpus-using-stdpar/

[9] https://www.nvidia.com/en-us/on-demand/session/gtc25-s72572/

[10] “Triton: An intermediate Language and Compiler for Tiled Neural Network Computations”, http://www.eecs.harvard.edu/~htk/publication/2019-mapl-tillet-kung-cox.pdf

[11] “Triton Compiler”, https://www.youtube.com/watch?v=AtbnRIzpwho

[12] The Triton Language by Philippe Tillet: https://www.youtube.com/watch?v=G951lCm_qnk

[13] Triton Tutorials: https://triton-lang.org/main/getting-started/tutorials/index.html

[14] How to write a CUDA Program: The Parallel Programming Edition: https://www.nvidia.com/gtc/session-catalog/?search=S72897&tab.catalogallsessionstab=16566177511100015Kus#/session/
1727909865019001Xmyv

You might also like