Lecture - 01 - CUDA Programming
Lecture - 01 - CUDA Programming
An introduction to CUDA
Lecture 1 2
Motivation for GPU computing
Moore’s law (roughly) states that
the number of transistors in an
integrated circuit will double every
two years. Moore made this
prediction in 1965!
Lecture 1 3
High level hardware overview
Heterogeneous computing makes
use of more than one type of
computer processor (for example a
CPU and a GPU). Bank of GPUs
Heterogeneous computing is
sometimes called hybrid computing
or accelerated computing.
2x CPUs
Some of the motivations for
employing heterogeneous
technologies are:
Server (often called a
node when a collection of
• Significant reductions in floor
servers form a cluster)
space needed.
• Higher throughput.
Lecture 1 4
High level hardware overview
Lecture 1 5
Generations of GPUs Kepler (compute capability 3.x):
• first released in 2012, including HPC cards.
• excellent double precision arithmetic (DP or fp64).
Several generations of GPUs have • our practicals will use K40s and K80s.
been released now.
Maxwell (compute capability 5.x):
Starting in 2007 with the first • first released in 2014.
General-Purpose Graphics • an architecture for gaming, so poor DP.
Processing Unit (GPGPU) called
Tesla. Pascal (compute capability 6.x):
• first released in 2016.
The “Tesla” name has been • many gaming cards and several HPC cards.
subsequently used as the identifier
for all NVIDIA HPC cards. Volta (compute capability 7.x):
• first released end 2017 / start 2018.
Currently, 5 generations of • only HPC cards, excellent DP.
hardware cards are in use, although
the Kepler and Maxwell generations Turing (compute capability 7.5):
are becoming more scarce. • first released Q3 2018.
• this is the gaming version of Volta.
• some HPC cards (specifically for AI inference). Poor DP.
Lecture 1 6
Current start of the art
Due to the tailoring of GPUs for AI we
now have two generations of GPU The Volta generation has only HPC (Tesla) and “prosumer”
meeting the needs of: (Titan) cards:
Gamers: RTX cards based on the Turing
Titan V: 5120 cores, 12GB (£2900)
architecture.
Tesla V100: 5120 cores, 16/32GB, PCIe or NVLink (£380)
AI/AR/VR researchers: Titan cards and
T4 based on both Volta and Turing The Turing generation has cards for gaming (GeForce) and AI
architectures. cards (Titan and T4):
Scientific Computing researchers: Titan GeForce RTX 2060: 1920 cores, 6GB (£300)
V and V100 cards with good DP. GeForce RTX 2080Ti: 4352 cores, 11GB (£1100)
Currently the biggest differentiation
factor between the generations is the Titan RTX: 4608 cores, 24GB (£2400)
double precision performance T4: 2560 cores, 16GB, HHHW, Low power (£2300)
GPU chip (sometimes called the die) A block diagram of the V100 GPU
Lecture 1 8
Basic building blocks
The basic building block of a GPU is the “Streaming Multiprocessor” or SM. This contains:
Lecture 1 9
Looking into the SM
A V100 has 80 SMs (see right).
• 16 FP32 Cores;
• 8 FP64 Cores;
• 16 INT32 Cores;
• 128KB L1 / Shared memory;
• 64 KB Register File.
Lecture 1 10
Looking into the SM
Lets look at this in more detail
Lecture 1 11
Different numbers of SMs
Different products have different numbers of SMs, but although the number of SMs across
the product range might vary, the SM is exactly the same for each generation.
Lecture 1 12
Different numbers of SMs
Typically each GPU generation brings improvements in the number of SMs, the bandwidth
to device (GPU) memory and the amount of memory on each GPU.
Sometimes NVIDIA use rather confusing naming schemes….
Lecture 1 13
Multiple GPU Chips
Some “GPUs” (the actual device that plugs into a PCIe slot) has multiple GPU chips on them…
Lecture 1 14
Multithreading
Key hardware feature is that the cores in a SM are SIMT (Single Instruction Multiple
Threads) cores:
• Groups of 32 cores execute the same instructions simultaneously, but with different data.
• 32 threads all doing the same thing at the same time (threads are in lock-step).
• SIMT is also a natural choice for many-core chips to simplify each core.
Lecture 1 15
Multithreading
• GPUs do not exploit “context switching”; each thread has its own
registers, which limits the number of active threads.
1 2 3 4 ….. 30 31
• Threads on each SM execute in groups of 32 called “warps”
Lecture 1 16
Multithreading
Originally, each thread completed one operation before the next Pipe 0
started to avoid complexity of pipeline overlaps, some examples are:
Pipe 1
• Structural – two instructions need to use the same physical
hardware component. Pipe 2
Lecture 1 17
Multithreading
Lecture 1 18
Working with GPUs
Recall from earlier in these slides, Device
a GPU is attached to the
computer by a PCIe bus.
Lecture 1 19
Software view
In a bit more detail, at the top level, we have a master process which runs
on the CPU and performs the following steps:
1. Initialises card.
Lecture 1 20
Software view
In further detail, within the GPU:
• All threads within one instance can access local shared memory but
can’t see what the other instances are doing (even if they are on the
same SM).
Lecture 1 21
Software view - CUDA
CUDA (Compute Unified Device Architecture) provides a
set of programming extensions based on the C/C++
family of languages.
Lecture 1 22
Installing CUDA
CUDA is supported on Windows, Linux and MacOSX.
https://developer.nvidia.com/cuda-downloads
Driver
• Low-level software that controls the graphics
card.
Toolkit
• nvcc CUDA compiler.
• Nsight IDE plugin for Eclipse or Visual Studio
profiling and debugging tools.
• Several libraries (more to come on this).
SDK
• Lots of demonstration examples.
• Some error-checking utilities.
• Not officially supported by NVIDIA.
• Sparse documentation.
Lecture 1 23
CUDA Programming
A CUDA program comes in two parts:
1. Runtime
• simpler, more convenient
2. Driver
• much more verbose
• more flexible (e.g. allows run-time compilation)
• closer to OpenCL
• memory allocation on graphics card // malloc() host memory (this is in your RAM)
h_input = (float*) malloc(mem_size);
• data transfer to/from device memory, including h_output = (float*) malloc(mem_size);
• constants
• ordinary data // allocate device memory input and output arrays
cudaMalloc((void**)&d_input, mem_size);
• error-checking cudaMalloc((void**)&d_output, mem_size);
• timing
// Do something here!
There is also a special syntax for launching multiple // cleanup memory
instances of the kernel process on the GPU. free(h_input);
free(h_output);
cudaFree(d_input);
cudaFree(d_output);
Lecture 1 25
CUDA Programming – host code
• memory allocation on graphics card // Copy host memory to device input array
• data transfer to/from device memory, including cudaMemcpy(d_input, h_input, mem_size, cudaMemcpyHostToDevice);
• constants // Do something on the GPU
• ordinary data
• error-checking // copy result from device to host
cudaMemcpy(h_output, d_output, mem_size, cudaMemcpyDeviceToHost);
• timing
Lecture 1 26
CUDA Programming – host code
Lecture 1 27
CUDA Programming – host code
In its simplest form the special syntax looks like:
kernel_routine<<gridDim, blockDim>>(args);
__global__ void helloworld_GPU(void){
gridDim is the number of instances of the kernel printf("Hello world!\n");
}
(the “grid” size).
int main(void) {
blockDim is the number of threads within each instance (the
// run CUDA kernel
“block” size). helloworld_GPU<<<1,1>>>();
Lecture 1 28
CUDA Programming
At the lower level, when one instance of the kernel is started on a SM it
is executed by a number of threads, each of which knows about:
Lecture 1 29
CUDA Programming
The CUDA language uses some reserved or special variables.
These are:
Lecture 1 30
CUDA Programming
Below is a conceptual example of a 1D grid, comprised of 4 blocks, each having 64 threads per block:
• gridDim = 4
• blockDim = 64
Lecture 1 31
CUDA Programming
The kernel code looks fairly normal once you get used to two things:
It’s important to think about where each variable lives (more on this in the next lecture)
• Any operation involving data in the device memory forces its transfer to/from registers in the GPU.
• It’s often better to copy the value into a local register variable
Lecture 1 32
Our first host code
int main() {
for (int n=0; n<nsize; n++) printf(" n, x = %d %f \n",n,h_x[n]); // Print the results.
Lecture 1 33
Our first kernel code
#include <helper_cuda.h>
• Within each block of threads, threadIdx.x ranges from 0 to blockDim.x-1, so each thread
has a unique value for tid.
Lecture 1 34
Quick recap
Software Hardware
A thread executes on a core. Thread FP32 Core
Lecture 1 35
Scaling things up
Suppose we have 1000 blocks, and each one has 128 threads – how would this get executed?
On the Kepler that we will use in our practicals, we would probably get 8-12 blocks running on each SM (Kepler SMs have 128
cores), and each block has 4 warps, so 32-48 warps running on each SM.
Each clock tick, the SM warp scheduler decides which warps to execute next, choosing from those not waiting for:
As a programmer, we don’t need to worry about this level of detail, we just need to ensure there are lots of threads /
warps.
Lecture 1 36
Scaling things up
Lecture 1 37
Quick recap
Thread
Block
Thread blocks are formed from warps.
Warp 1
By this we mean that everything that 32 33 34 35…
happens within a warp is lock-step.
Warp 2 64 65 66 67…
So the same operation (instruction) in
thread 2, 3, 8 occurs in thread 7, 1, 4, Warp 3
96 97 98 99…
11… (from 0 to 31) at the same time.
Lecture 1 38
Higher dimensions
So far, out simple example considers the case of a 1D grid of blocks, and within each block a 1D set of threads.
Many applications – Finite Element, CFD, MD,…., might need to use 2D or even 3D sets of threads.
Here, dim3 is a special CUDA datatype with 3 components .x,.y,.z each initialised to 1.
Lecture 1 39
Indexing in higher dimensions
To calculate a unique (or 1D) thread identifier (previously we called this tid) when working in 2D or 3D we
simply use:
Lecture 1 40
Mikes notes on Practical 1
Lecture 1 41
Practical 1
Lecture 1 42
Practical 1
Lecture 1 43
Practical 1
Lecture 1 44
Practical 1
Lecture 1 45
Practical 1
Lecture 1 46
Arcus htc
arcus-htc
• The GPU compute nodes have two K80 cards with a total of 4 GPUs, numbered 0 – 3.
Lecture 1 47
Nsight
Lecture 1 48
Nsight
Lecture 1 49
Nsight
Lecture 1 50
Key reading
CUDA Programming Guide, version 10.1:
• Chapter 1: Introduction
https://www.flickr.com/photos/abee5/8314929977
• Chapter 2: Programming Model
• Chapter 5: performance of different GPUs
• Appendix A: CUDA-enabled GPUs
• Appendix B, sections B.1 – B.4: C language extensions
• Appendix B, section B.20: Formatted (printf) output
• Appendix H, section H: Compute capabilities (features of
different GPUs)
Lecture 1 51
What have we learnt?
Lecture 1 52