CSE Lec4 Cuda
CSE Lec4 Cuda
CSE 461
Spring ‘24
• Recently, CPUs are also adopting the concept of high bandwidth memory and 3D stacked memory to
improve memory access performance.
• L2 cache
– Shared across all SMs => every thread in every CUDA block can access it
• Global Memory
– Global memory can be thought of as the physical memory on your graphics card.
– All threads can read and write to Global memory.
– You can even read and write to Global memory from a thread on the CPU.
• Each block is
16X16X1 16 BLOCK
• 5 Blocks on x-axis
• 4 blocks on y-axis
• We use these
when launching GRID
the kernel
PCI Bus
PCI Bus
PCI Bus
– Device functions (e.g. mykernel()) processed by • Triple angle brackets mark a call from host
NVIDIA compiler code to device code
– Host functions (e.g. main()) processed by – Also called a “kernel launch”
standard host compiler – We’ll return to the parameters (1,1) in a
– gcc, cl.exe moment
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 8 9 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3
0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1
threadIdx.x threadIdx.x threadIdx.x threadIdx.x
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
• In machine code:
– I0: LD R0, a[idx];
– I1: LD R1, b[idx];
– I2: MPY R2,R0,R1
…
SVNIT, Surat CSE 461 46
GPU Latency Hiding in the SM
I0: LD R0, a[idx];
…
SVNIT, Surat CSE 461 47
GPU Latency Hiding in the SM
I0: LD R0, a[idx];
…
SVNIT, Surat CSE 461 48
Why Do You Need Threads?
• Key to understanding:
– Instructions are issued in order
– A thread stalls when one of the operands isn’t ready:
• Memory read by itself doesn’t stall execution
– Latency is hidden by switching threads
• GMEM latency: >100 cycles (varies by architecture/design)
• Arithmetic latency: <100 cycles (varies by architecture/design)
in
out
in
out
ThreaTdhreaTdhreaTdhreaTdhreaTdhreaTdhreaTdhreaTdhread
0 1 2 3 4 5 6 7 8
in
out
▪ Suppose thread 15 reads the halo before thread 0 has fetched it…
...
temp[lindex] = in[gindex]; Store at temp[18]
if (threadIdx.x < RADIUS) {
}
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
...
temp[lindex] = in[gindex]; Store at temp[18]
if (threadIdx.x < RADIUS) {
• Logically
– a[row][column] == a[offset]
– offset = column + row * N
• In CUDA:
– int col = blockIdx.x*blockDim.x+threadIdx.x;
– int row = blockIdx.y*blockDim.y+threadIdx.y;
– int index = col + row * N;
– A[index] = …