Cuda Talk
Cuda Talk
Cuda Talk
NVIDIA Corporation
Host = CPU
Device = GPU
Parallel code
Serial code
Host = CPU
Device = GPU
Parallel code
...
Modify into
Parallel
CUDA C code
CUDA C
Functions
Rest of C
Application
NVCC
(Open64/LLVM)
CPU Compiler
CUDA object
files
CPU object
files
Linker
CPU-GPU
Executable
C Code
CUDA Kernels
Parallel portion of application: execute as a kernel
Entire GPU executes kernel, many threads
CUDA threads:
Lightweight
Fast switching
1000s execute simultaneously
CPU
Host
Executes functions
GPU
Device
Executes kernels
float x = input[threadIdx.x];
float y = func(x);
output[threadIdx.x] = y;
GPU
Kernel Execution
CUDA thread
CUDA core
CUDA Streaming
Multiprocessor
CUDA-enabled GPU
CUDA kernel grid
...
Instruction Cache
Scheduler Scheduler
Dispatch
Dispatch
Register File
64K Configurable
Cache/Shared Mem
Uniform Cache
SM 1
Block 1
Block 3
Block 4
Block 5
Block 6
Block 7
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
SM 1
SM 2
SM 3
Block 0
Block 1
Block 2
Block 3
Block 4
Block 5
Block 6
Block 7
What is CUDA?
CUDA Architecture
Expose GPU parallelism for general-purpose computing
Retain performance
CUDA C/C++
Based on industry-standard C/C++
Small set of extensions to enable heterogeneous programming
Straightforward APIs to manage devices, memory etc.
Prerequisites
You (probably) need experience with C or C++
You dont need GPU experience
You dont need parallel programming experience
Heterogeneous Computing
Blocks
Threads
CONCEPTS
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
HELLO WORLD!
Handling errors
Managing devices
Heterogeneous Computing
Terminology:
Host
The CPU and its memory (host memory)
Device The GPU and its memory (device memory)
Host
Device
Heterogeneous Computing
#include <iostream>
#include <algorithm>
using namespace std;
#define N
1024
#define RADIUS 3
#define BLOCK_SIZE 16
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;
// Read input elements into shared memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex - RADIUS] = in[gindex - RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize (ensure all the data is available)
__syncthreads();
parallel fn
serial code
parallel code
serial code
PCI Bus
PCI Bus
PCI Bus
Hello World!
int main(void) {
printf("Hello World!\n");
return 0;
}
Output:
Standard C that runs on the host
NVIDIA compiler (nvcc) can be used to
compile programs with no device code
$ nvcc
hello_world.cu
$ a.out
Hello World!
$
int main(void) {
mykernel<<<1,1>>>();
cudaDeviceSynchronize();
printf("Hello World from host!\n");
return 0;
}
Triple angle brackets mark a call from host code to device code
Also called a kernel launch
Well return to the parameters (1,1) in a moment
Access to BigRed2
ssh <username>@bigred2.uits.iu.edu
cp r /N/u/jbentz/BigRed2/oct2/cuda .
cd cuda
module load cudatoolkit
Use batch system for job submission
qsubsubmit a job to the queue
qstatshow all jobs in the queue
qdeldelete a job from the queue
Hello world
Login to BigRed 2
Each coding project in a separate folder in the following dir
~cuda/exercises
cd cuda/exercises/hello_world
All dirs have Makefiles for you
Try building/running the code
make
Make sure youve loaded the cudatoolkit module!
qsub runit
Screenshot
Output:
$ nvcc hello.cu
$ a.out
Hello from
device!
Hello from
Host!
$
memory
We need to allocate memory on the GPU
Memory Management
Host and device memory are separate entities
Device pointers point to GPU memory
May be passed to/from host code
May not be dereferenced in host code
add()
main()
// host copies of a, b, c
// device copies of a, b, c
main()
CONCEPTS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
RUNNING IN PARALLEL
Handling errors
Managing devices
Moving to Parallel
GPU computing is about massive parallelism
So how do we run code in parallel on the device?
add<<< 1, 1 >>>();
add<<< N, 1 >>>();
Block 0
c[0]
= a[0] + b[0];
Block 1
c[1]
= a[1] + b[1];
Block 2
c[2]
= a[2] + b[2];
Block 3
c[3]
= a[3] + b[3];
add()
main()
#define N 512
int main(void) {
int *a, *b, *c;
// host copies of a, b, c
int *d_a, *d_b, *d_c;
// device copies of a, b, c
int size = N * sizeof(int);
// Alloc space for device
cudaMalloc((void **)&d_a,
cudaMalloc((void **)&d_b,
cudaMalloc((void **)&d_c,
// Alloc
a = (int
b = (int
c = (int
copies of a, b, c
size);
size);
size);
main()
Review (1 of 2)
Difference between host and device
Host CPU
Device GPU
Review (2 of 2)
Basic device memory management
cudaMalloc()
cudaMemcpy()
cudaFree()
CONCEPTS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
INTRODUCING THREADS
Handling errors
Managing devices
CUDA Threads
Terminology: a block can be split into parallel threads
Lets change add() to use parallel threads instead of parallel blocks
__global__ void add(int *a, int *b, int *c) {
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}
main()
// host copies of a, b, c
// device copies of a, b, c
copies of a, b, c
size);
size);
size);
main()
CONCEPTS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
COMBINING THREADS
AND BLOCKS
Handling errors
Managing devices
threadIdx.x
threadIdx.x
threadIdx.x
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
blockIdx.x = 0
blockIdx.x = 1
blockIdx.x = 2
blockIdx.x = 3
10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
threadIdx.x = 5
M = 8
0
blockIdx.x = 2
main()
// host copies of a, b, c
// device copies of a, b, c
copies of a, b, c
size);
size);
size);
main()
blockDim.x
Review
Launching parallel kernels
Launch N copies of add() with add<<<N/M,M>>>();
Use blockIdx.x to access block index
Use threadIdx.x to access thread index within block
CONCEPTS
Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
COOPERATING THREADS
Handling errors
Managing devices
1D Stencil
Consider applying a 1D stencil to a 1D array of elements
Each output element is the sum of input elements within a radius
radius
radius
Simple Stencil in 1d
Open exercises/simple_stencil/kernel.cu
Finish the kernel and the kernel launch
Each thread calculates one stencil value
Reads 2*RADIUS + 1 values
dim3 type: CUDA 3 dimensional struct used for grid/block sizes
Inserted GPU timers into code to time the execution of the kernel
Try various sizes of N, RADIUS, BLOCK
Time a large (over a million) value of N with a RADIUS of 7
Can we do better?
Input elements are read multiple times
With RADIUS=3, each input element is read seven times!
Neighbouring threads read most of the same elements.
Thread 7 reads elements 4 through 10
Thread 8 reads elements 5 through 11
halo on right
halo on left
Stencil Kernel
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;
Stencil Kernel
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
// Store the result
out[gindex] = result;
}
Data Race!
The stencil example will not work
Suppose thread 15 reads the halo before thread 0 has fetched it
temp[lindex] = in[gindex];
Store at temp[18]
if (threadIdx.x < RADIUS) {
Skipped, threadIdx > RADIUS
temp[lindex RADIUS = in[gindex RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
int result = 0;
result += temp[lindex + 1]; Load from temp[19]
__syncthreads()
void __syncthreads();
Stencil Kernel
__global__ void stencil_1d(int *in, int *out) {
__shared__ int temp[BLOCK_SIZE + 2 * RADIUS];
int gindex = threadIdx.x + blockIdx.x * blockDim.x;
int lindex = threadIdx.x + radius;
// Read input elements into shared memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {
temp[lindex RADIUS] = in[gindex RADIUS];
temp[lindex + BLOCK_SIZE] = in[gindex + BLOCK_SIZE];
}
// Synchronize (ensure all the data is available)
__syncthreads();
Stencil Kernel
// Apply the stencil
int result = 0;
for (int offset = -RADIUS ; offset <= RADIUS ; offset++)
result += temp[lindex + offset];
// Store the result
out[gindex] = result;
Review (1 of 2)
Launching parallel threads
Launch N blocks with M threads per block with kernel<<<N,M>>>();
Use blockIdx.x to access block index within grid
Use threadIdx.x to access thread index within block
Review (2 of 2)
Use __shared__ to declare a variable/array in shared memory
Data is shared between threads in a block
Not visible to threads in other blocks
Using large shared mem size impacts number of blocks that can be
scheduled on an SM (48K total smem size)
Thank you!