CUDA Programming Invert
CUDA Programming Invert
NVIDIA Corporation
© NVIDIA 2013
Outline
• What will you learn in this session?
– What is heterogeneous computing?
– “Hello World!” for CUDA C
– Write and launch CUDA C kernel
– Manage GPU memory and communication
between CPU and GPU
– Hands-on-session on CUDA using Google Colab
© NVIDIA 2013
Heterogeneous Computing
▪ Terminology:
▪ Host The CPU and its memory (host memory)
▪ Device The GPU and its memory (device memory)
Host Device
© NVIDIA 2013
Heterogeneous Computing
#include <iostream>
#include <algorithm>
#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16
parallel fn
temp[lindex - RADIUS] = in[gindex -
RADIUS];
temp[lindex + BLOCK_SIZE] =
in[gindex + BLOCK_SIZE];
}
int main(void) {
serial code
int *in, *out; // host copies
of a, b, c
int *d_in, *d_out; // device
copies of a, b, c
int size = (N + 2*RADIUS) *
sizeof(int);
parallel code
out = (int *)malloc(size); fill_ints(out,
N + 2*RADIUS);
cudaMemcpyHostToDevice);
// Copy to device
cudaMemcpy(d_in, in, size,
// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;
© NVIDIA 2013
}
Simple Processing Flow
PCI Bus
© NVIDIA 2013
Simple Processing Flow
PCI Bus
© NVIDIA 2013
Simple Processing Flow
PCI Bus
© NVIDIA 2013
Hello World!
int main(void) {
printf("Hello World!\n");
return 0;
}
$ nvcc hello_world.cu
$ ./a.out
Hello World!
$
© NVIDIA 2013
Hello World! with Device Code
__global__ void mykernel(void) {
printf("Hello World!\n");
}
int main(void) {
mykernel<<<1,1>>>();
return 0;
}
© NVIDIA 2013
Hello World! with Device Code
__global__ void mykernel(void) {
}
• CUDA C/C++ keyword __global__ indicates a
function that:
– Runs on the device
– Is called from host code
© NVIDIA 2013
Hello World! with Device Code
mykernel<<<1,1>>>();
© NVIDIA 2013
Hello World! with Device Code
__global__ void mykernel(void){
printf("Hello World!\n");
}
int main(void) {
mykernel<<<1,1>>>();
return 0;
}
$ nvcc hello.cu
$ ./a.out
Hello World!
$
© NVIDIA 2013
Parallel Programming in CUDA C
• But wait… GPU computing is about
massive parallelism!
a b c
© NVIDIA 2013
Addition on the Device
• Note that we use pointers for the variables
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
© NVIDIA 2013
Memory Management
• Host and device memory are separate entities
© NVIDIA 2013
Addition on the Device: add()
• Returning to our add() kernel
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}
© NVIDIA 2013
Addition on the Device: main()
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = sizeof(int);
© NVIDIA 2013
Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);
// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
© NVIDIA 2013
CONCEPTS Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
RUNNING IN
Handling errors
Managing devices
PARALLEL
© NVIDIA 2013
Moving to Parallel
• GPU computing is about massive parallelism
– So how do we run code in parallel on the device?
add<<< 1, 1 >>>();
add<<< N, 1 >>>();
© NVIDIA 2013
Vector Addition on the Device
• With add() running in parallel we can do vector addition
© NVIDIA 2013
Vector Addition on the Device
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
© NVIDIA 2013
Vector Addition on the Device: add()
• Returning to our parallelized add() kernel
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}
© NVIDIA 2013
Vector Addition on the Device: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);
© NVIDIA 2013
Vector Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
© NVIDIA 2013
CUDA Threads
• Terminology: a block can be split into parallel threads
© NVIDIA 2013
Vector Addition Using Threads: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
© NVIDIA 2013
Indexing Arrays with Blocks and
Threads
• No longer as simple as using blockIdx.x and threadIdx.x
– Consider indexing an array with one element per thread (8
threads/block)
threadIdx.x threadIdx.x threadIdx.x threadIdx.x
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
© NVIDIA 2013
Indexing Arrays: Example
• Which thread will operate on the red
element?
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
M = 8 threadIdx.x = 5
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
blockIdx.x = 2
© NVIDIA 2013
Addition with Blocks and Threads: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
© NVIDIA 2013
Handling Arbitrary Vector Sizes
• Typical problems are not friendly multiples of
blockDim.x
© NVIDIA 2013
Why Bother with Threads?
• Threads seem unnecessary
– They add a level of complexity
– What do we gain?
© NVIDIA 2013
Thank You
© NVIDIA 2013