Cuda C/C++ Basics: NVIDIA Corporation
Cuda C/C++ Basics: NVIDIA Corporation
Cuda C/C++ Basics: NVIDIA Corporation
NVIDIA Corporation
Parallelize
Applications
OpenACC Programming
Libraries Languages
Directives
What is CUDA?
CUDA Architecture
Expose GPU parallelism for general-purpose computing
Retain performance
CUDA C/C++
Based on industry-standard C/C++
Small set of extensions to enable heterogeneous programming
Straightforward APIs to manage devices, memory etc.
Blocks
Threads
Indexing
CONCEPTS Shared memory
__syncthreads()
Asynchronous operation
Handling errors
Managing devices
CONCEPTS Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
HELLO WORLD!
Handling errors
Managing devices
Heterogeneous Computing
Terminology:
Host The CPU and its memory (host memory)
Device The GPU and its memory (device memory)
Host Device
Heterogeneous Computing
#include <iostream>
#include <algorithm>
#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16
parallel fn
}
int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);
serial code
in = (int *)malloc(size); fill_ints(in, N + 2*RADIUS);
out = (int *)malloc(size); fill_ints(out, N + 2*RADIUS);
// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);
// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;
serial code
}
Simple Processing Flow
PCI Bus
PCI Bus
PCI Bus
int main(void) {
printf("Hello World!\n");
return 0;
}
Output:
Standard C that runs on the host
$ nvcc
hello_world.cu
NVIDIA compiler (nvcc) can be used to $ a.out
compile programs with no device code Hello World!
$
Hello World! with Device Code
int main(void) {
mykernel<<<1,1>>>();
printf("Hello World!\n");
return 0;
}
mykernel<<<1,1>>>();
Triple angle brackets mark a call from host code to device code
Also called a kernel launch
Well return to the parameters (1,1) in a moment
Output:
int main(void) {
mykernel<<<1,1>>>();
$ nvcc hello.cu
printf("Hello World!\n");
$ a.out
return 0;
Hello World!
}
$
a b c
Addition on the Device
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = sizeof(int);
// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
CONCEPTS Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
RUNNING IN PARALLEL
Handling errors
Managing devices
Moving to Parallel
add<<< 1, 1 >>>();
add<<< N, 1 >>>();
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
Review (1 of 2)
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
INTRODUCING THREADS
Handling errors
Managing devices
CUDA Threads
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
CONCEPTS Heterogeneous Computing
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
COMBINING THREADS Managing devices
AND BLOCKS
CUDA Execution Model
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31
M = 8 threadIdx.x = 5
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
blockIdx.x = 2
int index = threadIdx.x + blockIdx.x * M;
= 5 + 2 * 8;
= 21;
Vector Addition with Blocks and Threads
// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}
Handling Arbitrary Vector Sizes
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
COOPERATING THREADS
Handling errors
Managing devices
1D Stencil
radius radius
Implementing Within a Block
void __syncthreads();
Blocks
Threads
Indexing
Shared memory
__syncthreads()
Asynchronous operation
Handling errors
MANAGING THE DEVICE Managing devices
Coordinating Host & Device
printf("%s\n", cudaGetErrorString(cudaGetLastError()));
Device Management