0% found this document useful (0 votes)
3 views

CUDA Programming Invert

The document provides an introduction to CUDA programming, covering concepts such as heterogeneous computing, memory management, and parallel programming. It includes examples of writing and launching CUDA C kernels, managing GPU memory, and performing operations like vector addition using threads and blocks. The document also outlines the steps for compiling and executing CUDA programs using the NVIDIA compiler.

Uploaded by

codingakshat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

CUDA Programming Invert

The document provides an introduction to CUDA programming, covering concepts such as heterogeneous computing, memory management, and parallel programming. It includes examples of writing and launching CUDA C kernels, managing GPU memory, and performing operations like vector addition using threads and blocks. The document also outlines the steps for compiling and executing CUDA programs using the NVIDIA compiler.

Uploaded by

codingakshat
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 36

CUDA Programming

NVIDIA Corporation

Note: Some contents have been modified for teaching


purpose.

© NVIDIA 2013
Outline
• What will you learn in this session?
– What is heterogeneous computing?
– “Hello World!” for CUDA C
– Write and launch CUDA C kernel
– Manage GPU memory and communication
between CPU and GPU
– Hands-on-session on CUDA using Google Colab

© NVIDIA 2013
Heterogeneous Computing
▪ Terminology:
▪ Host The CPU and its memory (host memory)
▪ Device The GPU and its memory (device memory)

Host Device

© NVIDIA 2013
Heterogeneous Computing
#include <iostream>
#include <algorithm>

using namespace std;

#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16

__global__ void stencil_1d(int *in, int *out) {


__shared__ int temp[BLOCK_SIZE
+ 2 * RADIUS];
int gindex = threadIdx.x +
blockIdx.x * blockDim.x;
int lindex = threadIdx.x + RADIUS;

// Read input elements into shared


memory
temp[lindex] = in[gindex];
if (threadIdx.x < RADIUS) {

parallel fn
temp[lindex - RADIUS] = in[gindex -
RADIUS];

temp[lindex + BLOCK_SIZE] =
in[gindex + BLOCK_SIZE];
}

// Synchronize (ensure all the data


is available)
__syncthreads();

// Apply the stencil


int result = 0;
for (int offset = -RADIUS ; offset <=
RADIUS ; offset++)

result += temp[lindex + offset];

// Store the result


out[gindex] = result;
}

void fill_ints(int *x, int n) {


fill_n(x, n, 1);
}

int main(void) {

serial code
int *in, *out; // host copies
of a, b, c
int *d_in, *d_out; // device
copies of a, b, c
int size = (N + 2*RADIUS) *
sizeof(int);

// Alloc space for host copies and


setup values
in = (int *)malloc(size); fill_ints(in,
N + 2*RADIUS);

parallel code
out = (int *)malloc(size); fill_ints(out,
N + 2*RADIUS);

// Alloc space for device copies


cudaMalloc((void **)&d_in, size);
cudaMalloc((void **)&d_out, size);

cudaMemcpyHostToDevice);
// Copy to device
cudaMemcpy(d_in, in, size,

cudaMemcpy(d_out, out, size,


serial code
cudaMemcpyHostToDevice);

// Launch stencil_1d() kernel on


GPU
stencil_1d<<<N/BLOCK_SIZE,BLO
CK_SIZE>>>(d_in + RADIUS, d_out + RADIUS);

// Copy result back to host


cudaMemcpy(out, d_out, size,
cudaMemcpyDeviceToHost);

// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;
© NVIDIA 2013
}
Simple Processing Flow

PCI Bus

1. Copy input data from CPU memory


to GPU memory

© NVIDIA 2013
Simple Processing Flow

PCI Bus

1. Copy input data from CPU memory


to GPU memory
2. Load GPU program and execute,
caching data on chip for
performance

© NVIDIA 2013
Simple Processing Flow

PCI Bus

1. Copy input data from CPU memory


to GPU memory
2. Load GPU program and execute,
caching data on chip for
performance
3. Copy results from GPU memory to
CPU memory

© NVIDIA 2013
Hello World!
int main(void) {
printf("Hello World!\n");
return 0;
}

• NVIDIA compiler (nvcc) can be used to compile


programs with no device code

$ nvcc hello_world.cu
$ ./a.out
Hello World!
$

© NVIDIA 2013
Hello World! with Device Code
__global__ void mykernel(void) {
printf("Hello World!\n");
}

int main(void) {
mykernel<<<1,1>>>();
return 0;
}

▪ Two new syntactic elements…

© NVIDIA 2013
Hello World! with Device Code
__global__ void mykernel(void) {
}
• CUDA C/C++ keyword __global__ indicates a
function that:
– Runs on the device
– Is called from host code

• nvcc separates source code into host and device


components
– Device functions (e.g. mykernel()) processed by NVIDIA
compiler
– Host functions (e.g. main()) processed by standard host compiler
• Eg., gcc

© NVIDIA 2013
Hello World! with Device Code
mykernel<<<1,1>>>();

• Triple angle brackets mark a call from host code to device


code
– Also called a “kernel launch”
– We’ll return to the parameters (1,1) in a moment

• That’s all that is required to execute a function on the GPU!

© NVIDIA 2013
Hello World! with Device Code
__global__ void mykernel(void){
printf("Hello World!\n");
}

int main(void) {
mykernel<<<1,1>>>();
return 0;
}

$ nvcc hello.cu
$ ./a.out
Hello World!
$
© NVIDIA 2013
Parallel Programming in CUDA C
• But wait… GPU computing is about
massive parallelism!

• We need a more interesting example…

• We’ll start by adding two integers and


build up to vector addition

a b c

© NVIDIA 2013
Addition on the Device
• Note that we use pointers for the variables
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}

• add()runs on the device, so a, b and c must


point to device memory

• We need to allocate memory on the GPU

© NVIDIA 2013
Memory Management
• Host and device memory are separate entities

– Device pointers point to GPU memory

– Host pointers point to CPU memory

• Simple CUDA API for handling device memory


– cudaMalloc(), cudaFree(), cudaMemcpy()
– Similar to the C equivalents malloc(), free(), memcpy()

© NVIDIA 2013
Addition on the Device: add()
• Returning to our add() kernel
__global__ void add(int *a, int *b, int *c) {
*c = *a + *b;
}

• Let’s take a look at main()…

© NVIDIA 2013
Addition on the Device: main()
int main(void) {
int a, b, c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = sizeof(int);

// Allocate space for device copies of a, b, c


cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Setup input values


a = 2;
b = 7;

© NVIDIA 2013
Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, &a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, &b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU


add<<<1,1>>>(d_a, d_b, d_c);

// Copy result back to host


cudaMemcpy(&c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

© NVIDIA 2013
CONCEPTS Heterogeneous Computing

Blocks

Threads

Indexing

Shared memory

__syncthreads()

Asynchronous operation

RUNNING IN
Handling errors

Managing devices

PARALLEL

© NVIDIA 2013
Moving to Parallel
• GPU computing is about massive parallelism
– So how do we run code in parallel on the device?

add<<< 1, 1 >>>();

add<<< N, 1 >>>();

• Instead of executing add() once, execute N


times in parallel

© NVIDIA 2013
Vector Addition on the Device
• With add() running in parallel we can do vector addition

• Terminology: each parallel invocation of add() is referred to


as a block
– The set of blocks is referred to as a grid
– Each invocation can refer to its block index using blockIdx.x
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• By using blockIdx.x to index into the array, each block handles


a different index

© NVIDIA 2013
Vector Addition on the Device
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• On the device, each block can execute in parallel:

Block 0 Block 1 Block 2 Block 3


c[0] = a[0] + b[0]; c[1] = a[1] + b[1]; c[2] = a[2] + b[2]; c[3] = a[3] + b[3];

© NVIDIA 2013
Vector Addition on the Device: add()
• Returning to our parallelized add() kernel
__global__ void add(int *a, int *b, int *c) {
c[blockIdx.x] = a[blockIdx.x] + b[blockIdx.x];
}

• Let’s take a look at main()…

© NVIDIA 2013
Vector Addition on the Device: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c


cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values


a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

© NVIDIA 2013
Vector Addition on the Device: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N blocks


add<<<N,1>>>(d_a, d_b, d_c);

// Copy result back to host


cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

© NVIDIA 2013
CUDA Threads
• Terminology: a block can be split into parallel threads

• Let’s change add() to use parallel threads instead of


parallel blocks
__global__ void add(int *a, int *b, int *c) {
c[threadIdx.x] = a[threadIdx.x] + b[threadIdx.x];
}

• We use threadIdx.x instead of blockIdx.x

• Need to make one change in main().


© NVIDIA 2013
Vector Addition Using Threads: main()
#define N 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c


cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values


a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

© NVIDIA 2013
Vector Addition Using Threads: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU with N threads


add<<<1,N>>>(d_a, d_b, d_c);

// Copy result back to host


cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

© NVIDIA 2013
Indexing Arrays with Blocks and
Threads
• No longer as simple as using blockIdx.x and threadIdx.x
– Consider indexing an array with one element per thread (8
threads/block)
threadIdx.x threadIdx.x threadIdx.x threadIdx.x

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 0 blockIdx.x = 1 blockIdx.x = 2 blockIdx.x = 3

• With M threads/block a unique index for each thread


is given by:
int index = threadIdx.x + blockIdx.x * M;

© NVIDIA 2013
Indexing Arrays: Example
• Which thread will operate on the red
element?
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31

M = 8 threadIdx.x = 5

0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7

blockIdx.x = 2

int index = threadIdx.x + blockIdx.x * M;


= 5 + 2 * 8;
= 21;
© NVIDIA 2013
Vector Addition with Blocks and
Threads
• Use the built-in variable blockDim.x for threads per
block
int index = threadIdx.x + blockIdx.x * blockDim.x;

• Combined version of add() to use parallel


threads and parallel blocks
__global__ void add(int *a, int *b, int *c) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
c[index] = a[index] + b[index];
}

• What changes need to be made in main()?


© NVIDIA 2013
Addition with Blocks and Threads: main()
#define N (2048*2048)
#define THREADS_PER_BLOCK 512
int main(void) {
int *a, *b, *c; // host copies of a, b, c
int *d_a, *d_b, *d_c; // device copies of a, b, c
int size = N * sizeof(int);

// Alloc space for device copies of a, b, c


cudaMalloc((void **)&d_a, size);
cudaMalloc((void **)&d_b, size);
cudaMalloc((void **)&d_c, size);

// Alloc space for host copies of a, b, c and setup input values


a = (int *)malloc(size); random_ints(a, N);
b = (int *)malloc(size); random_ints(b, N);
c = (int *)malloc(size);

© NVIDIA 2013
Addition with Blocks and Threads: main()
// Copy inputs to device
cudaMemcpy(d_a, a, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_b, b, size, cudaMemcpyHostToDevice);

// Launch add() kernel on GPU


add<<<N/THREADS_PER_BLOCK,,THREADS_PER_BLOCK>>>(d_a, d_b, d_c);

// Copy result back to host


cudaMemcpy(c, d_c, size, cudaMemcpyDeviceToHost);

// Cleanup
free(a); free(b); free(c);
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
return 0;
}

© NVIDIA 2013
Handling Arbitrary Vector Sizes
• Typical problems are not friendly multiples of
blockDim.x

• Avoid accessing beyond the end of the arrays:


__global__ void add(int *a, int *b, int *c, int n) {
int index = threadIdx.x + blockIdx.x * blockDim.x;
if (index < n)
c[index] = a[index] + b[index];
}

• Update the kernel launch:


add<<<(N + M-1) / M,M>>>(d_a, d_b, d_c, N);

© NVIDIA 2013
Why Bother with Threads?
• Threads seem unnecessary
– They add a level of complexity
– What do we gain?

• Unlike parallel blocks, threads have


mechanisms to:
– Communicate
– Synchronize

© NVIDIA 2013
Thank You

© NVIDIA 2013

You might also like