0% found this document useful (0 votes)
146 views25 pages

Introduction To The Cuda Programming

CPUs are designed for sequential processing while GPUs are designed for parallel processing of thousands of threads. CUDA is a parallel computing platform that allows developers to use C++ for GPU programming on NVIDIA GPUs. A CUDA program involves setting up inputs on the CPU, allocating memory on the GPU, copying data to the GPU, running a kernel function on the GPU, and copying results back to the CPU.

Uploaded by

vibhuti rajpal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
146 views25 pages

Introduction To The Cuda Programming

CPUs are designed for sequential processing while GPUs are designed for parallel processing of thousands of threads. CUDA is a parallel computing platform that allows developers to use C++ for GPU programming on NVIDIA GPUs. A CUDA program involves setting up inputs on the CPU, allocating memory on the GPU, copying data to the GPU, running a kernel function on the GPU, and copying results back to the CPU.

Uploaded by

vibhuti rajpal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

CDS Introduction to CUDA

Introduction to the
CUDA (GPU) Programming

1
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

CPU - GPU
CPU vs GPU

• CPU is designed to excel at executing a sequence of operations (threads) as fast as


possible (low latency)

• CPU can execute a few tens of these threads in parallel

• GPU is designed to excel at executing thousands of threads in parallel (amortizing


the slower single-thread performance to achieve greater throughput

2
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU programming

• CUDA
• general-purpose parallel computing platform and programming model for NVIDIA GPUs
• allows developers to use C++ as a high-level programming language

• OpenCL
• General heterogenous computing framework
• https://www.khronos.org/opencl/


• OpenACC
• user-driven directive-based performance-portable parallel programming model
• supports C, C++, Fortran programming language

Note: For python usage: Check Theano, pyCUDA 3


Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA
CUDA (Compute Unified Device Architecture)
A general-purpose parallel computing platform and programming model for NVIDIA GPUs

4
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

Scalable Programming Model

GPU computing: steps involved

• Setup inputs on the host (CPU-accessible memory)


• Allocate memory for outputs on the host CPU
• Allocate memory for inputs on the GPU device
• Allocate memory for outputs on the GPU
• Copy inputs from host to GPU (slow)
• Start GPU kernel (function that executes on gpu)
• Copy output from GPU to host (slow)

5
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU: Hello World!

stardom

$> nvcc hello.cu -o hello

https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial01/ 6
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU: Hello World!

• __global__ specifier indicates a function that runs on device (GPU)


• the CUDA kernel cuda_hello() can be called from host
• kernel execution configuration is provided through <<<...>>>
syntax, called kernel launch
• the number of GPU threads “M" to be launched in each thread
block is indicated through kernel launch: <<<B,M>>>, where “B” is

Fm
the number of thread blocks
Bil
7
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU Computing

CK 8 1033
• Thread - distributed by the
CUDA runtime (threadIdx)
• Block - user defined group of
1 to ~T threads (blockIdx)
• Grid - a group of one or
more blocks. A grid is
created for each CUDA
kernel function call.

8
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU: Thread Hierarchy

NVIDIA Tesla V100 SXM2 Module with Volta GV100 GPU

https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
9
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA
GPU: Thread Hierarchy
NVIDIA Tesla V100 SXM2 Module with Volta GV100 GPU

Source: https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf 10
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA
GPU: Thread Hierarchy
NVIDIA Tesla V100 SXM2 Module with Volta GV100 GPU

GPC - GPU Processing Clusters, TPCs -Texture Processing Clusters, SM - Streaming Multiprocessors
11
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU Computing

• Thread - distributed by the


CUDA runtime (threadIdx)
• Block - user defined group of
1 to ~T threads (blockIdx)
• Grid - a group of one or
more blocks. A grid is
created for each CUDA
kernel function call.

12
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU Computing

A f

• Thread - distributed by the CUDA


runtime (threadIdx)
• Block - user defined group of 1 to
~512 threads (blockIdx)
• Grid - a group of one or more blocks.
A grid is created for each CUDA
kernel function call.

13
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU Computing

Indexing Arrays with Blocks and Threads

• consider indexing an array with one element per thread (8 threads/block)


01 7 15 17

n
I
p
• with M threads per block, a unique index for each thread is given by:
• index = M * blockIdx.x + threadIdx.x
• use the built-in variable blockDim.x for threads per block (M)

mis index E 8 2 I i 17 14
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU Computing

Indexing Arrays with Blocks and Threads

• blockIdx.x, blockIdx.y, blockIdx.z built-in variables return the block ID in the x-axis, y-
axis, and z-axis of the block
• threadIdx.x, threadIdx.y, threadIdx.z built-in variables return the thread ID in the x-
axis, y-axis, and z-axis of the thread in a particular block
• blockDim.x, blockDim.y, blockDim.z built-in variables return the “block dimension”
(number of threads in a block in the x-axis, y-axis, and z-axis)

a fits3 t.IE
I
B
15
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA
GPU Computing: Vector addition
#define N 1618
#define T 1024
age 3 Pil
// Device code
k
__global__ void VecAdd(int* A, int* B, int* C, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x; BCN3n
if (i < N)
ACN
n
r
C[i] = A[i] + B[i];
}
tT
SEI
// main code
W
int main() {
int a[N], b[N], c[N]; IB
int *dev_a, *dev_b, *dev_c;
Ia

Ii
// initialize a and b with int values

size = N * sizeof(int);
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);

E
cudaMemcpy(dev_a, a, size,cudaMemcpyHostToDevice);

tac
cudaMemcpy(dev_b, b, size,cudaMemcpyHostToDevice);

vecAdd<<<(int)ceil(N/T),T>>>(dev_a,dev_b,dev_c);

cudaMemcpy(c, dev_c, size,cudaMemcpyDeviceToHost);

cudaFree(dev_a);

exit (0); }
cudaFree(dev_b); cudaFree(dev_c);
apu
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU Computing: Matrix multiplication


void matrixMult (int a[N][N], int b[N][N], int c[N][N], int width)
{
for (int i = 0; i < width; i++)
for (int j = 0; j < width; j++) {
int sum = 0;
for (int k = 0; k < width; k++) {
int m = a[i][k];
int n = b[k][j];
sum += m * n;
}

} -
c[I][j] = sum;

Can it be parallelized?

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore


CDS Introduction to CUDA

GPU Computing: Matrix multiplication


void matrixMult (int a[N][N], int b[N][N], int c[N][N], int width)
{
for (int i = 0; i < width; i++)
for (int j = 0; j < width; j++) {
int sum = 0;
for (int k = 0; k < width; k++) {
int m = a[i][k];
int n = b[k][j];
sum += m * n;
}
c[I][j] = sum;
}

} n Ro
Rtl
// Device code n
__global__ void matrixMult (int *a, int *b, int *c, int width) {
int k, sum = 0;
int col = threadIdx.x + blockDim.x * blockIdx.x;
int row = threadIdx.y + blockDim.y * blockIdx.y;
if(col < width && row < width) {
for (k = 0; k < width; k++)
In RI
sum += a[row * width + k] * b[k * width + col];
c[row * width + col] = sum;
}
}

Module 1
w
Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU Computing: Asynchronous


Concurrent Execution

CUDA exposes the following operations as independent tasks that can


operate concurrently with one another:
Computation on the host;
Computation on the device;
Memory transfers from the host to the device;
Memory transfers from the device to the host;
Memory transfers within the memory of a given device;
Memory transfers among devices.

More Details: https://docs.nvidia.com/cuda/cuda-c-programming-guide/

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore


CDS Introduction to CUDA

GPU Computing

EX: Write CUDA parallel code for

1. Matrix addition

2. Matrix vector multiplication

3. Matrix multiplication

20
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU Computing (cuDNN)


in Data Science

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore


CDS Introduction to CUDA

GPU Computing (cuDNN)


in Data Science

Source: https://developer.nvidia.com/blog/category/artificial-intelligence/

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore


CDS Introduction to CUDA

GPU Computing (cuDNN)


in Data Science

Source: https://developer.nvidia.com/deep-learning-frameworks

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore


CDS Introduction to CUDA

GPU Computing (cuDNN)


in Data Science

Source: https://developer.nvidia.com/deep-learning-frameworks

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore


CDS Introduction to CUDA

Numba for CUDA GPUs

https://numba.pydata.org/numba-doc/latest/cuda/index.html 25
Module 1 https://nyu-cds.github.io/python-numba/05-cuda/ Sashikumaar Ganesan, CDS, IISc Bangalore

You might also like