0% found this document useful (0 votes)

146 views25 pages

Introduction To The Cuda Programming

CPUs are designed for sequential processing while GPUs are designed for parallel processing of thousands of threads. CUDA is a parallel computing platform that allows developers to use C++ for GPU programming on NVIDIA GPUs. A CUDA program involves setting up inputs on the CPU, allocating memory on the GPU, copying data to the GPU, running a kernel function on the GPU, and copying results back to the CPU.

Uploaded by

vibhuti rajpal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

146 views25 pages

Introduction To The Cuda Programming

Uploaded by

vibhuti rajpal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

CDS Introduction to CUDA

Introduction to the
CUDA (GPU) Programming

1
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

CPU - GPU
CPU vs GPU

• CPU is designed to excel at executing a sequence of operations (threads) as fast as

possible (low latency)

• CPU can execute a few tens of these threads in parallel

• GPU is designed to excel at executing thousands of threads in parallel (amortizing

the slower single-thread performance to achieve greater throughput

2
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU programming

• CUDA
• general-purpose parallel computing platform and programming model for NVIDIA GPUs
• allows developers to use C++ as a high-level programming language

• OpenCL
• General heterogenous computing framework
• https://www.khronos.org/opencl/ 

• OpenACC
• user-driven directive-based performance-portable parallel programming model
• supports C, C++, Fortran programming language

Note: For python usage: Check Theano, pyCUDA 3

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA
CUDA (Compute Unified Device Architecture)
A general-purpose parallel computing platform and programming model for NVIDIA GPUs

4
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

Scalable Programming Model

GPU computing: steps involved

• Setup inputs on the host (CPU-accessible memory)

• Allocate memory for outputs on the host CPU
• Allocate memory for inputs on the GPU device
• Allocate memory for outputs on the GPU
• Copy inputs from host to GPU (slow)
• Start GPU kernel (function that executes on gpu)
• Copy output from GPU to host (slow)

5
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU: Hello World!

stardom

$> nvcc hello.cu -o hello

https://cuda-tutorial.readthedocs.io/en/latest/tutorials/tutorial01/ 6
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU: Hello World!

• global specifier indicates a function that runs on device (GPU)

• the CUDA kernel cuda_hello() can be called from host
• kernel execution configuration is provided through <<<...>>>
syntax, called kernel launch
• the number of GPU threads “M" to be launched in each thread
block is indicated through kernel launch: <<<B,M>>>, where “B” is

Fm
the number of thread blocks
Bil
7
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU Computing

CK 8 1033
• Thread - distributed by the
CUDA runtime (threadIdx)
• Block - user defined group of
1 to ~T threads (blockIdx)
• Grid - a group of one or
more blocks. A grid is
created for each CUDA
kernel function call.

8
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU: Thread Hierarchy

NVIDIA Tesla V100 SXM2 Module with Volta GV100 GPU

https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf
9
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA
GPU: Thread Hierarchy
NVIDIA Tesla V100 SXM2 Module with Volta GV100 GPU

Source: https://images.nvidia.com/content/volta-architecture/pdf/volta-architecture-whitepaper.pdf 10
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA
GPU: Thread Hierarchy
NVIDIA Tesla V100 SXM2 Module with Volta GV100 GPU

GPC - GPU Processing Clusters, TPCs -Texture Processing Clusters, SM - Streaming Multiprocessors
11
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU Computing

• Thread - distributed by the

CUDA runtime (threadIdx)
• Block - user defined group of
1 to ~T threads (blockIdx)
• Grid - a group of one or
more blocks. A grid is
created for each CUDA
kernel function call.

12
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU Computing

A f

• Thread - distributed by the CUDA

runtime (threadIdx)
• Block - user defined group of 1 to
~512 threads (blockIdx)
• Grid - a group of one or more blocks.
A grid is created for each CUDA
kernel function call.

13
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU Computing

Indexing Arrays with Blocks and Threads

• consider indexing an array with one element per thread (8 threads/block)

01 7 15 17

n
I
p
• with M threads per block, a unique index for each thread is given by:
• index = M * blockIdx.x + threadIdx.x
• use the built-in variable blockDim.x for threads per block (M)

mis index E 8 2 I i 17 14
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU Computing

Indexing Arrays with Blocks and Threads

• blockIdx.x, blockIdx.y, blockIdx.z built-in variables return the block ID in the x-axis, y-
axis, and z-axis of the block
• threadIdx.x, threadIdx.y, threadIdx.z built-in variables return the thread ID in the x-
axis, y-axis, and z-axis of the thread in a particular block
• blockDim.x, blockDim.y, blockDim.z built-in variables return the “block dimension”
(number of threads in a block in the x-axis, y-axis, and z-axis)

a fits3 t.IE
I
B
15
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA
GPU Computing: Vector addition
#define N 1618
#define T 1024
age 3 Pil
// Device code
k
__global__ void VecAdd(int* A, int* B, int* C, int N)
{
int i = blockDim.x * blockIdx.x + threadIdx.x; BCN3n
if (i < N)
ACN
n
r
C[i] = A[i] + B[i];
}
tT
SEI
// main code
W
int main() {
int a[N], b[N], c[N]; IB
int *dev_a, *dev_b, *dev_c;
Ia

Ii
// initialize a and b with int values

size = N * sizeof(int);
cudaMalloc((void**)&dev_a, size);
cudaMalloc((void**)&dev_b, size);
cudaMalloc((void**)&dev_c, size);

E
cudaMemcpy(dev_a, a, size,cudaMemcpyHostToDevice);

tac
cudaMemcpy(dev_b, b, size,cudaMemcpyHostToDevice);

vecAdd<<<(int)ceil(N/T),T>>>(dev_a,dev_b,dev_c);

cudaMemcpy(c, dev_c, size,cudaMemcpyDeviceToHost);

cudaFree(dev_a);

exit (0); }
cudaFree(dev_b); cudaFree(dev_c);
apu
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU Computing: Matrix multiplication

void matrixMult (int a[N][N], int b[N][N], int c[N][N], int width)
{
for (int i = 0; i < width; i++)
for (int j = 0; j < width; j++) {
int sum = 0;
for (int k = 0; k < width; k++) {
int m = a[i][k];
int n = b[k][j];
sum += m * n;
}

} -
c[I][j] = sum;

Can it be parallelized?

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore

CDS Introduction to CUDA

GPU Computing: Matrix multiplication

} n Ro
Rtl
// Device code n
__global__ void matrixMult (int *a, int *b, int *c, int width) {
int k, sum = 0;
int col = threadIdx.x + blockDim.x * blockIdx.x;
int row = threadIdx.y + blockDim.y * blockIdx.y;
if(col < width && row < width) {
for (k = 0; k < width; k++)
In RI
sum += a[row * width + k] * b[k * width + col];
c[row * width + col] = sum;
}
}

Module 1
w
Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU Computing: Asynchronous

Concurrent Execution

CUDA exposes the following operations as independent tasks that can

operate concurrently with one another:
Computation on the host;
Computation on the device;
Memory transfers from the host to the device;
Memory transfers from the device to the host;
Memory transfers within the memory of a given device;
Memory transfers among devices.

More Details: https://docs.nvidia.com/cuda/cuda-c-programming-guide/

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore

CDS Introduction to CUDA

GPU Computing

EX: Write CUDA parallel code for

1. Matrix addition

2. Matrix vector multiplication

3. Matrix multiplication

20
Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore
CDS Introduction to CUDA

GPU Computing (cuDNN)

in Data Science

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore

CDS Introduction to CUDA

GPU Computing (cuDNN)

in Data Science

Source: https://developer.nvidia.com/blog/category/artificial-intelligence/

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore

CDS Introduction to CUDA

GPU Computing (cuDNN)

in Data Science

Source: https://developer.nvidia.com/deep-learning-frameworks

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore

CDS Introduction to CUDA

GPU Computing (cuDNN)

in Data Science

Source: https://developer.nvidia.com/deep-learning-frameworks

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore

CDS Introduction to CUDA

Numba for CUDA GPUs

https://numba.pydata.org/numba-doc/latest/cuda/index.html 25
Module 1 https://nyu-cds.github.io/python-numba/05-cuda/ Sashikumaar Ganesan, CDS, IISc Bangalore

Module 3
No ratings yet
Module 3
43 pages
Lecture 12 GPU Programming
No ratings yet
Lecture 12 GPU Programming
65 pages
CCS335 Lab Manual
No ratings yet
CCS335 Lab Manual
51 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Mastering CUDA C Programming
From Everand
Mastering CUDA C Programming
Ed Norex
No ratings yet
Exam: Titl: 117-202 Linux Networking Administration
No ratings yet
Exam: Titl: 117-202 Linux Networking Administration
111 pages
Gpu History and Cuda Programming Basics
No ratings yet
Gpu History and Cuda Programming Basics
44 pages
GPU Architecture Ebook
No ratings yet
GPU Architecture Ebook
67 pages
Chapter7 GPU
No ratings yet
Chapter7 GPU
45 pages
1 Cuda
100% (1)
1 Cuda
173 pages
CSE Lec4 Cuda
No ratings yet
CSE Lec4 Cuda
91 pages
Cuda
No ratings yet
Cuda
69 pages
Endsem Imp HPC Unit 5
No ratings yet
Endsem Imp HPC Unit 5
24 pages
CUDA Introduction Mod
No ratings yet
CUDA Introduction Mod
50 pages
GPU Programming Slides 2
No ratings yet
GPU Programming Slides 2
37 pages
CUDA
No ratings yet
CUDA
18 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
0 Gpu Computing I Give It
No ratings yet
0 Gpu Computing I Give It
57 pages
Cuda 1
No ratings yet
Cuda 1
45 pages
DS1822 - Parallel Computing-Unit3
No ratings yet
DS1822 - Parallel Computing-Unit3
17 pages
CUDAProg Model
No ratings yet
CUDAProg Model
24 pages
Chapter 8
No ratings yet
Chapter 8
58 pages
Lecture12 GPUArchCUDA02-CUDAMem
No ratings yet
Lecture12 GPUArchCUDA02-CUDAMem
67 pages
Reverse Engineering and Secure Source Code Review
No ratings yet
Reverse Engineering and Secure Source Code Review
19 pages
Course 7
No ratings yet
Course 7
21 pages
CUDA Programming Model
No ratings yet
CUDA Programming Model
14 pages
27th Aug - Introduction To GPGPU - Part 1
No ratings yet
27th Aug - Introduction To GPGPU - Part 1
32 pages
Lecture 2
No ratings yet
Lecture 2
77 pages
Threads
No ratings yet
Threads
54 pages
Installation - Duplicati 2 User's Manual
No ratings yet
Installation - Duplicati 2 User's Manual
15 pages
Cuuda Nvidai Guide - Part1
No ratings yet
Cuuda Nvidai Guide - Part1
15 pages
OYS Project
No ratings yet
OYS Project
14 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Cuda
No ratings yet
Cuda
25 pages
HPC Final 4-8
No ratings yet
HPC Final 4-8
25 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
CUDA Introduction
No ratings yet
CUDA Introduction
39 pages
Introduction To Gpu Programming With Cuda and Openacc
100% (1)
Introduction To Gpu Programming With Cuda and Openacc
40 pages
Lecture2 Cuda Basic 2010
No ratings yet
Lecture2 Cuda Basic 2010
44 pages
GPU Programming: Dr. Florian Ferreira
No ratings yet
GPU Programming: Dr. Florian Ferreira
101 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
Seminar Igor Kamzic COSC3P93
No ratings yet
Seminar Igor Kamzic COSC3P93
58 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
Wa0023.
No ratings yet
Wa0023.
10 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Optimizing MySQL Performance With ZFS
100% (1)
Optimizing MySQL Performance With ZFS
37 pages
Lec-2-Introduction To Spyder-1
No ratings yet
Lec-2-Introduction To Spyder-1
21 pages
Linux Device Driver - ODT
No ratings yet
Linux Device Driver - ODT
2 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Labtainer
No ratings yet
Labtainer
9 pages
LSP Unit 3
No ratings yet
LSP Unit 3
25 pages
Lec 1
No ratings yet
Lec 1
27 pages
Install Windows Server
100% (1)
Install Windows Server
12 pages
CUDA Programming with C++: From Basics to Expert Proficiency
From Everand
CUDA Programming with C++: From Basics to Expert Proficiency
William Smith
No ratings yet
Laboratory Practice I (410246)
No ratings yet
Laboratory Practice I (410246)
28 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
No ratings yet
CUDA Programming: Lei Zhou, Yafeng Yin, Yanzhi Ren, Hong Man, Yingying Chen
28 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
No ratings yet
Unit 6 Chapter 1 Parallel Programming Tools Cuda - Programming
28 pages
Gpu Cuda Part2
No ratings yet
Gpu Cuda Part2
15 pages
Acl LINUX
No ratings yet
Acl LINUX
13 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Instructions For Bios Update of 6400 and 7400
No ratings yet
Instructions For Bios Update of 6400 and 7400
2 pages
Docker
No ratings yet
Docker
9 pages
Udev Trainning by Free Electrons
No ratings yet
Udev Trainning by Free Electrons
28 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
CUDA Programming: Johan Seland Johan - Seland@sintef - No
No ratings yet
CUDA Programming: Johan Seland Johan - Seland@sintef - No
76 pages
JCL Question Bank
No ratings yet
JCL Question Bank
23 pages
CUDA Programming On Nvidia Gpus: Mike Giles
No ratings yet
CUDA Programming On Nvidia Gpus: Mike Giles
21 pages
8051 Interrupts
No ratings yet
8051 Interrupts
3 pages
How To Install Kubernetes On Ubuntu 18.04 (Step by Step)
No ratings yet
How To Install Kubernetes On Ubuntu 18.04 (Step by Step)
4 pages
MySQL Installation (Linux or Ubuntu)
No ratings yet
MySQL Installation (Linux or Ubuntu)
17 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Lab Manual 3
No ratings yet
Lab Manual 3
11 pages
Shell Cheat Sheet
No ratings yet
Shell Cheat Sheet
4 pages
Red Hat Enterprise Linux 6 Resource Management Guide
No ratings yet
Red Hat Enterprise Linux 6 Resource Management Guide
38 pages
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
No ratings yet
A Beginner'S Guide To Programming Gpus With Cuda: Mike Peardon
21 pages
CS3451 Os Syllabus
No ratings yet
CS3451 Os Syllabus
1 page
Gradle
No ratings yet
Gradle
6 pages
8 Cud A 1
No ratings yet
8 Cud A 1
38 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
CUDA
No ratings yet
CUDA
33 pages
LUN Detattach - VMware
No ratings yet
LUN Detattach - VMware
4 pages
Presentation On Real Time Systems With Linux
No ratings yet
Presentation On Real Time Systems With Linux
23 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
No ratings yet
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
29 pages
M Build Help
No ratings yet
M Build Help
4 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
Icacls
No ratings yet
Icacls
4 pages

Introduction To The Cuda Programming

Uploaded by

Introduction To The Cuda Programming

Uploaded by

CDS Introduction to CUDA

• CPU is designed to excel at executing a sequence of operations (threads) as fast as

• CPU can execute a few tens of these threads in parallel

• GPU is designed to excel at executing thousands of threads in parallel (amortizing

Note: For python usage: Check Theano, pyCUDA 3

Scalable Programming Model

GPU computing: steps involved

• Setup inputs on the host (CPU-accessible memory)

GPU: Hello World!

$> nvcc hello.cu -o hello

GPU: Hello World!

• __global__ specifier indicates a function that runs on device (GPU)

GPU: Thread Hierarchy

NVIDIA Tesla V100 SXM2 Module with Volta GV100 GPU

• Thread - distributed by the

• Thread - distributed by the CUDA

Indexing Arrays with Blocks and Threads

• consider indexing an array with one element per thread (8 threads/block)

Indexing Arrays with Blocks and Threads

cudaMemcpy(c, dev_c, size,cudaMemcpyDeviceToHost);

GPU Computing: Matrix multiplication

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore

GPU Computing: Matrix multiplication

GPU Computing: Asynchronous

CUDA exposes the following operations as independent tasks that can

More Details: https://docs.nvidia.com/cuda/cuda-c-programming-guide/

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore

EX: Write CUDA parallel code for

2. Matrix vector multiplication

GPU Computing (cuDNN)

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore

GPU Computing (cuDNN)

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore

GPU Computing (cuDNN)

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore

GPU Computing (cuDNN)

Module 1 Sashikumaar Ganesan, CDS, IISc Bangalore

Numba for CUDA GPUs

You might also like

• global specifier indicates a function that runs on device (GPU)