0% found this document useful (0 votes)

47 views50 pages

CUDA Tutorial

CUDA programming Tutorials

Uploaded by

Dee Kan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views50 pages

CUDA Tutorial

CUDA programming Tutorials

Uploaded by

Dee Kan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 50

Graphic Processing Units – GPU (Section 7.

History of GPUs
• VGA in early 90’s -- A memory controller and display generator
connected to some (video) RAM
• By 1997, VGA controllers were incorporating some acceleration functions
• In 2000, a single chip graphics processor incorporated almost every detail
of the traditional high-end workstation graphics pipeline
- Processors oriented to 3D graphics tasks
- Vertex/pixel processing, shading, texture mapping, rasterization

• More recently, processor instructions and memory hardware were added

to support general-purpose programming languages
• OpenGL: A standard specification defining an API for writing applications
that produce 2D and 3D computer graphics
• CUDA (compute unified device architecture): A scalable parallel
programming model and language for GPUs based on C/C++
70
Historical PC architecture

Video Graphics Array

71
Contemporary PC architecture

72
Basic unified GPU architecture

Streaming
Multiprocessor

special
function
unit

73
ROP = Raster Opertastions Pipeline
TPC = Texture Processing Cluster
Tutorial CUDA
Cyril Zeller
NVIDIA Developer Technology
Note: These slides are truncated
from a longer version which is
publicly available on the web
Enter the GPU

GPU = Graphics Processing Unit

Chip in computer video cards, PlayStation 3, Xbox, etc.
Two major vendors: NVIDIA and ATI (now AMD)

© NVIDIA Corporation 2008

Enter the GPU

GPUs are massively multithreaded manycore chips

NVIDIA Tesla products have up to 128 scalar processors
Over 12,000 concurrent threads in flight
Over 470 GFLOPS sustained performance

Users across science & engineering disciplines are

achieving 100x or better speedups on GPUs

CS researchers can use GPUs as a research platform

for manycore computing: arch, PL, numeric, …

© NVIDIA Corporation 2008

Enter CUDA

CUDA is a scalable parallel programming model and a

software environment for parallel computing
Minimal extensions to familiar C/C++ environment
Heterogeneous serial-parallel programming model

NVIDIA’s TESLA GPU architecture accelerates CUDA

Expose the computational horsepower of NVIDIA GPUs
Enable general-purpose GPU computing

CUDA also maps well to multicore CPUs!

© NVIDIA Corporation 2008

CUDA
Programming Model

© NVIDIA Corporation 2008

Heterogeneous Programming

CUDA = serial program with parallel kernels, all in C

Serial C code executes in a host thread (i.e. CPU thread)
Parallel kernel C code executes in many device threads
across multiple processing elements (i.e. GPU threads)

Serial Code Host

Parallel Kernel Device

KernelA (args); ...

Serial Code Host

Parallel Kernel Device

KernelB (args); ...
© NVIDIA Corporation 2008
Kernel = Many Concurrent Threads

One kernel is executed at a time on the device

Many threads execute each kernel
Each thread executes the same code…
… on different data based on its threadID

threadID 0 1 2 3 4 5 6 7

CUDA threads might be

…
Physical threads float x = input[threadID];
As on NVIDIA GPUs float y = func(x);
GPU thread creation and output[threadID] = y;
…
context switching are
essentially free
Or virtual threads
E.g. 1 CPU core might execute
multiple CUDA threads
© NVIDIA Corporation 2008
Hierarchy of Concurrent Threads

Threads are grouped into thread blocks

Kernel = grid of thread blocks
Thread Block 0 Thread Block 1 Thread Block N - 1
0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7 0 1 2 3 4 5 6 7
threadID

… … …
float x = float x =

…
float x =
input[threadID]; input[threadID]; input[threadID];
float y = func(x); float y = func(x); float y = func(x);
output[threadID] = y; output[threadID] = y; output[threadID] = y;
… … …

By definition, threads in the same block may synchronize with

barriers Threads
scratch[threadID] = begin[threadID]; wait at the barrier
__syncthreads(); until all threads
in the same block
int left = scratch[threadID - 1];
© NVIDIA Corporation 2008 reach the barrier
Transparent Scalability
Thread blocks cannot synchronize
So they can run in any order, concurrently or sequentially
This independence gives scalability:
A kernel scales across any number of parallel cores

Kernel grid
2-Core Device 4-Core Device
Block 0 Block 1

Block 2 Block 3

Block 4 Block 5
Block 0 Block 1 Block 0 Block 1 Block 2 Block 3
Block 6 Block 7

Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

Block 4 Block 5

Block 6 Block 7 Implicit barrier between dependent kernels

vec_minus<<<nblocks, blksize>>>(a, b, c);
© NVIDIA Corporation 2008
vec_dot<<<nblocks, blksize>>>(c, c);
Heterogeneous Memory Model

Device 0
memory
Host memory cudaMemcpy()
Device 1
memory

© NVIDIA Corporation 2008

Kernel Memory Access
Per-thread
Registers On-chip
Thread
Local Memory Off-chip, uncached

Per-block

Block
Shared • On-chip, small
Memory • Fast

Per-device

Kernel 0 ... • Off-chip, large

Global
• Uncached
Time

Memory
• Persistent across
kernel launches
Kernel 1 ...
• Kernel I/O

© NVIDIA Corporation 2009

Physical Memory Layout
“Local” memory resides in device DRAM
Use registers and shared memory to minimize local
memory use
Host can read and write global memory but not
shared memory

Host Device
GPU
DRAM Multiprocessor
CPU
Local Multiprocessor
Memory
Multiprocessor
Global
DRAM Chipset Memory
Registers
Shared Memory

© NVIDIA Corporation 2009

10-Series Architecture
240 thread processors execute kernel threads
30 multiprocessors, each contains
8 thread processors
One double-precision unit
Shared memory enables thread cooperation

Multiprocessor

Thread
Processors

Double

Shared
Memory

© NVIDIA Corporation 2009

Execution Model
Software Hardware

Threads are executed by thread processors

Thread
Processor
Thread

Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on

Thread one multiprocessor - limited by multiprocessor
Block Multiprocessor resources (shared memory and register file)

A kernel is launched as a grid of thread blocks

...
Only one kernel can execute on a device at
one time
Grid Device
© NVIDIA Corporation 2009
CUDA Programming Basics

Part I - Software Stack and Memory Management

Compiler

Any source file containing language extensions, like

“<<< >>>”, must be compiled with nvcc
nvcc is a compiler driver
Invokes all the necessary tools and compilers like cudacc,
g++, cl, ...
nvcc can output either:
C code (CPU code)
That must then be compiled with the rest of the application
using another tool
PTX or object code directly
An executable requires linking to:
Runtime library (cudart)
Core library (cuda)
© NVIDIA Corporation 2009
Compiling
CPU/GPU
Source

NVCC CPU Source

PTX Code
Virtual

PTX to Target Physical

Compiler

G80 … GPU

Target code
© NVIDIA Corporation 2009
GPU Memory Allocation / Release

Host (CPU) manages device (GPU) memory

cudaMalloc(void **pointer, size_t nbytes)
cudaMemset(void *pointer, int value, size_t
count)
cudaFree(void *pointer)

int n = 1024;
int nbytes = 1024*sizeof(int);
int *a_d = 0;
cudaMalloc( (void**)&a_d, nbytes );
cudaMemset( a_d, 0, nbytes);
cudaFree(a_d);

© NVIDIA Corporation 2009

Data Copies

cudaMemcpy(void dst, void src, size_t nbytes,

enum cudaMemcpyKind direction);

direction specifies locations (host or device) of src and

dst
Blocks CPU thread: returns after the copy is complete
Doesn’t start copying until previous CUDA calls complete
enum cudaMemcpyKind
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice

© NVIDIA Corporation 2009

Data Movement Example
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data Host Device
int N = 14, nBytes, i ;

nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);

for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}

© NVIDIA Corporation 2009

Data Movement Example
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data Host
int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h

a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
b_h
for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}

© NVIDIA Corporation 2009

Data Movement Example
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data Host Device
int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h a_d

a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
b_h b_d
for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}

© NVIDIA Corporation 2009

Data Movement Example
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data Host Device
int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h a_d

a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
b_h b_d
for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}

© NVIDIA Corporation 2009

Data Movement Example
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data Host Device
int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h a_d

a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
b_h b_d
for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}

© NVIDIA Corporation 2009

Data Movement Example
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data Host Device
int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h a_d

a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
b_h b_d
for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}

© NVIDIA Corporation 2009

Data Movement Example
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data Host Device
int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h a_d

a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
b_h b_d
for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}

© NVIDIA Corporation 2009

Data Movement Example
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data Host Device
int N = 14, nBytes, i ;

nBytes = N*sizeof(float); a_h a_d

a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);
b_h b_d
for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}

© NVIDIA Corporation 2009

Data Movement Example
int main(void)
{
float *a_h, *b_h; // host data
float *a_d, *b_d; // device data Host Device
int N = 14, nBytes, i ;

nBytes = N*sizeof(float);
a_h = (float *)malloc(nBytes);
b_h = (float *)malloc(nBytes);
cudaMalloc((void **) &a_d, nBytes);
cudaMalloc((void **) &b_d, nBytes);

for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

cudaMemcpy(b_d, a_d, nBytes, cudaMemcpyDeviceToDevice);
cudaMemcpy(b_h, b_d, nBytes, cudaMemcpyDeviceToHost);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

free(a_h); free(b_h); cudaFree(a_d); cudaFree(b_d);
return 0;
}

© NVIDIA Corporation 2009

CUDA Programming Basics

Part II - Kernels
Thread Hierarchy

Threads launched for a parallel section are

partitioned into thread blocks
Grid = all blocks for a given launch
Thread block is a group of threads that can:
Synchronize their execution
Communicate via shared memory

© NVIDIA Corporation 2009

Executing Code on the GPU

Kernels are C functions with some restrictions

Cannot access host memory

Must have void return type
No variable number of arguments (“varargs”)
Not recursive
No static variables

Function arguments automatically copied from host

to device

Function Qualifiers
Kernels designated by function qualifier:
__global__

Function called from host and executed on device

Must return void

Other CUDA function qualifiers

__device__

Function called from device and run on device

Cannot be called from host code

__host__

Function called from host and executed on host (default)

__host__ and __device__ qualifiers can be combined to
generate both CPU and GPU code

Launching Kernels
Modified C function call syntax:

kernel<<<dim3 dG, dim3 dB>>>(…)

Execution Configuration (“<<< >>>”)

dG - dimension and size of grid in blocks
Two-dimensional: x and y
Blocks launched in the grid: dG.x*dG.y

dB - dimension and size of blocks in threads:

Three-dimensional: x, y, and z
Threads per block: dB.x*dB.y*dB.z

Unspecified dim3 fields initialize to 1

Host
Threads and blocks have Device

IDs Grid 1

So each thread can decide Kernel Block Block Block

1
what data to work on (0, 0) (1, 0) (2, 0)

Block Block Block

(0, 1) (1, 1) (2, 1)
Block ID: 1D or 2D
Thread ID: 1D, 2D, or 3D Grid 2

Kernel
2

Simplifies memory
addressing when Block (1, 1)

processing Thread Thread Thread Thread Thread

multidimensional data
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0)

Thread Thread Thread Thread Thread

Image processing (0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Solving PDEs on volumes Thread Thread Thread Thread Thread

dim3 grid, block;

grid.x = 2; grid.y = 4;
block.x = 8; block.y = 16;

kernel<<<grid, block>>>(...);

dim3 grid(2, 4), block(8,16);

Equivalent assignment using
constructor functions
kernel<<<grid, block>>>(...);

kernel<<<32,512>>>(...);

CUDA Built-in Device Variables

All global and device functions have

access to these automatically defined variables

dim3 gridDim;
Dimensions of the grid in blocks (at most 2D)
dim3 blockDim;
Dimensions of the block in threads
dim3 blockIdx;
Block index within the grid
dim3 threadIdx;
Thread index within the block

Unique Thread IDs
Built-in variables are used to determine unique
thread IDs
Map from local thread ID (threadIdx) to a global ID which
can be used as array indices

Grid
blockIdx.x 0 1 2

blockDim.x = 5

threadIdx.x 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4

blockIdx.x*blockDim.x
+threadIdx.x 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14

Minimal Kernels

global void kernel( int *a )

{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = 7;
} Output: 7 7 7 7 7 7 7 7 7 7 7 7 7 7 7

global void kernel( int *a )

{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = blockIdx.x;
} Output: 0 0 0 0 0 1 1 1 1 1 2 2 2 2 2

global void kernel( int *a )

{
int idx = blockIdx.x*blockDim.x + threadIdx.x;
a[idx] = threadIdx.x;
}
Output: 0 1 2 3 4 0 1 2 3 4 0 1 2 3 4

Increment Array Example

CPU program CUDA program

void inc_cpu(int a, int N) global void inc_gpu(int a_d, int N)

{ {
int idx; int idx = blockIdx.x * blockDim.x
+ threadIdx.x;
for (idx = 0; idx<N; idx++) if (idx < N)
a[idx] = a[idx] + 1; a_d[idx] = a_d[idx] + 1;
} }

void main() void main()

{ {
… …
inc_cpu(a, N); dim3 dimBlock (blocksize);
… dim3 dimGrid(ceil(N/(float)blocksize));
} inc_gpu<<<dimGrid, dimBlock>>>(a_d, N);
…
}
© NVIDIA Corporation 2009
Host Synchronization

All kernel launches are asynchronous

control returns to CPU immediately
kernel executes after all previous CUDA calls have
completed
cudaMemcpy() is synchronous
control returns to CPU after copy completes
copy starts after all previous CUDA calls have completed
cudaThreadSynchronize()
blocks until all previous CUDA calls complete

Host Synchronization Example

…
// copy data from host to device
cudaMemcpy(a_d, a_h, numBytes, cudaMemcpyHostToDevice);

// execute the kernel

inc_gpu<<<ceil(N/(float)blocksize), blocksize>>>(a_d, N);

// run independent CPU code

run_cpu_stuff();

// copy data from device back to host

cudaMemcpy(a_h, a_d, numBytes, cudaMemcpyDeviceToHost);

Variable Qualifiers (GPU code)

__device__
Stored in global memory (large, high latency, no cache)
Allocated with cudaMalloc (__device__ qualifier implied)
Accessible by all threads
Lifetime: application
__shared__
Stored in on-chip shared memory (very low latency)
Specified by execution configuration or at compile time
Accessible by all threads in the same thread block
Lifetime: thread block

Unqualified variables:
Scalars and built-in vector types are stored in registers
Arrays may be in registers or local memory

GPU Thread Synchronization
void __syncthreads();
Synchronizes all threads in a block
Generates barrier synchronization instruction
No thread can pass this barrier until all threads in the
block reach it
Used to avoid RAW / WAR / WAW hazards when accessing
shared memory
Allowed in conditional code only if the conditional
is uniform across the entire thread block

GPU Atomic Integer Operations

Requires hardware with compute capability >= 1.1

G80 = Compute capability 1.0
G84/G86/G92 = Compute capability 1.1
GT200 = Compute capability 1.3
Atomic operations on integers in global memory:
Associative operations on signed/unsigned ints
add, sub, min, max, ...
and, or, xor
Increment, decrement
Exchange, compare and swap
Atomic operations on integers in shared memory
Requires compute capability >= 1.2

Sustainable Web Development With Ruby On Rails P2.0
No ratings yet
Sustainable Web Development With Ruby On Rails P2.0
487 pages
Practical DevSecOps 2021 - 8
No ratings yet
Practical DevSecOps 2021 - 8
83 pages
1.top500oops Java Interview Que
No ratings yet
1.top500oops Java Interview Que
127 pages
Kubernetes Error
No ratings yet
Kubernetes Error
7 pages
Waiting For Your Valuable Responds 1719890904
No ratings yet
Waiting For Your Valuable Responds 1719890904
30 pages
K8S Vs Openshift
No ratings yet
K8S Vs Openshift
6 pages
Anible Use Cases
No ratings yet
Anible Use Cases
15 pages
ML Unit-1 (CEC)
No ratings yet
ML Unit-1 (CEC)
108 pages
?????? ???????????!
No ratings yet
?????? ???????????!
129 pages
Network Plateform Software Reference
No ratings yet
Network Plateform Software Reference
104 pages
1350 Oms SDH R 9.5 - Top63052 - V2.0
No ratings yet
1350 Oms SDH R 9.5 - Top63052 - V2.0
844 pages
Brkent 2006
No ratings yet
Brkent 2006
102 pages
50 Kubernetes Errors & Solutions
No ratings yet
50 Kubernetes Errors & Solutions
15 pages
Gpu Cuda
No ratings yet
Gpu Cuda
204 pages
ALL MODULES - Advantage Partner Program Training For VMware Resellers - July 2024
No ratings yet
ALL MODULES - Advantage Partner Program Training For VMware Resellers - July 2024
134 pages
Amazon CloudFront A Comprehensive Guide
No ratings yet
Amazon CloudFront A Comprehensive Guide
22 pages
PRESENTATION - Ask The Expert - How Do I Integrate SAS Viya and Open Source
No ratings yet
PRESENTATION - Ask The Expert - How Do I Integrate SAS Viya and Open Source
121 pages
Data Engineering 101 - Azure Synapse Analytics
No ratings yet
Data Engineering 101 - Azure Synapse Analytics
45 pages
SG2 00832513
No ratings yet
SG2 00832513
349 pages
DevOps Shack 200 Maven NPM Interview Q&A
No ratings yet
DevOps Shack 200 Maven NPM Interview Q&A
32 pages
Brij B. Gupta - Modern Principles, Practices, and Algorithms For Cloud Security (2019) - 1
No ratings yet
Brij B. Gupta - Modern Principles, Practices, and Algorithms For Cloud Security (2019) - 1
361 pages
Devops Ddocument
No ratings yet
Devops Ddocument
211 pages
Docker Scenario Based Questions and Answers
No ratings yet
Docker Scenario Based Questions and Answers
25 pages
Objectives
No ratings yet
Objectives
23 pages
C2 - W1 Mlopssadsa
No ratings yet
C2 - W1 Mlopssadsa
111 pages
Linux Interview Questions & Answers
No ratings yet
Linux Interview Questions & Answers
15 pages
LTRSPM 2010 LG
No ratings yet
LTRSPM 2010 LG
110 pages
1b.5620 SAM (Service Aware Manager) R9.0 Fundamentals LG
No ratings yet
1b.5620 SAM (Service Aware Manager) R9.0 Fundamentals LG
189 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
Software Testing Unit-1
No ratings yet
Software Testing Unit-1
111 pages
Top 1350oms PKT MPLS-TP PDF
No ratings yet
Top 1350oms PKT MPLS-TP PDF
636 pages
SG1 English 00908011
No ratings yet
SG1 English 00908011
198 pages
ANSYS Inc. Installation Guides
No ratings yet
ANSYS Inc. Installation Guides
196 pages
SG2 v15.11 English 00936427
No ratings yet
SG2 v15.11 English 00936427
148 pages
Machine Learning Introduction - A Comprehensive Guide
No ratings yet
Machine Learning Introduction - A Comprehensive Guide
13 pages
Nist SP 1800-21
No ratings yet
Nist SP 1800-21
348 pages
Application Security - Unit 3 PDF
No ratings yet
Application Security - Unit 3 PDF
146 pages
Intro HPC Linux Gent
No ratings yet
Intro HPC Linux Gent
124 pages
Data Science Bootcamp (Day-01) (1) - Compressed
No ratings yet
Data Science Bootcamp (Day-01) (1) - Compressed
161 pages
CLS 1306 WXCC - AI&Orchestration
No ratings yet
CLS 1306 WXCC - AI&Orchestration
135 pages
Building HP FlexFabric Data Centers, Rev 14.41 Student Guide Part1
No ratings yet
Building HP FlexFabric Data Centers, Rev 14.41 Student Guide Part1
413 pages
Block 0004
No ratings yet
Block 0004
125 pages
Updated Ssdndumps Trainor
No ratings yet
Updated Ssdndumps Trainor
48 pages
M1 CDL Student Slides v2
No ratings yet
M1 CDL Student Slides v2
184 pages
CLE Materials 03.14.19
No ratings yet
CLE Materials 03.14.19
100 pages
GPGPU
No ratings yet
GPGPU
139 pages
Enhancing VNF Performance by Exploiting SR IOV and DPDK Packet Processing Acceleration
No ratings yet
Enhancing VNF Performance by Exploiting SR IOV and DPDK Packet Processing Acceleration
6 pages
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
No ratings yet
Module 4.1 - Memory and Data Locality: GPU Teaching Kit
132 pages
Full Certified Tester 4
No ratings yet
Full Certified Tester 4
104 pages
Devops 84
No ratings yet
Devops 84
197 pages
TER36055 - V3.0 SG Ed1 CE PDF
No ratings yet
TER36055 - V3.0 SG Ed1 CE PDF
914 pages
TCAS 791 Install Manual
No ratings yet
TCAS 791 Install Manual
123 pages
Machine Learning
No ratings yet
Machine Learning
102 pages
SEC201.1 Computing Fundamentals
No ratings yet
SEC201.1 Computing Fundamentals
187 pages
IGCSE ICT Chapter 8 - Safety and Security
No ratings yet
IGCSE ICT Chapter 8 - Safety and Security
43 pages
Building HP FlexFabric Data Centers, Rev 14.41 Student Guide Part3
No ratings yet
Building HP FlexFabric Data Centers, Rev 14.41 Student Guide Part3
153 pages
Devops Shack: Linux Commands Documentation
No ratings yet
Devops Shack: Linux Commands Documentation
7 pages
DSA Beginner To Advanced Guide?
No ratings yet
DSA Beginner To Advanced Guide?
110 pages
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
No ratings yet
DR Antonio Gulli - A Collection of Advanced Data Science and Machine Learning Interview Questions Solved in Python and Spark (II) - Hands-On Big Data and Machine - Programming Interview Questions) (
112 pages
OpenSolaris DTrace - Harry J Foxwell PDF
No ratings yet
OpenSolaris DTrace - Harry J Foxwell PDF
181 pages
Data and Information Security - CW3551 Question Bank
No ratings yet
Data and Information Security - CW3551 Question Bank
6 pages
Hartzell Propeller Inc.: Remove Pages: Insert Pages
No ratings yet
Hartzell Propeller Inc.: Remove Pages: Insert Pages
752 pages
Microsoft 70-744 203q
No ratings yet
Microsoft 70-744 203q
166 pages
Ritik e Comm File
100% (1)
Ritik e Comm File
72 pages
Wazir K Mcqs Cmputer
100% (1)
Wazir K Mcqs Cmputer
250 pages
Assemble Disassemble
No ratings yet
Assemble Disassemble
11 pages
P4I Math Exercises
No ratings yet
P4I Math Exercises
4 pages
Tightening Tools
No ratings yet
Tightening Tools
36 pages
DP-100 Instruction Manual
No ratings yet
DP-100 Instruction Manual
9 pages
Statement of Purpose
No ratings yet
Statement of Purpose
9 pages
DepEd - RADaR User Reference Manual Mobile App
No ratings yet
DepEd - RADaR User Reference Manual Mobile App
12 pages
Brainly Case Study
No ratings yet
Brainly Case Study
13 pages
Manual UPS RT 5 10kVA en Us
No ratings yet
Manual UPS RT 5 10kVA en Us
48 pages
Mcafee Total Protection Service: User Guide
No ratings yet
Mcafee Total Protection Service: User Guide
22 pages
Week 08 - ATAM - Architecture Tradeoff Analysis Method
No ratings yet
Week 08 - ATAM - Architecture Tradeoff Analysis Method
31 pages
Full Stack Roadmap: Opinions
No ratings yet
Full Stack Roadmap: Opinions
8 pages
DX4500-DX4600 Series Digital Video Recorder Installation Manual 6-11
No ratings yet
DX4500-DX4600 Series Digital Video Recorder Installation Manual 6-11
40 pages
5697487-02 1st CARESCAPE Canvas D19 SM v3-3
No ratings yet
5697487-02 1st CARESCAPE Canvas D19 SM v3-3
46 pages
DX Diag
No ratings yet
DX Diag
49 pages
A New Colour Consciousness Colour in The
No ratings yet
A New Colour Consciousness Colour in The
18 pages
Optical Distribution Frame & Media Converter Details
No ratings yet
Optical Distribution Frame & Media Converter Details
21 pages
Post-Processing Part 5
No ratings yet
Post-Processing Part 5
6 pages
Unit 3
No ratings yet
Unit 3
17 pages
A Study On Human Activity Recognition Using Accelerometer Data From Smartphones
No ratings yet
A Study On Human Activity Recognition Using Accelerometer Data From Smartphones
8 pages
T14 Calculator
No ratings yet
T14 Calculator
5 pages
Momo Statement Report
No ratings yet
Momo Statement Report
2 pages
PI Bootcamp Homework - Vanessa
No ratings yet
PI Bootcamp Homework - Vanessa
4 pages
Jeq 2 J
No ratings yet
Jeq 2 J
2 pages
Vekatesh GRC Resume
No ratings yet
Vekatesh GRC Resume
2 pages
Apache Cassandra Administrator Associate - Exam Practice Tests
From Everand
Apache Cassandra Administrator Associate - Exam Practice Tests
Cristian Scutaru
No ratings yet

CUDA Tutorial

Uploaded by

CUDA Tutorial

Uploaded by

Graphic Processing Units – GPU (Section 7.

• More recently, processor instructions and memory hardware were added

Video Graphics Array

GPU = Graphics Processing Unit

© NVIDIA Corporation 2008

GPUs are massively multithreaded manycore chips

Users across science & engineering disciplines are

CS researchers can use GPUs as a research platform

© NVIDIA Corporation 2008

CUDA is a scalable parallel programming model and a

NVIDIA’s TESLA GPU architecture accelerates CUDA

CUDA also maps well to multicore CPUs!

© NVIDIA Corporation 2008

© NVIDIA Corporation 2008

CUDA = serial program with parallel kernels, all in C

Serial Code Host

Parallel Kernel Device

Serial Code Host

Parallel Kernel Device

One kernel is executed at a time on the device

CUDA threads might be

Threads are grouped into thread blocks

By definition, threads in the same block may synchronize with

Block 2 Block 3 Block 4 Block 5 Block 6 Block 7

Block 6 Block 7 Implicit barrier between dependent kernels

© NVIDIA Corporation 2008

Kernel 0 ... • Off-chip, large

© NVIDIA Corporation 2009

© NVIDIA Corporation 2009

© NVIDIA Corporation 2009

Threads are executed by thread processors

Thread blocks are executed on multiprocessors

Thread blocks do not migrate

Several concurrent thread blocks can reside on

A kernel is launched as a grid of thread blocks

Part I - Software Stack and Memory Management

Any source file containing language extensions, like

NVCC CPU Source

PTX to Target Physical

Host (CPU) manages device (GPU) memory

© NVIDIA Corporation 2009

cudaMemcpy(void *dst, void *src, size_t nbytes,

direction specifies locations (host or device) of src and

© NVIDIA Corporation 2009

for (i=0, i<N; i++) a_h[i] = 100.f + i;

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

© NVIDIA Corporation 2009

nBytes = N*sizeof(float); a_h

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

© NVIDIA Corporation 2009

nBytes = N*sizeof(float); a_h a_d

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

© NVIDIA Corporation 2009

nBytes = N*sizeof(float); a_h a_d

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

© NVIDIA Corporation 2009

nBytes = N*sizeof(float); a_h a_d

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

© NVIDIA Corporation 2009

nBytes = N*sizeof(float); a_h a_d

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

© NVIDIA Corporation 2009

nBytes = N*sizeof(float); a_h a_d

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

© NVIDIA Corporation 2009

nBytes = N*sizeof(float); a_h a_d

cudaMemcpy(a_d, a_h, nBytes, cudaMemcpyHostToDevice);

for (i=0; i< N; i++) assert( a_h[i] == b_h[i] );

© NVIDIA Corporation 2009

cudaMemcpy(void dst, void src, size_t nbytes,

All global and device functions have

global void kernel( int *a )

global void kernel( int *a )

global void kernel( int *a )

void inc_cpu(int a, int N) global void inc_gpu(int a_d, int N)