0% found this document useful (0 votes)

111 views70 pages

Cuda C

CUDA (Compute Unified Device Architecture) is a parallel computing platform and programming model introduced by NVIDIA in 2006. It allows developers to use C as a high-level programming language to develop applications that leverage the parallel compute engines in NVIDIA GPUs. The CUDA programming model consists of kernels that execute across multiple threads in parallel on the GPU. Kernels are organized into a grid of thread blocks, and each block contains a group of cooperating threads that can share data through shared memory and synchronize execution. The nvcc compiler driver compiles CUDA C/C++ programs by separating the device code for the GPU from the host code for the CPU and compiling them using different compilers.

Uploaded by

Swati Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

111 views70 pages

Cuda C

Uploaded by

Swati Choudhary

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 70

CUDA C

CUDA
A general purpose parallel computing platform and

programming model
Introduced in 2006 by NVIDIA
Enhances the compute engine in GPUs to solve complex

computation problems in an ecient way

Comes with a software environment that allows

developers to use
C as a high level programming language
other languages, application programming interfaces, or directives-

based approaches are supported, such as FORTRAN,

DirectCompute, OpenACC.
2

08/20/15

CUDA C

08/20/15

CUDA C - A scalable programming Model

Mainstream processors are parallel
Advent of many-core and multi-core chips

3D graphics applications transparently scale their

parallelism on GPUs with varying number of cores

Challenge: To develop application software that scales

transparently with the number of cores

Let programmers focus on parallel algorithms
Not on the mechanics of a parallel programming

language
4

08/20/15

CUDA C - A scalable programming Model

Facing the challenge
using minimal set of language extensions
Hierarchy of threads
Shared Memory
Barrier synchronization

Partition the problem into coarse sub problems

Solved independently in parallel by blocks of threads

Partition sub-problems into finer pieces

Solved cooperatively in parallel by all threads within

the block
5

08/20/15

CUDA C - A scalable programming Model

Each block of threads can be scheduled
on any of the available multiprocessors
Within a GPU, in any order, concurrently or

sequentially,
A compiled CUDA program can execute on any

number of multiprocessors
Only the runtime system needs to know the physical

multiprocessor count

08/20/15

Automatic Scalability

A multi threaded program is partitioned into blocks of threads that

execute independently from each other, so that a GPU with more cores
will automatically execute the program in less time than GPU with
fewer cores.
7

08/20/15

Heterogeneous Computing
Host : CPU and its memory (host memory)
Device : GPU and its memory (device memory)

08/20/15

Heterogeneous Computing

08/20/15

GPU programming model

Serial code executes in a

HOST(CPU) thread
Parallel code executes in

many concurrent
DEVICE(GPU) threads
across multiple parallel
processing elements

08/20/15

Compiling CUDA C Programs

Refer : http://docs.nvidia.com/cuda/cudacompiler-driver-nvcc/#axzz3Qz0M7rGW
11

08/20/15

Compiling CUDA C Programs

Source files for CUDA applications
Mixture of conventional C++ host code, plus GPU device (i.e., GPU-)

functions
CUDA compilation trajectory
Separates the device functions from the host code,
Compiles the device functions using proprietary NVIDIA compilers/

assemblers,
Compiles the host code using a general purpose C/C++ compiler that is

available on the host platform, and

Embeds the compiled GPU functions in the host object file.

In the linking stage, specific CUDA runtime libraries are added for

supporting remote SIMD procedure calling and for providing explicit

GPU manipulation such as allocation of GPU memory buers and hostGPU data transfer.

08/20/15

Purpose of NVCC
Compilation trajectory involves
Splitting, compilation, preprocessing, and merging steps for

each CUDA source file

CUDA compiler drivernvcchides the intricate details of

CUDA compilation from developers.

nvccmimics the behavior of the GNU compilergcc: it

accepts a range of conventional compiler options, such

as for defining macros and include/library paths, and
for steering the compilation process.
All non-CUDA compilation steps are forwarded to a

general purpose C compiler that is supported bynvcc

08/20/15

Integrated C programs with CUDA

extensions

NVCC Compiler

Host
Code
Host C preprocessor,
compiler/linker

Device Code (PTX)

Device just-in time
compiler

Heterogeneous Computing Platform with CPUs,

GPUs
Source : Kirk and Hwu
14

08/20/15

Nvccs basic workflow separates device code and host

code and then:

Compiles the device code into an assembly form (PTX code) or

binary form (cubin object)

Modifies the host code by replacing syntax in kernels by necessary

CUDA C runtime function calls to load and launch each compiled

kernel from the PTX code / cubin object

Modified host code is output as object code by letting

nvcc invoke the host compiler during last compilation

stage

08/20/15

Anatomy of a CUDA C Program

Step 1: Copy input data from host memory to device
memory

08/20/15

Anatomy of a CUDA C Program

Step 2: Launch a kernel on the device

08/20/15

Anatomy of a CUDA C Program

Step 3: Execute the kernel on the device, caching
data on chip for performance

08/20/15

Anatomy of a CUDA C Program

Step 4: Copy results from device memory to the host
memory

08/20/15

Thread

Simplified view of how a processor executes a program

Consists of
Code of the program
Particular point in the code that is being executed
Values of its variables and data structures

In CUDA, execution of each thread is sequential

But CUDA program initiates parallel execution by

launching kernel functions

Causes the underlying mechanisms to create many

threads that process dierent parts of the data in parallel

08/20/15

CUDA Kernels

Kernel is function that executes parallel portions

of an application on the device and can be called

from the host
One kernel is executed at a time by many threads

Execute as an array of threads in parallel

All threads run same code
Each thread has an ID that is used to compute

memory addresses and make control decisions

Can only access device memory

08/20/15

Thread Hierarchies
Grid
One or more thread blocks
Organized as a 3D array of blocks

Block
3D array of threads
Each block in a grid has the same number of threads

(dimension)
Each thread in a block can
Synchronize
Access shared memory
22

08/20/15

Thread Hierarchies
A kernel is executed as a grid of thread blocks
All threads share data memory space

A thread block is a batch of threads that can

cooperate with each other by:

Synchronizing their execution
For hazard-free shared memory accesses

Eciently sharing data through a low latency shared

memory
Two threads from two dierent blocks cannot

cooperate
23

08/20/15

Thread Hierarchies

08/20/15

Thread Hierarchies
Thread Block
Group of threads
G80 and GT200: Up to 512 threads
Fermi: Up to 1024 threads

Reside on same processor core

Share memory of that core

08/20/15

Initialize many
threads: Hundreds
or Thousands to
wake up a GPU
from its Bed!!!!

Threads: Representation
08/20/15
28

CUDA Thread Organization

Threads are grouped into blocks Blocks are grouped into Grids

Kernels are executed as a Grid of

Block of Threads

08/20/15

CUDA Thread Organization

Only one kernel can execute

on a device at one time

08/20/15

CUDA Thread Organization

All threads in a grid execute the same kernel function

Rely on unique coordinates to distinguish themselves

Two-level hierarchy using unique coordinates

blockIdx (for block index) shared by all threads in a block
threadIdx (for thread index) unique within a block
Used together to make a unique ID for each thread per kernel

Built-in pre-initialized variables accessed within kernels

References to them return coordinates of the thread when

executed
Kernel launch specifies the dimensions of the grid and each

block
gridDim and blockDim

08/20/15

Your First CUDA Program!

08/20/15

Program 1: Hello World!

int main(void) {
!
printf("Hello World!\n");
!
return 0;
}

Output:
$ nvcc
hello_world.cu
$ a.out
Hello World!
$

Standard C that runs on the host

NVIDIA compiler (nvcc) can be used to compile

programs with no device code

08/20/15

Program 1: Hello World! with Device

Code
__global__ void
mykernel(void) {
}
int main(void) {
!
mykernel<<<1,1>>>();
!
printf("Hello World!\n");
!
return 0;
}

Two new syntactic elements

Source: J sanders
et.al, Cuda by
example
08/20/15

Program 1: Hello World! with

Device Code
__global__ void mykernel(void) {
}

CUDA C/C++ keyword global indicates a function

that:
Runs on the device
Is called from host code

nvcc separates source code into host and device components

Device functions (e.g. mykernel()) processed by NVIDIA

compiler
Host functions (e.g. main()) processed by standard host

compiler
gcc, cl.exe
36

08/20/15

Program 1: Hello World! with Device

Code
mykernel<<<1,1>>>();
Triple angle brackets mark a call from host code to

device code
Also called a kernel launch
Well return to the parameters (1,1) in a moment

Thats all that is required to execute a function on the

GPU!

08/20/15

Program 2: Summing two vectors

08/20/15

Source: J sanders
et.al, Cuda by
example
39

08/20/15

Can also be written

in this way

Suggest a potential way to

parallelize the code on a system with
multiple CPUs or CPU cores.
For example, with a dual-core
processor,
Can change the increment to 2
Have one core initialize the loop
with tid = 0 and another with tid
= 1.
The first core would add the
even-indexed elements, and the
second core would add the oddindexed elements
08/20/15

Multi-core CPU vector sum

08/20/15

GPU vector sum

#include<cuda.h>
.
void vecAdd(float* A, float* B, float* C, int n){
1. // Allocate device memory for A, B and C
// Copy A and B to device memory
2. //Kernel launch code to have the device perform the actual vector
addition
3. //Copy C from the device memory to the host memory
// Free memory allocated on the device
}
42

08/20/15

N is the number of parallel blocks specified while launching the

kernel
Below we can see four blocks, all running through the same copy of
the device code but having dierent values for the block id

08/20/15

CUDA Device Memory Model

08/20/15

CUDA Device Memory Model

Global and Constant Memory
Written and read by host by calling API functions
Constant Memory : short-latency, high-bandwidth,

read-only access by the device$

Register and shared memory (on-chip memories)
Accessed at high speed in a parallel manner
Registers
Allocated to individual threads
Each thread can only access its own registers

08/20/15

CUDA Device Memory Model

Kernel function uses registers to hold frequently

accessed variables private to each thread

Shared memory is allocated to thread blocks
All threads in a block can access the variables
Ecient means for threads to cooperate by sharing

input data and intermediate results

08/20/15

CUDA API function for managing device global memory

08/20/15

cudaMalloc()
First parameter : address of a pointer variable that

must point to the allocated object after allocation.

The address of the pointer variable should be cast to

(void **)
Function expects a generic pointer value;
Allows the cudaMalloc() function to write the address of the

allocated object into the pointer variable.

The second parameter : size of the object to be

allocated, in terms of bytes.

float *d_A;
int size = n * sizeof(float);
cudaMalloc((void**)&d_A, size);
.
cudaFree(d_A);

08/20/15

CUDA APIs

08/20/15

cudaMemcpy()

Cannot be used to copy between dierent GPUs

Can be used to transfer data in both directions by
Proper ordering of source and destination pointers and
Using the appropriate constant for transfer type

Four symbolic predefined constants

cudaMemcpyHostToHost
cudaMemcpyHostToDevice
cudaMemcpyDeviceToHost
cudaMemcpyDeviceToDevice

cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);

cudaMemcpy(d_C, C, size, cudaMemcpyDeviceToHost);
50

08/20/15

GPU vector sum

#include<cuda.h>
.
void vecAdd(float* A, float* B, float* C, int n){
!
int size = n * sizeof(float);
!
float *d_A, *d_B, *d_C;
$
cudaMalloc((void**) &d_A, size);
!
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
!
cudaMalloc((void**) &d_B, size);
!
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
!
cudaMalloc((void**) &d_C, size);
!
!
!
!
}

//Kernel invocation code

cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);

//Free device memory
cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);

08/20/15

CUDA Extensions to function

declaration
Executed on Only callable from
the :
the :
__device__ float DeviceFunc()

Device

global void KernelFunc()

Device

Host

__host__

float HostFunc()

08/20/15

CUDA Memory Hierarchy

Thread: Registers and Local Memory

Blocks: Shared Memory

08/20/15

CUDA Memory Hierarchy

Grid(All thread blocks) :
Global Memory

Registers
Per thread, Data lifetime = thread
lifetime

Local Memory
Per thread o-chip memory
Data lifetime = thread lifetime

Shared Memory
Per thread block on-chip
Data lifetime = block lifetime

Global Memory
Accessible by all threads as well as host
Data lifetime = from allocation to
deallocation

Host Memory
Not directly accessible by CUDA threads

08/20/15

Thread and Memory Organization

Blocks execute on SM

08/20/15

Thread and Memory Organization

Grid of Blocks across
GPU

08/20/15

Thread and Block Indexing

Each Thread has an ID
Predefined variables that allow a thread to access the

hardware registers at runtime that provide the identifying

coordinates to the thread

Threads : 3D IDs and unique within a Block

Built-in variables : threadIdx, blockIdx, blockDim,

gridDim

08/20/15

Thread Indexing

Grid consists of N thread blocks

Each with a blockIdx.x value that ranges from 0 to N 1
Each block consists of M threads
Each with a threadIdx.x value that ranges from 0 to M 1.
All blocks at the grid level are organized as a one-dimensional
(1D) array
All threads within each block are also organized as a 1D array.
Each grid has a total of N*M threads.
59

08/20/15

Thread Indexing
Thread id= blockIdx.x * blockDim.x +

threadIdx.x
Thread 3 of Block 0 has a threadID value of 0*M +3
Thread 3 of Block 5 has a threadID value of 5*M+3

Eg: A grid with 128(N) blocks and 32(M) threads in

each block
Total of 128*32=4096 threads in a grid

Each thread should have a unique ID

Thread 3 of Block 0 has a threadID value of 0*32 +3
Thread 3 of Block 5 has a threadID value of 5*32+3=163
Thread 15 of block 102, threadID = ?

08/20/15

Thread and Block Indexing

Grid Dimension: <3, 2>
gridDim.x=3
gridDim.y=2
blockIdx.x=1
blockIdx.y=1
Block Dimension : <5, 3>
blockDim.x=5
blockDim.y=3
threadIdx.x=4
threadIdx.y=2

08/20/15

Vector Addition Kernel Function

__global__ void vecAddKernel(float* A, float* B, float*
C, int n)
{
!
int i = threadIdx.x + blockDim.x * blockIdx.x;
!
if(i < n)
!
!
C[i] = A[i] + b[i];
}

The if(i<n) statement allows the kernel to process vectors

of arbitrary lengths.
Example : If vector length is 100 and thread block
dimension is 32, then four thread blocks need to be
launched (128 threads). Last 28 threads in block 3 need to
08/20/15
be disabled.

Thread Indexing and Organization

In general

Grid is organized as a 3D array of blocks

Block is organized as 3D array of threads

Exact organization of a grid is determined by the execution

configuration provided at kernel launch

When the host code invokes a kernel, it sets the grid and
thread block dimensions via execution configuration
parameters
2 parameters
One describes the configuration of the grid : Number of blocks
Second one describes the configuration of blocks : groups of threads

Each parameter is dim3 type C struct type with 3 unsigned

integer fields, x, y and z (three dimensions).

08/20/15

Thread Indexing and Organization

For 1D and 2D grids/blocks, the unused fields should be set

to 1
1D grid with 128 blocks, each of which consists of 32

threads
Total number of threads = 128 * 32 = 4096

$
dim3 dimGrid(128, 1, 1);
$
dim3 dimBlock(32, 1, 1);
$
vecAddKernel<<<dimGrid, dimBlock>>>();
dimBlock and dimGrid are programmer defined variables
These variables can have any names as long as they are of
type dim3 and kernel launch uses appropriate names
64

08/20/15

Thread Indexing and Organization

If a Grid/ Block has only 1 Dimension, instead of dim3 use

arithmetic expressions to specify the configuration.

Compiler takes the expression as x dimensions and assumes y and z

are 1.
$

vecAddKernel<<<ceil(n/256.0), 256>>>();

gridDim and blockDim are part of CUDA C specification and

cannot be changed
The x field of the predefined variables gridDim and blockDim get

preinitialized based on the execution configuration parameters

If n is 4000, then gridDim.x = 16 and blockDim.x = 256

08/20/15

Thread Indexing and Organization

Allowed values of gridDim.x, gridDim.y and gridDim.z range

from 1 to 65536.
All threads in a block share same blockIdx.x, blockIdx.y and

blockIdx.z
In a grid
The blockIdx.x ranges between 0 and gridDim.x 1
The blockIdx.y ranges between 0 and gridDim.y 1
The blockIdx.z ranges between 0 and gridDim.z 1

Total size of a block is limited to 1024 threads

Flexibility to divide into 3 dimensions
Example blockDim values
(512, 1, 1), (8, 16, 4) and (32, 16, 2)
(32, 32, 32)
66

08/20/15

2D grid (2, 2, 1) consisting of 3D

blocks (4, 2, 2)
Host code

dim3 dimGrid(2, 2, 1)
dim3 dimBlock(4, 2, 2)
Kernel<<<dimGrid,dimBlock>>>(.
.)

08/20/15

CUDA execution configuration

parameters
$ Eg: dim3 dimBlock(5,3);
dim3 dimGrid(3,2);
Kernel call :
Eg: gauss<<<dimGrid,dimBlock>>>( )

08/20/15

Kernel Launch Statement

vecAddKernel<<<ceil(n/256.0), 256>>>(d_A, d_B,
d_C, n);
Number of thread
blocks

Number of threads in each

block

If there are 1000 threads, we launch ceil(1000/256.0) = 4

thread blocks
It will launch 4 * 256 = 1024 threads
Number of thread blocks depends on the length of the
vectors (n)
If n = 750,
3 thread blocks;
If n = 4000,
08/20/15
16 thread blocks

Vector
Addition
Kernel
Launch
#include<cuda.h>
void vecAdd(float* A, float* B, float* C, int n){
!
int size = n * sizeof(float);
$
float *d_A, *d_B, *d_C;
$
cudaMalloc((void**) &d_A, size);
$
cudaMemcpy(d_A, A, size, cudaMemcpyHostToDevice);
$
cudaMalloc((void**) &d_B, size);
$
cudaMemcpy(d_B, B, size, cudaMemcpyHostToDevice);
$
cudaMalloc((void**) &d_C, size);

vecAddKernel<<<ceil(n/256.0), 256>>>(d_A, d_B,

d_C, n);
$

cudaMemcpy(C, d_C, size, cudaMemcpyDeviceToHost);

$
$
}

//Free device memory

cudaFree(d_A); cudaFree(d_B); cudaFree(d_C);
08/20/15

Advanced Performance Optimization in CUDA (S62192)
No ratings yet
Advanced Performance Optimization in CUDA (S62192)
127 pages
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
No ratings yet
Performance Optimization With Modern CUDA Programming Techniques - 1635781161534001h3am
77 pages
Tutorial LLV M Back End Cpu 0
No ratings yet
Tutorial LLV M Back End Cpu 0
605 pages
Numerical Methods Implementation On CUDA
No ratings yet
Numerical Methods Implementation On CUDA
73 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Pci Sig SR Iov Primer
No ratings yet
Pci Sig SR Iov Primer
28 pages
《DPDK Cookbook - Intel® Developer Zone》
No ratings yet
《DPDK Cookbook - Intel® Developer Zone》
107 pages
Kinnevik Broker Report Apr-15
No ratings yet
Kinnevik Broker Report Apr-15
118 pages
21.L18 Intro To GPU and CUDA C
No ratings yet
21.L18 Intro To GPU and CUDA C
89 pages
Workflows - Installation and Configuration 2009
100% (1)
Workflows - Installation and Configuration 2009
45 pages
Treasury Management Assignment NMIMS June 2025
No ratings yet
Treasury Management Assignment NMIMS June 2025
3 pages
History Modern: Andhra
No ratings yet
History Modern: Andhra
221 pages
Purdue Owl Thesis
100% (3)
Purdue Owl Thesis
8 pages
Lienson Acoustic Products
No ratings yet
Lienson Acoustic Products
32 pages
Model Test 133
No ratings yet
Model Test 133
16 pages
Instruction Scheduler in LLVM
No ratings yet
Instruction Scheduler in LLVM
20 pages
A Restatement of Hohfeld. Max Radin
No ratings yet
A Restatement of Hohfeld. Max Radin
25 pages
Parallel Processing With Cuda
No ratings yet
Parallel Processing With Cuda
25 pages
Exercise and Tests of The Police Power
No ratings yet
Exercise and Tests of The Police Power
17 pages
VSB5 Draft Public Comment
No ratings yet
VSB5 Draft Public Comment
27 pages
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
No ratings yet
CUDA Programming Basic: High Performance Computing Center Hanoi University of Science & Technology
38 pages
ANL 2020 CMake && Friends Part1 - 0
100% (1)
ANL 2020 CMake && Friends Part1 - 0
162 pages
高中英语语法专项讲解+练习 2.介词、代词、数词
No ratings yet
高中英语语法专项讲解+练习 2.介词、代词、数词
8 pages
Cifar 10
No ratings yet
Cifar 10
7 pages
Electronic Health Records System
100% (1)
Electronic Health Records System
22 pages
Psychopathology Review: Allison M. Waters, PHD Richard T. Lebeau, PHD, & Michelle G. Craske, PHD
No ratings yet
Psychopathology Review: Allison M. Waters, PHD Richard T. Lebeau, PHD, & Michelle G. Craske, PHD
17 pages
CUDA
No ratings yet
CUDA
46 pages
Which GPU(s) To Get For Deep Learning
No ratings yet
Which GPU(s) To Get For Deep Learning
388 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
Actual Test 6 Listening 3
No ratings yet
Actual Test 6 Listening 3
4 pages
Senior Two Notes - Sculpture in The Round
No ratings yet
Senior Two Notes - Sculpture in The Round
5 pages
Fine Dining Lovers
No ratings yet
Fine Dining Lovers
3 pages
Real Time Software
No ratings yet
Real Time Software
272 pages
ECN TLP Prefix 2008-12-15
100% (1)
ECN TLP Prefix 2008-12-15
19 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
Introduction To CUDA
No ratings yet
Introduction To CUDA
51 pages
Ghulam Khan
No ratings yet
Ghulam Khan
3 pages
High Performance Computing On Gpu
No ratings yet
High Performance Computing On Gpu
37 pages
Advanced Series On Ocean Engineering V 9 Subrata Kumar Chakrabarti Offshore Structure Modeling WSPC 1994 PDF
0% (1)
Advanced Series On Ocean Engineering V 9 Subrata Kumar Chakrabarti Offshore Structure Modeling WSPC 1994 PDF
494 pages
COD Modbus Instruction
No ratings yet
COD Modbus Instruction
18 pages
c24 Grand Btest-2 Maths (Paper-1)
No ratings yet
c24 Grand Btest-2 Maths (Paper-1)
11 pages
Unit 5 (Slides)
No ratings yet
Unit 5 (Slides)
75 pages
L 1 ParallelProcess Challenges
No ratings yet
L 1 ParallelProcess Challenges
82 pages
Two Stories About Flying
No ratings yet
Two Stories About Flying
4 pages
Verification of High Performance Embedded Systems: Sisira K. Amarasinghe, PH.D
No ratings yet
Verification of High Performance Embedded Systems: Sisira K. Amarasinghe, PH.D
82 pages
Programming Real-Time Embedded Systems - EPFL
No ratings yet
Programming Real-Time Embedded Systems - EPFL
40 pages
CUDA Installation Guide Windows
No ratings yet
CUDA Installation Guide Windows
28 pages
Basics of Embedded Linux
100% (1)
Basics of Embedded Linux
55 pages
Stm32-Stm8 Embedded Software Solutions
100% (1)
Stm32-Stm8 Embedded Software Solutions
59 pages
CUDA Compute Unified Device Architecture
No ratings yet
CUDA Compute Unified Device Architecture
26 pages
Realtek DPDK PMD
No ratings yet
Realtek DPDK PMD
8 pages
DPDK
No ratings yet
DPDK
10 pages
How To Run CUDA C
100% (1)
How To Run CUDA C
6 pages
CUDA Installation Guide Windows
100% (1)
CUDA Installation Guide Windows
17 pages
Overview - DC Talent Development Workshop
No ratings yet
Overview - DC Talent Development Workshop
2 pages
Cluster Computing
No ratings yet
Cluster Computing
32 pages
Speaker - A02 - 5747 - Best Practices in Networking For AI
No ratings yet
Speaker - A02 - 5747 - Best Practices in Networking For AI
15 pages
Gpu1 - GPU Introduction
No ratings yet
Gpu1 - GPU Introduction
20 pages
Geology of Sindh District
60% (5)
Geology of Sindh District
3 pages
Human Resource Management Practices Case Study
No ratings yet
Human Resource Management Practices Case Study
23 pages
English Language Cala 3 Learner's Guide 3
No ratings yet
English Language Cala 3 Learner's Guide 3
2 pages
Dharwar Supergroup
No ratings yet
Dharwar Supergroup
4 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
Dav Class 1
No ratings yet
Dav Class 1
21 pages
Differentiated I/O Services in Virtualized Environments: Tyler Harter, Salini SK & Anand Krishnamurthy
No ratings yet
Differentiated I/O Services in Virtualized Environments: Tyler Harter, Salini SK & Anand Krishnamurthy
44 pages
Rtos
No ratings yet
Rtos
42 pages
Introduction To Bhyve
No ratings yet
Introduction To Bhyve
34 pages
Openmp Tutorial: Seung-Jai Min
No ratings yet
Openmp Tutorial: Seung-Jai Min
30 pages
P11Mca1 & P8Mca1 - Advanced Computer Architecture: Unit V Processors and Memory Hierarchy
No ratings yet
P11Mca1 & P8Mca1 - Advanced Computer Architecture: Unit V Processors and Memory Hierarchy
45 pages
QEMU Internals
No ratings yet
QEMU Internals
10 pages
Sales and Marketing: Process Map - Winter 08
100% (1)
Sales and Marketing: Process Map - Winter 08
7 pages
Vxworks and Embedded Linux
No ratings yet
Vxworks and Embedded Linux
26 pages
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
No ratings yet
RISC-VTF RISC-V Based Extended Instruction Set For Transformer
6 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
12 Accountancy
No ratings yet
12 Accountancy
4 pages
All About Linux Signals - Linux Programming Blog
No ratings yet
All About Linux Signals - Linux Programming Blog
20 pages
OpenWrt SDK
No ratings yet
OpenWrt SDK
10 pages
Midterm GE 2 Ethics Week 11 Module 9
No ratings yet
Midterm GE 2 Ethics Week 11 Module 9
7 pages
Nvidia Cuda Arc
No ratings yet
Nvidia Cuda Arc
16 pages
Arsha Adkar Business Worksheet
No ratings yet
Arsha Adkar Business Worksheet
4 pages
Running Ubuntu Snappy On QEMU
No ratings yet
Running Ubuntu Snappy On QEMU
7 pages
Real Time Operating Systems For Small Microcontrollers
No ratings yet
Real Time Operating Systems For Small Microcontrollers
16 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
QEMU - Crash Course Wiki
No ratings yet
QEMU - Crash Course Wiki
6 pages
IDA+VMWare - Linux Debugger
No ratings yet
IDA+VMWare - Linux Debugger
8 pages
6WIND-Intel White Paper - Optimized Data Plane Processing Solutions Using The Intel® DPDK v2
No ratings yet
6WIND-Intel White Paper - Optimized Data Plane Processing Solutions Using The Intel® DPDK v2
8 pages
FreeBSD Jails and ZFS
No ratings yet
FreeBSD Jails and ZFS
3 pages
GPU Wiki
No ratings yet
GPU Wiki
9 pages
HighPerformanceComputing DS
No ratings yet
HighPerformanceComputing DS
2 pages
Glade Tutorial
No ratings yet
Glade Tutorial
5 pages

Cuda C

Uploaded by

Cuda C

Uploaded by

CUDA C

computation problems in an ecient way

based approaches are supported, such as FORTRAN,

CUDA C - A scalable programming Model

3D graphics applications transparently scale their

parallelism on GPUs with varying number of cores

transparently with the number of cores

CUDA C - A scalable programming Model

Partition the problem into coarse sub problems

Partition sub-problems into finer pieces

CUDA C - A scalable programming Model

A multi threaded program is partitioned into blocks of threads that

GPU programming model

Compiling CUDA C Programs

Compiling CUDA C Programs

available on the host platform, and

supporting remote SIMD procedure calling and for providing explicit

each CUDA source file

CUDA compilation from developers.

accepts a range of conventional compiler options, such

general purpose C compiler that is supported bynvcc

Integrated C programs with CUDA

Device Code (PTX)

Heterogeneous Computing Platform with CPUs,

Nvccs basic workflow separates device code and host

code and then:

binary form (cubin object)

CUDA C runtime function calls to load and launch each compiled

Modified host code is output as object code by letting

nvcc invoke the host compiler during last compilation

Anatomy of a CUDA C Program

Anatomy of a CUDA C Program

Anatomy of a CUDA C Program

Anatomy of a CUDA C Program

Simplified view of how a processor executes a program

In CUDA, execution of each thread is sequential

launching kernel functions

threads that process dierent parts of the data in parallel

Kernel is function that executes parallel portions

of an application on the device and can be called

Execute as an array of threads in parallel

memory addresses and make control decisions

A thread block is a batch of threads that can

cooperate with each other by:

Eciently sharing data through a low latency shared

Reside on same processor core

CUDA Thread Organization

Kernels are executed as a Grid of

CUDA Thread Organization

Only one kernel can execute

CUDA Thread Organization

All threads in a grid execute the same kernel function

Two-level hierarchy using unique coordinates

Built-in pre-initialized variables accessed within kernels

Your First CUDA Program!

Program 1: Hello World!

Standard C that runs on the host

programs with no device code

Program 1: Hello World! with Device

Two new syntactic elements

Program 1: Hello World! with

CUDA C/C++ keyword __global__ indicates a function

nvcc separates source code into host and device components

Program 1: Hello World! with Device

Thats all that is required to execute a function on the

Program 2: Summing two vectors

Can also be written

Suggest a potential way to

Multi-core CPU vector sum

GPU vector sum

N is the number of parallel blocks specified while launching the

CUDA Device Memory Model

CUDA Device Memory Model

read-only access by the device$

CUDA Device Memory Model

accessed variables private to each thread

CUDA C/C++ keyword global indicates a function

global void KernelFunc()