8 Cud A 1
8 Cud A 1
Programming
CUDA
(Compute Unied Device Architecture)
Monday 21 March 2011
Tesla (G80)
SP
16KB Scratch
SP SP
128KB L2
GB DRAM
SP
SM
Monday 21 March 2011
Fermi
SP
64K L1/
Scratch
SP SP
768KB L2
GB DRAM
SP
SM
Texture
Constant
64K L1/
Scratch
Texture
Constant
64K L1/
Scratch
Texture
Constant
64K L1/
Scratch
Texture
Constant
Monday 21 March 2011
CPU vs GPU architecture
Memory latency needs to be hidden
Run many threads
Can do because of high compute density
Source nVIDIA
~ 8MB
~ 64KB
Monday 21 March 2011
Cuda Architecture
Courtesy NVIDIA
Monday 21 March 2011
SM
!"#
Dispatch Unit
Warp ScheduIer
Instruction Cache
Dispatch Unit
Warp ScheduIer
Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
Core Core
SFU
SFU
SFU
SFU
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
LD/ST
Interconnect Network
64 KB Shared Memory / L1 Cache
Uniform Cache
Core
Register FiIe (32,768 x 32-bit)
CUDA Core
Operand CoIIector
Dispatch Port
ResuIt Queue
FP Unit INT Unit
#
!"#$%&'()'*+,-'!&%."'/0'"1,2$3&4'56',7%&48'9('27+3:4;7%&'$1";48'<7$%'4=&,"+2><$1,;"71'
$1";48'+'56?>@7%3'%&#"4;&%'<"2&8'(A?'7<',71<"#$%+B2&'CD08'+13';-%&+3',71;%72'27#",)'*+,-',7%&'
-+4'B7;-'<27+;"1#>=7"1;'+13'"1;&#&%'&E&,$;"71'$1";4)'F/7$%,&G'HIJKJDL'
$%&'()*+,-&)*(#&-./'()&*0#1&%%&2#(3.#4555#678,!""9#1%&'()*+,-&)*(#0('*:'/:;#
5'<3#<&/.#<'*#-./1&/=#&*.#0)*+%.,-/.<)0)&*#1>0.:#=>%()-%?,'::#&-./'()&*#)*#.'<3#
<%&<@#-./)&:#'*:#&*.#:&>A%.,-/.<)0)&*#$BC#)*#(2&#<%&<@#-./)&:0;#C(#(3.#<3)-#%.D.%E#
$./=)#-./1&/=0#=&/.#(3'*#9#'0#='*?#:&>A%.,-/.<)0)&*#&-./'()&*0#-./#<%&<@#(3'*#
(3.#-/.D)&>0#FG!""#+.*./'()&*E#23./.#:&>A%.,-/.<)0)&*#-/&<.00)*+#2'0#3'*:%.:#A?#
'#:.:)<'(.:#>*)(#-./#HB#2)(3#=><3#%&2./#(3/&>+3->(;#
4555#1%&'()*+,-&)*(#<&=-%)'*<.#)*<%>:.0#'%%#1&>/#/&>*:)*+#=&:.0E#'*:#
0>A*&/='%#*>=A./0#I*>=A./0#<%&0./#(&#J./&#(3'*#'#*&/='%)J.:#1&/='(#<'*#
Monday 21 March 2011
GPU Performance
Massively parallel: 512 cores
Low power
Massively threaded:1000s of threads
Hardware-supported threads
Courtesy NVIDIA
Monday 21 March 2011
What is CUDA?
Compute Unied Device Architecture
General purpose programming model
User kicks off batches of threads on the GPU
GPU = dedicated super-threaded, massively data parallel co-
processor
Driver for loading computation programs into GPU
Standalone Driver - Optimized for computation
Interface designed for compute graphics-free API
Data sharing with OpenGL buffer objects
Guaranteed maximum download & readback speeds
Explicit GPU memory management
Monday 21 March 2011
CUDA is C-like
Integrated host+device app C program
Serial or modestly parallel parts in host C code
Highly parallel parts in device SPMD kernel C
code
Serial Code (host)
. . .
. . .
Parallel Kernel (device)
KernelA<<< nBlk, nTid >>>(args);
Serial Code (host)
Parallel Kernel (device)
KernelB<<< nBlk, nTid >>>(args);
Courtesy Kirk &
Hwu
Monday 21 March 2011
CUDA Devices and Threads
A compute device
Is a coprocessor to the CPU or host
Has its own DRAM (device memory)
Runs many threads in parallel
Is typically a GPU but can also be another type of parallel
processing device
Data-parallel portions of an application are expressed as
device kernels which run on many threads
Differences between GPU and CPU threads
GPU threads are extremely lightweight
Very little creation overhead
GPU needs 1000s of threads for full efciency
Multi-core CPU needs only a few
Courtesy Kirk & Hwu
Monday 21 March 2011
Extended C
Declspecs
global, device, shared,
local, constant
Built-in variables
threadIdx, blockIdx
Intrinsics
__syncthreads
Runtime API
Memory, symbol,
execution
management
Function launch
__device__ float filter[N];
__global__ void convolve (float *image) {
__shared__ float region[M];
...
region[threadIdx] = image[i];
__syncthreads()
...
image[j] = result;
}
// Allocate GPU memory
void *myimage = cudaMalloc(bytes)
// 100 blocks, 10 threads per block
convolve<<<100, 10>>> (myimage);
Courtesy Kirk &
Hwu
Monday 21 March 2011
gcc / cl
Architecture SASS
foo.sass
OCG
Extended-C SW stack
nvcc
C/C++ frontend
GPU Assembly
foo.s
CPU Host Code
foo.cpp
Integrated source
(foo.cu)
Monday 21 March 2011
gcc / cl
Architecture SASS
foo.sass
OCG
Extended-C SW stack
nvcc
C/C++ frontend
GPU Assembly
foo.s
CPU Host Code
foo.cpp
Integrated source
(foo.cu)
cuda-gdb
CUDA Visual
Proler
Parallel
Nsight
Monday 21 March 2011
CUDA software pipeline
Source les has mix of host and device code
nvcc separates device code from host code
and compiles device code into PTX/cubin
Host code is output as C code
nvcc can invoke the host compiler
or, it can be compiled later
Applications can link to the generated host code
host code includes PTX/cubin code as a global initialized data array
and cudart CUDA C runtime function calls to load and launch kernels
Alternatvely, one may load and execute the PTX/cubin using
the CUDA driver API
host code is then ignored
Monday 21 March 2011
CUDA software architecture
Source nVIDIA
Provides library functions
for host as well as device
Implement subset of stdlib
Monday 21 March 2011
System Requirements
CUDA GPU
With CUDA device driver
CUDA software
CUDA Toolkit
Tools to build a CUDA
Libraries
header les, and other resources
CUDA SDK
Sample projects (with congurations) including utility functions
C/C++ compiler
Needs to be a compatible version
Monday 21 March 2011
Arrays of Parallel Threads
A CUDA kernel is executed many times
By a block of threads running concurrently
Once per thread, each running the same kernel (SPMD)
Thread have access to their ID
may compute different memory addresses or control
7 6 5 4 3 2 1 0
float x = input[tID];
float y = func(x);
output[tID] = y;
thread ID
float x = input[tID];
float y = func(x);
output[tID] = y;
float x = input[tID];
float y = func(x);
output[tID] = y;
float x = input[tID];
float y = func(x);
output[tID] = y;
float x = input[tID];
float y = func(x);
output[tID] = y;
float x = input[tID];
float y = func(x);
output[tID] = y;
float x = input[tID];
float y = func(x);
output[tID] = y;
float x = input[tID];
float y = func(x);
output[tID] = y;
float x = input[tID];
float y = func(x);
output[tID] = y;
float x = input[tID];
float y = func(x);
output[tID] = y;
float x = input[tID];
float y = func(x);
output[tID] = y;
__syncthreads()
Different blocks only loosely tied
Must be able to execute independently (concurrently)
Do share global memory
Monday 21 March 2011
Thread Execution
A block does not execute in a SIMD fashion
There are only 8 SPs
Executed in groups of 32 parallel threads
called warps
Divided into two half-warps
There need not be 32 or even 16 SPs
Logical separation; Instructions may be double pumped
All threads of a warp start together
But may diverge by branching
Branch paths are serialized until they converge back
Important efciency consideration
Monday 21 March 2011
Grid/Block Dimension
A and B need not be ints
A is a (up to) two dimensional vector
dim3 A(m, n)
B is a (up to) three dimensional vector
dim3 B(a, b, c); a, b, c are ints
a x b x c <= 512 on Tesla (1024 on Fermi)
Resource sharing further limits the count
Up to 8 blocks may co-exist on SM; at least 1 must t
c is the most signicant dimension
a is the least signicant dimension
Dereference: B.x, B.y and B.z
Thread ID = (B.x + B.y * a + B.z * a*b)
Monday 21 March 2011
Thread ID
a
b
c
x
y
z
(x + y * a + z * a*b).
Monday 21 March 2011
CUDA Memory Model Overview
Global memory
Main host-device data communication path
Visible to all threads
Long latency
Shared Memory
Fast memory
Use as scratch
Shared across block
More memory segments
Constant and texture
Read-only, cached
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
courtesy Kirk & Hwu
Monday 21 March 2011
CUDA Memory Model Overview
Global memory
Main host-device data communication path
Visible to all threads
Long latency
Shared Memory
Fast memory
Use as scratch
Shared across block
More memory segments
Constant and texture
Read-only, cached
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
courtesy Kirk & Hwu
Monday 21 March 2011
CUDA Memory Model Overview
Global memory
Main host-device data communication path
Visible to all threads
Long latency
Shared Memory
Fast memory
Use as scratch
Shared across block
More memory segments
Constant and texture
Read-only, cached
Grid
Global Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
courtesy Kirk & Hwu
Monday 21 March 2011
23A
25
Monday 21 March 2011
Memory Model Details
Shared memory is tied to a block
Lifetime ends with the block
global, constant, and texture memories are
persistent across kernels (within application)
These are recognized as device memory
Separate from host memory
App must explicitly allocate/de-allocate device
memory
And manage data transfer between host and device
memory
Monday 21 March 2011
CUDA Device Memory Allocation
cudaMalloc()
Allocates in global memory
Requires parameters:
Address of a pointer to the
allocated object
Size of of allocated object
Beware of display mode
change
cudaFree()
Frees object from global memory
Takes pointer to object to free
Called on the host!
Feels like host pointers
Grid
Global
Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Thanks Kirk & Hwu
Monday 21 March 2011
CUDA Device Memory Allocation (cont.)
Code example:
Allocate a 64 * 64 single precision oat array
Attach the allocated storage to Md
Sufx d often used for device data
TILE_WIDTH = 64;
float *Md, *M;
int size = TILE_WIDTH * TILE_WIDTH * sizeof(float);
cudaMalloc((void**)&Md, size);
cudaFree(Md);
Courtesy Kirk &
Hwu
Monday 21 March 2011
Example Memory Copy
size_t size = N * sizeof(float);
// Allocate vector in host memory
float* h_A = (float*)malloc(size);
// Make sure to initialize input vectors
float* d_A, *d_B, *d_C;
// Allocate vectors in device global memory
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
// Copy host->device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
// Invoke kernel on GPU
ProcessDo<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, N);
// Copy result, h_B, from device memory to host memory
cudaMemcpy(h_B, d_B, size, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_A);
cudaFree(d_B);
Monday 21 March 2011
Example Memory Copy
size_t size = N * sizeof(float);
// Allocate vector in host memory
float* h_A = (float*)malloc(size);
// Make sure to initialize input vectors
float* d_A, *d_B, *d_C;
// Allocate vectors in device global memory
cudaMalloc(&d_A, size);
cudaMalloc(&d_B, size);
// Copy host->device
cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
// Invoke kernel on GPU
ProcessDo<<<blocksPerGrid, threadsPerBlock>>>(d_A, d_B, N);
// Copy result, h_B, from device memory to host memory
cudaMemcpy(h_B, d_B, size, cudaMemcpyDeviceToHost);
// Free device memory
cudaFree(d_A);
cudaFree(d_B);
See cudaMallocPitch() and cudaMalloc3D()
for allocating 2D/3D arrays. Pads to meet alignment
for efcient access (also see cudaMemcpy2D() and
cudaMemcpy3D()).
Monday 21 March 2011
CUDA Host-Device Data Transfer
cudaMemcpy()
Requires four
parameters
Pointer to destination
Pointer to source
Number of bytes copied
Type of transfer
Host to Host
Host to Device
Device to Host
Device to Device
Asynchronous transfer
Grid
Global
Memory
Block (0, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Block (1, 0)
Shared Memory
Thread (0, 0)
Registers
Thread (1, 0)
Registers
Host
Courtesy Kirk & Hwu
Monday 21 March 2011
CUDA Host-Device Data
Transfer
Code example:
Recall allocation earlier
Transfer a 64 * 64 oat array
M is in host memory and Md is in device
memory
cudaMemcpyHostToDevice and
cudaMemcpyDeviceToHost are symbolic
constants
cudaMemcpy(Md, M, size, cudaMemcpyHostToDevice);
cudaMemcpy(M, Md, size, cudaMemcpyDeviceToHost);
Courtesy Kirk & Hwu
Monday 21 March 2011
More Ways to Initialize
__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data))
Monday 21 March 2011
More Ways to Initialize
There also is page-locked (i.e., pinned) host memory
cudaHostAlloc() and cudaFreeHost()
Copies between page-locked host memory and device
memory can be performed concurrently with kernel execution
Page-locked host memory can be directly mapped into the
address space of the device
Bandwidth between host memory and device memory is
generally higher
__constant__ float constData[256];
float data[256];
cudaMemcpyToSymbol(constData, data, sizeof(data))
Monday 21 March 2011