4. CUDA Programming
4. CUDA Programming
Outline
GPU
CUDA Introduction
What is CUDA
CUDA Programming Model
Advantages & Limitations
CUDA Programming
Future Work
GPU
GPUs are massively multithreaded many core chips
handle computation only for computer graphics
Hundreds of processors
Tens of thousands of concurrent threads
TFLOPs peak performance
Fine-grained data-parallel computation
• What is GPGPU?
– General purpose computing on GPUs
– GPGPU is the use of a GPU, which typically
handles computation only for computer
graphics, to perform computation in
applications traditionally handled by the
central processing unit (CPU).
• Why GPGPU?
– Massively parallel computing power
– Inexpensive
GPGPU
• How?
– CUDA
– OpenCL
– DirectCompute
What is CUDA?
CUDA is the acronym for Compute Unified Device
Architecture.
A parallel computing architecture developed by NVIDIA.
Heterogeneous serial-parallel computing
The computing engine in GPU.
CUDA can be accessible to software developers through
industry standard programming languages.
CUDA gives developers access to the instruction set
and memory of the parallel computation elements in
GPUs.
Heterogeneous Computing
Terminology:
Host The CPU and its memory (host memory)
Device The GPU and its memory (device memory)
Host Device
© NVIDIA 2013
Heterogeneous Computing
#include <iostream>
#include <algorithm>
#define N 1024
#define RADIUS 3
#define BLOCK_SIZE 16
parallel fn
// Synchronize (ensure all the data is available)
__syncthreads();
int main(void) {
int *in, *out; // host copies of a, b, c
int *d_in, *d_out; // device copies of a, b, c
int size = (N + 2*RADIUS) * sizeof(int);
serial code
cudaMalloc((void **)&d_out, size);
// Copy to device
cudaMemcpy(d_in, in, size, cudaMemcpyHostToDevice);
cudaMemcpy(d_out, out, size, cudaMemcpyHostToDevice);
parallel code
cudaMemcpy(out, d_out, size, cudaMemcpyDeviceToHost);
// Cleanup
free(in); free(out);
cudaFree(d_in); cudaFree(d_out);
return 0;
serial code
}
© NVIDIA 2013
CUDA Kernels and Threads
Parallel portions of an application are executed on
the device as kernels
© NVIDIA Corporation
CUDA Programming Model
© NVIDIA Corporation
CUDA Programming Model
Single Instruction Multiple Thread (SIMT) Execution:
PCI Bus
© NVIDIA 2013
Simple Processing Flow
PCI Bus
© NVIDIA 2013
Simple Processing Flow
PCI Bus
© NVIDIA 2013
Memory Model
• Types of device memory
– Registers – read/write per-thread
– Local Memory – read/write per-thread
– Shared Memory – read/write per-block
– Global Memory – read/write across grids
– Constant Memory – read across grids
– Texture Memory – read across grids
Memory Model
© NVIDIA Corporation
Memory Model
There are 6 Memory Types :
• Registers
o on chip
o fast access
o per thread
o limited amount
o 32 bit
Memory Model
There are 6 Memory Types :
• Registers
• Local Memory
o in DRAM
o slow
o non-cached
o per thread
o relative large
Memory Model
There are 6 Memory Types :
• Registers
• Local Memory
• Shared Memory
o on chip
o fast access
o per block
o 16 KByte
o synchronize between
threads
Memory Model
There are 6 Memory Types :
• Registers
• Local Memory
• Shared Memory
• Global Memory
o in DRAM
o slow
o non-cached
o per grid
o communicate between
grids
Memory Model
There are 6 Memory Types :
• Registers
• Local Memory
• Shared Memory
• Global Memory
• Constant Memory
o in DRAM
o cached
o per grid
o read-only
Memory Model
There are 6 Memory Types :
• Registers
• Local Memory
• Shared Memory
• Global Memory
• Constant Memory
• Texture Memory
o in DRAM
o cached
o per grid
o read-only
Memory Model
• Registers
• Shared Memory
o on chip
• Local Memory
• Global Memory
• Constant Memory
• Texture Memory
o in Device Memory
Memory Model
• Global Memory
• Constant Memory
• Texture Memory
o managed by host code
o persistent across kernels
Advantages of CUDA
CUDA has several advantages over traditional
general purpose computation on GPUs:
Scattered reads – code can read from arbitrary
addresses in memory.
Shared memory - CUDA exposes a fast shared
memory region (16KB in size) that can be shared
amongst threads.
Limitations of CUDA
CUDA has several limitations over traditional
general purpose computation on GPUs:
A single process must run spread across multiple
disjoint memory spaces, unlike other C language
runtime environments.
The bus bandwidth and latency between the CPU
and the GPU may be a bottleneck.
CUDA-enabled GPUs are only available from
NVIDIA.