HIGH PERFORMANCE COMPUTING ON GPU
Graphics Processing Units
•A graphics processing unit or GPU is a specialized microprocessor that offloads
and accelerates 3D or 2D graphics rendering.
•Modern GPU’s highly parallel structure makes them more effective.
•NVIDIA's Tesla Architecture exposes the computational Horse power of the NVIDIA's
GPU.
GPU is specialized for compute-
intensive, highly parallel
designed such that more
transistors are devoted to data
processing rather than data
caching and flow control.
Physical Memory Layout of NVIDIA GPUs
The Device has its own Global Memory which all the
cores(Thread processors) can access. N multiprocessors
have M cores each. Cores share an instruction unit with
other cores in a multiprocessor. Each processor has its own
local memory(residing on DRAM), separate register set and
all the M cores shares an on chip memory called shared
memory. The Host can write to the global memory but not
the shared memory.
TESLA C1060
• NVIDIA® Tesla™ C1060 has 10 series NVIDIA
architecture having 30 multiprocessor with 8
cores , a double precision unit and on chip
shared memory.
What is CUDA?
CUDA is a scalable parallel programming
model and a software environment for
parallel computing
• Minimal extensions to familiar C/C++ environment
• Heterogeneous serial-parallel programming model
Kernels and Threads
Parallel portions of an application are executed on
the device as kernels
• One kernel is executed at a time
• All the parallel threads execute the same kernel.
• Some devices of High computation power can
execute more than one concurrent kernels.
Important Definition
More about threads
• A CUDA kernel is executed by an array of
threads.
• All threads run the same code.
• Each thread has an ID that it uses to compute
memory addresses and make control decisions.
Computation of memory address and control decisions will be discussed later.
THREAD BATCHING
Kernel launches a grid of thread blocks.
•Threads within a block cooperate via shared memory
•Threads within a block can synchronize(Thread
Coorporation)
•Threads in different blocks cannot cooperate
MEMORY ACCESS
EXECUTION MODEL
CUDA C and Compilation
• CUDA C provides a simple path for users familiar with the C programming language
to easily write programs for execution by the device. It consists of a minimal set of
extensions to the C language and a runtime library.
• CUDA provides nvcc compiler which spilts the CUDA code into PTX code(Used at
runtime) and standard C code (calls the Standard C Compiler at compile time).
RUNS ON CPU RUNS ON GPU
Managing memory
GPU’s memory can only be managed by the CPU and CPU has access only
to the Global Memory .
The following memory operations implies only for the Global memory
(Not to the local or shared memory).
• Allocate/Free memory
– cudaMalloc(void ** pointer, size_t nbytes) //To allocate nbytes of memory
– cudaMemset(void * pointer, int value, size_t count) //To set “count” bytes to
“value”.
– cudaFree(void* pointer) // To free memory allocated by cudaMalloc
• HOST <-> DEVICE data transfer
– cudaMemcpy(void *dst, void *src, size_t nbytes,enum cudaMemcpyKind direction)
//Transfers “nbytes” of data from “src” to “dst”. “direction specify the initial and final memory type”
CUDA function Qulaifiers
__global__
• Function called from host and executed on device
• Must return void
Eg, kernels
__device__
• Function called from device and run on device.
• Cannot be called from host code
__host__
• Function called from host and executed on host (default)
Kernel Calls and Unique Thread Index
Kernels are called by the modified syntax:
kernel<<<dim3 dG, dim3 dB>>>(…)
Here dim3 is a vector type with x , y , z as the members.
We can initialise dim3 objects by the constructor
For 1D grid: dim3 dG(var_x,1,1) or dim3 dG(var)
For 2D grid: dim3 dG(var_x,var_y,1) or dim3 dG(var1,var2)
Similarly for blocks:
For 1D block: dim3 dB(var_x,1,1) or dim3 dB(var)
For 2D block: dim3 dB(var_x,var_y,1) or dim3 dV(var_x,var_y)
For 3D blocks: dim3 dB(var_x,var_y,var_z) or dim3 dB(var_x,var_y,var_z)
Thread Synchronization
Host synchronization:
void CudaThreadsynchronize();
Blocks until all the CUDA calls are executed.
Device synchronization:
void __syncthreads();
Synchronizes the threads in a Blocks.
There no way to synchronize threads outside the block.
Programmer should be careful to avoid RAW/WAW/WAR
hazards.
Heteroprogramming and Synchronization
// copy data from host to device
• cudaMemcpy(a_d, a_h, numBytes,
cudaMemcpyHostToDevice);
// execute the kernel
• inc_gpu<<<ceil(N/(float)blocksize), blocksize>>>(a_d, N);
// run independent CPU code
• run_cpu_stuff();
// copy data from device back to host
• cudaMemcpy(a_h, a_d, numBytes,
cudaMemcpyDeviceToHost);
Error Reporting
All CUDA calls return error code but some calls are
Asynchronous, so programming should synchronize to
Keep checks.
Example:
cudaThreadSynchronize();
Kernel_Launch<<<config_arguments>>>(arg_list);
cudaThreadSynchronize();
printf(“%s\n”, cudaGetErrorString( cudaGetLastError() ) );
Hardware Implementation
• The CUDA architecture is built around a scalable array of multithreaded
Multiprocessors . When a CUDA program on the host CPU invokes a kernel grid, the
blocks of the grid are enumerated and distributed to multiprocessors with
available execution capacity. The threads of a thread block execute concurrently on
one multiprocessor, and multiple thread blocks can execute concurrently on one
multiprocessor. As thread blocks terminate, new blocks are launched on the
vacated multiprocessors. This makes the Framework Scalable.
• A multiprocessor is designed to execute hundreds of threads concurrently. To
manage such a large amount of threads, it employs a unique architecture called
SIMT (Single-Instruction, Multiple-Thread). When a multiprocessor is given one or
more thread blocks to execute, it partitions them into warps. A warp executes one
common instruction at a time, so full efficiency is realized when all 32 threads of a
warp agree on their execution path.
PERFROMANCE OPTIMIZATION
Performance optimization revolves around three
basic strategies:
• Maximizing parallel execution
• Optimizing memory usage
• Optimizing instruction usage to achieve
maximum instruction throughput
Maximizing parallel execution
• Amdahl’s law states that the maximum speed-up (S) of a program is
where P is the fraction of the total serial execution time taken by the portion of code that can be parallelized and N is the number of
processors over which the parallel portion of the code runs. The larger N is (that is, the greater the number of processors), the smaller
the P/N fraction.
It can be simpler to view N as a very large number, which essentially transforms the
equation into
S = 1 / 1−P
Now, if ¾ of a program is parallelized, the maximum speed-up over serial code is 1/ (1
– ¾) = 4. So our aim is to increase P by increasing the fraction of parallel code.
Optimizing memory transfers
• To run kernels, data values must be transferred from the host
to the device along the PCI Express (PCIe) bus. It is important
to minimize data transfer between the host and the device,
even if that means running kernels on the GPU that do not
demonstrate any speed-up.
Device<->Device transfer
CUDA provides function for device to device data transfer which
can only be called from the Host code.
The call to cudaMemcopy() is Asynchronous but
• next kernel wont start until memory transfer is complete.
• and what if there is large amout of memory trasfer?
The GPU cores will be idle.
To increase the performance we can allot the job, of copying N
bytes of data, to B blocks each running k threads in parallel. For
best performance N=k * B .( Eg, it takes 4.5 times less time if we
allot the job of copying 1 MB of data to around 1k threads ).
Shared Memory
Each Multiprocessor has 16 kb of Shared Memory
associated with it
• Provides thread corporation within a block of
threads.
• Sharing of memory access
• Redundant computations
• Because it is on-chip, shared memory is much
faster than local and global memory.
Coalesced Access to Global Memory
• Global memory can be viewed in terms of aligned segments
of 16 and 32 words.
Coalesced access in which all Misaligned sequential
threads but one access the addresses that fall within two
corresponding word in a segment 128-byte segments
Choosing thread block sizes as multiples of 16, facilitates
memory accesses by half warps that are aligned to segments.
But a warp-size is 32 , so there should be minimum 32
threads.
Optimizing Instruction Usage
A warp executes one common instruction at a time, so
full efficiency is realized when all 32 threads of a warp
agree on their execution path. Any flow control
instruction (if, switch, do, for, while) can significantly
affect the instruction throughput by causing threads of
the same warp to different execution paths. If this
happens, the different execution paths must be
serialized, increasing the total number of instructions
executed for this warp.
Parallelizing w.r.t. pixels
If the processing of pixels are independent .
Eg ,Conversion from rgb to grey ,Conversion from one format to
another.
char* host_img_rgb= malloc(3*height*width*sizeof(char)) ;
char* host_img_grey= malloc(height*width*sizeof(char)) ; //allocating in Host
char* dev_img_rgb, dev_img_grey; //device pointers
cudaMalloc((void **) &(dev_img_rgb), 3 *width*height*sizeof(char));
cudaMalloc((void **) &(dev_img_grey), width*height*sizeof(char)); //allocating in Device
//read image into the HOST memory
//copy that rgb image into the Device memory
cudaMemcpy(dev_img_rgb,host_img_rgb,3*width*height*sizeof(char),cudaHostToDevice);
Kernel<<<(height*width)/256 , 256>>>(dev_img_rgb,dev_img_grey);
//copy back to host memory
cudaMemcpy(host_img_grey,dev_img_grey,width*height*sizeof(char),cudaHostToDevice) ;
Visualizing the kernel execution
Reading from
……….
Every Block contains
256 threads
Writes to
• Parallel Execution of Each Block
• 256 as the threads per block encourages
coleased memory access
Improvements
char* host_img_rgb= malloc(3*height*width*sizeof(char)) ;
char* host_img_grey= malloc(height*width*sizeof(char)) ;//allocating in Host
char* dev_img_rgb, dev_img_grey; //device pointers
cudaMalloc((void **) &(dev_img_rgb), 3 *width*height*sizeof(char)); //allocating in
Device
//read image into the HOST memory
//copy that rgb image into the Device memory
cudaMemcpy(dev_img_rgb,host_img_rgb,3*width*height*sizeof(char),cudaHostTo
Device);
cudaMalloc((void **) &(dev_img_grey), width*height*sizeof(char));
Kernel<<<(height*width)/256 , 256>>>(dev_img_rgb,dev_img_grey);
cudaMemcpy(host_img_grey,dev_img_grey,width*height*sizeof(char),cudaDeviceTo
Host) ;
IMPROVEMENT IN ALLOCATION
cudaMalloc((void **) &(dev_img_rgb), 3
*width*height*sizeof(char));
cudaMalloc((void **) &(dev_img_grey),
width*height*sizeof(char));
Better way:
cudaMalloc((void**)&temp_dev_point,4*width*height*sizeof(c
har));
dev_img_rgb=temp_dev_point;
dev_img_grey=temp_dev_point + (width*height);
For eg, it takes 12 times less time allocating 6000 bytes than to
allocate 4 arrays of 1500 each.
Problems in data transfer and execution
cudaMemcpy(dev_img_rgb,host_img_rgb,3*width*height*sizeof(char),cuda
HostToDevice);
Kernel<<<(height*width)/256 , 256>>>(dev_img_rgb,dev_img_grey);
• Kernel has to wait for the data transfer .
Therefore the cores are idle.
• Moreover the Host<->Device Transfer is slow.
Page Locked Memory
• Cuda allows the programmer to allocate Page
locked Host memory.
• The data transfer rate, between Page locked
host memory and the device memory, is high.
• It allows Asynchronous Data transfer.
Concurrency
//Use of Streams
//Creating Streams
cudaStream_t stream[height];
for (int i = 0; i < height; ++i)
cudaStreamCreate(&stream[i]);
//Specifying sequence of host to device transfers.
for (int i = 0; i < height; ++i)
cudaMemcpyAsync(dev_img_rgb + (i * 3* width), host_img_rgb + (i* 3 * width) , width *
sizeof(char), cudaMemcpyHostToDevice, stream[i]);
//Specifying sequence of kernel launches
for (int i = 0; i < height; ++i)
Kernel<<<width/256, 256, 0, stream[i]>>> (dev_img_rgb + i * 3*width, dev_img_grey + i *
width);
//Specifying sequence of device to host transfers.
for (int i = 0; i < height; ++i)
cudaMemcpyAsync(host_img_grey + (i * width), dev_img_grey + (i * width) , width *
sizeof(char), cudaMemcpyDeviceToHost, stream[i]);
Comparison of Timelines for non-concurrent
and concurrent execution
Host->Device
Execution
Device->Host
Host->Device
Execution
Device->Host
Parallelizing nested loops
Eg, Parallelizing w.r.t. the pixels in a Patch
for(int i=0;i<width/width_of_patch,i++)
for(int j=0;j<height/height_of_patch,j++)
for(int k=0;k<width_of_patch,k++)
for(int k=0;k<width_of_patch,k++)
{…..}
//We launch a 2-D grid
dim3 grid(width/patch_width,
height/patch_height);
//and grid with 2-D blocks
dim3 block(patch_width,patch_height);
//launch kernel
Kernel_name<<<grid,block>>>
(arg list…..);
How to find the index of the patch inside the grid?
blockIdx.y * gridDim.x + blockIdx.x
How to find the index of the pixel inside the block?
threadIdx.y * blockDim.x + threadIdx.x
How to choose best configuration argument.
• Cuda provides an OCCUPANCY Calculator as
an excel file.
• Occupancy is the ratio of the number of
active warps per multiprocessor to the
maximum number of possible active warps.