GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
GPGPU Programming With CUDA: Leandro Avila - University of Northern Iowa
Mentor: Dr. Paul Gray Computer Science Department University of Northern Iowa
Outline
Introduction
Shift in the traditional paradigm of sequential programming, towards parallel processing. Scientific computing needs to change in order to deal with vast amounts of data. Hardware changes contributed to move towards parallel processing.
Memory Wall
Discrepancy between memory and CPU performance Effort put into ILP increases with not enough returns Clock frequency vs. Heat dissipation efforts.
Power Wall
Manferdelli, J. (2007) - The Many-Core Inflection Point for Mass Market Computer Systems
Accelerators
In HPC, an accelerator is a hardware component whose role is to speed up some aspect of the computing workload. In the old days (1980s), supercomputers we had array processors, for vector operations on arrays, and floating point accelerators. More recently, Field Programmable Gate Arrays (FPGAs) allow reprogramming deep into the hardware.
Accelerators
Advantages
They make your code run faster More expensive Harder to program Code is not portable from one accelerator to another. (OpenCL attempts to change this)
Disadvantages
Introducing GPGPU
General Purpose Computing on Graphics Processing Units Great example of the trend of moving away from the traditional model.
Why GPUs?
Graphics Processing Units (GPUs) were originally designed to accelerate graphics tasks like image rendering. They became very popular with videogamers, because theyve produced better and better images, and lightning fast. And, prices have been extremely good, ranging from three figures at the low end to four figures at the high end. GPUs mostly do stuff like rendering images.
This is done through mostly floating point arithmetic the same stuff people use supercomputing for!
Architecture
Architecture Comparison
Nvidia Tesla C1060 Processing Cores Memory 240 4GB Intel i7 975 Extreme 4 L1 Cache 32KB/core L2 Cache 256KB/core L3 Cache 8MB (shared) 3.33.GHz 25 GB/sec 70 Double Precision
1.3 GHz Clock Speed Memory Bandwidth Floating Point Operations / Sec 102 GB/Sec 933 Single Precision 78 Double Precision
Components
From http://www.tomshardware.com/reviews/nvidia-cuda-gpu,1954-7.html
Streaming Multiprocessors
Blocks of threads are assigned to SMs A SM contains 8 Scalar Processors Tesla C1060
Hardware Hierarchy
Streaming Multiprocessors
Scalar Processors
Great! We see the GPU architecture is different from what we see in the traditional CPU. So... Now what? What this all means? How do we use it?
Glossary
The HOST Is the machine executing main program The DEVICE Is the card with the GPU The KERNEL Is the routine that runs on the GPU
A THREAD Is the basic execution unit in the GPU A BLOCK Is a group of threads A GRID Is a group of blocks A WARP Is a group of 32 threads
Recall that threads are organized in BLOCKS and at the same time BLOCKS are organized in a GRID. The GRID can have 2 dimensions. X and Y
65535 x 65535 x 1
512 x 512 x 64
Prior to kernel execution we need to set it up by setting the dimensions of the GRID and the dimensions of the BLOCKS
Scheduling in Hardware
Host Device Grid 1 Kernel 1 Block (0, 0) Block (0, 1) Block (1, 0) Block (1, 1) Block (2, 0)
Grid is launched Blocks are distributed to the necessary SMs SM initiates processing of warps SM schedules warps that are ready As warps finish and resources are liberated, then new warps are scheduled. SM can take 1024 threads
Block (2, 1)
Grid 2 Kernel 2
Block (1, 1)
Thread (0, 0) Thread (0, 1) Thread (0, 2) Thread (1, 0) Thread (1, 1) Thread (1, 2) Thread (2, 0) Thread (2, 1) Thread (2, 2) Thread (3, 0) Thread (3, 1) Thread (3, 2) Thread (4, 0) Thread (4, 1) Thread (4, 2)
Memory Layout
Registers and shared memory are the fastest Local Memory is virtual memory Global Memory is the slowest.
Threads access memory as follows Registers Read & Write Local Memory Read & Write Shared Memory Read & Write (block level) Global Memory Read & Write (grid level) Constant Memory Read (grid level) Remember that Local Memory is implemented as virtual memory from a region that resides in Global Memory.
CUDA API
Programming Pattern
Host reads input and allocates memory in the device Host copies data to the device Host invokes a kernel that gets executed in parallel, using the data and hardware in the device, to do some useful work. Host copies back the results from the device for post processing.
Kernel Setup
_global_ void myKernel(); //declaration dim3 dimGrid(2,2,1); dim3 dimBlock(4,8,8); myKernel<<< dimGrid, dimBlock >>>( d_b, d_a );
Function Declaration
Executes On _device_ float myDeviceFunc() _host_ float myHostFunc() _global_ void myKernel() Device Host Device Callable From Device Host Host
Useful Variables
gridDim.(x|y) = grid dimension on x and y blockDim = number of threads in a block blockIdx = block index whithin the grid blockIdx.(x|y) threadIdx = Thread index within a block threadIdx.(x|y|z)
Variable type qualifiers specify the memory location of a variable on the devices memory __device__
Declares a variable in the device Declares a constant in the device Declares a variable in thread shared memory
__constant__
__shared__
Note: All shared memory variables start at the same address. You must use offsets if multiple variables are declared in shared memory.