Gpgpu Final
Gpgpu Final
Gpgpu Final
(GPGPU)
Components of a GPU
• Three key concepts behind how modern GPU processing cores run code
Compile
Shader
1 shaded
fragment
output record
Ideas to make GPU processing
cores run fast
Idea # 1: Add CPU Style Cores to run Multiple Fragments
Execution
Model
Multiple independent One thread with wide
Multiple lockstep threads
threads execution datapath
Example
Multicore CPUs x86 SSE/AVX GPUs
Architecture
More general: supports TLP Can mix sequential & parallel Easier to program
Pros
code Gather/Scatter operation
Like a wide CPU pipeline – except one fetch for entire width
• 16-wide physical ALU In each cycle can execute 16 wavefronts,
• Executes 64-wavefront over 4 cycles. Why?? unit is pipelined, so in 4 cycles it can
execute 16x4=64 wavefronts.
• 64KB register state / SIMT Unit
• Compare to x86 (Bulldozer): ~1KB of physical register file state (~1/64 size)
• Address Coalescing Unit Coalescing is a part of memory management in which
• A key to good memory performance two adjacent free blocks of computer memory are
merged.
Address Coalescing
• Wavefront: Issue 64 memory requests
• Common case:
• work-items in same wavefront touch same cache block
• Coalescing:
• Merge many work-items requests into single cache block request
• Important for performance:
• Reduces bandwidth to DRAM
Bulldozer – FX-8170 vs. GCN – Radeon HD 7970
GPU Memory CPU (Bulldozer) GPU (GCN)
L1 data cache capacity 16KB 16KB
Active threads (work-items) sharing L1 D Cache 1 2560
L1 d cache capacity / thread 16KB 6.4 bytes
Last level cache (LLC) capacity 8MB 768KB
• GPUs have caches. Active threads (work-items) sharing LLC 8 81,920
• Not Your CPU’s Cache LLC capacity / thread 1MB 9.6 bytes
• GPU Caches
• Maximize throughput, not hide latency
• Not there for either spatial or temporal locality
• L1 Cache: Coalesce requests to same cache block by different work-items
• i.e., streaming thread locality?
• Keep block around just long enough for each work-item to hit once
• Ultimate goal: Reduce bandwidth to DRAM
• L2 Cache: DRAM staging buffer + some instruction reuse
• Ultimate goal: Tolerate spikes in DRAM bandwidth If there is any spatial/temporal
locality:
• Use local memory (scratchpad)
GPU Memory
Scratchpad Memory
• GPUs have scratchpads (Local Memory)
• Separate address space
• Managed by software:
• Rename address
• Manage capacity – manual fill/eviction
• Allocated to a workgroup
• i.e., shared by wavefronts in workgroup
• 32 Compute Units:
• 81,920 Active work-items
• 32 CUs * 4 SIMT Units * 16 ALUs = 2048 Max FP ops/cycle
• 264 GB/s Max memory bandwidth 925 MHz engine clock
• 3.79 TFLOPS single precision (accounting trickery: FMA)
• 210W Max Power (Chip)
• >350W Max Power (card)
• 100W idle power (card)
• Two 7970s on one card:
• 375W (AMD Official) – 450W (OEM)
Example System – Nvidia K80
• GPU: 2
• CUDA Cores : 4992
• 2× 240 GB/s Max memory bandwidth 925 MHz engine clock
• 5591–8736 GFLOPS Single Precision (MAD or FMA)
GPUs: Great for data parallelism. Bad for everything else.
• Data Parallelism: Identical, Independent work over multiple data inputs
• GPU version: Add streaming access pattern
• Data Parallel Execution Models: MIMD, SIMD, SIMT
• GPU Execution Model: Multicore Multithreaded SIMT
• OpenCL Programming Model
• NDRange over workgroup/wavefront
• Modern GPU Microarchitecture: AMD Graphics Core Next (GCN)
• Compute Unit (“GPU Core”): 4 SIMT Units
• SIMT Unit (“GPU Pipeline”): 16-wide ALU pipe (16x4 execution)
• Memory: designed to stream
CPU viz GPU Architecture
• The Graphics Processing Unit (GPU) provides much higher instruction throughput and
memory bandwidth than the CPU within a similar price and power envelope. Many
applications leverage these higher capabilities to run faster on the GPU than on the
CPU as described in the GPU Applications Catalog. Other computing devices, like
FPGAs, are also very energy efficient, but offer much less programming flexibility
than GPUs.
• The difference in capabilities between the GPU and the CPU exists because they are
designed with different goals in mind. While the CPU is designed to excel at executing
a sequence of operations, called a thread, as fast as possible and can execute a few
tens of these threads in parallel, the GPU is designed to excel at executing thousands
of them in parallel (amortizing the slower single-thread performance to achieve
greater throughput).
• The GPU is specialized for highly parallel computations and therefore designed such
that more transistors are devoted to data processing rather than data caching and
flow control. The schematic Figure in next lside shows an example distribution of chip
resources for a CPU versus a GPU.
CPU viz GPU Architecture
CPU viz GPU Architecture
➢GPU devotes more transistors to data processing rather than data caching & flow control.
CUDA
Compute Unified
Device Architecture
CPU GPU
DMA
Thread
GPGPU Programming with CUDA (of execution)
threaIdx.x
0 1 2 3 ⋯ ⋯ ⋯ ⋯ ⋯ 4094 4095
blockIdx.x = 0
2-D grid
CUDA C Vectors Add on Grid
#include <stdio.h> // Device input vectors
#include <stdlib.h> double *d_a;
#include <math.h> double *d_b;
// CUDA kernel. Each thread takes care of one element of c //Device output vector
__global__ void vecAdd(double *a, double *b, double *c, int n) double *d_c;
{ // Size, in bytes, of each vector
// Get our global thread ID size_t bytes = n*sizeof(double);
int id = blockIdx.x*blockDim.x+threadIdx.x; // Allocate memory for each vector on host
// Make sure we do not go out of bounds h_a = (double*) malloc(bytes);
if (id < n) h_b = (double*) malloc(bytes);
c[id] = a[id] + b[id]; h_c = (double*) malloc(bytes);
} // Allocate memory for each vector on GPU
int main( int argc, char* argv[] ) { cudaMalloc(&d_a, bytes);
// Size of vectors cudaMalloc(&d_b, bytes);
int n = 100000; cudaMalloc(&d_c, bytes);
// Host input vectors // Initialize vectors on host
double *h_a; for(int i = 0; i < n; i++ ) {
double *h_b; h_a[i] = sin(i)*sin(i);
//Host output vector h_b[i] = cos(i)*cos(i);
double *h_c; }
CUDA C Vectors Add …
// Copy host vectors to device
cudaMemcpy( d_a, h_a, bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_b, h_b, bytes, cudaMemcpyHostToDevice);
// Number of threads in each thread block
int NumBlocks = 1024; // gridDim.x
/ Number of thread blocks in a block
int NumThreadsPerBlock = (int)ceil((float)n/blockSize);
// Execute the kernel
vecAdd<<<NumBlocks, NumThreadsPerBlock>>>(d_a, d_b, d_c, n);
// Copy array back to host
cudaMemcpy( h_c, d_c, bytes, cudaMemcpyDeviceToHost );
// Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0;
for( int i = 0; I < n; i++)
sum += h_c[i];
printf("final result: %f\n", sum/n);
// Release device memory
cudaFree(d_a); cudaFree(d_b); cudaFree(d_c);
// Release host memory
free(h_a); free(h_b); free(h_c);
return 0; }
CUDA – Thread Management API (Thread Index & ID)
• For example, the following code adds two matrices A and B of size NxN and stores the
result into matrix C, using a single grid having 2-D blocks (2-D grid).
// Kernel definition
__global__ void MatAdd(float A[N][N], float B[N][N], float C[N][N]) {
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j];
}
int main()
{
...
// Kernel invocation with one block of N * N * 1 threads
int numBlocks = 1;
dim3 threadsPerBlock(N, N);
MatAdd<<<numBlocks, threadsPerBlock>>>(A, B, C);
• The vector types derived from the basic integer and floating-point types.
They are structures and the 1st, 2nd, 3rd, and 4th components are
accessible through the fields x, y, z, and w, respectively. They all come
with a constructor function of the form make_<type name>;
• for example, int2 make_int2(int x, int y); creates a vector of type int2
with value(x, y).
• The alignment requirements of the vector types are detailed in Table 4.
CUDA Built-in Vector types
Type Alignment Type Alignment
char1, uchar1 1 long1, ulong1 4 if sizeof(long) is equal to sizeof(int) 8, otherwise
char2, uchar2 2 long2, ulong2 8 if sizeof(long) is equal to sizeof(int), 16, otherwise
char3, uchar3 1 long3, ulong3 4 if sizeof(long) is equal to sizeof(int), 8, otherwise
char4, uchar4 4 double4 16
short1, ushort1 2 longlong1, ulonglong1 8
short2, ushort2 4 longlong2, ulonglong2 16
short3, ushort3 2 longlong3, ulonglong3 8
short4, ushort4 8 longlong4, ulonglong4 16
int1, uint1 4 float1 4
int2, uint2 8 float2 8
int3, uint3 4 float3 4
int4, uint4 16 float4 16
double1 8
double2 16
double3 8
double4 16
Built-in variables
• dim3: is an integer vector type based on uint3 that is used to specify
dimensions. When defining a variable of type dim3, any component left
unspecified is initialized to 1.
• Built-in variables specify the grid and block dimensions and the block and
thread indices. They are only valid within functions that are executed on
the device.
1. gridDim variable is of type dim3 and contains the dimensions of the grid.
2. blockIdx variable is of type uint3 and contains the block index within the grid.
3. blockDim variable is of type dim3 and contains the dimensions of the block.
4. threadIdx variable is of type uint3 and contains the thread index within the
block.
5. warpSize variable is of type int and contains the warp size in threads
Vector Add Example
• We just need to modify the loop to stride through the array with parallel threads.
• The kernel code will need to know its block and thread index to find its offset into the
passed arrays. The parallelized kernel often uses a grid-stride loop, such as the
following:
__global__ __global__
void add(int n, float *x, float *y) void add(int n, float *x, float *y) {
{ int index = blockIdx.x * blockDim.x + threadIdx.x;
for (int i = 0; i < n; i++) int stride = blockDim.x * gridDim.x;
y[i] = x[i] + y[i]; for (int i = index; i < n; i += stride)
} y[i] = x[i] + y[i]; }
dim3 blockDim(16, 16, 1);
dim3 gridDim((width + blockDim.x - 1)/ blockDim.x, (height + blockDim.y - 1) / blockDim.y, 1);
kernel<<< gridDim, blockDim, 0 >>>(d_data, height, width);
• A GPU context represents all the state (data, variables, conditions, etc.) that
are collectively required and instantiated to perform certain tasks (e.g. CUDA
compute, graphics, H.264 encode, etc). A CUDA context is instantiated to
perform CUDA compute activities on the GPU, either implicitly by the CUDA
runtime API, or explicitly by the CUDA device API.
• A command is simply a set of data, and instructions to be performed on that
data. For example a command could be issued to the GPU to launch a kernel,
or to move a graphical window from one place to the other on the desktop.
• A channel represents a communication path between host (CPU) and the GPU.
In modern GPUs this makes use of PCI Express, and represents state and
buffers in both host and device, that are exchanged over PCI express, to issue
commands to, and provide other data to, the GPU, as well as to inform the
CPU of GPU activity.
• For the most part, using the CUDA runtime API, it's not necessary to be
familiar with these concepts, as they are all abstracted (hidden) underneath
the CUDA runtime API.
CUDA Thread and , Block & Blocks
Open
Computing
Language
Intel Skylake GPU (GT2)
• Intel HD Graphics 520 (GT2)
is an integrated graphics
unit, which can be found in
various ULV (Ultra Low
Voltage) processors of the
Skylake generation. The
"GT2" version of the Skylake
GPU offers 24 Execution
Units (EUs) clocked at up to
1050 MHz (depending on
the CPU model). Due to its
lack of dedicated graphics
memory or eDRAM cache,
the HD 520 has to access the
main memory (2x 64bit
DDR3L-1600 / DDR4-2133).
OpenCL
• OpenCL (Open Computing Language)
• Programming framework for CPUs, GPUs, DSPs, FPGAs with
programming language “OpenCL C”
• Started by Apple, subsequent development with AMD, IBM, Intel, and
NVIDIA, meanwhile managed by Khronos Group
• Open and royalty–free standard
• Goal: Programming framework for portable, parallel programming of
devices in heterogeneous environments (CPUs, GPUs, and other
processors; from smartphone to supercomputer)
OpenCL Program Flow
Architecture of OpenCL
• At the conceptional level:
o Platform model
o Execution model
o Memory model
o Programming model
• At the programming level:
• OpenCL Platform API
• OpenCL Runtime API
• OpenCL C (programming language)
Platform Model
• Basic structure: Host which is
connected to several devices
• Host: Computational unit on which
the host program runs.
➢Usually: CPU of the computer system
• Device: Computational unit which
is accessed via OpenCL library.
➢ Examples: CPUs, GPUs, DSPs, FPGAs
• Further subdivision:
➢Device → “Compute Units”
➢Compute Unit → “Processing
Elements”
Platform Model (CPU, GPU and MIC)
1. CPU
• Device: All CPUs on the mainboard of the computer system
• Compute unit (CU): One CU per core (or per hardware thread)
• Processing element (PE): 1 PE per CU, or if PEs are mapped to SIMD lanes, n PEs
per CU, where n matches the SIMD width.
2. GPU
• Device: Each GPU in the system acts as single device
• Compute unit (CU): One CU per multi–processor (NVIDIA)
• Processing element (PE): 1 PE per CUDA core (NVIDIA) or “SIMD lane” (AMD)
3. MIC
• Device: Each MIC (Many Integrated Cores) in the system acts as single device
• Compute unit (CU): One CU per hardware thread (= 4 x [# of cores - 1])
• Processing element (PE): 1 PE per CU, or if PEs are mapped to SIMD lanes, n PEs
per CU, where n matches the SIMD width
Fermi: GF100/GF110 - 16 Streaming Multi–Processors(SM)
Die shot
GF100/GF110: Streaming Multi–Processor (SM)
SM properties
• 32 CUDA cores (Streaming processors/SP)
• 6 Load/store units
• 4 Special function units (SFU)
• 2 Warp scheduler
• : 512 ALUs/FPUs available
Platform Model (Platform)
• Platform
• Every OpenCL implementation (with underlying OpenCL library) defines a so–
called “platform“.
• Each specific platform enables the host to control the devices belonging to it.
• Platforms of various manufacturers can coexist on one host and may be used
from within a single application (ICD: “installable client driver model“).
Platform Model - Practical Hints
Get OpenCL running under Linux
• Header files: Get from Khronos website (e.g.)
➢ Central file: CL/cl.h
• OpenCL library stub with ICD loader:
• Get from one of the vendors of your OpenCL devices
➢ Central file: libOpenCL.so
• ICD definition files and platform–specific OpenCL libraries:
• Get from all the vendors of your OpenCL devices
➢ ICD files usually located in: /etc/OpenCL/vendors/
• Mechanism at runtime:
➢libOpenCL.so is dynamically linked to your application at runtime
➢ICD loader uses dlopen(..) to open all required platform–specific OpenCL libraries
➢Calls to OpenCL library functions are routed to the correct implementation
Execution Model - Example: 2D–Arrangement of Work–Items
OpenCL Host API
Basic Programming Steps:
• Query platforms → selection
• Query devices of the platform → selection
• Create context for the devices
• Create queue (for context and device)
• Create program object (for context) ← from C string
➢ Compile program
➢ Create kernel (contained in program)
• Create memory objects (within context)
• Kernel execution:
1. Set kernel arguments
2. Put kernel into queue → Execution
• Copy memory objects with results from device to host (invoke via queue)
• Clean up …
Excursus: Thread Management on GPUs
Kernel
• Function for execution on the device (here: GPU)
• Typical scenario: Many kernel instantiations running simultaneously in parallel
threads
Challenge
• Management of many thousands of threads
Solution
• “Coarse Grained Parallelism” → ”Fine Grained Parallelism”
Thread Management (cont.)
→ Functions for all these steps: OpenCL Platform and Runtime API
Basic Programming Steps … in Practice
1. Query platforms : selection
2. Query devices of the platform : selection
3. Create context for the devices
4. Create queue (for context and device)
5. Create program object (for context) from C string
5.1 Compile program
5.2 Create kernel (contained in program)
6. Create memory objects (within context)
7. Kernel execution:
7.1 Set kernel arguments
7.2 Put kernel into queue : Execution
8. Copy memory objects with results from device to host (invoke via queue)
9. Clean up ...
OpenCL - Addition of two vectors using single work-item
#include <stdio.h> int main( int argc, char* argv[] ) {
#include <stdlib.h> // Length of vectors
#include <math.h> unsigned int n = 100000;
#include <CL/opencl.h> // Host input vectors
double *h_a;
// OpenCL kernel. Each work item takes care of one element of c double *h_b;
const char *kernelSource = "\n" // Host output vector
"#pragma OPENCL EXTENSION cl_khr_fp64 : enable \n" double *h_c;
"__kernel void vecAdd( __global double *a, \n" // Device input buffers
" __global double *b, \n" cl_mem d_a;
" __global double *c, \n" cl_mem d_b;
" const unsigned int n) \n" // Device output buffer
"{ \n" cl_mem d_c;
" //Get our global thread ID \n"
" int id = get_global_id(0); \n" cl_platform_id cpPlatform; // OpenCL platform
" \n" cl_device_id device_id; // device ID
" //Make sure we do not go out of bounds \n" cl_context context; // context
" if (id < n) \n" cl_command_queue queue;// command queue
" c[id] = a[id] + b[id]; \n" cl_program program; // program
"} \n" cl_kernel kernel; // kernel
"\n" ; // Size, in bytes, of each vector
size_t bytes = n*sizeof(double);
// Create a context
Addition of two Vectors … context = clCreateContext(0, 1, &device_id, NULL, NULL, &err);
// Create a command queue
// Allocate memory for each vector on host queue = clCreateCommandQueue(context, device_id, 0, &err);
h_a = (double*) malloc(bytes); // Create the compute program from the source buffer
h_b = (double*) malloc(bytes); program = clCreateProgramWithSource(context, 1,
h_c = (double*) malloc(bytes); (const char **) & kernelSource, NULL, &err);
// Initialize vectors on host // Build the program executable
int i; clBuildProgram(program, 0, NULL, NULL, NULL, NULL);
for( i = 0; i < n; i++ ) { // Create the compute kernel in the program we wish to run
h_a[i] = sinf(i)*sinf(i); kernel = clCreateKernel(program, "vecAdd", &err);
h_b[i] = cosf(i)*cosf(i); // Create the input and output arrays in device memory for our
} calculation
size_t globalSize, localSize; d_a = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes,
cl_int err; NULL, NULL);
// Number of work items in each local work group d_b = clCreateBuffer(context, CL_MEM_READ_ONLY, bytes,
localSize = 64; NULL, NULL);
// Number of total work items - localSize must be devisor d_c = clCreateBuffer(context, CL_MEM_WRITE_ONLY, bytes,
globalSize = ceil(n/(float)localSize)*localSize; NULL, NULL);
// Bind to platform // Write our data set into the input array in device memory
err = clGetPlatformIDs(1, &cpPlatform, NULL); err = clEnqueueWriteBuffer(queue, d_a, CL_TRUE, 0,
// Get ID for the device bytes, h_a, 0, NULL, NULL);
err = clGetDeviceIDs(cpPlatform, CL_DEVICE_TYPE_GPU, err |= clEnqueueWriteBuffer(queue, d_b, CL_TRUE, 0,
1, &device_id, NULL); bytes, h_b, 0, NULL, NULL);
// Set the arguments to our compute kernel
Addition of two Vectors …
err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_a);
err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_b);
err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_c);
err |= clSetKernelArg(kernel, 3, sizeof(unsigned int), &n);
// Execute the kernel over the entire range of the data set
err = clEnqueueNDRangeKernel(queue, kernel, 1, NULL, &globalSize, &localSize, 0, NULL, NULL);
// Wait for the command queue to get serviced before reading back results
clFinish(queue);
// Read the results from the device
clEnqueueReadBuffer(queue, d_c, CL_TRUE, 0, bytes, h_c, 0, NULL, NULL );
//Sum up vector c and print result divided by n, this should equal 1 within error
double sum = 0;
for(i=0; i<n; i++)
sum += h_c[i];
printf("final result: %f\n", sum/n);
// Release OpenCL resources
clReleaseMemObject(d_a); clReleaseMemObject(d_b); clReleaseMemObject(d_c);
clReleaseProgram(program); clReleaseKernel(kernel); clReleaseCommandQueue(queue); clReleaseContext(context);
// Release host memory
free(h_a); free(h_b); free(h_c);
//
return 0;}
Exercises
• Task – 1:
• Implement the addition of three vectors instead of two!
• Task – 2:
• Implement a second kernel for element–wise vector multiplication!
• Compute with both kernels (multiplication and pair–wise addition) the equation
e = a * b + c * d as element–wise vector operation!
• BONUS: Use an out–of–order queue instead of the default queue …
• … and ensure by using events that all commands are executed in the right order!
Open CL API
Query Platforms
cl_int clGetPlatformIDs ( cl_uint num_entries ,
cl_platform_id * platforms ,
cl_uint * num_platforms );
Related functions
• clGetDeviceInfo(..)
Create Context -- Precondition: Device exists
cl_context
clCreateContext (const cl_context_properties * properties ,
cl_uint num_devices ,
const cl_device_id * devices ,
( voidCL_CALLBACK * pfn_notify ) (
const char * errinfo ,
const void * private_info , size_t cb ,
void * user_data
),
void * user_data ,
cl_int * errcode_ret );
Creation of a context
• Return value : The created context
• properties : Bit field for the definition of the desired properties of the context
• num_devices : Number of devices for which the context shall be created
• devices : Array with devices for which the context shall be created
• errcode_ret : Returns the error code (ideally equal to CL_SUCCESS)
Create Queue -- Precondition: Context & device exist
cl_command_queue clCreateCommandQueue (cl_context context ,
cl_device_id device ,
cl_command_queue_properties properties ,
cl_int * errcode_ret );
Creation of a queue
• Return value : The created queue
• context : Context within which the queue shall be created
• device : Device for which the queue shall be created
• properties : Bit field for the definition of the desired properties of the
queue. The default mode for queues is “in order execution”
(other settings possible via parameter properties).
• errcode_ret : Returns the error code (ideally equal to CL_SUCCESS)
Create Program Object -- Precondition: Context & source code exist
cl_program clCreateProgramWithSource( cl_context context ,
cl_uint count ,
const char ** strings ,
const size_t * lengths ,
cl_int * errcode_ret );
Flag Meaning
CL_MEM_READ_WRITE Memory object will be read and written by a kernel.
CL_MEM_READ_ONLY Memory object will only be read by a kernel.
CL_MEM_WRITE_ONLY Memory object will only be written by a kernel
CL_MEM_USE_HOST_PTR The buffer shall be located in host memory at address host_ptr (content may
be cached in device memory). Not combinable with CL_MEM_ALLOC_HOST_PTR or
CL_MEM_COPY_HOST_PTR.
CL_MEM_ALLOC_HOST_PTR The buffer will be newly allocated in host memory (: in some implementations page–
locked memory!).
CL_MEM_COPY_HOST_PTR The buffer will be initialized with the content ofthe memory region to which
host_ptr points.
Set Kernel Arguments -- Precondition: Kernel exists
cl_int clSetKernelArg ( cl_kernel kernel ,
cl_uint arg_index ,
size_t arg_size ,
const void * arg_value );
• If you want to pass a global memory buffer as kernel argument, you have to use the corresponding cl_mem object as value.
• In this case, arg_size has to be the size of the cl_mem object (not the length of the buffer)!
Execution Model - Example: 2D–Arrangement of Work–Items
Kernel Execution Precondition: Queue and kernel exist, kernel arguments already set
cl_int clEnqueueNDRangeKernel ( cl_command_queue command_queue ,
cl_kernel kernel ,
cl_uint work_dim ,
const size_t * global_work_offset ,
const size_t * global_work_size ,
const size_t * local_work_size ,
cl_uint num_events_in_wait_list ,
const cl_event * event_wait_list ,
Place a kernel for execution in a queue cl_event * event );
• Return value : Error code (ideally equal to CL_SUCCESS)
• command_queue : Queue which shall be used for execution
• kernel : The kernel to be executed
• work_dim : Number of array dimensions (concerning the following three parameters)
• global_work_offset: Fx (Fy ; Fz ) (see preceding slide)
• global_work_size : Gx (Gy ;Gz ) (see preceding slide; overall number of work–items in each dimension
across all work–groups!)
• local_work_size : Sx (Sy ; Sz ) (see preceding slide; the ratios Gx =Sx , Gy =Sy , Gz=Sz need to be
integer numbers!)
Transfer Data from Device to Host -- Precondition: Queue exists
cl_int clEnqueueReadBuffer ( cl_command_queue command_queue ,
cl_mem buffer , cl_bool blocking_read ,
size_t offset , size_t cb , void *ptr ,
cl_uint num_events_in_wait_list ,
const cl_event * event_wait_list ,
cl_event * event );
Copy buffer content into host memory (e.g., buffer with results after kernel execution)
• Return value : Error code (ideally equal to CL_SUCCESS)
• command_queue : Queue which shall be used for execution
• buffer : Buffer object which serves as source of the copy operation
• blocking_read : If true, the function only returns after the copy operation has been
finished (and therefore also all preceding commands in the queue if it operates in “in–
order mode”)
• offset : Read offset in the buffer (in bytes)
• cb : Number of bytes to copy
• ptr : Pointer to the target region in host memory (sufficient size be allocated)
Free OpenCL Resources -- (Selection)
cl_int clReleaseContext ( cl_context context );
cl_int clReleaseCommandQueue ( cl_command_queue command_queue );
cl_int clReleaseProgram ( cl_program program );
cl_int clReleaseKernel ( cl_kernel kernel );
cl_int clReleaseMemObject ( cl_mem memobj );
In analogy to the release functions also retain functions exist for many types of OpenCL objects. The retain
functions increase an object–internal counter, the release functions decrease it. Only after all retain calls were
compensated by a release call, the next subsequent release call will ultimately free the resources of the object.
OpenCL for Compute Kernels
Basic Facts about “OpenCL C”
• Derived from ISO C99
• A few restrictions: No recursion, no function pointers, no functions from
the C99 standard headers
• Preprocessing directives defined by C99 are supported (e.g., #include)
• Built–in data types: Scalar and vector data types, pointers, images
• Mandatory built–in functions:
➢ Work–item functions, math.h, reading and writing of images
➢ Relational functions, geometric functions, synchronization functions
➢ printf (v1.2 only)
• Optional built–in functions (called “extensions”)
➢ Support for double precision, atomics to global and local memory
Qualifiers and Functions
Function qualifiers:
➢_kernel qualifier declares a function as a kernel, i.e. makes it visible to host code so that it
can be enqueued
• Address space qualifiers:
➢__global, __local, __constant, __private
➢Pointer kernel arguments must be declared with an address space qualifier (excl. __private)
• Work-item functions:
➢get_work_dim(),
➢get_global_id(),
➢get_local_id(),
➢get_group_id(), etc.
• Synchronization functions:
➢Barriers — all work-items within a work-group must execute the barrier function before
any work-item can continue: barrier(cl_mem_fence_flags flags)
➢Memory fences — provides ordering between memory operations:
mem_fence(cl_mem_fence_flags flags)
• Recursion is not supported
• Pointers to functions are not allowed
• Pointers to pointers allowed within a kernel, but
not as an argument to a kernel invocation
• Bit–fields are not supported
• Variable length arrays are not supported
Restrictions • Structures and other data types have to be
defined in both the host and device code
(naturally, in exactly the same way; use common
header files)
• Double types are optional in OpenCL v1.1, but
the key word is reserved (note: Most
implementations support double)
Event Handling
cl_int clWaitForEvents ( cl_uint num_events ,
const cl_event * event_list );
Wait for all events in event_list
Return value : Error code (ideally equal to CL_SUCCESS)
num_events : Number of elements in event_list
event_list : Array of events
cl_int clFlush ( cl_command_queue command_queue );
Issues all previously queued OpenCL commands in command_queue to the device associated with
command_queue
• Return value : Error code (ideally equal to CL_SUCCESS)
cl_int clFinish ( cl_command_queue command_queue );
Blocks until all previously queued OpenCL commands in command_queue are issued to the associated
device and have completed. clFinish is also a synchronization point.
Return value : Error code (ideally equal to CL_SUCCESS)
1-D NDRange
• The figure illustrates an example of 1-D
NDRange with global size = (4096, 1, 1) and
local size = (512, 1, 1). This allows the
computation to be broken down into eight
work-groups, each with 512 work-items.
• Now consider a simple vector adder kernel
written with a work size of (1, 1, 1). The length
of the data is 4096, and the function iterates
over the data using an explicit loop.
• In OpenCL C, however, it is better to write the
kernel as shown below:
• This produces the NDRange and work group
sizes shown above. Because this example __kernel __attribute__ ((reqd_work_group_size(1, 1, 1)))
allows the OpenCL compiler and runtime to void vadd(__global const int* a, __global const int* b, __global int* c) {
control the iteration over the 4096 data items, for (int i= 0; i < 4096; i++)
it allows a simpler coding style and enables c[i] = a[i] + b[i]; }
the compiler to make better optimization
decisions to parallelize the operations. The
call to get_global_id(0) provides the current __kernel __attribute__ ((reqd_work_group_size(512, 1, 1)))
location in the NDRange and is analogous to void vadd(__global const int* a, __global const int* b, __global int* c) {
the index of a for loop. int i = get_global_id(0);
c[i] = a[i] + b[i]; }
2-D NDRange
• The 2D NDRanges. works
well with 2D data such as
matrices.
• The matrix adder kernel
defines a local work size of
2x2, specified as a required
size of (2, 2, 1).
• The calls to get_global_id()
provide the index in the
global work size, while
get_global_size() provides
the total range value (e.g.,
64 for a 64x64 matrix).
• Alternatively, the kernel
could also index the local
work indices and sizes using
get_local_id() and
get_local_size(). __kernel __attribute__ ((reqd_work_group_size(2, 2, 1)))
void madd(__global int* a, __global int* b, __global int* output) {
int index = get_global_id(1)*get_global_size(0) + get_global_id(0);
output[index] = a[index] + b[index]; }
3-D NDRange
• The concept of work size can also be
extended to a 3-D space.
• The figure illustrates this work size as
a 3-D cube of size 16x16x16. While
the total number of work items is
again 4096, the work space is now
defined across three different
dimensions.
• This works well for applications that
can be defined across 3-dimensions
such as 3D computer graphics and
data mining algorithms.
• Similarly the 1- and 2-dimensional
cases, three dimensional work-items
can be implemented to operate in a
concurrent fashion on the FPGA
device.
Nomenclature
- AMD vs.
NVIDIA
Nomenclature
- OpenCL vs.
CUDA
References
• https://docs.nvidia.com/cuda/cuda-c-programming-guide/
• https://sites.google.com/site/csc8820/opencl-basics/
• https://www.xilinx.com/html_docs/xilinx2017_4/sdaccel_doc/pet15040
34296131.html
• “From Shader Code to a Teraflop: How GPU Shader Cores Work”,
Kayvon Fatahalian, Stanford University
Thread execution manager Input assembler Host
Parallel data Parallel data Parallel data Parallel data Parallel data Parallel data Parallel data
cache cache cache cache cache cache cache
Texture Texture Texture Texture Texture Texture Texture
Global memory
Process Context & Context Switching
• Process context is the current state of the process i.e. what is in its
registers?. The context of the current running process need to be saved,
so that it can be resumed after the interrupt is handled.
• Context Switch is the process of storing the context/state of a process or
thread, so that it can be restored and resume execution at a later point.
This allows multiple processes to share a single central processing unit
(CPU) and is an essential feature of a multitasking operating system.
• For example, in case of X86 processors, process context is based on the
registers: ESP, SS, EIP, CS and more. We need to save the instruction
pointer (EIP) and the CS (Code Segment) so that after the interrupt is
handled we can continue running from where we were stopped.