Parallel Programming in Opencl: Advanced Graphics & Image Processing
Parallel Programming in Opencl: Advanced Graphics & Image Processing
Rafał Mantiuk
Computer Laboratory, University of Cambridge
Single Program Multiple Data (SPMD)
Consider the following vector addition example
for( i = 0:11 ) {
C[ i ] = A[ i ] + B[ i ]
Serial program: }
one program completes
the entire task A
+
B
||
C
Multiple copies of the same program execute on different data in parallel
for( i = 0:3 ) { for( i = 4:7 ) { for( i = 8:11 ) {
C[ i ] = A[ i ] + B[ i ] C[ i ] = A[ i ] + B[ i ] C[ i ] = A[ i ] + B[ i ]
SPMD program:
} } }
multiple copies of the
same program run on A
different chunks of the +
data
B
||
C
Multi-threaded (CPU)
// tid is the thread id T0 0 1 2 3
// P is the number of cores T1 4 5 6 7
for(i = 0; i < tid*N/P; i++) T2 8 9 10 11
C[i] = A[i] + B[i] T3 12 13 14 15
T15 15
CPU GPU
OpenCL
CUDA
OpenMP
OpenACC
Metal
OpenCL
OpenCL is a framework for writing parallelized code for
CPUs, GPUs, DSPs, FPGAs and other processors
Initially developed by Apple, now supported by AMD, IBM,
Qualcomm, Intel and Nvidia (reluctantly)
Versions
Latest: OpenCL 2.2
OpenCL C++ kernel language
SPIR-V as intermediate representation for kernels
Vulcan uses the same Standard Portable Intermediate Representation
AMD, Intel
Mostly supported: OpenCL 1.2
Nvidia, OSX
OpenCL platforms and drivers
To run OpenCL code you need:
Generic ICD loader
Included in the OS
Installable Client Driver
From Nvidia, Intel, etc.
This applies to Windows and Linux, only one platform on Mac
To develop OpenCL code you need:
OpenCL headers/libraries
Included in the SDKs
Nvidia – CUDA Toolkit
Intel OpenCL SDK
But lightweight options are also available
Programming OpenCL
OpenCL natively offers C99 API
But there is also a standard OpenCL C++ API wrapper
Strongly recommended – reduces the amount of code
Programming OpenCL is similar to programming shaders
in OpenGL
Host code runs on CPU and invokes kernels
Kernels are written in C-like programming language
In many respects similar to GLSL
Kernels are passed to API as strings and compiled at runtime
Kernels are usually stored in text files
Kernels can be precompiled into SPIR from OpenCL 2.1
Example: Step 1 - Select device
Get all Select Get all Select
Platforms Platform Devices Device
Example: Step 2 - Build program
Create Load sources Create Build
context (usually from files) Program Program
Example: Step 3 - Create Buffers and
copy memory
Create Create Enqueue
Buffers Queue Memory Copy
Example: Step 4 - Execute Kernel and
retrieve the results
Create Set Kernel Enqueue Enqueue
Kernel Arguments Kernel memory copy
14
Execution model
Each kernel executes on 1D, 2D or 3D array (NDRange)
The array is split into work-groups
Work items (threads) in each work-group share some local
memory
Kernel can querry
get_global_id(dim)
get_group_id(dim)
get_local_id(dim)
Work items are not
bound to any memory
entity
(unlike GLSL shaders)
Memory model
Host memory
Usually CPU memory, device does
not have access to that memory
Global memory [__global]
Device memory, for storing large
data
Constant memory [__constant]
Local memory [__local]
Fast, accessible to all work-items
(threads) within a workgroup
Private memory [__private]
Accessible to a single work-item
(thread)
Memory objects
cl::Image1DBuffer
cl::Memory
cl::Buffer cl::Image
Buffer
ArrayBuffer in OpenGL
Accessed directly via C pointers
Image
Texture in OpenGL
Access via texture look-up function
Can interpolate values, clamp, etc.
Programming model
Data parallel programming
Each NDRange element is assigned to a work-item (thread)
Task-parallel programming
Multiple different kernels can be executed in parallel
Each kernel can use vector-types of the device (float4, etc.)
Command queue
queue.enqueueWriteBuffer(buffer_A, CL_TRUE, 0, sizeof(int)*10, A);
CL_TRUE - Execute in-order
CL_FALSE – Execute out-of-order
19
Thread Mapping
By using different mappings, the same thread can be
assigned to access different data elements
The examples below show three different possible mappings of
threads to data (assuming the thread id is used to access an
element) int group_size =
get_local_size(0) *
get_local_size(1);
int tid =
get_group_id(1) *
get_num_groups(0) *
int tid = int tid = group_size +
get_group_id(0) *
Mapping get_global_id(1) * get_global_id(0) *
group_size +
get_global_size(0) + get_global_size(1) +
get_global_id(0); get_global_id(1); get_local_id(1) *
get_local_size(0) +
get_local_id(0)
0 1 2 3 0 4 8 12
Thread IDs 4 5 6 7 1 5 9 13 0 1 4 5
8 9 10 11 2 6 10 14 2 3 6 7
12 13 14 15 3 7 11 15 8 9 12 13
10 11 14 15
20 From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/ *assuming 2x2 groups
Thread Mapping
Consider a serial matrix multiplication algorithm
Thread mapping 2: with an NxM index space, the kernel would be:
Mapping for C
0 1 2 3
4 5 6 7
8 9 10 11
12 13 14 15
The following slides are based on AMD’s OpenCL™ Optimization Case Study: Simple Reductions
http://developer.amd.com/resources/articles-whitepapers/opencl-optimization-case-study-simple-reductions/
Reduction tree for the min operation
__kernel
void reduce_min(__global float* buffer,
barrier ensures that all threads
__local float* scratch, (work units) in the local group
__const int length,
__global float* result) { reach that point before execution
int global_index = get_global_id(0);
continue
int local_index = get_local_id(0);
// Load data into local memory Each iteration of the for loop
if (global_index < length) { computes next level of the
scratch[local_index] = buffer[global_index];
} else { reduction pyramid
scratch[local_index] = INFINITY;
}
barrier(CLK_LOCAL_MEM_FENCE);
for(int offset = get_local_size(0) / 2;
offset > 0; offset >>= 1) {
if (local_index < offset) {
float other = scratch[local_index + offset];
float mine = scratch[local_index];
scratch[local_index] = (mine < other) ? mine :
other;
}
barrier(CLK_LOCAL_MEM_FENCE);
}
if (local_index == 0) {
result[get_group_id(0)] = scratch[0];
}
}
Multistage reduction
The local memory is usually
limited (e.g. 50kB), which
restricts the maximum size of
the array that can be processed
Therefore, for large arrays need
to be processed in multiple
stages
The result of a local memory
reduction is stored in the array
and then this array is reduced
Two-stage reduction
__kernel
void reduce(__global float* buffer,
__local float* scratch,
__const int length,
__global float* result) {
in local memory }
Reduction performance CPU/GPU