Introduction_to_OpenCL_with_Examples
Introduction_to_OpenCL_with_Examples
with examples
Piero Lanucara, SCAI
1 July 2015
from http://www.karlrupp.net/
from http://www.karlrupp.net/
from http://www.karlrupp.net/
Heterogeneous High Performance
Programming framework
• http://www.hpcwire.com/hpcwire/2012-02-
28/opencl_gains_ground_on_cuda.html
“As the two major programming frameworks for GPU computing, OpenCL and
CUDA have been competing for mindshare in the developer community for the
past few years. Until recently, CUDA has attracted most of the attention from
developers, especially in the high performance computing realm. But OpenCL
software has now matured to the point where HPC practitioners are taking a
second look.
Both OpenCL and CUDA provide a general-purpose model for data parallelism
as well as low-level access to hardware, but only OpenCL provides an open,
industry-standard framework. As such, it has garnered support from nearly all
processor manufacturers including AMD, Intel, and NVIDIA, as well as others
that serve the mobile and embedded computing markets. As a result,
applications developed in OpenCL are now portable across a variety of GPUs
and CPUs .”
Heterogeneous High Performance
Programming framework (2)
A modern computing
platform includes:
• One or more CPUs
• One of more GPUs
E.g. Samsung® Exynos 5:
• DSP processors
• Accelerators • Dual core ARM A15
• … other? 1.7GHz, Mali T604 GPU
10 cores
61 cores 16 wide SIMD 16 cores
16 wide SIMD 32 wide SIMD
ATI™ RV770
Intel® Xeon Phi™
NVIDIA® Tesla®
coprocessor
C2090
The Heterogeneous many-core challenge:
How are we to build a software ecosystem for the
Heterogeneous many core platform?
Third party names are the property of their owners.
Industry Standards for Programming
Heterogeneous Platforms
GPUs
CPUs Emerging Increasingly general
Multiple cores driving purpose data-parallel
performance increases Intersection
computing
Graphics
Multi- Heterogeneous APIs and
processor Shading
programming Computing
Languages
– e.g. OpenMP
Jun10 Jul13
OpenCL 1.0 OpenCL 1.2 OpenCL 2.0
released. Specification and Specification
Conformance tests conformance tests finalized and
released Dec08 released conformance tests
released
OpenCL Working Group
within Khronos
• Diverse industry participation
– Processor vendors, system OEMs, middleware vendors,
application developers.
• OpenCL became an important standard upon release by virtue
of the market coverage of the companies behind it.
fences
Cannot synchronize
between work-groups
within a kernel
Output
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
Building Program
•
Objects
The program object encapsulates:
– A context
– The program kernel source or binary OpenCL uses runtime
– List of target devices and build options compilation … because
in general you don’t
• The C API build process to create a program
object: know the details of the
target device when you
– clCreateProgramWithSource()
ship the program
– clCreateProgramWithBinary()
Compile for GPU
__kernel void GPU code
horizontal_reflect(read_only image2d_t src,
write_only image2d_t dst)
{
int x = get_global_id(0); // x-coord
int y = get_global_id(1); // y-coord
Compile for CPU
int width = get_image_width(src); CPU code
float4 src_val = read_imagef(src, sampler,
(int2)(width-1-x, y));
write_imagef(dst, (int2)(x, y), src_val);
}
Example: vector addition
program = clCreateProgramWithSource(context, 1
(const char**) &KernelSource, NULL, &err);
if (err != CL_SUCCESS) {
size_t len;
char buffer[2048];
clGetProgramBuildInfo(program, device_id,
CL_PROGRAM_BUILD_LOG, sizeof(buffer), buffer, &len);
printf(“%s\n”, buffer);
}
• Create the buffer (d_a), assign sizeof(float)*count bytes from “h_a” to the buffer
and copy it into device memory:
cl_mem d_a = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(float)*count, h_a, NULL);
Creating and manipulating
buffers
• Other common memory flags include:
CL_MEM_WRITE_ONLY, CL_MEM_READ_WRITE
• Write Buffers from host into global memory (as non-blocking operations):
• Read back result (as a blocking operation). We have an in-order queue which
assures the previous commands are completed before the read can begin.
It’s complicated, but most of this is “boilerplate” and not as bad as it looks .
OpenCL C for
Compute Kernels
• Derived from ISO C99
– A few restrictions: no recursion, function pointers, functions in C99 standard
headers ...
– Preprocessing directives defined by C99 are supported (#include etc.)
• Built-in data types
– Scalar and vector data types, pointers
– Data-type conversion functions:
• convert_type<_sat><_roundingmode>
– Image types:
• image2d_t, image3d_t and sampler_t
OpenCL C for Compute
Kernels
• Built-in functions — mandatory
– Work-Item functions, math.h, read and write image
– Relational, geometric functions, synchronization functions
– printf (v1.2 only, so not currently for NVIDIA GPUs)
• Built-in functions — optional (called “extensions”)
– Double precision, atomics to global and local memory
– Selection of rounding mode, writes to image3d_t surface
OpenCL C Language
•
Highlights
Function qualifiers
– __kernel qualifier declares a function as a kernel
• I.e. makes it visible to host code so it can be enqueued
– Kernels can call other kernel-side functions
• Address space qualifiers
– __global, __local, __constant, __private
– Pointer kernel arguments must be declared with an address space qualifier
• Work-item functions
– get_work_dim(), get_global_id(), get_local_id(), get_group_id()
• Synchronization functions
– Barriers - all work-items within a work-group must execute the barrier function
before any work-item can continue
– Memory fences - provides ordering between memory operations
Host programs can
be “ugly”
• OpenCL’s goal is extreme portability,
so it exposes everything
– (i.e. it is quite verbose!).
• But most of the host code is the
same from one application to the
next – the re-use makes the
verbosity a non-issue.
• You can package common API
combinations into functions or even
C++ or Python classes to make the
reuse more convenient.
The C++ Interface
• Khronos has defined a common C++ header file containing a high level interface
to OpenCL, cl.hpp
• This interface is dramatically easier to work with1
• Key features:
– Uses common defaults for the platform and command-queue, saving the
programmer from extra coding for the most common use cases
– Simplifies the basic API by bundling key parameters with the objects rather
than requiring verbose and repetitive argument lists
– Ability to “call” a kernel from the host, like a regular function
– Error checking can be performed with C++ exceptions
• Private Memory:
– A very scarce resource, only a few tens of 32-bit words per Work-Item
at most
– If you use too much it spills to global memory or reduces the number of
Work-Items that can be run at the same time, potentially harming
performance*
– Think of these like registers on the CPU
* Occupancy on a GPU
Local Memory*
• Tens of KBytes per Compute Unit
– As multiple Work-Groups will be running on each CU, this means only a
fraction of the total Local Memory size is available to each Work-Group
• Assume O(1-10) KBytes of Local Memory per Work-Group
– Your kernels are responsible for transferring data between Local and
Global/Constant memories … there are optimized library functions to help
• Use Local Memory to hold data that can be reused by all the work-items in a
work-group
• Access patterns to Local Memory affect performance in a similar way to
accessing Global Memory
– Have to think about things like coalescence & bank conflicts
fences
Cannot synchronize
between work-groups
within a kernel
Synchronization: when multiple units of execution (e.g. work-items) are brought to a known point in their execution.
Most common example is a barrier … i.e. all units of execution “in scope” arrive at the barrier before any proceed.
Work-Item
Ensure correct order of memory
Synchronization operations
•
to local memory (with
flushes or queuing a memory
Within a work-group
void barrier() fence)l or global
– Takes optional flags
CLK_LOCAL_MEM_FENCE and/or CLK_GLOBAL_MEM_FENCE
– A work-item that encounters a barrier() will wait until ALL work-items in its work-
group reach the barrier()
– Corollary: If a barrier() is inside a branch, then the branch must be taken by
either:
• ALL work-items in the work-group, OR
• NO work-item in the work-group
• Across work-groups
– No guarantees as to where and when a particular work-group will be executed
relative to another work-group
– Cannot exchange data, or have barrier-like synchronization between two
different work-groups! (Critical issue!)
– Only solution: finish the kernel and start another
• Targets a broader range of CPU-like and GPU-like
devices than CUDA
– Targets devices produced by multiple vendors
– Many features of OpenCL are optional and may not
be supported on all devices
Performance????
• OpenCL codes must be prepared to deal with much
greater hardware diversity
• A single OpenCL kernel will likely not achieve peak
performance on all device types
Portable performance in
OpenCL
• Portable performance is
always a challenge, more so • Tremendous amount of computing power available
when OpenCL devices can be
so varied (CPUs, GPUs, …)
Adjacent work-items
x x x x … y y y y… z z z z … a a a a …
like to access
adjacent memory
• Array of Structures (AoS) may suit cache hierarchies on CPUs
Individual work-
x y z a … x y z a … x y z a … x y z a … items like to
access adjacent
memory
Advice for performance
portability
• Optimal Work-Group sizes will differ between devices
– E.g. CPUs tend to prefer 1 Work-Item per Work-Group, while GPUs prefer lots of Work-
Items per Work-Group (usually a multiple of the number of PEs per Compute Unit, i.e.
32, 64 etc.)
• From OpenCL v1.1 you can discover the preferred Work-Group size multiple for a kernel
once it’s been built for a specific device
– Important to pad the total number of Work-Items to an exact multiple of this
– Again, will be different per device
• The OpenCL run-time will have a go at choosing good EnqueueNDRangeKernel
dimensions for you
– With very variable results
• Your mileage will vary, the best strategy is to write adaptive code that makes decisions at
run-time
Tuning Knobs
some general issues
• Tiling size (work-group sizes, dimensionality etc.)
– For block-based algorithms (e.g. matrix multiplication)
– Different devices might run faster on different block sizes
• Data layout
– Array of Structures or Structure of Arrays (AoS vs. SoA)
– Column or Row major
• Caching and prefetching
– Use of local memory or not
– Extra loads and stores assist hardware cache?
• Work-item / work-group data mapping
– Related to data layout
– Also how you parallelize the work
• Operation-specific tuning
– Specific hardware differences
– Built-in trig / special function hardware
– Double vs. float (vs. half)
Auto tuning
• Q: How do you know what the best parameter values for your program are?
– What is the best work-group size, for example
60
How much fast? The
EuroBen Benchmark
The EuroBen Benchmark Group provides benchmarks for the evaluation of the performance for
scientific and technical computing on single processor cores and on parallel computers systems using
standard parallel tool (OpenMP, MPI, ….) but also emerging standard (OpenCL, Cilk, …)
Programs are available in Fortran and C
The benchmark codes range from measuring the performance of basic operations and
mathematical functions to skeleton applications.
Cineca started a new activity in the official PRACE framework to test and validate EuroBen
benchmarks on Intel MIC architecture (V. Ruggiero-C.Cavazzoni).
MOD2F benchmark
16 OpenMP
threads, size=𝟐𝟐𝟐
CUDA C OpenCL C
Allocate float* d_x; cl_mem d_x =
cudaMalloc(&d_x, sizeof(float)*size); clCreateBuffer(context,
CL_MEM_READ_WRITE,
sizeof(float)*size,
NULL, NULL);
OpenCL C++
1. Have the kernel accept a local
CUDA C array as an argument
1. Define an array in the kernel __kernel void func(
source as extern __local int *array)
__shared__ int array[]; {}
CUDA C OpenCL C
1. Define an array in the kernel
source as extern 1. Have the kernel accept a local
array as an argument
__shared__ int array[];
__kernel void func(
__local int *array) {}
2. When executing the kernel,
specify the third parameter as
size in bytes of shared memory 2. Specify the size by setting the
kernel argument
func<<<num_blocks,
clSetKernelArg(kernel, 0,
num_threads_per_block,
sizeof(int)*num_elements,
shared_mem_size>>>(args); NULL);
Dividing up the work
Problem size
CUDA
OpenCL
Thread Work-item
CUDA C OpenCL C
dim3 threads_per_block(30,20); const size_t global[2] =
{300, 200};
dim3 num_blocks(10,10);
const size_t local[2] =
kernel<<<num_blocks, {30, 20};
threads_per_block>>>();
clEnqueueNDRangeKernel(
queue, &kernel,
2, 0, &global, &local,
0, NULL, NULL);
Enqueue a kernel (C++)
CUDA C
OpenCL C++
dim3 threads_per_block(30,20); const cl::NDRange
global(300, 200);
OpenCL
gridDim get_num_groups()
blockIdx get_group_id()
blockDim get_local_size()
threadIdx get_local_id()
CUDA OpenCL
__syncthreads() barrier()
__threadfenceblock() mem_fence(
CLK_GLOBAL_MEM_FENCE |
CLK_LOCAL_MEM_FENCE)
No equivalent read_mem_fence()
No equivalent write_mem_fence()
__threadfence() Finish one kernel and start
another
Translation from CUDA to OpenCL
CUDA OpenCL
GPU Device (CPU, GPU etc)
Multiprocessor Compute Unit, or CU
Scalar or CUDA core Processing Element, or PE
Global or Device Memory Global Memory
Shared Memory (per block) Local Memory (per workgroup)
Local Memory (registers) Private Memory
Thread Block Work-group
Thread Work-item
Warp No equivalent term (yet)
Grid NDRange
OpenCL live@Eurora
Eurora
• Eurora CINECA-Eurotech
prototype
• 1 rack
• Two Intel SandyBridge and
• two NVIDIA K20 cards per
node or:
• Two Intel MIC card per
node
• Hot water cooling
• Energy efficiency record
(up to 3210 MFLOPs/w)
• 100 TFLOPs sustained
Running environment
• 13 Multiprocessors • 236 compute units
• 2496 CUDA Cores • 8 GB of global memory
• 5 GB of global memory • CPU clock rate 1052 MHz
• GPU clock rate 760MHz
NVIDIA Tesla K20 Intel MIC Xeon Phi
Setting up OpenCL on Eurora
• Login on front-end.
Then:
It defines:
INTEL_OPENCL_INCLUDE
and
INTEL_OPENCL_LIB
• Goal:
– To inspect and verify that you can run an OpenCL kernel on Eurora machines
• Procedure:
– Take the provided C vadd.c and vadd.cl source programs from VADD
directory
– Compile and link vadd.c
– Run on NVIDIA or Intel platform.
• Expected output:
– A message verifying that the vector addition completed successfully
– Some useful info about OpenCL environment (Intel and NVIDIA)
Matrix-Matrix product: HOST
void MatrixMulOnHost (float* M, float* N, float* P, int Width)
{
// loop on rows
for (int row = 0; row < Width; ++row) {
// loop on columns P=M*N
for (int col = 0; col < Width; ++col) {
(0,2)
col (1,2) (2,2)
* index
gridDim.x * blockDim.x
col = blockIdx.x * blockDim.x + threadIdx.x;
row = blockIdx.y * blockDim.y + threadIdx.y;
• Private Memory
– Per work-item
• Local Memory
– Shared within a
work-group
• Global/Constant
Memory
– Visible to all
work-groups
• Host memory
– On the CPU
• Private Memory
– Fastest & smallest: O(10) words/WI
• Local Memory
– Shared by all WI’s in a work-group
– But not shared between work-groups!
– O(1-10) Kbytes per work-group
• Global/Constant Memory
– O(1-10) Gbytes of Global memory
– O(10-100) Kbytes of Constant
memory
• Host memory
– On the CPU - GBytes
Memory management is explicit:
O(1-10) Gbytes/s bandwidth to discrete GPUs for
Host <-> Global transfers
OpenCL mapping
• In OpenCL:
get_global_size(0)
Index Space
get_local_size(0) Work Work Work
Group Group Group
Work Group (0,0)
(0, 0) (1, 0) (2, 0) get_
Work Work Work Work Work global_
Item Item Item Item Item size(1)
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Work Work Work
get_ Group Group Group
local_ Work Work Work Work Work
size(1) Item Item Item Item Item (0, 1) (1, 1) (2, 1)
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)
You should use OpenCL mapping functions for element values recovery(this may be a
common source of bugs when write a kernel)
Matrix multiplication: pseudo OpenCL
kernel
• Goal:
– To write your first complete OpenCL kernel “from scratch”
– To multiply a pair of matrices
• Procedure:
– Start with the previous matrix multiplication OpenCL
kernel
– Rearrange and use local scalars for intermediate C
element values (a common optimization in matrix-
Multiplication functions)
• Expected output:
– A message to standard output verifying that the chain of
vector additions produced the correct result
– Report the runtime and the MFLOPS
Matrix multiplication: OpenCL kernel improved
Which is the best thread block /work-group size to select (i.e. TILE_WIDTH)?
On Fermi architectures: each SM can handle up to 1536 total threads
TILE_WIDTH = 8
8x8 = 64 threads >>> 1536/64 = 24 blocks needed to fully load a SM
… yet there is a limit of maximum 8 resident blocks per SM for cc 2.x
so we end up with just 64x8 = 512 threads per SM on a maximum of
1536 (only 33% occupancy)
TILE_WIDTH = 16
16x16 = 256 threads >>> 1536/256 = 6 blocks to fully load a SM
6x256 = 1536 threads per SM … reaching full occupancy per SM!
TILE_WIDTH = 32
32x32 = 1024 threads >>> 1536/1024 = 1.5 = 1 block fully loads SM
1024 threads per SM (only 66% occupancy)
TILE_WIDTH = 16
101
Matrix-Matrix product: selecting optimum thread block size
Which is the best thread block size/work-group size to select (i.e. TILE_WIDTH)?
On Kepler architectures: each SM can handle up to 2048 total threads
TILE_WIDTH = 8
8x8 = 64 threads >>> 2048/64 = 32 blocks needed to fully load a SM
… yet there is a limit of maximum 16 resident blocks per SM for cc 3.x
so we end up with just 64x16 = 1024 threads per SM on a maximum of 2048 (only
50% occupancy)
TILE_WIDTH = 16
16x16 = 256 threads >>> 2048/256 = 8 blocks to fully load a SM
8x256 = 2048 threads per SM … reaching full occupancy per SM!
TILE_WIDTH = 32
32x32 = 1024 threads >>> 2048/1024 = 2 blocks fully load a SM
2x1024 = 2048 threads per SM … reaching full occupancy per SM!
TILE_WIDTH = 16 or 32
102
Exercise 2: Matrix Multiplication
• Goal:
– To test different thread block size
– To multiply a pair of matrices
• Procedure:
– Start with the previous matrix multiplication OpenCL
kernel
– Test different thread block size in C source code.
Compare results and find the optimum value (on both
OpenCL platforms)
• Expected output:
– A message to standard output verifying that the chain of
vector additions produced the correct result
– Report the runtime and the MFLOPS
Matrix-Matrix product: selecting optimum thread block size
Which is the best thread block size/work-group size to select (i.e. TILE_WIDTH)?
On Kepler architectures: each SM can handle up to 2048 total threads
TILE_WIDTH = 8
8x8 = 64 threads >>> 2048/64 = 32 blocks needed to fully load a SM
… yet there is a limit of maximum 16 resident blocks per SM for cc 3.x
so we end up with just 64x16 = 1024 threads per SM on a maximum of 2048 (only
50% occupancy)
TILE_WIDTH = 16
16x16 = 256 threads >>> 2048/256 = 8 blocks to fully load a SM
8x256 = 2048 threads per SM … reaching full occupancy per SM!
TILE_WIDTH = 32
32x32 = 1024 threads >>> 2048/1024 = 2 blocks fully load a SM
2x1024 = 2048 threads per SM … reaching full occupancy per SM!
TILE_WIDTH Kernel time (sec.) GFLOP/s (NVIDIA
K20)
8 0.33 52
16 0.20 82
104
32 0.16 104
Exercise 3: Matrix Multiplication
• Goal:
– To check inside matrix borders
– To multiply a pair of matrices
• Procedure:
– Start with the previous matrix multiplication OpenCL
kernel
– Test the check inside matrix borders kernel and the
original one. Compare results and performances (on both
OpenCL platforms)
• Expected output:
– A message to standard output verifying that the chain of
vector additions produced the correct result
– Report the runtime and the MFLOPS
Matrix-Matrix product: check inside matrix
borders
}
Yes 2048 / 35
No 2047 Failed (different results from 21
reference)
No 2048 / 37
Optimizing matrix multiplication
• Private Memory:
– A very scarce resource, only a few tens of 32-bit
words per Work-Item at most
– If you use too much it spills to global memory or
reduces the number of Work-Items that can be run at
the same time, potentially harming performance*
– Think of these like registers on the CPU
* Occupancy on a GPU
Exercise 1-2: using private memory
• Goal:
– Use private memory to minimize memory movement costs
and optimize performance of your matrix multiplication
program
• Procedure:
– Start with previous matrix multiplication kernel
– Modify the kernel so that each work-item copies its own
column of B into private memory
– Test the new kernel.Compare results and performances
(on both OpenCL platforms)
• Expected output:
– A message to standard output verifying that the matrix
multiplication program is generating the correct results
– Report the runtime and the MFLOPS
Optimizing matrix multiplication
(Device) Grid
Threads belonging to the same block can cooperate togheter Block (0, 0) Block (1, 0)
using the shared memory to share data
if a thread needs some data which Shared Memory Shared Memory
has been already retrived by
another thread in the same Registers Registers
Thread Synchronization N
Cij=Cij+As(it,k)·Bs(k,jt) NB
Thread Synchronization
C(i,j)=Cij A C
Matrix-matrix using Shared Memory: CUDA Kernel
// Matrix multiplication kernel called by MatMul_gpu()
__global__ void MatMul_kernel (float *A, float *B, float *C, int N)
{ for (int kb = 0; kb < (A.width / NB); ++kb) {
// Shared memory used to store Asub and Bsub respectively // Get the starting address of Asub and Bsub
__shared__ float Asub[NB][NB]; a_offset = get_offset (ib, kb, N);
__shared__ float Bsub[NB][NB]; b_offset = get_offset (kb, jb, N);
// Block row and column // Load Asub and Bsub from device memory to shared memory
int ib = blockIdx.y; // Each thread loads one element of each sub-matrix
int jb = blockIdx.x; Asub[it][jt] = A[a_offset + it*N + jt];
Bsub[it][jt] = B[b_offset + it*N + jt];
// Thread row and column within Csub
int it = threadIdx.y; // Synchronize to make sure the sub-matrices are loaded
int jt = threadIdx.x; // before starting the computation
__syncthreads();
int a_offset , b_offset, c_offset;
// Multiply Asub and Bsub together
// Each thread computes one element of Csub for (int k = 0; k < NB; ++k) {
// by accumulating results into Cvalue Cvalue += Asub[it][k] * Bsub[k][jt];
float Cvalue = 0; }
// Synchronize to make sure that the preceding
// Loop over all the sub-matrices of A and B that are // computation is done
// required to compute Csub __syncthreads();
// Multiply each pair of sub-matrices together }
// and accumulate the results
// Get the starting address (c_offset) of Csub
c_offset = get_offset (ib, jb, N);
// Each thread block computes one sub-matrix Csub of C
C[c_offset + it*N + jt] = Cvalue;
} 118
Exercise 4: Matrix Multiplication
• Goal:
– To use shared/memory local
– To multiply a pair of matrices
• Procedure:
– Start with the previous matrix multiplication CUDA kernel
– Modify source in order to generate an OpenCL kernel
– Compare results and performances (on both OpenCL
platforms)
• Expected output:
– A message to standard output verifying that the chain of
vector additions produced the correct result
– Report the runtime and the MFLOPS
Matrix-matrix using Shared Memory: OpenCL
Kernel:results
121
OpenCL on Intel MIC
• Intel MIC combines many core onto a single chip. Each core runs exactly 4
hardware threads. In particular:
1. All cores/threads are a single OpenCL device
2. Separate hardware threads are OpenCL CU.
• In the end, you’ll have parallelism at the work-group level (vectorization) and
parallelism between work-groups (threading).
OpenCL on Intel MIC
TILE_SIZE_M x TILE_SIZE_N =
number of elements of C computed
by one WI
Matrix-matrix on Intel MIC (results)