0% found this document useful (0 votes)

18 views

Introduction_to_OpenCL_with_Examples

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Introduction_to_OpenCL_with_Examples

Uploaded by

Ali

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 128

Introduction to OpenCL

with examples
Piero Lanucara, SCAI

1 July 2015
from http://www.karlrupp.net/
from http://www.karlrupp.net/
from http://www.karlrupp.net/
Heterogeneous High Performance
Programming framework
• http://www.hpcwire.com/hpcwire/2012-02-
28/opencl_gains_ground_on_cuda.html

“As the two major programming frameworks for GPU computing, OpenCL and
CUDA have been competing for mindshare in the developer community for the
past few years. Until recently, CUDA has attracted most of the attention from
developers, especially in the high performance computing realm. But OpenCL
software has now matured to the point where HPC practitioners are taking a
second look.
Both OpenCL and CUDA provide a general-purpose model for data parallelism
as well as low-level access to hardware, but only OpenCL provides an open,
industry-standard framework. As such, it has garnered support from nearly all
processor manufacturers including AMD, Intel, and NVIDIA, as well as others
that serve the mobile and embedded computing markets. As a result,
applications developed in OpenCL are now portable across a variety of GPUs
and CPUs .”
Heterogeneous High Performance
Programming framework (2)
A modern computing
platform includes:
• One or more CPUs
• One of more GPUs
E.g. Samsung® Exynos 5:
• DSP processors
• Accelerators • Dual core ARM A15
• … other? 1.7GHz, Mali T604 GPU

OpenCL lets Programmers write a single

portable program that uses ALL resources in
the heterogeneous platform
Microprocessor trends
Individual processors have many (possibly heterogeneous) cores.

10 cores
61 cores 16 wide SIMD 16 cores
16 wide SIMD 32 wide SIMD

ATI™ RV770
Intel® Xeon Phi™
NVIDIA® Tesla®
coprocessor
C2090
The Heterogeneous many-core challenge:
How are we to build a software ecosystem for the
Heterogeneous many core platform?
Third party names are the property of their owners.
Industry Standards for Programming
Heterogeneous Platforms
GPUs
CPUs Emerging Increasingly general
Multiple cores driving purpose data-parallel
performance increases Intersection
computing

Graphics
Multi- Heterogeneous APIs and
processor Shading
programming Computing
Languages
– e.g. OpenMP

OpenCL – Open Computing Language

Open, royalty-free standard for portable, parallel programming of heterogeneous
parallel computing CPUs, GPUs, and other processors
OpenCL Timeline
• Launched Jun’08 … 6 months from “strawman” to OpenCL 1.0
• Rapid innovation to match pace of hardware innovation
– 18 months from 1.0 to 1.1 and from 1.1 to 1.2
– Goal: a new OpenCL every 18-24 months
– Committed to backwards compatibility to protect software
investments

OpenCL 1.1 OpenCL 2.0

Provisional
Specification and
Specification released
Within 6
conformance tests months
released for public review
(depends on
Dec08 Nov11 feedback)

Jun10 Jul13
OpenCL 1.0 OpenCL 1.2 OpenCL 2.0
released. Specification and Specification
Conformance tests conformance tests finalized and
released Dec08 released conformance tests
released
OpenCL Working Group
within Khronos
• Diverse industry participation
– Processor vendors, system OEMs, middleware vendors,
application developers.
• OpenCL became an important standard upon release by virtue
of the market coverage of the companies behind it.

Third party names are the property of their owners.

OpenCL Platform Model
…
…
…
……
Processing …
…
… Host
Element …
…
……
…
…

Compute Unit OpenCL Device

• One Host and one or more OpenCL Devices

– Each OpenCL Device is composed of one or more
Compute Units
• Each Compute Unit is divided into one or more Processing Elements
• Memory divided into host memory and device memory
OpenCL Platform Example
(One node, two CPU sockets,
CPUs: two GPUs) GPUs:
• Treated as one OpenCL • Each GPU is a separate
device OpenCL device
– One CU per core • One CU per Streaming
– 1 PE per CU, or if PEs Multiprocessor
mapped to SIMD lanes, n • Can use CPU and all GPU
PEs per CU, where n devices concurrently through
matches the SIMD width OpenCL
• Remember:
– the CPU will also have to
be its own host!
CU = Compute Unit; PE = Processing Element
The BIG idea behind
•
OpenCL
Replace loops with functions (a kernel) executing at each point in a problem
domain
– E.g., process a 1024x1024 image with one kernel invocation per pixel or
1024x1024=1,048,576 kernel executions
Traditional loops Data Parallel OpenCL
void __kernel void
mul(const int n, mul(__global const float *a,
const float *a, __global const float *b,
const float *b, __global float *c)
float *c) {
{ int id = get_global_id(0);
int i; c[id] = a[id] * b[id];
for (i = 0; i < n; i++) }
c[i] = a[i] * b[i]; // many instances of the kernel,
} // called work-items, execute
// in parallel
An N-dimensional domain
of work-items• Global Dimensions:
– 1024x1024 (whole problem space)
• Local Dimensions:
– 128x128 (work-group, executes together)
1024
Synchronization between
work-items possible only
within work-groups:
barriers and memory
1024

fences
Cannot synchronize
between work-groups
within a kernel

• Choose the dimensions that are “best” for

your algorithm
OpenCL N Dimensional Range
(NDRange)

• The problem we want to compute should have some

dimensionality;
– For example, compute a kernel on all points in a cube
• When we execute the kernel we specify up to 3 dimensions
• We also specify the total problem size in each dimension – this is
called the global size
• We associate each point in the iteration space with a work-item
OpenCL N Dimensional
Range (NDRange)

• Work-items are grouped into work-groups; work-items within a

work-group can share local memory and can synchronize
• We can specify the number of work-items in a work-group –
this is called the local (work-group) size
• Or the OpenCL run-time can choose the work-group size for
you (usually not optimally)
OpenCL Memory model
• Private Memory
– Per work-item
• Local Memory
– Shared within a
work-group
• Global Memory
/Constant Memory
– Visible to all
work-groups
• Host memory
– On the CPU Memory management is explicit:
You are responsible for moving data from
host → global → local and back
Context and
Command-Queues
• Context:
– The environment within which kernels Device
execute and in which synchronization
and memory management is defined.
• The context includes:
Device Memory
– One or more devices
– Device memory
– One or more command-queues Queue
• All commands for a device (kernel
execution, synchronization, and memory
transfer operations) are submitted
through a command-queue.
• Each command-queue points to a single
device within a context.
Context
Execution model
(kernels)
• OpenCL execution model … define a problem domain and execute an
instance of a kernel for each point in the domain
__kernel void times_two(
__global float* input,
__global float* output)
{
int i = get_global_id(0);
output[i] = 2.0f * input[i];
get_global_id(0)
}
10
Input 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Output
0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
Building Program
•
Objects
The program object encapsulates:
– A context
– The program kernel source or binary OpenCL uses runtime
– List of target devices and build options compilation … because
in general you don’t
• The C API build process to create a program
object: know the details of the
target device when you
– clCreateProgramWithSource()
ship the program
– clCreateProgramWithBinary()
Compile for GPU
__kernel void GPU code
horizontal_reflect(read_only image2d_t src,
write_only image2d_t dst)
{
int x = get_global_id(0); // x-coord
int y = get_global_id(1); // y-coord
Compile for CPU
int width = get_image_width(src); CPU code
float4 src_val = read_imagef(src, sampler,
(int2)(width-1-x, y));
write_imagef(dst, (int2)(x, y), src_val);
}
Example: vector addition

• The “hello world” program of data parallel programming is a program to

add two vectors

C[i] = A[i] + B[i] for i=0 to N-1

• For the OpenCL solution, there are two parts

– Kernel code
– Host code
Vector Addition - Kernel

kernel void vadd(global constfloat *a,

__global constfloat *b,
__global float *c)
{
intgid = get_global_id(0);
c[gid] = a[gid] + b[gid];
}
Vector Addition – Host
• The host program is the code that runs on the host to:
– Setup the environment for the OpenCL program
– Create and manage kernels
• 5 simple steps in a basic host program:
1. Define the platform … platform = devices+context+queues
2. Create and Build the program (dynamic library for kernels)
3. Setup memory objects
4. Define the kernel (attach arguments to kernel functions)
5. Submit commands … transfer memory objects and execute kernels
Please, refer to he reference card. This will help you
get used to the reference card and how to pull
information from the card and express it in code.
1. Define the platform
• Grab the first available platform:
err = clGetPlatformIDs(1, &firstPlatformId,
&numPlatforms);

• Use the first CPU device the platform provides:

err = clGetDeviceIDs(firstPlatformId,
CL_DEVICE_TYPE_CPU, 1, &device_id, NULL);

• Create a simple context with a single device:

context = clCreateContext(firstPlatformId, 1,
&device_id, NULL, NULL, &err);

• Create a simple command-queue to feed our device:

commands = clCreateCommandQueue(context, device_id,
0, &err);
Command-Queues
• Commands include:
– Kernel executions
– Memory object management GPU CPU
– Synchronization
• The only way to submit commands
to a device is through a command-
queue.
• Each command-queue points to a Queue Queue
single device within a context.
• Multiple command-queues can feed
a single device.
– Used to define independent
streams of commands that don’t Context
require synchronization
Command-Queue
execution details
Command queues can be configured in different ways to
control how commands execute
• In-order queues: GPU CPU
– Commands are enqueued and complete in the
order they appear in the program (program-order)
• Out-of-order queues: Queue Queue
– Commands are enqueued in program-order but
can execute (and hence complete) in any order.
• Execution of commands in the command-queue are Context
guaranteed to be completed at synchronization points
2. Create and Build the
program
• Define source code for the kernel-program as a string literal (great for toy
programs) or read from a file (for real applications).

• Build the program object:

program = clCreateProgramWithSource(context, 1
(const char**) &KernelSource, NULL, &err);

• Compile the program to create a “dynamic library” from which specific

kernels can be pulled:

err = clBuildProgram(program, 0, NULL,NULL,NULL,NULL);

Error messages
• Fetch and print error messages:

if (err != CL_SUCCESS) {
size_t len;
char buffer[2048];
clGetProgramBuildInfo(program, device_id,
CL_PROGRAM_BUILD_LOG, sizeof(buffer), buffer, &len);
printf(“%s\n”, buffer);
}

• Important to do check all your OpenCL API error messages!

• Easier in C++ with try/catch

3. Setup Memory
Objects
• For vector addition we need 3 memory objects, one each for
input vectors A and B, and one for the output vector C.
• Create input vectors and assign values on the host:
float h_a[LENGTH], h_b[LENGTH], h_c[LENGTH];
for (i = 0; i < length; i++) { Memory Objects:
h_a[i] = rand() / (float)RAND_MAX; • A handle to a
h_b[i] = rand() / (float)RAND_MAX; reference-counted
region of global
} memory.
• Define OpenCL memory objects:
d_a = clCreateBuffer(context, CL_MEM_READ_ONLY,
sizeof(float)*count, NULL, NULL);
d_b = clCreateBuffer(context, CL_MEM_READ_ONLY,
sizeof(float)*count, NULL, NULL);
d_c = clCreateBuffer(context, CL_MEM_WRITE_ONLY,
sizeof(float)*count, NULL, NULL);
Creating and
manipulating buffers
• Buffers are declared on the host as type: cl_mem

• Arrays in host memory hold your original host-side data:

float h_a[LENGTH], h_b[LENGTH];

• Create the buffer (d_a), assign sizeof(float)*count bytes from “h_a” to the buffer
and copy it into device memory:
cl_mem d_a = clCreateBuffer(context,
CL_MEM_READ_ONLY | CL_MEM_COPY_HOST_PTR,
sizeof(float)*count, h_a, NULL);
Creating and manipulating
buffers
• Other common memory flags include:
CL_MEM_WRITE_ONLY, CL_MEM_READ_WRITE

• These are from the point of view of the device

• Submit command to copy the buffer back to host memory at “h_c”:

– CL_TRUE = blocking, CL_FALSE = non-blocking

clEnqueueReadBuffer(queue, d_c, CL_TRUE,

sizeof(float)*count, h_c,
NULL, NULL, NULL);
4. Define the kernel
• Create kernel object from the kernel function “vadd”:

kernel = clCreateKernel(program, “vadd”, &err);

• Attach arguments of the kernel function “vadd” to memory objects:

err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_a);

err |= clSetKernelArg(kernel, 1, sizeof(cl_mem), &d_b);
err |= clSetKernelArg(kernel, 2, sizeof(cl_mem), &d_c);
err |= clSetKernelArg(kernel, 3, sizeof(unsigned int),
&count);
5. Enqueue commands

• Write Buffers from host into global memory (as non-blocking operations):

err = clEnqueueWriteBuffer(commands, d_a, CL_FALSE,

0, sizeof(float)*count, h_a, 0, NULL, NULL);
err = clEnqueueWriteBuffer(commands, d_b, CL_FALSE,
0, sizeof(float)*count, h_b, 0, NULL, NULL);

• Enqueue the kernel for execution (note: in-order so OK):

err = clEnqueueNDRangeKernel(commands, kernel, 1,

NULL, &global, &local, 0, NULL, NULL);
5. Enqueue commands

• Read back result (as a blocking operation). We have an in-order queue which
assures the previous commands are completed before the read can begin.

err = clEnqueueReadBuffer(commands, d_c, CL_TRUE,

sizeof(float)*count, h_c, 0, NULL, NULL);
Vector Addition –
Host Program
// create the OpenCL context on a GPU device // build the program
Build the
cl_context context = clCreateContextFromType(0, err = clBuildProgram(program, 0, NULL,NULL,NULL,NULL);
CL_DEVICE_TYPE_GPU, NULL, NULL, NULL); program
// create the kernel
// get the list of GPU devices associated with context kernel = clCreateKernel(program, “vec_add”, NULL);
clGetContextInfo(context, CL_CONTEXT_DEVICES, 0, NULL, &cb);
Define platform and queues // set the args values
cl_device_id[] devices = malloc(cb);
err = clSetKernelArg(kernel, 0, (void *) &memobjs[0],
clGetContextInfo(context,CL_CONTEXT_DEVICES,cb,devices,NULL);
Create and setup kernel
sizeof(cl_mem));
err |= clSetKernelArg(kernel, 1, (void *) &memobjs[1],
// create a command-queue
sizeof(cl_mem));
cmd_queue = clCreateCommandQueue(context,devices[0],0,NULL);
err |= clSetKernelArg(kernel, 2, (void *) &memobjs[2],
// allocate the buffer memory objects
sizeof(cl_mem));
memobjs[0] = clCreateBuffer(context, CL_MEM_READ_ONLY | // set work-item dimensions
Define memory objects
CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcA, NULL); global_work_size[0] = n;
memobjs[1] = clCreateBuffer(context, CL_MEM_READ_ONLY |
CL_MEM_COPY_HOST_PTR, sizeof(cl_float)*n, srcb, NULL); // execute kernel
Execute the kernel
err = clEnqueueNDRangeKernel(cmd_queue, kernel, 1, NULL,
memobjs[2] = clCreateBuffer(context, CL_MEM_WRITE_ONLY, global_work_size, NULL,0,NULL,NULL);
sizeof(cl_float)*n, NULL, NULL);
// read output array
// create the program err = clEnqueueReadBuffer(cmd_queue, memobjs[2],
Create the program
program = clCreateProgramWithSource(context, 1, Read results on the host
CL_TRUE, 0,
&program_source, NULL, NULL); n*sizeof(cl_float), dst,
0, NULL, NULL);

It’s complicated, but most of this is “boilerplate” and not as bad as it looks .
OpenCL C for
Compute Kernels
• Derived from ISO C99
– A few restrictions: no recursion, function pointers, functions in C99 standard
headers ...
– Preprocessing directives defined by C99 are supported (#include etc.)
• Built-in data types
– Scalar and vector data types, pointers
– Data-type conversion functions:
• convert_type<_sat><_roundingmode>
– Image types:
• image2d_t, image3d_t and sampler_t
OpenCL C for Compute
Kernels
• Built-in functions — mandatory
– Work-Item functions, math.h, read and write image
– Relational, geometric functions, synchronization functions
– printf (v1.2 only, so not currently for NVIDIA GPUs)
• Built-in functions — optional (called “extensions”)
– Double precision, atomics to global and local memory
– Selection of rounding mode, writes to image3d_t surface
OpenCL C Language
•
Highlights
Function qualifiers
– __kernel qualifier declares a function as a kernel
• I.e. makes it visible to host code so it can be enqueued
– Kernels can call other kernel-side functions
• Address space qualifiers
– __global, __local, __constant, __private
– Pointer kernel arguments must be declared with an address space qualifier
• Work-item functions
– get_work_dim(), get_global_id(), get_local_id(), get_group_id()
• Synchronization functions
– Barriers - all work-items within a work-group must execute the barrier function
before any work-item can continue
– Memory fences - provides ordering between memory operations
Host programs can
be “ugly”
• OpenCL’s goal is extreme portability,
so it exposes everything
– (i.e. it is quite verbose!).
• But most of the host code is the
same from one application to the
next – the re-use makes the
verbosity a non-issue.
• You can package common API
combinations into functions or even
C++ or Python classes to make the
reuse more convenient.
The C++ Interface
• Khronos has defined a common C++ header file containing a high level interface
to OpenCL, cl.hpp
• This interface is dramatically easier to work with1
• Key features:
– Uses common defaults for the platform and command-queue, saving the
programmer from extra coding for the most common use cases
– Simplifies the basic API by bundling key parameters with the objects rather
than requiring verbose and repetitive argument lists
– Ability to “call” a kernel from the host, like a regular function
– Error checking can be performed with C++ exceptions

1 especially for C++ programmers…

OpenCL Memory model
• Private Memory
– Per work-item
• Local Memory
– Shared within a
work-group
• Global/Constant
Memory
– Visible to all
work-groups
• Host memory
– On the CPU

Memory management is explicit:

You are responsible for moving data from
host → global → local and back
OpenCL Memory model
• Private Memory
– Fastest & smallest: O(10) words/WI
• Local Memory
– Shared by all WI’s in a work-group
– But not shared between work-groups!
– O(1-10) Kbytes per work-group
• Global/Constant Memory
– O(1-10) Gbytes of Global memory
– O(10-100) Kbytes of Constant
memory
• Host memory
– On the CPU - GBytes
Memory management is explicit:
O(1-10) Gbytes/s bandwidth to discrete GPUs for
Host <-> Global transfers
Private Memory
• Managing the memory hierarchy is one of the most important things to get
right to achieve good performance

• Private Memory:
– A very scarce resource, only a few tens of 32-bit words per Work-Item
at most
– If you use too much it spills to global memory or reduces the number of
Work-Items that can be run at the same time, potentially harming
performance*
– Think of these like registers on the CPU

* Occupancy on a GPU
Local Memory*
• Tens of KBytes per Compute Unit
– As multiple Work-Groups will be running on each CU, this means only a
fraction of the total Local Memory size is available to each Work-Group
• Assume O(1-10) KBytes of Local Memory per Work-Group
– Your kernels are responsible for transferring data between Local and
Global/Constant memories … there are optimized library functions to help
• Use Local Memory to hold data that can be reused by all the work-items in a
work-group
• Access patterns to Local Memory affect performance in a similar way to
accessing Global Memory
– Have to think about things like coalescence & bank conflicts

* Typical figures for a 2013 GPU

Local Memory
• Local Memory doesn’t always help…
– CPUs don’t have special hardware for it
– This can mean excessive use of Local Memory might slow down kernels on
CPUs
– GPUs now have effective on-chip caches which can provide much of the
benefit of Local Memory but without programmer intervention
– So, your mileage may vary!
The Memory Hierarchy
Bandwidths Sizes
Private memory Private memory
O(2-3) words/cycle/WI O(10) words/WI

Local memory Local memory

O(10) words/cycle/WG O(1-10) KBytes/WG

Global memory Global memory

O(100-200) GBytes/s O(1-10) GBytes

Host memory Host memory

O(1-100) GBytes/s O(1-100) GBytes
Speeds and feeds approx. for a high-end discrete GPU, circa 2011
Memory Consistency
• OpenCL uses a relaxed consistency memory model; i.e.
– The state of memory visible to a work-item is not guaranteed to be consistent
across the collection of work-items at all times.
• Within a work-item:
– Memory has load/store consistency to the work-item’s private view of
memory, i.e. it sees its own reads and writes correctly
• Within a work-group:
– Local memory is consistent between work-items at a barrier.
• Global memory is consistent within a work-group at a barrier, but not
guaranteed across different work-groups!!
– This is a common source of bugs!
• Consistency of memory shared between commands (e.g. kernel invocations) is
enforced by synchronization (barriers, events, in-order queue)
Consider N-dimensional domain
of work-items
• Global Dimensions:
– 1024x1024 (whole problem space)
• Local Dimensions:
– 128x128 (work-group, executes together)
1024
Synchronization between
work-items possible only
within work-groups:
barriers and memory
1024

fences
Cannot synchronize
between work-groups
within a kernel

Synchronization: when multiple units of execution (e.g. work-items) are brought to a known point in their execution.
Most common example is a barrier … i.e. all units of execution “in scope” arrive at the barrier before any proceed.
Work-Item
Ensure correct order of memory
Synchronization operations
•
to local memory (with
flushes or queuing a memory
Within a work-group
void barrier() fence)l or global
– Takes optional flags
CLK_LOCAL_MEM_FENCE and/or CLK_GLOBAL_MEM_FENCE
– A work-item that encounters a barrier() will wait until ALL work-items in its work-
group reach the barrier()
– Corollary: If a barrier() is inside a branch, then the branch must be taken by
either:
• ALL work-items in the work-group, OR
• NO work-item in the work-group

• Across work-groups
– No guarantees as to where and when a particular work-group will be executed
relative to another work-group
– Cannot exchange data, or have barrier-like synchronization between two
different work-groups! (Critical issue!)
– Only solution: finish the kernel and start another
• Targets a broader range of CPU-like and GPU-like
devices than CUDA
– Targets devices produced by multiple vendors
– Many features of OpenCL are optional and may not
be supported on all devices
Performance????
• OpenCL codes must be prepared to deal with much
greater hardware diversity
• A single OpenCL kernel will likely not achieve peak
performance on all device types
Portable performance in
OpenCL
• Portable performance is
always a challenge, more so • Tremendous amount of computing power available
when OpenCL devices can be
so varied (CPUs, GPUs, …)

• But OpenCL provides a

powerful framework for writing
performance portable code
1170
GFLOPs 1070
peak GFLOPs
• The following slides are peak
general advice on writing
code that should work well on
most OpenCL devices
Optimization issues
• Efficient access to memory
– Memory coalescing
• Ideally get work-item i to access data[i] and work-item j to access data[j] at the
same time etc.
– Memory alignment
• Padding arrays to keep everything aligned to multiples of 16, 32 or 64 bytes
• Number of work-items and work-group sizes
– Ideally want at least 4 work-items per PE in a Compute Unit on GPUs
– More is better, but diminishing returns, and there is an upper limit
• Each work item consumes PE finite resources (registers etc)
• Work-item divergence
– What happens when work-items branch?
– Actually a SIMD data parallel model
– Both paths (if-else) may need to be executed (branch divergence), avoid where possible
(non-divergent branches are termed uniform)
Memory layout is critical to
performance
• “Structure of Arrays vs. Array of Structures” problem:
struct { float x, y, z, a; } Point;

• Structure of Arrays (SoA) suits memory coalescence on GPUs

Adjacent work-items
x x x x … y y y y… z z z z … a a a a …
like to access
adjacent memory
• Array of Structures (AoS) may suit cache hierarchies on CPUs

Individual work-
x y z a … x y z a … x y z a … x y z a … items like to
access adjacent
memory
Advice for performance
portability
• Optimal Work-Group sizes will differ between devices
– E.g. CPUs tend to prefer 1 Work-Item per Work-Group, while GPUs prefer lots of Work-
Items per Work-Group (usually a multiple of the number of PEs per Compute Unit, i.e.
32, 64 etc.)
• From OpenCL v1.1 you can discover the preferred Work-Group size multiple for a kernel
once it’s been built for a specific device
– Important to pad the total number of Work-Items to an exact multiple of this
– Again, will be different per device
• The OpenCL run-time will have a go at choosing good EnqueueNDRangeKernel
dimensions for you
– With very variable results

• Your mileage will vary, the best strategy is to write adaptive code that makes decisions at
run-time
Tuning Knobs
some general issues
• Tiling size (work-group sizes, dimensionality etc.)
– For block-based algorithms (e.g. matrix multiplication)
– Different devices might run faster on different block sizes
• Data layout
– Array of Structures or Structure of Arrays (AoS vs. SoA)
– Column or Row major
• Caching and prefetching
– Use of local memory or not
– Extra loads and stores assist hardware cache?
• Work-item / work-group data mapping
– Related to data layout
– Also how you parallelize the work
• Operation-specific tuning
– Specific hardware differences
– Built-in trig / special function hardware
– Double vs. float (vs. half)
Auto tuning
• Q: How do you know what the best parameter values for your program are?
– What is the best work-group size, for example

• A: Try them all! (Or a well chosen subset)

• This is where auto tuning comes in

– Run through different combinations of parameter values and optimize the
runtime (or another measure) of your program.
How much fast? The
Hydro benchmark
Hydro is a simplified version of RAMSES (CEA, France astrophysics code to study large scale
structure and galaxy formation)

Hydro main features:

 regular cartesian mesh (no AMR)
 solves compressible Euler equations of hydrodynamics
 finite volume method, second order Godunov scheme
 it uses a Riemann solver numerical flux at the interfaces
The Hydro benchmark
Hydro is about 1K lines of code and has been ported to different programming
environment and architectures, including accelerators. In particular:
 initial Fortran branch including OpenMP, MPI, hybrid MPI+OpenMP
 C branch for CUDA, OpenCL, OpenACC, UPC
Hydro run
performances of OpenCL code are
comparison very good (better than CUDA!)

Device/version Elapsed time (sec.) EfficiencyLoss (with

without initialization respect to the best
timing)
More than 16
Intel Xeon
SandyBridge
cores are needed
CUDA K20C 52.37 0.24
to compare
OpenCL 1 K20 OpenCL K20C 42.09 0
device MPI (1 process) 780.8 17.5
Intel MIC preliminary run MPI+OpenMP (16 109.7 1.60
on CINECA prototype.
240 threads, vectorized OpenMP threads)
code,KMP_AFFINITY=bal OpenAcc
anced MPI+OpenMP MIC 147.5 2.50 run it fails
(240 threads) using Pgi
OpenACC (Pgi) N.A. N.A. compiler
59
Hydro OpenCL
performances are good. Scalability is
scaling limited by domain size

Number of K20 devices Elapsed time (sec.) Speed-Up

without initialization
OpenCL+MPI run,
varying the
number of NVIDIA
Tesla K20
device,4091x409 1 42.0 1.0
1 domain,100 2 23.5 1.7
iterations 4 12.2 3.4
8 8.56 4.9
16 5.70 7.3

60
How much fast? The
EuroBen Benchmark
The EuroBen Benchmark Group provides benchmarks for the evaluation of the performance for
scientific and technical computing on single processor cores and on parallel computers systems using
standard parallel tool (OpenMP, MPI, ….) but also emerging standard (OpenCL, Cilk, …)
 Programs are available in Fortran and C
 The benchmark codes range from measuring the performance of basic operations and
mathematical functions to skeleton applications.
 Cineca started a new activity in the official PRACE framework to test and validate EuroBen
benchmarks on Intel MIC architecture (V. Ruggiero-C.Cavazzoni).
MOD2F benchmark
16 OpenMP
threads, size=𝟐𝟐𝟐

Host: Intel Xeon

SandyBridge cores
MOD2F benchmark
16 MPI process, size=𝟐𝟏𝟗

Host: Intel Xeon

SandyBridge cores
MOD2F benchmark
OpenCL, kernel only,
size=𝟐𝟐𝟑

Host: Intel Xeon

SandyBridge cores
MOD2F benchmark
240 OpenMP threads,
size=𝟐𝟐𝟐

Native: Intel MIC, up to

240 hw threads
MOD2F benchmark
16 MPI process, size=𝟐𝟐𝟑

Native: Intel MIC, up to

240 hw threads
MOD2F benchmark
OpenCL, kernel only,
size=𝟐𝟐𝟑

Native: Intel MIC, up to

240 hw threads
Porting CUDA to
OpenCL
• If you have CUDA code, you’ve already done the hard
work!
– I.e. working out how to split up the problem to run
effectively on a many-core device

• Switching between CUDA and OpenCL is mainly

changing the host code syntax
– Apart from indexing and naming conventions in the
kernel code (simple to change!)
Allocating and copying memory

CUDA C OpenCL C
Allocate float* d_x; cl_mem d_x =
cudaMalloc(&d_x, sizeof(float)*size); clCreateBuffer(context,
CL_MEM_READ_WRITE,
sizeof(float)*size,
NULL, NULL);

Host to Device cudaMemcpy(d_x, h_x, clEnqueueWriteBuffer(queue, d_x,

sizeof(float)*size, CL_TRUE, 0,
cudaMemcpyHostToDevice); sizeof(float)*size,
h_x, 0, NULL, NULL);

Device to Host cudaMemcpy(h_x, d_x, clEnqueueReadBuffer(queue, d_x,

sizeof(float)*size, CL_TRUE, 0,
cudaMemcpyDeviceToHost); sizeof(float)*size,
h_x, 0, NULL, NULL);
Allocating and copying memory

CUDA C OpenCL C++

Allocate float* d_x; cl::Buffer
cudaMalloc(&d_x, d_x(begin(h_x), end(h_x), true);
sizeof(float)*size);

Host to Device cudaMemcpy(d_x, h_x, cl::copy(begin(h_x), end(h_x),

sizeof(float)*size, d_x);
cudaMemcpyHostToDevice);

Device to Host cudaMemcpy(h_x, d_x, cl::copy(d_x,

sizeof(float)*size, begin(h_x), end(h_x));
cudaMemcpyDeviceToHost);
Declaring dynamic local/shared memory

OpenCL C++
1. Have the kernel accept a local
CUDA C array as an argument
1. Define an array in the kernel __kernel void func(
source as extern __local int *array)
__shared__ int array[]; {}

2. When executing the kernel, 2. Define a local memory kernel

specify the third parameter as kernel argument of the right size
size in bytes of shared memory cl::LocalSpaceArg localmem =
func<<<num_blocks, cl::Local(shared_mem_size);
num_threads_per_block,
shared_mem_size>>>(args); 3. Pass the argument to the kernel
invocation
func(EnqueueArgs(…),localmem);
Declaring dynamic local/shared memory

CUDA C OpenCL C
1. Define an array in the kernel
source as extern 1. Have the kernel accept a local
array as an argument
__shared__ int array[];
__kernel void func(
__local int *array) {}
2. When executing the kernel,
specify the third parameter as
size in bytes of shared memory 2. Specify the size by setting the
kernel argument

func<<<num_blocks,
clSetKernelArg(kernel, 0,
num_threads_per_block,
sizeof(int)*num_elements,
shared_mem_size>>>(args); NULL);
Dividing up the work
Problem size
CUDA
OpenCL
Thread Work-item

Thread block Work-group

• To enqueue the kernel

– CUDA – specify the number of thread blocks and threads
per block
– OpenCL – specify the problem size and (optionally)
number of work-items per work-group
Enqueue a kernel (C)

CUDA C OpenCL C
dim3 threads_per_block(30,20); const size_t global[2] =
{300, 200};
dim3 num_blocks(10,10);
const size_t local[2] =
kernel<<<num_blocks, {30, 20};
threads_per_block>>>();
clEnqueueNDRangeKernel(
queue, &kernel,
2, 0, &global, &local,
0, NULL, NULL);
Enqueue a kernel (C++)

CUDA C
OpenCL C++
dim3 threads_per_block(30,20); const cl::NDRange
global(300, 200);

dim3 num_blocks(10,10); const cl::NDRange

local(30, 20);
kernel<<<num_blocks,
threads_per_block>>>(…); kernel(
EnqueueArgs(global, local),
…);
Indexing work

OpenCL
gridDim get_num_groups()

blockIdx get_group_id()

blockDim get_local_size()

gridDim * blockDim get_global_size()

threadIdx get_local_id()

blockIdx * blockdim + threadIdx get_global_id()

Differences in kernels

• Where do you find the kernel?

– OpenCL - either a string (const char *), or read from
a file
– CUDA – a function in the host code
• Denoting a kernel
– OpenCL - __kernel
– CUDA - __global__
• When are my kernels compiled?
– OpenCL – at runtime
– CUDA – with compilation of host code
Host code

• By default, CUDA initializes the GPU automatically

– If you needed anything more complicated (multi-
device etc.) you must do so manually
• OpenCL always requires explicit device initialization
– It runs not just on NVIDIA® GPUs and so you
must tell it which device(s) to use
Thread Synchronization

CUDA OpenCL
__syncthreads() barrier()

__threadfenceblock() mem_fence(
CLK_GLOBAL_MEM_FENCE |
CLK_LOCAL_MEM_FENCE)

No equivalent read_mem_fence()
No equivalent write_mem_fence()
__threadfence() Finish one kernel and start
another
Translation from CUDA to OpenCL

CUDA OpenCL
GPU Device (CPU, GPU etc)
Multiprocessor Compute Unit, or CU
Scalar or CUDA core Processing Element, or PE
Global or Device Memory Global Memory
Shared Memory (per block) Local Memory (per workgroup)
Local Memory (registers) Private Memory
Thread Block Work-group
Thread Work-item
Warp No equivalent term (yet)
Grid NDRange
OpenCL live@Eurora
Eurora

• Eurora CINECA-Eurotech
prototype
• 1 rack
• Two Intel SandyBridge and
• two NVIDIA K20 cards per
node or:
• Two Intel MIC card per
node
• Hot water cooling
• Energy efficiency record
(up to 3210 MFLOPs/w)
• 100 TFLOPs sustained
Running environment
• 13 Multiprocessors • 236 compute units
• 2496 CUDA Cores • 8 GB of global memory
• 5 GB of global memory • CPU clock rate 1052 MHz
• GPU clock rate 760MHz
NVIDIA Tesla K20 Intel MIC Xeon Phi
Setting up OpenCL on Eurora

• Login on front-end.
Then:

>module load profile/advanced

> module load intel_opencl/none--intel--cs-xe-2013--binary

It defines:

INTEL_OPENCL_INCLUDE

and

INTEL_OPENCL_LIB

environmental variables that can be used:

>cc -I$INTEL_OPENCL_INCLUDE -L$INTEL_OPENCL_LIB -lOpenCL vadd.c -o vadd

Running on Intel
PROFILE=FULL_PROFILE
VERSION=OpenCL 1.2 LINUX
NAME=Intel(R) OpenCL
VENDOR=Intel(R) Corporation
EXTENSIONS=cl_khr_fp64 cl_khr_global_int32_base_atomics
cl_khr_global_int32_extended_atomics
cl_khr_local_int32_base_atomics Intel OpenCL
cl_khr_local_int32_extended_atomics
cl_khr_byte_addressable_store platform found and
--0--
DEVICE NAME= Intel(R) Xeon(R) CPU E5-2660 0 @ 2.20GHz
3 devices (cpu and
DEVICE VENDOR=Intel(R) Corporation Intel MIC card)
DEVICE VERSION=OpenCL 1.2 (Build 67279)
DEVICE_MAX_COMPUTE_UNITS=16
DEVICE_MAX_WORK_GROUP_SIZE=1024
DEVICE_MAX_WORK_ITEM_DIMENSIONS=3
DEVICE_MAX_WORK_ITEM_SIZES=1024 1024 1024
DEVICE_GLOBAL_MEM_SIZE=16685436928
--1--
DEVICE NAME=Intel(R) Many Integrated Core Acceleration Card
DEVICE VENDOR=Intel(R) Corporation
Intel MIC device was selected
DEVICE VERSION=OpenCL 1.2 (Build 67279)
DEVICE_MAX_COMPUTE_UNITS=236
DEVICE_MAX_WORK_GROUP_SIZE=1024
DEVICE_MAX_WORK_ITEM_DIMENSIONS=3
DEVICE_MAX_WORK_ITEM_SIZES=1024 1024 1024
DEVICE_GLOBAL_MEM_SIZE=6053646336
--2--
DEVICE NAME=Intel(R) Many Integrated Core Acceleration Card Results are OK no matter
DEVICE VENDOR=Intel(R) Corporation
DEVICE VERSION=OpenCL 1.2 (Build 67279) what performances
DEVICE_MAX_COMPUTE_UNITS=236
DEVICE_MAX_WORK_GROUP_SIZE=1024
DEVICE_MAX_WORK_ITEM_DIMENSIONS=3
DEVICE_MAX_WORK_ITEM_SIZES=1024 1024 1024
DEVICE_GLOBAL_MEM_SIZE=6053646336
Computed sum = 549754961920.0.
Check passed.
Exercise

• Goal:
– To inspect and verify that you can run an OpenCL kernel on Eurora machines
• Procedure:
– Take the provided C vadd.c and vadd.cl source programs from VADD
directory
– Compile and link vadd.c
– Run on NVIDIA or Intel platform.
• Expected output:
– A message verifying that the vector addition completed successfully
– Some useful info about OpenCL environment (Intel and NVIDIA)
Matrix-Matrix product: HOST
void MatrixMulOnHost (float* M, float* N, float* P, int Width)
{
// loop on rows
for (int row = 0; row < Width; ++row) {
// loop on columns P=M*N
for (int col = 0; col < Width; ++col) {

// accumulate element-wise products

float pval = 0;
for (int k = 0; k < Width; ++k) {
float a = M[row * Width + k];
float b = N[k * Width + col];
pval += a * b;
}

// store final results

P[row * Width + col] = pval;
}
}
}
Matrix-Matrix product: launch grid

(0,0) (1,0) (2,0)

MatrixWidth

row (0,1) (1,1) (2,1)

Matrix

(0,2)
col (1,2) (2,2)
* index

(0,3) (1,3) (2,3)

gridDim.x * blockDim.x
col = blockIdx.x * blockDim.x + threadIdx.x;
row = blockIdx.y * blockDim.y + threadIdx.y;

index = row * MatrixWidth + col;

Matrix-Matrix product: CUDA Kernel

global void MMKernel (float* dM, float dN, float dP,

int width) {
// row,col from built-in thread indeces(2D block of threads)
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;

// check if current CUDA thread is inside matrix borders

if (row < width && col < width) {

// accumulate element-wise products

// NB: pval stores the dP element computed by the thread
float pval = 0;
for (int k=0; k < width; k++)
pval += dM[row * width + k] * dN[k * width + col];

// store final results (each thread writes one element)

dP[row * width + col] = Pvalue;
}
}
OpenCL Memory model

• Private Memory
– Per work-item
• Local Memory
– Shared within a
work-group
• Global/Constant
Memory
– Visible to all
work-groups
• Host memory
– On the CPU

Memory management is explicit:

You are responsible for moving data from
host → global → local and back
OpenCL Memory model

• Private Memory
– Fastest & smallest: O(10) words/WI
• Local Memory
– Shared by all WI’s in a work-group
– But not shared between work-groups!
– O(1-10) Kbytes per work-group
• Global/Constant Memory
– O(1-10) Gbytes of Global memory
– O(10-100) Kbytes of Constant
memory
• Host memory
– On the CPU - GBytes
Memory management is explicit:
O(1-10) Gbytes/s bandwidth to discrete GPUs for
Host <-> Global transfers
OpenCL mapping

• In OpenCL:
get_global_size(0)

Index Space
get_local_size(0) Work Work Work
Group Group Group
Work Group (0,0)
(0, 0) (1, 0) (2, 0) get_
Work Work Work Work Work global_
Item Item Item Item Item size(1)
(0, 0) (1, 0) (2, 0) (3, 0) (4, 0) Work Work Work
get_ Group Group Group
local_ Work Work Work Work Work
size(1) Item Item Item Item Item (0, 1) (1, 1) (2, 1)
(0, 1) (1, 1) (2, 1) (3, 1) (4, 1)

Work Work Work Work Work

Item Item Item Item Item
(0, 2) (1, 2) (2, 2) (3, 2) (4, 2)
OpenCL mapping (again)

You should use OpenCL mapping functions for element values recovery(this may be a
common source of bugs when write a kernel)
Matrix multiplication: pseudo OpenCL
kernel

__kernel void mat_mul(

const int Mdim, const int Ndim, const int Pdim,
__global float *A, __global float *B, __global float *C)
{
int i, j, k;
for (i = 0; i < Ndim; i++) {
for (j = 0; j < Mdim; j++) {
for (k = 0; k < Pdim; k++) {
// C(i, j) = sum(over k) A(i,k) * B(k,j)
C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j];
}
} Remove outer loops and set
} work-item co-ordinates
}
Matrix multiplication: OpenCL kernel

__kernel void mat_mul(

const int Mdim, const int Ndim, const int Pdim,
__global float *A, __global float *B, __global float *C)
{
int i, j, k;
j = get_global_id(0);
i = get_global_id(1);
// C(i, j) = sum(over k) A(i,k) * B(k,j)
for (k = 0; k < Pdim; k++) {
C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j];
}
}
Exercise 1: Matrix Multiplication

• Goal:
– To write your first complete OpenCL kernel “from scratch”
– To multiply a pair of matrices
• Procedure:
– Start with the previous matrix multiplication OpenCL
kernel
– Rearrange and use local scalars for intermediate C
element values (a common optimization in matrix-
Multiplication functions)
• Expected output:
– A message to standard output verifying that the chain of
vector additions produced the correct result
– Report the runtime and the MFLOPS
Matrix multiplication: OpenCL kernel improved

Rearrange and use a local scalar for intermediate C element values

(a common optimization in Matrix Multiplication functions)

Matrix Platfor Kernel GFLOP/

Size m time s
(sec.)

2048 NVIDIA 0.24 71

K20s

2048 Intel MIC 0.47 37

Matrix-Matrix product: selecting optimum thread block size

Which is the best thread block /work-group size to select (i.e. TILE_WIDTH)?
On Fermi architectures: each SM can handle up to 1536 total threads
TILE_WIDTH = 8
8x8 = 64 threads >>> 1536/64 = 24 blocks needed to fully load a SM
… yet there is a limit of maximum 8 resident blocks per SM for cc 2.x
so we end up with just 64x8 = 512 threads per SM on a maximum of
1536 (only 33% occupancy)
TILE_WIDTH = 16
16x16 = 256 threads >>> 1536/256 = 6 blocks to fully load a SM
6x256 = 1536 threads per SM … reaching full occupancy per SM!
TILE_WIDTH = 32
32x32 = 1024 threads >>> 1536/1024 = 1.5 = 1 block fully loads SM
1024 threads per SM (only 66% occupancy)
TILE_WIDTH = 16

101
Matrix-Matrix product: selecting optimum thread block size

TILE_WIDTH = 16 or 32

102
Exercise 2: Matrix Multiplication

• Goal:
– To test different thread block size
– To multiply a pair of matrices
• Procedure:
– Start with the previous matrix multiplication OpenCL
kernel
– Test different thread block size in C source code.
Compare results and find the optimum value (on both
OpenCL platforms)
• Expected output:
– A message to standard output verifying that the chain of
vector additions produced the correct result
– Report the runtime and the MFLOPS
Matrix-Matrix product: selecting optimum thread block size

Which is the best thread block size/work-group size to select (i.e. TILE_WIDTH)?
On Kepler architectures: each SM can handle up to 2048 total threads
TILE_WIDTH = 8
8x8 = 64 threads >>> 2048/64 = 32 blocks needed to fully load a SM
… yet there is a limit of maximum 16 resident blocks per SM for cc 3.x
so we end up with just 64x16 = 1024 threads per SM on a maximum of 2048 (only
50% occupancy)
TILE_WIDTH = 16
16x16 = 256 threads >>> 2048/256 = 8 blocks to fully load a SM
8x256 = 2048 threads per SM … reaching full occupancy per SM!
TILE_WIDTH = 32
32x32 = 1024 threads >>> 2048/1024 = 2 blocks fully load a SM
2x1024 = 2048 threads per SM … reaching full occupancy per SM!
TILE_WIDTH Kernel time (sec.) GFLOP/s (NVIDIA
K20)
8 0.33 52
16 0.20 82
104
32 0.16 104
Exercise 3: Matrix Multiplication
• Goal:
– To check inside matrix borders
– To multiply a pair of matrices
• Procedure:
– Start with the previous matrix multiplication OpenCL
kernel
– Test the check inside matrix borders kernel and the
original one. Compare results and performances (on both
OpenCL platforms)
• Expected output:
– A message to standard output verifying that the chain of
vector additions produced the correct result
– Report the runtime and the MFLOPS
Matrix-Matrix product: check inside matrix
borders

global void MMKernel (float* dM, float dN, float dP,

int width) {
// row,col from built-in thread indeces(2D block of
threads)
int col = blockIdx.x * blockDim.x + threadIdx.x;
int row = blockIdx.y * blockDim.y + threadIdx.y;

// check if current CUDA thread is inside matrix borders

if (row < width && col < width) {
kernel chek...
(Yes/No) Matrices Size Kernel Error GFLOP/s (Intel MIC)
Yes ... 2047 / 20

}
Yes 2048 / 35
No 2047 Failed (different results from 21
reference)
No 2048 / 37
Optimizing matrix multiplication

• There may be significant overhead to manage work-items

and work-groups.
• So let’s have each work-item compute a full row of C

C(i,j) C(i,j) A(i,:)

= + x B(:,j)

Dot product of a row of A and a column of B for each element of

C
Exercise 1-1: Matrix Multiplication
• Goal:
– Let each work-item to compute a full row of C
– To multiply a pair of matrices
• Procedure:
– Start with the previous matrix multiplication OpenCL
kernel
– Modify in order to have each work-item computing a full
row of C
– Test the new kernel.Compare results and performances
(on both OpenCL platforms)
• Expected output:
– A message to standard output verifying that the chain of
vector additions produced the correct result
– Report the runtime and the MFLOPS
Optimizing matrix multiplication

C(i,j) C(i,j) A(i,:)

= + x B(:,j)

Dot product of a row of A and a column of B for each element of

C
Matrix Size Platform Kernel time GFLOP/s
(sec.)

2048 NVIDIA 1.17 15

K20s
2048 Intel MIC 0.88 20

This change doesn’t really help.

Optimizing matrix multiplication

• Notice that, in one row of C, each element reuses the same

column of B.
• Let’s copy that column of B into private memory of the work-
item that’s (exclusively) using it to avoid the overhead of
loading it from global memory for each C(i,j) computation.

C(i,j) C(i,j) A(i,:)

= + x B(:,j)

Private memory of each

work-item
Private Memory

• Managing the memory hierarchy is one of the

most important things to get right to achieve good
performance

• Private Memory:
– A very scarce resource, only a few tens of 32-bit
words per Work-Item at most
– If you use too much it spills to global memory or
reduces the number of Work-Items that can be run at
the same time, potentially harming performance*
– Think of these like registers on the CPU

* Occupancy on a GPU
Exercise 1-2: using private memory

• Goal:
– Use private memory to minimize memory movement costs
and optimize performance of your matrix multiplication
program
• Procedure:
– Start with previous matrix multiplication kernel
– Modify the kernel so that each work-item copies its own
column of B into private memory
– Test the new kernel.Compare results and performances
(on both OpenCL platforms)
• Expected output:
– A message to standard output verifying that the matrix
multiplication program is generating the correct results
– Report the runtime and the MFLOPS
Optimizing matrix multiplication

C(i,j) C(i,j) A(i,:)

= + x B(:,j)

Private memory of each

Matrix Size Platform work-item
Kernel time GFLOP/s
(sec.)

2048 NVIDIA 1.17 15

K20s
2048 Intel MIC 0.21 80

This has started to help.

Local Memory*

• Tens of KBytes per Compute Unit

– As multiple Work-Groups will be running on each CU, this means only a
fraction of the total Local Memory size is available to each Work-Group
• Assume O(1-10) KBytes of Local Memory per Work-Group
– Your kernels are responsible for transferring data between Local and
Global/Constant memories … there are optimized library functions to help
• Use Local Memory to hold data that can be reused by all the work-items in a
work-group
• Access patterns to Local Memory affect performance in a similar way to
accessing Global Memory
– Have to think about things like coalescence & bank conflicts

* Typical figures for a 2013 GPU

Local Memory

• Local Memory doesn’t always help…

– CPUs, MICs don’t have special hardware for it
– This can mean excessive use of Local Memory might slow down kernels on
CPUs
– GPUs now have effective on-chip caches which can provide much of the
benefit of Local Memory but without programmer intervention
– So, your mileage may vary!
Using Local/Shared Memory for Thread
Cooperation

(Device) Grid

Threads belonging to the same block can cooperate togheter Block (0, 0) Block (1, 0)
using the shared memory to share data
if a thread needs some data which Shared Memory Shared Memory
has been already retrived by
another thread in the same Registers Registers

block, this data can be shared

using the shared memory Threads Threads
Typical Shared Memory usage:
1. declare a buffer residing on shared
memory (this buffer is per block)
2. load data into shared memory buffer
Global
3. synchronize threads so to make sure all Memory
needed data is present in the buffer
4. performe operation on data
Constant
5. synchronize threads so all operations Memory
have been performed
6. write back results to global memory
Texture
Memory
Matrix-matrix using Shared Memory
Cij=0. B

Cycle on block it = threadIdx.y

jt = threadIdx.x
kb=0, N/NB
ib = blockIdx.y
jb = blockIdx.x
N
As(it,jt) = A(ib*NB + it, kb*NB + jt)
Bs(it,jt) = B(kb*NB + it, jb*NB + jt)

Thread Synchronization N

Cycle on block k=1,NB NB

Cij=Cij+As(it,k)·Bs(k,jt) NB

Thread Synchronization

C(i,j)=Cij A C
Matrix-matrix using Shared Memory: CUDA Kernel
// Matrix multiplication kernel called by MatMul_gpu()
__global__ void MatMul_kernel (float *A, float *B, float *C, int N)
{ for (int kb = 0; kb < (A.width / NB); ++kb) {

// Shared memory used to store Asub and Bsub respectively // Get the starting address of Asub and Bsub
__shared__ float Asub[NB][NB]; a_offset = get_offset (ib, kb, N);
__shared__ float Bsub[NB][NB]; b_offset = get_offset (kb, jb, N);

// Block row and column // Load Asub and Bsub from device memory to shared memory
int ib = blockIdx.y; // Each thread loads one element of each sub-matrix
int jb = blockIdx.x; Asub[it][jt] = A[a_offset + it*N + jt];
Bsub[it][jt] = B[b_offset + it*N + jt];
// Thread row and column within Csub
int it = threadIdx.y; // Synchronize to make sure the sub-matrices are loaded
int jt = threadIdx.x; // before starting the computation
__syncthreads();
int a_offset , b_offset, c_offset;
// Multiply Asub and Bsub together
// Each thread computes one element of Csub for (int k = 0; k < NB; ++k) {
// by accumulating results into Cvalue Cvalue += Asub[it][k] * Bsub[k][jt];
float Cvalue = 0; }
// Synchronize to make sure that the preceding
// Loop over all the sub-matrices of A and B that are // computation is done
// required to compute Csub __syncthreads();
// Multiply each pair of sub-matrices together }
// and accumulate the results
// Get the starting address (c_offset) of Csub
c_offset = get_offset (ib, jb, N);
// Each thread block computes one sub-matrix Csub of C
C[c_offset + it*N + jt] = Cvalue;

} 118
Exercise 4: Matrix Multiplication

• Goal:
– To use shared/memory local
– To multiply a pair of matrices
• Procedure:
– Start with the previous matrix multiplication CUDA kernel
– Modify source in order to generate an OpenCL kernel
– Compare results and performances (on both OpenCL
platforms)
• Expected output:
– A message to standard output verifying that the chain of
vector additions produced the correct result
– Report the runtime and the MFLOPS
Matrix-matrix using Shared Memory: OpenCL
Kernel:results

Matrix Size Platform Kernel time (sec.) GFLOP/s

2048 NVIDIA K20s 0.10 166

2048 Intel MIC 0.15 115

121
OpenCL on Intel MIC

• Intel MIC combines many core onto a single chip. Each core runs exactly 4
hardware threads. In particular:
1. All cores/threads are a single OpenCL device
2. Separate hardware threads are OpenCL CU.
• In the end, you’ll have parallelism at the work-group level (vectorization) and
parallelism between work-groups (threading).
OpenCL on Intel MIC

• To reach performances, the number of work-groups should be not less than

CL_DEVICE_MAX_COMPUTE_UNITS parameter (more is better)
• Again, automatic vectorization module should be fully utilized. This module:
 packs adiacent work-items (from dimension 0 of NDRange)
 executes them with SIMD instructions
• Use the recommended work-group size as multiple of 16 (SIMD width for float,
int, …data type).
Matrix-matrix on Intel MIC (skeleton)

for i from 0 to NUM_OF_TILES_M-1

for j from 0 to NUM_OF_TILES_N-1
C_BLOCK = ZERO_MATRIX(TILE_SIZE_M, TILE_SIZE_N)
for k from 0 to size-1
for ib = from 0 to TILE_SIZE_M-1
for jb = from 0 to TILE_SIZE_N-1
C_BLOCK(jb, ib) = C_BLOCK(ib, jb) + A(k, i*TILE_SIZE_M + ib)*B(j*TILE_SIZE_N + jb, k)
end for jb
end for ib
end for k TILE_SIZE_K = size
for ib = from 0 to TILE_SIZE_M-1 of block for
for jb = from 0 to TILE_SIZE_N-1 internal
C(j*TILE_SIZE_M + jb, i*TILE_SIZE_N + ib) = C_BLOCK(jb, ib) computation of
end for jb C_BLOCK
end for ib
end for j TILE_GROUP_M x TILE_GROUP_N =
end for i number of WI within each WG

TILE_SIZE_M x TILE_SIZE_N =
number of elements of C computed
by one WI
Matrix-matrix on Intel MIC (results)

for i from 0 to NUM_OF_TILES_M-1

for j from 0 to NUM_OF_TILES_N-1
C_BLOCK = ZERO_MATRIX(TILE_SIZE_M, TILE_SIZE_N)
for k from 0 to size-1
for ib = from 0 to TILE_SIZE_M-1
for jb = from 0 to TILE_SIZE_N-1
C_BLOCK(jb, ib) = C_BLOCK(ib, jb) + A(k, i*TILE_SIZE_M + ib)*B(j*TILE_SIZE_N + jb, k)
end for jb
end for ib
end for k
for ib = from 0 to TILE_SIZE_M-1
for jb = from 0 to TILE_SIZE_N-1
C(j*TILE_SIZE_M + jb, i*TILE_SIZE_N + ib) = C_BLOCK(jb, ib)
end for jb
end for ib
end for j
end for i

Matrices Size Kernel time (sec.) GFLOP/s (Intel

MIC)
3968 0.3 415
Conclusions

The future of Accelerator

Programming
Most of the latest
supercomputers are based on
accelerators platform. This
huge adoption is the result of:
• High (peak) performances
• Good energy efficiency
• Low price

Accelerators should be used everywhere and all

the time. So, why aren’t there?
Conclusions

The future of Accelerator

Programming
There are two main difficulties
with accelerators:
• They can only execute
certain type of programs
efficiently (high parallelism,
data reuse, regular control
flow and data access)
• Architectural disparity with
respect to CPU
(cumbersome
programming, portability is
an issue)
Accelerators should be used everywhere and all
the time. So, why aren’t there?
Conclusions

The future of Accelerator

Programming
GPUs are now more general-
purpose computing devices
thanks to CUDA adoption. On the
other hand, the fact that CUDA is
a proprietary tool and its
complexity triggered the creation
of other programming
approaches:
• OpenCL
• OpenAcc
• …
• …
Accelerators should be used everywhere and all
the time. So, why aren’t there?
Conclusions

The future of Accelerator

Programming
• OpenCL is the non-proprietary
counterpart of CUDA (also supports
AMD GPUs, CPUs, MIC,
FPGAs….really portable!) but just
like CUDA , is very low level and
require a lot of programming skills
to be used.
• OpenACC is a very high-level
approach. Similar to OpenMP (they
should be merged in a near(?) future)
but still at its infancy and currently
supported by a few compilers
• Other approaches like C++AMP only
tied to exhotic HPC environment
(Windows) and impractical for
standard HPC applications

Accelerators should be used everywhere and all

the time. So, why aren’t there?
Conclusions

The future of Accelerator

Programming
• So, how to (efficiently) program actual and
future devices?
• A possible answer could be surprisingly
simple and similar to how today’s multicore
(CPUs) are used (including SIMD
extensions, accelerators,…)
• Basically, there are three levels:
- libraries
- automated tools
- do-it-yourself
• Programmers will employ library approach
whenever possible. In absence of efficient
libraries, tools could be used.
• For the remaining cases, the do-it-yourself
approach will have to be used (OpenCL or
a derivative of it should be preferred to
proprietary CUDA)

Accelerators will be used everywhere and all the

time. So, start to use them!
Credits

Among the others:

• Simon McIntosh Smith for OpenCL

• CUDA Team in CINECA (Luca Ferraro, Sergio
Orlandini, Stefano Tagliaventi)
• MontBlanc project (EU) Team

2396510-14-8EN - r1 - Service Information and Procedures Class M
100% (1)
2396510-14-8EN - r1 - Service Information and Procedures Class M
2,072 pages
PlayStation Architecture: Architecture of Consoles: A Practical Analysis, #6
From Everand
PlayStation Architecture: Architecture of Consoles: A Practical Analysis, #6
Rodrigo Copetti
No ratings yet
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
11 - OpenCL Fundamentals
No ratings yet
11 - OpenCL Fundamentals
253 pages
upcrc_opencl_lec1
No ratings yet
upcrc_opencl_lec1
38 pages
06-Intro To Opencl PDF
No ratings yet
06-Intro To Opencl PDF
57 pages
Introduction To OpenCL
No ratings yet
Introduction To OpenCL
44 pages
GPU Programming Using openCL
No ratings yet
GPU Programming Using openCL
13 pages
Opencl: These Notes Will Introduce Opencl
No ratings yet
Opencl: These Notes Will Introduce Opencl
34 pages
CS-3006 7 UsingOpenCL DataParallelProgramming
No ratings yet
CS-3006 7 UsingOpenCL DataParallelProgramming
80 pages
Lecture 19-Opencl: Ece 459: Programming For Performance
No ratings yet
Lecture 19-Opencl: Ece 459: Programming For Performance
47 pages
3 Heterogeneous Computer Architectures: 3.1 Gpus
No ratings yet
3 Heterogeneous Computer Architectures: 3.1 Gpus
16 pages
OpenCL For EiT-M
No ratings yet
OpenCL For EiT-M
41 pages
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
No ratings yet
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
258 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
OpenCL A Parallel Programming Standart For Heterogeneous
No ratings yet
OpenCL A Parallel Programming Standart For Heterogeneous
12 pages
Opencl 03 Basics
No ratings yet
Opencl 03 Basics
62 pages
Parallel Programming in Opencl: Advanced Graphics & Image Processing
No ratings yet
Parallel Programming in Opencl: Advanced Graphics & Image Processing
31 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
Opencl 1pp PDF
No ratings yet
Opencl 1pp PDF
48 pages
DS1822-Parallel Computing - Unit5
No ratings yet
DS1822-Parallel Computing - Unit5
16 pages
OpenCL Jumpstart Guide
No ratings yet
OpenCL Jumpstart Guide
17 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
Supercomputing On Graphics Cards: Marcus Bannerman
No ratings yet
Supercomputing On Graphics Cards: Marcus Bannerman
18 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
OpenCL Parallel Programming Development Cookbook 1st Edition Raymond Tay - The complete ebook is available for download with one click
100% (1)
OpenCL Parallel Programming Development Cookbook 1st Edition Raymond Tay - The complete ebook is available for download with one click
49 pages
Opencl 2pp
No ratings yet
Opencl 2pp
28 pages
OpenCL Parallel Programming Development Cookbook 1st Edition Raymond Tay instant download
No ratings yet
OpenCL Parallel Programming Development Cookbook 1st Edition Raymond Tay instant download
49 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
OpenCL Unleashing the Power of Parallel Computing
No ratings yet
OpenCL Unleashing the Power of Parallel Computing
8 pages
Pete-presentation-2 (1)
No ratings yet
Pete-presentation-2 (1)
17 pages
FPGA and OpenCL
No ratings yet
FPGA and OpenCL
31 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
Lec 1
No ratings yet
Lec 1
27 pages
The OpenCL™ Specification-19-26
No ratings yet
The OpenCL™ Specification-19-26
8 pages
(Ebook) Using OpenCL: Programming Massively Parallel Computers by J. Kowalik, T. Puzniakowski ISBN 9781614990291, 1614990298 - The latest ebook is available, download it today
100% (2)
(Ebook) Using OpenCL: Programming Massively Parallel Computers by J. Kowalik, T. Puzniakowski ISBN 9781614990291, 1614990298 - The latest ebook is available, download it today
58 pages
cuda_mode_lecture2
No ratings yet
cuda_mode_lecture2
33 pages
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
No ratings yet
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
128 pages
Opencl Programming For The Cuda Architecture
No ratings yet
Opencl Programming For The Cuda Architecture
23 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
OpenCL Programming Guide
No ratings yet
OpenCL Programming Guide
61 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
No ratings yet
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
128 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Using OpenCL Programming Massively Parallel Computers
No ratings yet
Using OpenCL Programming Massively Parallel Computers
309 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
WhitePaper GPU Computing On Mali
No ratings yet
WhitePaper GPU Computing On Mali
6 pages
CatanzaroIntroToGPUs
No ratings yet
CatanzaroIntroToGPUs
76 pages
Introduction To Programming Massively Parallel Graphics Processors
No ratings yet
Introduction To Programming Massively Parallel Graphics Processors
84 pages
Gpu Rigid Body Simulation Using Opencl: Bullet 2.X Refactoring
No ratings yet
Gpu Rigid Body Simulation Using Opencl: Bullet 2.X Refactoring
22 pages
ECE 498AL The CUDA Programming Model
No ratings yet
ECE 498AL The CUDA Programming Model
37 pages
OpenGL vs. OpenCL, Which To Choose and Why - Stack Overflow
No ratings yet
OpenGL vs. OpenCL, Which To Choose and Why - Stack Overflow
9 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
OpenCL Programming by Example
From Everand
OpenCL Programming by Example
Ravishekhar Banger
No ratings yet
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
From Everand
OpenCL Programming and Architecture: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
From Everand
C++ VS JAVA A PERFORMANCE DEEPDIVE: Unraveling the Performance Characteristics of C++ and Java for High-Performance Computing
Manoj R Chakravarthi
No ratings yet
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
From Everand
Neo Geo Architecture: Architecture of Consoles: A Practical Analysis, #23
Rodrigo Copetti
No ratings yet
Accelerated Computing With HIP: Second Edition
From Everand
Accelerated Computing With HIP: Second Edition
Yifan Sun
No ratings yet
2112.10318
No ratings yet
2112.10318
34 pages
1.10. Decision Trees — scikit-learn 0.24.1 documentation
No ratings yet
1.10. Decision Trees — scikit-learn 0.24.1 documentation
10 pages
بارگذاری فایل
No ratings yet
بارگذاری فایل
2 pages
clustering
No ratings yet
clustering
1 page
Loading Pandas
No ratings yet
Loading Pandas
23 pages
10ClusBasic Editted v1
No ratings yet
10ClusBasic Editted v1
41 pages
KNN in python
No ratings yet
KNN in python
11 pages
_08ClassBasic_v1
No ratings yet
_08ClassBasic_v1
46 pages
02Data Edited v2
No ratings yet
02Data Edited v2
43 pages
_03Preprocessing
No ratings yet
_03Preprocessing
60 pages
subdivision
No ratings yet
subdivision
5 pages
02 CUDA Shared Memory
No ratings yet
02 CUDA Shared Memory
21 pages
01 Laurie Stephey
No ratings yet
01 Laurie Stephey
14 pages
MT6572 Android Scatter
No ratings yet
MT6572 Android Scatter
6 pages
Observium Lab 1
No ratings yet
Observium Lab 1
7 pages
Final Oop Paper Spring 2020
No ratings yet
Final Oop Paper Spring 2020
5 pages
Modern Programming Tools - and Techniques III - Shrivastava - Ibrg
No ratings yet
Modern Programming Tools - and Techniques III - Shrivastava - Ibrg
362 pages
RPG Starter Kit Tutorial - Re-Using The Tile Engine (Tutorial 2)
No ratings yet
RPG Starter Kit Tutorial - Re-Using The Tile Engine (Tutorial 2)
17 pages
Frequently Asked Java Questions in Interviews
No ratings yet
Frequently Asked Java Questions in Interviews
5 pages
Foxcon 661M07 Mobo Schematic Diagram
No ratings yet
Foxcon 661M07 Mobo Schematic Diagram
40 pages
ADB - PRG F4202N - A4 - Datasheet
No ratings yet
ADB - PRG F4202N - A4 - Datasheet
2 pages
STEP 7 - New Modules PDF
No ratings yet
STEP 7 - New Modules PDF
3 pages
2501unw PDF
No ratings yet
2501unw PDF
1 page
C-CAT Syllabus: Test Paper Topics Duration
0% (1)
C-CAT Syllabus: Test Paper Topics Duration
2 pages
Stm32f4-Evb User Manual
No ratings yet
Stm32f4-Evb User Manual
32 pages
Manual de Instrucao CLP XINJE XD XL
No ratings yet
Manual de Instrucao CLP XINJE XD XL
393 pages
Viewnet Diy Pricelist PDF
No ratings yet
Viewnet Diy Pricelist PDF
2 pages
BigData Theory
No ratings yet
BigData Theory
65 pages
3rd International Conference On Cloud Computing
No ratings yet
3rd International Conference On Cloud Computing
1 page
Design and Implementation of A Web-Based Employees Appraisal and Assessment Management System
No ratings yet
Design and Implementation of A Web-Based Employees Appraisal and Assessment Management System
14 pages
Launching A Virtual Machine
No ratings yet
Launching A Virtual Machine
14 pages
HP Z800 Series - Workstations
No ratings yet
HP Z800 Series - Workstations
2 pages
ReleaseNotesforSCOPIAXTSeriesVersion3 1 0 X
No ratings yet
ReleaseNotesforSCOPIAXTSeriesVersion3 1 0 X
31 pages
Implementing Pfunctions
No ratings yet
Implementing Pfunctions
15 pages
XGL C42B
No ratings yet
XGL C42B
144 pages
Exploring The PICK Operating Sys
No ratings yet
Exploring The PICK Operating Sys
356 pages
HP P2000san Stoarge
No ratings yet
HP P2000san Stoarge
6 pages
Hour 8. Repeating An Action With Loops
No ratings yet
Hour 8. Repeating An Action With Loops
13 pages
CPU Suppor List H110
No ratings yet
CPU Suppor List H110
2 pages
Cheatsheet Kubernetes A4
No ratings yet
Cheatsheet Kubernetes A4
5 pages
Ecs 7000 6GD
No ratings yet
Ecs 7000 6GD
2 pages
Creating and Showing Comments
No ratings yet
Creating and Showing Comments
8 pages

Introduction_to_OpenCL_with_Examples

Uploaded by

Introduction_to_OpenCL_with_Examples

Uploaded by

Introduction to OpenCL

OpenCL lets Programmers write a single

OpenCL – Open Computing Language

OpenCL 1.1 OpenCL 2.0

Third party names are the property of their owners.

Compute Unit OpenCL Device

• One Host and one or more OpenCL Devices

• Choose the dimensions that are “best” for

• The problem we want to compute should have some

• Work-items are grouped into work-groups; work-items within a

• The “hello world” program of data parallel programming is a program to

C[i] = A[i] + B[i] for i=0 to N-1

• For the OpenCL solution, there are two parts

__kernel void vadd(__global constfloat *a,

• Use the first CPU device the platform provides:

• Create a simple context with a single device:

• Create a simple command-queue to feed our device:

• Build the program object:

• Compile the program to create a “dynamic library” from which specific

err = clBuildProgram(program, 0, NULL,NULL,NULL,NULL);

• Important to do check all your OpenCL API error messages!

• Easier in C++ with try/catch

• Arrays in host memory hold your original host-side data:

• These are from the point of view of the device

• Submit command to copy the buffer back to host memory at “h_c”:

clEnqueueReadBuffer(queue, d_c, CL_TRUE,

kernel = clCreateKernel(program, “vadd”, &err);

• Attach arguments of the kernel function “vadd” to memory objects:

err = clSetKernelArg(kernel, 0, sizeof(cl_mem), &d_a);

err = clEnqueueWriteBuffer(commands, d_a, CL_FALSE,

• Enqueue the kernel for execution (note: in-order so OK):

err = clEnqueueNDRangeKernel(commands, kernel, 1,

err = clEnqueueReadBuffer(commands, d_c, CL_TRUE,

1 especially for C++ programmers…

Memory management is explicit:

* Typical figures for a 2013 GPU

Local memory Local memory

Global memory Global memory

Host memory Host memory

• But OpenCL provides a

• Structure of Arrays (SoA) suits memory coalescence on GPUs

• A: Try them all! (Or a well chosen subset)

• This is where auto tuning comes in

Hydro main features:

Device/version Elapsed time (sec.) EfficiencyLoss (with

Number of K20 devices Elapsed time (sec.) Speed-Up

Host: Intel Xeon

Host: Intel Xeon

Host: Intel Xeon

Native: Intel MIC, up to

Native: Intel MIC, up to

Native: Intel MIC, up to

• Switching between CUDA and OpenCL is mainly

Host to Device cudaMemcpy(d_x, h_x, clEnqueueWriteBuffer(queue, d_x,

Device to Host cudaMemcpy(h_x, d_x, clEnqueueReadBuffer(queue, d_x,

CUDA C OpenCL C++

Host to Device cudaMemcpy(d_x, h_x, cl::copy(begin(h_x), end(h_x),

Device to Host cudaMemcpy(h_x, d_x, cl::copy(d_x,

2. When executing the kernel, 2. Define a local memory kernel

Thread block Work-group

• To enqueue the kernel

dim3 num_blocks(10,10); const cl::NDRange

gridDim * blockDim get_global_size()

blockIdx * blockdim + threadIdx get_global_id()

• Where do you find the kernel?

• By default, CUDA initializes the GPU automatically

>module load profile/advanced

environmental variables that can be used:

>cc -I$INTEL_OPENCL_INCLUDE -L$INTEL_OPENCL_LIB -lOpenCL vadd.c -o vadd

// accumulate element-wise products

// store final results

(0,0) (1,0) (2,0)

row (0,1) (1,1) (2,1)

(0,3) (1,3) (2,3)

index = row * MatrixWidth + col;

__global__ void MMKernel (float* dM, float *dN, float *dP,

kernel void vadd(global constfloat *a,

global void MMKernel (float* dM, float dN, float dP,

global void MMKernel (float* dM, float dN, float dP,