0% found this document useful (0 votes)

32 views80 pages

CS-3006 7 UsingOpenCL DataParallelProgramming

Uploaded by

i210507

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

32 views80 pages

CS-3006 7 UsingOpenCL DataParallelProgramming

Uploaded by

i210507

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 80

Programming Multi-/Many-cores Using

OpenCL
(CS 3006)

Adapted from: see the Acknowledgement slide

Compiled by: Dr. Muhammad Arshad Islam and Dr. Muhammad Aleem

National University of Computer & Emerging Sciences, Islamabad

Campus
Lecture Acknowledgements
Simon McIntosh-Smith University of Bristol
http://people.cs.bris.ac.uk/~simonm/SC13/OpenCL_slides_SC13.pdf

Romain Teissyer, Pierre-Franois Lavallée, et al.

http://irfu.cea.fr/Phocea/Vie_des_labos/Ast/ast_sstechnique.php?id_ast=904

Optimizing OpenCL applications on Intel Xeon Phi

http://iwocl.org/wp-content/uploads/2013/06/Optimizing-OpenCL-Applications-on- Intel-Xeon-
Phi-IWOCL.pdf

OpenCL home page

http://www.khronos.org/opencl/
https://www.khronos.org/assets/uploads/developers/library/2012-pan-pacific-road-show-
June/OpenCL-Details-Taiwan_June-2012.pdf

AMD
https://indico.fysik.su.se/event/1621/sessions/70/attachments/600/695/OpenCL_Training.pdf
OpenCL
• Open Compute Language
• For heterogeneous parallel-
computing systems
• Cross-platform
- Implementations for
• ATI GPUs
• NVIDIA GPUs
• Intel MIC
• x86 CPUs
• Many others…..
Industry Standards for Programming Heterogeneous Platforms

GPUs
CPUs Emerging Increasingly general
Multiple cores driving purpose data-parallel
performance increases
Intersection
computing

Graphics
Multi-
processor
Heterogeneous APIs and
Computing Shading
programming –
Languages
e.g. OpenMP

OpenCL – Open Computing Language

Open, royalty-free standard for portable, parallel programming of
heterogeneous parallel computing CPUs, GPUs, and other processors
OpenCL is Widely Deployed and Used

http://www.khronos.org/opencl/
OpenCL architecture

https://www.researchgate.net/figure/Schematic-structure-of-OpenCL-framework_fig3_270979904
OpenCL Platform Model

……
…
……
Processing ………
Element …… Host
……
……

Compute Unit OpenCL Device

• One Host and one or more OpenCL Devices

- Each OpenCL Device is composed of one or more
Compute Units
• Each Compute Unit is divided into one or more Processing Elements

• Memory divided into host memory and device memory

Credits: Trevett and Cyril Zeller, NVIDIA
OpenCL Program: Host and Kernel

https://www.researchgate.net/figure/The-GPUs-main-function-named-kernel-is-invoked-from-the-CPU-host-code_fig16_256495766
OpenCL Platform Example
(One node, two CPU sockets, two GPUs)

CPUs: GPUs:
• Treated as one OpenCL • Each GPU is a separate
device OpenCL device
- One CU per core • One CU per Streaming
- 1 PE per CU, or if PEs Multiprocessor
mapped to SIMD lanes, n PEs • Can use CPU and all GPU
per CU, where n matches the devices concurrently through
SIMD width OpenCL
• Remember:
- the CPU will also have to be
its own host!

CU = Compute Unit; PE = Processing Element

• Structure: CPU vs. GPU

http://thebeardsage.com/cuda-streaming-multiprocessors/ Credits: Andreas Moshovos, https://scholar.google.com/citations?user=D2VLt-8AAAAJ&hl=en

OpenCL Memory model Overview

• Private Memory
- Per work-item

• Local Memory
- Shared within a
work-group
• Global /Constant
Memory (Read-only)
Visible to all work-groups
• Host memory
- On the CPU

Memory management is explicit:

You are responsible for moving data from
host → global → local and back
Traditional Vs. OpenCL Parallel Programming

https://www.khronos.org/opencl/
UNDERSTANDING THE HOST PROGRAM
Vector Addition – Host
• The host program is the code that runs on the host to:
– Setup the environment for the OpenCL program
– Create and manage kernels

• 5 simple steps in a basic host program:

1. Define the platform … platform = devices+context+queues
2. Create and Build the program (dynamic library for kernels)
3. Setup memory objects
4. Define the kernel (attach arguments to kernel function)
5. Submit commands … transfer memory objects and execute kernels
The C++ Interface
• Khronos has defined a common C++ header file containing a
high level interface to OpenCL, cl.hpp

• Key features:
– Uses common defaults for the platform and command-
queue,
– Simplifies the basic API
– Ability to “call” a kernel from the host, like a regular
function
– Error checking can be performed with C++ exceptions
C++ Interface: setting up the host program

• Enable OpenCL API Exceptions. Do this before

#define CL_ENABLE_EXCEPTIONS including the
header files

• Include key header files … both standard and custom

#include <CL/cl.hpp> // Khronos C++ Wrapper API
#include <cstdio> // C style IO (e.g. printf)
#include <iostream> // C++ style IO
#include <vector> // C++ vector types
Context and Command-Queues
• Context:
– The environment within which kernels
execute and in which synchronization
and memory management is defined.
Device

• The context includes:

– One or more devices Device Memory

– Device memory
– One or more command-queues Queue

• All commands for a device (kernel

execution, synchronization, and memory
operations) are submitted through a
command-queue. Context

• Each command-queue points to a single

device within a context.
1. Create a context and queue
• Grab a context using a device type:
cl::Context context(CL_DEVICE_TYPE_DEFAULT);

Or…CL_DEVICE_TYPE_CPU,
CL_DEVICE_TYPE_GPU,
CL_DEVICE_TYPE_ACCELERATOR, etc.

• Create a command queue for the device in

the context: cl::CommandQueue
queue(context);
Commands and Command-Queues
• Commands include:
– Kernel executions
– Memory object management GPU CPU

– Synchronization

• The only way to submit commands Queue Queue

to a device is through a command-

queue.
Context
• Each command-queue points to
a single device within a
context.
Command-Queue execution details
• Command queues can be configured in
different ways to control how commands GPU CPU

execute
Queue Queue

• In-order queues:
– Commands are enqueued and complete Context
in the order they appear in the host
program (program-order)

• Out-of-order queues:
– Commands are enqueued in program-
order but can execute (and hence
complete) in any order.
2. Create and Build the program
• Define source code for the kernel-program either as a
string literal (for small programs) or read it from a file (for
large applications).

• Create the program object and compile :

“true” tells OpenCL to build
(compile/link) the program object

cl::Program program(context, KernelSource , true );

KernelSource is a string … either statically set in the host program or

returned from a function that loads the kernel code from a file.
Building Program Objects

• The program object encapsulates:

OpenCL uses runtime
1. A context
compilation … because
2. The program source or binary, and in general you don’t
3. List of target devices and build know the details of the
target device when you
options ship the program
• The build process to create a
program object:
cl::Program program(context, KernelSource);
kernel void
horizontal_reflect(read_only image2d_t src,
write_only image2d_t dst)
{ Compile for GPU
int x = get_global_id(0); // x-coord GPU code
int y = get_global_id(1); // y-coord
int width = get_image_width(src);
float4 src_val = read_imagef(src, sampler,
(int2)(width-1-x, y)); Compile for CPU
write_imagef(dst, (int2)(x, y), src_val);
CPU code
}
3. Setup Memory Objects
• For vector addition we need 3 memory objects: one each for
input vectors A and B, and one for the output vector C

• Create input vectors and assign values on the host:

std::vector<float> h_a(LENGTH), h_b(LENGTH), h_c(LENGTH); for (i =
0; i < LENGTH; i++) {
h_a[i] = rand() / (float)RAND_MAX;
h_b[i] = rand() / (float)RAND_MAX;
}

• Define OpenCL device buffers and copy from host buffers:

cl::Buffer d_a(context, h_a.begin(), h_a.end(), true);
cl::Buffer d_b(context, h_b.begin(), h_b.end(), true);
cl::Buffer d_c(context, CL_MEM_WRITE_ONLY,
sizeof(float)*LENGTH);
or CL_MEM_READ_ONLY or CL_MEM_READ_WRITE
What do we put in device memory?
• Memory Objects:

• There are two kinds of memory object

– Buffer object (Always linear):
• Defines a linear collection of bytes.

– Image object:
• Defines a two- or three-dimensional region of
memory.
• Image data can only be accessed with read and write
functions
Creating and manipulating buffers
• Buffers are declared on the host as object type:
cl::Buffer
• Arrays in host memory hold your original host-side data:
std::vector<float> h_a, h_b;

• Create the device-side buffer (d_a), assign read-only

memory to hold the host array (h_a) and copy it into device
memory:
cl::Buffer d_a(context, h_a.begin(), h_a.end(), true);

Start_iterator and end_iterator for the Stipulates that this is a

container holding host side object read-only buffer

or use function clCreateBuffer ( ) to create a device-side buffer

Creating and manipulating buffers
• The last argument sets the read/write access to the Buffer
by the device. true means “read only” while false (the
default) means “read/write”.

• Submit command to copy the device buffer back to host

memory in array “h_c”:
cl::copy(queue, d_c, h_c.begin(), h_c.end());

• Can also copy host memory to device buffers:

cl::copy(queue, h_c.begin(), h_c.end(), d_c);
4. Define the kernel
• Create a kernel function for the kernels you want to be able
to call in the program:

Must match the pattern of

arguments to the kernel.

cl::make_kernel<cl::Buffer,cl::Buffer,cl::Buffer>
vadd(program, “vadd”);

A previously created The name of the function

“program object” serving as used for the kernel
a dynamic library of kernels

• This means you can ‘call’ the kernel as a ‘function’

in your host code to enqueue the kernel.
5. Enqueue commands
• For kernel launches, specify global and local dimensions
– cl::NDRange global(1024)
– cl::NDRange local(64)
If you don’t specify a local dimension, it is assumed as
cl::NullRange, and the runtime picks a size for you

• Enqueue the kernel for execution (note: returns immediately …

i.e. this is a non-blocking command):
vadd(cl::EnqueueArgs(queue, global), d_a, d_b, d_c);

• Read back result (as a blocking operation). We use an in-order

queue to assure the previous commands are completed before the
read can begin

cl::copy(queue, d_c, h_c.begin(), h_c.end());

OpenCL Vector Addition Example
#include <iostream>
#include <vector>
#include <CL/cl.hpp>

int main() {
std::vector<float> a = {1.0f, 2.0f, 3.0f, 4.0f};
std::vector<float> b = {4.0f, 3.0f, 2.0f, 1.0f};
std::vector<float> c(a.size());

try {
// get available OpenCL platforms
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);

// choose a platform
cl::Platform platform = platforms[0];

// get available OpenCL devices

std::vector<cl::Device> devices;
platform.getDevices(CL_DEVICE_TYPE_ALL, &devices);

// choose a device
cl::Device device = devices[0];

// create an OpenCL context for the device

cl::Context context({device});

// create an OpenCL program from source

cl::Program program(context, "kernel.cl");
OpenCL Vector Addition Example
// build the program for the device
program.build({device});

// create OpenCL buffers for the input and output vectors

cl::Buffer bufA(context, CL_MEM_READ_ONLY, a.size() * sizeof(float));
cl::Buffer bufB(context, CL_MEM_READ_ONLY, b.size() * sizeof(float));
cl::Buffer bufC(context, CL_MEM_WRITE_ONLY, c.size() * sizeof(float));

// create a command queue for the device

cl::CommandQueue queue(context, device); True: Readonly

// enqueue data to be transferred to the device

queue.enqueueWriteBuffer(bufA, CL_TRUE, 0, a.size() * sizeof(float), a.data());
queue.enqueueWriteBuffer(bufB, CL_TRUE, 0, b.size() * sizeof(float), b.data());

// create a kernel object for the vector addition kernel

cl::Kernel kernel(program, "vecadd");

// set the arguments of the kernel

kernel.setArg(0, bufA); Global NDRange
kernel.setArg(1, bufB);
kernel.setArg(2, bufC);
Global offset LocalNDRange

// enqueue the kernel for execution

queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(a.size()), cl::NullRange);

// enqueue data to be transferred back to the host

queue.enqueueReadBuffer(bufC, CL_TRUE, 0, c.size() * sizeof(float), c.data());
OpenCL Vector Addition Example
// print the result
for (float x : c) {
std::cout << x << " ";
}

std::cout << std::endl;

} catch (cl::Error& e) {
std::cerr << "OpenCL error: " << e.what() << " (" << e.err() << ")" <<
std::endl;
return 1;
}

return 0;
}
An N-dimensional domain of work-items
• Global Dimensions:
– 1024x1024 (whole problem space)
– For example, if we have a 2D problem with dimensions (1024, 1024), then the
NDRange would be defined as (1024, 1024).

• Local Dimensions:
– 128x128 (work-group, executes together)Synchronization between work-
1024
items possible only within work-
groups: barriers and memory fences
1024

Cannot synchronize among work-

groups

• Choose the dimensions (“best” for your algorithm):

• 1 (one dimension)
• 2 (two dimension)
• 3 (three dimension)
Summary
INTRODUCTION TO OPENCL
KERNEL PROGRAMMING
OpenCL kernel
• Derived from ISO C99
– A few restrictions: no recursion, function pointers,
functions in C99 standard headers ...
– Preprocessing directives defined by C99 are
supported (#include etc.)

• Built-in data types

– Scalar and vector data types, pointers
– Data-type conversion functions:
• convert_type<_sat><_roundingmode>
– Image types: image2d_t, image3d_t and sampler_t
OpenCL C Language Highlights
• Function qualifiers
– __kernel qualifier declares a function as a kernel
• I.e. makes it visible to host code so it can be enqueued

• Address space qualifiers

– __global,__local,__constant,__private
– Pointer kernel arguments must be declared with an address space
qualifier

• Work-item functions
– uint get_work_dim() … number of dimensions in use (1,2, or 3)
– size_t get_global_id(uint n) … global work-item ID in dim “n”
– size_t get_local_id(uint n) … work-item ID in dim “n” inside work-
group
– size_t get_group_id(uint n) … ID of work-group in dim “n”
– size_t get_global_size(uint n) … num of work-items in dim “n”
– size_t get_local_size(uint n) … num of work-items in work group in
dim “n”
The BIG idea behind OpenCL
• Replace loops with functions (a kernel) executing at each point in
a problem domain
– E.g., process a 1024x1024 image with one kernel invocation per pixel or
1024x1024=1,048,576 kernel executions

Traditional loops OpenCL

void
kernel void
mul(const int n, mul( global const float *a,
const float *a, global const float *b,
const float *b, global float *c)
float *c) {
{ int id = get_global_id(0);
int i; c[id] = a[id] * b[id];
for (i = 0; i < n; i++) }
c[i] = a[i] * b[i]; // execute over n work-items
}
Execution model (kernels)

• OpenCL execution model … define a problem domain and

execute an instance of a kernel for each point in the
domain
kernel void times_two(
global float* input,
global float* output)
{ int i = get_global_id(0);
output[i] = 2.0f *
input[i];
}
get_global_id(0)
10
Input 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Output 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
Vector Addition - Kernel

kernel void vadd(

global const float *a,
global const float *b,
global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid] + b[gid];
}
OpenCL C Language Highlights

• Synchronization functions
– Barriers - all work-items within a work-group must execute the
barrier function before any work-item can continue

– Memory fences - provides ordering between memory operations

OpenCL C Language Restrictions
• Pointers to functions are not allowed

• Pointers to pointers allowed within a kernel, but not

as an argument to a kernel invocation

• Variable length arrays and structures are not

supported

• Recursion is not supported (yet!)

Matrix Addition
Parallel Software – SPMD
Single-threaded (CPU)
// there are N elements Time
for(i = 0; i < N; i++) T0 0 1 2 3 4 5 6 7 8 9 10 15
C[i] = A[i] + B[i]

Multi-threaded (CPU)

// tid is the thread id T0 0 1 2 3

// P is the number of cores T1 4 5 6 7
for(i = 0; i < tid*N/P; i++) T2 8 9 10 11
C[i] = A[i] + B[i] T3 12 13 14 15

Massively Multi-threaded ( GPU)

T0 0
// tid is the thread id T1 1
C[tid] = A[tid] + B[tid] T2 2
T3 3

T15 15

From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/

NDRange
An NDRange is defined by two parameters

The global size in each dimension G_x, G_y, G_z

The local size in each dimension S_x, S_y, S_z
NDRange

Global Dimensions # Work Items

1024 1024
1920 * 1080 2M
256 * 256 * 256 16M
Thread Mapping
By using different mappings, the same thread can be
assigned to access different data elements
signed to access different data elements
The examples below show three different possible
mappings of threads to data (assuming the thread id is
used to access an element int group_size =
get_local_size(0) *
get_local_size(1);

int tid =
get_group_id(1) *
get_num_groups(0) *
int tid = int tid = group_size +
get_group_id(0) *
Mapping get_global_id(1) * get_global_id(0) *
group_size +
get_global_size(0) + get_global_size(1) +
get_global_id(0); get_global_id(1); get_local_id(1) *
get_local_size(0) +
get_local_id(0)
0 1 2 3 0 4 8 12
Thread IDs 4 5 6 7 1 5 9 13 0 1 4 5
8 9 10 11 2 6 10 14 2 3 6 7
12 13 14 15 3 7 11 15 8 9 12 13
10 11 14 15
*assuming 2x2 groups
Thread Mapping for Nvidia

Credits: Rafał Mantiuk, Computer Laboratory, University of Cambridge

Performance

– Accesses to both of these matrices will be coalesced

• Degree of coalescence depends on the workgroup and data sizes
Credits: Rafał Mantiuk, Computer Laboratory, University of Cambridge
Matrix multiplication: sequential code
We calculate C=AB, dimA = (N x P), dimB=(P x M), dimC=(N x M)

void mat_mul(int Mdim, int Ndim, int Pdim,

float *A, float *B, float *C)
{
int i, j, k;
for (i = 0; i < Ndim; i++) {
for(j = 0; j < Mdim; j++) {
C[i*N+j] = 0.0f;
for (k = 0; k < Pdim; k++) {
// C(i, j) = sum(over k) A(i,k) * B(k,j)
C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j];
}

} C(i,j) A(i,:)
} = x B(:,j)
}

Dot product of a row of A and a column of B for each element of C

Matrix multiplication: sequential code
We calculate C=AB, where all three matrices are NxN

void mat_mul(int N, float A, float B, float *C)

{
Let’s make it easier
int i, j, k;
and specialize to
for (i = 0; i < N; i++) {
square matrices
for(j = 0; j < N; j++) {
C[i*N+j] = 0.0f;
for (k = 0; k < N; k++) {
// C(i, j) = sum(over k) A(i,k) * B(k,j)
C[i*N+j] += A[i*N+k] * B[k*N+j];
}
}
} A(i,:)
C(i,j)
} = x B(:,j)

Dot product of a row of A and a column of B for each element of C

Matrix multiplication performance

• Serial C code on CPU (single core).

Case MFLOPS
CPU GPU
Sequential C (not OpenCL) 887.2 N/A

Device is Intel® Xeon® CPU, E5649 @ 2.53GHz

using the gcc compiler.

These are not official benchmark results. You

may observe completely different results should
you run these tests on your own system.
Third party names are the property of their owners.
Matrix multiplication: sequential code
void mat_mul(int N, float *A, float *B, float *C)
{
int i, j, k;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
C[i*N+j] = 0.0f;
for (k = 0; k < N; {
k++) k) A(i,k) * B(k,j)
// C(i, j) = sum(over
C[i*N+j] += A[i*N+k] * B[k*N+j];
}
}
}
}
Matrix multiplication: OpenCL kernel (1/2)

kernel void mat_mul(const int N, global float *A,

global float *B, global float
{
int i, j, k;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
C[i*N+j] = 0.0f;
for (k = 0; k < N; k++) {
// C(i, j) = sum(over k) A(i,k) * B(k,j)
C[i*N+j] += A[i*N+k] * B[k*N+j];
}
}
} Mark as a kernel function and
} specify memory qualifiers
Matrix multiplication: OpenCL kernel (2/2)

kernel void mat_mul(const int N, global float *A,

global float *B, global float
{
int i, j, k;
i = get_global_id(0);
j = get_global_id(1);
C[i*N+j] = 0.0f;
for (k = 0; k < N; k++) {
// C(i, j) = sum(over k) A(i,k) * B(k,j)
C[i*N+j] += A[i*N+k] * B[k*N+j];
}
}
} Replace loops with the
} work item’s global id
Matrix multiplication host program
cl::CommandQueue queue(context);
#define DEVICE
CL_DEVICE_TYPE_DEFAULT
cl::make_kernel
<int, cl::Buffer, cl::Buffer, cl::Buffer>
// declarations (not shown) mmul(program, "mmul");
sz = N * N;
std::vector<float> h_A(sz); d_A = cl::Buffer(context,
h_A.begin(), h_A.end(), true);
std::vector<float> h_B(sz); d_B = cl::Buffer(context,
std::vector<float> h_C(sz); h_B.begin(), h_B.end(), true);
d_C = cl::Buffer(context,
cl::Buffer d_A, d_B, d_C; CL_MEM_WRITE_ONLY,
sizeof(float) * sz);
// initialize matrices and setup
// the problem (not shown) mmul(
cl::EnqueueArgs(queue, cl::NDRange(N,N)),
cl::Context context(DEVICE); N, d_A, d_B, d_C );
cl::Program program(context,
cl::copy(queue, d_C, h_C.begin(), h_C.end());
util::loadProgram("matmul1.cl"),
true); // Timing and check results (not shown)
Matrix multiplication performance
• Matrices are stored in global memory.

Case MFLOPS
CPU GPU
Sequential C (not OpenCL) 887.2 N/A
C(i,j) per work-item, all global 3,926. 3,720.9
1

Device is Tesla® M2090 GPU from NVIDIA® with a max of 16 compute units, 512 PEs
Device is Intel® Xeon® CPU, E5649 @ 2.53GHz

These are not official benchmark results. You may

observe completely different results should you run
these tests on your own system.
Third party names are the property of their owners.
UNDERSTANDING THE OPENCL
MEMORY HIERARCHY
OpenCL Memory model
• Private Memory
– Per work-item
• Local Memory
– Shared within a
work-group
• Global/Constant
Memory
– Visible to all
work-groups
• Host memory
– On the CPU
Memory management is explicit:
You are responsible for moving data from
host ‹ global ‹ local and back
The Memory Hierarchy
Bandwidths Sizes
Private memory Private memory
O(2-3) words/cycle/WI O(10) words/WI

Local memory Local memory

O(10) words/cycle/WG O(1-10) KBytes/WG

Global memory Global memory

O(100-200) GBytes/s O(1-10) GBytes

Host memory Host memory

O(1-100) GBytes/s O(1-100) GBytes

Managing the memory hierarchy is one of the most important

things to get right to achieve good performance

*Size and performance numbers are approximate and for a high-end discrete GPU, circa 2011
Optimizing matrix multiplication
• MM cost determined by FLOPS and memory movement:
– 2*n3 = O(n3) FLOPS
– Operates on 3*n2 = O(n2) numbers
• To optimize matrix multiplication, we must ensure that for
every memory access we execute as many FLOPS as
possible.
• Outer product algorithms are faster, but for pedagogical
reasons, let’s stick to the simple dot-product algorithm.
A(i,:)
C(i,j) = x B(:,j)

Dot product of a row of A and a column of B for each element of C

• We will work with work-item/work-group sizes and the

memory model to optimize matrix multiplication
Optimizing matrix multiplication
• There may be significant overhead to manage work-items
and work-groups.
• So let’s have each work-item compute a full row of C

C(i,j) A(i,:)
B(:,j)
= x

Dot product of a row of A and a column of B for each element of C

• And with an eye towards future optimizations, let’s collect

work-items into work-groups with 64 work-items per work-
group
C= A * B (1 row per work-item)
• Goal:
– To give you experience managing the number of work-
items per work-group.
• Procedure:
– Start from you last matrix multiplication program.
Modify it so each work-item handles an entire row of the
matrix.
• Expected output:
– Test your result and verify that it is correct. Output the
runtime and the MFLOPS.

cl::EnqueueArgs() is used with the kernel functor to control how a kernel is

enqueued. There are many overloaded forms … the one you’ll need is:

cl::EnqueueArgs(NDRange Global, NDRange Local)

Where “global” and “local” are (N), (N,N), or (N,N,N) depending on the
dimensionality of the NDRange index space.
An N-dimension domain of work-items
• Global Dimensions: 1024 (1D)
Whole problem space (index space)

• Local Dimensions: 64 (work-items per work-group) Only

1024/64 = 16 work-groups in total

64
1024

• Important implication: we will have a lot fewer work-

items per work-group (64) and work- groups (16).
Matrix multiplication: One work item per row of C

{
kernel void mmul( int j, k;
const int N, int i = get_global_id(0);
global float *A, float tmp;
global float *B, for (j = 0; j < N; j++) {
global float *C) tmp = 0.0f;
for (k = 0; k < N; k++)
tmp += A[i*N+k]*B[k*N+j];
C[i*N+j] = tmp;
}
}
Mat. Mul. host program (1 row per work-item)

• #define DEVICE cl::CommandQueue queue(context);

CL_DEVICE_TYPE_DEFAULT
cl::make_kernel
<int, cl::Buffer, cl::Buffer, cl::Buffer>
• // declarations (not shown) sz = N * mmul(program, "mmul");
N; std::vector<float> h_A(sz);
std::vector<float> h_B(sz); d_A = cl::Buffer(context,
h_A.begin(), h_A.end(), true);
std::vector<float> h_C(sz);
d_B = cl::Buffer(context,
h_B.begin(), h_B.end(), true);
• cl::Buffer d_A, d_B, d_C; d_C = cl::Buffer(context,
CL_MEM_WRITE_ONLY,
sizeof(float) * sz);
• // initialize matrices and setup
• // the problem (not shown) mmul( cl::EnqueueArgs(
queue, cl::NDRange(N), cl::NdRange(64) ), N, d_A,
• cl::Context context(DEVICE); cl::Program d_B, d_C );
program(context,
• util::loadProgram("mmulCrow.cl"), true); cl::copy(queue, d_C, h_C.begin(), h_C.end());

// Timing and check results (not shown)

Mat. Mul. host program (1 row per work-item)
cl::CommandQueue queue(context);
#define DEVICE CL_DEVICE_TYPE_DEFAULT

cl::make_kernel
// declarations (not shown) <int, cl::Buffer, cl::Buffer, cl::Buffer>
sz = N * N; mmul(program, "mmul");
std::vector<float> h_A(sz);
std::vector<float> h_B(sz); d_A = cl::Buffer(context,
std::vector<float> h_C(sz); h_A.begin(), h_A.end(), true);
d_B = cl::Buffer(context,
h_B.begin(), h_B.end(), true);
Changesd_A,
cl::Buffer to host program:
d_B, d_C;
d_C = cl::Buffer(context,
1. 1D ND Range set to CL_MEM_WRITE_ONLY,
number
// initialize of rows
matrices and in the C
setup sizeof(float) * sz);
matrix (not shown)
// the problem
2. Local Dimension set to 64
mmul( cl::EnqueueArgs(
(which
cl::Context gives us 16 work-
context(DEVICE);
groups which matches the queue, cl::NDRange(N), cl::NdRange(64) ),
cl::Program program(context, N, d_A, d_B, d_C );
GPU’s number of compute
util::loadProgram("mmulCrow.cl“,
units).
true));
cl::copy(queue, d_C, h_C.begin(), h_C.end());

// Timing and check results (not shown)

Matrix multiplication performance

• Matrices are stored in global memory.

Case MFLOPS
CPU GPU
Sequential C (not OpenCL) 887.2 N/A
C(i,j) per work-item, all global 3,926.1 3,720.9
C row per work-item, all global 3,379.5 4,195.8

This has started to help.

Device is Tesla® M2090 GPU from NVIDIA® with a max of 16 compute units, 512 PEs
Device is Intel® Xeon® CPU, E5649 @ 2.53GHz
These are not official benchmark results. You
may observe completely different results should
Third party names are the property of their owners. you run these tests on your own system.
Optimizing matrix multiplication

• Notice that, in one row of C, each element reuses the same

row of A.
• Let’s copy that row of A into private memory of the work-
item that’s (exclusively) using it to avoid the overhead of
loading it from global memory for each C(i,j) computation.

C(i,j) A(i,:)
= x B(:,j)

Private memory of each

work-item
Private Memory
• A work-items private memory:
– A very scarce resource, only a few tens of 32-bit
words per Work-Item at most (on a GPU)
– If you use too much it spills to global memory or
reduces the number of Work-Items that can be
run at the same time, potentially harming
performance*
– Think of these like registers on the CPU
• How do you create and manage private
memory?
– Declare statically inside your kernel
* Occupancy on a GPU
C= A * B (Row of A in private memory)
• Goal:
– To give you experience working with private memory.
• Procedure:
– Start from you last matrix multiplication program (the
row-based method). Modify it so each work item copies
A from global to private memory to reduce traffic into
global memory.
• Expected output:
– Test your result and verify that it is correct. Output the
runtime and the MFLOPS.

Private memory can be allocated as an automatic (i.e. not

with malloc) inside a kernel … so just declare any arrays you
need. You can use normal loads and stores inside a kernel to
move data between private and global address spaces.
Matrix multiplication: (Row of A in private memory)

for (k = 0; k < N; k++)

kernel void mmul(
const int N, Awrk[k] = A[i*N+k];

global float *A,

global float *B, for (j = 0; j < N; j++) {
global float *C) tmp = 0.0f;
{ for (k = 0; k < N; k++)
int j, k; tmp += Awrk[k]*B[k*N+j];
int i =
get_global_id(0);
C[i*N+j] += tmp;
float tmp;
}
float Awrk[1024];
}
Matrix multiplication:
(Row of A in private memory)

Copy a row of A into private memory from global memory

before we start with the matrix multiplications.

for (k = 0; k < N; k++)

kernel voidmmul(
const int N, Awrk[k] = A[i*N+k];

global float *A,

(*Actually, this is using far more private memory than we’ll have and so Awrk[] will be spilled to global memory)
Mat. Mul. host program (Row of A in private memory)
#define DEVICE CL_DEVICE_TYPE_DEFA ULT cl::CommandQueue queue(context);
cl::make_kernel
// declarations (not shown) sz <int, cl::Buffer, cl::Buffer,
= N * N; std::vector<float> cl::Buffer> mmul(program,
h_A(sz); std::vector<float> "mmul");
h_B(sz); std::vector<float>
h_C(sz); d_A = cl::Buffer(context,
h_A.begin(), h_A.end(), true);
cl::Buffer d_A, d_B, d_C; d_B = cl::Buffer(context,
h_B.begin(), h_B.end(), true);
// initialize matrices and setup d_C = cl::Buffer(context,
// the problem (not shown) CL_MEM_WRITE_ONLY,
sizeof(float) * sz);
cl::Context context(DEVICE);
cl::Program program(context, mmul( cl::EnqueueArgs(
util::loadProgram("mmulCrow.cl"), queue, cl::NDRange(N),
true); cl::NdRange(64) ), N, d_A, d_B, d_C );

cl::copy(queue, d_C,check
// Timing and h_C.begin(), h_C.end());
results (not shown)
Host program unchanged from last exercise
Matrix multiplication performance

• Matrices are stored in global memory.

Case MFLOPS
CPU GPU
Sequential C (not OpenCL) 887.2 N/A
C(i,j) per work-item, all global 3,926.1 3,720.9
C row per work-item, all global 3,379.5 4,195.8
C row per work-item, A row private 3,385.8 8,584.3
Device is Tesla® M2090 GPU from
NVIDIA® with a max of 16 Big impact!
compute units, 512 PEs
Device is Intel® Xeon® CPU,
E5649 @ 2.53GHz These are not official benchmark results. You may
observe completely different results should you run
Third party names are the property of their owners. these tests on your own system.
Optimizing matrix multiplication

• We already noticed that, in one row of C, each element uses

the same row of A
• Each work-item in a work-group also uses the same columns
of B
• So let’s store the B columns in local memory (which is
shared by the work-items in the work-group)

C(i,j) A(i,:)
= x B(:,j)

Private memory of each

work-item Local memory for each
work-group
OpenCL and CUDA
• Many OpenCL features have a one to one mapping to CUDA features

• OpenCL
- More complex platform and device management
- More complex kernel launch
- More lower level than CUDA
OpenCL and CUDA

• Compute Unit (CU) correspond to

- CUDA streaming multiprocessors (SMs)
- CPU core
- etc.
• Processing Element correspond to
- CUDA streaming processor (SP)
- CPU ALU
OpenCL and CUDA
• Work Item (CUDA thread) – executes kernel code

• Index Space (CUDA grid) – defines work items and how data is mapped to them
• Work Group (CUDA block) – work items in a work group can synchronize
References
Optimizing OpenCL applications on Intel Xeon Phi
http://iwocl.org/wp-content/uploads/2013/06/Optimizing-OpenCL-Applications-on- Intel-
Xeon-Phi-IWOCL.pdf

OpenCL home page

http://www.khronos.org/opencl/

One OpenCL to Rule Them All

Romain Dolbeau, Francois Bodin, Guillame Colin de Verdière
http://www.caps-entreprise.com/wp-content/uploads/2012/08/One-OpenCL-to-rule- them-
all.pdf

Ramses Project
Romain Teissyer, Pierre-Franois Lavallée, et al.
http://irfu.cea.fr/Phocea/Vie_des_labos/Ast/ast_sstechnique.php?id_ast=904

Tycho User Guide
No ratings yet
Tycho User Guide
98 pages
@ASM - Bookz Continued Rise of The Cloud - Advances and Trend
0% (1)
@ASM - Bookz Continued Rise of The Cloud - Advances and Trend
415 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Mipsology Aws f1
No ratings yet
Mipsology Aws f1
10 pages
Big CPU Big Data
No ratings yet
Big CPU Big Data
424 pages
GPU Programming Using openCL
No ratings yet
GPU Programming Using openCL
13 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
CS8076 - GPU Architecture and Programming
No ratings yet
CS8076 - GPU Architecture and Programming
244 pages
Video Processing (VPP) Sample
No ratings yet
Video Processing (VPP) Sample
6 pages
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
No ratings yet
Opencl On Fpga: Marc Gaucheron INTEL Programmable Solution Group
128 pages
Aocl Programming Guide
No ratings yet
Aocl Programming Guide
50 pages
Graphics Processing Units Paper PDF
No ratings yet
Graphics Processing Units Paper PDF
14 pages
AdvancedOpenCL Full
No ratings yet
AdvancedOpenCL Full
101 pages
OpenCL Programming Guide
No ratings yet
OpenCL Programming Guide
61 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
No ratings yet
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
258 pages
Using OpenCL Programming Massively Parallel Computers
No ratings yet
Using OpenCL Programming Massively Parallel Computers
309 pages
cs179 2016 Lec13
No ratings yet
cs179 2016 Lec13
30 pages
SA09 Opencl DG Events Stream
No ratings yet
SA09 Opencl DG Events Stream
65 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
11 - OpenCL Fundamentals
No ratings yet
11 - OpenCL Fundamentals
253 pages
OpenCL Jumpstart Guide
No ratings yet
OpenCL Jumpstart Guide
17 pages
06-Intro To Opencl PDF
No ratings yet
06-Intro To Opencl PDF
57 pages
GPU Programming: CUDA
No ratings yet
GPU Programming: CUDA
29 pages
Introduction To OpenCL
No ratings yet
Introduction To OpenCL
44 pages
Opencl 1pp PDF
No ratings yet
Opencl 1pp PDF
48 pages
Lecture 19-Opencl: Ece 459: Programming For Performance
No ratings yet
Lecture 19-Opencl: Ece 459: Programming For Performance
47 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
Adobe Media Encoder Log-Last
No ratings yet
Adobe Media Encoder Log-Last
2 pages
Supercomputing On Graphics Cards: Marcus Bannerman
No ratings yet
Supercomputing On Graphics Cards: Marcus Bannerman
18 pages
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
No ratings yet
Recipe For Running Simple CUDA Code On A GPU Based Rocks Cluster
17 pages
Opencl 2pp
No ratings yet
Opencl 2pp
28 pages
OpenCL Image Convolution Filter - Box Filter
No ratings yet
OpenCL Image Convolution Filter - Box Filter
8 pages
Sycl-1 2
No ratings yet
Sycl-1 2
206 pages
A Programming Model For Massive Data Parallelism With Data Dependencies
No ratings yet
A Programming Model For Massive Data Parallelism With Data Dependencies
8 pages
Opencl Api Reference: The Opencl Platform Layer
No ratings yet
Opencl Api Reference: The Opencl Platform Layer
11 pages
3 Heterogeneous Computer Architectures: 3.1 Gpus
No ratings yet
3 Heterogeneous Computer Architectures: 3.1 Gpus
16 pages
Mayank Bhura: Education
No ratings yet
Mayank Bhura: Education
2 pages
Lightgbm Abril2019 PDF
No ratings yet
Lightgbm Abril2019 PDF
157 pages
Parallel Programming in Opencl: Advanced Graphics & Image Processing
No ratings yet
Parallel Programming in Opencl: Advanced Graphics & Image Processing
31 pages
GPUMap - A Transparently GPU-Accelerated Python Map Function
No ratings yet
GPUMap - A Transparently GPU-Accelerated Python Map Function
10 pages
Embedded Linux - Gse5 Lab5 - Introduction To Opencl: Barriga Ponce de Leon Ricardo Guo Ran
No ratings yet
Embedded Linux - Gse5 Lab5 - Introduction To Opencl: Barriga Ponce de Leon Ricardo Guo Ran
8 pages
Clenqueuereadbuffer (Queue, C - Buffer,, 0, N, C, 0, ,)
No ratings yet
Clenqueuereadbuffer (Queue, C - Buffer,, 0, N, C, 0, ,)
3 pages
Opencl 03 Basics
No ratings yet
Opencl 03 Basics
62 pages
Opencl 2.0 Features: Benjamin Coquelle MAY 2015
No ratings yet
Opencl 2.0 Features: Benjamin Coquelle MAY 2015
40 pages
MP Report v2
No ratings yet
MP Report v2
10 pages
Parallel Implementation of Cryptographic Algorithm Aes Using Opencl On Gpu
No ratings yet
Parallel Implementation of Cryptographic Algorithm Aes Using Opencl On Gpu
5 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
Opencl: These Notes Will Introduce Opencl
No ratings yet
Opencl: These Notes Will Introduce Opencl
34 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
Opencl: Graphics Interop: The Best of Both Worlds - Graphics and Compute
No ratings yet
Opencl: Graphics Interop: The Best of Both Worlds - Graphics and Compute
18 pages
CXX For OpenCL
No ratings yet
CXX For OpenCL
22 pages
OpenCL A Parallel Programming Standart For Heterogeneous
No ratings yet
OpenCL A Parallel Programming Standart For Heterogeneous
12 pages
Computer Vision PHD Thesis PDF
100% (3)
Computer Vision PHD Thesis PDF
7 pages
Design and Implementation of High-Level Compute On Android Systems
No ratings yet
Design and Implementation of High-Level Compute On Android Systems
9 pages
Cuda Review 1
No ratings yet
Cuda Review 1
13 pages
Lec 1
No ratings yet
Lec 1
27 pages
Oneapi Base Toolkit - Get Started Guide Windows - 2023.2 766891 782258
No ratings yet
Oneapi Base Toolkit - Get Started Guide Windows - 2023.2 766891 782258
26 pages
Hpca2020 Gpu 2
No ratings yet
Hpca2020 Gpu 2
41 pages
Basic-Cuda
No ratings yet
Basic-Cuda
49 pages
System Compatibility Report
No ratings yet
System Compatibility Report
3 pages
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
No ratings yet
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
128 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
A Survey of Medical Image Registration On Graphics
No ratings yet
A Survey of Medical Image Registration On Graphics
15 pages
Ntu Thesis Repository
100% (4)
Ntu Thesis Repository
6 pages
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
No ratings yet
FFCNN: Fast FPGA Based Acceleration For Convolution Neural Network Inference
5 pages
OpenCL For EiT-M
No ratings yet
OpenCL For EiT-M
41 pages
Lec 2 PDC
No ratings yet
Lec 2 PDC
31 pages
Media Encode
No ratings yet
Media Encode
3 pages
Opencl Interoperability Enable With Openvino Rev0 5
No ratings yet
Opencl Interoperability Enable With Openvino Rev0 5
10 pages
Chapter 1 - Introduction - 2023 - Programming Massively Parallel Processors
No ratings yet
Chapter 1 - Introduction - 2023 - Programming Massively Parallel Processors
20 pages
The OpenCL™ Specification-19-26
No ratings yet
The OpenCL™ Specification-19-26
8 pages
Upcrc Opencl Lec1
No ratings yet
Upcrc Opencl Lec1
38 pages
Introduction To OpenCL With Examples
No ratings yet
Introduction To OpenCL With Examples
128 pages
Cuda Mode Lecture2
No ratings yet
Cuda Mode Lecture2
33 pages
01 Cuda C Basics
No ratings yet
01 Cuda C Basics
32 pages
Reference Hdevelop
No ratings yet
Reference Hdevelop
3,218 pages
DS1822-Parallel Computing - Unit5
No ratings yet
DS1822-Parallel Computing - Unit5
16 pages
OpenCL On de Series Boards
No ratings yet
OpenCL On de Series Boards
18 pages
Opencl 1 1 Quick Reference Card
No ratings yet
Opencl 1 1 Quick Reference Card
8 pages
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
From Everand
PLC: Programmable Logic Controller – Arktika.: EXPERIMENTAL PRODUCT BASED ON CPLD.
MARIO FRANCO
No ratings yet
OpenCL Programming by Example
From Everand
OpenCL Programming by Example
Ravishekhar Banger
No ratings yet
Foundation Course for Advanced Computer Studies
From Everand
Foundation Course for Advanced Computer Studies
Franck Ismael Djédjé
No ratings yet
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
From Everand
Computer Science: Learn about Algorithms, Cybersecurity, Databases, Operating Systems, and Web Design
Jonathan Rigdon
No ratings yet
Accelerated Computing With HIP: Second Edition
From Everand
Accelerated Computing With HIP: Second Edition
Yifan Sun
No ratings yet