0% found this document useful (0 votes)
13 views

CS-3006 7 UsingOpenCL DataParallelProgramming

Uploaded by

i210507
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

CS-3006 7 UsingOpenCL DataParallelProgramming

Uploaded by

i210507
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 80

Programming Multi-/Many-cores Using

OpenCL
(CS 3006)

Adapted from: see the Acknowledgement slide


Compiled by: Dr. Muhammad Arshad Islam and Dr. Muhammad Aleem

National University of Computer & Emerging Sciences, Islamabad


Campus
Lecture Acknowledgements
Simon McIntosh-Smith University of Bristol
http://people.cs.bris.ac.uk/~simonm/SC13/OpenCL_slides_SC13.pdf

Romain Teissyer, Pierre-Franois Lavallée, et al.


http://irfu.cea.fr/Phocea/Vie_des_labos/Ast/ast_sstechnique.php?id_ast=904

Optimizing OpenCL applications on Intel Xeon Phi


http://iwocl.org/wp-content/uploads/2013/06/Optimizing-OpenCL-Applications-on- Intel-Xeon-
Phi-IWOCL.pdf

OpenCL home page


http://www.khronos.org/opencl/
https://www.khronos.org/assets/uploads/developers/library/2012-pan-pacific-road-show-
June/OpenCL-Details-Taiwan_June-2012.pdf

AMD
https://indico.fysik.su.se/event/1621/sessions/70/attachments/600/695/OpenCL_Training.pdf
OpenCL
• Open Compute Language
• For heterogeneous parallel-
computing systems
• Cross-platform
- Implementations for
• ATI GPUs
• NVIDIA GPUs
• Intel MIC
• x86 CPUs
• Many others…..
Industry Standards for Programming Heterogeneous Platforms

GPUs
CPUs Emerging Increasingly general
Multiple cores driving purpose data-parallel
performance increases
Intersection
computing

Graphics
Multi-
processor
Heterogeneous APIs and
Computing Shading
programming –
Languages
e.g. OpenMP

OpenCL – Open Computing Language


Open, royalty-free standard for portable, parallel programming of
heterogeneous parallel computing CPUs, GPUs, and other processors
OpenCL is Widely Deployed and Used

http://www.khronos.org/opencl/
OpenCL architecture

https://www.researchgate.net/figure/Schematic-structure-of-OpenCL-framework_fig3_270979904
OpenCL Platform Model

……

……
Processing ………
Element …… Host
……
……

Compute Unit OpenCL Device

• One Host and one or more OpenCL Devices


- Each OpenCL Device is composed of one or more
Compute Units
• Each Compute Unit is divided into one or more Processing Elements

• Memory divided into host memory and device memory


Credits: Trevett and Cyril Zeller, NVIDIA
OpenCL Program: Host and Kernel

https://www.researchgate.net/figure/The-GPUs-main-function-named-kernel-is-invoked-from-the-CPU-host-code_fig16_256495766
OpenCL Platform Example
(One node, two CPU sockets, two GPUs)

CPUs: GPUs:
• Treated as one OpenCL • Each GPU is a separate
device OpenCL device
- One CU per core • One CU per Streaming
- 1 PE per CU, or if PEs Multiprocessor
mapped to SIMD lanes, n PEs • Can use CPU and all GPU
per CU, where n matches the devices concurrently through
SIMD width OpenCL
• Remember:
- the CPU will also have to be
its own host!

CU = Compute Unit; PE = Processing Element


• Structure: CPU vs. GPU

http://thebeardsage.com/cuda-streaming-multiprocessors/ Credits: Andreas Moshovos, https://scholar.google.com/citations?user=D2VLt-8AAAAJ&hl=en


OpenCL Memory model Overview

• Private Memory
- Per work-item

• Local Memory
- Shared within a
work-group
• Global /Constant
Memory (Read-only)
Visible to all work-groups
• Host memory
- On the CPU

Memory management is explicit:


You are responsible for moving data from
host → global → local and back
Traditional Vs. OpenCL Parallel Programming

https://www.khronos.org/opencl/
UNDERSTANDING THE HOST PROGRAM
Vector Addition – Host
• The host program is the code that runs on the host to:
– Setup the environment for the OpenCL program
– Create and manage kernels

• 5 simple steps in a basic host program:


1. Define the platform … platform = devices+context+queues
2. Create and Build the program (dynamic library for kernels)
3. Setup memory objects
4. Define the kernel (attach arguments to kernel function)
5. Submit commands … transfer memory objects and execute kernels
The C++ Interface
• Khronos has defined a common C++ header file containing a
high level interface to OpenCL, cl.hpp

• Key features:
– Uses common defaults for the platform and command-
queue,
– Simplifies the basic API
– Ability to “call” a kernel from the host, like a regular
function
– Error checking can be performed with C++ exceptions
C++ Interface: setting up the host program

• Enable OpenCL API Exceptions. Do this before


#define CL_ENABLE_EXCEPTIONS including the
header files

• Include key header files … both standard and custom


#include <CL/cl.hpp> // Khronos C++ Wrapper API
#include <cstdio> // C style IO (e.g. printf)
#include <iostream> // C++ style IO
#include <vector> // C++ vector types
Context and Command-Queues
• Context:
– The environment within which kernels
execute and in which synchronization
and memory management is defined.
Device

• The context includes:


– One or more devices Device Memory

– Device memory
– One or more command-queues Queue

• All commands for a device (kernel


execution, synchronization, and memory
operations) are submitted through a
command-queue. Context

• Each command-queue points to a single


device within a context.
1. Create a context and queue
• Grab a context using a device type:
cl::Context context(CL_DEVICE_TYPE_DEFAULT);

Or…CL_DEVICE_TYPE_CPU,
CL_DEVICE_TYPE_GPU,
CL_DEVICE_TYPE_ACCELERATOR, etc.

• Create a command queue for the device in


the context: cl::CommandQueue
queue(context);
Commands and Command-Queues
• Commands include:
– Kernel executions
– Memory object management GPU CPU

– Synchronization

• The only way to submit commands Queue Queue

to a device is through a command-


queue.
Context
• Each command-queue points to
a single device within a
context.
Command-Queue execution details
• Command queues can be configured in
different ways to control how commands GPU CPU

execute
Queue Queue

• In-order queues:
– Commands are enqueued and complete Context
in the order they appear in the host
program (program-order)

• Out-of-order queues:
– Commands are enqueued in program-
order but can execute (and hence
complete) in any order.
2. Create and Build the program
• Define source code for the kernel-program either as a
string literal (for small programs) or read it from a file (for
large applications).

• Create the program object and compile :


“true” tells OpenCL to build
(compile/link) the program object

cl::Program program(context, KernelSource , true );

KernelSource is a string … either statically set in the host program or


returned from a function that loads the kernel code from a file.
Building Program Objects

• The program object encapsulates:


OpenCL uses runtime
1. A context
compilation … because
2. The program source or binary, and in general you don’t
3. List of target devices and build know the details of the
target device when you
options ship the program
• The build process to create a
program object:
cl::Program program(context, KernelSource);
kernel void
horizontal_reflect(read_only image2d_t src,
write_only image2d_t dst)
{ Compile for GPU
int x = get_global_id(0); // x-coord GPU code
int y = get_global_id(1); // y-coord
int width = get_image_width(src);
float4 src_val = read_imagef(src, sampler,
(int2)(width-1-x, y)); Compile for CPU
write_imagef(dst, (int2)(x, y), src_val);
CPU code
}
3. Setup Memory Objects
• For vector addition we need 3 memory objects: one each for
input vectors A and B, and one for the output vector C

• Create input vectors and assign values on the host:


std::vector<float> h_a(LENGTH), h_b(LENGTH), h_c(LENGTH); for (i =
0; i < LENGTH; i++) {
h_a[i] = rand() / (float)RAND_MAX;
h_b[i] = rand() / (float)RAND_MAX;
}

• Define OpenCL device buffers and copy from host buffers:


cl::Buffer d_a(context, h_a.begin(), h_a.end(), true);
cl::Buffer d_b(context, h_b.begin(), h_b.end(), true);
cl::Buffer d_c(context, CL_MEM_WRITE_ONLY,
sizeof(float)*LENGTH);
or CL_MEM_READ_ONLY or CL_MEM_READ_WRITE
What do we put in device memory?
• Memory Objects:

• There are two kinds of memory object


– Buffer object (Always linear):
• Defines a linear collection of bytes.

– Image object:
• Defines a two- or three-dimensional region of
memory.
• Image data can only be accessed with read and write
functions
Creating and manipulating buffers
• Buffers are declared on the host as object type:
cl::Buffer
• Arrays in host memory hold your original host-side data:
std::vector<float> h_a, h_b;

• Create the device-side buffer (d_a), assign read-only


memory to hold the host array (h_a) and copy it into device
memory:
cl::Buffer d_a(context, h_a.begin(), h_a.end(), true);

Start_iterator and end_iterator for the Stipulates that this is a


container holding host side object read-only buffer

or use function clCreateBuffer ( ) to create a device-side buffer


Creating and manipulating buffers
• The last argument sets the read/write access to the Buffer
by the device. true means “read only” while false (the
default) means “read/write”.

• Submit command to copy the device buffer back to host


memory in array “h_c”:
cl::copy(queue, d_c, h_c.begin(), h_c.end());

• Can also copy host memory to device buffers:


cl::copy(queue, h_c.begin(), h_c.end(), d_c);
4. Define the kernel
• Create a kernel function for the kernels you want to be able
to call in the program:

Must match the pattern of


arguments to the kernel.

cl::make_kernel<cl::Buffer,cl::Buffer,cl::Buffer>
vadd(program, “vadd”);

A previously created The name of the function


“program object” serving as used for the kernel
a dynamic library of kernels

• This means you can ‘call’ the kernel as a ‘function’


in your host code to enqueue the kernel.
5. Enqueue commands
• For kernel launches, specify global and local dimensions
– cl::NDRange global(1024)
– cl::NDRange local(64)
If you don’t specify a local dimension, it is assumed as
cl::NullRange, and the runtime picks a size for you

• Enqueue the kernel for execution (note: returns immediately …


i.e. this is a non-blocking command):
vadd(cl::EnqueueArgs(queue, global), d_a, d_b, d_c);

• Read back result (as a blocking operation). We use an in-order


queue to assure the previous commands are completed before the
read can begin

cl::copy(queue, d_c, h_c.begin(), h_c.end());


OpenCL Vector Addition Example
#include <iostream>
#include <vector>
#include <CL/cl.hpp>

int main() {
std::vector<float> a = {1.0f, 2.0f, 3.0f, 4.0f};
std::vector<float> b = {4.0f, 3.0f, 2.0f, 1.0f};
std::vector<float> c(a.size());

try {
// get available OpenCL platforms
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);

// choose a platform
cl::Platform platform = platforms[0];

// get available OpenCL devices


std::vector<cl::Device> devices;
platform.getDevices(CL_DEVICE_TYPE_ALL, &devices);

// choose a device
cl::Device device = devices[0];

// create an OpenCL context for the device


cl::Context context({device});

// create an OpenCL program from source


cl::Program program(context, "kernel.cl");
OpenCL Vector Addition Example
// build the program for the device
program.build({device});

// create OpenCL buffers for the input and output vectors


cl::Buffer bufA(context, CL_MEM_READ_ONLY, a.size() * sizeof(float));
cl::Buffer bufB(context, CL_MEM_READ_ONLY, b.size() * sizeof(float));
cl::Buffer bufC(context, CL_MEM_WRITE_ONLY, c.size() * sizeof(float));

// create a command queue for the device


cl::CommandQueue queue(context, device); True: Readonly

// enqueue data to be transferred to the device


queue.enqueueWriteBuffer(bufA, CL_TRUE, 0, a.size() * sizeof(float), a.data());
queue.enqueueWriteBuffer(bufB, CL_TRUE, 0, b.size() * sizeof(float), b.data());

// create a kernel object for the vector addition kernel


cl::Kernel kernel(program, "vecadd");

// set the arguments of the kernel


kernel.setArg(0, bufA); Global NDRange
kernel.setArg(1, bufB);
kernel.setArg(2, bufC);
Global offset LocalNDRange

// enqueue the kernel for execution


queue.enqueueNDRangeKernel(kernel, cl::NullRange, cl::NDRange(a.size()), cl::NullRange);

// enqueue data to be transferred back to the host


queue.enqueueReadBuffer(bufC, CL_TRUE, 0, c.size() * sizeof(float), c.data());
OpenCL Vector Addition Example
// print the result
for (float x : c) {
std::cout << x << " ";
}

std::cout << std::endl;

} catch (cl::Error& e) {
std::cerr << "OpenCL error: " << e.what() << " (" << e.err() << ")" <<
std::endl;
return 1;
}

return 0;
}
An N-dimensional domain of work-items
• Global Dimensions:
– 1024x1024 (whole problem space)
– For example, if we have a 2D problem with dimensions (1024, 1024), then the
NDRange would be defined as (1024, 1024).

• Local Dimensions:
– 128x128 (work-group, executes together)Synchronization between work-
1024
items possible only within work-
groups: barriers and memory fences
1024

Cannot synchronize among work-


groups

• Choose the dimensions (“best” for your algorithm):


• 1 (one dimension)
• 2 (two dimension)
• 3 (three dimension)
Summary
INTRODUCTION TO OPENCL
KERNEL PROGRAMMING
OpenCL kernel
• Derived from ISO C99
– A few restrictions: no recursion, function pointers,
functions in C99 standard headers ...
– Preprocessing directives defined by C99 are
supported (#include etc.)

• Built-in data types


– Scalar and vector data types, pointers
– Data-type conversion functions:
• convert_type<_sat><_roundingmode>
– Image types: image2d_t, image3d_t and sampler_t
OpenCL C Language Highlights
• Function qualifiers
– __kernel qualifier declares a function as a kernel
• I.e. makes it visible to host code so it can be enqueued

• Address space qualifiers


– __global,__local,__constant,__private
– Pointer kernel arguments must be declared with an address space
qualifier

• Work-item functions
– uint get_work_dim() … number of dimensions in use (1,2, or 3)
– size_t get_global_id(uint n) … global work-item ID in dim “n”
– size_t get_local_id(uint n) … work-item ID in dim “n” inside work-
group
– size_t get_group_id(uint n) … ID of work-group in dim “n”
– size_t get_global_size(uint n) … num of work-items in dim “n”
– size_t get_local_size(uint n) … num of work-items in work group in
dim “n”
The BIG idea behind OpenCL
• Replace loops with functions (a kernel) executing at each point in
a problem domain
– E.g., process a 1024x1024 image with one kernel invocation per pixel or
1024x1024=1,048,576 kernel executions

Traditional loops OpenCL


void
kernel void
mul(const int n, mul( global const float *a,
const float *a, global const float *b,
const float *b, global float *c)
float *c) {
{ int id = get_global_id(0);
int i; c[id] = a[id] * b[id];
for (i = 0; i < n; i++) }
c[i] = a[i] * b[i]; // execute over n work-items
}
Execution model (kernels)

• OpenCL execution model … define a problem domain and


execute an instance of a kernel for each point in the
domain
kernel void times_two(
global float* input,
global float* output)
{ int i = get_global_id(0);
output[i] = 2.0f *
input[i];
}
get_global_id(0)
10
Input 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

Output 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
Vector Addition - Kernel

kernel void vadd(


global const float *a,
global const float *b,
global float *c)
{
int gid = get_global_id(0);
c[gid] = a[gid] + b[gid];
}
OpenCL C Language Highlights

• Synchronization functions
– Barriers - all work-items within a work-group must execute the
barrier function before any work-item can continue

– Memory fences - provides ordering between memory operations


OpenCL C Language Restrictions
• Pointers to functions are not allowed

• Pointers to pointers allowed within a kernel, but not


as an argument to a kernel invocation

• Variable length arrays and structures are not


supported

• Recursion is not supported (yet!)


Matrix Addition
Parallel Software – SPMD
Single-threaded (CPU)
// there are N elements Time
for(i = 0; i < N; i++) T0 0 1 2 3 4 5 6 7 8 9 10 15
C[i] = A[i] + B[i]

Multi-threaded (CPU)

// tid is the thread id T0 0 1 2 3


// P is the number of cores T1 4 5 6 7
for(i = 0; i < tid*N/P; i++) T2 8 9 10 11
C[i] = A[i] + B[i] T3 12 13 14 15

Massively Multi-threaded ( GPU)


T0 0
// tid is the thread id T1 1
C[tid] = A[tid] + B[tid] T2 2
T3 3

T15 15

From: OpenCL 1.2 University Kit - http://developer.amd.com/partners/university-programs/


NDRange
An NDRange is defined by two parameters

The global size in each dimension G_x, G_y, G_z


The local size in each dimension S_x, S_y, S_z
NDRange

Global Dimensions # Work Items


1024 1024
1920 * 1080 2M
256 * 256 * 256 16M
Thread Mapping
By using different mappings, the same thread can be
assigned to access different data elements
signed to access different data elements
The examples below show three different possible
mappings of threads to data (assuming the thread id is
used to access an element int group_size =
get_local_size(0) *
get_local_size(1);

int tid =
get_group_id(1) *
get_num_groups(0) *
int tid = int tid = group_size +
get_group_id(0) *
Mapping get_global_id(1) * get_global_id(0) *
group_size +
get_global_size(0) + get_global_size(1) +
get_global_id(0); get_global_id(1); get_local_id(1) *
get_local_size(0) +
get_local_id(0)
0 1 2 3 0 4 8 12
Thread IDs 4 5 6 7 1 5 9 13 0 1 4 5
8 9 10 11 2 6 10 14 2 3 6 7
12 13 14 15 3 7 11 15 8 9 12 13
10 11 14 15
*assuming 2x2 groups
Thread Mapping for Nvidia

Credits: Rafał Mantiuk, Computer Laboratory, University of Cambridge


Performance

– Accesses to both of these matrices will be coalesced


• Degree of coalescence depends on the workgroup and data sizes
Credits: Rafał Mantiuk, Computer Laboratory, University of Cambridge
Matrix multiplication: sequential code
We calculate C=AB, dimA = (N x P), dimB=(P x M), dimC=(N x M)

void mat_mul(int Mdim, int Ndim, int Pdim,


float *A, float *B, float *C)
{
int i, j, k;
for (i = 0; i < Ndim; i++) {
for(j = 0; j < Mdim; j++) {
C[i*N+j] = 0.0f;
for (k = 0; k < Pdim; k++) {
// C(i, j) = sum(over k) A(i,k) * B(k,j)
C[i*Ndim+j] += A[i*Ndim+k] * B[k*Pdim+j];
}

} C(i,j) A(i,:)
} = x B(:,j)
}

Dot product of a row of A and a column of B for each element of C


Matrix multiplication: sequential code
We calculate C=AB, where all three matrices are NxN

void mat_mul(int N, float *A, float *B, float *C)


{
Let’s make it easier
int i, j, k;
and specialize to
for (i = 0; i < N; i++) {
square matrices
for(j = 0; j < N; j++) {
C[i*N+j] = 0.0f;
for (k = 0; k < N; k++) {
// C(i, j) = sum(over k) A(i,k) * B(k,j)
C[i*N+j] += A[i*N+k] * B[k*N+j];
}
}
} A(i,:)
C(i,j)
} = x B(:,j)

Dot product of a row of A and a column of B for each element of C


Matrix multiplication performance

• Serial C code on CPU (single core).


Case MFLOPS
CPU GPU
Sequential C (not OpenCL) 887.2 N/A

Device is Intel® Xeon® CPU, E5649 @ 2.53GHz


using the gcc compiler.

These are not official benchmark results. You


may observe completely different results should
you run these tests on your own system.
Third party names are the property of their owners.
Matrix multiplication: sequential code
void mat_mul(int N, float *A, float *B, float *C)
{
int i, j, k;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
C[i*N+j] = 0.0f;
for (k = 0; k < N; {
k++) k) A(i,k) * B(k,j)
// C(i, j) = sum(over
C[i*N+j] += A[i*N+k] * B[k*N+j];
}
}
}
}
Matrix multiplication: OpenCL kernel (1/2)

kernel void mat_mul(const int N, global float *A,


global float *B, global float
{
int i, j, k;
for (i = 0; i < N; i++) {
for (j = 0; j < N; j++) {
C[i*N+j] = 0.0f;
for (k = 0; k < N; k++) {
// C(i, j) = sum(over k) A(i,k) * B(k,j)
C[i*N+j] += A[i*N+k] * B[k*N+j];
}
}
} Mark as a kernel function and
} specify memory qualifiers
Matrix multiplication: OpenCL kernel (2/2)

kernel void mat_mul(const int N, global float *A,


global float *B, global float
{
int i, j, k;
i = get_global_id(0);
j = get_global_id(1);
C[i*N+j] = 0.0f;
for (k = 0; k < N; k++) {
// C(i, j) = sum(over k) A(i,k) * B(k,j)
C[i*N+j] += A[i*N+k] * B[k*N+j];
}
}
} Replace loops with the
} work item’s global id
Matrix multiplication host program
cl::CommandQueue queue(context);
#define DEVICE
CL_DEVICE_TYPE_DEFAULT
cl::make_kernel
<int, cl::Buffer, cl::Buffer, cl::Buffer>
// declarations (not shown) mmul(program, "mmul");
sz = N * N;
std::vector<float> h_A(sz); d_A = cl::Buffer(context,
h_A.begin(), h_A.end(), true);
std::vector<float> h_B(sz); d_B = cl::Buffer(context,
std::vector<float> h_C(sz); h_B.begin(), h_B.end(), true);
d_C = cl::Buffer(context,
cl::Buffer d_A, d_B, d_C; CL_MEM_WRITE_ONLY,
sizeof(float) * sz);
// initialize matrices and setup
// the problem (not shown) mmul(
cl::EnqueueArgs(queue, cl::NDRange(N,N)),
cl::Context context(DEVICE); N, d_A, d_B, d_C );
cl::Program program(context,
cl::copy(queue, d_C, h_C.begin(), h_C.end());
util::loadProgram("matmul1.cl"),
true); // Timing and check results (not shown)
Matrix multiplication performance
• Matrices are stored in global memory.

Case MFLOPS
CPU GPU
Sequential C (not OpenCL) 887.2 N/A
C(i,j) per work-item, all global 3,926. 3,720.9
1

Device is Tesla® M2090 GPU from NVIDIA® with a max of 16 compute units, 512 PEs
Device is Intel® Xeon® CPU, E5649 @ 2.53GHz

These are not official benchmark results. You may


observe completely different results should you run
these tests on your own system.
Third party names are the property of their owners.
UNDERSTANDING THE OPENCL
MEMORY HIERARCHY
OpenCL Memory model
• Private Memory
– Per work-item
• Local Memory
– Shared within a
work-group
• Global/Constant
Memory
– Visible to all
work-groups
• Host memory
– On the CPU
Memory management is explicit:
You are responsible for moving data from
host ‹ global ‹ local and back
The Memory Hierarchy
Bandwidths Sizes
Private memory Private memory
O(2-3) words/cycle/WI O(10) words/WI

Local memory Local memory


O(10) words/cycle/WG O(1-10) KBytes/WG

Global memory Global memory


O(100-200) GBytes/s O(1-10) GBytes

Host memory Host memory


O(1-100) GBytes/s O(1-100) GBytes

Managing the memory hierarchy is one of the most important


things to get right to achieve good performance

*Size and performance numbers are approximate and for a high-end discrete GPU, circa 2011
Optimizing matrix multiplication
• MM cost determined by FLOPS and memory movement:
– 2*n3 = O(n3) FLOPS
– Operates on 3*n2 = O(n2) numbers
• To optimize matrix multiplication, we must ensure that for
every memory access we execute as many FLOPS as
possible.
• Outer product algorithms are faster, but for pedagogical
reasons, let’s stick to the simple dot-product algorithm.
A(i,:)
C(i,j) = x B(:,j)

Dot product of a row of A and a column of B for each element of C

• We will work with work-item/work-group sizes and the


memory model to optimize matrix multiplication
Optimizing matrix multiplication
• There may be significant overhead to manage work-items
and work-groups.
• So let’s have each work-item compute a full row of C

C(i,j) A(i,:)
B(:,j)
= x

Dot product of a row of A and a column of B for each element of C

• And with an eye towards future optimizations, let’s collect


work-items into work-groups with 64 work-items per work-
group
C= A * B (1 row per work-item)
• Goal:
– To give you experience managing the number of work-
items per work-group.
• Procedure:
– Start from you last matrix multiplication program.
Modify it so each work-item handles an entire row of the
matrix.
• Expected output:
– Test your result and verify that it is correct. Output the
runtime and the MFLOPS.

cl::EnqueueArgs() is used with the kernel functor to control how a kernel is


enqueued. There are many overloaded forms … the one you’ll need is:

cl::EnqueueArgs(NDRange Global, NDRange Local)

Where “global” and “local” are (N), (N,N), or (N,N,N) depending on the
dimensionality of the NDRange index space.
An N-dimension domain of work-items
• Global Dimensions: 1024 (1D)
Whole problem space (index space)

• Local Dimensions: 64 (work-items per work-group) Only


1024/64 = 16 work-groups in total

64
1024

• Important implication: we will have a lot fewer work-


items per work-group (64) and work- groups (16).
Matrix multiplication: One work item per row of C

{
kernel void mmul( int j, k;
const int N, int i = get_global_id(0);
global float *A, float tmp;
global float *B, for (j = 0; j < N; j++) {
global float *C) tmp = 0.0f;
for (k = 0; k < N; k++)
tmp += A[i*N+k]*B[k*N+j];
C[i*N+j] = tmp;
}
}
Mat. Mul. host program (1 row per work-item)

• #define DEVICE cl::CommandQueue queue(context);

CL_DEVICE_TYPE_DEFAULT
cl::make_kernel
<int, cl::Buffer, cl::Buffer, cl::Buffer>
• // declarations (not shown) sz = N * mmul(program, "mmul");
N; std::vector<float> h_A(sz);
std::vector<float> h_B(sz); d_A = cl::Buffer(context,
h_A.begin(), h_A.end(), true);
std::vector<float> h_C(sz);
d_B = cl::Buffer(context,
h_B.begin(), h_B.end(), true);
• cl::Buffer d_A, d_B, d_C; d_C = cl::Buffer(context,
CL_MEM_WRITE_ONLY,
sizeof(float) * sz);
• // initialize matrices and setup
• // the problem (not shown) mmul( cl::EnqueueArgs(
queue, cl::NDRange(N), cl::NdRange(64) ), N, d_A,
• cl::Context context(DEVICE); cl::Program d_B, d_C );
program(context,
• util::loadProgram("mmulCrow.cl"), true); cl::copy(queue, d_C, h_C.begin(), h_C.end());

// Timing and check results (not shown)


Mat. Mul. host program (1 row per work-item)
cl::CommandQueue queue(context);
#define DEVICE CL_DEVICE_TYPE_DEFAULT

cl::make_kernel
// declarations (not shown) <int, cl::Buffer, cl::Buffer, cl::Buffer>
sz = N * N; mmul(program, "mmul");
std::vector<float> h_A(sz);
std::vector<float> h_B(sz); d_A = cl::Buffer(context,
std::vector<float> h_C(sz); h_A.begin(), h_A.end(), true);
d_B = cl::Buffer(context,
h_B.begin(), h_B.end(), true);
Changesd_A,
cl::Buffer to host program:
d_B, d_C;
d_C = cl::Buffer(context,
1. 1D ND Range set to CL_MEM_WRITE_ONLY,
number
// initialize of rows
matrices and in the C
setup sizeof(float) * sz);
matrix (not shown)
// the problem
2. Local Dimension set to 64
mmul( cl::EnqueueArgs(
(which
cl::Context gives us 16 work-
context(DEVICE);
groups which matches the queue, cl::NDRange(N), cl::NdRange(64) ),
cl::Program program(context, N, d_A, d_B, d_C );
GPU’s number of compute
util::loadProgram("mmulCrow.cl“,
units).
true));
cl::copy(queue, d_C, h_C.begin(), h_C.end());

// Timing and check results (not shown)


Matrix multiplication performance

• Matrices are stored in global memory.

Case MFLOPS
CPU GPU
Sequential C (not OpenCL) 887.2 N/A
C(i,j) per work-item, all global 3,926.1 3,720.9
C row per work-item, all global 3,379.5 4,195.8

This has started to help.


Device is Tesla® M2090 GPU from NVIDIA® with a max of 16 compute units, 512 PEs
Device is Intel® Xeon® CPU, E5649 @ 2.53GHz
These are not official benchmark results. You
may observe completely different results should
Third party names are the property of their owners. you run these tests on your own system.
Optimizing matrix multiplication

• Notice that, in one row of C, each element reuses the same


row of A.
• Let’s copy that row of A into private memory of the work-
item that’s (exclusively) using it to avoid the overhead of
loading it from global memory for each C(i,j) computation.

C(i,j) A(i,:)
= x B(:,j)

Private memory of each


work-item
Private Memory
• A work-items private memory:
– A very scarce resource, only a few tens of 32-bit
words per Work-Item at most (on a GPU)
– If you use too much it spills to global memory or
reduces the number of Work-Items that can be
run at the same time, potentially harming
performance*
– Think of these like registers on the CPU
• How do you create and manage private
memory?
– Declare statically inside your kernel
* Occupancy on a GPU
C= A * B (Row of A in private memory)
• Goal:
– To give you experience working with private memory.
• Procedure:
– Start from you last matrix multiplication program (the
row-based method). Modify it so each work item copies
A from global to private memory to reduce traffic into
global memory.
• Expected output:
– Test your result and verify that it is correct. Output the
runtime and the MFLOPS.

Private memory can be allocated as an automatic (i.e. not


with malloc) inside a kernel … so just declare any arrays you
need. You can use normal loads and stores inside a kernel to
move data between private and global address spaces.
Matrix multiplication: (Row of A in private memory)

for (k = 0; k < N; k++)


kernel void mmul(
const int N, Awrk[k] = A[i*N+k];

global float *A,


global float *B, for (j = 0; j < N; j++) {
global float *C) tmp = 0.0f;
{ for (k = 0; k < N; k++)
int j, k; tmp += Awrk[k]*B[k*N+j];
int i =
get_global_id(0);
C[i*N+j] += tmp;
float tmp;
}
float Awrk[1024];
}
Matrix multiplication:
(Row of A in private memory)

Copy a row of A into private memory from global memory


before we start with the matrix multiplications.

for (k = 0; k < N; k++)


kernel voidmmul(
const int N, Awrk[k] = A[i*N+k];

global float *A,


global float *B, for (j = 0; j < N; j++) {
global float *C) tmp = 0.0f;
{ for (k = 0; k < N; k++)
int j, k; tmp += Awrk[k]*B[k*N+j];
int i =
get_global_id(0);
C[i*N+j] += tmp;
float tmp;
}
float Awrk[1024];
}
Setup a work array for A in
private memory*

(*Actually, this is using far more private memory than we’ll have and so Awrk[] will be spilled to global memory)
Mat. Mul. host program (Row of A in private memory)
#define DEVICE CL_DEVICE_TYPE_DEFA ULT cl::CommandQueue queue(context);
cl::make_kernel
// declarations (not shown) sz <int, cl::Buffer, cl::Buffer,
= N * N; std::vector<float> cl::Buffer> mmul(program,
h_A(sz); std::vector<float> "mmul");
h_B(sz); std::vector<float>
h_C(sz); d_A = cl::Buffer(context,
h_A.begin(), h_A.end(), true);
cl::Buffer d_A, d_B, d_C; d_B = cl::Buffer(context,
h_B.begin(), h_B.end(), true);
// initialize matrices and setup d_C = cl::Buffer(context,
// the problem (not shown) CL_MEM_WRITE_ONLY,
sizeof(float) * sz);
cl::Context context(DEVICE);
cl::Program program(context, mmul( cl::EnqueueArgs(
util::loadProgram("mmulCrow.cl"), queue, cl::NDRange(N),
true); cl::NdRange(64) ), N, d_A, d_B, d_C );

cl::copy(queue, d_C,check
// Timing and h_C.begin(), h_C.end());
results (not shown)
Host program unchanged from last exercise
Matrix multiplication performance

• Matrices are stored in global memory.

Case MFLOPS
CPU GPU
Sequential C (not OpenCL) 887.2 N/A
C(i,j) per work-item, all global 3,926.1 3,720.9
C row per work-item, all global 3,379.5 4,195.8
C row per work-item, A row private 3,385.8 8,584.3
Device is Tesla® M2090 GPU from
NVIDIA® with a max of 16 Big impact!
compute units, 512 PEs
Device is Intel® Xeon® CPU,
E5649 @ 2.53GHz These are not official benchmark results. You may
observe completely different results should you run
Third party names are the property of their owners. these tests on your own system.
Optimizing matrix multiplication

• We already noticed that, in one row of C, each element uses


the same row of A
• Each work-item in a work-group also uses the same columns
of B
• So let’s store the B columns in local memory (which is
shared by the work-items in the work-group)

C(i,j) A(i,:)
= x B(:,j)

Private memory of each


work-item Local memory for each
work-group
OpenCL and CUDA
• Many OpenCL features have a one to one mapping to CUDA features

• OpenCL
- More complex platform and device management
- More complex kernel launch
- More lower level than CUDA
OpenCL and CUDA

• Compute Unit (CU) correspond to


- CUDA streaming multiprocessors (SMs)
- CPU core
- etc.
• Processing Element correspond to
- CUDA streaming processor (SP)
- CPU ALU
OpenCL and CUDA
• Work Item (CUDA thread) – executes kernel code

• Index Space (CUDA grid) – defines work items and how data is mapped to them
• Work Group (CUDA block) – work items in a work group can synchronize
References
Optimizing OpenCL applications on Intel Xeon Phi
http://iwocl.org/wp-content/uploads/2013/06/Optimizing-OpenCL-Applications-on- Intel-
Xeon-Phi-IWOCL.pdf

OpenCL home page


http://www.khronos.org/opencl/

One OpenCL to Rule Them All


Romain Dolbeau, Francois Bodin, Guillame Colin de Verdière
http://www.caps-entreprise.com/wp-content/uploads/2012/08/One-OpenCL-to-rule- them-
all.pdf

Ramses Project
Romain Teissyer, Pierre-Franois Lavallée, et al.
http://irfu.cea.fr/Phocea/Vie_des_labos/Ast/ast_sstechnique.php?id_ast=904

You might also like