CS-3006 7 UsingOpenCL DataParallelProgramming
CS-3006 7 UsingOpenCL DataParallelProgramming
OpenCL
(CS 3006)
AMD
https://indico.fysik.su.se/event/1621/sessions/70/attachments/600/695/OpenCL_Training.pdf
OpenCL
• Open Compute Language
• For heterogeneous parallel-
computing systems
• Cross-platform
- Implementations for
• ATI GPUs
• NVIDIA GPUs
• Intel MIC
• x86 CPUs
• Many others…..
Industry Standards for Programming Heterogeneous Platforms
GPUs
CPUs Emerging Increasingly general
Multiple cores driving purpose data-parallel
performance increases
Intersection
computing
Graphics
Multi-
processor
Heterogeneous APIs and
Computing Shading
programming –
Languages
e.g. OpenMP
http://www.khronos.org/opencl/
OpenCL architecture
https://www.researchgate.net/figure/Schematic-structure-of-OpenCL-framework_fig3_270979904
OpenCL Platform Model
……
…
……
Processing ………
Element …… Host
……
……
https://www.researchgate.net/figure/The-GPUs-main-function-named-kernel-is-invoked-from-the-CPU-host-code_fig16_256495766
OpenCL Platform Example
(One node, two CPU sockets, two GPUs)
CPUs: GPUs:
• Treated as one OpenCL • Each GPU is a separate
device OpenCL device
- One CU per core • One CU per Streaming
- 1 PE per CU, or if PEs Multiprocessor
mapped to SIMD lanes, n PEs • Can use CPU and all GPU
per CU, where n matches the devices concurrently through
SIMD width OpenCL
• Remember:
- the CPU will also have to be
its own host!
• Private Memory
- Per work-item
• Local Memory
- Shared within a
work-group
• Global /Constant
Memory (Read-only)
Visible to all work-groups
• Host memory
- On the CPU
https://www.khronos.org/opencl/
UNDERSTANDING THE HOST PROGRAM
Vector Addition – Host
• The host program is the code that runs on the host to:
– Setup the environment for the OpenCL program
– Create and manage kernels
• Key features:
– Uses common defaults for the platform and command-
queue,
– Simplifies the basic API
– Ability to “call” a kernel from the host, like a regular
function
– Error checking can be performed with C++ exceptions
C++ Interface: setting up the host program
– Device memory
– One or more command-queues Queue
Or…CL_DEVICE_TYPE_CPU,
CL_DEVICE_TYPE_GPU,
CL_DEVICE_TYPE_ACCELERATOR, etc.
– Synchronization
execute
Queue Queue
• In-order queues:
– Commands are enqueued and complete Context
in the order they appear in the host
program (program-order)
• Out-of-order queues:
– Commands are enqueued in program-
order but can execute (and hence
complete) in any order.
2. Create and Build the program
• Define source code for the kernel-program either as a
string literal (for small programs) or read it from a file (for
large applications).
– Image object:
• Defines a two- or three-dimensional region of
memory.
• Image data can only be accessed with read and write
functions
Creating and manipulating buffers
• Buffers are declared on the host as object type:
cl::Buffer
• Arrays in host memory hold your original host-side data:
std::vector<float> h_a, h_b;
cl::make_kernel<cl::Buffer,cl::Buffer,cl::Buffer>
vadd(program, “vadd”);
int main() {
std::vector<float> a = {1.0f, 2.0f, 3.0f, 4.0f};
std::vector<float> b = {4.0f, 3.0f, 2.0f, 1.0f};
std::vector<float> c(a.size());
try {
// get available OpenCL platforms
std::vector<cl::Platform> platforms;
cl::Platform::get(&platforms);
// choose a platform
cl::Platform platform = platforms[0];
// choose a device
cl::Device device = devices[0];
} catch (cl::Error& e) {
std::cerr << "OpenCL error: " << e.what() << " (" << e.err() << ")" <<
std::endl;
return 1;
}
return 0;
}
An N-dimensional domain of work-items
• Global Dimensions:
– 1024x1024 (whole problem space)
– For example, if we have a 2D problem with dimensions (1024, 1024), then the
NDRange would be defined as (1024, 1024).
• Local Dimensions:
– 128x128 (work-group, executes together)Synchronization between work-
1024
items possible only within work-
groups: barriers and memory fences
1024
• Work-item functions
– uint get_work_dim() … number of dimensions in use (1,2, or 3)
– size_t get_global_id(uint n) … global work-item ID in dim “n”
– size_t get_local_id(uint n) … work-item ID in dim “n” inside work-
group
– size_t get_group_id(uint n) … ID of work-group in dim “n”
– size_t get_global_size(uint n) … num of work-items in dim “n”
– size_t get_local_size(uint n) … num of work-items in work group in
dim “n”
The BIG idea behind OpenCL
• Replace loops with functions (a kernel) executing at each point in
a problem domain
– E.g., process a 1024x1024 image with one kernel invocation per pixel or
1024x1024=1,048,576 kernel executions
Output 0 2 4 6 8 10 12 14 16 18 20 22 24 26 28 30 32 34 36 38 40 42 44 46 48 50
Vector Addition - Kernel
• Synchronization functions
– Barriers - all work-items within a work-group must execute the
barrier function before any work-item can continue
Multi-threaded (CPU)
T15 15
int tid =
get_group_id(1) *
get_num_groups(0) *
int tid = int tid = group_size +
get_group_id(0) *
Mapping get_global_id(1) * get_global_id(0) *
group_size +
get_global_size(0) + get_global_size(1) +
get_global_id(0); get_global_id(1); get_local_id(1) *
get_local_size(0) +
get_local_id(0)
0 1 2 3 0 4 8 12
Thread IDs 4 5 6 7 1 5 9 13 0 1 4 5
8 9 10 11 2 6 10 14 2 3 6 7
12 13 14 15 3 7 11 15 8 9 12 13
10 11 14 15
*assuming 2x2 groups
Thread Mapping for Nvidia
} C(i,j) A(i,:)
} = x B(:,j)
}
Case MFLOPS
CPU GPU
Sequential C (not OpenCL) 887.2 N/A
C(i,j) per work-item, all global 3,926. 3,720.9
1
Device is Tesla® M2090 GPU from NVIDIA® with a max of 16 compute units, 512 PEs
Device is Intel® Xeon® CPU, E5649 @ 2.53GHz
*Size and performance numbers are approximate and for a high-end discrete GPU, circa 2011
Optimizing matrix multiplication
• MM cost determined by FLOPS and memory movement:
– 2*n3 = O(n3) FLOPS
– Operates on 3*n2 = O(n2) numbers
• To optimize matrix multiplication, we must ensure that for
every memory access we execute as many FLOPS as
possible.
• Outer product algorithms are faster, but for pedagogical
reasons, let’s stick to the simple dot-product algorithm.
A(i,:)
C(i,j) = x B(:,j)
C(i,j) A(i,:)
B(:,j)
= x
Where “global” and “local” are (N), (N,N), or (N,N,N) depending on the
dimensionality of the NDRange index space.
An N-dimension domain of work-items
• Global Dimensions: 1024 (1D)
Whole problem space (index space)
64
1024
{
kernel void mmul( int j, k;
const int N, int i = get_global_id(0);
global float *A, float tmp;
global float *B, for (j = 0; j < N; j++) {
global float *C) tmp = 0.0f;
for (k = 0; k < N; k++)
tmp += A[i*N+k]*B[k*N+j];
C[i*N+j] = tmp;
}
}
Mat. Mul. host program (1 row per work-item)
CL_DEVICE_TYPE_DEFAULT
cl::make_kernel
<int, cl::Buffer, cl::Buffer, cl::Buffer>
• // declarations (not shown) sz = N * mmul(program, "mmul");
N; std::vector<float> h_A(sz);
std::vector<float> h_B(sz); d_A = cl::Buffer(context,
h_A.begin(), h_A.end(), true);
std::vector<float> h_C(sz);
d_B = cl::Buffer(context,
h_B.begin(), h_B.end(), true);
• cl::Buffer d_A, d_B, d_C; d_C = cl::Buffer(context,
CL_MEM_WRITE_ONLY,
sizeof(float) * sz);
• // initialize matrices and setup
• // the problem (not shown) mmul( cl::EnqueueArgs(
queue, cl::NDRange(N), cl::NdRange(64) ), N, d_A,
• cl::Context context(DEVICE); cl::Program d_B, d_C );
program(context,
• util::loadProgram("mmulCrow.cl"), true); cl::copy(queue, d_C, h_C.begin(), h_C.end());
cl::make_kernel
// declarations (not shown) <int, cl::Buffer, cl::Buffer, cl::Buffer>
sz = N * N; mmul(program, "mmul");
std::vector<float> h_A(sz);
std::vector<float> h_B(sz); d_A = cl::Buffer(context,
std::vector<float> h_C(sz); h_A.begin(), h_A.end(), true);
d_B = cl::Buffer(context,
h_B.begin(), h_B.end(), true);
Changesd_A,
cl::Buffer to host program:
d_B, d_C;
d_C = cl::Buffer(context,
1. 1D ND Range set to CL_MEM_WRITE_ONLY,
number
// initialize of rows
matrices and in the C
setup sizeof(float) * sz);
matrix (not shown)
// the problem
2. Local Dimension set to 64
mmul( cl::EnqueueArgs(
(which
cl::Context gives us 16 work-
context(DEVICE);
groups which matches the queue, cl::NDRange(N), cl::NdRange(64) ),
cl::Program program(context, N, d_A, d_B, d_C );
GPU’s number of compute
util::loadProgram("mmulCrow.cl“,
units).
true));
cl::copy(queue, d_C, h_C.begin(), h_C.end());
Case MFLOPS
CPU GPU
Sequential C (not OpenCL) 887.2 N/A
C(i,j) per work-item, all global 3,926.1 3,720.9
C row per work-item, all global 3,379.5 4,195.8
C(i,j) A(i,:)
= x B(:,j)
(*Actually, this is using far more private memory than we’ll have and so Awrk[] will be spilled to global memory)
Mat. Mul. host program (Row of A in private memory)
#define DEVICE CL_DEVICE_TYPE_DEFA ULT cl::CommandQueue queue(context);
cl::make_kernel
// declarations (not shown) sz <int, cl::Buffer, cl::Buffer,
= N * N; std::vector<float> cl::Buffer> mmul(program,
h_A(sz); std::vector<float> "mmul");
h_B(sz); std::vector<float>
h_C(sz); d_A = cl::Buffer(context,
h_A.begin(), h_A.end(), true);
cl::Buffer d_A, d_B, d_C; d_B = cl::Buffer(context,
h_B.begin(), h_B.end(), true);
// initialize matrices and setup d_C = cl::Buffer(context,
// the problem (not shown) CL_MEM_WRITE_ONLY,
sizeof(float) * sz);
cl::Context context(DEVICE);
cl::Program program(context, mmul( cl::EnqueueArgs(
util::loadProgram("mmulCrow.cl"), queue, cl::NDRange(N),
true); cl::NdRange(64) ), N, d_A, d_B, d_C );
cl::copy(queue, d_C,check
// Timing and h_C.begin(), h_C.end());
results (not shown)
Host program unchanged from last exercise
Matrix multiplication performance
Case MFLOPS
CPU GPU
Sequential C (not OpenCL) 887.2 N/A
C(i,j) per work-item, all global 3,926.1 3,720.9
C row per work-item, all global 3,379.5 4,195.8
C row per work-item, A row private 3,385.8 8,584.3
Device is Tesla® M2090 GPU from
NVIDIA® with a max of 16 Big impact!
compute units, 512 PEs
Device is Intel® Xeon® CPU,
E5649 @ 2.53GHz These are not official benchmark results. You may
observe completely different results should you run
Third party names are the property of their owners. these tests on your own system.
Optimizing matrix multiplication
C(i,j) A(i,:)
= x B(:,j)
• OpenCL
- More complex platform and device management
- More complex kernel launch
- More lower level than CUDA
OpenCL and CUDA
• Index Space (CUDA grid) – defines work items and how data is mapped to them
• Work Group (CUDA block) – work items in a work group can synchronize
References
Optimizing OpenCL applications on Intel Xeon Phi
http://iwocl.org/wp-content/uploads/2013/06/Optimizing-OpenCL-Applications-on- Intel-
Xeon-Phi-IWOCL.pdf
Ramses Project
Romain Teissyer, Pierre-Franois Lavallée, et al.
http://irfu.cea.fr/Phocea/Vie_des_labos/Ast/ast_sstechnique.php?id_ast=904