0% found this document useful (0 votes)

96 views

Opencl 03 Basics

OpenCL is a programming framework for parallel programming across CPUs, GPUs, and other hardware. It allows programming of heterogeneous systems. The OpenCL standard is managed by Khronos and supports portable programming. The OpenCL architecture includes a platform model with hosts and devices, an execution model, memory model, and programming model. Key components are the OpenCL API, runtime API, and OpenCL C programming language. Devices include compute units and processing elements. GPUs and MICs are common accelerator devices that can be programmed with OpenCL.

Uploaded by

Akash AS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

96 views

Opencl 03 Basics

Uploaded by

Akash AS

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 62

OpenCL Basics

Wolfram Schenck Faculty of Eng. and Math., Bielefeld University of Applied Sciences
Member of the Helmholtz-Association

Vectorisation and Portable Programming using OpenCL, 21.-22.11.2017

Overview of the Lecture

1 OpenCL Overview

2 OpenCL Host API

3 OpenCL for Compute Kernels

4 Exercise 1

5 Event Handling

6 Exercise 2
Member of the Helmholtz-Association

7 Appendix: Notes on Nomenclature

2
OpenCL Overview
OpenCL

OpenCL (Open Computing Language)

Programming framework for CPUs, GPUs, DSPs,
FPGAs with programming language „OpenCL C“

• Started by Apple, subsequent development with AMD, IBM,

Intel, and NVIDIA, meanwhile managed by Khronos Group
: Open and royalty–free standard
• Goal: Programming framework for portable, parallel
programming of devices in heterogeneous environments
(CPUs, GPUs, and other processors; from smartphone to
supercomputer)
Member of the Helmholtz-Association

• Dec. 2008: OpenCL 1.0

• June 2010: OpenCL 1.1
• Nov. 2011: OpenCL 1.2
• Nov. 2013: OpenCL 2.0
• Nov. 2015: OpenCL 2.1 4
CUDA C vs. OpenCL

CUDA C
PRO CONTRA
3 Mature and efficient 7 Only usable for GPUs by
3 Many tools and extra NVIDIA
libraries
OpenCL
PRO CONTRA
3 For various processor 7 Not as mature and as
types (independent from widely used as CUDA C
Member of the Helmholtz-Association

manufacturer) 7 Partly long–winded

3 Supports heterogeneous programming necessary
platforms

5
Additional Information

OpenCL Programming Guide

Aaftab Munshi, Benedict Gaster, Timothy G. Mattson,
James Fung, Dan Ginsburg
Addison-Wesley, 2011, ISBN: 978-0321749642

• Khronos website:
http://www.khronos.org/opencl
(News, specifications, MAN pages)
• AMD OpenCL Zone:
http://developer.amd.com/tools-and-sdks/opencl-zone/
• Intel Developer Zone on OpenCL:
Member of the Helmholtz-Association

https://software.intel.com/de-de/forums/opencl
• EBook on OpenCL:
http://www.fixstars.com/en/opencl/book/. . .
. . . /OpenCLProgrammingBook/contents/
6
Tutorials

• OpenCL — Introduction for HPC Programmers:

https://software.intel.com/en-us/articles/. . .
. . . /tutorial-opencl-introduction-for-hpc-programmers
(from V. Kartoshkin and T. Mattson)
• OpenCL–Tutorials on PRACE website:
http://www.training.prace-ri.eu/. . .
. . . /training_material/?tx_pracetmo_pi1[tag]=opencl
I Heterogeneous Programming. OpenCL and High Level
Programming Tools
I PATC Course: Programming paradigms for new hybrid
Member of the Helmholtz-Association

architectures
I CUDA and OpenCL

7
Architecture of OpenCL

At the conceptional level:

• Platform model
• Execution model
• Memory model
• Programming model

At the programming level:

• OpenCL Platform API
Member of the Helmholtz-Association

• OpenCL Runtime API

• OpenCL C (programming language)

8
Member of the Helmholtz-Association

Fig.: Wikipedia
Platform Model

9
Platform Model (cont.)

• Basic structure: Host which is

connected to several devices
• Host: Computational unit on which
the host program runs
I Usually: CPU of the computer system
• Device: Computational unit which is
accessed via OpenCL library
I Examples: CPUs, GPUs, DSPs,
FPGAs
• Further subdivision:
Member of the Helmholtz-Association

I Device −→ „Compute Units“

I Compute Unit
−→ „Processing Elements“

10
Fig.: Wikipedia
Platform Model (CPU)

CPU
• Device: All CPUs on the mainboard of the computer
system
• Compute unit (CU): One CU per core (or per hardware
thread)
• Processing element (PE): 1 PE per CU, or if PEs are
mapped to SIMD lanes, n PEs per CU, where n matches
the SIMD width
Member of the Helmholtz-Association

11
Fermi: GF100/GF110
16 Streaming Multi–Processors (SM)
Member of the Helmholtz-Association

Die shot

12
Fig.: NVIDIA
GF100/GF110: Streaming Multi–Processor (SM)

SM properties
• 32 CUDA cores
(Streaming
processors/SP)
• 16 Load/store units
• 4 Special function units
(SFU)
• 2 Warp scheduler

: 512 ALUs/FPUs available

Fig.: NVIDIA
Platform Model (GPU and MIC)

GPU
• Device: Each GPU in the system acts as single device
• Compute unit (CU): One CU per multi–processor (NVIDIA)
• Processing element (PE): 1 PE per CUDA core (NVIDIA) or
“SIMD lane” (AMD)

MIC
• Device: Each MIC in the system acts as single device
• Compute unit (CU): One CU per hardware thread
(= 4 × [# of cores − 1])
Member of the Helmholtz-Association

• Processing element (PE): 1 PE per CU, or if PEs are mapped

to SIMD lanes, n PEs per CU, where n matches the SIMD width

14
Platform Model (cont.)

Platform
• Every OpenCL implementation (with
underlying OpenCL library) defines a
so–called „platform“.
• Each specific platform enables the
host to control the devices belonging
to it.
• Platforms of various manufacturers
can coexist on one host and may be
used from within a single application
Member of the Helmholtz-Association

(ICD: „installable client driver

model“).

15
Fig.: Wikipedia
Platform Model
Practical Hints

Get OpenCL running under Linux

• Header files: Get from Khronos website (e.g.)
I Central file: CL/cl.h
• OpenCL library stub with ICD loader:
Get from one of the vendors of your OpenCL devices
I Central file: libOpenCL.so
• ICD definition files and platform–specific OpenCL libraries:
Get from all the vendors of your OpenCL devices
I ICD files usually located in: /etc/OpenCL/vendors/
• Mechanism at runtime:
Member of the Helmholtz-Association

I libOpenCL.so is dynamically linked to your application at runtime

I ICD loader uses dlopen(..) to open all required platform–specific
OpenCL libraries
I Calls to OpenCL library functions are routed to the correct
implementation
16
Execution Model
Example: 2D–Arrangement of Work–Items

Sx , Sy : Number of work–
items within a single
work–group
sx , sy : Indices of a work–
item within a
work–group
wx , wy : Indices of a work–
group within the
NDRange
Fx , Fy : Global index offset
Member of the Helmholtz-Association

17
Fig.: „The OpenCL Specification 1.1“, Fig. 3.2
Excursus: Thread Management on GPUs

Kernel
• Function for execution on the device (here: GPU)
• Typical scenario: Many kernel instantiations running
simultaneously in parallel threads

Challenge
Management of many thousands of threads
Member of the Helmholtz-Association

Solution
„Coarse Grained Parallelism“ : „Fine Grained Parallelism“

18
Thread Management (cont.)

Hierarchical thread organization

Upper level : Grid (equiv. to NDRange) ⇒ Device
Member of the Helmholtz-Association

Medium level : Block (equiv. to work–group)

⇒ Streaming Multi–Processor (SM)
Lower level : Thread (equiv. to work–item)
⇒ Streaming Processor (SP)
19
Fig.: NVIDIA
Thread–Management (cont.)

Block Scheduler : „Coarse Grained Parallelism“ (NVIDIA)

• Distributes groups of work–items (“work–groups”) to SMs
• Takes free capacity into account (registers, local memory,
number of work–items)
• Goal: Load–balancing (“round–robin” procedure)

Warp/Wavefront : „Fine Grained Parallelism“

• Warp (NVIDIA): Group of 32 work–items which are scheduled
and executed together (within a work–group/SM)
• Wavefront (AMD): Group of 64 work–items which are
Member of the Helmholtz-Association

scheduled and executed together (within a work–group/CU)

• At this level: SIMD

20
Member of the Helmholtz-Association

Latency Hiding

21
Thread–Management (cont.)
(numbers for GF110)

• Up to 8 work–groups actively scheduled per SM

I Up to 1024 work–items per work–group
(does not result in 8 × 1024 : see next item)
• Up to 1536 work–items per SM (organized as 48 Warps)
: With 16 SMs:
Max. 16 × 1536 = 24576 simultaneously scheduled
work–items
I Comparison with CPU:
Several 100 threads simultaneously active
• In the grid: Up to 655353 work–items
Member of the Helmholtz-Association

: Up to 655353 × 1024 ≈ 2.88 · 1017 work–items per kernel call

22
NVIDIA Tesla Graphics Cards in Comparison

GT200 GF110 GK110 GK110B

(C1060) (M2090) (K20X) (K40)

# of multi–procs. (SM/SMX) 30 16 14 15
# of cuda cores (per SM/SMX) 8 32 192 192
# of cuda cores (overall) 240 512 2688 2880
Clock (core/shader) [MHz] 602/1296 650/1300 735/735 745/875
GFLOPs (SP) 933 1331 3951 4290
GFLOPs (DP) 78 665 1317 1430
Memory bandwidth [GB/sec] 102 177 250 288
# of registers (per SM/SMX) 16384 32768 65536 65536
Shared mem. (per SM/X) [KB] 32 16–48 16–48 16–48
L1–cache (per SM/SMX) [KB] 0 16–48 16–48 16–48
L2–cache [KB] 0 768 1536 1536
Max threads per SM/SMX 1024 1536 2048 2048
Max blocks per SM/SMX 8 8 16 16
Max threads in flight 30720 24576 30720 30720
Execution Model
Components

• Basic distinction:
I Host: Executes host program
I Device: Executes device kernel
• Hierarchy on device:
I NDRange −→ Work–Group −→ Work–Item
• Host defines Context:
I Devices (only from single platform!)
I Kernels (OpenCL–functions for execution on the device)
I Program objects (kernel source code and kernels in compiled form)
I Memory objects
• Host manages Queues:
Member of the Helmholtz-Association

I Kernel execution
I Operations on memory objects
I Synchronization
B Variants: In–order– und out–of–order execution

24
Member of the Helmholtz-Association

Fig.: Wikipedia
Memory Model

25
Memory Model
Allocation and Access

Weak Consistency Model

• Consistency within work–group for global and local memory:
Member of the Helmholtz-Association

Only at synchronization points within work–group

• Consistency between work–groups for global memory: Only
at synchronization points at host level

26
Table: „The OpenCL Specification 1.1“, Tab. 3.1
Programming Model
Supported Approaches

Data Parallel
• Possible mappings between data and NDRange:
I Strict 1:1 mapping: For each data element one work–item
I More flexible mappings also possible
• Favored device class: GPUs

Task Parallel
• Execution of only a single kernel instance
(equivalent to an NDRange with only one work–item)
• Parallelism via:
Member of the Helmholtz-Association

I SIMD units on the device (using OpenCL vector data types)

I Multiple tasks in queue which are executed asynchronously
• Favored device class: Multi–core CPUs, multi–CPU systems

27
The „Big Picture“
Member of the Helmholtz-Association

28
Abb.: Khronos Group
OpenCL
Basic Programming Steps in Host Code

1 Determine components of the heterogeneous system

2 Query specific properties of each component to adapt
program execution dynamically during runtime
3 Compile and configure the OpenCL kernels
: Programming language for kernel code: OpenCL C
4 Create and initialize memory objects (buffers, images)
5 Execution of the kernels in the correct order with the best
suited device for each kernel
6 Collection of results
Member of the Helmholtz-Association

: Functions for all these steps:

OpenCL Platform and Runtime API

29
OpenCL Host API
Basic Programming Steps. . .
. . . in Practice

• Query platforms : selection

• Query devices of the platform : selection
• Create context for the devices
• Create queue (for context and device)
• Create program object (for context) ← from C string
I Compile program
I Create kernel (contained in program)
• Create memory objects (within context)
• Kernel execution:
1 Set kernel arguments
2 Put kernel into queue : Execution
Member of the Helmholtz-Association

• Copy memory objects with results from device to host

(invoke via queue)
• Clean up. . .

31
Query Platforms

cl_int clGetPlatformIDs ( cl_uint num_entries ,

cl_platform_id * platforms ,
cl_uint * num_platforms );

Query all OpenCL platforms on the system

Return value : Error code (ideally equal to CL_SUCCESS)
num_entries : Number of pre-allocated elements of type cl_platform_id
in the array platforms
platforms : Returns information about the platforms (for each platform
one element in the array platforms)
num_platforms: Returns number of platforms
Member of the Helmholtz-Association

32
Query Platforms (cont.)

Double invocation of clGetPlatformIDs(..) necessary

• 1. invocation: num_entries = 0, platforms = NULL
: Query num_platforms
• Allocate num_platforms elements of type cl_platform_id in the array
platforms
• 2. invocation: num_entries = num_platforms : Query platforms

Related functions
• clGetPlatformInfo(..)
Member of the Helmholtz-Association

33
Query Devices
Precondition: Platform exists

cl_int clGetDeviceIDs ( cl_platform_id platform ,

cl_device_type device_type ,
cl_uint num_entries ,
cl_device_id * devices ,
cl_uint * num_devices );

Query the devices belonging to the respective platform

Return value : Error code (ideally equal to CL_SUCCESS)
platform : Selected platform
device_type : Device category (e.g. CL_DEVICE_TYPE_CPU, CL_DEVICE_TYPE_GPU)
num_entries : Number of pre-allocated elements of type cl_device_id
in the array devices
devices
Member of the Helmholtz-Association

: Returns information about the devices (for each device

one element in the array devices)
num_devices : Returns number of devices

34
Query Devices (cont.)

Double invocation of clGetDeviceIDs(..) necessary

• 1. invocation: num_entries = 0, devices = NULL
: Query num_devices
• Allocate num_devices elements of type cl_device_id in the array
devices
• 2. invocation: num_entries = num_devices : Query devices

Related functions
• clGetDeviceInfo(..)
Member of the Helmholtz-Association

35
Create Context
Precondition: Device exists

cl_context
clCreateContext ( const cl_context_properties * properties ,
cl_uint num_devices ,
const cl_device_id * devices ,
( voidCL_CALLBACK * pfn_notify ) (
const char * errinfo ,
const void * private_info , size_t cb ,
void * user_data
),
void * user_data ,
cl_int * errcode_ret );

Creation of a context
Return value : The created context
properties : Bit field for the definition of the desired properties of the context
num_devices : Number of devices for which the context shall be created
devices : Array with devices for which the context shall be created
Member of the Helmholtz-Association

errcode_ret : Returns the error code (ideally equal to CL_SUCCESS)

For more details : see OpenCL man pages

36
Create Queue
Precondition: Context and device exist

cl_command_queue
clCreateCommandQueue ( cl_context context ,
cl_device_id device ,
cl_command_queue_properties properties ,
cl_int * errcode_ret );

Creation of a queue
Return value : The created queue
context : Context within which the queue shall be created
device : Device for which the queue shall be created
properties : Bit field for the definition of the desired properties of the queue
errcode_ret : Returns the error code (ideally equal to CL_SUCCESS)
Member of the Helmholtz-Association

Hint
The default mode for queues is “in order execution” (other settings possible
via parameter properties).

37
Create Program Object
Precondition: Context and source code exist
cl_program
clCreateProgramWithSource ( cl_context context ,
cl_uint count ,
const char ** strings ,
const size_t * lengths ,
cl_int * errcode_ret );
Creation of a program object
Return value : The created program object
count : Number of char buffers with source code (see strings)
context : Context within which the program object shall be created
strings : Array with pointers to the char buffers containing the source code
length : Array specifying the length of each char buffer (in bytes)
errcode_ret : Returns the error code (ideally equal to CL_SUCCESS)
Member of the Helmholtz-Association

Hint
Most often the char buffers have been read in before from text files with the
OpenCL source code (file extension: .cl). For small applications, there is
often only one char buffer which contains the complete source code.

38
Compile Program
Precondition: Program object and device(s) exist

cl_int
clBuildProgram ( cl_program program ,
cl_uint num_devices ,
const cl_device_id * device_list ,
const char * options ,
void ( CL_CALLBACK * pfn_notify )(
cl_program program , void * user_data
),
void * user_data );

Compile the program for the listed devices

Return value : Error code (ideally equal to CL_SUCCESS)
num_devices : Number of devices for which the program shall be compiled
Member of the Helmholtz-Association

device_list : Array with devices for which the program shall be compiled
(these must belong to the same context as the program object!)
options : Char string with compiler options

39
Compile Program (cont.)

Hints
• For more details : see OpenCL man pages
• Automagically, the right compiler implementation is used — the one
from the OpenCL library which implements the platform for the devices
which belong to the context of the program object.

Related functions
• clGetProgramBuildInfo(..)
: Query the build status and the compiler logs (with error messages, e.g.
Member of the Helmholtz-Association

for syntax errors within the OpenCL device source code)

40
Create Kernel
Precondition: Program object with compiled code exists

cl_kernel clCreateKernel ( cl_program program ,

const char * kernel_name ,
cl_int * errcode_ret );

Creation of a compute kernel

Return value : The created kernel
program : The program object which contains the compiled kernel code
kernel_name : Name of the kernel function (within the source code of the program object)
errcode_ret : Returns the error code (ideally equal to CL_SUCCESS)

Hint
Member of the Helmholtz-Association

The kernel is afterwards available for all devices which were contained in
the device_list when calling clBuildProgram(..) before.

41
Create Memory Objects
Precondition: Context exists

Here: Creation of a buffer (alternatively: sub buffer, image)

cl_mem clCreateBuffer ( cl_context context ,

cl_mem_flags flags ,
size_t size ,
void * host_ptr ,
cl_int * errcode_ret );

Creation of a buffer object

Return value : The created buffer
context : Context within which the buffer shall be created
flags : Bit field for the definition of the buffer properties
and of the copy operations executed at creation
size : Length of the buffer (in bytes)
Member of the Helmholtz-Association

host_ptr : Pointer to the memory area in host memory which is used

as source for copy operations or which is directly used for
the buffer
errcode_ret : Returns the error code (ideally equal to CL_SUCCESS)

42
Explanation of the Parameter flags (Disjunct. within the Bit Field)
Flag Meaning
CL_MEM_READ_WRITE Memory object will be read and written by a
kernel.
CL_MEM_READ_ONLY Memory object will only be read by a kernel.
CL_MEM_WRITE_ONLY Memory object will only be written by a kernel.
CL_MEM_USE_HOST_PTR The buffer shall be located in host mem-
ory at address host_ptr (content may
be cached in device memory). Not
combinable with CL_MEM_ALLOC_HOST_PTR or
CL_MEM_COPY_HOST_PTR.
CL_MEM_ALLOC_HOST_PTR The buffer will be newly allocated in host mem-
ory (: in some implementations page–locked
memory!).
CL_MEM_COPY_HOST_PTR The buffer will be initialized with the content of
the memory region to which host_ptr points.
Set Kernel Arguments
Precondition: Kernel exists
cl_int clSetKernelArg ( cl_kernel kernel ,
cl_uint arg_index ,
size_t arg_size ,
const void * arg_value );

Set a single kernel argument

Return value : Error code (ideally equal to CL_SUCCESS)
kernel : The kernel for which the argument is set
arg_index : Index of the argument (starting with 0 for the first
argument of the kernel function)
arg_size : Length of the value of the argument (in bytes)
arg_value : Pointer to the value of the argument

Hints
Member of the Helmholtz-Association

• If you want to pass a global memory buffer as kernel argument, you

have to use the corresponding cl_mem object as value.
• In this case, arg_size has to be the size of the cl_mem object (not the
length of the buffer)!

44
Execution Model (repeated)
Example: 2D–Arrangement of Work–Items

45
Fig.: „The OpenCL Specification 1.1“, Fig. 3.2
Kernel Execution
Precondition: Queue and kernel exist, kernel arguments already set

cl_int
clEnqueueNDRangeKernel ( cl_command_queue command_queue ,
cl_kernel kernel ,
cl_uint work_dim ,
const size_t * global_work_offset ,
const size_t * global_work_size ,
const size_t * local_work_size ,
cl_uint num_events_in_wait_list ,
const cl_event * event_wait_list ,
cl_event * event );

Place a kernel for execution in a queue

Return value : Error code (ideally equal to CL_SUCCESS)
command_queue : Queue which shall be used for execution
kernel : The kernel to be executed
work_dim : Number of array dimensions (concerning the following
three parameters)
Member of the Helmholtz-Association

global_work_offset : Fx (Fy , Fz ) (see preceding slide)

global_work_size : Gx (Gy , Gz ) (see preceding slide; overall number of
work–items in each dimension across all work–groups!)
local_work_size : Sx (Sy , Sz ) (see preceding slide; the ratios
Gx /Sx , Gy /Sy , Gz /Sz need to be integer numbers!)

46
Kernel Execution (cont.)

Hints
• The size and structure of the NDRange are defined by the parameters
global_work_offset, global_work_size and local_work_size.
• If local_work_size is set to NULL, the size of the work–groups will be
automatically determined.
• For details on event handling : see later slides
Member of the Helmholtz-Association

47
Transfer Data from Device to Host
Precondition: Queue exists
cl_int clEnqueueReadBuffer ( cl_command_queue command_queue ,
cl_mem buffer ,
cl_bool blocking_read ,
size_t offset ,
size_t cb ,
void * ptr ,
cl_uint num_events_in_wait_list ,
const cl_event * event_wait_list ,
cl_event * event );

Copy buffer content into host memory (e.g., buffer with results after kernel execution)
Return value : Error code (ideally equal to CL_SUCCESS)
command_queue : Queue which shall be used for execution
buffer : Buffer object which serves as source of the copy operation
blocking_read : If true, the function only returns after the copy operation has
been finished (and therefore also all preceding commands in
the queue if it operates in “in–order mode”)
Member of the Helmholtz-Association

offset : Read offset in the buffer (in bytes)

cb : Number of bytes to copy
ptr : Pointer to the target region in host memory (needs to be
allocated in sufficient size before)

For details on event handling : see later slides

48
Free OpenCL Resources
(Selection)

cl_int clReleaseContext ( cl_context context );

cl_int clReleaseCommandQueue ( cl_command_queue command_queue );
cl_int clReleaseProgram ( cl_program program );
cl_int clReleaseKernel ( cl_kernel kernel );
cl_int clReleaseMemObject ( cl_mem memobj );

Release of different types of OpenCL objects

Return value : Error code (ideally equal to CL_SUCCESS)

Hint
In analogy to the release functions also retain functions exist for many
types of OpenCL objects. The retain functions increase an object–internal
Member of the Helmholtz-Association

counter, the release functions decrease it. Only after all retain calls were
compensated by a release call, the next subsequent release call will
ultimately free the resources of the object.

49
OpenCL for Compute Kernels
Basic Facts about “OpenCL C”

• Derived from ISO C99

• A few restrictions: No recursion, no function pointers, no
functions from the C99 standard headers
• Preprocessing directives defined by C99 are supported
(e.g., #include)
• Built–in data types:
I Scalar and vector data types, pointers, images
• Mandatory built–in functions:
I Work–item functions, math.h, reading and writing of images
I Relational functions, geometric functions, synchronization functions
I printf (v1.2 only)
Member of the Helmholtz-Association

• Optional built–in functions (called “extensions”)

I Support for double precision, atomics to global and local memory

51
Qualifiers and Functions

• Function qualifiers:
I __kernel qualifier declares a function as a kernel, i.e. makes it
visible to host code so that it can be enqueued
• Address space qualifiers:
I __global, __local, __constant, __private
I Pointer kernel arguments must be declared with an address space
qualifier (excl. __private)
• Work-item functions:
I get_work_dim(), get_global_id(), get_local_id(),
get_group_id(), etc.
• Synchronization functions:
I Barriers — all work-items within a work-group must execute the
Member of the Helmholtz-Association

barrier function before any work-item can continue:

barrier(cl_mem_fence_flags flags)
I Memory fences — provides ordering between memory operations:
mem_fence(cl_mem_fence_flags flags)

52
Restrictions

• Recursion is not supported

• Pointers to functions are not allowed
• Pointers to pointers allowed within a kernel, but not as an
argument to a kernel invocation
• Bit–fields are not supported
• Variable length arrays are not supported
• Structures and other data types have to be defined in both the
host and device code (naturally, in exactly the same way; use
common header files)
• Double types are optional in OpenCL v1.1, but the key word is
Member of the Helmholtz-Association

reserved (note: Most implementations support double)

53
Exercise 1
Task and Hints

Task
• Implement the addition of three vectors instead of two!

Hints
• Copy project files from
train025@zam1069:OpenCL_Course/OpenCL_Basics/example
• Use host code in VectorAddition.C as starting point
• Use device code in vectoradd.cl as starting point
• Adjust settings in Makefile to the computer system which
Member of the Helmholtz-Association

you are using

55
Event Handling
Further Useful Functions. . .
. . . for Event Handling and Queues

cl_int clWaitForEvents ( cl_uint num_events ,

const cl_event * event_list );

Wait for all events in event_list

Return value : Error code (ideally equal to CL_SUCCESS)
num_events : Number of elements in event_list
event_list : Array of events

cl_int clFlush ( cl_command_queue command_queue );

Issues all previously queued OpenCL commands in command_queue to the device

associated with command_queue
Return value : Error code (ideally equal to CL_SUCCESS)
Member of the Helmholtz-Association

cl_int clFinish ( cl_command_queue command_queue );

Blocks until all previously queued OpenCL commands in command_queue are issued to
the associated device and have completed. clFinish is also a synchronization point.
Return value : Error code (ideally equal to CL_SUCCESS)

57
Exercise 2
Task and Hints
Task
• Implement a second kernel for element–wise vector
multiplication!
• Compute with both kernels (multiplication and pair–wise
addition) the equation e = a ∗ b + c ∗ d as element–wise
vector operation!
• BONUS: Use an out–of–order queue instead of the default
queue. . .
• . . . and ensure by using events that all commands are
executed in the right order!
Member of the Helmholtz-Association

Hints
• Extend your code from exercise 1

59
Appendix: Notes on Nomenclature
Nomenclature
AMD vs. NVIDIA

AMD NVIDIA
— Texture Processing Cluster (TPC),
Graphics Processing Cluster (GPC)
SIMD–Core Streaming Multi–Processor
GCN–Arch.: Compute–Unit (SM, SMX)
GCN–Arch.: SIMD —
SIMD–Einheit („SIMD lane“) Streaming Proc. (SP), CUDA Core
Wavefront Warp
Local Data Share Shared Memory
Member of the Helmholtz-Association

Global Data Share —

61
Nomenclature
OpenCL vs. CUDA

OpenCL CUDA
Work–Item Thread
Work–Group Block
NDRange (Workspace) Grid
Local Memory Shared Memory
Private Memory Registers/Local Memory
Image Texture
Queue Stream
Member of the Helmholtz-Association

Event Event

Accelerated Computing with HIP
From Everand
Accelerated Computing with HIP
Yifan Sun
4.5/5 (2)
Node.js 63 Interview Questions and Answers
From Everand
Node.js 63 Interview Questions and Answers
John Edward Cooper Berg
No ratings yet
Introduction_to_OpenCL_with_Examples
No ratings yet
Introduction_to_OpenCL_with_Examples
128 pages
11 - OpenCL Fundamentals
No ratings yet
11 - OpenCL Fundamentals
253 pages
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
No ratings yet
Hands On Opencl: Created by Simon Mcintosh-Smith and Tom Deakin
258 pages
upcrc_opencl_lec1
No ratings yet
upcrc_opencl_lec1
38 pages
GPU Programming Using openCL
No ratings yet
GPU Programming Using openCL
13 pages
Introduction To OpenCL
No ratings yet
Introduction To OpenCL
44 pages
06-Intro To Opencl PDF
No ratings yet
06-Intro To Opencl PDF
57 pages
3 Heterogeneous Computer Architectures: 3.1 Gpus
No ratings yet
3 Heterogeneous Computer Architectures: 3.1 Gpus
16 pages
OpenCL A Parallel Programming Standart For Heterogeneous
No ratings yet
OpenCL A Parallel Programming Standart For Heterogeneous
12 pages
OpenCL For EiT-M
No ratings yet
OpenCL For EiT-M
41 pages
Introduction To OpenCL Programming (201005)
No ratings yet
Introduction To OpenCL Programming (201005)
132 pages
Pete-presentation-2 (1)
No ratings yet
Pete-presentation-2 (1)
17 pages
Opencl: These Notes Will Introduce Opencl
No ratings yet
Opencl: These Notes Will Introduce Opencl
34 pages
Lecture 19-Opencl: Ece 459: Programming For Performance
No ratings yet
Lecture 19-Opencl: Ece 459: Programming For Performance
47 pages
OpenCL Jumpstart Guide
No ratings yet
OpenCL Jumpstart Guide
17 pages
Opencl 1pp PDF
No ratings yet
Opencl 1pp PDF
48 pages
PostgreSQL OpenCL Procedural Language
No ratings yet
PostgreSQL OpenCL Procedural Language
29 pages
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
No ratings yet
A Jump Start To Opencl: March 15, 2009 Cis 565/665 - Gpu Computing and Architecture
74 pages
Parallel Programming in Opencl: Advanced Graphics & Image Processing
No ratings yet
Parallel Programming in Opencl: Advanced Graphics & Image Processing
31 pages
NVIDIA OpenCL JumpStart Guide
No ratings yet
NVIDIA OpenCL JumpStart Guide
15 pages
CS-3006 7 UsingOpenCL DataParallelProgramming
No ratings yet
CS-3006 7 UsingOpenCL DataParallelProgramming
80 pages
Gpgpu Final
No ratings yet
Gpgpu Final
124 pages
OpenCL Guide
No ratings yet
OpenCL Guide
19 pages
Lec 1
No ratings yet
Lec 1
27 pages
NTNU HetComp Topublish PDF
No ratings yet
NTNU HetComp Topublish PDF
83 pages
GPGPU
No ratings yet
GPGPU
139 pages
Programming Gpus With Cuda: John Mellor-Crummey
No ratings yet
Programming Gpus With Cuda: John Mellor-Crummey
42 pages
OpenCL Best Practices Guide
No ratings yet
OpenCL Best Practices Guide
54 pages
OpenCL Unleashing the Power of Parallel Computing
No ratings yet
OpenCL Unleashing the Power of Parallel Computing
8 pages
CSE_lec4_cuda
No ratings yet
CSE_lec4_cuda
91 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
FPGA and OpenCL
No ratings yet
FPGA and OpenCL
31 pages
chapter-8
No ratings yet
chapter-8
58 pages
лк CUDA - 1 PDCn
No ratings yet
лк CUDA - 1 PDCn
31 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
247 pages
Lecture 1: An Introduction To CUDA: Mike Giles
No ratings yet
Lecture 1: An Introduction To CUDA: Mike Giles
40 pages
Creating HWSW Co-Designed MPSoPCs From High Level Programming Models
No ratings yet
Creating HWSW Co-Designed MPSoPCs From High Level Programming Models
7 pages
Using OpenCL Programming Massively Parallel Computers
No ratings yet
Using OpenCL Programming Massively Parallel Computers
309 pages
Using GPUs
No ratings yet
Using GPUs
18 pages
Nvidia Opencl Best Practices Guide: Optimization
No ratings yet
Nvidia Opencl Best Practices Guide: Optimization
49 pages
(Ebook) Using OpenCL: Programming Massively Parallel Computers by J. Kowalik, T. Puzniakowski ISBN 9781614990291, 1614990298 - The latest ebook is available, download it today
100% (2)
(Ebook) Using OpenCL: Programming Massively Parallel Computers by J. Kowalik, T. Puzniakowski ISBN 9781614990291, 1614990298 - The latest ebook is available, download it today
58 pages
GPU Cluster4
No ratings yet
GPU Cluster4
31 pages
Kirk+Hwu GPU
No ratings yet
Kirk+Hwu GPU
92 pages
CUDA Tutorial
No ratings yet
CUDA Tutorial
50 pages
1 Cuda
100% (1)
1 Cuda
173 pages
Day1 1
No ratings yet
Day1 1
25 pages
IntroGPUs
No ratings yet
IntroGPUs
36 pages
Topic GPU1
No ratings yet
Topic GPU1
32 pages
GPU Basics
No ratings yet
GPU Basics
93 pages
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
No ratings yet
Lecture 11 Programming On Gpus Part 1 Zxu2acms60212 40212 S15lec 11 Gpupdf
121 pages
CatanzaroIntroToGPUs
No ratings yet
CatanzaroIntroToGPUs
76 pages
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
No ratings yet
OpenMP 4.0 For GPU, Accelerators and Other Things - Michael Wong - CppCon 2014
128 pages
Cuda C
No ratings yet
Cuda C
70 pages
Cuda Talk
100% (1)
Cuda Talk
82 pages
OpenCL Programming by Example
From Everand
OpenCL Programming by Example
Koushik Bhattacharyya
No ratings yet
Linux Programming Tools Unveiled
From Everand
Linux Programming Tools Unveiled
N. B. Venkateswarlu
No ratings yet
Dataflow and Reactive Programming Systems
From Everand
Dataflow and Reactive Programming Systems
Matt Carkci
No ratings yet
Mastering Apache: From Basics to Advanced Administration
From Everand
Mastering Apache: From Basics to Advanced Administration
Dargslan
No ratings yet
Os Experiment No - 6 (Chinmaay)
No ratings yet
Os Experiment No - 6 (Chinmaay)
11 pages
Asynchronous Programming in C#
No ratings yet
Asynchronous Programming in C#
8 pages
Banker's Algorithm Proect
No ratings yet
Banker's Algorithm Proect
6 pages
Research Paper on Software Solution of Critical Se
No ratings yet
Research Paper on Software Solution of Critical Se
7 pages
Itec 55a Platform Technologies
No ratings yet
Itec 55a Platform Technologies
3 pages
Adms CH-4
No ratings yet
Adms CH-4
36 pages
Operating systemsProcessManagementStates
No ratings yet
Operating systemsProcessManagementStates
12 pages
Semaphore PDF
No ratings yet
Semaphore PDF
29 pages
Introduction To Parallel Architectures - Josep Torrellas - CS533 (2012)
No ratings yet
Introduction To Parallel Architectures - Josep Torrellas - CS533 (2012)
14 pages
The Actor Model: CSCI 5828: Foundations of Software Engineering Lecture 20 - 10/29/2015
No ratings yet
The Actor Model: CSCI 5828: Foundations of Software Engineering Lecture 20 - 10/29/2015
38 pages
Introduction To Threads: C o R e J A V A
No ratings yet
Introduction To Threads: C o R e J A V A
46 pages
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
No ratings yet
Analytical Modeling of Parallel Systems: Ananth Grama, Anshul Gupta, George Karypis, and Vipin Kumar
67 pages
What Is Round Robin Scheduling
100% (3)
What Is Round Robin Scheduling
5 pages
Real Time Scheduling Algorithm
No ratings yet
Real Time Scheduling Algorithm
24 pages
Deadlocks and Livelocks: Presented By: Guided by
No ratings yet
Deadlocks and Livelocks: Presented By: Guided by
9 pages
Process Management PPT For Operating Systems
100% (2)
Process Management PPT For Operating Systems
32 pages
Instructions:: Q1. Answer The Following Questions: (Marks 10)
No ratings yet
Instructions:: Q1. Answer The Following Questions: (Marks 10)
3 pages
Os Lab - 3
No ratings yet
Os Lab - 3
3 pages
Akka Scala
No ratings yet
Akka Scala
399 pages
Contents:: Multiprocessors: Characteristics of Multiprocessor, Structure of Multiprocessor
No ratings yet
Contents:: Multiprocessors: Characteristics of Multiprocessor, Structure of Multiprocessor
52 pages
Today's Topics: - System V Interprocess Communication (IPC) Mechanism
No ratings yet
Today's Topics: - System V Interprocess Communication (IPC) Mechanism
20 pages
Java Multithreading
No ratings yet
Java Multithreading
23 pages
Types of Pipe Lining Processor
75% (4)
Types of Pipe Lining Processor
2 pages
Cheatsheet OS 2
No ratings yet
Cheatsheet OS 2
3 pages
Chapter 07 Deadlocks
No ratings yet
Chapter 07 Deadlocks
11 pages
Midterm Quiz 1 - Attempt Review-1
No ratings yet
Midterm Quiz 1 - Attempt Review-1
3 pages
Os 5 Qa
No ratings yet
Os 5 Qa
5 pages
Threadpool Handout
No ratings yet
Threadpool Handout
13 pages
LU4 - Distributed Operating System
No ratings yet
LU4 - Distributed Operating System
44 pages
Power Off Reset Reason Backup
No ratings yet
Power Off Reset Reason Backup
5 pages

Opencl 03 Basics

Uploaded by

Opencl 03 Basics

Uploaded by

OpenCL Basics

Vectorisation and Portable Programming using OpenCL, 21.-22.11.2017

2 OpenCL Host API

3 OpenCL for Compute Kernels

7 Appendix: Notes on Nomenclature

OpenCL (Open Computing Language)

• Started by Apple, subsequent development with AMD, IBM,

• Dec. 2008: OpenCL 1.0

manufacturer) 7 Partly long–winded

OpenCL Programming Guide

• OpenCL — Introduction for HPC Programmers:

At the conceptional level:

At the programming level:

• OpenCL Runtime API

• Basic structure: Host which is

I Device −→ „Compute Units“

: 512 ALUs/FPUs available

• Processing element (PE): 1 PE per CU, or if PEs are mapped

(ICD: „installable client driver

Get OpenCL running under Linux

I libOpenCL.so is dynamically linked to your application at runtime

Hierarchical thread organization

Medium level : Block (equiv. to work–group)

Block Scheduler : „Coarse Grained Parallelism“ (NVIDIA)

Warp/Wavefront : „Fine Grained Parallelism“

scheduled and executed together (within a work–group/CU)

• Up to 8 work–groups actively scheduled per SM

: Up to 655353 × 1024 ≈ 2.88 · 1017 work–items per kernel call

GT200 GF110 GK110 GK110B

Weak Consistency Model

Only at synchronization points within work–group

I SIMD units on the device (using OpenCL vector data types)

1 Determine components of the heterogeneous system

: Functions for all these steps:

• Query platforms : selection

• Copy memory objects with results from device to host

cl_int clGetPlatformIDs ( cl_uint num_entries ,

Query all OpenCL platforms on the system

Double invocation of clGetPlatformIDs(..) necessary

cl_int clGetDeviceIDs ( cl_platform_id platform ,

Query the devices belonging to the respective platform

: Returns information about the devices (for each device

Double invocation of clGetDeviceIDs(..) necessary

errcode_ret : Returns the error code (ideally equal to CL_SUCCESS)

For more details : see OpenCL man pages

Compile the program for the listed devices

for syntax errors within the OpenCL device source code)

cl_kernel clCreateKernel ( cl_program program ,

Creation of a compute kernel

Here: Creation of a buffer (alternatively: sub buffer, image)

cl_mem clCreateBuffer ( cl_context context ,

Creation of a buffer object

host_ptr : Pointer to the memory area in host memory which is used

Set a single kernel argument

• If you want to pass a global memory buffer as kernel argument, you

Place a kernel for execution in a queue

global_work_offset : Fx (Fy , Fz ) (see preceding slide)

offset : Read offset in the buffer (in bytes)

For details on event handling : see later slides

cl_int clReleaseContext ( cl_context context );

Release of different types of OpenCL objects

• Derived from ISO C99

• Optional built–in functions (called “extensions”)

barrier function before any work-item can continue:

• Recursion is not supported

reserved (note: Most implementations support double)

you are using

cl_int clWaitForEvents ( cl_uint num_events ,

Wait for all events in event_list

cl_int clFlush ( cl_command_queue command_queue );

Issues all previously queued OpenCL commands in command_queue to the device

cl_int clFinish ( cl_command_queue command_queue );

Global Data Share —

You might also like