Opencl 03 Basics
Opencl 03 Basics
Wolfram Schenck Faculty of Eng. and Math., Bielefeld University of Applied Sciences
Member of the Helmholtz-Association
1 OpenCL Overview
4 Exercise 1
5 Event Handling
6 Exercise 2
Member of the Helmholtz-Association
2
OpenCL Overview
OpenCL
CUDA C
PRO CONTRA
3 Mature and efficient 7 Only usable for GPUs by
3 Many tools and extra NVIDIA
libraries
OpenCL
PRO CONTRA
3 For various processor 7 Not as mature and as
types (independent from widely used as CUDA C
Member of the Helmholtz-Association
5
Additional Information
• Khronos website:
http://www.khronos.org/opencl
(News, specifications, MAN pages)
• AMD OpenCL Zone:
http://developer.amd.com/tools-and-sdks/opencl-zone/
• Intel Developer Zone on OpenCL:
Member of the Helmholtz-Association
https://software.intel.com/de-de/forums/opencl
• EBook on OpenCL:
http://www.fixstars.com/en/opencl/book/. . .
. . . /OpenCLProgrammingBook/contents/
6
Tutorials
architectures
I CUDA and OpenCL
7
Architecture of OpenCL
8
Member of the Helmholtz-Association
Fig.: Wikipedia
Platform Model
9
Platform Model (cont.)
10
Fig.: Wikipedia
Platform Model (CPU)
CPU
• Device: All CPUs on the mainboard of the computer
system
• Compute unit (CU): One CU per core (or per hardware
thread)
• Processing element (PE): 1 PE per CU, or if PEs are
mapped to SIMD lanes, n PEs per CU, where n matches
the SIMD width
Member of the Helmholtz-Association
11
Fermi: GF100/GF110
16 Streaming Multi–Processors (SM)
Member of the Helmholtz-Association
Die shot
12
Fig.: NVIDIA
GF100/GF110: Streaming Multi–Processor (SM)
SM properties
• 32 CUDA cores
(Streaming
processors/SP)
• 16 Load/store units
• 4 Special function units
(SFU)
• 2 Warp scheduler
Fig.: NVIDIA
Platform Model (GPU and MIC)
GPU
• Device: Each GPU in the system acts as single device
• Compute unit (CU): One CU per multi–processor (NVIDIA)
• Processing element (PE): 1 PE per CUDA core (NVIDIA) or
“SIMD lane” (AMD)
MIC
• Device: Each MIC in the system acts as single device
• Compute unit (CU): One CU per hardware thread
(= 4 × [# of cores − 1])
Member of the Helmholtz-Association
14
Platform Model (cont.)
Platform
• Every OpenCL implementation (with
underlying OpenCL library) defines a
so–called „platform“.
• Each specific platform enables the
host to control the devices belonging
to it.
• Platforms of various manufacturers
can coexist on one host and may be
used from within a single application
Member of the Helmholtz-Association
15
Fig.: Wikipedia
Platform Model
Practical Hints
Sx , Sy : Number of work–
items within a single
work–group
sx , sy : Indices of a work–
item within a
work–group
wx , wy : Indices of a work–
group within the
NDRange
Fx , Fy : Global index offset
Member of the Helmholtz-Association
17
Fig.: „The OpenCL Specification 1.1“, Fig. 3.2
Excursus: Thread Management on GPUs
Kernel
• Function for execution on the device (here: GPU)
• Typical scenario: Many kernel instantiations running
simultaneously in parallel threads
Challenge
Management of many thousands of threads
Member of the Helmholtz-Association
Solution
„Coarse Grained Parallelism“ : „Fine Grained Parallelism“
18
Thread Management (cont.)
20
Member of the Helmholtz-Association
Latency Hiding
21
Thread–Management (cont.)
(numbers for GF110)
22
NVIDIA Tesla Graphics Cards in Comparison
# of multi–procs. (SM/SMX) 30 16 14 15
# of cuda cores (per SM/SMX) 8 32 192 192
# of cuda cores (overall) 240 512 2688 2880
Clock (core/shader) [MHz] 602/1296 650/1300 735/735 745/875
GFLOPs (SP) 933 1331 3951 4290
GFLOPs (DP) 78 665 1317 1430
Memory bandwidth [GB/sec] 102 177 250 288
# of registers (per SM/SMX) 16384 32768 65536 65536
Shared mem. (per SM/X) [KB] 32 16–48 16–48 16–48
L1–cache (per SM/SMX) [KB] 0 16–48 16–48 16–48
L2–cache [KB] 0 768 1536 1536
Max threads per SM/SMX 1024 1536 2048 2048
Max blocks per SM/SMX 8 8 16 16
Max threads in flight 30720 24576 30720 30720
Execution Model
Components
• Basic distinction:
I Host: Executes host program
I Device: Executes device kernel
• Hierarchy on device:
I NDRange −→ Work–Group −→ Work–Item
• Host defines Context:
I Devices (only from single platform!)
I Kernels (OpenCL–functions for execution on the device)
I Program objects (kernel source code and kernels in compiled form)
I Memory objects
• Host manages Queues:
Member of the Helmholtz-Association
I Kernel execution
I Operations on memory objects
I Synchronization
B Variants: In–order– und out–of–order execution
24
Member of the Helmholtz-Association
Fig.: Wikipedia
Memory Model
25
Memory Model
Allocation and Access
26
Table: „The OpenCL Specification 1.1“, Tab. 3.1
Programming Model
Supported Approaches
Data Parallel
• Possible mappings between data and NDRange:
I Strict 1:1 mapping: For each data element one work–item
I More flexible mappings also possible
• Favored device class: GPUs
Task Parallel
• Execution of only a single kernel instance
(equivalent to an NDRange with only one work–item)
• Parallelism via:
Member of the Helmholtz-Association
27
The „Big Picture“
Member of the Helmholtz-Association
28
Abb.: Khronos Group
OpenCL
Basic Programming Steps in Host Code
29
OpenCL Host API
Basic Programming Steps. . .
. . . in Practice
31
Query Platforms
32
Query Platforms (cont.)
Related functions
• clGetPlatformInfo(..)
Member of the Helmholtz-Association
33
Query Devices
Precondition: Platform exists
34
Query Devices (cont.)
Related functions
• clGetDeviceInfo(..)
Member of the Helmholtz-Association
35
Create Context
Precondition: Device exists
cl_context
clCreateContext ( const cl_context_properties * properties ,
cl_uint num_devices ,
const cl_device_id * devices ,
( voidCL_CALLBACK * pfn_notify ) (
const char * errinfo ,
const void * private_info , size_t cb ,
void * user_data
),
void * user_data ,
cl_int * errcode_ret );
Creation of a context
Return value : The created context
properties : Bit field for the definition of the desired properties of the context
num_devices : Number of devices for which the context shall be created
devices : Array with devices for which the context shall be created
Member of the Helmholtz-Association
36
Create Queue
Precondition: Context and device exist
cl_command_queue
clCreateCommandQueue ( cl_context context ,
cl_device_id device ,
cl_command_queue_properties properties ,
cl_int * errcode_ret );
Creation of a queue
Return value : The created queue
context : Context within which the queue shall be created
device : Device for which the queue shall be created
properties : Bit field for the definition of the desired properties of the queue
errcode_ret : Returns the error code (ideally equal to CL_SUCCESS)
Member of the Helmholtz-Association
Hint
The default mode for queues is “in order execution” (other settings possible
via parameter properties).
37
Create Program Object
Precondition: Context and source code exist
cl_program
clCreateProgramWithSource ( cl_context context ,
cl_uint count ,
const char ** strings ,
const size_t * lengths ,
cl_int * errcode_ret );
Creation of a program object
Return value : The created program object
count : Number of char buffers with source code (see strings)
context : Context within which the program object shall be created
strings : Array with pointers to the char buffers containing the source code
length : Array specifying the length of each char buffer (in bytes)
errcode_ret : Returns the error code (ideally equal to CL_SUCCESS)
Member of the Helmholtz-Association
Hint
Most often the char buffers have been read in before from text files with the
OpenCL source code (file extension: .cl). For small applications, there is
often only one char buffer which contains the complete source code.
38
Compile Program
Precondition: Program object and device(s) exist
cl_int
clBuildProgram ( cl_program program ,
cl_uint num_devices ,
const cl_device_id * device_list ,
const char * options ,
void ( CL_CALLBACK * pfn_notify )(
cl_program program , void * user_data
),
void * user_data );
device_list : Array with devices for which the program shall be compiled
(these must belong to the same context as the program object!)
options : Char string with compiler options
39
Compile Program (cont.)
Hints
• For more details : see OpenCL man pages
• Automagically, the right compiler implementation is used — the one
from the OpenCL library which implements the platform for the devices
which belong to the context of the program object.
Related functions
• clGetProgramBuildInfo(..)
: Query the build status and the compiler logs (with error messages, e.g.
Member of the Helmholtz-Association
40
Create Kernel
Precondition: Program object with compiled code exists
Hint
Member of the Helmholtz-Association
The kernel is afterwards available for all devices which were contained in
the device_list when calling clBuildProgram(..) before.
41
Create Memory Objects
Precondition: Context exists
42
Explanation of the Parameter flags (Disjunct. within the Bit Field)
Flag Meaning
CL_MEM_READ_WRITE Memory object will be read and written by a
kernel.
CL_MEM_READ_ONLY Memory object will only be read by a kernel.
CL_MEM_WRITE_ONLY Memory object will only be written by a kernel.
CL_MEM_USE_HOST_PTR The buffer shall be located in host mem-
ory at address host_ptr (content may
be cached in device memory). Not
combinable with CL_MEM_ALLOC_HOST_PTR or
CL_MEM_COPY_HOST_PTR.
CL_MEM_ALLOC_HOST_PTR The buffer will be newly allocated in host mem-
ory (: in some implementations page–locked
memory!).
CL_MEM_COPY_HOST_PTR The buffer will be initialized with the content of
the memory region to which host_ptr points.
Set Kernel Arguments
Precondition: Kernel exists
cl_int clSetKernelArg ( cl_kernel kernel ,
cl_uint arg_index ,
size_t arg_size ,
const void * arg_value );
Hints
Member of the Helmholtz-Association
44
Execution Model (repeated)
Example: 2D–Arrangement of Work–Items
Sx , Sy : Number of work–
items within a single
work–group
sx , sy : Indices of a work–
item within a
work–group
wx , wy : Indices of a work–
group within the
NDRange
Fx , Fy : Global index offset
Member of the Helmholtz-Association
45
Fig.: „The OpenCL Specification 1.1“, Fig. 3.2
Kernel Execution
Precondition: Queue and kernel exist, kernel arguments already set
cl_int
clEnqueueNDRangeKernel ( cl_command_queue command_queue ,
cl_kernel kernel ,
cl_uint work_dim ,
const size_t * global_work_offset ,
const size_t * global_work_size ,
const size_t * local_work_size ,
cl_uint num_events_in_wait_list ,
const cl_event * event_wait_list ,
cl_event * event );
46
Kernel Execution (cont.)
Hints
• The size and structure of the NDRange are defined by the parameters
global_work_offset, global_work_size and local_work_size.
• If local_work_size is set to NULL, the size of the work–groups will be
automatically determined.
• For details on event handling : see later slides
Member of the Helmholtz-Association
47
Transfer Data from Device to Host
Precondition: Queue exists
cl_int clEnqueueReadBuffer ( cl_command_queue command_queue ,
cl_mem buffer ,
cl_bool blocking_read ,
size_t offset ,
size_t cb ,
void * ptr ,
cl_uint num_events_in_wait_list ,
const cl_event * event_wait_list ,
cl_event * event );
Copy buffer content into host memory (e.g., buffer with results after kernel execution)
Return value : Error code (ideally equal to CL_SUCCESS)
command_queue : Queue which shall be used for execution
buffer : Buffer object which serves as source of the copy operation
blocking_read : If true, the function only returns after the copy operation has
been finished (and therefore also all preceding commands in
the queue if it operates in “in–order mode”)
Member of the Helmholtz-Association
Hint
In analogy to the release functions also retain functions exist for many
types of OpenCL objects. The retain functions increase an object–internal
Member of the Helmholtz-Association
counter, the release functions decrease it. Only after all retain calls were
compensated by a release call, the next subsequent release call will
ultimately free the resources of the object.
49
OpenCL for Compute Kernels
Basic Facts about “OpenCL C”
51
Qualifiers and Functions
• Function qualifiers:
I __kernel qualifier declares a function as a kernel, i.e. makes it
visible to host code so that it can be enqueued
• Address space qualifiers:
I __global, __local, __constant, __private
I Pointer kernel arguments must be declared with an address space
qualifier (excl. __private)
• Work-item functions:
I get_work_dim(), get_global_id(), get_local_id(),
get_group_id(), etc.
• Synchronization functions:
I Barriers — all work-items within a work-group must execute the
Member of the Helmholtz-Association
52
Restrictions
53
Exercise 1
Task and Hints
Task
• Implement the addition of three vectors instead of two!
Hints
• Copy project files from
train025@zam1069:OpenCL_Course/OpenCL_Basics/example
• Use host code in VectorAddition.C as starting point
• Use device code in vectoradd.cl as starting point
• Adjust settings in Makefile to the computer system which
Member of the Helmholtz-Association
55
Event Handling
Further Useful Functions. . .
. . . for Event Handling and Queues
Blocks until all previously queued OpenCL commands in command_queue are issued to
the associated device and have completed. clFinish is also a synchronization point.
Return value : Error code (ideally equal to CL_SUCCESS)
57
Exercise 2
Task and Hints
Task
• Implement a second kernel for element–wise vector
multiplication!
• Compute with both kernels (multiplication and pair–wise
addition) the equation e = a ∗ b + c ∗ d as element–wise
vector operation!
• BONUS: Use an out–of–order queue instead of the default
queue. . .
• . . . and ensure by using events that all commands are
executed in the right order!
Member of the Helmholtz-Association
Hints
• Extend your code from exercise 1
59
Appendix: Notes on Nomenclature
Nomenclature
AMD vs. NVIDIA
AMD NVIDIA
— Texture Processing Cluster (TPC),
Graphics Processing Cluster (GPC)
SIMD–Core Streaming Multi–Processor
GCN–Arch.: Compute–Unit (SM, SMX)
GCN–Arch.: SIMD —
SIMD–Einheit („SIMD lane“) Streaming Proc. (SP), CUDA Core
Wavefront Warp
Local Data Share Shared Memory
Member of the Helmholtz-Association
61
Nomenclature
OpenCL vs. CUDA
OpenCL CUDA
Work–Item Thread
Work–Group Block
NDRange (Workspace) Grid
Local Memory Shared Memory
Private Memory Registers/Local Memory
Image Texture
Queue Stream
Member of the Helmholtz-Association
Event Event
62