Unit-6 Concurrent and Parallel Programming:: C++ AMP (Accelerated Massive Programming)
Unit-6 Concurrent and Parallel Programming:: C++ AMP (Accelerated Massive Programming)
Microsoft added the restrict (amp) feature, which can be applied to any function (including
lambdas) to declare that the function can be executed on a C++ AMP accelerator. The restrict
keyword instructs the compiler to statically check that the function uses only those language
features that are supported by most GPUs, for example, void myFunc() restrict(amp) {…}
Microsoft or other implementer of the open C++ AMP specification could add other restrict
specifiers for other purposes, including for purposes that are unrelated to C++ AMP.
Example:
The following example illustrate the primary components of C++ AMP. Assume that you want
to add the corresponding elements of two one-dimensional arrays. For example, you might want
to add {1, 2, 3, 4, 5} and {6, 7, 8, 9, 10} to obtain {7, 9, 11, 13, 15}, using C++ AMP, you might
write the following code to add the numbers and display the results.
include<amp.h> #include <iostream>
using
namespace
concurrency;
const int size
= 5;
void CppAmpMethod() {
int aCPP[] = {1, 2, 3, 4, 5};
int bCPP[] = {6, 7, 8, 9, 10};
int sumCPP[size];
// Create C++ AMP
1
objects.
array_view<const int,
1> a(size, aCPP);
array_view<const int,
1> b(size, bCPP);
array_view<int, 1>
sum(size, sumCPP);
sum.discard_data();
parallel_for_each(
/ Define the compute domain, which is the set of
threads that are created. sum.extent,
/ Define the code to run on each thread on
the accelerator. [=](index<1> idx)
restrict(amp) {
sum[idx] = a[idx] + b[idx];
);
}
Shaping and Indexing Data: index and extent
You must define the data values and declare the shape of the data before you can run the kernel
code. All data is defined to be an array (rectangular), and you can define the array to have any
rank (number of dimensions). The data can be any size in any of the dimensions.
Index Class
The index Class specifies a location in the array or array_view object by encapsulating the offset
from the origin in each dimension into one object. When you access a location in the array, you
pass an index object to the indexing operator, [], instead of a list of integer indexes. You can
access the elements in each dimension by using the array::operator() Operator or the
array_view::operator() Operator.
2
Extent Class
The extent class specifies the length of the data in each dimension of ‘array’ or ‘array_view’
object. We can also create or retrieve the extent of existing ‘array’ or ‘array_view’ object.
Moving Data to the Accelerator: array and array_view
Two data containers used to move data to the accelerator are defined in the runtime library. They
are the array Class and the array_view Class. The array class is a container class that creates a
deep copy of the data when the object is constructed. The array_view class is a wrapper class that
copies the data when the kernel function accesses the data. When the data is needed on the
source device the data is copied back.
Array Class
When an array object is constructed, a deep copy of the data is created on the accelerator if you
use a constructor that includes a pointer to the data set. The kernel function modifies the copy on
the accelerator. When the execution of the kernel function is finished, you must copy the data
back to the source data structure. The following example multiplies each element in a vector by
10. After the kernel function is finished, the vector conversion operator is used to copy the data
back into the vector object.
Array_View Class
The array_view has nearly the same members as the array class, but the underlying behavior is
not the same. Data passed to the array_view constructor is not replicated on the GPU as it is with
an array constructor. Instead, the data is copied to the accelerator when the kernel function is
executed. Therefore, if you create two array_view objects that use the same data, both
array_view objects refer to the same memory space. When you do this, you have to synchronize
any multithreaded access. The main advantage of using the array_view class is that data is
moved only if it is necessary.
Table 6.1: similarities and differences between the array and array_view classes.
Description array class array_view class
When rank is At compile time. At compile time.
determined
When extent is At run time. At run time.
determined
Shape Rectangular. Rectangular.
Data storage Is a data container. Is a data wrapper.
Copy Explicit and deep copy Implicit copy when it is accessed by the kernel
at definition. function.
Data retrieval By copying the array By direct access of the array_view object or by
data back to an object calling the array_view::synchronize Method to
on the CPU thread. continue accessing the data on the original
container.
3
Shared memory with array and array_view
Shared memory is memory that can be accessed by both the CPU and the accelerator. The use of
shared memory eliminates or significantly reduces the overhead of copying data between the
CPU and the accelerator. Although the memory is shared, it cannot be accessed concurrently by
both the CPU and the accelerator, and doing so causes undefined behavior. ‘array’ objects can be
used to specify fine-grained control over the use of shared memory if the associated accelerator
supports it.
An array_view reflects the same ‘cpu_access_type’ as the array that it’s associated with; or, if
the array_view is constructed without a data source, its ‘access_type’ reflects the environment
that first causes it to allocate storage. That is, if it’s first accessed by the host (CPU), then it
behaves as if it were created over a CPU data source and shares the access_type of the
accelerator_view associated by capture; however, if it's first accessed by an accelerator_view,
then it behaves as if it were created over an array created on that accelerator_view and shares the
array’s access_type.
4
}
}
The parallel_for_each method takes two arguments, a compute domain and a lambda
expression.
The compute domain is an extent object or a tiled_extent object that defines the set of
threads to create for parallel execution. One thread is generated for each element in the
compute domain. In this case, the extent object is one-dimensional and has five elements.
Therefore, five threads are started.
The lambda expression defines the code to run on each thread. The capture clause, [=],
specifies that the body of the lambda expression accesses all captured variables by value,
which in this case are a, b, and sum. In this example, the parameter list creates a one-
dimensional index variable named idx. The value of the idx[0] is 0 in the first thread and
increases by one in each subsequent thread. The restrict(amp) indicates that only the
subset of the C++ language that C++ AMP can accelerate is used.
5
the C++ Standard Library with respect to assignment and copy construction.
The template parameters for the texture class are the element type and the rank.
writeonly_texture_view Class: Provides write-only access to any texture.
Short Vector Library: Defines a set of short vector types of length 2, 3, and 4
that are based on int, uint, float, double, norm, or unorm.
Concurrency Visualizer
The Concurrency Visualizer includes support for analyzing performance of C++ AMP code
Performance Recommendations
Modulus and division of unsigned integers have significantly better performance than modulus
and division of signed integers. We recommend that you use unsigned integers when possible.
OpenCL:
Introduction:
1.Central Processing Unit (CPU):
• Fundamental component of a modern computer
• Constantly in evolution in term of performance
• Better CPU performance = greater core clock frequency.
• Reached the clock frequency limit due to power requirements.
• Solution: increase numbers of cores per chip.
CPU vs GPU:
6
GPGPU:
• GPGPU: General Purpose computation on Graphics Processing Units.
• Idea: using GPU for generic computations
• GPU acts as an “accelerator” to the CPU (Heterogeneous System)
• Most lines of code are executed on the CPU
• Key computational kernels are executed on the GPU
• Taking advantage of the large number of cores and high graphics memory
bandwidth
• AIM: code performs better than use of CPU alone.
• Nvidia was the pioneer for GPGPU.
• It created CUDA Language (based on C) and the guidelines to follow
Heterogeneous Computing:
Heterogeneous computing exploits the capabilities of different computing resources in a system
like
• CPU
• GPU
• Multicore Microprocessor
• Digital Signal Processor
7
• In this kind of application the same operation is applied to multiple data
items, and on vector architecturs multiple operations can be executed in
parallel
• Usually used to obtain a high level of parallelization
• Course focus: GPU – CPU
• The use of a graphics processing unit (GPU) together with a CPU to
accelerate scientific, analytics, engineering, consumer, and enterprise
applications is a simple and common scenario of the heterogeneous
programming
Parallel Programming:
• The parallel programming is the ability to use multiple computing resources to speed up
the computation
• Two kinds of parallelism:
• Task-based: each unit carries out a different job.
• Data-based: all units do the same work on different subsets of the data
8
Fig: Parallel Programming
9
Parallelism Sample
• Classic Sample: multiplication of the elements of two arrays A and B (each array has N
elements) storing the result of each multiply in the corresponding element of array C
• The standard way to develop this sample is implementing a sequential solution
• Problem: this solution execute N-times line 2 (one for each element in the array) without
parallelism
• Question: Is the sample parallelizable? And why?
• Answer: Yes!!!!! The sample is parallelizable because the multiplication of each element
in A and B is independent.
• And then it’s possible to create independent sub problems.
• Solution: Generate a separate execution instance to perform the computation of each
element of C. This code possesses significant data-level parallelism because it’s possible
to perform the same operation in parallel.
10
Fig: Parallelism Sample – Parallel Solution
11
• Using the core language and correctly following the specification, any program
designed for one vendor can execute on another vendor’s hardware.
• Version 1.0 (2008) on Apple’s Mac OSX Snow Leopard.
• AMD announced support for OpenCL in the same timeframe, and in 2009 IBM
announced support for OpenCL in its XL compilers for the Power architecture.
• In 2010 released version 1.1
• In 2011 released version 1.2
• In 2013 released version 2.0 (actual version).
• OpenCL supports different levels of parallelism.
• It efficiently maps to
• homogeneous systems
• heterogeneous systems.
• single- or multiple-device systems consisting of CPUs, GPUs, and other types of devices
• limited only by the imagination of vendors.
• OpenCL code is written in OpenCL C, a restricted version of the C99 language with
extensions appropriate for executing data-parallel code on a variety of heterogeneous
devices.
OpenCL vs OpenGL:
• OpenCL is similar to OpenGL but THEY ARE NOT THE SAME!!!!!!!
• OpenCL is specifically crafted to increase computing efficiency across platforms and it
is typically used for image processing algorithms, physical simulations. It returns
numerical results (NO IMAGE RESULTS).
• OpenGL is a graphical API that allows you to send rendering commands to the GPU.
Typically, the goal is to show the rendering on screen.
OpenCL – Specification
The OpenCL specification is defined in 4 parts called models.
1. Platform model:
• Specifies two kinds of processors:
• Host processor: coordinates the execution. Only one.
• Device processors: execute OpenCL C kernels. One or more
• Defines an abstract hardware model for Devices.
12
2. Execution model: Defines how the OpenCL environment is configured by the host, and
how the host may direct the devices to perform work. This includes the definitions of:
• Host execution environment
• Mechanisms for Host-Device interaction
• Concurrency model used for the configuration of kernels.
• Fundamental elements: OpenCL work-items and work-groups.
3. Kernel programming model: Defines how the concurrency model is mapped to
physical hardware.
4. Memory model: Defines memory object types, and the abstract memory hierarchy that
kernels use regardless of the actual underlying memory architecture. It also contains
requirements for memory ordering and optional shared virtual memory between the host
and devices.
13
OpenCL – Platform Model API:
OpenCL offers two API functions for discovering platforms and devices:
14
• The number of kernel instances is equal to the size of the index space and each
kernel instance is created at each of the coordinates of this index space.
• This instance is called as the "work item“ and the index space is called as
• the NDRange. The work-items are performed by the compute units.
• Work-items can be divided into smaller equally sized “work-groups”
15
• The API functions used to get information about the ID are the following:
In order for the host to request that a kernel is executed on a device, a context must be
configured.
It enables the host to pass commands and data to the device. The API function to create a
context is clCreateContext().
16
Execution model specifies that devices perform tasks based on commands which are
sent from the host to the device.
A command-queue is the communication mechanism that the host uses to request
action by a
device.
One command-queue needs to be created per device.
The API function clCreateCommandQueue() (deprecated in OpenCL 2.0 and
substituted by
clCreateCommandQueueWithProprierties) is used to create a command-queue
Any API call that submits a command to a command-queue will begin with clEnqueue
and require a command-queue as a parameter.
For example:
• clEnqueueReadBuffer() call requests that the device send data to the host
• clEnqueueNDRangeKernel() requests that a kernel is executed on the device.
The command put in a queue are handled through the use of events. Each command of
clEnqueue type has three parameters in common:
a pointer to a list of events that specify dependencies for the current command called
wait-list.
It is used to specify dependencies for a command
• the number of events in the wait list
• a pointer to an event that will represent the execution of the current
command
17
• clFlush() issues all previously queued OpenCL commands in a command-queue
to the device associated with the command-queue
• clFinish() blocks until all previously queued OpenCL commands in a command-
queue are issued to the associated device and have completed
• The OpenCL API also includes the function clWaitForEvents(), which causes the host to
wait for all events specified in the wait list to complete execution.
OpenCL source code is compiled at runtime through a series of API calls.
The process of creating a kernel from source code is as follows:
1.Store the OpenCL C source code in a character array.
• If the source code is stored in a file, it must be read into memory and stored as
a character array.
• Each kernel in a program source string or file is identified by a kernel
qualifier
2. Turn the source code into a program object, cl_program, by calling
clCreateProgramWithSource().
• It’s possible to create a program from binary source with
clCreateProgramWithBinary()
3.Compile the program object, for one or more OpenCL devices, with
clBuildProgram().
• In case of compile errors, they will be reported here.
4.Create a kernel object of type cl_kernel calling clCreateKernel and specifying
the program object and kernel name.
• The final step of obtaining a cl_kernel object is similar to obtaining an
exported function from a dynamic library.
Unlike invoking functions in C programs, we cannot simply call a kernel with a list of
arguments.
Before enqueue the kernel, we have to specify each kernel argument individually using
18
clSetKernelArg().
19
OpenCL defines three types of memory objects: buffers, images and pipes.
20
21