0% found this document useful (0 votes)
68 views21 pages

Unit-6 Concurrent and Parallel Programming:: C++ AMP (Accelerated Massive Programming)

C++ AMP (Accelerated Massive Parallelism) is a programming model that uses C++ language extensions to accelerate code execution on GPUs. It includes support for multi-dimensional arrays, indexing, memory transfer and tiling. The parallel_for_each function is used to define code that runs on the GPU over data arrays or views. Tiling can provide additional acceleration by organizing work into blocks that allow for better memory access. Array views copy data lazily to the GPU while arrays make a deep copy, so views are preferable when possible to reduce data transfer overhead.

Uploaded by

Sandhya Gubbala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
68 views21 pages

Unit-6 Concurrent and Parallel Programming:: C++ AMP (Accelerated Massive Programming)

C++ AMP (Accelerated Massive Parallelism) is a programming model that uses C++ language extensions to accelerate code execution on GPUs. It includes support for multi-dimensional arrays, indexing, memory transfer and tiling. The parallel_for_each function is used to define code that runs on the GPU over data arrays or views. Tiling can provide additional acceleration by organizing work into blocks that allow for better memory access. Array views copy data lazily to the GPU while arrays make a deep copy, so views are preferable when possible to reduce data transfer overhead.

Uploaded by

Sandhya Gubbala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as DOC, PDF, TXT or read online on Scribd
You are on page 1/ 21

UNIT- 6

Concurrent and Parallel Programming: HeterogeneousComputing, C+


+AMP, OpenCL

C++ AMP (Accelerated Massive Programming):


C++ Accelerated Massive Parallelism is a native programming model that contains elements that
span the C++ programming language and its runtime library. It accelerates the execution of your
C++ code by taking advantage of the data-parallel hardware that's commonly present as a
graphics processing unit (GPU) on a discrete graphics card. This programming model includes
support for multidimensional arrays, indexing, memory transfer, and tiling. It also includes a
mathematical function library. You can use C++ AMP language extensions to control how data
is moved from the CPU to the GPU and back so that you can improve performance. On
November 12, 2013 the HSA Foundation announced a C++ AMP compiler that outputs to
OpenCL, Standard Portable Intermediate Representation (SPIR), and HSA Intermediate
Language (HSAIL) supporting the current C++ AMP specification. Its support is considered
.
obsolete and the current ROCm 1.9 series will be the last to support it

Microsoft added the restrict (amp) feature, which can be applied to any function (including
lambdas) to declare that the function can be executed on a C++ AMP accelerator. The restrict
keyword instructs the compiler to statically check that the function uses only those language
features that are supported by most GPUs, for example, void myFunc() restrict(amp) {…}
Microsoft or other implementer of the open C++ AMP specification could add other restrict
specifiers for other purposes, including for purposes that are unrelated to C++ AMP.
Example:
The following example illustrate the primary components of C++ AMP. Assume that you want
to add the corresponding elements of two one-dimensional arrays. For example, you might want
to add {1, 2, 3, 4, 5} and {6, 7, 8, 9, 10} to obtain {7, 9, 11, 13, 15}, using C++ AMP, you might
write the following code to add the numbers and display the results.
include<amp.h> #include <iostream>
using
namespace
concurrency;
const int size
= 5;
void CppAmpMethod() {
int aCPP[] = {1, 2, 3, 4, 5};
int bCPP[] = {6, 7, 8, 9, 10};
int sumCPP[size];
// Create C++ AMP

1
objects.
array_view<const int,
1> a(size, aCPP);
array_view<const int,
1> b(size, bCPP);
array_view<int, 1>
sum(size, sumCPP);
sum.discard_data();
parallel_for_each(
/ Define the compute domain, which is the set of
threads that are created. sum.extent,
/ Define the code to run on each thread on
the accelerator. [=](index<1> idx)
restrict(amp) {
sum[idx] = a[idx] + b[idx];

);

/ Print the results. The expected output is


"7, 9, 11, 13, 15". for (int i = 0; i < size; i+
+) {
std::cout << sum[i] << "\n";

}
Shaping and Indexing Data: index and extent
You must define the data values and declare the shape of the data before you can run the kernel
code. All data is defined to be an array (rectangular), and you can define the array to have any
rank (number of dimensions). The data can be any size in any of the dimensions.
Index Class
The index Class specifies a location in the array or array_view object by encapsulating the offset
from the origin in each dimension into one object. When you access a location in the array, you
pass an index object to the indexing operator, [], instead of a list of integer indexes. You can
access the elements in each dimension by using the array::operator() Operator or the
array_view::operator() Operator.

2
Extent Class
The extent class specifies the length of the data in each dimension of ‘array’ or ‘array_view’
object. We can also create or retrieve the extent of existing ‘array’ or ‘array_view’ object.
Moving Data to the Accelerator: array and array_view
Two data containers used to move data to the accelerator are defined in the runtime library. They
are the array Class and the array_view Class. The array class is a container class that creates a
deep copy of the data when the object is constructed. The array_view class is a wrapper class that
copies the data when the kernel function accesses the data. When the data is needed on the
source device the data is copied back.
Array Class
When an array object is constructed, a deep copy of the data is created on the accelerator if you
use a constructor that includes a pointer to the data set. The kernel function modifies the copy on
the accelerator. When the execution of the kernel function is finished, you must copy the data
back to the source data structure. The following example multiplies each element in a vector by
10. After the kernel function is finished, the vector conversion operator is used to copy the data
back into the vector object.
Array_View Class
The array_view has nearly the same members as the array class, but the underlying behavior is
not the same. Data passed to the array_view constructor is not replicated on the GPU as it is with
an array constructor. Instead, the data is copied to the accelerator when the kernel function is
executed. Therefore, if you create two array_view objects that use the same data, both
array_view objects refer to the same memory space. When you do this, you have to synchronize
any multithreaded access. The main advantage of using the array_view class is that data is
moved only if it is necessary.
Table 6.1: similarities and differences between the array and array_view classes.
Description array class array_view class
When rank is At compile time. At compile time.
determined
When extent is At run time. At run time.
determined
Shape Rectangular. Rectangular.
Data storage Is a data container. Is a data wrapper.
Copy Explicit and deep copy Implicit copy when it is accessed by the kernel
at definition. function.
Data retrieval By copying the array By direct access of the array_view object or by
data back to an object calling the array_view::synchronize Method to
on the CPU thread. continue accessing the data on the original
container.

3
Shared memory with array and array_view
Shared memory is memory that can be accessed by both the CPU and the accelerator. The use of
shared memory eliminates or significantly reduces the overhead of copying data between the
CPU and the accelerator. Although the memory is shared, it cannot be accessed concurrently by
both the CPU and the accelerator, and doing so causes undefined behavior. ‘array’ objects can be
used to specify fine-grained control over the use of shared memory if the associated accelerator
supports it.
An array_view reflects the same ‘cpu_access_type’ as the array that it’s associated with; or, if
the array_view is constructed without a data source, its ‘access_type’ reflects the environment
that first causes it to allocate storage. That is, if it’s first accessed by the host (CPU), then it
behaves as if it were created over a CPU data source and shares the access_type of the
accelerator_view associated by capture; however, if it's first accessed by an accelerator_view,
then it behaves as if it were created over an array created on that accelerator_view and shares the
array’s access_type.

Executing Code over Data: parallel_for_each


The parallel_for_each function defines the code that you want to run on the accelerator against
the data in the array or array_view object. Consider the following code from the introduction of
this topic.

#include <amp.h> #include <iostream>


using namespace concurrency;
void AddArrays() {
int aCPP[] = {1, 2, 3, 4, 5};
int bCPP[] = {6, 7, 8, 9, 10};
int sumCPP[5] = {0, 0, 0, 0, 0};
array_view<int, 1> a(5, aCPP); array_view<int, 1> b(5, bCPP); array_view<int,
1> sum(5, sumCPP);
parallel_for_each(sum.extent), [=](index<1> idx) restrict(amp)
{
sum[idx] = a[idx] + b[idx];
}
);
for (int i = 0;
i < 5; i++)
{ std::cout
<< sum[i]
<< "\n";

4
}
}
The parallel_for_each method takes two arguments, a compute domain and a lambda
expression.

 The compute domain is an extent object or a tiled_extent object that defines the set of
threads to create for parallel execution. One thread is generated for each element in the
compute domain. In this case, the extent object is one-dimensional and has five elements.
Therefore, five threads are started.
 The lambda expression defines the code to run on each thread. The capture clause, [=],
specifies that the body of the lambda expression accesses all captured variables by value,
which in this case are a, b, and sum. In this example, the parameter list creates a one-
dimensional index variable named idx. The value of the idx[0] is 0 in the first thread and
increases by one in each subsequent thread. The restrict(amp) indicates that only the
subset of the C++ language that C++ AMP can accelerate is used.

Accelerating Code: Tiles and Barriers


You can gain additional acceleration by using tiling. Tiling divides the threads into equal
rectangular subsets or tiles. Determine the appropriate tile size based on your data set and the
algorithm that you are coding. For each thread, you have access to the global location of a data
element relative to the whole array or array_view and access to the local location relative to the
tile. Using the local index value simplifies your code because you don't have to write the code to
translate index values from global to local. To use tiling, call the extent::tile Method on the
compute domain in the parallel_for_each method, and use a tiled_index object in the lambda
expression.
In typical applications, the elements in a tile are related in some way, and the code has to access
and keep track of values across the tile. Use the tile_static Keyword keyword and the
tile_barrier::wait Method to accomplish this. A variable that has the tile_static keyword has a
scope across an entire tile, and an instance of the variable is created for each tile. You must
handle synchronization of tile-thread access to the variable. The tile_barrier::wait Method stops
execution of the current thread until all the threads in the tile have reached the call to
tile_barrier::wait. So you can accumulate values across the tile by using tile_static variables.
Then you can finish any computations that require access to all the values.
Graphics Library
C++ AMP includes a graphics library that is designed for accelerated graphics programming.
This library is used only on devices that support native graphics functionality. The methods are
in the Concurrency::graphics Namespace and are contained in the <amp_graphics.h> header file.
The key components of the graphics library are:
 texture Class: You can use the texture class to create textures from memory or from a file.
Textures resemble arrays because they contain data, and they resemble containers in

5
the C++ Standard Library with respect to assignment and copy construction.
The template parameters for the texture class are the element type and the rank.
 writeonly_texture_view Class: Provides write-only access to any texture.
 Short Vector Library: Defines a set of short vector types of length 2, 3, and 4
that are based on int, uint, float, double, norm, or unorm.

Universal Windows Platform (UWP) Apps


Like other C++ libraries, you can use C++ AMP in your UWP apps.

Concurrency Visualizer
The Concurrency Visualizer includes support for analyzing performance of C++ AMP code

Performance Recommendations
Modulus and division of unsigned integers have significantly better performance than modulus
and division of signed integers. We recommend that you use unsigned integers when possible.

OpenCL:
Introduction:
1.Central Processing Unit (CPU):
• Fundamental component of a modern computer
• Constantly in evolution in term of performance
• Better CPU performance = greater core clock frequency.
• Reached the clock frequency limit due to power requirements.
• Solution: increase numbers of cores per chip.

2.Graphics Processing Unit (GPU):


• Primarily used to manage and boost the performance of video and graphics
• Main feature: high number (hundreds) of simplistic cores
• GPUs work in tandem with a CPU, and are responsible for generating the graphical
output display (computing pixel values)
• Inherently parallel - each core computes a certain set of pixels
• Architecture has evolved for this purpose

CPU vs GPU:

6
GPGPU:
• GPGPU: General Purpose computation on Graphics Processing Units.
• Idea: using GPU for generic computations
• GPU acts as an “accelerator” to the CPU (Heterogeneous System)
• Most lines of code are executed on the CPU
• Key computational kernels are executed on the GPU
• Taking advantage of the large number of cores and high graphics memory
bandwidth
• AIM: code performs better than use of CPU alone.
• Nvidia was the pioneer for GPGPU.
• It created CUDA Language (based on C) and the guidelines to follow

Heterogeneous Computing:
Heterogeneous computing exploits the capabilities of different computing resources in a system
like
• CPU
• GPU
• Multicore Microprocessor
• Digital Signal Processor

• Heterogeneous applications commonly include a mix of workload behaviors:


• control intensive (e.g. searching, sorting, and parsing)
• data intensive (e.g. image processing, simulation and modeling, and data mining)
• compute intensive (e.g. iterative methods, numerical methods, and financial
modeling)
• Each of these workload classes executes most efficiently on a specific style of hardware
architecture and no single device is best for running all classes of workloads. For
example:
• Control-intensive applications tend to run faster on superscalar CPUs
• They use branch prediction mechanism that are very powerful on this
hardware
• Data-intensive applications tend to run faster on vector architectures

7
• In this kind of application the same operation is applied to multiple data
items, and on vector architecturs multiple operations can be executed in
parallel
• Usually used to obtain a high level of parallelization
• Course focus: GPU – CPU
• The use of a graphics processing unit (GPU) together with a CPU to
accelerate scientific, analytics, engineering, consumer, and enterprise
applications is a simple and common scenario of the heterogeneous
programming

Fig: Heterogeneous computing

Parallel Programming:
• The parallel programming is the ability to use multiple computing resources to speed up
the computation
• Two kinds of parallelism:
• Task-based: each unit carries out a different job.
• Data-based: all units do the same work on different subsets of the data

8
Fig: Parallel Programming

• A problem can be parallelized only if it can be divided into independent subproblems


• If the problem can be divided, it’s possible to use a decomposition method
• Two main decomposition methods:

9
Parallelism Sample
• Classic Sample: multiplication of the elements of two arrays A and B (each array has N
elements) storing the result of each multiply in the corresponding element of array C
• The standard way to develop this sample is implementing a sequential solution

• Problem: this solution execute N-times line 2 (one for each element in the array) without
parallelism
• Question: Is the sample parallelizable? And why?
• Answer: Yes!!!!! The sample is parallelizable because the multiplication of each element
in A and B is independent.
• And then it’s possible to create independent sub problems.
• Solution: Generate a separate execution instance to perform the computation of each
element of C. This code possesses significant data-level parallelism because it’s possible
to perform the same operation in parallel.

10
Fig: Parallelism Sample – Parallel Solution

Heterogeneous Computing – Problem:


• For a class of algorithms developers write code in C or C++ and run it on a CPU.
• For another class of algorithms developers often write code in CUDA and use a GPU
• Two related approaches, but each works on only one kind of processor
• Developers has to specialize in one and ignore the other.
how do you program such machines?
OpenCL:
The solution is Open Computing Language or OpenCL, a programming language developed
specifically to support heterogeneous computing environments.

• OpenCL is managed by the no-profit technology consortium Khronos Group (Apple,


IBM, NVIDIA, AMD, Intel, ARM, etc).
• The aim of OpenCL is enable the development of applications that can be executed
across a range of different devices made by different vendors.

11
• Using the core language and correctly following the specification, any program
designed for one vendor can execute on another vendor’s hardware.
• Version 1.0 (2008) on Apple’s Mac OSX Snow Leopard.
• AMD announced support for OpenCL in the same timeframe, and in 2009 IBM
announced support for OpenCL in its XL compilers for the Power architecture.
• In 2010 released version 1.1
• In 2011 released version 1.2
• In 2013 released version 2.0 (actual version).
• OpenCL supports different levels of parallelism.
• It efficiently maps to
• homogeneous systems
• heterogeneous systems.
• single- or multiple-device systems consisting of CPUs, GPUs, and other types of devices
• limited only by the imagination of vendors.
• OpenCL code is written in OpenCL C, a restricted version of the C99 language with
extensions appropriate for executing data-parallel code on a variety of heterogeneous
devices.

OpenCL vs OpenGL:
• OpenCL is similar to OpenGL but THEY ARE NOT THE SAME!!!!!!!
• OpenCL is specifically crafted to increase computing efficiency across platforms and it
is typically used for image processing algorithms, physical simulations. It returns
numerical results (NO IMAGE RESULTS).
• OpenGL is a graphical API that allows you to send rendering commands to the GPU.
Typically, the goal is to show the rendering on screen.

OpenCL – Specification
The OpenCL specification is defined in 4 parts called models.
1. Platform model:
• Specifies two kinds of processors:
• Host processor: coordinates the execution. Only one.
• Device processors: execute OpenCL C kernels. One or more
• Defines an abstract hardware model for Devices.

12
2. Execution model: Defines how the OpenCL environment is configured by the host, and
how the host may direct the devices to perform work. This includes the definitions of:
• Host execution environment
• Mechanisms for Host-Device interaction
• Concurrency model used for the configuration of kernels.
• Fundamental elements: OpenCL work-items and work-groups.
3. Kernel programming model: Defines how the concurrency model is mapped to
physical hardware.
4. Memory model: Defines memory object types, and the abstract memory hierarchy that
kernels use regardless of the actual underlying memory architecture. It also contains
requirements for memory ordering and optional shared virtual memory between the host
and devices.

OpenCL – Platform Model:


• An OpenCL platform consists of a Host (CPU) connected to one or more OpenCL
Devices (GPU).
• A Device is divided into one or more compute units (functionally independent)
• Each compute unit is divided into one or more processing elements.

Fig: OpenCL – Platform Model


 The AMD Radeon R9 290X graphics card (Device) comprises 44 vector processors
(compute units)
 Each compute unit has four 16-lane SIMD (Single Instrunction Multiple Data) engines,
for a total of 64 lanes (processing elements).
 Each SIMD lane on the Radeon R9 290X executes a scalar instruction.
 This allows the GPU device to execute a total of 44 × 16 × 4 = 2816 instructions at a
time.

13
OpenCL – Platform Model API:
OpenCL offers two API functions for discovering platforms and devices:

Instructions for connecting to Cometa GPU Server:


SSH client to connect to our server. Linux or MacOS: it is installed by default.
Windows: download SSH client (putty or openssh).
Step for connection:
1. ssh –l gpu 212.189.144.28
• Enter ‘bnd3espue’ as password
2. ssh –l username gpu
• ‘username’ is the login name you should have received from Cometa
• Enter the password you received
3.step to get the platforms/devices
• STEP 1: discovery quantity of platforms/devices
• STEP 2: allocation of enough space
• STEP 3: retrieval of the desired number of platforms/devices
• You can choose what device retrieve with device_type argument:
• CL_DEVICE_TYPE_CPU
• CL_DEVICE_TYPE_GPU
• CL_DEVICE_TYPE_ALL

OpenCL – Execution and Programming:


The Execution model defines two main components:
• Host program: written in C or C++, it runs on the OpenCL host.
• creates and queries the platform and the device attributes
• defines a context for the kernels
• builds the kernels and manages their executions.
• Kernels: written in OpenCL C, they are the basic units of executable code that run on the
OpenCL device.
• Each instance of a OpenCL kernel is executed by a Compute Units.
• On submission of the kernel by the Host to the Device, an N dimensional index
space is defined (N = 1 2 or 3).

14
• The number of kernel instances is equal to the size of the index space and each
kernel instance is created at each of the coordinates of this index space.
• This instance is called as the "work item“ and the index space is called as
• the NDRange. The work-items are performed by the compute units.
• Work-items can be divided into smaller equally sized “work-groups”

• So for each work-item we can define two types of identifier:


• global-id: A unique global ID given to each work item in the global NDRange
• local-id: A unique local ID given to each work item within a work group
The ID is fundamental for the execution of the kernels in OpenCL

15
• The API functions used to get information about the ID are the following:

 In order for the host to request that a kernel is executed on a device, a context must be
 configured.
 It enables the host to pass commands and data to the device. The API function to create a
context is clCreateContext().

16
 Execution model specifies that devices perform tasks based on commands which are
sent from the host to the device.
 A command-queue is the communication mechanism that the host uses to request
action by a
 device.
 One command-queue needs to be created per device.
 The API function clCreateCommandQueue() (deprecated in OpenCL 2.0 and
substituted by
 clCreateCommandQueueWithProprierties) is used to create a command-queue

 Any API call that submits a command to a command-queue will begin with clEnqueue
and require a command-queue as a parameter.
For example:
• clEnqueueReadBuffer() call requests that the device send data to the host
• clEnqueueNDRangeKernel() requests that a kernel is executed on the device.
 The command put in a queue are handled through the use of events. Each command of
clEnqueue type has three parameters in common:
 a pointer to a list of events that specify dependencies for the current command called
wait-list.
It is used to specify dependencies for a command
• the number of events in the wait list
• a pointer to an event that will represent the execution of the current
command

• In addition, OpenCL includes barrier operations that can be used to synchronize


execution of command-queues.

17
• clFlush() issues all previously queued OpenCL commands in a command-queue
to the device associated with the command-queue
• clFinish() blocks until all previously queued OpenCL commands in a command-
queue are issued to the associated device and have completed
• The OpenCL API also includes the function clWaitForEvents(), which causes the host to
wait for all events specified in the wait list to complete execution.
OpenCL source code is compiled at runtime through a series of API calls.
The process of creating a kernel from source code is as follows:
1.Store the OpenCL C source code in a character array.
• If the source code is stored in a file, it must be read into memory and stored as
a character array.
• Each kernel in a program source string or file is identified by a kernel
qualifier
2. Turn the source code into a program object, cl_program, by calling
clCreateProgramWithSource().
• It’s possible to create a program from binary source with
clCreateProgramWithBinary()
3.Compile the program object, for one or more OpenCL devices, with
clBuildProgram().
• In case of compile errors, they will be reported here.
4.Create a kernel object of type cl_kernel calling clCreateKernel and specifying
the program object and kernel name.
• The final step of obtaining a cl_kernel object is similar to obtaining an
exported function from a dynamic library.

Unlike invoking functions in C programs, we cannot simply call a kernel with a list of
arguments.
Before enqueue the kernel, we have to specify each kernel argument individually using

18
clSetKernelArg().

Enqueuing a command to a device to begin kernel execution is done with a call to


clEnqueueNDRangeKernel().

Fig: OpenCL – Execution and Programming Model

OpenCL – Memory Model:


To support portability, OpenCL defines an abstract memory model that programmers can target
when writing code and vendors can map to their actual memory hardware.

19
OpenCL defines three types of memory objects: buffers, images and pipes.

• OpenCL classifies memory as either host memory or device memory.


• OpenCL divides device memory into four memory regions.
• These memory regions are relevant within OpenCL kernels.
• Global Memory: visible to all work-items
• similar to the main memory on a CPU-based system.
• Costant Memory: specifically designed for data where each element is accessed
• simultaneously by all work-item.
• Part of Global Memory
• Local Memory: memory that is shared between work-items within a work-group.
• Private Memory: memory that is unique to an individual work-item.

20
21

You might also like