0% found this document useful (0 votes)

68 views21 pages

Unit-6 Concurrent and Parallel Programming:: C++ AMP (Accelerated Massive Programming)

C++ AMP (Accelerated Massive Parallelism) is a programming model that uses C++ language extensions to accelerate code execution on GPUs. It includes support for multi-dimensional arrays, indexing, memory transfer and tiling. The parallel_for_each function is used to define code that runs on the GPU over data arrays or views. Tiling can provide additional acceleration by organizing work into blocks that allow for better memory access. Array views copy data lazily to the GPU while arrays make a deep copy, so views are preferable when possible to reduce data transfer overhead.

Uploaded by

Sandhya Gubbala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

68 views21 pages

Unit-6 Concurrent and Parallel Programming:: C++ AMP (Accelerated Massive Programming)

Uploaded by

Sandhya Gubbala

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as DOC, PDF, TXT or read online on Scribd

You are on page 1/ 21

UNIT- 6

Concurrent and Parallel Programming: HeterogeneousComputing, C+

+AMP, OpenCL

C++ AMP (Accelerated Massive Programming):

C++ Accelerated Massive Parallelism is a native programming model that contains elements that
span the C++ programming language and its runtime library. It accelerates the execution of your
C++ code by taking advantage of the data-parallel hardware that's commonly present as a
graphics processing unit (GPU) on a discrete graphics card. This programming model includes
support for multidimensional arrays, indexing, memory transfer, and tiling. It also includes a
mathematical function library. You can use C++ AMP language extensions to control how data
is moved from the CPU to the GPU and back so that you can improve performance. On
November 12, 2013 the HSA Foundation announced a C++ AMP compiler that outputs to
OpenCL, Standard Portable Intermediate Representation (SPIR), and HSA Intermediate
Language (HSAIL) supporting the current C++ AMP specification. Its support is considered
.
obsolete and the current ROCm 1.9 series will be the last to support it

Microsoft added the restrict (amp) feature, which can be applied to any function (including
lambdas) to declare that the function can be executed on a C++ AMP accelerator. The restrict
keyword instructs the compiler to statically check that the function uses only those language
features that are supported by most GPUs, for example, void myFunc() restrict(amp) {…}
Microsoft or other implementer of the open C++ AMP specification could add other restrict
specifiers for other purposes, including for purposes that are unrelated to C++ AMP.
Example:
The following example illustrate the primary components of C++ AMP. Assume that you want
to add the corresponding elements of two one-dimensional arrays. For example, you might want
to add {1, 2, 3, 4, 5} and {6, 7, 8, 9, 10} to obtain {7, 9, 11, 13, 15}, using C++ AMP, you might
write the following code to add the numbers and display the results.
include<amp.h> #include <iostream>
using
namespace
concurrency;
const int size
= 5;
void CppAmpMethod() {
int aCPP[] = {1, 2, 3, 4, 5};
int bCPP[] = {6, 7, 8, 9, 10};
int sumCPP[size];
// Create C++ AMP

1
objects.
array_view<const int,
1> a(size, aCPP);
array_view<const int,
1> b(size, bCPP);
array_view<int, 1>
sum(size, sumCPP);
sum.discard_data();
parallel_for_each(
/ Define the compute domain, which is the set of
threads that are created. sum.extent,
/ Define the code to run on each thread on
the accelerator. [=](index<1> idx)
restrict(amp) {
sum[idx] = a[idx] + b[idx];

);

/ Print the results. The expected output is

"7, 9, 11, 13, 15". for (int i = 0; i < size; i+
+) {
std::cout << sum[i] << "\n";

}
Shaping and Indexing Data: index and extent
You must define the data values and declare the shape of the data before you can run the kernel
code. All data is defined to be an array (rectangular), and you can define the array to have any
rank (number of dimensions). The data can be any size in any of the dimensions.
Index Class
The index Class specifies a location in the array or array_view object by encapsulating the offset
from the origin in each dimension into one object. When you access a location in the array, you
pass an index object to the indexing operator, [], instead of a list of integer indexes. You can
access the elements in each dimension by using the array::operator() Operator or the
array_view::operator() Operator.

2
Extent Class
The extent class specifies the length of the data in each dimension of ‘array’ or ‘array_view’
object. We can also create or retrieve the extent of existing ‘array’ or ‘array_view’ object.
Moving Data to the Accelerator: array and array_view
Two data containers used to move data to the accelerator are defined in the runtime library. They
are the array Class and the array_view Class. The array class is a container class that creates a
deep copy of the data when the object is constructed. The array_view class is a wrapper class that
copies the data when the kernel function accesses the data. When the data is needed on the
source device the data is copied back.
Array Class
When an array object is constructed, a deep copy of the data is created on the accelerator if you
use a constructor that includes a pointer to the data set. The kernel function modifies the copy on
the accelerator. When the execution of the kernel function is finished, you must copy the data
back to the source data structure. The following example multiplies each element in a vector by
10. After the kernel function is finished, the vector conversion operator is used to copy the data
back into the vector object.
Array_View Class
The array_view has nearly the same members as the array class, but the underlying behavior is
not the same. Data passed to the array_view constructor is not replicated on the GPU as it is with
an array constructor. Instead, the data is copied to the accelerator when the kernel function is
executed. Therefore, if you create two array_view objects that use the same data, both
array_view objects refer to the same memory space. When you do this, you have to synchronize
any multithreaded access. The main advantage of using the array_view class is that data is
moved only if it is necessary.
Table 6.1: similarities and differences between the array and array_view classes.
Description array class array_view class
When rank is At compile time. At compile time.
determined
When extent is At run time. At run time.
determined
Shape Rectangular. Rectangular.
Data storage Is a data container. Is a data wrapper.
Copy Explicit and deep copy Implicit copy when it is accessed by the kernel
at definition. function.
Data retrieval By copying the array By direct access of the array_view object or by
data back to an object calling the array_view::synchronize Method to
on the CPU thread. continue accessing the data on the original
container.

3
Shared memory with array and array_view
Shared memory is memory that can be accessed by both the CPU and the accelerator. The use of
shared memory eliminates or significantly reduces the overhead of copying data between the
CPU and the accelerator. Although the memory is shared, it cannot be accessed concurrently by
both the CPU and the accelerator, and doing so causes undefined behavior. ‘array’ objects can be
used to specify fine-grained control over the use of shared memory if the associated accelerator
supports it.
An array_view reflects the same ‘cpu_access_type’ as the array that it’s associated with; or, if
the array_view is constructed without a data source, its ‘access_type’ reflects the environment
that first causes it to allocate storage. That is, if it’s first accessed by the host (CPU), then it
behaves as if it were created over a CPU data source and shares the access_type of the
accelerator_view associated by capture; however, if it's first accessed by an accelerator_view,
then it behaves as if it were created over an array created on that accelerator_view and shares the
array’s access_type.

Executing Code over Data: parallel_for_each

The parallel_for_each function defines the code that you want to run on the accelerator against
the data in the array or array_view object. Consider the following code from the introduction of
this topic.

#include <amp.h> #include <iostream>

using namespace concurrency;
void AddArrays() {
int aCPP[] = {1, 2, 3, 4, 5};
int bCPP[] = {6, 7, 8, 9, 10};
int sumCPP[5] = {0, 0, 0, 0, 0};
array_view<int, 1> a(5, aCPP); array_view<int, 1> b(5, bCPP); array_view<int,
1> sum(5, sumCPP);
parallel_for_each(sum.extent), [=](index<1> idx) restrict(amp)
{
sum[idx] = a[idx] + b[idx];
}
);
for (int i = 0;
i < 5; i++)
{ std::cout
<< sum[i]
<< "\n";

4
}
}
The parallel_for_each method takes two arguments, a compute domain and a lambda
expression.

 The compute domain is an extent object or a tiled_extent object that defines the set of
threads to create for parallel execution. One thread is generated for each element in the
compute domain. In this case, the extent object is one-dimensional and has five elements.
Therefore, five threads are started.
 The lambda expression defines the code to run on each thread. The capture clause, [=],
specifies that the body of the lambda expression accesses all captured variables by value,
which in this case are a, b, and sum. In this example, the parameter list creates a one-
dimensional index variable named idx. The value of the idx[0] is 0 in the first thread and
increases by one in each subsequent thread. The restrict(amp) indicates that only the
subset of the C++ language that C++ AMP can accelerate is used.

Accelerating Code: Tiles and Barriers

You can gain additional acceleration by using tiling. Tiling divides the threads into equal
rectangular subsets or tiles. Determine the appropriate tile size based on your data set and the
algorithm that you are coding. For each thread, you have access to the global location of a data
element relative to the whole array or array_view and access to the local location relative to the
tile. Using the local index value simplifies your code because you don't have to write the code to
translate index values from global to local. To use tiling, call the extent::tile Method on the
compute domain in the parallel_for_each method, and use a tiled_index object in the lambda
expression.
In typical applications, the elements in a tile are related in some way, and the code has to access
and keep track of values across the tile. Use the tile_static Keyword keyword and the
tile_barrier::wait Method to accomplish this. A variable that has the tile_static keyword has a
scope across an entire tile, and an instance of the variable is created for each tile. You must
handle synchronization of tile-thread access to the variable. The tile_barrier::wait Method stops
execution of the current thread until all the threads in the tile have reached the call to
tile_barrier::wait. So you can accumulate values across the tile by using tile_static variables.
Then you can finish any computations that require access to all the values.
Graphics Library
C++ AMP includes a graphics library that is designed for accelerated graphics programming.
This library is used only on devices that support native graphics functionality. The methods are
in the Concurrency::graphics Namespace and are contained in the <amp_graphics.h> header file.
The key components of the graphics library are:
 texture Class: You can use the texture class to create textures from memory or from a file.
Textures resemble arrays because they contain data, and they resemble containers in

5
the C++ Standard Library with respect to assignment and copy construction.
The template parameters for the texture class are the element type and the rank.
 writeonly_texture_view Class: Provides write-only access to any texture.
 Short Vector Library: Defines a set of short vector types of length 2, 3, and 4
that are based on int, uint, float, double, norm, or unorm.

Universal Windows Platform (UWP) Apps

Like other C++ libraries, you can use C++ AMP in your UWP apps.

Concurrency Visualizer
The Concurrency Visualizer includes support for analyzing performance of C++ AMP code

Performance Recommendations
Modulus and division of unsigned integers have significantly better performance than modulus
and division of signed integers. We recommend that you use unsigned integers when possible.

OpenCL:
Introduction:
1.Central Processing Unit (CPU):
• Fundamental component of a modern computer
• Constantly in evolution in term of performance
• Better CPU performance = greater core clock frequency.
• Reached the clock frequency limit due to power requirements.
• Solution: increase numbers of cores per chip.

2.Graphics Processing Unit (GPU):

• Primarily used to manage and boost the performance of video and graphics
• Main feature: high number (hundreds) of simplistic cores
• GPUs work in tandem with a CPU, and are responsible for generating the graphical
output display (computing pixel values)
• Inherently parallel - each core computes a certain set of pixels
• Architecture has evolved for this purpose

CPU vs GPU:

6
GPGPU:
• GPGPU: General Purpose computation on Graphics Processing Units.
• Idea: using GPU for generic computations
• GPU acts as an “accelerator” to the CPU (Heterogeneous System)
• Most lines of code are executed on the CPU
• Key computational kernels are executed on the GPU
• Taking advantage of the large number of cores and high graphics memory
bandwidth
• AIM: code performs better than use of CPU alone.
• Nvidia was the pioneer for GPGPU.
• It created CUDA Language (based on C) and the guidelines to follow

Heterogeneous Computing:
Heterogeneous computing exploits the capabilities of different computing resources in a system
like
• CPU
• GPU
• Multicore Microprocessor
• Digital Signal Processor

• Heterogeneous applications commonly include a mix of workload behaviors:

• control intensive (e.g. searching, sorting, and parsing)
• data intensive (e.g. image processing, simulation and modeling, and data mining)
• compute intensive (e.g. iterative methods, numerical methods, and financial
modeling)
• Each of these workload classes executes most efficiently on a specific style of hardware
architecture and no single device is best for running all classes of workloads. For
example:
• Control-intensive applications tend to run faster on superscalar CPUs
• They use branch prediction mechanism that are very powerful on this
hardware
• Data-intensive applications tend to run faster on vector architectures

7
• In this kind of application the same operation is applied to multiple data
items, and on vector architecturs multiple operations can be executed in
parallel
• Usually used to obtain a high level of parallelization
• Course focus: GPU – CPU
• The use of a graphics processing unit (GPU) together with a CPU to
accelerate scientific, analytics, engineering, consumer, and enterprise
applications is a simple and common scenario of the heterogeneous
programming

Fig: Heterogeneous computing

Parallel Programming:
• The parallel programming is the ability to use multiple computing resources to speed up
the computation
• Two kinds of parallelism:
• Task-based: each unit carries out a different job.
• Data-based: all units do the same work on different subsets of the data

8
Fig: Parallel Programming

• A problem can be parallelized only if it can be divided into independent subproblems

• If the problem can be divided, it’s possible to use a decomposition method
• Two main decomposition methods:

9
Parallelism Sample
• Classic Sample: multiplication of the elements of two arrays A and B (each array has N
elements) storing the result of each multiply in the corresponding element of array C
• The standard way to develop this sample is implementing a sequential solution

• Problem: this solution execute N-times line 2 (one for each element in the array) without
parallelism
• Question: Is the sample parallelizable? And why?
• Answer: Yes!!!!! The sample is parallelizable because the multiplication of each element
in A and B is independent.
• And then it’s possible to create independent sub problems.
• Solution: Generate a separate execution instance to perform the computation of each
element of C. This code possesses significant data-level parallelism because it’s possible
to perform the same operation in parallel.

10
Fig: Parallelism Sample – Parallel Solution

Heterogeneous Computing – Problem:

• For a class of algorithms developers write code in C or C++ and run it on a CPU.
• For another class of algorithms developers often write code in CUDA and use a GPU
• Two related approaches, but each works on only one kind of processor
• Developers has to specialize in one and ignore the other.
how do you program such machines?
OpenCL:
The solution is Open Computing Language or OpenCL, a programming language developed
specifically to support heterogeneous computing environments.

• OpenCL is managed by the no-profit technology consortium Khronos Group (Apple,

IBM, NVIDIA, AMD, Intel, ARM, etc).
• The aim of OpenCL is enable the development of applications that can be executed
across a range of different devices made by different vendors.

11
• Using the core language and correctly following the specification, any program
designed for one vendor can execute on another vendor’s hardware.
• Version 1.0 (2008) on Apple’s Mac OSX Snow Leopard.
• AMD announced support for OpenCL in the same timeframe, and in 2009 IBM
announced support for OpenCL in its XL compilers for the Power architecture.
• In 2010 released version 1.1
• In 2011 released version 1.2
• In 2013 released version 2.0 (actual version).
• OpenCL supports different levels of parallelism.
• It efficiently maps to
• homogeneous systems
• heterogeneous systems.
• single- or multiple-device systems consisting of CPUs, GPUs, and other types of devices
• limited only by the imagination of vendors.
• OpenCL code is written in OpenCL C, a restricted version of the C99 language with
extensions appropriate for executing data-parallel code on a variety of heterogeneous
devices.

OpenCL vs OpenGL:
• OpenCL is similar to OpenGL but THEY ARE NOT THE SAME!!!!!!!
• OpenCL is specifically crafted to increase computing efficiency across platforms and it
is typically used for image processing algorithms, physical simulations. It returns
numerical results (NO IMAGE RESULTS).
• OpenGL is a graphical API that allows you to send rendering commands to the GPU.
Typically, the goal is to show the rendering on screen.

OpenCL – Specification
The OpenCL specification is defined in 4 parts called models.
1. Platform model:
• Specifies two kinds of processors:
• Host processor: coordinates the execution. Only one.
• Device processors: execute OpenCL C kernels. One or more
• Defines an abstract hardware model for Devices.

12
2. Execution model: Defines how the OpenCL environment is configured by the host, and
how the host may direct the devices to perform work. This includes the definitions of:
• Host execution environment
• Mechanisms for Host-Device interaction
• Concurrency model used for the configuration of kernels.
• Fundamental elements: OpenCL work-items and work-groups.
3. Kernel programming model: Defines how the concurrency model is mapped to
physical hardware.
4. Memory model: Defines memory object types, and the abstract memory hierarchy that
kernels use regardless of the actual underlying memory architecture. It also contains
requirements for memory ordering and optional shared virtual memory between the host
and devices.

OpenCL – Platform Model:

• An OpenCL platform consists of a Host (CPU) connected to one or more OpenCL
Devices (GPU).
• A Device is divided into one or more compute units (functionally independent)
• Each compute unit is divided into one or more processing elements.

Fig: OpenCL – Platform Model

 The AMD Radeon R9 290X graphics card (Device) comprises 44 vector processors
(compute units)
 Each compute unit has four 16-lane SIMD (Single Instrunction Multiple Data) engines,
for a total of 64 lanes (processing elements).
 Each SIMD lane on the Radeon R9 290X executes a scalar instruction.
 This allows the GPU device to execute a total of 44 × 16 × 4 = 2816 instructions at a
time.

13
OpenCL – Platform Model API:
OpenCL offers two API functions for discovering platforms and devices:

Instructions for connecting to Cometa GPU Server:

SSH client to connect to our server. Linux or MacOS: it is installed by default.
Windows: download SSH client (putty or openssh).
Step for connection:
1. ssh –l gpu 212.189.144.28
• Enter ‘bnd3espue’ as password
2. ssh –l username gpu
• ‘username’ is the login name you should have received from Cometa
• Enter the password you received
3.step to get the platforms/devices
• STEP 1: discovery quantity of platforms/devices
• STEP 2: allocation of enough space
• STEP 3: retrieval of the desired number of platforms/devices
• You can choose what device retrieve with device_type argument:
• CL_DEVICE_TYPE_CPU
• CL_DEVICE_TYPE_GPU
• CL_DEVICE_TYPE_ALL

OpenCL – Execution and Programming:

The Execution model defines two main components:
• Host program: written in C or C++, it runs on the OpenCL host.
• creates and queries the platform and the device attributes
• defines a context for the kernels
• builds the kernels and manages their executions.
• Kernels: written in OpenCL C, they are the basic units of executable code that run on the
OpenCL device.
• Each instance of a OpenCL kernel is executed by a Compute Units.
• On submission of the kernel by the Host to the Device, an N dimensional index
space is defined (N = 1 2 or 3).

14
• The number of kernel instances is equal to the size of the index space and each
kernel instance is created at each of the coordinates of this index space.
• This instance is called as the "work item“ and the index space is called as
• the NDRange. The work-items are performed by the compute units.
• Work-items can be divided into smaller equally sized “work-groups”

• So for each work-item we can define two types of identifier:

• global-id: A unique global ID given to each work item in the global NDRange
• local-id: A unique local ID given to each work item within a work group
The ID is fundamental for the execution of the kernels in OpenCL

15
• The API functions used to get information about the ID are the following:

 In order for the host to request that a kernel is executed on a device, a context must be
 configured.
 It enables the host to pass commands and data to the device. The API function to create a
context is clCreateContext().

16
 Execution model specifies that devices perform tasks based on commands which are
sent from the host to the device.
 A command-queue is the communication mechanism that the host uses to request
action by a
 device.
 One command-queue needs to be created per device.
 The API function clCreateCommandQueue() (deprecated in OpenCL 2.0 and
substituted by
 clCreateCommandQueueWithProprierties) is used to create a command-queue

 Any API call that submits a command to a command-queue will begin with clEnqueue
and require a command-queue as a parameter.
For example:
• clEnqueueReadBuffer() call requests that the device send data to the host
• clEnqueueNDRangeKernel() requests that a kernel is executed on the device.
 The command put in a queue are handled through the use of events. Each command of
clEnqueue type has three parameters in common:
 a pointer to a list of events that specify dependencies for the current command called
wait-list.
It is used to specify dependencies for a command
• the number of events in the wait list
• a pointer to an event that will represent the execution of the current
command

• In addition, OpenCL includes barrier operations that can be used to synchronize

execution of command-queues.

17
• clFlush() issues all previously queued OpenCL commands in a command-queue
to the device associated with the command-queue
• clFinish() blocks until all previously queued OpenCL commands in a command-
queue are issued to the associated device and have completed
• The OpenCL API also includes the function clWaitForEvents(), which causes the host to
wait for all events specified in the wait list to complete execution.
OpenCL source code is compiled at runtime through a series of API calls.
The process of creating a kernel from source code is as follows:
1.Store the OpenCL C source code in a character array.
• If the source code is stored in a file, it must be read into memory and stored as
a character array.
• Each kernel in a program source string or file is identified by a kernel
qualifier
2. Turn the source code into a program object, cl_program, by calling
clCreateProgramWithSource().
• It’s possible to create a program from binary source with
clCreateProgramWithBinary()
3.Compile the program object, for one or more OpenCL devices, with
clBuildProgram().
• In case of compile errors, they will be reported here.
4.Create a kernel object of type cl_kernel calling clCreateKernel and specifying
the program object and kernel name.
• The final step of obtaining a cl_kernel object is similar to obtaining an
exported function from a dynamic library.

Unlike invoking functions in C programs, we cannot simply call a kernel with a list of
arguments.
Before enqueue the kernel, we have to specify each kernel argument individually using

18
clSetKernelArg().

Enqueuing a command to a device to begin kernel execution is done with a call to

clEnqueueNDRangeKernel().

Fig: OpenCL – Execution and Programming Model

OpenCL – Memory Model:

To support portability, OpenCL defines an abstract memory model that programmers can target
when writing code and vendors can map to their actual memory hardware.

19
OpenCL defines three types of memory objects: buffers, images and pipes.

• OpenCL classifies memory as either host memory or device memory.

• OpenCL divides device memory into four memory regions.
• These memory regions are relevant within OpenCL kernels.
• Global Memory: visible to all work-items
• similar to the main memory on a CPU-based system.
• Costant Memory: specifically designed for data where each element is accessed
• simultaneously by all work-item.
• Part of Global Memory
• Local Memory: memory that is shared between work-items within a work-group.
• Private Memory: memory that is unique to an individual work-item.

20
21

Achievers B1+ Teachers Book
100% (7)
Achievers B1+ Teachers Book
371 pages
Diseño Suelo Cemento
No ratings yet
Diseño Suelo Cemento
95 pages
Introduction To C++ AMP Ccelerated Assive Arallelism: Marc Grégoire
No ratings yet
Introduction To C++ AMP Ccelerated Assive Arallelism: Marc Grégoire
52 pages
Introduction To OpenACC Course 20161102 1530 1
No ratings yet
Introduction To OpenACC Course 20161102 1530 1
64 pages
C++ AMP Errata: Chapter 1: Vectorization (Page 8)
No ratings yet
C++ AMP Errata: Chapter 1: Vectorization (Page 8)
3 pages
(Abstract Data Types Using Arrays) : Fast NUCES - Department of Computer Science
No ratings yet
(Abstract Data Types Using Arrays) : Fast NUCES - Department of Computer Science
11 pages
C++ Programming From Problem Analysis To Program Design
No ratings yet
C++ Programming From Problem Analysis To Program Design
85 pages
QnA Data Structures 24 - 040924
No ratings yet
QnA Data Structures 24 - 040924
25 pages
CPP
No ratings yet
CPP
43 pages
Pointer
No ratings yet
Pointer
26 pages
Ian Talks Python A-Z
From Everand
Ian Talks Python A-Z
Ian Eress
No ratings yet
Fundamentals
No ratings yet
Fundamentals
19 pages
DSA Lab 01 - Arrays in C++
No ratings yet
DSA Lab 01 - Arrays in C++
7 pages
Example1: (Using Friend Function With Class)
No ratings yet
Example1: (Using Friend Function With Class)
15 pages
CS-72 Solved Assignment 2012
No ratings yet
CS-72 Solved Assignment 2012
9 pages
Advanced C++ Programming
No ratings yet
Advanced C++ Programming
69 pages
L8 - Pointers
No ratings yet
L8 - Pointers
27 pages
Viewing The World Through Array-Shaped Glasses - Łukasz Mendakiewicz - CppCon 2014
No ratings yet
Viewing The World Through Array-Shaped Glasses - Łukasz Mendakiewicz - CppCon 2014
131 pages
Unit 1-1
No ratings yet
Unit 1-1
87 pages
Arrays in C++: (Chapter 12)
No ratings yet
Arrays in C++: (Chapter 12)
45 pages
Polymorphism I
No ratings yet
Polymorphism I
63 pages
285 - OOPS Lecture Notes Complete
No ratings yet
285 - OOPS Lecture Notes Complete
146 pages
01 Intro-Cpp
No ratings yet
01 Intro-Cpp
159 pages
PG C++
0% (1)
PG C++
267 pages
PF Lab 09 Manual
No ratings yet
PF Lab 09 Manual
8 pages
Quantitative Finance: New Delhi
No ratings yet
Quantitative Finance: New Delhi
37 pages
AI TimeTable Report
No ratings yet
AI TimeTable Report
19 pages
Lecture-1 - CSE-225 Tamanna Motahar NSU
No ratings yet
Lecture-1 - CSE-225 Tamanna Motahar NSU
38 pages
Object Oriented Programming (OOP) Objects Classes Data Methods Reusability Scalability Maintainability
No ratings yet
Object Oriented Programming (OOP) Objects Classes Data Methods Reusability Scalability Maintainability
20 pages
BCS306B Module 2 PDF
No ratings yet
BCS306B Module 2 PDF
42 pages
02 RTVis GPGPU CUDA
No ratings yet
02 RTVis GPGPU CUDA
34 pages
C++ and Oops
No ratings yet
C++ and Oops
5 pages
Lecture 1
No ratings yet
Lecture 1
38 pages
4EE4-06 U5 L34-L40 by Dr. Rajesh Kumar
No ratings yet
4EE4-06 U5 L34-L40 by Dr. Rajesh Kumar
37 pages
C++ Text Book
No ratings yet
C++ Text Book
29 pages
Unit1 Type Casting This Pointer-Part-2
No ratings yet
Unit1 Type Casting This Pointer-Part-2
13 pages
C Plus
No ratings yet
C Plus
31 pages
Unit03 Pointers
No ratings yet
Unit03 Pointers
19 pages
Introduction to PHP, Part 2, Second Edition
From Everand
Introduction to PHP, Part 2, Second Edition
Adam Majczak
No ratings yet
Data Structures: Dr. Seemab Latif Arrays and Pointers
No ratings yet
Data Structures: Dr. Seemab Latif Arrays and Pointers
26 pages
DSTC Unit-I
No ratings yet
DSTC Unit-I
16 pages
Lecture 2 Data Structure Array & Vector
No ratings yet
Lecture 2 Data Structure Array & Vector
33 pages
AM38: Intermediate Level Programming
No ratings yet
AM38: Intermediate Level Programming
57 pages
2 Lecture 2 Data Structure Array Vector
No ratings yet
2 Lecture 2 Data Structure Array Vector
34 pages
An Introduction To C++: Dave Klein
No ratings yet
An Introduction To C++: Dave Klein
37 pages
MFE C++ Intro
No ratings yet
MFE C++ Intro
37 pages
Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014
No ratings yet
Writing Data Parallel Algorithms On GPUs - Ade Miller - CppCon 2014
43 pages
Unit 2 - Arrays and Pointers
No ratings yet
Unit 2 - Arrays and Pointers
7 pages
8 Pointers
No ratings yet
8 Pointers
4 pages
Introduction To Pointers
No ratings yet
Introduction To Pointers
20 pages
C++ Full Course
No ratings yet
C++ Full Course
15 pages
Lecture 8
No ratings yet
Lecture 8
67 pages
Assignment C++
No ratings yet
Assignment C++
9 pages
Tutorial 01
No ratings yet
Tutorial 01
26 pages
01 Intro CPP
No ratings yet
01 Intro CPP
159 pages
BtechCSE1 2
No ratings yet
BtechCSE1 2
2 pages
Lect 9-10
No ratings yet
Lect 9-10
27 pages
Lect 9
No ratings yet
Lect 9
13 pages
Course Contents
No ratings yet
Course Contents
19 pages
18bit23c U4
No ratings yet
18bit23c U4
83 pages
Mod 2 C++
No ratings yet
Mod 2 C++
6 pages
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
From Everand
Python Advanced Programming: The Guide to Learn Python Programming. Reference with Exercises and Samples About Dynamical Programming, Multithreading, Multiprocessing, Debugging, Testing and More
Marcus Richards
No ratings yet
CPP Unit-2
No ratings yet
CPP Unit-2
37 pages
Unit-Iv Concurrent and Parallel Programming: Parallel Programming Paradigms - Data Parallel
No ratings yet
Unit-Iv Concurrent and Parallel Programming: Parallel Programming Paradigms - Data Parallel
61 pages
CPP Unit-4
No ratings yet
CPP Unit-4
61 pages
Unit-1 Concurrent and Parallel Programming Syllabus: Concurrent Versus Sequential Programming. Concurrent Programming Constructs
No ratings yet
Unit-1 Concurrent and Parallel Programming Syllabus: Concurrent Versus Sequential Programming. Concurrent Programming Constructs
26 pages
CPP Unit-1
No ratings yet
CPP Unit-1
26 pages
Hen HLPS, Ges: Duc Ag Algor Conve
No ratings yet
Hen HLPS, Ges: Duc Ag Algor Conve
25 pages
Universal Approximation Theorem
No ratings yet
Universal Approximation Theorem
1 page
CS100 Computational Problem Solving Fall 2021-2022 Sarvech Qadir
No ratings yet
CS100 Computational Problem Solving Fall 2021-2022 Sarvech Qadir
4 pages
hdc4 2
No ratings yet
hdc4 2
23 pages
20461C Setup Guide
0% (1)
20461C Setup Guide
17 pages
Jawaban - 1.3.2.4 Lab - Tracing Internet Connectivity
No ratings yet
Jawaban - 1.3.2.4 Lab - Tracing Internet Connectivity
11 pages
Sensors: Indoor Positioning Algorithm Based On The Improved RSSI Distance Model
No ratings yet
Sensors: Indoor Positioning Algorithm Based On The Improved RSSI Distance Model
15 pages
217 Lec2
No ratings yet
217 Lec2
24 pages
Expert Evaluation Form PK
No ratings yet
Expert Evaluation Form PK
2 pages
REHS0970 - Cross Reference For Electrical Connectors
No ratings yet
REHS0970 - Cross Reference For Electrical Connectors
115 pages
DAA Unit 1
No ratings yet
DAA Unit 1
106 pages
Connecting To ECM Via Ethernet
No ratings yet
Connecting To ECM Via Ethernet
8 pages
LG 20LS5R Chassis LP68A PDF
No ratings yet
LG 20LS5R Chassis LP68A PDF
30 pages
.JSP Extension Are The File Name Homepage - JSP: JSP & JDB Lab Manual For IT by Abdo .A at W K U (2 0 1 7)
No ratings yet
.JSP Extension Are The File Name Homepage - JSP: JSP & JDB Lab Manual For IT by Abdo .A at W K U (2 0 1 7)
6 pages
Collections Interview Questions
No ratings yet
Collections Interview Questions
7 pages
Classes That Can Be Instantiated: Ghoul Class
No ratings yet
Classes That Can Be Instantiated: Ghoul Class
14 pages
Тест ИКТ 8
No ratings yet
Тест ИКТ 8
8 pages
Final Manuscript The Arcadian 3
No ratings yet
Final Manuscript The Arcadian 3
31 pages
Implementation of Denoising...
100% (1)
Implementation of Denoising...
105 pages
Features and Applications of The P82B715 I2C-bus Extender
No ratings yet
Features and Applications of The P82B715 I2C-bus Extender
29 pages
Smart Fridge
100% (1)
Smart Fridge
17 pages
Pptmicro
No ratings yet
Pptmicro
9 pages
02 Flow Control
No ratings yet
02 Flow Control
16 pages
Oops CS8392 MCQ
No ratings yet
Oops CS8392 MCQ
37 pages
HCNA-UC-IHUCA V2.8 Training Material (20160201) PDF
No ratings yet
HCNA-UC-IHUCA V2.8 Training Material (20160201) PDF
804 pages
SMD Unit 1 PDF
No ratings yet
SMD Unit 1 PDF
8 pages
Elite-7x: Operation Manual
No ratings yet
Elite-7x: Operation Manual
0 pages
Saliola Assunta 201406 MSC Thesis
No ratings yet
Saliola Assunta 201406 MSC Thesis
80 pages
New Language Line Job Infographic
No ratings yet
New Language Line Job Infographic
1 page
Test Result Cable System
No ratings yet
Test Result Cable System
10 pages

Unit-6 Concurrent and Parallel Programming:: C++ AMP (Accelerated Massive Programming)

Uploaded by

Unit-6 Concurrent and Parallel Programming:: C++ AMP (Accelerated Massive Programming)

Uploaded by

UNIT- 6

Concurrent and Parallel Programming: HeterogeneousComputing, C+

C++ AMP (Accelerated Massive Programming):

/ Print the results. The expected output is

Executing Code over Data: parallel_for_each

#include <amp.h> #include <iostream>

Accelerating Code: Tiles and Barriers

Universal Windows Platform (UWP) Apps

2.Graphics Processing Unit (GPU):

• Heterogeneous applications commonly include a mix of workload behaviors:

Fig: Heterogeneous computing

• A problem can be parallelized only if it can be divided into independent subproblems

Heterogeneous Computing – Problem:

• OpenCL is managed by the no-profit technology consortium Khronos Group (Apple,

OpenCL – Platform Model:

Fig: OpenCL – Platform Model

Instructions for connecting to Cometa GPU Server:

OpenCL – Execution and Programming:

• So for each work-item we can define two types of identifier:

• In addition, OpenCL includes barrier operations that can be used to synchronize

Enqueuing a command to a device to begin kernel execution is done with a call to

Fig: OpenCL – Execution and Programming Model

OpenCL – Memory Model:

• OpenCL classifies memory as either host memory or device memory.

You might also like