Snapdragon Hetcompute SDK 1.0.0 Refman en
Snapdragon Hetcompute SDK 1.0.0 Refman en
May 2, 2018
Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries.
All Qualcomm Incorporated trademarks are used with permission. Other product and brand names may be
trademarks or registered trademarks of their respective owners.
Contents
1 Introduction 16
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 Technical Assistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
4 Getting Started 23
4.1 Writing your first HetCompute program . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1 Building a HetCompute program using ndk-build . . . . . . . . . . . . . . . . . . 24
5 User Guide 25
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.1 Writing a HetCompute Application . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1.1.1 Parallel vector addition . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1.2 Parallel sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1.3 Parallelism using tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.2 Executing a HetCompute Application . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Parallel Programming Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.1 Overview of HetCompute Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.2 Parallel Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.2.1 Parallel For Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.2.2 Parallel Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.3 Parallel Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2.4 Parallel Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.5 Parallel Divide-and-Conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.6 Parallel Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.7 Advanced Topics for Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.7.1 Pattern Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.7.2 Tuner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2.8 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.8.2 HetCompute Pipeline Example . . . . . . . . . . . . . . . . . . . . . . 47
5.2.8.3 HetCompute Pipeline Details . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.8.4 Launch the HetCompute pipeline . . . . . . . . . . . . . . . . . . . . . 52
17 Miscellaneous 441
17.1 Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
17.2 Legacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
17.2.1 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
17.2.1.1 init . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
17.2.1.2 shutdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
Bibliography 461
List of Tables
1-1 Reference documents and standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1-2 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.1 Purpose
This document describes the Qualcomm® Snapdragon™ Heterogeneous Compute SDK programming
model and API.
1.2 Scope
This document is for system developers using the Qualcomm Heterogeneous Compute SDK to develop
domain-specific libraries for high-performance applications. Qualcomm Heterogeneous Compute SDK
handles core management, providing the ability to port an application across multiple cores. Speed is
determined by the number of processors on the device.
This document provides the public interfaces necessary to use the features provided by the Qualcomm
Heterogeneous Compute SDK. A functional overview and information on leveraging the interface
functionality are also provided.
1.3 Conventions
Function declarations, function names, type declarations, and code samples appear in a different font. For
example, #include.
Code variables appear in angle brackets. For example, <number>.
Commands and command variables appear in a different font. For example, {copy a:∗.∗ b:}.
1.4 References
The following table lists reference documents, which may include Qualcomm documents and
non-Qualcomm standards and resources. Reference documents that are no longer applicable are deleted
from this table; therefore, reference numbers might not be sequential. This document also includes a
Bibliography at the end of this document with linkable citations throughout.
1.6 Acronyms
For definitions of commonly used terms and abbreviations, refer to Q1. The following terms are specific to
this document.
Table 1-2 Acronyms
Acronym Definition
API application programming interface
DAG directed acyclic graph
GPGPU general purpose GPU
Qualcomm Het- Qualcomm® Snapdragon™ Heterogeneous Compute SDK
Compute
MIMD multiple instruction, multiple data
MPI message passing interface
NDEBUG C/C++ preprocessor macro for NO DEBUG
NDK Native Development Kit
SAXPY scalar vector multiply
SIMD single instruction, multiple data
SMP symmetric multiprocessing
SoC system-on-a-chip
TLS thread local storage
This chapter explains how to configure an application to use HetCompute given the binary distribution. The
installer package available from the Qualcomm Developer Network contains precompiled dynamic
libraries for Android (32-bit and 64-bit ARM). Install the distribution on your system following the installer
prompts, and then see the appropriate section below on how to verify installation, integrate it with your
application.
HetCompute assumes the existence of a working Android NDK and SDK. We recommend using NDK
r13b or later.
Note: The Qualcomm Hexagon SDK (available on Qualcomm Developer Network) is needed
to enable support for hexagon dsp in the Qualcomm HetCompute library. The recommended version of
the Qualcomm Hexagon SDK for use with HetCompute is 3.3.0 or later.
Before compiling the samples, specify the path to the root of OpenCL directory containing the headers and
the library by initializing QSHETCOMPUTE_OPENCL_PATH in
$HETCOMPUTE_DIR/samples/build/android/jni/Android.mk.
To verify the installation, perform the following:
Using ndk-build:
cd $HETCOMPUTE_DIR/samples/build/android/jni
$ANDROID_NDK/ndk-build
The above ndk-build will build both 32-bit and 64-bit variant of the samples.
Note that some of the HetCompute GPU samples requires image files, the samples assumes that the image
files are under /mnt/sdcard in the device. Sample Image files can be found in the
HETCOMPUTE_DIR/samples/src directory.
Next, edit your project’s jni/Android.mk to define the location of the HetCompute libraries and headers
and generate prebuilt shared library.
# Heterogeneous Compute SDK prebuilt
include $(CLEAR_VARS)
LOCAL_MODULE := qshetcompute
LOCAL_SRC_FILES := $(HETCOMPUTE_DIR)/$(TARGET_ARCH_ABI)/libhetCompute-$(QSHETCOMPUTE_VERSION).so
LOCAL_EXPORT_C_INCLUDES := $(HETCOMPUTE_DIR)/include
include $(PREBUILT_SHARED_LIBRARY)
If applications wants to disable exceptions, make the following changes in Android.nk & Application.mk
files.
# Add the following CFLAGS in Android.mk
LOCAL_CFLAGS := -DHETCOMPUTE_DISABLE_EXCEPTIONS
Here is sample Android.mk file that is used to build the shipped samples
define hetcompute_add_sample
include $(CLEAR_VARS)
LOCAL_MODULE := hetcompute_sample_$1
LOCAL_C_INCLUDES := $(QSHETCOMPUTE_CORE_INCLUDE_PATH) \
$(QSHETCOMPUTE_OPENCL_INC_PATH) \
$(QSHETCOMPUTE_DSP_STUB_PATH)
HetCompute SDK supports Heterogeneous Compute with offload on the GPU and DSP. GPU offload is
supported using OpenCL and OpenGL kernels. The offload mechanism is either OpenCL 1.2 or later, or the
native Qualcomm GPU driver. The community has been collecting a list of Android devices that support
OpenCL. In addition, a Qualcomm-based platform that provides OpenCL support, such as the Qualcomm
DragonBoard.
Using HetCompute with OpenCL as GPU backend requires the OpenCL C++ header file cl.hpp, which
needs to be patched. For details on how to patch cl.hpp see OpenCL C++ Support. In the Android.mk
file, set LOCAL_C_INCLUDES to include path to OpenCL headers. In the above Android.mk, this is
referred by QSHETCOMPUTE_OPENCL_INC_PATH. libOpenCL-prebuit refers to the corresponding
OpenCL library for 32 or 64 bit variant.
To build an application with the Hexagon-enabled library, set LOCAL_C_INCLUDES to include DSP stub
headers generated using Hexagon SDK. This is referred by QSHETCOMPUTE_DSP_STUB_PATH in the
samples Android.mk file.
Note
If issues are encountered, verify that the calculator example shipped with Hexagon SDK works
properly in your device and that your device properly supports DSP execution.
These chapters discuss the steps needed if you are transitioning from Symphony System Manager SDK to
Heterogeneous Compute SDK.
• Migration from Symphony System Manager SDK 1.x to Snapdragon™ Heterogeneous Compute
SDK 1.0
The above program does the following: Given an input vector vin containing 1024 elements, all of which
are initialized to 0, every element is updated to store 2∗i, where i is the index of that element.
In line 3, the hetcompute.hh header is included, which is needed for any HetCompute program. All the
HetCompute classes and functions are declared in the hetcompute namespace.
This simple example illustrates the use of the HetCompute pfor_each pattern, which allows the elements
of a collection to be processed in parallel. Because there are no dependencies between iterations (termed
inter-iteration dependencies), the values can be computed and updated in parallel. This pattern can be used
to replace all loops in the user’s program that do not have inter-iteration dependencies. HetCompute
provides a variety of other patterns, which are described in Parallel Programming Patterns.
HetCompute also provides programmers with another layer of abstraction, allowing them to think about
algorithms in terms of concurrent tasks and letting the HetCompute runtime schedule them onto available
resources in the system. Programmers can create dynamic task graphs by setting dependencies between
tasks that the runtime enforces. Another key HetCompute abstraction —not shown in the example— are
groups. Groups allow the programmer to easily manage sets of tasks. Tasks and groups are discussed in
more detail in Introduction to Tasks.
The example can then be built by typing the following at the command prompt:
$ ndk-build
5.1 Overview
All current hardware platforms, from desktops to smartphones, are built around multicore and
heterogeneous systems-on-a-chip (SoC). Servers and supercomputers are also using specialized cores, such
as GPUs, to improve performance and power efficiency.
Qualcomm Heterogeneous Compute SDK (HetCompute) enables the full utilization of the hardware at the
user application level, in the following ways:
• By providing a parallel programming model that allows programmers to express the concurrency in
their applications. HetCompute’s powerful abstractions ease the burden of parallel programming
through a design that builds on dynamic concurrency from the ground up. At the high level,
HetCompute provides a set of parallel programming patterns that capture many of the existing
parallel building blocks, and adds dataflow and work cancellation as first-class primitives that
improve programmer productivity.
• By seamlessly integrating heterogeneous execution into a concurrent task graph and removing the
burden of managing data transfers and explicit data copies between kernels executing on different
devices. At the low level, HetCompute provides state-of-the-art algorithms for work stealing and
power optimizations that allow it to hide hardware idiosyncrasies to allow the development of
portable applications. In addition, HetCompute is designed to support dynamic mapping to
heterogeneous execution units. Moreover, expert programmers can take charge of the execution
through a carefully designed system of attributes and directives that provide the runtime system with
additional semantic information about the patterns, tasks, and buffers that HetCompute uses as
building blocks.
• By embedding the programming model in C++ and providing a C++ library API. C++ is a familiar
language for a large number of performance-oriented programmers, thus making it easy for
programmers to pick up the abstractions quickly. C++ embedding also allows incremental
development of existing applications, because HetCompute interoperates with existing libraries, such
as pthreads and OpenGL.
HetCompute runs on top of a runtime system that will execute the concurrent applications on all the
available computational resources on the SoC. The HetCompute runtime system is essentially a resource
manager for threads, address spaces, and devices. It builds on a set of state-of-the-art algorithms to free
programmers from the need to manage these resources explicitly and provide the best performance for the
HetCompute execution model.
The remaining sections of this chapter provide a high-level overview of the HetCompute parallel patterns
and concurrent abstractions, and its execution model. The rest of the User’s Guide provides additional
details on the design decisions in HetCompute, which will allow the programmer to chose the right level of
primitives to use in the application. The Reference Manual includes the API details.
HetCompute is a user-level library that integrates with OS services to hide the complexity of hardware as
much as possible, while still providing programmers with control over performance. HetCompute takes
advantage of existing standards to enable execution on the entire SoC: POSIX and C++11 for exploiting
multicore, OpenCL to dispatch onto GPUs, and OpenDSP to dispatch to the Qualcomm Hexagon(™ ) DSP.
The advantage of using HetCompute is that it provides a seamless interface for all these devices, therefore
enabling the programmer to focus on the application being developed, rather than managing hardware,
different execution models, and data transfers.
HetCompute’s execution model is a concurrent task graph, with acyclic control dependencies and/or data
dependencies that define which tasks should execute concurrently. Tasks (defined formally in sections
Introduction to Tasks and Tasks Reference API) are units of independent work. They are an intuitive way of
specifying chunks of computation that can map to different exection units. Dependencies (control and data)
provide the mechanism to dynamically build a concurrent task graph. The task will execute in parallel on as
many execution units are available on the platform at that moment. Note that on a mobile device, because of
power and thermal constraints, some execution units will not be available, or even disappear dynamically.
Therefore, it is best that the programmers focus on expressing the concurrency using HetCompute tasks,
and the runtime will map them to all the available resources. In HetCompute, heterogeneous execution is no
different than multicore execution. However, to provide best performance, HetCompute requires
programmers to write specialized kernels. The current version of HetCompute supports writing GPU
kernels in the OpenCL language and DSP kernels in C99.
The HetCompute runtime manages tasks and maps them to platform resources using a state-of-the-art work
scheduler. The scheduler implements pervasive work stealing and dynamic mapping of tasks to execution
units, based on heuristics that are driven by the programmer using novel high-level APIs. Later in this
guide, several examples will be discussed on how the programmer can control the behavior of the runtime,
such as using pattern tuners and task attributes. These are particularly relevant for mobile devices.
HetCompute provides two levels of APIs:
• A set of high-level APIs that includes parallel programming patterns and basic tasks and group
creation and launch. These APIs are intended for programmers who focus first and foremost on
productivity. Using these APIs provide you with the best performance in most instances, with
relatively little coding effort. The semantics of these APIs is precisely defined, and the HetCompute
type system is designed to catch many concurrent programming errors.
• A set of low-level APIs that allow expert programmers finer control over the parallel execution.
These APIs may offer better performance, at the cost of removing some of the guarantees that the
high-level APIs provide. Direct access to task pointer objects, task attributes and pattern tuners,
specialized allocators, buffer consistency and synchronization, and storage classes, are some example
of these APIs. These are the foundation for the high-level APIs, and thus the two levels work in
concert. However, using the low-level APIs requires a good understanding of parallel programming
and the side-effects that concurrent execution can have on your program; therefore, use with caution!
The target audience for HetCompute are programmers who require performance. HetCompute is envisioned
to be used by application programmers and library programmers to build high-performance applications
and domain-specific libraries. It is designed to make composing libraries easy: HetCompute tasks can be
launched from any application thread (no need to join a particular thread pool), tasks can be launched
hierarchically and synchronized individually or as a group, and a unified representation of patterns and
tasks. These novel characteristics make HetCompute uniquely positioned as a framework for heterogeneous
execution. Many other application programmers can benefit from HetCompute by embedding such
Qualcomm HetCompute-enabled libraries and thus indirectly benefit from parallel and heterogeneous
execution without the burden of parallel programming.
• Identify the algorithm to be parallelized and design a parallel version of the algorithm.
• Encode the algorithm using HetCompute abstractions:
– If the algorithm matches one of the HetCompute patterns, use the pattern directly and enjoy the
speedups.
– More complex applications will require either the use of multiple patterns or they may exhibit
parallelism that does not match one of the existing patterns. In this case, use the HetCompute
building blocks of task and group to partitioning the algorithm into tasks, setting dependencies
between the tasks (building the execution task graph), and launching the tasks for execution. Also,
partitioning the data should be considered for data concurrent access.
• Patterns and tasks are interoperable, as the HetCompute library maps patterns to tasks. Thus, a
HetCompute application consists of a forest of DAGs. The runtime system schedules the tasks once
their dependencies are satisfied.
• HetCompute task graphs execute across different devices when the programmer provides device
kernels. To execute on the GPU kernels in OpenCL are written. To run on the DSP, kernels in C99 are
written. These kernels are integrated into the task graph just like other tasks that are designed for the
CPU. Kernels: The Path to Heterogeneity has details on how to design and build heterogeneous
kernels.
The fastest way to build a HetCompute application is by using the HetCompute patterns. If your parallel
algorithm matches one of the parallel programming patterns in HetCompute (pfor_each, preduce,
ptransform, pscan, psort, pdivide_and_conquer, or pipeline), directly using the
pattern is recommended. The HetCompute runtime understands the semantics of these constructs and
optimizes for their concurrent execution.
Below are several examples on how to use several of these patterns.
One of the most common operations for parallel programming is parallel iteration. HetCompute provides
the pfor_each pattern to support parallel iteration. Below is an example:
1 #include <cstdlib>
2 #include <vector>
3
4 #include <hetcompute/hetcompute.hh>
5
6 using namespace std;
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 const size_t N = 100;
13 vector<float> a(N), b(N), c(N);
14
15 // Initialize the source arrays with random numbers
16 for (size_t i = 0; i < N; i++)
17 {
18 a[i] = static_cast<float>(rand()) / ((1ULL << 31) - 1);
19 b[i] = static_cast<float>(rand()) / ((1ULL << 31) - 1);
20 }
21 float alpha = 0.2f;
22
23 // add the two vectors concurrently
24 hetcompute::pfor_each(size_t(0), N, [&](size_t i) { c[i] = alpha * a[i] + b[i]; })
;
25
26 hetcompute::runtime::shutdown();
27 return 0;
28 }
The use of HetCompute is highlighted in this example. Line 4 includes the HetCompute library headers. Up
to, and including line 20 is standard C++11 for initializing two vectors, a and b, of size N. In line 24 the
hetcompute::pfor_each construct is invoked. It is very similar to a for loop, except that the
iterations will be executed in parallel on as many execution units are available on the platform. A more
detailed description of these patterns is in Patterns Reference API.
Another example of a common operation that benefits from concurrent execution is sorting of large arrays.
Here is an example of how to sort in parallel in HetCompute:
1 #include <random>
2 #include <sstream>
3 #include <vector>
4
5 #include <hetcompute/hetcompute.hh>
6
12
13 int
14 main(int argc, const char* argv[])
15 {
16 hetcompute::runtime::init();
17 std::vector<long> input;
18 size_t n_def = 20;
19 size_t n = n_def;
20
21 if (argc >= 2)
22 {
23 std::istringstream istr(argv[1]);
24 istr >> n;
25 }
26
27 std::random_device rd;
28 std::mt19937 generator(rd());
29 std::uniform_int_distribution<long> dis;
30 const size_t num_ints = 1ULL << n;
31 // Create a random array of integers
32 for (size_t i = 0; i < num_ints; i++)
33 {
34 input.push_back(dis(generator));
35 }
36
37 hetcompute::psort(input.begin(), input.end());
38
39 if (!std::is_sorted(input.begin(), input.end()))
40 {
41 std::cerr << "psorting failed\n";
42 }
43
44 hetcompute::runtime::shutdown();
45 return 0;
46 }
Most of the code in this example is standard C++ to initialize the data structures. In line 37, the
hetcompute::psort parallel sorting function is invoked. It takes two iterators, the beginning and the
end of the list, and it sorts (in place) the input vector in the interval [begin, end).
Hopefully, you are convinced how easy is to introduce parallel programming in your application if your
application fits one of the pre-defined HetCompute patterns.
HetCompute exposes the fundamental building blocks tasks and groups to parallelize algorithms that
do not fit into one of the HetCompute patterns. Below is an example of sorting (using merge sort) that is
parallelized using hetcompute tasks:
1 #include <algorithm>
2 #include <functional>
3 #include <iostream>
4 #include <iterator>
5 #include <sstream>
6 #include <vector>
7
8 #include <hetcompute/hetcompute.hh>
9
10 // Parallel mergesort using recursive fork-join parallelism.
11 // hetcompute::task<>::finish_after allows easy expression of the parallelism in the
12 // algorithm in a non-blocking manner, yielding better performance than
13 // blocking parallelization using hetcompute::task<>::wait_for.
14
16 const size_t GRANULARITY = 8192;
17
18 // Asynchronous mergesort, to be invoked in a task
19 template <typename Iterator, typename Compare>
20 void
21 mergesort(Iterator begin, Iterator end, Compare cmp)
22 {
23 size_t n = std::distance(begin, end);
24 if (n <= GRANULARITY)
25 {
26 sort(begin, end, cmp);
27 }
28 else
29 {
30 auto middle = begin;
31 std::advance(middle, n / 2);
32 auto left = hetcompute::launch([=] { mergesort(begin, middle, cmp); });
33 auto right = hetcompute::launch([=] { mergesort(middle, end, cmp); });
34 auto merge = hetcompute::create_task([=] { std::inplace_merge(begin, middle,
end, cmp); });
35 // The left subtree and right subtree tasks must finish before the merge
36 // task can execute
37 left->then(merge);
38 right->then(merge);
39 merge->launch();
40 // mergesort(begin, end, cmp) logically finishes after the merge task
41 // finishes
42 merge->finish_after();
43 }
44 }
45
46 int
47 main(int argc, const char* argv[])
48 {
49 hetcompute::runtime::init();
50 std::vector<long> input;
51 size_t n_def = 1 << 16;
52 size_t n = n_def;
53
54 if (argc >= 2)
55 {
56 std::istringstream istr(argv[1]);
57 istr >> n;
58 }
59
60 // Create a random array of integers
61 for (size_t i = 0; i < n; i++)
62 {
63 input.push_back(rand());
64 }
65
66 // Launch mergesort inside a task since it has an asynchronous interface (due
67 // to use of hetcompute::task::finish_after)
68 auto t = hetcompute::launch([&] { mergesort(input.begin(), input.end(),
std::less<long>()); });
69 t->wait_for();
70
71 if (!std::is_sorted(input.begin(), input.end()))
72 {
73 std::cerr << "parallel mergesorting failed\n";
74 }
75
76 hetcompute::runtime::shutdown();
77 return 0;
78 }
Please note how much easier is to just use the HetCompute patterns. In this example, a dynamic DAG of
tasks is constructed by splitting the array into halves and sorting each half in parallel, and then merging the
results using another task. Lines 32-33 create the recursive sorting tasks. Line 34 create the merge task.
Lines 37-38 sets the dependencies between the tasks, thus building the DAG. Lines 39-42 launch the tasks
into the runtime, and finally the function terminates when the merge task terminates (line 42). In the main
function, a task is created for the merge by passing the entire array (line 68), which has no dependency so it
can be directly launched (line 66) and then wait for it to complete (line 69).
This is a quick illustration of the power of HetCompute’s abstractions. In the rest of this guide a
walkthrough is provided with details on the design to help you use HetCompute to extract the most benefits
The HetCompute runtime fundamentally implements a thread pool over which tasks are scheduled at the
user level. When the application starts running, the thread pool is initialized such that it makes optimal use
of the existing hardware contexts on the device. The scheduler is a throughput-oriented scheduler. Tasks are
scheduled in a non-preemptive manner as they are ready for execution (dependencies are satisfied). They
are mapped to devices based on the kernel type. The runtime performs additional optimizations for
performance and energy efficiency based on the patterns semantics, patterns tuning and task attributes.
Parallel Programming Patterns
Introduction to Tasks
Buffers
Textures
Data Structures
Storage
Affinity
Heterogeneous Computing in Action
Interoperability
The parallel for loop pattern, hetcompute::pfor_each, supports concurrent application of a given
function object on each element in the input collection returned by the input iterator taken as an argument.
The input iterator can be expressed as a pair of integers (lower bound, upper bound), or a pair of random
access iterators (begin, end). This pattern is mostly suitable to replace a serial loop where loop-carried
dependence (dependence exists across iterations) does not occur. The following example illustrates the use
of the parallel iteration pattern for a simple computation.
1 #include <vector>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 // initialize the input vector
10 std::vector<size_t> vin(1024, 0);
11
12 // in-place update of the input vector
13 // equivalent to the following code
14 // for (size_t i = 0; i < vin.size(); ++i) {
15 // vin[i] = 2 * i;
16 // }
17 hetcompute::pfor_each(size_t(0), vin.size(), [&vin](size_t i) { vin[i] = 2 * i; })
;
18
19 hetcompute::runtime::shutdown();
20 return 0;
21 }
Despite the simple look, the underlying implementation is highly efficient in workload parallelization and
load balancing. A lock-free workstealing algorithm is employed to balance workload, i.e., iterations to
work on, across multiple computational cores. It attempts to exploit the maximum degree of concurrency
available in the loop computation, and has a very low overhead of synchronization.
The API also takes two optional parameters: stride and tuner. Pattern tuners are covered in Tuner. The
stride parameter represents the step size of the incremental iterator, and has a default value of one. For
example, the parallel version of the following code snippet
for(size_t i = 0; i < vin.size(); i += 2)
vin[i] = 2 * i;
The parallel iteration pattern can be nested. However, it is usually sufficient to only decorate the outmost
loop with hetcompute::pfor_each, given the outmost loop has sufficient iterations to keep all cores
busy.
The parallel transformation pattern, hetcompute::ptransform, has three versions. The first two
versions apply a given function object to a range and stores the result in another range. They are essentially
the parallel version of std::transform (one applies unary function and the other applies binary
function). The third version, similar to hetcompute::pfor_each, performs in-place transformation.
The major difference between the two patterns is that hetcompute::ptransform passes the
dereferenced input iterator to the function object, whereas hetcompute::pfor_each passes the input
iterator directly to the function object. Therefore, the input iterator passed to
hetcompute::ptransform is restricted to random access iterators. Integral iterators are not allowed
because they cannot be dereferenced.
The parallel transformation pattern is useful when the programmer wishes to directly manipulate the
dereferenced input iterator in the function object. For example, the following code performs a binary
operation on different segments of the input range, and stores the result in the output range.
1 #include <functional>
2 #include <vector>
3
4 #include <hetcompute/hetcompute.hh>
5
6 int
7 main()
8 {
9 hetcompute::runtime::init();
10 // Initialize input vector: vin[i] = i
11 std::vector<int> vin(1024);
12 int j = 0;
13 for (auto& i : vin)
14 i = j++;
15
16 // vout[i] = vin[i] + vin[i+1]
17 std::vector<int> vout(vin.size() - 1);
18 hetcompute::ptransform(begin(vin),
19 begin(vin) + vout.size(), // first input range
20 begin(vin) + 1, // start of the second input range
21 begin(vout), // start of the output range
22 std::plus<int>());
23
24 hetcompute::runtime::shutdown();
25 return 0;
26 }
If the programmer passes a pair of dereferenceable input iterators, the example becomes:
int parallel_sum = hetcompute::preduce(vin.begin(), vin.end(), identity,
std::plus<int>());
Or
int parallel_sum = hetcompute::preduce(arr, arr + 1024, identity, std::plus<int>());
However, if the programmer passes a pair of integral input iterators which are not dereferenceable, the
programmer needs another function object to capture the input container, and to define the accumulation
operation for a subrange starting with some initial value. This is necessary because the join function does
not offer dereferenced operation on the input iterators. The parallel sum example will become the following:
1 #include <functional>
2 #include <iostream>
3 #include <numeric>
4 #include <vector>
5
6 #include <hetcompute/hetcompute.hh>
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 // initalize the input vector
13 std::vector<int> vin(1024, 0);
14 int val = 1;
15 for (auto& i : vin)
16 {
17 i = val++;
18 }
19
20 const int identity = 0;
21 // parallel_sum = 1 + 2 + 3 + ... + 1024
22 int parallel_sum = hetcompute::preduce(size_t(0),
23 vin.size(),
24 identity,
25 // aggregate subrange
26 [&vin](size_t f, size_t l, int& init) {
27 for (size_t k = f; k < l; ++k)
28 {
29 init += vin[k];
30 }
31 },
32 // join intermediate results
33 std::plus<int>());
34
35 // check result
36 int serial_sum = std::accumulate(vin.begin(), vin.end(), 0);
37 if (parallel_sum != serial_sum)
38 {
39 std::cout << "Parallel reduction failed!" << std::endl;
40 }
41
42 hetcompute::runtime::shutdown();
43
44 return 0;
45 }
Internally, an efficient work stealing algorithm has been implemented to parallelize the reduction
computation. The algorithm builds a reduction tree and first performs accumulation of subranges in a
top-down manner. The intermediate values are then joined together bottom-up to obtain a final result.
Because of the work stealing implementation, programmers can put some in-place transformation
computation ahead of reduction to build more complex algorithms. In this sense,
hetcompute::preduce can be viewed as the combination of the parallel for loop pattern and the
parallel reduction pattern. The in-place transformations are completed during the top-down accumulation
process, as exhibited by the following code snippet.
1 #include <functional>
2 #include <iostream>
3 #include <numeric>
4 #include <vector>
5
6 #include <hetcompute/hetcompute.hh>
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 // initalize the input vector
13 std::vector<int> vin(1024, 0);
14 int val = 1;
15 for (auto& i : vin)
16 {
17 i = val++;
18 }
19
20 const int identity = 0;
21 // parallel_sum = 2 + 4 + 6 + ... + 2048
22 int parallel_sum = hetcompute::preduce(size_t(0),
23 vin.size(),
24 identity,
25 // aggregate subrange
26 [&vin](size_t f, size_t l, int& init) {
27 for (size_t k = f; k < l; ++k)
28 {
29 // some transformation func applied to vin
30 vin[k] *= 2;
31 init += vin[k];
32 }
33 },
34 // join intermediate results
35 std::plus<int>());
36
37 // check result
38 int serial_sum = std::accumulate(vin.begin(), vin.end(), 0);
39 if (parallel_sum != serial_sum)
40 {
41 std::cout << "Parallel reduction failed!" << std::endl;
42 }
43
44 hetcompute::runtime::shutdown();
45 return 0;
46 }
22
24 static const size_t GRANULARITY = 20;
25
27 static size_t
28 fibonacci(size_t n)
29 {
30 return hetcompute::pdivide_and_conquer<size_t, size_t>(
31 // Problem is to compute the n-th Fibonacci term
32 n,
33 // When should an arbitrary Fibonacci term, represented by ’m’, be
34 // computed sequentially?
35 // Note that programmer chooses to compute Fibonacci terms 20 and lower
36 // sequentially for best performance.
37 [](size_t& m) { return m <= GRANULARITY; },
38 // How to compute the term sequentially
39 [](size_t& m) { return fibonacci_s(m); },
40 // Split problem into independent subproblems
41 [](size_t& m) {
42 return std::vector<size_t>({ m - 1, m - 2 });
43 },
44 // Merge solutions to subproblems.
45 // Note that the first parameter (size_t, corresponding to the split
46 // problem) is unused in this case, but may be useful while merging in
47 // other cases.
48 [](size_t, std::vector<size_t>& sols) { return sols[0] + sols[1]; });
49 }
50
51 int
52 main(int argc, const char* argv[])
53 {
54 hetcompute::runtime::init();
55 size_t n_def = 24;
56 size_t n = n_def;
57
58 if (argc >= 2)
59 {
60 std::istringstream istr(argv[1]);
61 istr >> n;
62 }
63
64 size_t out = fibonacci(n);
65
66 if (out != fibonacci_s(n))
67 {
68 std::cerr << "parallel fibonacci failed\n";
69 }
70 hetcompute::runtime::shutdown();
71 return 0;
72 }
97 return 0;
98 }
The parallel divide-and-conquer pattern, like other patterns, are built using the basic HetCompute
constructs of tasks and groups. However, they are optimized using knowledge about the pattern structure
and the operations in the runtime to minimize the amount of synchronization, and to avoid other
bookkeeping operations that are needed for more generic use.
Programmers can create a pattern object and invoke the pattern by using the run method or the () operator
with arguments, as illustrated in the following code example.
1 #include <vector>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // initialize the input vector
9 std::vector<size_t> vin(1024, 0);
10
11 // declare function object to be applied
12 auto func = [&vin](size_t i) { vin[i] = 2 * i; };
13
14 auto pfor = hetcompute::pattern::create_pfor_each(func);
15 pfor.run(size_t(0), vin.size());
16
17 hetcompute::runtime::shutdown();
18 return 0;
19 }
Patterns are by default blocking, meaning that the execution is stopped until the pattern call returns, which
is sometimes undesirable. The programmer might want patterns to run asynchronously similar to other
HetCompute tasks. Fortunately, all patterns in HetCompute define a corresponding asynchronous API that
does not wait for termination. These APIs are named after the original patterns with suffix _async. In
HetCompute, the most common way to launch a pattern asynchronously is to (1) create a pattern object, (2)
using hetcompute::create_task and hetcompute::launch to invoke the pattern. As such,
programmers can utilize the rich semantics defined for HetCompute tasks and groups (that is, dependencies,
wait_for, finish_after, etc.) for pattern manipulation.
1 #include <vector>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // initialize the input vector
9 std::vector<size_t> vin(1024, 0);
10
11 // declare function object to be applied
12 auto func = [&vin](size_t i) { vin[i] = 2 * i; };
13 auto pfor = hetcompute::pattern::create_pfor_each(func);
14
15 // create a pfor task and launch!
16 auto t1 = hetcompute::create_task(pfor, size_t(0), vin.size());
17 t1->launch();
18 t1->wait_for();
19
5.2.7.2 Tuner
The default pattern implementations should cover the majority of use cases. However, no single
implementation is the best fit for all workload types. For that reason, HetCompute offers programmers a
collection of commonly used algorithm parameters served as the performance tuning knobs (tuner). In
particular, the programmer can declare a HetCompute tuner object and set its property up front, and pass the
tuner object to the pattern API for the purpose of performance tuning. This is illustrated by the following
example:
// declare hetcompute::tuner object and use the static chunking algorithm for parallelization.
hetcompute::pattern::tuner t;
t.set_static();
// start pfor
hetcompute::pfor_each(size_t(0), vin.size(), [&vin](size_t i) { vin[i] = 2 * i; },
t);
// start pfor
pfor.run(size_t(0),
vin.size(),
hetcompute::pattern::tuner()
.set_max_doc(8) // Use 8 tasks for load balancing
.set_chunk_size(16) // The minimum stealing granuality is 16
);
Some settings do not have any effect because there is no mapping under the setting specific to a pattern. The
current HetCompute release focuses on performance tuning for hetcompute::pfor_each. The most
useful settings are listed for hetcompute::pfor_each performance tuning explaining their usages.
5.2.8 Pipeline
The HetCompute Pipeline pattern supports the pipeline parallel programming model, which is often used in
streaming applications.
The HetCompute Pipeline API allows the programmer to describe a linear chain of processing stages such
that the output of each stage is the input of the next. The programmer associates a C++ stage function with
each stage, and can specify a basic C++ type or a user-defined data-type for handing over data between
stages. Once launched, the Pipeline stage repeatedly executes the stage function over a data stream. A
successor stage starts executing on one data unit after its predecessor stage finishes processing the same
unit. While the stages in the pipeline executes one data unit sequentially (from the first stage to the last),
they can execute different data units at the same time.
Note that in contrast to a typical pipeline model, where all the stages always execute exactly the same
number of iterations, HetCompute supports stages executing a different number of total iterations in one
Pipeline (by using hetcompute::iteration_rate). Also in contrast to the standard pipeline model
where the successor stage can start executing immediately after its predecessor finishes one iteration,
HetCompute Pipeline supports iteration delays between stages so that the predecessor can run at least n
iterations ahead of its immediate successor (by using hetcompute::iteration_lag).
HetCompute Pipeline is compatible with the HetCompute asynchronous semantics so that it can be
launched and waited on, just like any other tasks.
Algorithms for streaming applications can be expected to map to Pipeline in a straightforward manner.
5.2.8.1 Overview
• The Pipeline is created dynamically by a C++ program as a HetCompute pipeline object. Pipeline
stages should be added sequentially to the pipeline prior to launching, that is, once launched, the
Pipeline can no longer be modified.
• The Pipeline allows arbitrary C++ code in a stage function, though the parameter list is dictated by
the pipeline context (manditory,
hetcompute::pattern::pipeline<UserData...>::context) and the data packet
(hetcompute::stage_input) that is expected to be handed over between stages.
• The data between stages can transport any data type that is copyable (assignable and constructible)
and default constructible.
– One stage iteration can produce, at most 1 data unit.
– One stage iteration can consume 0 − n data units, depending on the features of the stage. (see
Iteration Rate (hetcompute::iteration_rate):).
• The HetCompute pipeline can control the memory footprint for the stages. Instead of allowing a stage
to proceed freely with many future iterations, HetCompute pipeline supports a default special
execution manner (hetcompute::pattern::pipeline::enable_sliding_window)
which favors pipeline throughput, that is, instead of allowing a stage to freely proceed with many
future iterations and storing the produced data, the Pipeline schedules the successor stage to consume
the data as soon as possible so as to save the memory space for storage. Thus, a pipeline stage can
specify a fixed amount of memory as a circular buffer to store the produced data. HetCompute calls
the circular buffer Sliding Window . Note that this special execution manner is pipeline-specific rather
than a stage feature. A pipeline can also run in a more free manner (hetcompute::pattern-
::pipeline::disable_sliding_window) if no sliding window is used in any of its stages.
This execution manner may lead to a higher level of parallelism at runtime, but has no control on the
memory footprint. It can be mostly used for performance tuning when level of parallelism is critical
and memory footprint control is not. Moreover, the pipeline internal uses different buffer data
structures for inter-stage data transfers under different sliding window modes. A static circular buffer
is used when sliding window mode is enabled while a dynamic pool bucket buffer is used when
sliding window mode is disabled. However, the implementation details of the inter-stage buffer is
transparent to the user. The information here is mentioned as a side note to consider when advanced
users need to reason performance and memory usage of their applications under different modes.
• The user can set the following parameters for each Pipeline stage:
– Stage Type: the execution order of the iterations for a stage
◦ Serial Stage (hetcompute::serial_stage): runs every iteration sequentially.
◦ Parallel Stage (hetcompute::parallel_stage): can run multiple consecutive iterations
concurrently.
· Degree of Concurrency (doc): number of consecutive iterations that can run in parallel.
· A parallel stage with doc = 1 is equivalent to a serial stage.
– Iteration Lag (hetcompute::iteration_lag): minimum number of iterations that a stage
should run ahead of its successor (should be ≥ 0).
– Iteration Rate (hetcompute::iteration_rate): rate of iterations between two consecutive
stages
How to express a simple video processing application is demonstrated using the HetCompute Pipeline
Pattern.
The video processing application (as shown in fig_HETCOMPUTEPipeline) contains three stages:
typedef struct {
File* InputFile; //Input video file
File* OutputFileOdd; //Output video file for the odd frames
File* OutputFileEven; //Output video file for the even frames
size_t num_frames; //Number of total frames in the input file
}FileInfo;
// Add the first stage to read the frames from the video stream
p.add_stage(hetcompute::serial_stage(), // serial stage
hetcompute::sliding_window_size(16), // use a sliding window for
16 frames
read_frame_from_stream);
// Add the last serial stage to save every other frame back to a new output video file
p.add_stage(hetcompute::serial_stage(), // serial stage
hetcompute::iteration_rate(2, 1), // Every 2 iterations in the 2nd
stage map to 1 here
save_every_other_frame);
• The stage function of a HetCompute Pipeline can be any of the following user-defined C++ entities:
– lambda expression
– callable object
– function pointer
• The return type of the stage function:
– void means no data needs to be passed over to the next stage;
– Non-void means one object of the return type will be handed over to the next stage after each
iteration.
– The last stage in the pipeline should always return void
• The parameter list of the stage function:
– The 1st parameter of the stage function is mandatory. It should be a reference to the context type
of the pipeline the stage belongs to (hetcompute::pattern::pipeline::context).
This makes it possible for the user-defined stage function to get access to the pipeline state.
– The 2nd parameter is optional. When needed, the 2nd parameter of a pipeline stage function has to
be a reference to hetcompute::stage_input<type>, here type should match the return
type of the stage function for the predecessor stage.
– The stage function for the first stage should always take one parameter since it does not have a
predecessor stage.
– For any other stages, the stage function takes one parameter if their predecessor stages return void;
otherwise, it takes two parameters.
– The HetCompute pipeline sanity check will verify that the data types between the return type of
the predecessor stage and the stage_input type of the successor stage before launching the
pipeline. An exception will be raised if the types do not match.
Code snippet for defining stage function in HetCompute:
// alias to the context belongs to a pipeline without specfic data
using context = hetcompute::pattern::pipeline<>::context;
S2 s2;
// add the first stage, where ... are other possbile stage features, such as
// lag, rate, sliding window size
p.add_stage(hetcompute::serial_stage(), ..., s0);
// add the second stage, where ... are other possible stage features, such as
// lag, rate, sliding window size
p.add_stage(hetcompute::parallel_stage(8), ..., s1);
// add the third stage, where ... are other possible stage features, such as
// lag, rate, sliding window size
p.add_stage(hetcompute::serial_stage(), ..., s2);
Minimum number of iterations that a stage runs ahead of its successor (lag ≥ 0)
• Iteration Lag is always described in the successor stage for a pair of consecutive stages.
• Iteration Lag will be scaled up according to stage rates ( s1 follows by s2 ), that is, L iterations in s2
compares to dL ∗ r1 /r2 e iterations in s1 .
• The default value for iteration lag is 0
• Execution Semantics:
– At any time, assume i1 and i2 are the iteration stages s1 and s2 has finished, the following equation
holds: i1 ∗ r2 ≥ (i2 + L) ∗ r1
Code snippet for defining stage iteration lag in HetCompute:
// stage s1 is followed by s2
// iteration rate between s1 and s2 is (r1, r2)
// the lag between them is L
p.add_stage(..., stage1_body);
p.add_stage(..., hetcompute::iteration_lag(L),
hetcompute::iteration_rate(r1, r2), stage2_body);
The unit size of the circular buffer between stages in a Pipeline that is needed to control the memory
footprint.
The information that is available to the user-defined stage functions for a specific pipeline
// define the stage function, which returns a size_t value and takes no stage input
auto stage_body = [](finfo_pcontext& ctx)->size_t{
foo(ctx.get_data()->infile, ctx.get_iter_id());
bar(ctx.get_data()->outfile, ctx.get_stage_id());
// stop the pipeline on the fly
if(...)
ctx.stop_pipeline();
else if(...)
ctx.cancel_pipeline();
return 0;
};
// add the stage to the pipeline
p.add_stage(..., stage_body);
// add stages
...
// synchronous launch with 10 iterations for the first stage
// launch without sliding window
p.disable_sliding_window();
p.run(10);
// add the first stage that will stop the pipeline at some point
p.add_stage(hetcompute::serial_stage(), // stage type
..., // other features of the stage, that is, lag, rate, sliding window size
[](context&ctx) { // stage function
// do something
// when condition is true, stop the pipeline
if(condition) {
ctx.stop_pipeline();
}
});
Launch the pipeline and execution blocks until the pipeline finishes execution
Code snippet for synchronously launching a pipeline
// define a HetCompute pipeline object
hetcompute::pattern::pipeline<> p;
// add stages
...
// synchronous launch with known number of iterations
p.run(10);
// use HetCompute free function for synchronous launch with known number of iterations
hetcompute::launch(p, 10)
// pipeline finishes execution at this point
Pipeline execution behaves like a normal HetCompute task and supports all the task asynchronous
semantics, that is, launch, wait_for, dependency, finish_after, etc. Please use the same precautions with the
life-cycle of the data used in the pipeline as that are needed due to the asynchronous semantics of
HetCompute tasks.
Code snippet for pipeline asynchronous semantics
// define a HetCompute pipeline object
hetcompute::pattern::pipeline<> p;
// add stages
...
// launch with no control memory footprint
p.disable_sliding_window();
// asynchronous launch with a known number of iterations
auto t1 = p.create_task(10); // t1 is of type hetcompute::task_ptr<>
// use t1 as a regular HetCompute task set dependency, or launch, or wait_for, or finish_after...
26
27 a_bufs.push_back(buf_a);
28 b_bufs.push_back(buf_b);
29 c_bufs.push_back(buf_c);
30 }
31 }
32
33 void
34 reset_bufs(size_t size)
35 {
36 // Reset the initial value of the buffers.
37 for (size_t j = 0; j < num_iters; j++)
38 {
39 a_bufs[j].acquire_wi();
40 b_bufs[j].acquire_wi();
41 c_bufs[j].acquire_wi();
42
43 for (size_t i = 0; i < size; ++i)
44 {
45 a_bufs[j][i] = i;
46 b_bufs[j][i] = size - i;
47 c_bufs[j][i] = j + 1;
48 }
49
50 a_bufs[j].release();
51 b_bufs[j].release();
52 c_bufs[j].release();
53 }
54 }
55
56 // Release the memory of the buffers.
57 void
58 cleanup_bufs()
59 {
60 a_bufs.clear();
61 b_bufs.clear();
62 c_bufs.clear();
63 }
64
65 // A GPU kernel string which does the vector addition.
66 #define OCL_KERNEL(name, k) std::string const name##_string = #k
67 OCL_KERNEL(vadd_kernel, __kernel void vadd(__global float* a, __global float* b, __global float* c,
unsigned int size) {
68 unsigned int i = get_global_id(0);
69 if (i < size)
70 c[i] = a[i] + b[i];
71 });
72
73 int
74 main()
75 {
76 hetcompute::runtime::init();
77 const size_t size = 32;
78
79 // Initialize the buffers.
80 init_bufs(size);
81 // Reset the buffer values.
82 reset_bufs(size);
83
84 // Define a hetcompute heterogeneous pipeline and its context.
85 hetcompute::beta::pattern::pipeline<> p;
86 using context = hetcompute::beta::pattern::pipeline<>::context
;
87
88 // S0: A CPU stage.
89 // Add a serial cpu stage.
90 p.add_stage(hetcompute::serial_stage(), [](context& ctx) -> size_t {
91 // Return the iteration id.
92 return ctx.get_iter_id();
93 });
94
161 p.run(num_iters);
162
163 // Clean up the buffers.
164 cleanup_bufs();
165
166 hetcompute::runtime::shutdown();
167 return 0;
168 }
Besides the GPU stage, the heterogeneous pipeline maintains the same features as a regular homogeneous
CPU pipeline (see HetCompute Pipeline Details).
• a before lambda
• a GPU kernel
• an after lambda (optional).
The before lambda (can also be a callable object or a function pointer) connects the stage with its previous
stage and provides the arguments to the GPU kernel for the current stage. The before lambda takes the
same parameters as a CPU stage function, i.e., a reference to
hetcompute::beta::pattern::pipeline::context and a reference to
hetcompute::stage_input<type> (optional), see Stage Function. It should return a
std::tuple of the range for the GPU kerenal and the variables that will be fed as the arguments for the
GPU kernel. In the simple example above (line 94 to 129), the GPU kernel is 1-d and takes three buffer
pointers and one unsigned int as its input arguments. Therefore, the before lambda of the GPU stage
returns a std::tuple<hetcompute::range<1>, hetcompute::buffer_ptr<float>,
hetcompute::buffer_ptr<float>, hetcompute::buffer_ptr<float>, unsigned
int>, with range as the first element in the tuple followed by the kernel arguments in order (see Basic
Usage of Buffers, Kernels: The Path to Heterogeneity).
The GPU kernel defines the main functionality of the stage. It will be launched as a GPU task and executes
on the GPU component.
The after lambda (optional and can also be a callable object or a function pointer) takes the output of the
GPU kernel and performs post-processing (if necessary) to prepare for the following stages, which takes
two parameters:
std::tuple approach to provide the inputs and get the outputs from the GPU kernel instead of
parameter pass and value returning as regular functions.
A GPU pipeline stage has the same stage parameters as a regular CPU stage, such as
hetcompute::serial_stage or hetcompute::parallel_stage
hetcompute::iteration_rate, hetcompute::iteration_lag,
hetcompute::sliding_window_size (see HetCompute Pipeline Details).
The example above shows a HetCompute application that writes Hello World! and This is HetCompute!
concurrently. In line 9, hetcompute::launch(Code&&) creates and launches a task t that prints
Hello World!. hetcompute::launch(Code&&) takes as a parameter a lambda expression,
really just an anonymous function, which defines the work that the tasks execute. As will be seen in the
remainder of the guide, hetcompute::launch(Code&&, Args&&...) is extremely versatile, and
accepts function pointers, CPU kernels, OpenCL kernels, etc. hetcompute::launch(Code&&)
returns a pointer to the task immediately and the execution proceeds to the next statement
—HETCOMPUTE_ILOG("This is HetCompute!")— while the HetCompute runtime schedules
task t in the first available CPU core. Because t might run concurrently with main, the following two
program outputs are feasible:
Hello World!
This is HETCOMPUTE!
This is HETCOMPUTE!
Hello World!
The example creates two tasks — hello and world — and sets up a dependency between them (line 16)
to ensure that World! is printed after Hello. Without this dependency, the HetCompute runtime could
execute world first, or concurrently. Note that the order in which tasks are launched does not reflect the
order in which the HetCompute runtime executes them. In line 22, the example waits for world to finish.
There is no need to explicitly wait for hello because the HetCompute runtime guarantees that hello
completes before world executes.
It is important to notice the use of two separate API calls to create and launch world (lines 10 and 19,
respectively). This is to create a dependency between hello and world before the latter is launched,
because it is not possible to add a predecessor to a launched task because it might have already executed by
the time the program calls hetcompute::task<>::then(). This is the reason why, in HetCompute,
task creation, launch, and execution are different operations — hetcompute::launch(Code&&,
Args&&...) combines them for the sake of programmatic convenience and efficiency. In HetCompute,
tasks must be launched in order to be executed. Launching a task t means that the programmer has finished
adding predecessors to t and that she wants the HetCompute runtime to execute t as soon as its
predecessors have completed and there are execution units available.
These two simple samples illustrate two basic HetCompute abstractions: tasks and dependencies. In
HetCompute, programmers think about algorithms in terms of concurrent tasks and let the HetCompute
runtime schedule them onto available resources in the system. Programmers can create dynamic task graphs
by setting dependencies between tasks that the runtime enforces.
Most parallel algorithms launch more than one task, unlike the example above. Waiting for each individual
task can get cumbersome: the programmer would need to store the task pointers in some container (e.g,
queue, list), and call hetcompute::task<>::wait_for() on each of them. For programmatic
convenience and efficiency, HetCompute provides another asynchronous abstraction called groups. A group
is a set of tasks that can be waited for as a unit.
1 #include <hetcompute/hetcompute.hh>
2 #include <stdio.h>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group g
9 auto g = hetcompute::create_group();
10
11 // Launch 10 tasks into g
12 for (int i = 0; i < 10; i++)
13 {
14 g->launch([i] { HETCOMPUTE_ILOG("Hello World! I’m task #%d\n", i); });
15 }
16
17 // Wait for tasks to complete and exit group
18 g->wait_for();
19 hetcompute::runtime::shutdown();
20
21 return 0;
22 }
The example above creates a group g in line 9. Then, rather than creating tasks out of lambda functions and
launching them explicitly as in the earlier samples, this example launches the lambda functions directly into
group g (lines 12-15). HetCompute internally creates optimized tasks that execute as part of the group.
Finally, all the tasks are waited for as a unit in line 18.
With this basic introduction to the fundamental units of asynchrony in HetCompute, tasks and groups, read
through the next several chapters to discover various exciting operations and capabilities of these
asynchronous abstractions, including heterogeneous execution, dataflow, non-blocking parallelization,
cancellation, exception handling, and algebraic operations.
Kernels: The Path to Heterogeneity
Creating Tasks
Task Pointers
Life of a HetCompute Task
Launching Tasks
Task Dependencies
Task Groups
Waiting for Tasks
Exceptions and Cancellation
Blocking Tasks
Algebraic Operations on Tasks
Task-Pointer Collapsing
Unleashing Asynchrony
The hello world example in the previous section uses hetcompute::launch to create and launch a
CPU task in one step. However, hetcompute::launch is in fact one of many convenient methods that
combine multiple steps into one method. In particular, hetcompute::launch creates anonymous
kernels and tasks as necessary along the way.
The general steps to write a HetCompute program are as follows:
Some of the steps above can be interleaved with other steps. For example, new tasks can still be
created and set up task dependency for even after some tasks have already been launched. Meanwhile,
many convenient methods exist to combine multiple steps into one; refer to other sections in this
chapter to see some of these in action.
As an example, the following code rewrites the hello world program by explicitly carrying out the steps
listed above:
1 #include <hetcompute/hetcompute.hh>
2 #include <stdio.h>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create a cpu_kernel k that prints "Hello World!"
9 auto k = hetcompute::create_cpu_kernel([] { HETCOMPUTE_ILOG("Hello World!
\n"); });
10
11 // Use k to create a task t
12 auto t = hetcompute::create_task(k);
13
14 // Launch the task t
15 t->launch();
16
17 // Print another line after t is asynchronously launched
18 HETCOMPUTE_ILOG("This is HETCOMPUTE!\n");
19
Note
Although the modified hello world example above creates a task using a CPU kernel (which itself is
created from a lambda expression, that is, an anonymous CPU function), hetcompute::create-
_task can take other types of kernels created from various device functions: a CPU kernel, a GPU
kernel, or a DSP kernel. The detail on how to create these heterogeneous kernels follows.
Currently, there are three types of kernels in HetCompute. Use the following methods to create each of
them:
The three kernel creation methods have different semantics, due to the difference in how CPU, GPU,
and DSP functions are written. Refer to Kernels in the API Reference Manual for details regarding
each method.
22 static int
23 f5(int* x, int* y, int l)
24 {
25 int i = 0;
26 for (i = 0; i < l; i++)
27 y[i] = x[i] * 2;
28 return 0;
29 }
30
31 int
32 main()
33 {
34 hetcompute::runtime::init();
35
36 // Create a cpu_kernel from a function
37 auto k1 = hetcompute::create_cpu_kernel(f1);
38
39 // Create a cpu_kernel from a lambda expression
40 auto k2 = hetcompute::create_cpu_kernel(f2);
41
42 // Create a cpu_kernel from a functor
43 auto k3 = hetcompute::create_cpu_kernel(f3);
44
45 // Create a gpu_kernel from an OpenCL C GPU function
46 auto k4 = hetcompute::create_gpu_kernel<hetcompute::buffer_ptr<int>,
hetcompute::buffer_ptr<int>>(f4_string, "f4");
47
48 // Create a hexagon_kernel from a DSP function
49 auto k5 = hetcompute::create_dsp_kernel<>(f5);
50
51 hetcompute::runtime::shutdown();
52 return 0;
53 }
Correspondingly when passing OpenCL C functions, the user may optionally pass
hetcompute::beta::cl to make the use of OpenCL explicit, as illustrated below.
auto k4 = hetcompute::beta::create_gpu_kernel<hetcompute::buffer_ptr<int>,
hetcompute::buffer_ptr<int>>
(hetcompute::beta::cl, f4_string, "f4");
When launching a GPU kernel, a global range parameter must always be provided. The global range
identifies the total number and shape of GPU threads that will be executed. Optionally, a local range
parameter may also be provided. When omitted, the local range is assumed to be of size 1 (of the
1D/2D/3D dimensionality corresponding to the global range).
With GPU kernels created from OpenCL C, the programmer typically doesn’t face correctness issues when
omitting the local range parameter. However, OpenGL ES shader program sources specify a local range.
When launching GPU kernels created from OpenGL ES shaders, the programmer must ensure that
the local range is provided and matches the local range used in the compute shader program. For
example, the sample program in GPU kernels for OpenCL and OpenGL ES uses a 1D local range of size 16.
After a kernel is created—from a CPU, GPU, or DSP function—Qualcomm HetCompute users cannot swap
the underlying function for another one in the kernel. However, users can change the following attributes of
the kernel:
1. Blocking: denotes whether a CPU task made from this kernel is expected to block on external events
such as I/O activities (Blocking Tasks). The blocking attribute can be set and queried using the
set_blocking and is_blocking methods of a kernel object.
2. Big: denotes that the CPU kernel is preferably executed on a big core (i.e. has affinity to the big core)
in a big.LITTLE SoC. Users can override the kernel affinity setting through the HetCompute affinity
APIs (Affinity). The big attribute can be set and queried using the set_big and is_big methods
of a kernel object.
3. Little: denotes that the CPU kernel is preferably executed on a LITTLE core (i.e. has affinity to the
LITTLE core) in a big.LITTLE SoC. Users can override the kernel affinity setting through the
HetCompute affinity APIs (Affinity). The little attribute can be set and queried using the
set_little and is_little methods of a kernel object.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 auto k1 = hetcompute::create_cpu_kernel([] { HETCOMPUTE_ILOG("big task
executed"); });
8 // inform the Hetcompute runtime that the kernel is best executed on a big
9 // core in a big.LITTLE SoC
10 k1.set_big();
11
12 auto t1 = hetcompute::launch(k1);
13 t1->wait_for();
14
15 auto k2 = hetcompute::create_cpu_kernel([] { HETCOMPUTE_ILOG("LITTLE task
executed"); });
16 // inform the Hetcompute runtime that the kernel is best executed on a
17 // LITTLE core in a big.LITTLE SoC
18 k2.set_little();
19
20 auto t2 = hetcompute::launch(k2);
21 t2->wait_for();
22
23 hetcompute::runtime::shutdown();
24 return 0;
25 }
Note
While the HetCompute runtime system will try to enforce the big and little kernel attributes, dynamic
system conditions such as offline cores may prevent it from doing so.
Changing a kernel’s attributes does not affect tasks created from this kernel prior to the change.
Because object lifetime management is generally tricky in asynchronous and parallel programming, at task
creation time, HetCompute copies a kernel into the resulting task. This has a few implications:
1. The task can exist independent of the kernel’s lifetime. This is particularly useful when a kernel is
created inside a scope such as a function, but the tasks created from this kernel can live beyond the
end of this scope and get launched later, even after the original kernel object has already been
destroyed.
2. This also means programmers should be particularly careful with kernel-owned data. As shown in the
example below, the copying of kernel objects during task creation may lead to non-obvious results.
1 #include <hetcompute/hetcompute.hh>
2 #include <stdio.h>
3
4 // A functor that stores some internal state
5 struct foo
6 {
7 int n;
8
9 foo() : n(1) {}
10
11 void operator()()
12 {
13 HETCOMPUTE_ILOG("n = %d\n", n);
14 n++;
15 }
16 };
17
18 int
19 main()
20 {
21 hetcompute::runtime::init();
22
23 // Create a cpu_kernel from the functor
24 foo bar;
25 auto k = hetcompute::create_cpu_kernel(bar);
26
27 // Create two independent tasks from k
28 auto t1 = hetcompute::create_task(k);
29 auto t2 = hetcompute::create_task(k);
30
31 // Set dependency and launch the tasks
32 t1->then(t2);
33 t1->launch();
34 t2->launch();
35
36 // Expected output: 1 and 1, not 1 and 2
37 t2->wait_for();
38
39 hetcompute::runtime::shutdown();
40
41 return 0;
42 }
Kernel objects can generally be moved and copied in construction and assignment, similar to regular
objects. However, one exception is that cpu_kernel objects created from lambda functions cannot be
copy-assigned, because lambda functions do not have copy-assignment operators.
Finally, it is possible to create kernel objects by directly instantiating them via their constructors, instead of
using the factory methods (that is, hetcompute::make_cpu_kernel,
hetcompute::make_gpu_kernel, and hetcompute::make_dsp_kernel). This may be useful
for passing kernel pointers around and extending kernel classes. However, in those situations, one usually
cannot use the auto keyword, and it is important to explicitly and correctly type the kernel objects.
The HetCompute runtime automatically balances load across the various devices in a heterogeneous
system: the CPU (big or LITTLE), GPU, and DSP. Developers can exploit this feature by constructing
"poly-kernels" – kernels having multiple implementations of the same interface – and have the HetCompute
runtime system dynamically pick the most suitable device on which to execute the task constructed from the
poly-kernel.
1 #include <hetcompute/hetcompute.hh>
2
3 // Macro which creates a string containing the OpenCL C kernel.
4 #define OCL_KERNEL(name, k) std::string const name##_string = #k
5
6 OCL_KERNEL(vadd_kernel, __kernel void vadd(__global float* A, __global float* B, __global float* C,
unsigned int size) {
7 unsigned int i = get_global_id(0);
8 if (i < size)
9 C[i] = A[i] + B[i];
10 });
11
12 int
13 main(void)
14 {
15 hetcompute::runtime::init();
16
17 {
18 // Create input buffers, automatically host accessible
19 auto buf_a = hetcompute::create_buffer<float>(1024);
20 auto buf_b = hetcompute::create_buffer<float>(buf_a.size());
21
22 buf_a.acquire_wi();
23 buf_b.acquire_wi();
24 // Initialize the input buffers
25 for (size_t i = 0; i < buf_a.size(); ++i)
26 {
27 buf_a[i] = i;
28 buf_b[i] = buf_a.size() - i;
29 }
30 buf_a.release();
31 buf_b.release();
32
33 // Create an output buffer in relaxed mode: not automatically accsssible by host
34 auto buf_c = hetcompute::create_buffer<float>(buf_a.size());
35
36 // Name of the OpenCL C kernel.
37 std::string kernel_name("vadd");
38
39 // Create a gpu kernel. Note the optional in/out directions that allow HETCOMPUTE
40 // to perform copy optimizations. By default, the buffers are treated as
41 // inout.
42 auto gpu_vadd = hetcompute::create_gpu_kernel<hetcompute::in<hetcompute::buffer_ptr<float>>
,
43 hetcompute::in<hetcompute::buffer_ptr<float>>,
44 hetcompute::out<hetcompute::buffer_ptr<float>>,
45 unsigned int>(vadd_kernel_string, kernel_name);
46 unsigned int size = buf_a.size();
47
48 // Create a hetcompute::range object, 1D in this case.
49 hetcompute::range<1> range_1d(buf_a.size());
50
51 // Create cpu kernel with the same interface as the gpu kernel
52 auto cpu_vadd = [](hetcompute::range<1> r,
53 hetcompute::buffer_ptr<float> a,
54 hetcompute::buffer_ptr<float> b,
55 hetcompute::buffer_ptr<float> c,
56 unsigned int) {
57 HETCOMPUTE_ILOG("running on the CPU");
58 hetcompute::pfor_each(r, [&](
hetcompute::index<1> i) { c[i[0]] = a[i[0]] + b[i[0]]; });
59 };
60
61 // Create a task out of a poly-kernel.
62 // The HetCompute runtime will automatically choose the appropriate gpu or cpu
63 // variant to dispatch based on runtime load on the devices.
64 auto poly_task = hetcompute::beta::create_task(std::make_tuple(
gpu_vadd, cpu_vadd),
65 range_1d, // global range
66 buf_a, // rest correspond to gpu_vadd template
parameters
67 buf_b,
68 buf_c,
69 size);
70 // Launch the task
71 poly_task->launch();
72
73 // Wait for task completion.
74 poly_task->wait_for();
75
76 buf_a.acquire_ro();
77 buf_b.acquire_ro();
78 buf_c.acquire_ro();
79
80 // Access the results on the host and verify their correctness.
81 for (size_t i = 0; i < buf_a.size(); ++i)
82 {
83 HETCOMPUTE_INTERNAL_ASSERT(buf_a[i] + buf_b[i] == buf_c[i] && buf_c[i] == buf_a.size(),
84 "comparison failed at ix %zu: %f + %f == %f == %zu",
85 i,
86 buf_a[i],
87 buf_b[i],
88 buf_c[i],
89 buf_a.size());
90 }
91
92 buf_a.release();
93 buf_b.release();
94 buf_c.release();
95 }
96
97 hetcompute::runtime::shutdown();
98 }
In the example above, the programmer is interested in adding two vectors. A GPU kernel should be
implemented to perform vector addition on line 42. A CPU kernel should be implemented (using
hetcompute::pfor_each) on line 52. Both alternatives should be exposed to the HetCompute
runtime system by constructing a task out of a poly-kernel, exposed as an std::tuple, on line 64. The
constructed poly_task is like any other task; the programmer can wait for the task, add the task to
groups, set dependencies, etc. Once the task is launched (on line 71), the HetCompute runtime performs
"late binding" of function to device, wherein either the CPU or the GPU may execute the task depending on
their relative load.
Note
The poly-kernel is a beta feature in this HetCompute release, and its API may change in the future.
hetcompute::range<N> is useful to represent an ND-Range (in OpenCL). The table below shows different
ways to create the hetcompute::range<N> object and how it maps to OpenCL constructs.
hetcompute::range OpenCL NDRange
hetcompute::range<1>(w) cl::NDRange(w)
hetcompute::range<2>(w, h) cl::NDRange(w, h)
hetcompute::range<3>(w, h, d) cl::NDRange(w, h, d)
hetcompute::range<1>(off_x, w) cl::NDRange(off_x) - will be used as an offset
when used in the GPU kernel launch
cl::NDRange(w) - will be used as global size when
used in gpu kernel launch
hetcompute::range<2>(off_x, w, off_y, h) cl::NDRange(off_x, off_y) - will be used as an
offset when used in the GPU kernel launch
cl::NDRange(w, h) - will be used as global size
when used in the GPU kernel launch
hetcompute::range<3>(off_x, w, off_y, h, off_z, cl::NDRange(off_x, off_y, off_z) - will be used as
d) an offset when used in the GPU kernel launch
cl::NDRange(w, h, d) - will be used as global size
when used in the GPU kernel launch
5.3.1.8 hetcompute::range<1>
Represents a 1D range and is useful if your problem space is linear like vector addition. The example below
shows a simple vector addition where each work item takes two input vectors and produces one element in
the output vector.
// Macro which creates a string containing the OpenCL C kernel.
#define OCL_KERNEL(name, k) std::string const name##_string = #k
// Create a gpu kernel object. Note the optional in/out directions that allow HetCompute to perform
// copy optimizations. By default, the buffers are treated as inout.
auto gpu_vadd = hetcompute::create_gpu_kernel<hetcompute::in<hetcompute::buffer_ptr<float>>,
hetcompute::in<hetcompute::buffer_ptr<float>>,
hetcompute::out<hetcompute::buffer_ptr<float>>,
unsigned int>(vadd_kernel_string, kernel_name);
hetcompute::range<1> range_1d(buf_a.size());
// Create a task
vadd_kernel_string represents a string containing the OpenCL C code for vector addition.
5.3.1.9 hetcompute::range<2>
Represents a 2D range and is useful if your problem space is two-dimensional, similar to matrix
multiplication or many image processing applications. The example below shows a simple matrix
multiplication, where each work item computes one element in the output matrix.
// Macro which creates a string containing the OpenCL C kernel.
#define OCL_KERNEL(name, k) std::string const name##_string = #k
OCL_KERNEL(matrix_multiply_kernel,
__kernel void matrix_multiply(__global float* a, __global float* b, __global float* c, int M,
int P, int N) {
int i = get_global_id(1);
int j = get_global_id(0);
if (i >= M || j >= N)
return;
c[i * N + j] = 0;
for (int k = 0; k < P; k++)
{
c[i * N + j] += a[i * P + k] * b[k * N + j];
}
});
// Create a gpu kernel object. Note the optional in/out directions that allow HetCompute to perform
// copy optimizations. By default, the buffers are treated as inout.
auto gpu_mm = hetcompute::create_gpu_kernel<hetcompute::in<hetcompute::buffer_ptr<float>>,
hetcompute::in<hetcompute::buffer_ptr<float>>,
hetcompute::out<hetcompute::buffer_ptr<float>>,
unsigned int,
unsigned int,
unsigned int>(matrix_multiply_kernel_string, kernel_name);
// Create a task
auto gpu_task = hetcompute::create_task(gpu_mm, // gpu kernel
range_2d, // global range
buf_a, // rest correspond to gpu_vadd template parameters
buf_b,
buf_c,
m,
p,
n);
5.3.1.10 hetcompute::range<3>
Consider the example of a denoise_kernel described in more detail Parallelization using patterns.
void denoise_image()
{
// initialization, etc
hetcompute::range<2> r(0, w, TILE_SIZE, 0, h, TILE_SIZE);
hetcompute::pfor_each(r, [input, &output] (size_t index) {
hetcompute::index<2> idx = r.linear_to_index(index);
denoise_kernel(input, idx[0], TILE_SIZE, idx[1], TILE_SIZE, output);
});
}
hetcompute::range<2> above defines a two-dimensional range, from [0,w) x [0, h) with a stride of
TILE_SIZE in each dimension. Each dimension can have a different stride. We use this range as our
iteration space for the parallel loop.
hetcompute::index<2> defines a two-dimensional index. HetCompute ranges know how to iterate to the
appropriate points. hetcompute::pfor_each provides a linear index to the lambda (size_t index) in the code
above; using the linear_to_index call returns a hetcompute::index<2> object that has the appropriate
coordinates in each dimension of the range. We can directly access the dimensions using the [] operator on
the object.
Note
hetcompute::task_ptr<ReturnType>
hetcompute::create_value_task(Args&& ...args)
• template<typename Code, typename... Args>
collapsed_task_type<Code> hetcompute::create_task(Code&&,
Args&&...)
• template<typename Code, typename... Args>
non_collapsed_task_type<Code>
hetcompute::create_task(do_not_collapse_t, Code&&, Args&&...)
Creating a task using the create_task methods returns a pointer to the task. The task is not
ready to be launched and the programmer has the opportunity to set up dependencies (that is,
make this task part of a task graph). The runtime maintains the validity of the pointer for the
time when the task is executing. However, because the programmer has a pointer to it, the
lifetime of that pointer will also impact how long the task will be maintained in the system. It is
recommended that the programmer reset the pointer when finished with the task.
There are three essential parts for creating a HetCompute task (non-value task):
Lambda expressions are a new feature in C++11, and the preferred argument type to create hetcompute
tasks. Lambda expressions are unnamed function objects that are able to capture variables from the
enclosing scopes. A description of this C++11 feature is outside the scope of this document. Find detailed
information about lambda expressions in the following links:
6 hetcompute::runtime::init();
7
8 // Create a task that prints Hello World!
9 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World!\n"); });
10
11 // Launch the task.
12 t1->launch();
13 // Wait for the task to finish.
14 t1->wait_for();
15
16 hetcompute::runtime::shutdown();
17 return 0;
18 }
The lambda expression in the previous example is very simple as it does not capture any variables. Let’s
suppose that you want to capture a string with the user name to do a proper greeting:
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 auto g = hetcompute::create_group();
8 std::string name = "HETCOMPUTE";
9
10 // Launching a task in the group.
11 g->launch([name] { HETCOMPUTE_ILOG("Hello World, %s!\n", name.c_str()); });
12
13 // Wait for g to finish.
14 g->wait_for();
15
16 hetcompute::runtime::shutdown();
17 return 0;
18 }
By capturing name in the lambda expression, HetCompute makes sure that can use it when the task
executes, which happens outside the scope where the task is created. Make sure that, if you capture
variables by reference, the original object still exists when the task executes. For example, consider the
following code:
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 auto g = hetcompute::create_group();
8
9 {
10 std::string name = "HETCOMPUTE";
11
12 // Launching a task in the group.
13 g->launch([name] { HETCOMPUTE_ILOG("Hello World, %s!\n", name.c_str()); });
14 } // "name" goes out of scope here.
15
16 // Wait for g to finish.
17 g->wait_for();
18 hetcompute::runtime::shutdown();
19
20 return 0;
21 }
The string name goes out-of-scope in line 14, and its destructor is then called. If the scheduler executes the
task after that happens, the program will most likely crash.
Refer to Task Pointers for information about capturing hetcompute::task_ptr<...> by reference
and this should never be performed.
Warning
Using default capture by copy ([=]) or by reference ([&]) will capture all variables from the
enclosing scope, which may increase the size of your tasks considerably if the compiler cannot figure
out that many of them are not used and do not need to be captured. It is recommended that only
capture the variables that your lambda expression uses.
You can use any custom class as <typename Code> by overloading the class’s operator(). The
following code shows how to create a task from a class instance. When the HetCompute scheduler executes
the task, the operator() method is called.
1 #include <hetcompute/hetcompute.hh>
2
3 class user_class
4 {
5 public:
6 explicit user_class(int value) : x(value) {}
7
8 void operator()(int y) { HETCOMPUTE_ILOG("x = %d, y = %d\n", x, y); }
9
10 void set_x(int value) { x = value; }
11
12 private:
13 int x;
14 };
15
16 int
17 main()
18 {
19 hetcompute::runtime::init();
20 // Create a hetcompute group.
21 auto g = hetcompute::create_group();
22
23 // Create and launch a task into group g.
24 g->launch(user_class(42), 27);
25
26 // Wait for the group to finish.
27 g->wait_for();
28 hetcompute::runtime::shutdown();
29 return 0;
30 }
It is also possible to create an object from user_class and then create a task using that object:
1 #include <hetcompute/hetcompute.hh>
2
3 class user_class
4 {
5 public:
6 explicit user_class(int value) : x(value) {}
7
8 void operator()(int y) { HETCOMPUTE_ILOG("x = %d, y = %d\n", x, y); }
9
10 void set_x(int value) { x = value; }
11
12 private:
13 int x;
14 };
15
16 int
17 main()
18 {
19 hetcompute::runtime::init();
20 // Create a hetcompute group.
21 auto g = hetcompute::create_group();
22
23 // Instantiate an object of user_class.
24 user_class obj(42);
25
26 // Create a hetcompute task.
27 auto t = hetcompute::create_task(obj);
28
29 // Launch the task into group g.
30 g->launch(t, 27);
31
32 // Wait for the group to finish.
33 g->wait_for();
34 hetcompute::runtime::shutdown();
35 return 0;
36 }
The previous example raises an interesting question: What would the task print if obj.set_x(100) was
called between lines 27 and 30?
1 #include <hetcompute/hetcompute.hh>
2
3 class user_class
4 {
5 public:
6 explicit user_class(int value) : x(value) {}
7
8 void operator()(int y) { HETCOMPUTE_ILOG("x = %d, y = %d\n", x, y); }
9
10 void set_x(int value) { x = value; }
11
12 private:
13 int x;
14 };
15
16 int
17 main()
18 {
19 hetcompute::runtime::init();
20 // Create a hetcompute group.
21 auto g = hetcompute::create_group();
22
23 // Instantiate an object of user_class.
24 user_class obj(42);
25
As always, the answer to the ultimate question of life, the universe, and everything (including HetCompute
task execution) is 42. The reason is that HetCompute makes a copy of obj when it creates the task in line
27. Otherwise, users would need to keep track of the lifetime of the objects used to create tasks. However,
if you were to construct the object in-place, no copies would be made:
1 #include <hetcompute/hetcompute.hh>
2
3 class user_class
4 {
5 public:
6 explicit user_class(int value) : x(value) {}
7
8 void operator()(int y) { HETCOMPUTE_ILOG("x = %d, y = %d\n", x, y); }
9
10 void set_x(int value) { x = value; }
11
12 private:
13 int x;
14 };
15
16 int
17 main()
18 {
19 hetcompute::runtime::init();
20 // Create a hetcompute group.
21 auto g = hetcompute::create_group();
22
23 // Create a hetcompute task.
24 auto t = hetcompute::create_task(user_class(42));
25
26 // Launch the task into group g.
27 g->launch(t, 27);
28
29 // Wait for the group to finish.
30 g->wait_for();
31 hetcompute::runtime::shutdown();
32 return 0;
33 }
13 hetcompute::runtime::init();
14 // Create a task that executes foo().
15 auto t = hetcompute::create_task(foo);
16
17 // Launch and wait for the task.
18 t->launch();
19 t->wait_for();
20 hetcompute::runtime::shutdown();
21 return 0;
22 }
Warning
Due to limitations in the Visual Studio C++ compiler, this does not work on Visual Studio. You can get
around it by using a lambda function:
1 #include <hetcompute/hetcompute.hh>
2
3 void foo();
4 void
5 foo()
6 {
7 HETCOMPUTE_ILOG("Hello World!\n");
8 };
9
10 int
11 main()
12 {
13 hetcompute::runtime::init();
14 // Create a task that executes foo().
15 auto t = hetcompute::create_task([] { foo(); });
16
17 // Launch and wait for the task.
18 t->launch();
19 t->wait_for();
20 hetcompute::runtime::shutdown();
21 return 0;
22 }
• hetcompute::task_ptr<>: points to a task that neither accepts any arguments nor returns a
value
• hetcompute::task_ptr<void>: points to a task that returns void
• hetcompute::task_ptr<ReturnType>: points to a task that returns a value
• hetcompute::task_ptr<ReturnType(Args...)>: points to a task that accepts
arguments and returns a value
The above types constitute a type hierarchy, such that a hetcompute::task_ptr<> can point to a
task that returns a value or accepts arguments. Each of the above task_ptrs permits different operations
on tasks, with hetcompute::task_ptr<> providing most of the common operations on tasks and the other
two types providing more advanced operations:
• hetcompute::task_ptr<> can be launched, canceled, waited for, finished after, serve as the
source of a control dependency, etc.
• In addition to the above, hetcompute::task_ptr<ReturnType> can serve as the source of a
data dependency.
• In addition to the above, hetcompute::task_ptr<ReturnType(Args...)> can have its
arguments bound through hetcompute::task<ReturnType(Args...)>::bind_all.
Tasks are reference-counted, so they are automatically destroyed when no more
hetcompute::task_ptr<>s reference them. When a task is launched, the HetCompute runtime
increases the reference count of the task. This prevents the task from being destroyed, even if all pointers
referencing the task are reset. The HetCompute runtime decrements the reference count of the task after it
finishes (completes execution, throws an exception, or is canceled). The task reference count requires
atomic operations. Copying a hetcompute::task_ptr<> causes an atomic increment and the new
copy of the hetcompute::task_ptr<> causes an atomic decrement when it goes out of scope. For
best results, minimize the number of times your application copies hetcompute::task_ptr<>s.
Some algorithms require constantly passing hetcompute::task_ptr<>s. To maintain high
performance, HetCompute provides another task pointer type that does not perform reference counting:
hetcompute::task<>∗ (, hetcompute::task<void>∗, hetcompute::task<ReturnType>∗, and
hetcompute::task<ReturnType(Args...)>∗). The following example demonstrates how to point
hetcompute::task<>∗ to a task:
Note
Or use a hetcompute::task<>∗ (of course, make sure that t1 does not go out of scope):
1 hetcompute::task_ptr<> t1 = hetcompute::create_task([] {
HETCOMPUTE_ILOG("Hello World from t1!"); });
2
3 hetcompute::task<>* unsafe_t1 = t1.get();
4
5 hetcompute::task_ptr<> t2 = hetcompute::create_task([unsafe_t1
] { HETCOMPUTE_ILOG("Hello World from t2!"); });
6
7 hetcompute::task_ptr<> t3 = hetcompute::create_task([&
The state of a task determines the operations permitted on it. Understanding task states and transitions is
useful to debug both correctness and performance issues with your HetCompute program. For example, as
shown below, the program may hang because a task which is never launched is waited for.
auto t = hetcompute::create_task([]{});
// wait for the task without launching it
t->wait_for(); // never released
Stepping through the operations on a task with the above task state diagram in mind may help to identify
and fix errors such as the one in the code snippet.
Note
Neither the task states nor the transitions between them are part of the HetCompute API. They are
described herein merely for the sake of understanding.
Most HetCompute tasks take the green line to successful completion, while some take the red line due to a
variety of reasons, e.g., throwing an exception. Some tasks may directly be created in the Launched state
(via hetcompute::launch(Code&&, Args&&...)), while others may directly be created and
launched in the Ready state (via hetcompute::launch(Code&&, Args&&...) with all arguments
ready).
• (1) After setting up any control dependencies (via hetcompute::task<>::then()) or data dependencies
(via hetcompute::task<ReturnType(Args...)>::bind_all()) from other tasks, use
hetcompute::task<>::launch() to register the task with the HetCompute runtime system.
No further dependencies may be added to a Launched task.
• (2) After all tasks on which a task is control- or data-dependent have transitioned to Completed, the
task becomes Ready for execution.
• (3) When an appropriate execution resource (CPU, GPU, or DSP) becomes available and any other
resources such as hetcompute::buffers that may be used by the task become available, the task
transitions to the Running state.
• (4) Finally, after the task completes execution successfully, and any other tasks after which it is set to
finish (via hetcompute::task<>::finish_after()) have also finished, the task
transitions to the Completed state.
Some tasks may not execute succesfully (e.g., throw an exception), may be canceled programmatically, etc.;
such tasks take the red line and end up in the Canceled state. See Exceptions and Cancellation for more
details. The state transitions along the red line are described below:
1. t->launch(args...) to launch the task and bind the arguments for data dependency;
2. t->launch() to launch the task, but the arguments need to be bound already by
t->bind_all(args...). For more information about task argument binding, see Unleashing
Asynchrony and Tasks.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create a task.
8 auto t1 = hetcompute::create_task([](int x) { HETCOMPUTE_ILOG("Hello World! x =
%d\n", x); });
9
10 // t1 is ready, launch it and bind the argument.
11 t1->launch(42);
12
13 // Wait for t1 to finish.
14 t1->wait_for();
15
16 // Create a task.
17 auto t2 = hetcompute::create_task([](int x) { HETCOMPUTE_ILOG("Hello World! x =
%d\n", x); });
18
19 // Bind the task argument to t2.
20 t2->bind_all(73);
21
22 // t2 is ready, launch it.
23 t2->launch();
24
25 // Wait for t2 to finish.
26 t2->wait_for();
27
28 hetcompute::runtime::shutdown();
29 return 0;
30 }
Notice that launching a task means that it is not possible to add any new predecessors, although you can add
successors. The reason is that, by launching the task, the programmer is asking the HetCompute runtime to
execute the task as soon as possible. By the time the programmer tries to add a new predecessor to the task,
the task might have already executed, and adding a predecessor to an already-executed task is not allowed.
Tasks can launch only once. Any subsequent calls to hetcompute::task<>::launch(...) or
hetcompute::group::launch(...) on a task do not cause the task to execute again. Calls of
hetcompute::group::launch(...) on a task might, however, cause the task to be added to new
groups. See Task Groups
While control dependencies can be set up among any CPU, GPU, or DSP tasks; data dependencies can
be set up only among CPU tasks in the current HetCompute release.
Note
In the example above, the statement t1->then(t2) guarantees that t1 finishes before t2 begins
execution. Consequently, it suffices to just wait_for t2 to finish to ensure that both t1 and t2
finish.
Note
For programmatic convenience, data dependencies can be specified similar to task argument binding at any
one of the following points:
• hetcompute::create_task()
• hetcompute::task<ReturnType(Args...)>::bind_all()
• hetcompute::task<ReturnType(Args...)>::launch()
The following example illustrates the above:
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 auto t1 = hetcompute::create_task([] { return 42; });
9
10 t1->launch();
11
12 auto t2 = hetcompute::create_task([](int i) { HETCOMPUTE_ILOG("The answer to
life the universe and everything = %d", i); },
13 t1); // Set up data dependency from t1 to t2 when t2 is created
14 t2->launch();
15
16 auto t3 = hetcompute::create_task([](int i) { HETCOMPUTE_ILOG("The answer to
life the universe and everything = %d", i); });
17 // Set up data dependency from t1 to t3
18 t3->bind_all(t1);
19 t3->launch();
20
21 auto t4 = hetcompute::create_task([](int i) { HETCOMPUTE_ILOG("The answer to
life the universe and everything = %d", i); });
22 // Set up data dependency from t1 to t4 at launch-time
23 t4->launch(t1);
24
25 t2->wait_for();
26 t3->wait_for();
27 t4->wait_for();
28
29 hetcompute::runtime::shutdown();
30 return 0;
31 }
Tasks t2, t3, and t4 are all data-dependent on task t1. But the data dependency is specified through
different HetCompute API calls.
Use HetCompute data dependencies to easily and automatically manage lifetime of data accessed by tasks.
The data returned by a task is kept alive until the point the task finishes execution and all references to the
task (through hetcompute::task_ptr<>) go out of scope.
As mentioned previously, tasks can encapsulate computation to be executed on the CPU, GPU, or DSP. This
allows the creation of directed acyclic graphs (DAGs) of tasks whose execution spans across all three
processing units:
1 #include <hetcompute/hetcompute.hh>
2
3 // header to include the dsp bindings, it is generated by the Hexagon SDK
4 #include <include/hetcompute_dsp.h>
5
6 static std::string const gpu_fn_string = "__kernel void gpu_fn() {"
7 "}";
8
9 int
10 main()
11 {
12 hetcompute::runtime::init();
13
14 // This is ensure all task objects are freed before we call shutdown
15 {
16 // CPU task
17 auto t1 = hetcompute::create_task([] {});
18
19 // Create a dsp_kernel from a DSP function
20 auto dsp_kernel = hetcompute::create_dsp_kernel<>(hetcompute_dsp_return_input);
21 int i = 0;
22
In the above example, the programmer creates CPU tasks on lines 17, 30, and 33; a Hexagon DSP task on
line 20; and a GPU task on line 27. The programmer then sets up a combination of control and data
dependencies among these tasks on lines 37 through 39. These dependencies create a heterogeneous task
graph as the figure below shows.
Subsequently, on lines 42 through 46, the programmer launches the task graph for execution and waits for
its completion on line 49.
Warning
A cycle in the DAG may cause deadlock. For performance reasons, HetCompute does not check
whether there are cycles in the DAG. The programmer is responsible for avoiding them.
All the predecessor dependencies of a task must be specified before launching the task. Specifying
inter-task dependencies in this manner ahead of task execution provides information to the HetCompute
runtime system allowing it to schedule the tasks intelligently, optimizing for performance and power. While
this is the preferred method, alternative means to specify inter-task dependencies exist: After a task starts
execution, it can invoke hetcompute::task<>::wait_for()(Waiting for Tasks) or
hetcompute::task<>::finish_after()(finish_after) on some hetcompute::task_ptr<>. Using these
methods limits the scope of optimization in the HetCompute runtime task scheduler. The following
example illustrates the two different methods:
1 #include <hetcompute/hetcompute.hh>
2
3 void task_dependency();
4 void task_waiting();
5
6 void
7 task_dependency()
8 {
9 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello "); });
10
11 auto t2 = hetcompute::create_task([] { HETCOMPUTE_ILOG("World!"); });
12
13 // Ensure that t1 executes before t2
14 // *PREFERRED METHOD* of specifying task dependency
15 t1->then(t2);
16
17 t1->launch();
18 t2->launch();
19
20 t2->wait_for();
21 }
22
23 void
24 task_waiting()
25 {
26 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello "); });
27
28 auto t2 = hetcompute::create_task([t1] {
29 // Wait for t1 to finish execution
30 // *LESS PREFERRED METHOD* of specifying task dependency
31 t1->wait_for();
32 HETCOMPUTE_ILOG("World!");
33 });
34
35 t1->launch();
36 t2->launch();
37
38 t2->wait_for();
39 }
40
41 int
42 main()
43 {
44 hetcompute::runtime::init();
45 // preferred
46 task_dependency();
47 // less preferred
48 task_waiting();
49 hetcompute::runtime::shutdown();
50 }
22
23 hetcompute::runtime::shutdown();
24 return 0;
25 }
There are three ways to add a task or kernel into a group: by creating and Launching, just by launching, or
by directly adding without launching it
1 #include <hetcompute/hetcompute.hh>
2
3 static void
4 do_something(int, int)
5 {
6 }
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 // Create group g
13 auto g = hetcompute::create_group();
14
15 // Create tasks from l and launch them into g
16 for (int i = 0; i < 10; i++)
17 for (int j = 10; j < 20; j++)
18 g->launch(do_something, i, j);
19
20 // Wait for all the tasks in group g to complete
21 g->wait_for();
22
23 hetcompute::runtime::shutdown();
24 return 0;
25 }
Launching
1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group
9 auto g = hetcompute::create_group();
10
11 // hello is a fully-typed task pointer of type
12 // hetcompute::task_ptr<void(int)>
13 auto hello = hetcompute::create_task([](int x) { HETCOMPUTE_ILOG("Hello World
%d!\n", x); });
14
15 // Bind hello to 42 and launch task into g
16 g->launch(hello, 42);
17
18 // Wait for g to be empty
19 g->wait_for();
20
21 hetcompute::runtime::shutdown();
22 }
Adding
1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group g.
9 auto g = hetcompute::create_group();
10
11 // Create task t1. Its type is hetcompute::task_ptr<void()>
12 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t1!\n")
; });
13
14 // Add task t1 to group g, but do not launch it.
15 g->add(t1);
16
17 auto t2 = hetcompute::launch([t1] {
18 // Launch t1. Because it already belongs to group g, there is no
19 // reason to use hetcompute::group::launch.
20 t1->launch();
21 });
22
23 // Wait for tasks in group g to complete.
24 g->wait_for();
25 hetcompute::runtime::shutdown();
26
27 return 0;
28 }
Use these methods when the task is part of a DAG. For example, there is a dependency between t1 and t2
in the following example:
1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create Example group
9 auto g = hetcompute::create_group();
10
11 // Create tasks
12 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t1!\n")
; });
13 auto t2 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t2!\n")
; });
14
15 // Launch t1 into g,
16 g->launch(t1);
17
18 // Use t1
19 t1->then(t2);
20
21 // Launch t2 into g
22 g->launch(t2);
23
24 // Wait for tasks to complete
25 g->wait_for();
26
27 hetcompute::runtime::shutdown();
28 return 0;
29 }
• Tasks stay in the group until they complete execution. Once a task is added to a group, there is no
way to remove it from the group.
• Once a task belonging to multiple groups completes execution, HetCompute removes it from all the
groups it belongs.
• Neither completed nor canceled tasks can join groups.
• Tasks can not be added to a canceled group.
• By launching it into each of the groups. For example, to add task t to groups g1 and g2, the
following would be performed:
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create task
8 auto t = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World!\n"); });
9
10 // Create groups
11 auto g1 = hetcompute::create_group("Example 1");
12 auto g2 = hetcompute::create_group("Example 2");
13
14 // Launch t into g1 and g2
15 g1->launch(t);
16 g2->launch(t);
17
18 t->wait_for();
19 hetcompute::runtime::shutdown();
20 }
Notice that, in the example above, t joins both g1 and g2, but it only executes once. Therefore, the code
snippet outputs a single ’Hello World!’. However, t might never join g2 because it might complete
execution before the first launch returns. Remember that completed tasks can never join groups.
• By creating a new group that is the intersection of all the groups where the task needs to launch, and
then launch the task into it. hetcompute::group_ptr intersect(hetcompute-
::group_ptr const& a, hetcompute::group_ptr const& b) returns a group
pointer to a group that represents the intersection of the two groups passed as arguments. This
method is more performant than repeatedly launching the same task into different groups.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create task
8 auto t = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World!\n"); });
9
10 // Create groups
11 auto g1 = hetcompute::create_group("Example 1");
12 auto g2 = hetcompute::create_group("Example 2");
13
14 auto g12 = hetcompute::intersect(g1, g2);
15
16 // Launch t into g1 and g2
17 g12->launch(t);
18 hetcompute::runtime::shutdown();
19 }
It is important to understand what group intersection really means, because it might appear counterintuitive.
hetcompute::intersect returns a pointer to a group that represents the intersection of two or more
groups. Launching a task into the intersection group means simultaneously launching it into all the groups
that are part of the intersection.
For example, the following code snippet shows an application with two groups, g1 and g2, with 3000 and
2000 tasks in each, respectively.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create groups
8 auto g1 = hetcompute::create_group("Group 1");
9 auto g2 = hetcompute::create_group("Group 2");
10
11 auto g12 = hetcompute::intersect(g1, g2);
12
In line 11, g1 and g2 are intersected into g12. The returned pointer, g12, points to an empty group
because no task belongs to both g1 and g2 yet. Therefore, g12->wait_for() in line 24 returns
immediately. The wait_for calls in lines 27 and 28 only return when their tasks complete. In line 30, a
new task is launched into g12. The wait_for calls in lines 37 and 38 only return after that task
completes execution because t belongs to both g1 and g2 (and, of course, g12).
Keep in mind that group intersection is a somewhat expensive operation. If you need to intersect groups
repeatedly, just do it once and keep the pointer to the group intersection alive.
1 #include <memory>
2 #include <hetcompute/hetcompute.hh>
3
4 static void
5 do_something()
6 {
7 }
8
9 int
10 main()
11 {
12 hetcompute::runtime::init();
13 // Create groups
14 auto g1 = hetcompute::create_group("Example group 1");
15 auto g2 = hetcompute::create_group("Example group 2");
16 auto g12 = g1 & g2;
17
18 // Launch 10 tasks into g1 and g2
19 for (int i = 0; i < 10; i++)
20 {
21 g12->launch(do_something);
22 }
23
24 g12->wait_for();
25 hetcompute::runtime::shutdown();
26 return 0;
27 }
Therefore, the code snippet above is faster than the one below:
1 #include <cassert>
2 #include <hetcompute/hetcompute.hh>
3
4 static void
5 do_something()
6 {
7 }
8
9 int
10 main()
11 {
12 hetcompute::runtime::init();
13 // Create groups
14 auto g1 = hetcompute::create_group("Example group 1");
15 auto g2 = hetcompute::create_group("Example group 2");
16
17 // Launch 10 tasks into g1 and g2
18 for (int i = 0; i < 10; i++)
19 {
20 (g1 & g2)->launch(do_something);
21 }
22
23 (g1 & g2)->wait_for();
24 hetcompute::runtime::shutdown();
25 return 0;
26 }
Consecutive calls to hetcompute::intersect with the same groups pointer as arguments return a
pointer to the same group. In addition, group intersection is commutative:
1 #include <cassert>
2 #include <stdio.h>
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 // Create groups
10 auto g1 = hetcompute::create_group("Group 1");
11 auto g2 = hetcompute::create_group("Group 2");
12
13 // Get pointer to intersection groups:
14 auto g12 = g1 & g2;
15 auto g21 = g2 & g1;
16
17 // This assert will never fire
18 assert(g12 == g21);
19 hetcompute::runtime::shutdown();
20 }
and associative:
1 #include <cassert>
2 #include <stdio.h>
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 // Create groups
10 auto g1 = hetcompute::create_group("Group 1");
11 auto g2 = hetcompute::create_group("Group 2");
12 auto g3 = hetcompute::create_group("Group 3");
13
14 // Get pointers to intersection groups:
hetcompute::group::wait_for() does not return until all the tasks in it have completed execution
or have been canceled.
1 #include <hetcompute/hetcompute.hh>
2 #include <stdio.h>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group g
9 auto g = hetcompute::create_group();
10
11 // Launch 10 tasks into g
12 for (int i = 0; i < 10; i++)
13 {
14 g->launch([i] { HETCOMPUTE_ILOG("Hello World! I’m task #%d\n", i); });
15 }
16
17 // Wait for tasks to complete and exit group
18 g->wait_for();
19 hetcompute::runtime::shutdown();
20
21 return 0;
22 }
g2->launch([]{
while(1) {}
});
// Never returns
g1->wait_for();
g2->wait_for();
hetcompute::group::wait_for() and is a safe point. For information about safe points, see
Interoperability.
hetcompute::group::finish_after is a non-blocking alternative to wait_for.
hetcompute::group::finish_after returns immediately but the task calling it is guaranteed not
to finish before all the tasks in group g are finished.
Use hetcompute::group::cancel() to cancel all the tasks in a group. Canceling a group means
that:
• The group tasks that have not started execution will never execute.
• The group tasks that are executing will be canceled only when they call
hetcompute::abort_on_cancel. If any of these executing tasks is a blocking task,
HetCompute will execute its cancel handler if they had not executed it before.
• Any tasks added to the group after the group is canceled will also be canceled.
In the following example, 2000 tasks are launched and then sleep for some time so that a few of those 2000
tasks are done, a few others are executing and a large majority are waiting to be executed. In line 26 the
group is canceled. This means that next time the running tasks execute
hetcompute::abort_on_cancel they will see that their group has been canceled and will abort.
g->wait_for() will not return before the running tasks end their execution — either because they call
hetcompute::abort_on_cancel() or because they finish writing all the messages.
1 #include <atomic>
2 #include <hetcompute/hetcompute.hh>
3
4 using namespace std;
5
6 int
7 main()
8 {
9 hetcompute::runtime::init();
10 // Counts the number of tasks that execute before the group gets
11 // canceled
12 atomic<size_t> counter;
13
14 auto group = hetcompute::create_group();
15
16 // Create 2000 tasks that increase an atomic counter
17 for (int i = 0; i < 2000; i++)
18 {
19 group->launch([&counter] {
20 counter++;
21 usleep(7);
22 });
23 }
24
25 // Cancel group
26 group->cancel();
27
28 // Wait for group to cancel
29 try
30 {
31 group->wait_for();
32 }
33 catch (const hetcompute::aggregate_exception& e)
34 {
35 // If many tasks were canceled, they each propagate a
36 // hetcompute::canceled_exception to the group, all of which get aggregated into
37 // a single hetcompute::aggregate_exception.
38 std::cout << "threw " << e.what() << " due to group cancellation " << std::endl;
39 }
40 catch (const hetcompute::canceled_exception& e)
41 {
42 // If all but one task finished by the time group cancellation took effect,
43 // then the one remaining task which was canceled will propagate a single
44 // hetcompute::canceled_exception.
45 std::cout << "threw " << e.what() << " due to group cancellation " << std::endl;
46 }
47 catch (...)
48 {
49 // Never reached
50 }
51 HETCOMPUTE_ILOG("wait_for returned after %zu tasks executed", counter.load());
52 hetcompute::runtime::shutdown();
53 return 0;
54 }
Now consider what happens when function get_char is executed asynchronously, possibly by a thread
different from the one that executes main. In the example below, the call stack of the thread executing
get_char does not contain the continuation of task t, because the continuation is possibly in one or more
(different) threads that synchronize with t through operations, such as wait_for. Consequently, normal
C++ exception propagation up the thread’s call stack is not sufficient for asynchronous programs. In the
example below, in the absence of a well-defined exceptions model for asynchronous programs, the main
thread waiting for task t on line 20 may resume execution without being aware that task t finished
unsuccessfully due to the exception.
1 #include <iostream>
2 #include <string>
3
4 #include <hetcompute/hetcompute.h>
5
6 void get_char();
7
8 char c;
9
10 void
11 get_char()
12 {
13 c = std::string().at(1); // Illegal access of empty string
14 }
15
16 int
17 main()
18 {
19 auto t = hetcompute::launch([] { get_char(); }); // Executed asynchronously (possibly
in a different thread)
20 t->wait_for(); // Synchronization point
21 std::cout << "got character " << c << " from string " << std::endl;
22 return 0;
23 }
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 100
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
More than one task in a group of tasks or a pattern executed using a group of tasks may throw an exception.
HetCompute captures all such exceptions in a hetcompute::aggregate_exception, which is
thrown at sync points of the group or pattern. The example below illustrates its use. Note the use of
hetcompute::aggregate_exception::has_next and
hetcompute::aggregate_exception::next to iterate through all the exceptions contained in
hetcompute::aggregate_exception. The exceptions may be rethrown in any order due to the
asynchronous nature of the constructs generating the exceptions.
1 #include <iostream>
2 #include <string>
3
4 #include <hetcompute/hetcompute.hh>
5
6 void get_char();
7 void get_num();
8
9 char c;
10 int n;
11
12 #ifdef _MSC_VER
13 #pragma warning(disable : 4702)
14 #endif
15
16 void
17 get_char()
18 {
19 c = std::string().at(1); // Illegal access of empty string
20 }
21
22 void
23 get_num()
24 {
25 throw std::exception();
26 n = 1;
27 n = 2;
28 }
29
30 int
31 main()
32 {
33 hetcompute::runtime::init();
34 auto g = hetcompute::create_group();
35 g->launch([] { get_char(); }); // Executed asynchronously (possibly in a different thread)
36 g->launch([] { get_num(); }); // Executed asynchronously (possibly in a different thread)
37 try
38 {
39 g->wait_for(); // Synchronization point
40 std::cout << "got character " << c << " from string " << std::endl;
41 std::cout << "got num " << n << std::endl;
42 }
43 catch (hetcompute::aggregate_exception& e)
44 {
45 // Deal with all exceptions
46 while (e.has_next())
47 {
48 try
49 {
50 e.next(); // throws contained exceptions one-by-one
51 }
52 catch (const std::out_of_range&)
53 {
54 std::cerr << "illegal string access" << std::endl;
55 }
56 catch (const std::exception& s)
57 {
58 std::cerr << s.what() << std::endl;
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 101
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
59 }
60 catch (...)
61 {
62 // Should never get here
63 }
64 }
65 }
66 catch (...)
67 {
68 // Should never get here
69 }
70 hetcompute::runtime::shutdown();
71 return 0;
72 }
73
74 #ifdef _MSC_VER
75 #pragma warning(default : 4702)
76 #endif
5.3.9.3 Cancellation
While exceptions are thrown from within an executing asynchronous construct, there is another external
reason as to why an asynchronous construct may finish unsuccessfully. In HetCompute, tasks, groups, and
patterns may all be canceled programmatically. Programmatic cancellation is very useful in a variety of
scenarios:
• If a background task, such as fetching data from a remote server, takes too long, the user may cancel
it through the UI.
• If a group of tasks is engaged in searching a database, and one of them finds the intended data, the
other tasks may be canceled to avoid unnecessary work and save energy.
While the source of exceptions is intrinsic to the asynchronous construct and the source of cancellation is
extrinsic to the asynchronous construct, exceptions and cancellations both result in the asynchronous
construct finishing unsuccessfully. Therefore, HetCompute deals with both exceptions and cancellation in
an identical fashion. When a task, group, or pattern is canceled or throws an exception, HetCompute
records the fact. Subsequently, at sync points of the asynchronous construct, HetCompute throws
hetcompute::canceled_exception or rethrows the original exception thrown by the
asynchronous construct. If the exception is not handled at a sync point, then that exception propagates up
the call stack of the thread executing the sync point as per normal C++ exception propagation.
When a task is canceled or throws an exception, HetCompute cancels all of its successors in the task graph.
Any synchronization with the now-canceled successor tasks will throw the appropriate exception(s).
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 102
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
The following is a comprehensive list of program points at which HetCompute may throw an exception due
to an asynchronous construct throwing an exception or getting canceled:
To observe an exception thrown by a task, group, or pattern, the programmer must synchronize with
the same through one of the above methods. If a sync point is not reached and the asynchronous
construct (task_ptr or group_ptr) goes out of scope, the asynchronous construct is deemed as
useless and the exception is never rethrown anywhere.
5.3.9.5.1 t->cancel()
Use t->cancel() to cancel a task t and its successors. What happens to the task when the programmer
calls t->cancel() depends on the status of the task.
If a task is canceled before it is launched, it never executes, even if it is launched later. In addition, the
runtime will then cancel all successors in the task’s successor graph. In the following example, two tasks
are created t1 and t2 and create a dependency between them. Notice that, if any of the tasks execute, it
will raise an assertion. In line 18, t1 is canceled, which causes t2 to be canceled as well. In line 21, t2 is
launched, but this has no effect because the task will not execute, as it was canceled when t1 propagated its
cancellation.
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 auto t1 = hetcompute::create_task([] { assert(false); });
10
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 103
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
Similarly, if a task is canceled after it is launched, but before it starts executing, it never executes and will
propagate the cancellation request to its successors. In the following example, three tasks are created and
chained, t1, t2 and t3. In line 22, t2 is launched, but it cannot execute because its predecessor has not
yet executed. In line 25, t2 is canceled, which means that it will never execute. Because t3 is t2’s
successor, it is also canceled - if t3 had a successor, it would also be canceled.
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t1!\n")
; });
10
11 auto t2 = hetcompute::create_task([] { assert(false); });
12
13 auto t3 = hetcompute::create_task([] { assert(false); });
14
15 // Create dependencies
16 t1->then(t2)->then(t3);
17
18 // Launch t2. It cannot execute as yet because t1 has not been launched.
19 t2->launch();
20
21 // Cancel t2, which propagates cancellation to t3
22 t2->cancel();
23
24 // Launch t1. It will execute because no one canceled it.
25 t1->launch();
26
27 // Returns after t1 completes execution
28 t1->wait_for();
29 hetcompute::runtime::shutdown();
30
31 return 0;
32 }
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 104
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
Canceling a task that is executing is more involved because HetCompute uses cooperative
multitasking . This means that, once a task is executing, it is not pre-empted unless it voluntarily
cedes the processor (e.g., by calling hetcompute::task<>::wait_for()). Thus, it is up to the
task to check periodically whether or not it has been canceled. Use
hetcompute::abort_on_cancel() inside a task body to abort the task immediately if the task, or
any of the groups to which it belongs, have been canceled.
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9
10 auto t = hetcompute::create_task([] {
11 while (1)
12 {
13 hetcompute::abort_on_cancel();
14 HETCOMPUTE_ILOG("Waiting to be canceled.\n");
15 usleep(100);
16 }
17 assert(false); // This will never fire.
18 });
19
20 // Launch t.
21 t->launch();
22
23 // Wait for 2 seconds.
24 usleep(200);
25
26 // Cancel task. Returns immediately.
27 t->cancel();
28
29 try
30 {
31 // Wait for the task.
32 t->wait_for();
33 }
34 catch (const hetcompute::canceled_exception& e)
35 {
36 std::cout << e.what() << " thrown" << std::endl;
37 }
38 catch (...)
39 {
40 // Never reached.
41 }
42
43 hetcompute::runtime::shutdown();
44 return 0;
45 }
In the example above, task t will never finish unless it is canceled. Task t is launched in line 16. After
launching the task, HetCompute blocks for 2 seconds in line 19 to make sure that t is scheduled and prints
its messages. In line 22, HetCompute is asked to cancel the task, which should be running by now. The
method t->cancel() returns immediately after it marks the task as "pending for cancellation". This
means that t might still be executing after t->cancel() returns. That is why t->wait_for() is
called in line 26, to make sure t waits to complete its execution.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 105
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
Note
A task does not know whether someone has requested its cancellation unless it calls
hetcompute::abort_on_cancel() during its execution.
The method hetcompute::abort_on_cancel() never returns if the task has indeed been canceled
because it throws an exception that the Qualcomm HetCompute runtime catches. For this reason, it is
recommended that you use Resource Acquisition Is Initialization (RAII) to allocate
and deallocate the resources used inside a task. If using RAII in your code is not an option, surround
hetcompute::abort_on_cancel() with try - catch, and call throw from within the catch
block after the cleanup code:
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9
10 auto t = hetcompute::create_task([] {
11 while (1)
12 {
13 try
14 {
15 hetcompute::abort_on_cancel();
16 }
17 catch (const hetcompute::abort_task_exception&)
18 {
19 //..do cleanup
20 throw;
21 }
22 catch (...)
23 {
24 //..do cleanup
25 throw;
26 }
27 // HETCOMPUTE_ILOG("Waiting to be canceled.\n");
28 usleep(10);
29 }
30 assert(false); // This will never fire
31 });
32
33 // Launch t
34 t->launch();
35
36 // Wait for 20 micro-seconds.
37 usleep(20);
38
39 // Cancel task. Returns immediately.
40 t->cancel();
41
42 try
43 {
44 // Wait for the task to complete.
45 t->wait_for();
46 }
47 catch (const hetcompute::canceled_exception& e)
48 {
49 std::cout << e.what() << " thrown" << std::endl;
50 }
51 catch (...)
52 {
53 // Never reached
54 }
55
56 hetcompute::runtime::shutdown();
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 106
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
57 return 0;
58 }
Warning
If throw is replaced in line 12 of the previous example with return, the exception would not
propagate to the runtime, HetCompute would not consider the task as canceled, and, therefore, its
successors (if any) would not be canceled.
Canceling a task after it has been executed has no effect on the task, nor on its successors. In the following
example, t1 and t2 are launched after a dependency is set up between them. On line 28, t1 is canceled
after it has completed. By then, t1 has finished execution (waiting for it in line 24) so t1->cancel()
has no effect. Thus, nobody cancels t2 and t2->wait_for() in line 31 never returns, because t2
remains stuck in an infinite loop.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t1!\n")
; });
8
9 auto t2 = hetcompute::create_task([] {
10 while (1)
11 {
12 hetcompute::abort_on_cancel();
13 HETCOMPUTE_ILOG("Hello World from t2!\n");
14 usleep(100);
15 }
16 });
17
18 // Create dependencies.
19 t1->then(t2);
20
21 // Launch tasks.
22 t1->launch();
23
24 // Wait for t1 to complete.
25 t1->wait_for();
26
27 // Cancel t1.
28 // Because it has already completed, it does not propagate its cancellation.
29 t1->cancel();
30
31 // If the two lines below are uncommented the wait_for will never return.
32 // t2->launch();
33 // t2->wait_for();
34
35 hetcompute::runtime::shutdown();
36 return 0;
37 }
5.3.9.5.2 hetcompute::abort_task()
Running tasks call hetcompute::abort_task() to cancel themselves and their successors. Consider
the following example. Two tasks are created, t1 and t2, and create a dependency between them. The
body of t1 is very simple: it prints a message ten times and then it aborts. Both are launced and wait for t1
to complete its execution in line 31. Because t1 calls hetcompute::abort_task(), it is canceled
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 107
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 108
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
When all the hetcompute::task_ptrs referencing an unlaunched task go out of scope, the task is
canceled and it propagates the cancellation to its successors. The reasoning is simple: a task t cannot
launch without a task pointer, and none of its successors will ever be able to execute because t never
executed.
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 void foo();
6
7 void
8 foo()
9 {
10 auto t1 = hetcompute::create_task([] {
11 HETCOMPUTE_ILOG("Hello World from t1\n");
12 // This will never fire
13 assert(false);
14 });
15
16 auto t2 = hetcompute::create_task([] {
17 HETCOMPUTE_ILOG("Hello World from t2\n");
18 // This will never fire
19 assert(false);
20 });
21
22 auto t3 = hetcompute::create_task([] {
23 int i = 0;
24 while (i++ < 10)
25 {
26 HETCOMPUTE_ILOG("Hello World from t3\n");
27 sleep(1);
28 };
29 });
30
31 t1 >> t2;
32
33 t2->launch();
34 t3->launch();
35
36 // t1, t2, and t3 go out of scope
37 }
38
39 int
40 main()
41 {
42 hetcompute::runtime::init();
43 foo();
44 hetcompute::runtime::shutdown();
45 return 0;
46 }
In the snippet above, three tasks are created t1, t2 and t3, and create a dependency between the first two.
t2 and t3 are launched in lines 31 and 32. t2 cannot run because t1 has not yet executed. In line 35,
foo() ends and the three pointers go out-of-scope. t1 is canceled because it is not yet launched. t2 is
canceled because t1 propagated its cancellation. t3 does not get canceled and will run even after foo()
goes out-of-scope.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 109
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
A blocking kernel (and a task created out of that kernel) consists of computation that depends on external
(non-HetCompute) synchronization to make guaranteed forward progress. Typically, the external
synchronization includes completing I/O requests, other OS syscalls with indefinite run-time, and
busy-waiting. It does not include waiting on HetCompute tasks or groups using
hetcompute::task<>::wait_for, hetcompute::task<ReturnType>::copy_value,
hetcompute::task<ReturnType>::move_value, or hetcompute::group::wait_for.
There are two problems with blocking tasks. The first is that once a blocking task executes, it will take over
a thread in a HetCompute thread pool, thus preventing other tasks from executing in the same thread.
Because a blocking task spends most of its time blocking on an event, essentially one of the threads in the
thread pool is wasted. When the programmer marks the task kernel as blocking, HetCompute ensures that
the thread pool does not wastefully dedicate a thread to the task.
The second problem has to do with cancellation. If a blocking task is canceled while it is blocked on an
external event, HetCompute needs to be able to unblock the task so that it can respond to the cancellation
signal (e.g., by calling hetcompute::abort_on_cancel()). As the code snippet below shows, there
is often a well-defined means to unblock:
// blocking call
{
x = network_fetch(network_handle);
}
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 110
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
5.3.10.2 hetcompute::blocking
hetcompute::blocking() enables a task to enter and exit multiple blocking sections of code, and
provides means for the programmer to precisely and efficiently specify how to unblock each blocking
section. The hetcompute::blocking() construct takes two programmer-supplied arguments:
• blocking function: a C++ function, functor, or lambda, that contains the blocking code to execute, and
• cancel function: a C++ function, functor, or lambda, that contains the code to unblock the task if it is
blocked, so that the task may respond to cancellation.
In response to a task being canceled asynchronously, e.g., via hetcompute::task<>::cancel(),
the cancel function will be executed exactly once, only if the task is running.
In the following example, a CPU kernel that executes blocking code is created,. The blocking statement is
wrapped in hetcompute::blocking() and specified using two lambdas, including
cancel_function. The kernel is marked as blocking (line 30). After launching task t and sleeping for
a second, t(line 39) is canceled. Most likely, by the time t is canceled, it will be waiting on the condition
variable (line 26). The cancel function wakes up the task body so that it can abort (line 24).
1 #include <condition_variable>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8
9 static std::mutex mutex;
10 static std::condition_variable cv;
11
12 // create CPU kernel and mark it as blocking
13 auto k = hetcompute::create_cpu_kernel([] {
14 auto cancel_function = [] {
15 HETCOMPUTE_ILOG("CANCEL blocking task");
16 std::lock_guard<std::mutex> lock(mutex);
17 cv.notify_all();
18 };
19
20 HETCOMPUTE_ILOG("START blocking task");
21 std::unique_lock<std::mutex> lock(mutex);
22 for (;;)
23 {
24 hetcompute::abort_on_cancel();
25 // enter hetcompute::blocking construct
26 hetcompute::blocking([&lock] { cv.wait(lock); }, // blocking function
27 cancel_function); // cancel function
28 }
29 HETCOMPUTE_ILOG("STOP blocking task");
30 });
31 k.set_blocking();
32
33 auto t = hetcompute::launch(k);
34
35 // wait for task to block
36 sleep(1);
37
38 // cancel task; it will call t’s cancel function
39 t->cancel();
40
41 try
42 {
43 // wait for t to finish
44 t->wait_for();
45 }
46 catch (const hetcompute::canceled_exception& e)
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 111
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
47 {
48 HETCOMPUTE_ILOG("task threw %s", e.what());
49 }
50 catch (...)
51 {
52 // Never reached
53 }
54
55 hetcompute::runtime::shutdown();
56 return 0;
57 }
Note
Later, at runtime, when tasks a and b finish, their return values are propagated to task a + b, which then
concatenates the two strings through the + operator overloaded on the std::string datatype. The
concatenated string is then retrieved through t->move_value in main.
HetCompute supports the following non-blocking algebraic operations on the return values of collapsed
tasks:
• Unary arithmetic and bitwise operations: +, -, ∼ The return value of the task (pointed to by
hetcompute::task_ptr) can be any type (built-in or user-defined) as long as the operation is
applicable to it (for user-defined types, the corresponding operator needs to be defined).
• Binary arithmetic and bitwise operations: +, -, ∗, /, %, &, |, ∧, which can take the
following three combinations of operands:
– operand1: hetcompute::task_ptr operand2: hetcompute::task_ptr
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 112
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
In the example above, e is an expression tree composed of values, tasks, and operations on tasks. The
expression tree will be evaluated when tasks in each sub-expression finish and return their values.
Note
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 113
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
There may be certain situations in which the intent is to pass a task_ptr through dataflow. For instance, one
task may create a task t and some other task may launch task t after all of task t’s data is available. For such
situations, use hetcompute::do_not_collapse to indicate that return type collapsing should not be
performed.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 auto t1 = hetcompute::launch(hetcompute::do_not_collapse,
[] {
8 auto t = hetcompute::create_task([] { return 42; });
9 return t;
10 });
11 auto t2 = hetcompute::launch(
12 [](hetcompute::task_ptr<int> t) {
13 t->launch();
14 return t;
15 },
16 t1);
17 std::cout << t2->copy_value() << std::endl;
18 hetcompute::runtime::shutdown();
19 return 0;
20 }
In the above example, task t1 creates task t and passes it as a data dependency to task t2 which accepts a
task_ptr<int> as its argument.hetcompute::do_not_collapse ensures that the return type of
task t1 is task_ptr<task_ptr<int>> and that task_ptr t flows to task t2. Task t2 in turn
launches task t and returns it. Notice that because task t2 is not created with
hetcompute::do_not_collapse, t2->copy_value() accesses the int eventually computed
by task t.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 114
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
compose_webpages(urls);
Assume that individual webpages can be rendered independently of each other, and that the fetching of data
and its styling can also be done in parallel. This informs the following parallel implementation of the
composite display application. The for loop in the compose_webpages function is executed in
parallel. The code to fetch data and style is launched as tasks that can execute in parallel, while the
render function is launched as a task scheduled to execute after the fetchdata and fetchstyle
tasks have finished.
In HetCompute, tasks are not related to each other except through task dependencies specified through
hetcompute::task<>::then or hetcompute::task<ReturnType(Args...)>::bind-
_all. A notable consequence of this is that although the display_webpage task created in the for
loop in compose_webpages creates and launches three more tasks, the latter tasks are in no way related
to the display_webpage task that created them. Therefore, the display_webpage function must
explicitly wait_for the render task to finish before it returns, so that the display_webpage task
that created them finishes only when all tasks created and launched by it have finished. Similarly, the
compose_webpages function also waits for all tasks it creates to finish before it returns. While such a
parallelization is desirable because each function was locally parallelized in a modular fashion, the use of
blocking hetcompute::task<>::wait_for to enforce the synchronous function call interface can
significantly hinder performance when there are many outstanding
hetcompute::task<>::wait_fors in the application.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 115
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
5.3.13.1 finish_after
int main() {
auto t1 = hetcompute::create_task(foo);
...
}
int main() {
auto t1 = hetcompute::create_task(foo);
...
}
Both function calls are non-blocking, lightweight, and return immediately. Note that the non-blocking
parallelization using finish_after below is nearly identical to the blocking parallelization, with the
sole difference being the use of finish_after in place of wait_for. Furthermore, note that the
display_webpage and compose_webpages functions now return early, before any of the tasks they
launched finish. Consequently, these functions have become asynchronous and must be invoked from
within tasks. Therefore, in main(), compose_webpages is called from within task t rather than as a
synchronous function call.
1 #include <string>
2
3 #include <hetcompute/hetcompute.hh>
4
5 void display_webpage(char*);
6 void compose_webpages(int num_urls, char* urls[]);
7
8 void
9 display_webpage(char* url)
10 {
11 auto fetchdata = hetcompute::create_task([=] {
12 /*fetch(url, "fetchdata");*/
13 return std::string(url) + " data";
14 });
15 auto fetchstyle = hetcompute::create_task([=] {
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 116
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
16 /*fetch(url, "fetchstyle");*/
17 return std::string(url) + " style";
18 });
19 auto render = hetcompute::create_task([](std::string data, std::string style
) {
20 /*render();*/
21 std::cout << data + " " + style << std::endl;
22 });
23 // Render task may start executing only after data and style have been
24 // fetched
25 render->bind_all(fetchdata, fetchstyle);
26 fetchdata->launch();
27 fetchstyle->launch();
28 render->launch();
29 // Mark display_webpage as logically finishing after the render task finishes
30 render->finish_after();
31 // Return from function call even before any of the fetchdata, fetchstyle, or render
32 // tasks finish. Such an early return makes the function asynchronous.
33 }
34
35 void
36 compose_webpages(int num_urls, char* urls[])
37 {
38 auto g = hetcompute::create_group();
39 for (int i = 1; i < num_urls; i++)
40 {
41 g->launch([=] { display_webpage(urls[i]); });
42 }
43 // Mark compose_webpages as logically finishing after all webpages have been
44 // composed and displayed
45 g->finish_after();
46 // Return from function call before any of the tasks finish
47 }
48
51 int
52 main(int argc, char* argv[])
53 {
54 hetcompute::runtime::init();
55
56 // Launch compose_webpages as a task since it is an asynchronous function
57 // call
58 auto t = hetcompute::launch([=, &argv] { compose_webpages(argc, argv); });
59 // Waits for the composite display to be rendered!
60 t->wait_for();
61 return 0;
62
63 hetcompute::runtime::shutdown();
64 }
The single, global, hetcompute::task<>::wait_for in main ensures that the composite display
is correctly rendered before the application terminates. Note that there are no other blocking calls necessary
to specify all the parallelism in the application.
hetcompute::task<>::finish_after or hetcompute::group::finish_after can be
invoked only from within a task. Note that a task can register itself as finishing after an arbitrary number of
other tasks and groups. A task can register itself as finishing after tasks or groups it did not create or launch.
finish_after is a means to semantically relate tasks to each other, e.g., in a parent-child relationship.
In the parallel mergesort example below, every node in the mergesort tree is specified to finish after
the merge step corresponding to that node finishes (line 39).
1 #include <algorithm>
2 #include <functional>
3 #include <iostream>
4 #include <iterator>
5 #include <sstream>
6 #include <vector>
7
8 #include <hetcompute/hetcompute.hh>
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 117
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
9
10 // Parallel mergesort using recursive fork-join parallelism.
11 // hetcompute::task<>::finish_after allows easy expression of the parallelism in the
12 // algorithm in a non-blocking manner, yielding better performance than
13 // blocking parallelization using hetcompute::task<>::wait_for.
14
16 const size_t GRANULARITY = 8192;
17
18 // Asynchronous mergesort, to be invoked in a task
19 template <typename Iterator, typename Compare>
20 void
21 mergesort(Iterator begin, Iterator end, Compare cmp)
22 {
23 size_t n = std::distance(begin, end);
24 if (n <= GRANULARITY)
25 {
26 sort(begin, end, cmp);
27 }
28 else
29 {
30 auto middle = begin;
31 std::advance(middle, n / 2);
32 auto left = hetcompute::launch([=] { mergesort(begin, middle, cmp); });
33 auto right = hetcompute::launch([=] { mergesort(middle, end, cmp); });
34 auto merge = hetcompute::create_task([=] { std::inplace_merge(begin, middle,
end, cmp); });
35 // The left subtree and right subtree tasks must finish before the merge
36 // task can execute
37 left->then(merge);
38 right->then(merge);
39 merge->launch();
40 // mergesort(begin, end, cmp) logically finishes after the merge task
41 // finishes
42 merge->finish_after();
43 }
44 }
45
46 int
47 main(int argc, const char* argv[])
48 {
49 hetcompute::runtime::init();
50 std::vector<long> input;
51 size_t n_def = 1 << 16;
52 size_t n = n_def;
53
54 if (argc >= 2)
55 {
56 std::istringstream istr(argv[1]);
57 istr >> n;
58 }
59
60 // Create a random array of integers
61 for (size_t i = 0; i < n; i++)
62 {
63 input.push_back(rand());
64 }
65
66 // Launch mergesort inside a task since it has an asynchronous interface (due
67 // to use of hetcompute::task::finish_after)
68 auto t = hetcompute::launch([&] { mergesort(input.begin(), input.end(),
std::less<long>()); });
69 t->wait_for();
70
71 if (!std::is_sorted(input.begin(), input.end()))
72 {
73 std::cerr << "parallel mergesorting failed\n";
74 }
75
76 hetcompute::runtime::shutdown();
77 return 0;
78 }
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 118
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
As stated previously, calling finish_after in a function implicitly makes the function asynchronous. In
some cases, e.g., when the function is set to finish_after a single task, the asynchronous nature of the
function may be made explicit by modifying it to return that task, instead of calling finish_after. This
results in a lightweight, asynchronous, non-blocking API. For illustration, first consider the synchronous
sequential implementation below:
// synchronous sequential implementation
size_t
fibonacci_seq(size_t n)
{
if (n < 2)
{
return n;
}
else
{
auto a = fibonacci_seq(n - 1);
auto b = fibonacci_seq(n - 2);
return a + b;
}
}
This is trivially converted into the following fully asynchronous implementation in three easy steps:
The snippet illustrates how Task-Pointer Collapsing and Algebraic Operations on Tasks are synergistically
combined to enable the asynchronous fibonacci API. Note the close correspondence with the
synchronous sequential implementation shown below. The HetCompute API enables the programmer to
easily and elegantly express the concurrency in the algorithm.
As a performance optimization, the programmer may coarsen the size of tasks so that small Fibonacci terms
are computed sequentially while large ones are computed in parallel. Below is the full example with both
the optimized and unoptimized versions.
1 #include <hetcompute/hetcompute.hh>
2
3 // synchronous sequential Fibonacci API
4 size_t fibonacci_seq(size_t n);
5
6 // asynchronous Fibonacci API
7 hetcompute::task_ptr<size_t> fibonacci(size_t n);
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 119
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 120
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
5.3.13.3 Cancellation
As discussed in Cancellation, a task or group may be canceled. Consequently, any task registered to
finish_after the canceled task or group may be subject to cancellation.
1 #include <hetcompute/hetcompute.hh>
2
3 void foo();
4 void bar();
5
6 void
7 foo()
8 {
9 auto tl = hetcompute::launch([] {
10 while (true)
11 {
12 hetcompute::abort_on_cancel();
13 // do something
14 }
15 });
16 tl->finish_after();
17 // do something
18 tl->cancel();
19 }
20
21 void
22 bar()
23 {
24 // do something
25 }
26
27 int
28 main()
29 {
30 hetcompute::runtime::init();
31 auto t1 = hetcompute::create_task(foo);
32 auto t2 = hetcompute::create_task(bar);
33
34 t1->then(t2);
35
36 t1->launch();
37 t2->launch();
38
39 try
40 {
41 t2->wait_for();
42 }
43 catch (const hetcompute::canceled_exception&)
44 {
45 HETCOMPUTE_ILOG("t2 was canceled");
46 }
47 catch (...)
48 {
49 // Never reached
50 }
51 hetcompute::runtime::shutdown();
52
53 return 0;
54 }
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 121
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
Referring to the above example, task t1 launches task tl, registers itself as finishing after task tl, and
subsequently cancels task tl. As a result, task t1 is itself canceled by the HetCompute runtime system and
cancellation is propagated to successors of task t1, which in this case is only task t2.
Canceling a task does not result in cancellation of other tasks or groups it is registered to finish_after.
As the example below shows, canceling task t does not cancel tl that t is registered to finish_after.
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 void foo();
6
7 hetcompute::task_ptr<> tl;
8 std::atomic<bool> tl_running(false);
9 std::atomic<bool> stop_tl(false);
10
11 void
12 foo()
13 {
14 tl = hetcompute::launch([] {
15 while (!stop_tl)
16 {
17 hetcompute::abort_on_cancel();
18 tl_running = true;
19 // do something
20 }
21 });
22 tl->finish_after();
23 }
24
25 int
26 main()
27 {
28 hetcompute::runtime::init();
29 auto t = hetcompute::launch(foo);
30
31 while (!tl_running)
32 {
33 }
34
35 t->cancel(); // Does not cancel tl
36
37 stop_tl = true;
38
39 try
40 {
41 t->wait_for();
42 }
43 catch (const hetcompute::canceled_exception&)
44 {
45 // Will never reach here since t->cancel is issued only after task t starts
46 // running and task t never acknowledges cancellation
47 }
48 catch (...)
49 {
50 // Never reached
51 }
52
53 assert(!tl->canceled());
54 HETCOMPUTE_ILOG("tl was not canceled");
55
56 tl.reset();
57
58 hetcompute::runtime::shutdown();
59 return 0;
60 }
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 122
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
5.3.13.4 Summary
5.4 Buffers
Basic Usage of Buffers
Using Buffers with Tasks
Synchronized and Concurrent Use
Creating Buffers
Performance and Storage Optimizations When Using Buffers
Memory Regions
hetcompute::buffer_ptr<const float> b3 =
hetcompute::create_buffer<float>(100);
The runtime transparently manages the movement of the buffer data between specialized device-specific
backing stores. For example, the runtime allocates ION memory as backing store for the optimal sharing of
buffer data between the CPU and DSP. Similarly, the runtime uses an OpenCL buffer as backing store to
synchronize the buffer data between the CPU and GPU. Additionally, the runtime tries to take advantage of
any available advance knowledge of which devices may access a buffer to optimize the allocation of backing
stores from specialized device memories and to minimize the copying of data between the backing stores.
Please also refer to Textures for a GPU-only data structure suitable for image data.
There are four entities that may access a buffer’s data.
1. A CPU task
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 123
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
2. A GPU task
3. A DSP task
4. CPU host code
Task access: A task may access a buffer by taking the corresponding buffer pointer as an argument. A task
may access the buffer as an input, an output, or input-output, referred to as the direction of access. Recall
that a task is created using a device-specific kernel (hetcompute::cpu_kernel,
hetcompute::gpu_kernel and hetcompute::dsp_kernel). The kernel’s signature may
explicitly declare the direction for each buffer pointer parameter or the direction may be implicitly inferred
based on the mutability of the buffer pointer (hetcompute::buffer_ptr<T> versus
hetcompute::buffer_ptr<const T>).
Note that a CPU task may be created directly with a lambda, functor, or function parameter without
involving a CPU kernel. For such a CPU task, the access directions of any buffer pointer arguments are
inferred implicitly from the mutability of the buffer pointer parameters to the lambda, functor, or function.
Host code access: The application code on the CPU may directly access a buffer’s data using its buffer
pointer. A host code access refers to any access from the application code that is either not enclosed within
a task or uses a buffer pointer that is not a parameter to the enclosing task.
The following example illustrates the difference between task and host code access.
auto b1 = hetcompute::create_buffer<int>(3);
auto b2 = hetcompute::create_buffer<int>(3);
auto t = hetcompute::launch(
[=](hetcompute::buffer_ptr<int> x) {
// This is *task access* to b1’s buffer
// via task parameter x.
for (size_t i = 0; i < x.size(); i++)
x[i] = int(i);
b2.acquire_wi();
// This is *host code access* to b2’s buffer.
for (size_t i = 0; i < b2.size(); i++)
b2[i] = 1000 + int(i);
b2.release();
},
b1);
t->wait_for();
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 124
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
The following example illustrates buffer access by a CPU task created directly from a user function. Note
that the access directions are implicitly inferred from the mutability of the corresponding buffer pointer
parameters.
void user_function(hetcompute::buffer_ptr<const int> x, // x is input only
hetcompute::buffer_ptr<int> y) // y is
input-output
{
for (size_t i = 0; i < x.size(); i++)
y[i] = x[i] * 2;
}
int
main()
{
hetcompute::runtime::init();
auto b1 = hetcompute::create_buffer<int>(10);
auto b2 = hetcompute::create_buffer<int>(10);
b1.acquire_wi();
for (size_t i = 0; i < b1.size(); i++)
b1[i] = int(i);
b1.release();
b2.acquire_ro();
for (size_t i = 0; i < b2.size(); i++)
HETCOMPUTE_ILOG("b2[%zu]=%d", i, b2[i]);
b2.release();
hetcompute::runtime::shutdown();
return 0;
}
The following example illustrates buffer access by a CPU task created using a CPU kernel.
void user_function(hetcompute::buffer_ptr<const int> x, // x is input only
hetcompute::buffer_ptr<int> y) // y is
input-output
{
for (size_t i = 0; i < x.size(); i++)
y[i] = x[i] * 2;
}
int
main()
{
hetcompute::runtime::init();
auto b1 = hetcompute::create_buffer<int>(10);
auto b2 = hetcompute::create_buffer<int>(10);
b1.acquire_wi();
for (size_t i = 0; i < b1.size(); i++)
b1[i] = int(i);
b1.release();
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 125
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
// b1 is inferred as input
// b2 is inferred as input-output
auto t = hetcompute::launch(ck, b1, b2);
t->wait_for();
// elements of b2 are now double the corresponding elements of b1
b2.acquire_ro();
for (size_t i = 0; i < b2.size(); i++)
HETCOMPUTE_ILOG("b2[%zu]=%d", i, b2[i]);
b2.release();
hetcompute::runtime::shutdown();
return 0;
}
In the examples above, the user function accessed the buffer data by indexing the buffer pointer as an array.
The host code accesses the buffer data in a similar manner. The host code and CPU tasks may also request a
pointer to manipulate the entire contents of the buffer, as shown below.
auto b = hetcompute::create_buffer<int>(100);
Limitation HetCompute v1.0 does not yet support explicit specification of access directions with CPU
kernels. The access directions are only allowed to be implicitly inferred.
The following example illustrates creation of a GPU task with implicitly inferred access directions, similar
to the CPU task examples above.
// Create a string containing OpenCL C kernel code.
#define OCL_KERNEL(name, k) std::string const name##_string = #k
int
main()
{
hetcompute::runtime::init();
auto buf_a = hetcompute::create_buffer<float>(3);
auto buf_b = hetcompute::create_buffer<float>(buf_a.size());
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 126
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
gpu_task->wait_for();
buf_b.acquire_ro();
for (size_t i = 0; i < buf_b.size(); i++)
HETCOMPUTE_ILOG("buf_b[%zu] = %f", i, buf_b[i]);
buf_b.release();
hetcompute::runtime::shutdown();
}
Access directions may be explicitly specified by wrapping the buffer pointer template parameters of the
kernel with hetcompute::in, hetcompute::out, and hetcompute::inout, as illustrated
below.
// Create a kernel object
auto gpu_vdouble = hetcompute::create_gpu_kernel<hetcompute::in<hetcompute::buffer_ptr<float>>, //
explicit in direction
hetcompute::out<hetcompute::buffer_ptr<float>>> //
explicit out direction
(vdouble_kernel_string, "vdouble");
Note
A GPU kernel may be created using either an OpenCL C function or an OpenGL ES compute shader.
However, buffers interact with GPU kernels of either type in an identical manner. See GPU kernels for
OpenCL and OpenGL ES for an example program that uses a kernel created from an OpenGL ES
compute shader with buffers.
Consider a DSP function with the following IDL signature. The IDL signature explicitly identifies the in
and out access directions for the parameters.
long array_is_prime(in sequence<long> numbers, rout sequence<long> primes);
HetCompute recognizes the in and out access directions coming from the IDL signature when a
hetcompute::dsp_kernel instance is created from the DSP function, as illustrated in the following
example.
// dsp kernel creation
auto hex_kernel = hetcompute::create_dsp_kernel<>(hetcompute_dsp_array_is_prime);
// create the dsp task that will be executed inside the dsp DSP
auto hex_task = hetcompute::create_task(hex_kernel,
in_buf, // in access recognized
out_buf); // out access recognized
When a task completes execution, HetCompute no longer automatically synchronizes all the buffers
accessed by the tasks. The host code must explicitly synchronize to access the buffer data updated by the
task (see Host Access, Explicit Synchronization with Host Code).
HetCompute deprecated the use of buffer_mode::synchronized & buffer_mode::relaxed
during buffer creation. Instead all buffer creation follow the semantics of relaxed mode.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 127
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
The HetCompute runtime allows multiple tasks to concurrently access the buffer provided they access the
buffer only as input. The runtime ensures that a task accessing the buffer as output or as input-output does
not execute concurrently with other tasks accessing that buffer. HetCompute disallows concurrent access to
a buffer when the buffer is being modified. The acquisition will be blocked when a concurrent task/pattern
has acquired the buffer for read-write or write-invalidate access. In rare situations, the acquisition may also
be blocked when a concurrent task/pattern has read-only access but HetCompute is unable to synchronize
the buffer data for host access until the concurrent task/pattern completes.
The user specifies the datatype and number of elements needed in the buffer. HetCompute internally
manages the allocation of all the storage needed for the buffer.
auto b = hetcompute::create_buffer<int>(100);
The initial storage and data for the buffer can be provided by the user. The HetCompute runtime may
allocate additional backing stores as needed, and will handle the synchronization between the user-provided
storage and any internal backing stores.
// user creates storage
std::vector<int> v;
for(int i = 0; i < 100; i++)
v.push_back(i);
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 128
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
Hetcompute v1.0 introduces the following APIs in the buffer_ptr class supports the following host
synchronization calls for a buffer,
1. acquire_ro(): The host code gains read-only access. Results from a prior task become visible to
the host code.
2. acquire_rw(): The host code gains read-write access. Results from a prior task become visible to
the host code. Modifications to the buffer data by the host code will be made visible to any
subsequent tasks.
3. acquire_wi(): The host code invalidates (clobbers) the prior contents of the buffer. Results from
a prior task may be lost. The entire contents of the buffer should be treated as undefined, save for the
new contents written by the host code subsequent to this synchronization call. It is valid for the host
code to read back any new contents written by the host code subsequent to this call. The new contents
of the buffer will be made visible to subsequent tasks.
The buffer synchronization allows the host code to access the buffer data updated by the task (see Host
Access). acquire_ro() acquires the underlying buffer for read-only access by the host code. The host
code may also modify the buffer data by attempting to acquire the underlying buffer for write access using
acquire_wi() or acquire_rw() buffer APIs.
All acquire_∗ calls will block for any conflicting operations to complete (e.g., a task concurrently
performing read-write access to the buffer), after which the buffer is acquired for access by the host code
and the call unblocks. However, if the buffer has already been acquired for the host code by a preceding
acquire_(), the call will return immediately.
The host code may recursively acquire the buffer using a combination of acquire_ro(),
acquire_wi() and acquire_rw() calls. The first acquire_ establishes the access type (read-only,
write-invalidate, or read-write) of the buffer for the host code. Subsequent recursive acquire_ calls will
succeed only if they are compatible with the previously established access type. Subsequent recursive
acquire_wi() and acquire_rw() calls will return with failure if the first recursive acquisition was
acquire_ro(), as the access type of these calls is incompatible with the established read-only access.
However, any subsequent acquire_() recursive calls will succeed if the first acquisition was either
write-invalidate or read-write. When the established access type is write-invalidate, subsequent recursive
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 129
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
read-only or read-write acquisitions are considered to get access to any data written to the buffer after the
original write-invalidate. When the established access type is read-write, a subsequent recursive
write-invalidate does not destroy any prior data, as there is no additional synchronization required between
device memories to access the latest data.
The host code releases the buffer only when a number of release() calls equal to the number of
successful recursive acquire_() calls are made.
Note that access by concurrent threads of the host code is also considered recursive, even when the
acquire-release calls do not properly nest across threads. The first acquire by any one thread establishes the
host access type for all threads of the host code, until the host code releases.
// Relaxed host synchronization:
// Select during buffer creation to get better performance
auto buf_a = hetcompute::create_buffer<float>(3);
auto buf_b = hetcompute::create_buffer<float>(buf_a.size());
buf_a.acquire_wi();
// Initialize the input
for (size_t i = 0; i < buf_a.size(); ++i)
buf_a[i] = i;
buf_a.release();
buf_b.acquire_ro();
for (size_t i = 0; i < buf_b.size(); i++)
HETCOMPUTE_ILOG("buf_b[%zu] = %f", i, buf_b[i]);
buf_b.release();
buf_a.acquire_rw();
// Read buf_a
for (size_t i = 0; i < buf_a.size(); i++)
HETCOMPUTE_ILOG("buf_a[%zu] = %f", i, buf_a[i]);
The buffer contents become undefined if the host code accesses the buffer data without explicit
synchronization when the synchronization was required (such as after a task access), or if the host code
performs accesses incompatible with the type of access chosen (e.g., writing to the buffer after invoking
ro_sync()).
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 130
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
The likely devices information is used merely as an optimization hint. If the information turns out to be
partial or incorrect, the only penalty will be some avoidable backing store allocations and a performance hit
due to the avoidable copying of the buffer data between the backing stores.
5.5 Textures
Texture APIs in HetCompute are useful for image processing tasks. These APIs allow the user to create
image objects from data residing in host memory and provide them to a kernel for processing. These APIs
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 131
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
also significantly simplify the programming of parellel image processing, such as filtering.
When a user is processing 2D or 3D image data, HetCompute texture APIs are useful as they provide a
multitude of image formats, filtering modes and addressing modes for accessing and manipulating pixel
data in a GPU kernel effectively.
All APIs are accessible by including the following header:
#include <hetcompute/texture.hh>
Note that HetCompute internally handles the initialization of device-and platform-dependent contexts, so
the programmers do not need to query or create these contexts by themselves. To begin with, the
programmers can use the following code to create HetCompute textures directly:
input_tex = hetcompute::graphics::create_texture<img_format, 2>({ width, height }, static_cast<unsigned
char const*>(input_img_data));
The mytextureptrtype and mysamplerptrtype are the corresponding types of textures and
samplers in HetCompute, which is defined as follows:
typedef hetcompute::graphics::texture_ptr<img_format, 2> mytextureptrtype;
Please note that source_string contains the actual OpenCL kernel code that takes textures as kernel
function arguments and use them. In addition, template parameters must match the signature of the kernel
function. The kernel source code provided in source_string is shown as follows:
__kernel void box_filter(__read_only image2d_t source,
__write_only image2d_t dest,
sampler_t sampler)
{
// image dimensions
int img_width = get_global_size(0);
int img_height = get_global_size(1);
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 132
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
This kernel above applies an 8-by-8 box filter to an input image and generates a smoothed output image.
The kernel can be executed in the style of general HetCompute tasks:
// launch GPU kernel over 2D range
hetcompute::range<2> r(0, width, 0, height);
t->launch();
t->wait_for();
Internally, the kernel call is directed to the OpenCL driver for execution.
The processed result image can be read back using the following hetcompute::graphics::map API:
// read back result to CPU
auto ptr = static_cast<unsigned char*>(hetcompute::graphics::map(output_tex));
if (ptr != output_img_data)
HETCOMPUTE_FATAL("mapped addr does not match the original one.\n");
hetcompute::graphics::unmap(output_tex);
Please note that ptr should match the data pointer output_img_data that we use to create the output
texture as a sanity check.
HetCompute also handles release of the OpenCL contexts and HetCompute texture objects. However, the
programmers should still call hetcompute::graphics::unmap to release the mapping between CPU
memory and GPU memory, so the same HetCompute texture object can be reused for subsequent kernel
calls.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 133
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
typedef
hetcompute::graphics::texture_ptr<hetcompute::graphics::image_format::CompressedTP10unorm_int10, 2> mytextur
typedef hetcompute::graphics::sampler_ptr<hetcompute::graphics::addressing_mode::ADDRESS_NONE,
hetcompute::graphics::filter_mode::FILTER_NEAREST>
mysamplerptrtype;
auto sampler =
hetcompute::graphics::create_sampler<hetcompute::graphics::addressing_mode::ADDRESS_NONE,
hetcompute::graphics::filter_mode::FILTER_NEAREST>(
false);
auto hetcomputegpukernel =
hetcompute::create_gpu_kernel<mytextureptrtype,
mytextureptrtype,
mytextureptrtype,
mysamplerptrtype>(default_source_string, "
copy_tp10_yuv_image_to_y_image_and_uv_image");
*(src_ion_mem),
true);
*(dst_ion_mem),
true);
The above example create source and destination ION memory respectively. Source ION memory is
populated with image data being read. The creation of GPU Kernel, sampler follow the previous example.
The above snippet creates parent UBWC TP10 textures. Since we are writing UBWC TP10 data to
output_tex the example creates the derivative Y and UV plane. Both these derivative textures are later
passed into GPU Kernel for actually copying data to output Y and UV textures.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 134
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
At a high level, the bounded_lfqueue is implemented as a fixed size circular array, whose size is
defined by the user through an input parameter. Specifically, in HetCompute, size of the array is forced to
be a power of two, by taking the log (to the base 2) of the size as input. Consider the following example of a
bounded_lfqueue instantiation:
hetcompute::bounded_lfqueue<size_t> q(8);
In this example, the size of the bounded_lfqueue is set to 2∧ 8 = 256 entries. When the queue is full, a
push operation cannot add a new value into the queue until one has been popped.
The lfqueue can be thought of as a linked list, where each node is a bounded_lfqueue, and is of
unbounded size. As in the case of the bounded_lfqueue, the size parameter passed during instantiation
gives its initial size. The lfqueue then extends itself in chunks of this size whenever needed.
The following is a simple example that illustrates the use of the lfqueue.
1 #include <hetcompute/lfqueue.hh>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 hetcompute::lfqueue<size_t> q(8);
10
11 // Create groups for producer and consumer tasks
12 auto producer = hetcompute::create_group("Producer");
13 auto consumer = hetcompute::create_group("Consumer");
14
15 // Launch 2 tasks into the producer group,
16 // each of which pushes 100 values into q
17 for (size_t p = 0; p < 2; ++p)
18 {
19 producer->launch([&]() {
20 for (size_t i = 0; i < 100; i++)
21 {
22 q.push(i);
23 }
24 });
25 }
26
27 // Launch 2 tasks into the consumer group,
28 // each of which pops 100 values from q
29 for (size_t c = 0; c < 2; ++c)
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 135
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
30 {
31 consumer->launch([&]() {
32 size_t j = 0;
33 while (j < 100)
34 {
35 size_t result;
36 if (q.pop(result))
37 {
38 // The popped value is stored in result
39 ++j;
40 }
41 }
42 });
43 }
44
45 // wait for consumer group to finish
46 consumer->wait_for();
47
48 hetcompute::runtime::shutdown();
49 }
In the above example, two HetCompute groups producer and consumer, are created first (lines 12 and
13). Two HetCompute tasks are then launched into each group. Each task in the producer group pushes
100 size_t values (lines 19-24) into the queue q (instantiated in line 9), and the tasks in the consumer
group concurrently pop the values (lines 32-41) from q. The program terminates only when all the 200
values pushed into the queue have been popped. Therefore, it suffices to wait for the consumer group to
finish (line 46), as the consumer tasks will complete only after each one has popped 100 tasks.
5.7 Storage
5.7.1 Task-Local Storage
Tasks, much like threads, can be associated with task-local storage, via
hetcompute::task_storage_ptr. The usage pattern consists of declaring a global variable, say
storage, which holds a pointer to the actual task-local data. Then, within task t, that variable is assigned
a pointer to a (usually) local variable, or a chunk of freshly allocated memory. After that, storage can be
used within the dynamic extent of task t:
1 #include <hetcompute/hetcompute.hh>
2
3 namespace
4 {
5 hetcompute::task_storage_ptr<int> storage;
6 }; // namespace
7
8 void func();
9
10 void
11 func()
12 {
13 HETCOMPUTE_ILOG("%d", *storage);
14 ++*storage;
15 }
16
17 int
18 main()
19 {
20 hetcompute::runtime::init();
21 auto g = hetcompute::create_group();
22 for (int i = 0; i < 10; ++i)
23 {
24 g->launch([i] {
25 int v = i;
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 136
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
26 storage = &v;
27 func();
28 if (v != i + 1)
29 {
30 HETCOMPUTE_ILOG("error");
31 }
32 func();
33 if (v != i + 2)
34 {
35 HETCOMPUTE_ILOG("error");
36 }
37 });
38 }
39 g->wait_for();
40 hetcompute::runtime::shutdown();
41 return 0;
42 }
Note that accessing the value of storage affects only the current task. Attempting to modify the value of
a task_storage_ptr outside of a task yields undefined behaviour.
Optionally, a destructor (or rather: finalizer), can be employed to dispose resources. The destructor will run
within each task that has a value assigned to the global variable.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 137
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
3 namespace
4 {
5 const hetcompute::scheduler_storage_ptr<size_t> s_sls_state;
6 }; // namespace
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 auto g = hetcompute::create_group();
13 auto t = hetcompute::create_task([] {});
14
15 for (size_t i = 0; i < 200; ++i)
16 {
17 g->launch([=] {
18 size_t c1 = ++*s_sls_state;
19 t->launch();
20 t->wait_for();
21 size_t c2 = ++*s_sls_state;
22 if (c1 + 1 != c2)
23 {
24 HETCOMPUTE_ILOG("error: mismatch");
25 }
26 });
27 }
28
29 g->wait_for();
30 hetcompute::runtime::shutdown();
31
32 return 0;
33 }
A complete example
1 #include <algorithm>
2 #include <iterator>
3
4 #include <hetcompute/hetcompute.hh>
5
6 template <size_t N>
7 struct image_scratchpad
8 {
9 image_scratchpad() { std::fill(std::begin(edge_image), std::end(edge_image), 0); }
10 char edge_image[N];
11 };
12
13 namespace
14 {
15 const hetcompute::scheduler_storage_ptr<image_scratchpad<4096>
> image_buffers;
16 }; // namespace
17
18 int
19 main()
20 {
21 hetcompute::runtime::init();
22 int const N = 200;
23
24 auto g = hetcompute::create_group();
25 for (int i = 1; i < N; ++i)
26 {
27 g->launch([i] {
28 // fill image buffer, which is reused across tasks
29 for (auto& slot : image_buffers->edge_image)
30 slot = i & 0xff;
31 hetcompute::internal::yield(); // context-switch, we expect SLS to survive this
32 // check contents
33 for (auto const& slot : image_buffers->edge_image)
34 {
35 if (slot != char(i & 0xff))
36 {
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 138
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
5.8 Affinity
The affinity APIs enable the programmer to change execution properties of program statements (arbitrary
functions), HetCompute tasks, and device threads. These properties include:
• location: to set the CPUs where the program constructs should run.
• pinning: to set whether HetCompute device threads should migrate freely among cores (also known
as thread binding).
• mode: to override local affinity settings.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 139
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
Programmers can benefit from these APIs to improve performance and even to save power. All APIs are
accessible by including the main HetCompute header defined in hetcompute/affinity.hh:
#include <hetcompute/hetcompute.hh>
Note
For setting the affinity of individual tasks (rather than all tasks using the above APIs) to big or LITTLE
in a big.LITTLE SoC, use CPU kernel attributes (Setting Kernel Attributes).
Regarding the capabilities of the APIs, location enables targeting clusters of cores in heterogeneous
System-On-Chip (SoC), such as Qualcomm Snapdragon 845 or 835, where not all cores are equal,
providing different performance/power points. For example, in a Snapdragon 845, a programmer may
choose to run only in the LITTLE cluster as illustrated in the following example, which demonstrates all
other affinity APIs as well.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 auto fn = [](int i) { HETCOMPUTE_ILOG("Function executed with specified affinity on arg %d", i); };
9 auto aff_settings =
10 hetcompute::affinity::settings(
hetcompute::affinity::cores::big, false,
hetcompute::affinity::mode::allow_local_setting);
11 // In a big.LITTLE SoC, function fn executes on a big core.
12 hetcompute::affinity::execute(aff_settings, fn, 42);
13
14 auto g = hetcompute::create_group(__FUNCTION__);
15
16 auto k_wout_attrib = hetcompute::create_cpu_kernel([] { HETCOMPUTE_ILOG("
Task without kernel affinity attribute."); });
17
18 auto k_with_attrib = hetcompute::create_cpu_kernel([] { HETCOMPUTE_ILOG("
Task with kernel affinity attribute"); });
19 k_with_attrib.set_little();
20
21 // k_with_attrib kernel will run in a LITTLE core
22 g->launch(k_with_attrib);
23
24 // k_wout_attrib can run in any core
25 g->launch(k_wout_attrib);
26
27 g->wait_for();
28
29 // Set the affinity to the LITTLE cores without pinning in
30 // allow_local_setting mode
31 hetcompute::affinity::set(
32 hetcompute::affinity::settings(
hetcompute::affinity::cores::little, false,
hetcompute::affinity::mode::allow_local_setting));
33
34 // k_wout_attrib task will run in a LITTLE core because the kernel has no
35 // individual affinity specification
36 g->launch(k_wout_attrib);
37
38 // Set the affinity to the big cores with pinning in allow_local_setting mode
39 // by reading the current affinity and then updating the different fields
40 auto affinity = hetcompute::affinity::get();
41
42 // Update the cores from LITTLE to big
43 affinity.set_cores(hetcompute::affinity::cores::big);
44
45 // Enable thread pinning
46 affinity.set_pin_threads();
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 140
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
47
48 // Update the mode from allow_local_setting to override_local_setting in the
49 // settings
50 affinity.set_mode(hetcompute::affinity::mode::override_local_setting
);
51
52 // Update the affinity with the modified affinity object
53 hetcompute::affinity::set(affinity);
54
55 // The second run of k_with_attrib will run on a big core because the
56 // affinity mode is override_local_setting and global affinity settings are
57 // obeyed
58 g->launch(k_with_attrib);
59
60 g->wait_for();
61
62 hetcompute::runtime::shutdown();
63 return 0;
64 }
The example illustrates three different ways of setting affinity to program constructs and also shows how
one can override the others.
Specifying a big or LITTLE location in an SoC with homogeneous cores, such as Snapdragon 805,
will have no effect. However, the pinning request will still be fulfilled.
To update individual aspects of the current affinity settings, use the following API:
auto affinity = hetcompute::affinity::get();
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 141
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
In our example, k_with_attrib kernel executes in the LITTLE cores the first two times; the third time,
it runs in the big cores.
To respect local affinity settings, set mode to
hetcompute::affinity::mode::allow_local_setting
Note
Very few situations will benefit from pinning; due to the thermally constrained environment of mobile
packages, CPUs can go online/offline unannounced. When requesting pinning, if there are offline
CPUs, HetCompute will pin device threads as much as possible to a single CPU; however, some device
threads may remain unpinned.
And the system will return to the default state where device threads can freely move across CPUs.
All previous free functions are thread-safe, and programmers may call the affinity APIs at any point of
execution, even within CPU tasks.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 142
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
step-by-step. This application takes in an array of 10 floating-point numbers, x[i], and computes x[i] ∗ x[i]
+ 1 / x[i] for each element. That is, an input of 1.0, 2.0, ..., 10.0 would produce an output of 2.0, 4.5, ...,
100.01.
First, the device functions must be written. As explained in Kernels: The Path to Heterogeneity, at present,
CPUs, GPUs, and DSPs are programmed in different languages in HetCompute. The following example
lists three device functions that will be used in the example program, showing different device
programming styles.
1 #pragma once
2
3 // A CPU function that initializes an array
4 void
5 f1(float* b, int N)
6 {
7 for (int i = 0; i < N; i++)
8 b[i] = i + 1.0f;
9 }
10
11 // A GPU function that computes squares
12 std::string const f2_string = "__kernel void f2(__global float *in, __global float *out) {"
13 " int i = get_global_id(0);"
14 " out[i] = in[i] * in[i];"
15 "}";
16
17 // A DSP function that computes reciprocals
18 int
19 f3(float* in, int lin, float* out, int lout)
20 {
21 int i;
22 for (i = 0; i < lin && i < lout; i++)
23 out[i] += 1 / in[i];
24 return 0;
25 }
After the device functions are written, the next step is to consider how data is passed between these
functions and create data containers accordingly. HetCompute buffers serve this purpose. In particular,
buffers abstract away much of the manual data marshaling in traditional heterogeneous programming
environments, greatly simplifying multi-device programming.
Finally, the methods described in Kernels: The Path to Heterogeneity and Creating Tasks can be used to
create kernels and tasks, set dependency between tasks, and launch them. This results in the following final
result.
1 #include <hetcompute/hetcompute.hh>
2 #include "heterogeneous.hh"
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // The number of elements to compute x^2 + 1/x for
9 constexpr int N = 10;
10
11 // Create buffers for input and output data
12 auto b1 = hetcompute::create_buffer<float>(N);
13 auto b2 = hetcompute::create_buffer<float>(N);
14
15 // The CPU initializes the input data first
16 auto t1 = hetcompute::create_task(f1);
17
18 // The GPU squares every input element
19 auto k2 = hetcompute::create_gpu_kernel<hetcompute::buffer_ptr<float>, hetcompute::buffer_ptr<float>>(
f2_string, "f2");
20 auto t2 = hetcompute::create_task(k2,
hetcompute::range<1>{ N }, b1, b2);
21
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 143
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
Note
It is worth emphasizing again that HetCompute tasks are universal across different devices. While
tasks may contain kernels customized for different devices, at the task level and above, a programmer
should not need to distinguish between these tasks.
While the example above is functionally correct, its performance can be improved by a few simple
techniques.
1. The GPU and DSP kernels in the example above are sequentially executed. However, through the use
of additional buffers, they can be launched asynchronously, giving them a chance to execute
concurrently depending on scheduling results. The caveat is to ensure that they converge in the same
host thread by using a group wait_for.
2. While the default buffer constructors need very few arguments to produce functionally correct
behavior, their performance can be improved by providing additional hints to the constructors. For
example, hetcompute::in, hetcompute::out, and hetcompute::inout can be used to
qualify the buffers and avoid unnecessary copies. Additionally, hints about likely devices can guide
storage allocation for buffers.
The optimizations above produce the following result:
1 #include <hetcompute/hetcompute.hh>
2 #include "heterogeneous2.hh"
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // The number of elements to compute x^2 + 1/x for
9 constexpr int N = 10;
10
11 // Create buffers for input and output data
12 auto b1 = hetcompute::create_buffer<float>(N);
13 auto b2 = hetcompute::create_buffer<float>(N);
14 auto b3 = hetcompute::create_buffer<float>(N);
15 auto b4 = hetcompute::create_buffer<float>(N);
16
17 // The CPU initializes the input data first
18 auto t1 = hetcompute::create_task(f1);
19
20 // The GPU squares every input element
21 auto k2 =
22 hetcompute::create_gpu_kernel<hetcompute::in<hetcompute::buffer_ptr<float>>,
hetcompute::out<hetcompute::buffer_ptr<float>>>(f2_string,
23
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 144
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
"f2");
24 auto t2 = hetcompute::create_task(k2,
hetcompute::range<1>{ N }, b1, b2);
25
26 // The DSP adds the reciprocals to the result
27 auto k3 = hetcompute::create_dsp_kernel<>(f3);
28 auto t3 = hetcompute::create_task(k3, b1, b3);
29
30 // Run all the tasks
31 auto g = hetcompute::create_group();
32 t1 >> t2;
33 t1 >> t3;
34 t1->launch(b1, N);
35 g->launch(t2);
36 g->launch(t3);
37 g->wait_for();
38
39 // Combine the results
40 for (int i = 1; i < N; i++)
41 b4[i] = b2[i] + b3[i];
42
43 // Output the result
44 for (int i = 0; i < N; i++)
45 HETCOMPUTE_ILOG("%f\n", b2[i]);
46
47 hetcompute::runtime::shutdown();
48
49 return 0;
50 }
5.10 Interoperability
The HetCompute programming model isolates programmers from threads; however, HetCompute
applications are multithreaded. In this section, some of the interoperability issues are discussed that arise
from using threads in your application and how they interact with the HetCompute runtime.
A HetCompute application starts in a main thread and will create a thread pool. The number of threads in
the thread pool depends on the platform. Additional threads may be created based on application behavior
in order to keep all the resources busy. A HetCompute task might be created in any thread, and other
operations, such as launching and executing on any other threads. Heterogeneous tasks on the GPU and D-
SP behave similarly. Any application thread may call hetcompute::create_task(Code &&code,
Args &&...args) and hetcompute::task<>::launch(). The task will be executed by one of
the threads in the HetCompute thread pool. It is important to note this distinction because programmers
need to ensure that data accessed in the task must be available throughout the lifetime of the task (even if it
has been allocated in a different thread) and that the data is accessed in a thread-safe manner.
HetCompute provides certain guarantees with respect to task execution, as defined below.
• hetcompute::task<>::wait_for()
• hetcompute::group<>::wait_for()
• hetcompute::condition_variable::wait()
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 145
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
Xlib
The Xlib libraries are typically not thread-safe, although each implementation is different. It is not
possible to perform display operations from two threads at the same time because this could corrupt
internal data structures. While multiple threads can be used, the programmer must ensure that only one
thread can be using Xlib at any time.
UI Toolkits
User interface toolkits —such as QT— typically have a main thread which is dedicated to processing
input events, manipulating a display, and then sleeping until more input occurs. It is important that
control is returned to the UI toolkit as soon as possible to ensure that the user experience is smooth and
uninterrupted. If a call from a different thread is made to a function that manipulates the UI or triggers
an event, the toolkit may corrupt a data structure, or detect this and generate an error message.
OpenGL
Each OpenGL implementation varies in how it can be used with multiple threads. In typical usage, you
create an OpenGL context in the thread where you intend to use it. The OpenGL library then sets
internal state information into TLS. This internal state is used so that when calls are made to OpenGL,
you do not need to pass the context around each time. However, the TLS is set for only one thread. So
if you try to make an OpenGL call from a different thread, the implementation may fail. Some
implementations allow multiple contexts, with each context being created on the thread where it will
be used. With multiple contexts, some implementations also allow parallel access to the OpenGL
library, although this support varies depending on the vendor. Hardware implementing OpenGL
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 146
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide
typically uses some kind of command buffer, which can force a sequential ordering of commands.
Therefore, trying to implement calls to OpenGL in parallel may not provide any benefit, and may
actually slow things down due to contention on the mutex used to protect the command buffer. An
OpenGL application is typically used with some kind of user interface and event handler, which will
be running on the main thread. So it is recommended that you perform your OpenGL calls in the same
thread as the user interface.
While these are limitations that need to be taken into consideration, it is still possible to exploit parallelism
using HetCompute in these types of applications. For example, consider the case of a game with physics
simulation, where the user can click on the display to launch spheres into a room. In a sequential
implementation, the user touches the display, which generates a UI event. The UI thread wakes up and
processes the UI event, which needs to generate the new sphere in the physics simulation. The physics
simulation runs for the time required to compute the result. The location of all objects in the physics
simulation is then traversed, and OpenGL calls are made to draw the scene. The OpenGL buffers are then
swapped onto the display, and the thread goes back to sleep to wait for either a UI event, or a timeout to
refresh the display with no change.
When analyzing the previous example, the bulk of calculations are performed in the physics engine. This is
very computationally expensive and where the most optimization work can be applied. So HetCompute can
be used here to perform the computation in parallel, assuming the underlying implementation supports this.
The user breaks down the parts of the simulation into a suitable number of tasks, specifies dependencies,
and then launches them with HetCompute. When the tasks are launched, the thread does a wait_for()
until the tasks have completed. In the meantime, the HetCompute thread pool begins executing the tasks,
which spreads the computational load across all available processors. When the tasks are complete, the
wait_for() will return, and execution on the main thread can continue with the calls to OpenGL for
rendering. With this arrangement, you can see that the operations that are thread-sensitive are performed in
one thread, guaranteeing safe use of libraries such as OpenGL and the user-interface toolkit. Many
computationally intensive OpenGL applications are written with an event loop very similar to that described
above, so these changes should be relatively simple to implement to take advantage of HetCompute.
The C++11 standard indicates that the iostream library should be thread safe. As of this writing, the
programmers have experienced stability issues on some platforms, such as Android and OSX. In order
to maximize portability, HetCompute applications should avoid using cout and cerr to perform
asynchronous writes, especially to the console. It is recommended to use the C-based stdio
printf routines. On Android, programmers have experienced additional issues with
stringstream objects.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 147
6 Parallel Processing Tutorial
6.1 Abstract
In this tutorial, general principles of parallel programming are introduced with an emphasis on task-based
parallel programming models. Scaling is first introduced as a metric of evaluating the potential speedup that
an algorithm can obtain. Different parallel programming paradigms are introduced with a number of
optimizations for parallel code. Examples are provided to illustrate the HetCompute programming model.
s+p 1
P arallelSpeedup = = ,
s + p/N s + p/N
where s + p = 1, representing the serial and parallel fractions of the program, respectively. Using Amdahl’s
law, the speedup that can be obtained with eight processors as a function of the serial fraction is illustrated
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 148
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Parallel Processing Tutorial
in Figure Amdahl.
Note that even if the serial fraction is only 10%, the maximum theoretical speedup achievable is 4.58. In
practice, however, hardware architecture characteristics, such as caching, allow programmers to obtain
much better performance from multicore systems. Amdahl’s law expresses performance increase for
constant problem size (strong scaling). Gustafson [9] demonstrates that parallel processing can be used to
perform more work in the same amount of time by increasing the problem size, thus improving scalability.
this technique is called weak scaling. Architectural artifacts [11] also play an important role; additional
processors come with additional cache and memory resources, often enabling applications to obtain
super-linear speedup. A number of optimizations are discussed that take advantage of architectural features
in Section Optimizations.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 149
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Parallel Processing Tutorial
compute on the GPU as part of the programming model. In the code example below, a simple example is
shown of a scalar vector multiply (SAXPY) using the HetCompute pfor_each pattern (see Parallel
Programming Patterns) and vector operations.
void saxpy(float* y, float a, float* x, int n) { // Y += a * X
// for simplicity only vector sizes are multiplied that are multiples of 4
assert(n%4 == 0);
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 150
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Parallel Processing Tutorial
// Create a Qualcomm Hexagon DSP kernel with a buffer and a long ptr parameter
auto hk = hetcompute::create_dsp_kernel<hetcompute::buffer<float>, long *>(math_sum);
long sum;
// Bind arguments and execute on DSP
g->launch(hk, buf_a, &sum);
g->wait_for();
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 151
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Parallel Processing Tutorial
6.5 Optimizations
Other than algorithmic decomposition of work and data, a parallel program requires tuning to a specific
platform to achieve optimal performance. As a general rule, the following process should be used:
• Serial tuning: The code executed by each task should be optimized using classical optimization
techniques: loop optimizations, strength reduction, and cache locality optimizations.
• Synchronization tuning: Coordinating parallel execution is typically considered overhead – the
program executes additional instructions that are not necessarily part of the effective work. Such
overhead includes serialization in critical sections, waiting for dependencies to be satisfied and/or
condition variables to be signaled, etc. A well-tuned parallel program spends most of its time
executing work as opposed to managing work. However, it may be necessary to replicate
computation in order to minimize synchronization. Using HetCompute asynchronous patterns is one
way of avoiding synchronization.
• Parallel efficiency: A parallel execution is optimal when all execution units are equally busy,
performing minimal redundant work. Therefore, it is important to balance the computation across all
processors. This can be achieved by a combination of algorithmic decomposition — finer grain tasks
allow better load balancing, and take into account architectural characteristics, such as resource
sharing and overhead of spawning tasks — coarser grain tasks typically incur less overhead. In
HetCompute, one can tune a number of parameters to control the granularity of tasks that execute the
pattern.
These topics are discussed briefly in the next sections.
• Consistency: Most multicore shared memory systems provide hardware coherency [10]. However,
architectures implement different consistency models [1], thereby affecting the way shared memory
updates are visible to different threads. In particular, the ARM architecture defines a weak memory
consistency model. The C++11 standard defines primitives to enforce the ordering of memory
operations for all atomic accesses. Senior-level programmers can exploit nonsequential consistent
orderings to obtain better performance on such systems.
• False sharing: False sharing [19] arises when independent data items used by two tasks executing
concurrently on two different cores are co-located in the same cache line. Because the unit of
coherence is the cache line, if the items are accessed by both tasks, the line will be forced by the
coherence protocol to bounce between caches. False sharing can be avoided by separating data items
accessed by different concurrent tasks into separate cache lines, using techniques such as padding
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 152
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Parallel Processing Tutorial
[17] and/or allocation to cache line boundaries. To improve locality of reference and limit memory
fragmentation, programmers should group data items accessed by a single task as close as possible,
preferrably in contiguous blocks of memory addresses.
• Cache interference: In serial applications, cache optimizations are tuned to the entire cache.
However, in a parallel application, caches are shared by execution units. A carefully tuned parallel
program should maximize the utilization of the cache, by ensuring reuse of true shared data. For
example, by maintaining a single copy of read-only shared data, and referencing it simultaneously,
one will exploit temporal locality and minimize the amount of cache used. To minimize contention
and interference, the working set sizes of the tasks should fit in the cache. Tiling and cache blocking
[6], [20] parameters must be tuned considering the capacity when the caches are shared.
Many other cache locality optimizations are described in the literature.
• Avoid waiting for single tasks: Long chains of dependencies, and/or often waiting for the results of
single tasks, limits the level of parallelism available in applications [13]. HetCompute groups can be
used to wait for sets of tasks, thus potentially minimizing the overall amount of stalling.
• Data synchronization: Synchronizing shared memory accesses may introduce considerable
serialization or cache conflict overhead. Such overhead can be reduced by the following
optimizations:
– Privatize data [3] – mutually exclusive partitioning of shared data. For example, partitioning an
image into tiles, where each task works on a different tile. In cases where the partitioning is not
obvious, programmers can copy shared data into private buffers, work on the private data, and then
synchronize changes to the shared copy. Parallel reductions, and parallel gather and scatter
operations are helpful in reshaping the private and shared data formats.
– Avoid large critical sections – Because critical sections guarantee mutual exclusion, they serialize
the execution of tasks that are accessing these areas. Minimizing the time spent in critical
sections, in particular, when they are highly contended, will reduce the synchronization overhead.
– Use atomic operations – the appropriate memory ordering further reduces the synchronization
overhead and relies on hardware capabilities for efficient shared data accesses.
HetCompute encourages an asynchronous programming style. Besides providing parallel programming
patterns with asynchronous semantics, the execution model of HetCompute is one in which fine-grained
tasks are placed in a dependence graph, and thus minimizes the need for waits. By contrast, fork-join
models spawn a large set of work which needs to complete before the control flows from the join.
Asynchronous concurrency is also preferable in the case of heterogeneous computing because resources
need not be blocked waiting for an off-load device to complete the work.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 153
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Parallel Processing Tutorial
• Tuning the task granularity: Task granularity represents the amount of work in a task. Ideally, if the
amount of work is known, one can balance the computation manually. However, this is not the case
for irregular applications, in which case, overdecomposition and relying on the HetCompute runtime
dynamic scheduling is a better option. Task granularity also plays an important role in managing the
overhead. As task granularity decreases, the overhead of managing the parallel execution becomes a
larger fraction of the total time. Therefore, coarser tasks are preferred to minimize the overhead. This
is an important balance that the programmers need to weigh. HetCompute makes it easy to explore
these trade-offs by providing a set of flexible APIs to create tasks.
• Overdecomposition: Overdecomposition is the mechanism by which programmers ensure that there
is enough parallel work in the system, so that the runtime always has work to schedule.
Overdecomposition is defined as creating more tasks than the number of computation units available,
such that if a task blocks or waits for dependencies to be satisfied, other independent tasks continue to
make progress. The more independent tasks are provided, the better the load balancing that can be
achieved. Of course, one needs to take into consideration the task granularity and manage the
overhead.
6.6 Conclusions
Parallel programming is fun and intellectually challenging. There are many factors that come into play
when building a parallel application, which may not be obvious. The techniques described in this tutorial
will help you reach the main goal of parallel programming — speeding up the execution of the application.
HetCompute is designed to ease this task and provide abstractions that make it convenient to express
parallel computation. The hard work of creating a parallel algorithm remains; however, HetCompute and
these techniques will help encoding these algorithms into an efficient solution.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 154
7 Image Processing Tutorial
7.1 Abstract
The goal of this tutorial is to illustrate how to use the HetCompute programming model to process images
using task parallelism and shared memory.
2
1 − ||v(Ni )−v(N j )||2,a
w(i, j) = e h2 ,
Z(i)
int denoise_image(Pixel __restrict *input, int width, int height, Pixel *output)
{
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) { //iterate through all pixel points in the input image
Point point(x, y);
// compute weights for pixel points in the search window
int w[SEARCH_WINDOW_SIZE][SEARCH_WINDOW_SIZE];
compute_weights(input, point, w);
// denoise: compute the weighted average for this pixel point
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 155
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Image Processing Tutorial
While such a parallelization strategy is very simple and easy to implement in HetCompute, the performance
of such an implementation may not be optimal, for several reasons, as discussed in the Parallel Processing
Tutorial. In particular, this implementation is too fine-grained to overcome the parallel overhead and does
not exploit cache locality.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 156
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Image Processing Tutorial
g->launch( [=] {
for(int ty = y; ty < y + TILE_SIZE_ROW; ty++){// iterate through pixel points in the tile
for(int tx = x; tx < x + TILE_SIZE_COL; tx++){
Point point(tx,ty);
// compute weights for pixel points in the search window
int w[SEARCH_WINDOW_SIZE][SEARCH_WINDOW_SIZE];
compute_weights(input, point, w);
// denoise: compute the weighted average for this pixel point
for(int i = 0; i < SEARCH_WINDOW_SIZE; i++) {
for(int j = 0; j < SEARCH_WINDOW_SIZE; j++) {
Point neighbor(point.x - SEARCH_WINDOW_SIZE / 2 + i,
point.y - SEARCH_WINDOW_SIZE / 2 + j);
output[point] += w[i][j] * input[neighbor];
}
}
}
}
});
}
}
g->wait_for(); // wait for all the tasks to complete
}
or by restructuring the code, such that a denoise kernel is preserved that is identical to the serial
implementation and parallelize its invocation.
int denoise_kernel(Pixel __restrict *input, int startX, int width, int startY, int height, Pixel *output)
{
for(int y = startY; y < startY + height; y++){
for(int x = startX ; x < startX + width; x++){// iterate through all pixel points in the input tile
Point point(tx,ty);
// compute weights for pixel points in the search window
int w[SEARCH_WINDOW_SIZE][SEARCH_WINDOW_SIZE];
compute_weights(input, point, w);
// denoise: compute the weighted average for this pixel point
for (int i = 0; i < SEARCH_WINDOW_SIZE; i++) {
for (int j = 0; j < SEARCH_WINDOW_SIZE; j++) {
Point neighbor(point.x - SEARCH_WINDOW_SIZE / 2 + i,
point.y - SEARCH_WINDOW_SIZE / 2 + j);
output[point] += w[i][j] * input[neighbor];
}
}
}
}
}
void denoise_image(Pixel __restrict *input, int width, int height, Pixel *output)
{
// initialization, etc
auto g = create_group("denoise");
for(int y = 0; y < height; y+= TILE_SIZE_ROW) {
for(int x = 0; x < width; x+= TILE_SIZE_COL) {
g->launch( [=] {
denoise_kernel(input, x, TILE_SIZE_COL, y, TILE_SIZE_ROW, output);
});
}
}
g->wait_for(); // wait for all the tasks to complete
}
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 157
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Image Processing Tutorial
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 158
8 Point Kernels (Beta feature)
A Point Kernel, combined with the hetcompute::pfor_each pattern (see Parallel Iteration), is
automatically scheduled for heterogeneous execution across the CPU, GPU, and DSP, by the HetCompute
runtime. A Point Kernel is written in C99 with some minor restrictions. The programmer is encouraged to
use the hetcompute::pattern::tuner to experiment with and set the distribution of workload
across the CPU, GPU, and DSP. For example, hetcompute::pattern::tuner().set_cpu_-
load(30).set_gpu_load(50).set_dsp_load(20) instructs the HetCompute runtime to
partition the range of iterations such that 30% of the iterations are assigned to the CPU, 50% to the GPU,
and the remaining 20% to the DSP. In the present release, there are a few constraints on the code inside a
Point Kernel:
1. The pfor_each with a Point Kernel shall write to only one output buffer.
2. Each iteration i of the pfor_each range shall write only to index i of the output buffer.
A Point Kernel captures the operations performed at a point in an iteration space. For example, a vector-add
point kernel computes the sum of two vector elements A[i] and B[i] and stores the result in C[i], for every
point i in a hetcompute::range of iterations. In contrast to OpenCL kernels, which can synchronize
across work-items in a work-group using, e.g., barriers, a Point Kernel captures pure data-parallelism such
that no two points in the iteration space can synchronize with each other during a kernel’s execution – all
synchronization is deferred to until after the kernel’s execution. In practice, this is not a significant
limitation, as several algorithms in multiple domains such as image processing, video encoding, and
simultaneous localization and mapping can be expressed as Point Kernels.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 159
9 Patterns Reference API
The Qualcomm HetCompute parallel patterns API provides programmers with a high-level interface to
express commonly used parallel programming idioms, such as parallel loops, parallel prefix operations,
parallel map and reduce operations, etc. We recommend considering using one of the patterns before trying
to implement custom task graphs, as the Qualcomm HetCompute runtime optimizes the execution of
patterns. You can fine tune the execution as explained in Section Tuner.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 160
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Functions
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 161
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Public Attributes
• T2 _atpl
• uint64_t _cpu_task_time
• double _dsp_profile
• uint64_t _dsp_task_time
• double _gpu_profile
• uint64_t _gpu_task_time
• T1 _ktpl
• size_t _num_runs
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 162
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Friends
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 163
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Public Attributes
• T2 _atpl
• uint64_t _cpu_task_time
• double _dsp_profile
• uint64_t _dsp_task_time
• double _gpu_profile
• uint64_t _gpu_task_time
• size_t _num_runs
• pointkernel_type & _pk
Friends
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 164
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
&t=hetcompute::pattern::tuner())
• template<typename InputIterator >
void run (InputIterator first, const size_t stride, InputIterator last, const hetcompute::pattern::tuner
&t=hetcompute::pattern::tuner())
• template<size_t Dims>
void run (const hetcompute::range< Dims > &r, const hetcompute::pattern::tuner
&t=hetcompute::pattern::tuner())
Public Attributes
• T1 _fn
Friends
Examples
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 165
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Note
An "iterator" refers to an object that enables a programmer to traverse a container. It can be indices of
integral type, or pointers of RandomAccessIterator type.
In contrast to std::for_each and ptransform, the iterator is passed to the function, instead of
the element.
It is permissible to modify the elements of the range from fn, provided that InputIterator is a
mutable iterator.
Note
This function returns only after fn has been applied to the whole iteration range.
The usual rules for cancellation apply, i.e., within fn the cancellation must be acknowledged using
abort_on_cancel.
Complexity
See Also
Examples
[...]
// Parallel for-loop using indices
pfor_each(size_t(0), vin.size(),
[=,&vin] (size_t i) {
vin[i] = 2 * vin[i];
});
[...]
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 166
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Note
The function object will be applied to iterators with an incremental step size (iter+=stride)
Parameters
Instead of passing in a pair of iterators, this form accepts a hetcompute::range object. Internally the
indices are linearized before passing to the kernel function. It has a default step size of one.
Parameters
Parameters
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 167
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Create an asynchronous task from the hetcompute::pfor_each pattern (with step size).
Parameters
Returns
Parameters
Returns
Create an asynchronous task from the hetcompute::pfor_each pattern (with step size).
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 168
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Parameters
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 169
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Functions
• template<typename Fn >
ptransformer< Fn > hetcompute::pattern::create_ptransform (Fn &&fn)
• template<typename InputIterator , typename OutputIterator , typename UnaryFn >
std::enable_if<!std::is_same
< hetcompute::pattern::tuner,
typename std::remove_reference
< UnaryFn >::type >::value,
void >::type hetcompute::ptransform (InputIterator first, InputIterator last, OutputIterator d_first,
UnaryFn &&fn, const hetcompute::pattern::tuner &tuner=hetcompute::pattern::tuner())
• template<typename InputIterator , typename OutputIterator , typename BinaryFn >
std::enable_if<!std::is_same
< hetcompute::pattern::tuner,
typename std::remove_reference
< BinaryFn >::type >::value,
void >::type hetcompute::ptransform (InputIterator first1, InputIterator last1, InputIterator first2,
OutputIterator d_first, BinaryFn &&fn, const hetcompute::pattern::tuner
&tuner=hetcompute::pattern::tuner())
• template<typename InputIterator , typename UnaryFn >
void hetcompute::ptransform (InputIterator first, InputIterator last, UnaryFn &&fn, const
hetcompute::pattern::tuner &tuner=hetcompute::pattern::tuner())
• template<typename InputIterator , typename OutputIterator , typename UnaryFn >
hetcompute::task_ptr< void()> hetcompute::ptransform_async (InputIterator first, InputIterator last,
OutputIterator d_first, UnaryFn &&fn, const hetcompute::pattern::tuner
&tuner=hetcompute::pattern::tuner())
• template<typename InputIterator , typename OutputIterator , typename BinaryFn >
hetcompute::task_ptr< void()> hetcompute::ptransform_async (InputIterator first1, InputIterator
last1, InputIterator first2, OutputIterator d_first, BinaryFn &&fn, const hetcompute::pattern::tuner
&tuner=hetcompute::pattern::tuner())
• template<typename InputIterator , typename UnaryFn >
hetcompute::task_ptr< void()> hetcompute::ptransform_async (InputIterator first, InputIterator last,
UnaryFn &&fn, const hetcompute::pattern::tuner &tuner=hetcompute::pattern::tuner())
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 170
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Friends
Examples
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 171
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Parameters
Examples
Parameters
This function returns only after fn has been applied to the whole iteration range.
In contrast to pfor_each, arguments specifying ranges are restricted to RandomAccessIterator,
where as pfor_each allows them to be of integral type representing indices.
Complexity
See Also
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 172
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Examples
// arr[i] == 2*vin[i]
size_t arr[vin.size()];
ptransform(begin(vin), end(vin), arr,
[=] (size_t const& v) {
return 2*v;
});
Parameters
This function returns only after fn has been applied to the whole iteration range.
In contrast to pfor_each, arguments specifying range are restricted to RandomAccessIterator, where
as pfor_each allows them to be of integral type representing indices.
Complexity
See Also
Examples
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 173
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Parameters
It is permissible to modify the elements of the range from fn, assuming that InputIterator is a
mutable iterator.
Note
This function returns only after fn has been applied to the whole iteration range.
Complexity
See Also
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 174
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
The usual rules for cancellation apply, i.e., within fn the cancellation must be acknowledged using
abort_on_cancel.
See Also
Examples
The usual rules for cancellation apply, i.e., within fn the cancellation must be acknowledged using
abort_on_cancel.
See Also
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 175
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Note
The usual rules for cancellation apply, i.e., within fn the cancellation must be acknowledged using
abort_on_cancel.
See Also
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 176
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Functions
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 177
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Friends
9.3.1.1.1.1 template<typename Reduce , typename Join > preducer create_preduce ( Reduce && r,
Join && j ) [friend]
Examples
// result == 100
auto result = preduce(vec.begin(), vec.end(), identity);
Parameters
9.3.2.1 template<typename Reduce , typename Join > preducer< Reduce, Join >
hetcompute::pattern::create_preduce ( Reduce && r, Join && j )
Examples
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 178
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
// result == 100
auto result = preduce(vec.begin(), vec.end(), identity);
Parameters
Performs parallel reduction by reducing the results using binary operator join. Returns the result of the
reduction.
Note
Qualcomm HetCompute parallel reduction pattern operates in two stages. In the first stage, it applies
the reduction operation (reduce) to a set of subranges. In the second stage, the reduction results of all
the subranges will be aggregated (join) into the final result.
The binary operation is expected to be associative, but not necessarily commutative, as the algorithm
does not swap operands of the reduce operation while working on the range.
InputIterator can be either of type RandomAccess Iterators, or of integral type to represent iteration
indices.
For tiny iteration range and/or trivial binary operator, it may not be worthwhile to parallelize the
reduction operation.
Reduce function requires pass-by-reference semantics.
To achieve best performance, it is recommended to implement move constructor/assignment for
user-defined to-be-reduced type.
Complexity
Examples
[...]
const int identity = 0;
// Parallel sum
auto p_sum = hetcompute::preduce(0, vin.size(), identity,
// reduce func
[&vin](int i, int j, int& init)
{
for(int k = i; k < j; ++k)
init += vin[k];
},
// join func
[](size_t x, size_t y){ return x + y; }
);
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 179
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Parameters
Returns
Note: Container must have size() defined and indexable with operator []
See Also
Examples
[...]
vector<int> vin;
// Initialize vin
[...]
const int identity = 0;
// Parallel sum
auto p_sum = hetcompute::preduce(vin, identity, std::plus<int>());
Parameters
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 180
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
See Also
Examples
[...]
vector<int> vin;
// Initialize vector
[...]
const int identity = 0;
// Parallel sum
auto p_sum = hetcompute::preduce(vin.begin(), vin.end(), identity, std::plus<int>());
Parameters
Returns
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 181
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Returns
Parameters
Returns
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 182
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 183
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Functions
Friends
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 184
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Examples
auto l = std::plus<size_t>();
auto pscan = hetcompute::pattern::create_pscan_inclusive(l);
pscan(vec.begin(), vec.end());
Examples
auto l = std::plus<size_t>();
auto pscan = hetcompute::pattern::create_pscan_inclusive(l);
pscan(vec.begin(), vec.end());
This function returns only after fn has been applied to the whole iteration range.
Similar to hetcompute::ptransform, range iterators are restricted to type
RandomAccessIterator.
The usual rules for cancellation apply, i.e., within fn the cancellation must be acknowledged using
abort_on_cancel.
Examples
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 185
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 186
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Functions
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 187
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Friends
9.5.1.1.1.1 template<typename IsBaseFn , typename BaseFn , typename SplitFn , typename MergeFn >
pdivide_and_conquerer create_pdivide_and_conquer ( IsBaseFn && isbase, BaseFn &&
base, SplitFn && split, MergeFn && merge ) [friend]
Create a pattern object from function objects isbase, base, split, and merge.
Returns a pattern object which can be invoked (1) synchronously, using the run method or the () operator
with arguments; or (2) asynchronously, using hetcompute::create_task or hetcompute::launch.
Examples
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 188
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Create a pattern object from function objects isbase, base, split, and merge.
Returns a pattern object which can be invoked (1) synchronously, using the run method or the () operator
with arguments; or (2) asynchronously, using hetcompute::create_task or hetcompute::launch.
Examples
Parameters
Parallel divide-and-conquer
Solve a problem by splitting it into independent subproblems, which may be solved in parallel, and merging
the solutions to the subproblems. A subproblem may recursively be split into yet more problems, yielding
significant parallelism, e.g., Fibonacci.
Note: For best performance, make split and merge relatively inexpensive compared to base.
1 #include <sstream>
2 #include <vector>
3
4 #include <hetcompute/hetcompute.hh>
5
8
10 static size_t
11 fibonacci_s(size_t n)
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 189
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
12 {
13 if (n == 0 || n == 1)
14 {
15 return n;
16 }
17 else
18 {
19 return fibonacci_s(n - 1) + fibonacci_s(n - 2);
20 }
21 }
22
24 static const size_t GRANULARITY = 20;
25
27 static size_t
28 fibonacci(size_t n)
29 {
30 return hetcompute::pdivide_and_conquer<size_t, size_t>(
31 // Problem is to compute the n-th Fibonacci term
32 n,
33 // When should an arbitrary Fibonacci term, represented by ’m’, be
34 // computed sequentially?
35 // Note that programmer chooses to compute Fibonacci terms 20 and lower
36 // sequentially for best performance.
37 [](size_t& m) { return m <= GRANULARITY; },
38 // How to compute the term sequentially
39 [](size_t& m) { return fibonacci_s(m); },
40 // Split problem into independent subproblems
41 [](size_t& m) {
42 return std::vector<size_t>({ m - 1, m - 2 });
43 },
44 // Merge solutions to subproblems.
45 // Note that the first parameter (size_t, corresponding to the split
46 // problem) is unused in this case, but may be useful while merging in
47 // other cases.
48 [](size_t, std::vector<size_t>& sols) { return sols[0] + sols[1]; });
49 }
50
51 int
52 main(int argc, const char* argv[])
53 {
54 hetcompute::runtime::init();
55 size_t n_def = 24;
56 size_t n = n_def;
57
58 if (argc >= 2)
59 {
60 std::istringstream istr(argv[1]);
61 istr >> n;
62 }
63
64 size_t out = fibonacci(n);
65
66 if (out != fibonacci_s(n))
67 {
68 std::cerr << "parallel fibonacci failed\n";
69 }
70 hetcompute::runtime::shutdown();
71 return 0;
72 }
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 190
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Returns
Parameters
Parallel divide-and-conquer specialized for no merge of subproblems and not returning a solution, e.g.,
quicksort.
1 #include <algorithm>
2 #include <array>
3 #include <cstdlib>
4 #include <functional>
5 #include <sstream>
6 #include <utility>
7
8 #include <hetcompute/hetcompute.hh>
9
15
21 template <typename Iterator>
22 struct QuickSort
23 {
24 QuickSort(Iterator _begin, Iterator _end) : begin(_begin), end(_end), middle() {}
25 Iterator begin, end, middle;
26 };
27
29 const size_t GRANULARITY = 8192;
30
33 template <typename Iterator, typename Compare>
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 191
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
34 void
35 quicksort(Iterator begin, Iterator end, Compare cmp)
36 {
37 typedef QuickSort<Iterator> QuickSort;
38 hetcompute::pdivide_and_conquer(
39 // Main problem
40 QuickSort(begin, end),
41 // When should an arbitrary array, represented by ’q’, be sorted
42 // sequentially?
43 // Note that programmer chooses to sort arrays smaller than size 8192
44 // sequentially for best performance.
45 [&](QuickSort& q) {
46 size_t n = std::distance(q.begin, q.end);
47 if (n <= GRANULARITY)
48 {
49 return true;
50 }
51 // Choice of first element as pivot is arbitrary
52 auto pivot = *q.begin;
53 q.middle = std::partition(q.begin, q.end, std::bind2nd(cmp, pivot));
54 // If middle == begin, elements in [begin, end) are greater than or
55 // equal to pivot. We could either find a new pivot or as we do here,
56 // just sort sequentially.
57 return q.middle == q.begin;
58 },
59 // Sequential sort used
60 [&](QuickSort& q) { std::sort(q.begin, q.end, cmp); },
61 // Split problem into two subproblems
62 [&](QuickSort& q) {
63 std::array<QuickSort, 2> subarrays{ { QuickSort(q.begin, q.middle), QuickSort(q.middle, q.end)
} };
64 return subarrays;
65 });
66 }
67
68 int
69 main(int argc, const char* argv[])
70 {
71 hetcompute::runtime::init();
72 std::vector<long> input;
73 size_t n_def = 1 << 16;
74 size_t n = n_def;
75
76 if (argc >= 2)
77 {
78 std::istringstream istr(argv[1]);
79 istr >> n;
80 }
81
82 // Create a random array of integers
83 for (size_t i = 0; i < n; i++)
84 {
85 input.push_back(rand());
86 }
87
88 quicksort(input.begin(), input.end(), std::less<long>());
89
90 if (!std::is_sorted(input.begin(), input.end()))
91 {
92 std::cerr << "parallel quicksorting failed\n";
93 }
94
95 hetcompute::runtime::shutdown();
96
97 return 0;
98 }
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 192
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Parameters
Parameters
Returns
Asynchronous parallel divide-and-conquer specialized for not returning a solution, e.g., mergesort.
Returns a task that represents the pattern’s execution. Operations on the task translate into operations on the
executing pattern. The caller must launch the task.
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 193
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Returns
Asynchronous parallel divide-and-conquer specialized for no merge of subproblems and not returning a
solution, e.g., quicksort.
Returns a task that represents the pattern’s execution. Operations on the task translate into operations on the
executing pattern. The caller must launch the task.
Parameters
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 194
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Functions
Friends
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 195
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
9.6.1.1.1.1 template<typename Compare > psorter create_psort ( Compare && cmp ) [friend]
Examples
vector<int> vin(100000);
Rand_int rnd{0, int(vin.size() - 1)};
Parameters
Examples
vector<int> vin(100000);
Rand_int rnd{0, int(vin.size() - 1)};
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 196
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Examples
See Also
psort(RandomAccessIterator, RandomAccessIterator)
Parameters
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 197
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Parameters
psort(RandomAccessIterator, RandomAccessIterator)
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 198
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
9.7 Pipeline
Classes
• class hetcompute::iteration_lag
Pipeline stage iteration lag. More...
• class hetcompute::iteration_rate
Pipeline stage iteration match rate. More...
• class hetcompute::parallel_stage
Parallel pipeline stage for specifying the type of the stages when adding to the pipeline. More...
• class hetcompute::pipeline_context<>
Pipeline_context with no user data. More...
• class hetcompute::pipeline_context_base
Pipeline context class. More...
• class hetcompute::serial_stage
Serial stage for specifying the type of the stages when adding to the pipeline. More...
• class hetcompute::sliding_window_size
Pipeline stage sliding window size. More...
Typedefs
• typedef enum
hetcompute::serial_stage_type hetcompute::serial_stage_type
Serial pipeline stage types.
Enumerations
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 199
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Constructor.
Copy constructor.
Move constructor.
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 200
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Constructor.
Copy constructor.
Move constructor.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 201
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Returns
Returns
Parallel pipeline stage for specifying the type of the stages when adding to the pipeline.
Constructor.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 202
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Parameters
doc Degree of concurrency for the parallel stage. Degree of concurrency (doc):
should be a positive integer number. It specifies the maximum number of
consecutive stage iterations that can run in parallel for this stage. When doc
= 1, the parallel stage will behave like a serial stage.
Copy constructor.
Returns
Pipeline class.
Template Parameters
UserData The type for the pipeline context data or empty, i.e.,
hetcompute::pattern::pipeline<size_t> or
hetcompute::pattern::pipeline<>.
Public Types
• pipeline ()
Constructor.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 203
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Copy constructor.
• virtual ∼pipeline ()
Destructor.
• template<typename... Confs>
hetcompute::task_ptr create_task (UserData ∗...context_data, size_t num_iterations, Confs
&&...confs) const
Create a task for the pipeline for asynchronous execution.
• hetcompute::task_ptr< void(UserData
∗..., size_t)> create_task (const hetcompute::pattern::tuner &t=hetcompute::pattern::tuner()) const
Create a task for the pipeline for asynchronous execution.
• void disable_sliding_window ()
Disable the pipeline sliding window launch type.
• void enable_sliding_window ()
Enable the pipeline launch type to be with sliding window.
• bool is_valid ()
Pipeline sanity check for stage IO types and sliding window size.
• template<typename... Confs>
void run (UserData ∗...context_data, size_t num_iterations, Confs &&...confs) const
Launch and wait for the pipeline.
Constructor.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 204
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Destructor.
Reimplemented in hetcompute::beta::pattern::pipeline< UserData >.
Copy constructor.
Move constructor.
Create a task for the pipeline for asynchronous execution. Do not call this member function if the pipeline
has no stages. This would cause a fatal error.
Parameters
context_data Pointer to the data for the pipeline context if the pipeline is defined as
having one, i.e., sizeof...(UserData) == 1.
num_iterations The total number of iterations for the first stage. Note: if num_iterations ==
0, the pipeline runs infinite number of iterations until the first stage stops the
pipeline.
confs Other configurations for launching a task out of pipeline. Currently, only
support one tuner object for the pipeline (optional).
Returns
1 #include <hetcompute/hetcompute.hh>
2
3 //
4 // Pipeline without context data (wcd),
5 // Known iterations before launch(iter)
6 // Launch by creating tasks (ct)
7 // Through the pipeline object (obj)
8 //
9 int
10 main()
11 {
12 hetcompute::runtime::init();
13 // Define a pipeline skeleton, without pipeline context data.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 205
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
14 hetcompute::pattern::pipeline<> p;
15
16 // Pipeline context type.
17 typedef hetcompute::pattern::pipeline<>::context
context;
18
19 // Add a serial first stage.
20 p.add_stage(hetcompute::serial_stage(), [](context& ctx) {
21 size_t iter = ctx.get_iter_id();
22 // some usage of iter
23 HETCOMPUTE_ILOG("iter: %zu", iter);
24 });
25
26 // Add a parallel stage with degree of concurrency of 4.
27 p.add_stage(hetcompute::parallel_stage(4), [](context&) {});
28
29 // Add a serial stage.
30 p.add_stage(hetcompute::serial_stage(), [](context&) {});
31
32 // Asynchronous launch.
33 // Create a task of a pipeline that runs for 20 iterations.
34 // Run the pipeline as if the stages are using sliding windows.
35 p.enable_sliding_window();
36 auto t1 = p.create_task(20);
37 // Launch the pipeline and do not block.
38 t1->launch();
39
40 // Create a task of a pipeline that runs for 10 iterations.
41 // Run the pipeline as if the stages are not using sliding windows.
42 p.disable_sliding_window();
43 auto t2 = p.create_task(10);
44 // Launch the pipeline and do not block.
45 t2->launch();
46
47 // Wait for the first pipeline to stop.
48 t1->wait_for();
49 // Wait for the second pipeline to stop.
50 t2->wait_for();
51
52 std::cout << "pipeline1 runs 20 iters" << std::endl;
53 std::cout << "pipeline2 runs 10 iters" << std::endl;
54
55 hetcompute::runtime::shutdown();
56 return 0;
57 }
Create a task for the pipeline for asynchronous execution. The task arguments need to be bound later. Do
not call this member function if the pipeline has no stages. This would cause a fatal error.
Parameters
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 206
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
1 #include <hetcompute/hetcompute.hh>
2
3 //
4 // Pipeline with context data (wcd),
5 // On the fly stop (ofs)
6 // Create task and launch later (ctlchl)
7 // Through the pipeline object (obj)
8 //
9 int
10 main()
11 {
12 hetcompute::runtime::init();
13 // Define a pipeline skeleton, with pipeline context data of type size_t.
14 hetcompute::pattern::pipeline<size_t> p;
15
16 // pipeline context type
17 typedef hetcompute::pattern::pipeline<size_t>::context
context;
18
19 // Add a serial first stage.
20 p.add_stage(hetcompute::serial_stage(), [](context& ctx) {
21 size_t iter = ctx.get_iter_id();
22 size_t data = *ctx.get_data();
23 if (iter == data - 1)
24 {
25 ctx.stop_pipeline();
26 }
27 });
28
29 // Add a parallel stage with degree of concurrency of 4.
30 p.add_stage(hetcompute::parallel_stage(4), [](context& ctx) {
31 size_t iter = ctx.get_iter_id();
32 size_t data = *ctx.get_data();
33 // some usage of iter and data here
34 HETCOMPUTE_ILOG("iter: %zu, data: %zu", iter, data);
35 });
36
37 // Add a serial stage.
38 p.add_stage(hetcompute::serial_stage(), [](context&) {});
39
40 // Define the context data.
41 size_t num1 = 20;
42 size_t num2 = 10;
43
44 // Asynchronous launch.
45 // Create a task of a pipeline that runs for num1 iterations.
46 // Run the pipeline as if the stages are using sliding windows.
47 //
48 // Here the total number of iterations is set to be 0 (infinite number of runs).
49 // The first stage of the pipeline does dynamic checking to stop the pipeline on the fly.
50 // The total number of pipeline iterations is specified by using the pipeline context data.
51 p.enable_sliding_window();
52 auto t1 = p.create_task();
53 // Launch the pipeline, bind the arguments, and do not block.
54 t1->launch(&num1, 0);
55
56 // Create a task of a pipeline that runs for num2 iterations.
57 // Run the pipeline as if the stages are not using sliding windows.
58 //
59 // Here the total number of iterations is set to be 0 (infinite number of runs).
60 // The first stage of the pipeline does dynamic checking to stop the pipeline on the fly.
61 // The total number of pipeline iterations is specified by using the pipeline context data.
62 p.disable_sliding_window();
63 auto t2 = p.create_task();
64 // Bind the arguments to the task.
65 t2->bind_all(&num2, 0);
66 // Launch the pipeline and do not block.
67 t2->launch();
68
69 // Wait for the first pipeline to stop.
70 t1->wait_for();
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 207
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
1 #include <hetcompute/hetcompute.hh>
2
3 //
4 // Pipeline with context data (wcd),
5 // Known iterations before launch(iter)
6 // Create task and launch later (ctlchl)
7 // Through the pipeline object (obj)
8 //
9 int
10 main()
11 {
12 hetcompute::runtime::init();
13 // Define a pipeline skeleton, with pipeline context data of type size_t.
14 hetcompute::pattern::pipeline<size_t> p;
15
16 // pipeline context type
17 typedef hetcompute::pattern::pipeline<size_t>::context
context;
18
19 // Add a serial first stage.
20 p.add_stage(hetcompute::serial_stage(), [](context& ctx) {
21 size_t iter = ctx.get_iter_id();
22 size_t data = *ctx.get_data();
23 if (iter == data - 1)
24 {
25 ctx.stop_pipeline();
26 }
27 });
28
29 // Add a parallel stage with degree of concurrency of 4.
30 p.add_stage(hetcompute::parallel_stage(4), [](context& ctx) {
31 size_t iter = ctx.get_iter_id();
32 size_t data = *ctx.get_data();
33 // some usage of iter and data here
34 HETCOMPUTE_ILOG("iter: %zu, data: %zu", iter, data);
35 });
36
37 // Add a serial stage.
38 p.add_stage(hetcompute::serial_stage(), [](context&) {});
39
40 // Define the context data.
41 size_t num1 = 20;
42 size_t num2 = 10;
43
44 // Asynchronous launch.
45 // Create a task of a pipeline that runs for num1 iterations.
46 // Run the pipeline as if the stages are using sliding windows.
47 //
48 // Here the total number of iterations is set to be 0 (infinite number of runs).
49 // The first stage of the pipeline does dynamic checking to stop the pipeline on the fly.
50 // The total number of pipeline iterations is specified by using the pipeline context data.
51 p.enable_sliding_window();
52 auto t1 = p.create_task();
53 // Launch the pipeline, bind the arguments, and do not block.
54 t1->launch(&num1, 0);
55
56 // Create a task of a pipeline that runs for num2 iterations.
57 // Run the pipeline as if the stages are not using sliding windows.
58 //
59 // Here the total number of iterations is set to be 0 (infinite number of runs).
60 // The first stage of the pipeline does dynamic checking to stop the pipeline on the fly.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 208
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
61 // The total number of pipeline iterations is specified by using the pipeline context data.
62 p.disable_sliding_window();
63 auto t2 = p.create_task();
64 // Bind the arguments to the task.
65 t2->bind_all(&num2, 0);
66 // Launch the pipeline and do not block.
67 t2->launch();
68
69 // Wait for the first pipeline to stop.
70 t1->wait_for();
71 // Wait for the second pipeline to stop.
72 t2->wait_for();
73
74 std::cout << "pipeline1 runs " << num1 << " iters" << std::endl;
75 std::cout << "pipeline2 runs " << num2 << " iters" << std::endl;
76 hetcompute::runtime::shutdown();
77 return 0;
78 }
Disable the pipeline sliding window launch type and there won’t be any control on the memory footprint.
Pipeline sanity check for stage IO types and sliding window size.
Returns
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 const size_t num_iters = 100;
8
9 hetcompute::pattern::pipeline<std::array<size_t, num_iters>
> p;
10 typedef hetcompute::pattern::pipeline<std::array<size_t, num_iters>
>::context context;
11
12 // Add a parallel stage which behaves like a serial stage.
13 p.add_stage(hetcompute::parallel_stage(1), [](context&) {});
14
15 // Add a parallel stage with doc = 8, no lag.
16 p.add_stage(hetcompute::parallel_stage(8), [](context&) {});
17
18 // Add a serial stage.
19 p.add_stage(hetcompute::serial_stage(), [](context& ctx) {
20 size_t i = ctx.get_iter_id();
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 209
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
21 (*ctx.get_data())[i] = i;
22 });
23
24 // Sanity check.
25 if (p.is_valid())
26 {
27 HETCOMPUTE_ILOG("The pipeline settings are valid.");
28 }
29 else
30 {
31 HETCOMPUTE_ILOG("The pipeline settings are not valid.");
32 }
33
34 hetcompute::runtime::shutdown();
35 return 0;
36 }
Parameters
context_data Pointer to the data for the pipeline context if the pipeline defined as having
one, i.e., sizeof...(UserData) == 1.
num_iterations The total number of iterations for the first stage.
confs Other configurations for running a pipeline. Currently, only support one
tuner object for the pipeline (optional).
Note: if num_iterations == 0, the pipeline runs infinite number of iterations until the first stage stops the
pipeline.
1 #include <hetcompute/hetcompute.hh>
2
3 //
4 // Pipeline without context data (wocd),
5 // Known iterations before launch(iter)
6 // Launch by using hetcompute free function (lch)
7 // Through the pipeline object (obj)
8 //
9 int
10 main()
11 {
12 hetcompute::runtime::init();
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 210
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Note: This is the pipeline_context type for the pipeline with context data, of type UserData, i.e.,
hetcompute::pattern::pipeline<UserData>. Do not use this type directly. Instead, get the member type
from the pipeline that the context is associated with, i.e., using context =
hetcompute::pattern::pipeline<UserData>::context.
• virtual ∼pipeline_context ()
Destructor.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 211
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Destructor.
Returns
UserData∗ The pointer to the user-defined context data, which is provided by the user when launching
the pipeline.
template<>class hetcompute::pipeline_context<>
Note: This is the pipeline_context type for the pipeline without context data, i.e.,
hetcompute::pattern::pipeline<>. So not use this type directly. Instead, get the member type from the
pipeline that the context is associated with, i.e., using context =
hetcompute::pattern::pipeline<>::context.
• virtual ∼pipeline_context ()
Destructor.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 212
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Destructor.
• virtual ∼pipeline_context_base ()
Destructor.
• void cancel_pipeline ()
Cancel the pipeline.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 213
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Destructor.
Use this method to cancel a pipeline. Note that hetcompute::abort_on_cancel() needs to be called in the
pipeline user-defined stage functions for proper pipeline cancellation. A pipeline can be cancelled in any
stages, however the internal state of the pipeline could be non-deterministic
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 const size_t num_iters = 100;
8 const size_t cancel_iter = 50;
9 const size_t doc = 8;
10
11 hetcompute::pattern::pipeline<std::array<size_t, num_iters>
> p;
12 typedef hetcompute::pattern::pipeline<std::array<size_t, num_iters>
>::context context;
13
14 // Add a serial stage
15 p.add_stage(hetcompute::serial_stage(), [](context&) {
hetcompute::abort_on_cancel(); });
16
17 // Add a parallel stage with doc = 8, no lag
18 p.add_stage(hetcompute::parallel_stage(doc), [cancel_iter](context& ctx) {
19 size_t i = ctx.get_iter_id();
20 (*ctx.get_data())[i] = i;
21 if (ctx.get_iter_id() == cancel_iter - 1)
22 ctx.cancel_pipeline();
23 hetcompute::abort_on_cancel();
24 });
25
26 // Add a serial stage
27 p.add_stage(hetcompute::serial_stage(), [](context&) {
hetcompute::abort_on_cancel(); });
28
29 // define and reset the output array
30 std::array<size_t, num_iters> out_array;
31 for (size_t i = 0; i < num_iters; i++)
32 {
33 out_array[i] = 0;
34 }
35
36 // launch with sliding window
37 p.enable_sliding_window();
38 p.run(&out_array, num_iters);
39
40 // check the results
41 for (size_t i = 0; i < cancel_iter - doc; i++)
42 {
43 if (out_array[i] != i)
44 {
45 HETCOMPUTE_ILOG("The pipeline cancellation is not correct.");
46 return -1;
47 }
48 }
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 214
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 215
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
120 }
121 for (size_t i = cancel_iter + doc - 1; i < num_iters; i++)
122 {
123 if (out1_array[i] != 0 || out2_array[i] != 0)
124 {
125 HETCOMPUTE_ILOG("The pipeline cancellation is not correct.");
126 return -1;
127 }
128 }
129 hetcompute::runtime::shutdown();
130 }
Returns
Returns
size_t maximum number of iterations for this stage. 0 means the maximum number is unknown and
the pipeline will be stopped or canceled dynamically during execution.
Returns
Check whether the maximum number of iterations for this stage is set.
Returns
true - The pipeline has an iteration limit known before running. false- The pipeline does not have an
iteration limit known before running.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 216
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
See Also
pipeline_context_base::stop_pipeline()
Use this method to stop a pipeline launched with an iteration limit. Calling this method on a pipeline
without an iteration limit will cause a fatal error. This method can only be called from the first stage of the
pipeline.
See Also
pipeline_context_base::has_iter_limit()
Serial stage for specifying the type of the stages when adding to the pipeline.
Constructor.
Parameters
t hetcompute::in_order (default) or
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 217
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Copy constructor.
Move constructor.
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 218
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Constructor.
Copy constructor.
Move constructor.
Returns
InputType The data type for the stage_input, which should match the return type of the
previous stage.
Public Types
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 219
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
• virtual ∼stage_input ()
Destructor.
Destructor.
Get the iter_id for the stage iteration that generates the first element.
Returns
size_t The iteration id in the previous stage that generates the first element in the stage_input.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 220
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Parameters
Returns
Parameters
Returns
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 221
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 222
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
9.8 Tuner
Classes
• class hetcompute::pattern::tuner
Public Types
• tuner ()
• size_t get_chunk_size () const
• load_type get_cpu_load () const
• size_t get_doc () const
• load_type get_dsp_load () const
• load_type get_gpu_load () const
• bool has_profile () const
• bool is_serial () const
• bool is_static () const
• tuner & set_chunk_size (size_t sz)
• tuner & set_cpu_load (load_type load)
• tuner & set_dsp_load (load_type load)
• tuner & set_dynamic ()
• tuner & set_gpu_load (load_type load)
• tuner & set_max_doc (size_t doc)
• tuner & set_profile ()
• tuner & set_serial ()
• tuner & set_static ()
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 223
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
9.8.1.1.1.1 hetcompute::pattern::tuner::tuner ( )
Tuner constructor Parameters to fine-tune various execution settings in HETCOMPUTE patterns. Note that
tuner settings are hints that the HETCOMPUTE runtime takes into account while scheduling a pattern.
Constraining factors may cause HETCOMPUTE to ignore the hints.
Returns
For patterns executable heterogeneously on multiple devices (e.g. CPU, GPU, DSP), get fraction of pattern
work to be executed on the CPU.
Returns
number of units out of total work (cpu_load + gpu_load + dsp_load) to be executed on the CPU
Returns
For patterns executable heterogeneously on multiple devices (e.g. CPU, GPU, DSP), get fraction of pattern
work to be executed on the DSP.
Returns
number of units out of total work (cpu_load + gpu_load + dsp_load) to be executed on the DSP
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 224
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
For patterns executable heterogeneously on multiple devices (e.g. CPU, GPU, DSP), set fraction of pattern
work to execute on the GPU.
Returns
Returns
Returns
bool TRUE if using static chunking and false if using dynamic work stealing.
Defines granularity for work stealing. In data parallel patterns, Qualcomm HETCOMPUTE launches
multiple tasks (defined by doc) in parallel. Each task steals some iterations from other tasks when its
assigned iterations are completed. The chunk size parameter controls the minimum number of iterations a
task needs to finish before it is stolen from a stealer task. It is recommended to increase chunk size when
the computation in each iteration is less.
Parameters
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 225
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
For patterns executable heterogeneously on multiple devices (e.g. CPU, GPU, DSP), set fraction of pattern
work to execute on the CPU.
Parameters
For patterns executable heterogeneously on multiple devices (e.g. CPU, GPU, DSP), set fraction of pattern
work to execute on the DSP.
Parameters
set_chunk_size(size_t)
set_max_doc(size_t)
Returns
For patterns executable heterogeneously on multiple devices (e.g. CPU, GPU, DSP), set fraction of pattern
work to execute on the GPU.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 226
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Parameters
Defines the maximum number of tasks in parallel (degree of concurrency) for load balancing. A higher
number indicates over-subscription which might be beneficial in certain usage scenarios. doc must be
larger than zero. Otherwise, it will cause a fatal error.
Parameters
Returns
Enable profiling within pattern execution. Currently meaningful to hetero pfor_each pattern to generate
auto-tuned work distribution across heterogeneous devices.
Returns
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 227
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 228
10 Tasks Reference API
Tasks represent independent units of work that can be executed asynchronously. Qualcomm HetCompute
programmers are responsible for partitioning their application into tasks and organizing them into a task
graph using dependencies. This chapter documents the interfaces to create tasks, setup dependencies, and
launch (execute) tasks. It also discusses task synchronization (waiting) and cancellation. Grouping is the
mechanism to wait and cancel on a set of tasks. And finally, attributes is a more advanced feature which
allows programmers to pass additional information about task behavior to the Qualcomm HetCompute
runtime system.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 229
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
10.1 Groups
Classes
• class hetcompute::group
Groups represent sets of tasks, which are used to simplify waiting and canceling multiple tasks. More...
• class hetcompute::group_ptr
Smart pointer to a group object. More...
Functions
• group_ptr hetcompute::create_group ()
Creates a group and returns a group_ptr that points to the group.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 230
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
• void cancel ()
Cancels group.
• void finish_after ()
Specifies that the task invoking this function should be deemed to finish only after tasks the group. This
method returns immediately.
• hc_error wait_for ()
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 231
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Blocks until all tasks in the group complete execution or are canceled (that is, the group is empty)
Use add to add a task to a group without launching it. Because of performance reasons, it is recommended
that tasks are added to groups at the time they are launched using hetcompute::group::launch.
Use add when your algorithm requires that the task belongs to a group, but you are not yet ready to launch
the task. For example, perhaps you want to prevent the group from being empty, so you can wait on it
somewhere else.
It is possible, though not recommended because of performance reasons, to use add repeatedly to add a
task to multiple groups. Repeatedly adding a task to the same group is not an error, Qualcomm HetCompute
ignores subsequent launches. If the task has previously been launched, hetcompute::group-
::launch(task_ptr<> const&) and hetcompute::group::add(task_ptr<>
const&) are equivalent. For more information about tasks joining multiple groups, see Task Groups.
Regardless of the method used to add tasks to a group, the following rules always apply:
• Tasks stay in the group until they finish execution (successfully or unsuccessfully due to exceptions
or cancellation). Once a task is added to a group, there is no way to remove it from the group.
• Once a task belonging to multiple groups completes execution, Qualcomm HetCompute removes it
from all the groups to which it belongs.
• Neither completed nor canceled tasks can join groups.
• Tasks cannot be added to a canceled group.
Do not call this method if task is nullptr. This would cause a fatal error.
Parameters
Example 1
1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group g.
9 auto g = hetcompute::create_group();
10
11 // Create task t1. Its type is hetcompute::task_ptr<void()>
12 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t1!\n")
; });
13
14 // Add task t1 to group g, but do not launch it.
15 g->add(t1);
16
17 auto t2 = hetcompute::launch([t1] {
18 // Launch t1. Because it already belongs to group g, there is no
19 // reason to use hetcompute::group::launch.
20 t1->launch();
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 232
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
21 });
22
23 // Wait for tasks in group g to complete.
24 g->wait_for();
25 hetcompute::runtime::shutdown();
26
27 return 0;
28 }
Example 2
1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create groups g1, g2, g3
9 auto g1 = hetcompute::create_group();
10 auto g2 = hetcompute::create_group();
11 auto g3 = hetcompute::create_group();
12
13 // Create task t. Its type is hetcompute::task_ptr<void(int)>
14 auto t = hetcompute::create_task([](int seconds) {
15 HETCOMPUTE_ILOG("Hello World from t! I’ll sleep for %d seconds\n", seconds);
16 sleep(seconds);
17 HETCOMPUTE_ILOG("Good bye from t\n");
18 });
19
20 // Launch t into g1, let it sleep for 4 seconds.
21 g1->launch(t, 4);
22
23 // t is launched, possibly running, let’s add it to g2 as well.
24 g2->add(t);
25
26 // Equivalent to g3->add(t).
27 g3->launch(t);
28
29 // Wait for g2 to be empty
30 g2->wait_for();
31
32 HETCOMPUTE_ILOG("**%s**\n", g3->get_name().c_str());
33 hetcompute::runtime::shutdown();
34 return 0;
35 }
See Also
hetcompute::group::launch(task_ptr<>const&)
Similar to add(task_ptr<> const&) except that it takes a pointer to a base task instead of a base
task-pointer.
Do not call this method if task is nullptr. This would cause a fatal error.
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 233
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
See Also
hetcompute::group::launch(task_ptr<>const&)
hetcompute::group::add(task_ptr<> const&)
Marks the group as canceled and returns immediately. Once a group is canceled, it cannot revert to a
non-canceled state. Canceling a group means that:
• The tasks in the group that have not started execution will never execute.
• The tasks in the group that are executing will be canceled only when they call
hetcompute::abort_on_cancel. If any of these executing tasks is a blocking executing a
hetcompute::blocking construct, Qualcomm HetCompute executes the constructs’s
cancellation handler if they had not executed it before.
• Any tasks added to the group after the group is canceled are also canceled.
cancel returns immediately. Call hetcompute::group::wait_for() afterwards to wait for all
the running tasks to be completed. For more information about cancellation, check Tasks.
Example 1
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 // Create group
9 auto g = hetcompute::create_group();
10
11 // Create lambda for task body.
12 auto l = [](int task_id) {
13 HETCOMPUTE_ILOG("Task %d begins execution.\n", task_id);
14 for (int i = 0; i < 2; ++i)
15 {
16 hetcompute::abort_on_cancel();
17 usleep(400000);
18 }
19 HETCOMPUTE_ILOG("Task %d ends execution normally.\n", task_id);
20 };
21
22 // Launch many tasks
23 for (int j = 0; j < 10000; ++j)
24 {
25 g->launch(l, j);
26 }
27
28 // Sleep for a little while, to give some tasks
29 // time to completely execute.
30 sleep(1);
31
32 // Cancel group and wait for the running tasks to complete
33 g->cancel();
34 try
35 {
36 g->wait_for();
37 }
38 catch (const hetcompute::aggregate_exception& e)
39 {
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 234
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
40 std::cout << "threw " << e.what() << " due to group cancellation " << std::endl;
41 }
42 catch (const hetcompute::canceled_exception& e)
43 {
44 std::cout << "threw " << e.what() << " due to group cancellation " << std::endl;
45 }
46 catch (...)
47 {
48 // Never reached
49 }
50 hetcompute::runtime::shutdown();
51 return 0;
52 }
In the example above, launch 10000 tasks are launched into group g. Each task prints a message when it
starts execution and another one right before it ends execution. The latter one will only print if the task does
not notice that the group has been canceled. (See hetcompute::abort_on_cancel).
Right after launching the tasks, main sleeps for a second before canceling the group. This means that next
time the running tasks execute hetcompute::abort_on_cancel(), they will see that their group
has been canceled and will abort. wait_for will not return before the running tasks end their execution –
either because they call hetcompute::abort_on_cancel(), or because they complete their
execution without being canceled.
Example 2
1 #include <atomic>
2 #include <hetcompute/hetcompute.hh>
3
4 using namespace std;
5
6 int
7 main()
8 {
9 hetcompute::runtime::init();
10 // Counts the number of tasks that execute before the group gets
11 // canceled
12 atomic<size_t> counter;
13
14 auto group = hetcompute::create_group();
15
16 // Create 2000 tasks that increase an atomic counter
17 for (int i = 0; i < 2000; i++)
18 {
19 group->launch([&counter] {
20 counter++;
21 usleep(7);
22 });
23 }
24
25 // Cancel group
26 group->cancel();
27
28 // Wait for group to cancel
29 try
30 {
31 group->wait_for();
32 }
33 catch (const hetcompute::aggregate_exception& e)
34 {
35 // If many tasks were canceled, they each propagate a
36 // hetcompute::canceled_exception to the group, all of which get aggregated into
37 // a single hetcompute::aggregate_exception.
38 std::cout << "threw " << e.what() << " due to group cancellation " << std::endl;
39 }
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 235
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Output
hetcompute::abort_on_cancel()
hetcompute::group::wait_for()
hetcompute::task<>::cancel()
Returns true if the group has been canceled; otherwise, returns false. For more about cancellation, see
Tasks.
Returns
See Also
hetcompute::group::cancel()
Exceptions
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 236
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Note
If exceptions are disabled by application, this API will terminate the app, if pointer to task is
nullptr, invoked from outside a task or from within a hetcompute::pfor_each
Example
1 #include <string>
2
3 #include <hetcompute/hetcompute.hh>
4
5 void display_webpage(char*);
6 void compose_webpages(int num_urls, char* urls[]);
7
8 void
9 display_webpage(char* url)
10 {
11 auto fetchdata = hetcompute::create_task([=] {
12 /*fetch(url, "fetchdata");*/
13 return std::string(url) + " data";
14 });
15 auto fetchstyle = hetcompute::create_task([=] {
16 /*fetch(url, "fetchstyle");*/
17 return std::string(url) + " style";
18 });
19 auto render = hetcompute::create_task([](std::string data, std::string style
) {
20 /*render();*/
21 std::cout << data + " " + style << std::endl;
22 });
23 // Render task may start executing only after data and style have been
24 // fetched
25 render->bind_all(fetchdata, fetchstyle);
26 fetchdata->launch();
27 fetchstyle->launch();
28 render->launch();
29 // Mark display_webpage as logically finishing after the render task finishes
30 render->finish_after();
31 // Return from function call even before any of the fetchdata, fetchstyle, or render
32 // tasks finish. Such an early return makes the function asynchronous.
33 }
34
35 void
36 compose_webpages(int num_urls, char* urls[])
37 {
38 auto g = hetcompute::create_group();
39 for (int i = 1; i < num_urls; i++)
40 {
41 g->launch([=] { display_webpage(urls[i]); });
42 }
43 // Mark compose_webpages as logically finishing after all webpages have been
44 // composed and displayed
45 g->finish_after();
46 // Return from function call before any of the tasks finish
47 }
48
51 int
52 main(int argc, char* argv[])
53 {
54 hetcompute::runtime::init();
55
56 // Launch compose_webpages as a task since it is an asynchronous function
57 // call
58 auto t = hetcompute::launch([=, &argv] { compose_webpages(argc, argv); });
59 // Waits for the composite display to be rendered!
60 t->wait_for();
61 return 0;
62
63 hetcompute::runtime::shutdown();
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 237
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
64 }
Returns string with the name of the group. If the group has no name, the returned string is empty.
Returns
See Also
hetcompute::create_group(std::string const&)
Returns a pointer to a group that represents the intersection of the group managed by ∗this and other.
Some applications require that tasks join more than one group. It is possible, though not recommended for
performance reasons, to use hetcompute::group::launch(hetcompute::task_ptr<>
const&) or hetcompute::group::add(hetcompute::task_ptr<> const&) repeatedly to
add a task to several groups. Instead, use hetcompute::group::intersect(group_ptr
const&) to create a new group that represents the intersection of all the groups where the tasks need to
launch. Again, this method is more performant than repeatedly launching the same task into different
groups.
Launching a task into the intersection group also simultaneously launches it into all the groups that are part
of the intersection.
Consecutive calls to hetcompute::group::intersect with the same group pointer as argument
return a pointer to the same group.
Group intersection is a commutative operation.
You can use the & operator instead of hetcompute::group::intersect.
Parameters
Returns
group_ptr – Group pointer that points to a group that represents the intersection of ∗this and other.
See Also
hetcompute::intersect(hetcompute::group_ptr const&,
hetcompute::group_ptr const&).
hetcompute::operator&(hetcompute::group_ptr const&,
hetcompute::group_ptr const&)
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 238
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Binds arguments to task and launches it into the group. task must be a fully-typed task-pointer to allow
argument binding, and it should not be bound already. Otherwise, launch causes a runtime error. For
more information about binding, check Tasks.
Tasks do not execute unless they are launched. By launching a task, the programmer informs the Qualcomm
HetCompute runtime that the task is ready to execute as soon as all its (control and data) dependencies have
been satisfied, required buffers, if any, are available, and a hardware context is available. For more
information about task launching, see Tasks.
Tasks can launch only once. Any subsequent calls to g->launch() do not cause the task to execute
again. Instead, they cause the task to be added to group g, if the task was not part of that group already.
When launching a task into many groups, remember that group intersection is a somewhat expensive
operation. If you need to launch into multiple groups several times, intersect the groups once and launch the
tasks into the intersection. For more information about tasks joining multiple groups, see Task Groups.
Template Parameters
FullType Task pointer type. Should be a full type (i.e., void(int, float)).
FirstArg Type of the first argument to be bound to the task.
RestArgs Type of the rest of the arguments to be bound to the task.
Parameters
Exceptions
Note
If exceptions are disabled by application, this API will terminate the app if pointer to task is nullptr
Example
1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group
9 auto g = hetcompute::create_group();
10
11 // hello is a fully-typed task pointer of type
12 // hetcompute::task_ptr<void(int)>
13 auto hello = hetcompute::create_task([](int x) { HETCOMPUTE_ILOG("Hello World
%d!\n", x); });
14
15 // Bind hello to 42 and launch task into g
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 239
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
16 g->launch(hello, 42);
17
18 // Wait for g to be empty
19 g->wait_for();
20
21 hetcompute::runtime::shutdown();
22 }
See Also
hetcompute::group::launch(hetcompute::task_ptr<> const&)
Parameters
Exceptions
Note
If exceptions are disabled by application, this API will terminate the app if pointer to task is nullptr
See Also
hetcompute::group::launch(hetcompute::task_ptr<FullType> const&,
FirstArg&& first_arg, RestArgs&& ...rest_args)
Launches task and into group. Tasks do not execute unless they are launched. By launching a task, the
programmer informs the Qualcomm HetCompute runtime that the task is ready to execute as soon as all its
(control and data) dependencies have been satisfied, required buffers (if any) are available, and a hardware
context is available. For more information about task launching, see Tasks.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 240
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
A task executes only once regardless of how many times it has been launched. Therefore, any subsequent
call to launch does not cause the task to execute again. Instead, it causes the task to be added to a new
group, if the task was not part of that group already. For more information about tasks joining multiple
groups, see Task Groups.
Parameters
Exceptions
Note
If exceptions are disabled by application, this API will terminate the app if task pointer is nullptr
Example
1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group
9 auto g = hetcompute::create_group();
10
11 // hello is a fully-typed task pointer of type
12 // hetcompute::task_ptr<void()>.
13 auto hello = hetcompute::create_task([]() { HETCOMPUTE_ILOG("Hello World!\n"); }
);
14
15 // Launch hello into g.
16 g->launch(hello);
17
18 // Wait for g to be empty.
19 g->wait_for();
20 hetcompute::runtime::shutdown();
21 }
Parameters
Exceptions
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 241
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Note
If exceptions are disabled by application, this API will terminate the app if task is nullptr
Example
1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group
9 auto g = hetcompute::create_group();
10
11 // hello is a fully-typed task pointer of type
12 // hetcompute::task_ptr<void()>.
13 auto hello = hetcompute::create_task([]() { HETCOMPUTE_ILOG("Hello World!\n"); }
);
14
15 // get regular pointer to task.
16 auto hello_ptr = hello.get();
17
18 // Launch hello into g.
19 g->launch(hello_ptr);
20
21 // Wait for g to be empty.
22 g->wait_for();
23 hetcompute::runtime::shutdown();
24 }
Creates a new task, binds arguments (if given) and launches it into a group. This is the fastest way to create
and launch a task into a group. It is recommended that it be used as much as possible. Note, however, that
this method does not return a pointer to the task. Therefore, only use this method if the new task will not be
part of a task graph. Qualcomm HetCompute runtime will execute the task as soon as all its (control and
data) dependencies have been satisfied, required buffers if any are available, and a hardware context is
available. For more information about task launching, see Tasks.
The new task executes the Code passed as an argument to this method.
When creating a task that will execute in the CPU, the preferred types for Code are C++11 lambda and
hetcompute::cpu_kernel, although it is possible to use other types such as function objects and
function pointers. Use hetcompute::dsp_kernel or hetcompute::gpu_kernel to create a task
that runs in the Qualcomm Hexagon DSP or in the GPU. Regardless of the Code type, it can take up to 31
arguments.
Notice that launch makes a copy of code so that the programmer does not need to worry about the
lifetime of the code object.
launch can launch a hetcompute pattern object directly just as launching a regular task, as shown in the
following example:
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 242
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Examples
Notice that launch does not support launching patterns with non-void return value. Therefore
programmers cannot launch preduce or pdivide_and_conquer using this group launch semantic.
Template Parameters
Code Code that the task will execute. It can be a lambda expression, function
pointer, functor, pattern
cpu_kernel, gpu_kernel or a dsp_kernel.
Parameters
Example
1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 static void
5 foo()
6 {
7 HETCOMPUTE_ILOG("Hello World! from foo()\n");
8 sleep(1);
9 HETCOMPUTE_ILOG("Bye from foo()\n");
10 }
11
12 int
13 main()
14 {
15 hetcompute::runtime::init();
16 // Create group g
17 auto g = hetcompute::create_group();
18
19 // Create cpu_kernel that executes foo
20 auto k1 = hetcompute::create_cpu_kernel(foo);
21
22 // Create a task from a kernel and launch it
23 g->launch(k1);
24
25 // Create lambda expression l that takes two arguments
26 auto l = [](int x, int y) { HETCOMPUTE_ILOG("Hello World! %d + %d = %d\n", x, y, x + y); };
27
28 // Create tasks from l and launch them into g
29 for (int i = 0; i < 3; i++)
30 for (int j = 42; j < 44; j++)
31 g->launch(l, i, j);
32
33 // Wait for all the tasks in group g to complete
34 g->wait_for();
35 hetcompute::runtime::shutdown();
36
37 return 0;
38 }
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 243
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
See Also
Blocks until all tasks in the group complete execution or are canceled. If new tasks are added to the group
while wait_for is blocking, wait_for does not return until all those new tasks also complete.
If wait_for is called from within a task, Qualcomm HetCompute context switches the task and finds
another task to run. If called from outside a task, this wait_for blocks the calling thread until it returns.
Note
Example 1
1 #include <hetcompute/hetcompute.hh>
2 #include <stdio.h>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group g
9 auto g = hetcompute::create_group();
10
11 // Launch 10 tasks into g
12 for (int i = 0; i < 10; i++)
13 {
14 g->launch([i] { HETCOMPUTE_ILOG("Hello World! I’m task #%d\n", i); });
15 }
16
17 // Wait for tasks to complete and exit group
18 g->wait_for();
19 hetcompute::runtime::shutdown();
20
21 return 0;
22 }
Waiting for a group intersection means that Qualcomm HetCompute returns once the tasks in the
intersection group have completed or executed.
Example 2
1 #include <stdio.h>
2 #include <hetcompute/hetcompute.h>
3
4 int
5 main()
6 {
7 // Create groups
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 244
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
8 hetcompute::group_ptr g1 = hetcompute::create_group("
Example 1");
9 hetcompute::group_ptr g2 = hetcompute::create_group("
Example 2");
10 hetcompute::group_ptr g12 = g1 & g2;
11
12 // Create and launch two tasks that never end
13 g1->launch([] {
14 while (1)
15 {
16 }
17 });
18
19 g2->launch([] {
20 while (1)
21 {
22 }
23 });
24
25 // Returns immediately because there are no
26 // tasks that belong to both g1 and g2
27 g12->wait_for();
28
29 // Never returns
30 // g1->wait_for();
31 // g2->wait_for();
32
33 g1->cancel();
34 g2->cancel();
35
36 return 0;
37 }
See Also
hetcompute::group::finish_after
hetcompute::task::finish_after
hetcompute::task::wait_for
hetcompute::intersection
• group_ptr ()
Default constructor. Constructs a group_ptr with no group.
• group_ptr (std::nullptr_t)
Default constructor. Constructs a group_ptr with no group.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 245
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
• void reset ()
Resets pointer to managed group.
10.1.1.2.1.1 hetcompute::group_ptr::group_ptr ( )
Constructs a group_ptr object that manages the same group as other. If other points to nullptr,
the newly built object points to nullptr as well.
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 246
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Constructs a group_ptr object that manages the same group as other and resets other. If other
points to nullptr, the newly built object points to nullptr as well.
Parameters
Returns pointer to the managed group. Remember that the lifetime of the group is defined by the lifetime of
the group_ptr objects managing it. If all group_ptr objects managing a group g go out of scope, all
group∗ pointing to g may be invalid.
Returns
Returns
Returns pointer to the managed group. Do not call this member function if ∗this does not manage a
group. This would cause a fatal error.
Returns
Assigns the group managed by other to ∗this. If, before the assignment, ∗this was the last
group_ptr pointing to a group g, then the assignment will cause g to be destroyed. If other manages
no object, ∗this will not manage an object either after the assignment.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 247
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Parameters
Returns
∗this.
Resets ∗this so that it manages no object. If, before the assignment, ∗this was the last group_ptr
pointing to a group g, then the assignment will cause g to be destroyed. If other manages no object,
∗this will not manage an object either after the assignment.
Returns
∗this.
Move-assigns the group managed by other to ∗this. other will manage no group after the assignment.
If, before the assignment, ∗this was the last group_ptr pointing to a group g, then the assignment will
cause g to be destroyed. If other manages no object, ∗this will not manage an object either after the
assignment.
Parameters
Returns
∗this.
Resets pointer to managed group. If, ∗this was the last group_ptr pointing to a group g, then
reset() cause g to be destroyed.
Exceptions
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 248
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Parameters
Checks whether ∗this is the onlygroup_ptr managing the same group object. If ∗this does not
manage any group, unique() returns false.
It is equivalent to checking whether use_count is 1, except that it is more efficient.
Returns
true – The pointer is the only group_ptr managing the group. false – The pointer is not the
only group_ptr managing the group or ∗this is nullptr.
Returns the number of group_ptr objects managing the same object (including ∗this). Notice that the
HETCOMPUTE runtime keeps one internal group_ptr to a group if the group contains one or more
tasks. This is to prevent a group from disappearing while it has tasks.
1 #include <cassert>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group g.
9 auto g = hetcompute::create_group();
10
11 // g’s use_count should be 1
12 HETCOMPUTE_ILOG("After construction: g.use_count() = %zu\n", g.use_count());
13
14 // Copy-construct g2 from g. g and g2’s use_count is 2.
15 auto g2 = g;
16 HETCOMPUTE_ILOG("After copy-construction: g2.use_count() = %zu\n", g2.use_count());
17
18 std::atomic<bool> running(false);
19 std::atomic<bool> finish(false);
20
21 // Launch t into g and wait for its completion.
22 g->launch([&running, &finish] {
23 running = true;
24 while (!finish)
25 {
26 };
27 });
28
29 while (!running)
30 {
31 };
32
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 249
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Output
Returns
Creates a named group and returns a group_ptr that points to it. Named groups can facilitate debugging
of complex applications. Keep in mind, that Qualcomm HetCompute will make a copy of name, which
may cause a slight overhead if you repeatedly create and destroy groups.
name does not have to be unique. Qualcomm HetCompute does not ensure it, so two or more groups can
share the same name.
Parameters
Returns
Example
1 #include <cassert>
2 #include <string>
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 250
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
7 {
8 hetcompute::runtime::init();
9 // Create group named "Example 1"
10 auto g1 = hetcompute::create_group("Example 1");
11
12 // Create group named "Example 2"
13 std::string g2_name("Example 2");
14 auto g2 = hetcompute::create_group(g2_name);
15
16 // Create unnamed group
17 auto g3 = hetcompute::create_group();
18
19 HETCOMPUTE_ILOG("g1 name = %s\n", g1->get_name().c_str());
20 HETCOMPUTE_ILOG("g2 name = %s\n", g2->get_name().c_str());
21 HETCOMPUTE_ILOG("g3 name = %s\n", g3->get_name().c_str());
22
23 hetcompute::runtime::shutdown();
24 return 0;
25 }
See Also
hetcompute::create_group()
hetcompute::create_group(std::string const&)
Creates a named group and returns a group_ptr that points to it. Named groups can facilitate debugging
of complex applications. Keep in mind, that Qualcomm HetCompute will make a copy of name, which
may cause a slight overhead if you repeatedly create and destroy groups.
name does not have to be unique. Qualcomm HetCompute does not ensure it, so two or more groups can
share the same name.
Parameters
Returns
Example
1 #include <cassert>
2 #include <string>
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 // Create group named "Example 1"
10 auto g1 = hetcompute::create_group("Example 1");
11
12 // Create group named "Example 2"
13 std::string g2_name("Example 2");
14 auto g2 = hetcompute::create_group(g2_name);
15
16 // Create unnamed group
17 auto g3 = hetcompute::create_group();
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 251
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
18
19 HETCOMPUTE_ILOG("g1 name = %s\n", g1->get_name().c_str());
20 HETCOMPUTE_ILOG("g2 name = %s\n", g2->get_name().c_str());
21 HETCOMPUTE_ILOG("g3 name = %s\n", g3->get_name().c_str());
22
23 hetcompute::runtime::shutdown();
24 return 0;
25 }
See Also
hetcompute::create_group()
hetcompute::create_group(const char∗)
Returns
Example
1 #include <cassert>
2 #include <string>
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 // Create group named "Example 1"
10 auto g1 = hetcompute::create_group("Example 1");
11
12 // Create group named "Example 2"
13 std::string g2_name("Example 2");
14 auto g2 = hetcompute::create_group(g2_name);
15
16 // Create unnamed group
17 auto g3 = hetcompute::create_group();
18
19 HETCOMPUTE_ILOG("g1 name = %s\n", g1->get_name().c_str());
20 HETCOMPUTE_ILOG("g2 name = %s\n", g2->get_name().c_str());
21 HETCOMPUTE_ILOG("g3 name = %s\n", g3->get_name().c_str());
22
23 hetcompute::runtime::shutdown();
24 return 0;
25 }
See Also
hetcompute::create_group()
hetcompute::create_group(const char∗)
PRIVATE
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 252
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Specifies that the task invoking this function should be deemed to finish only after tasks in g finish. This
method returns immediately.
If the invoking task is multi-threaded, the programmer must ensure that concurrent calls to
finish_after from within the task are properly synchronized.
Parameters
g Group pointer.
Exceptions
Note
If exceptions are disabled by the application, the API terminates in the above listed error conditions.
Returns a pointer to a group that represents the intersection of two groups. Some applications require that
tasks join more than one group. It is possible, though not recommended for performance reasons, to use
hetcompute::group::launch(hetcompute::task_ptr<> const&) or hetcompute-
::group::add(hetcompute::task_ptr<> const&) repeatedly to add a task to several groups.
Instead, use hetcompute::intersect(group_ptr const&, group_ptr const&) to create
a new group that represents the intersection of all the groups where the tasks need to launch. Again, this
method is more performant than repeatedly launching the same task into different groups.
Launching a task into the intersection group also simultaneously launches it into all the groups that are part
of the intersection.
Consecutive calls to hetcompute::intersect with the same groups’ pointer as arguments, return a
pointer to the same group.
Group intersection is a commutative operation.
You can use the & operator instead of hetcompute::group::intersect.
Parameters
Returns
group_ptr – Group pointer that points to a group that represents the intersection of a and b.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 253
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Example
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create groups
8 auto g1 = hetcompute::create_group("Group 1");
9 auto g2 = hetcompute::create_group("Group 2");
10
11 auto g12 = hetcompute::intersect(g1, g2);
12
13 for (int i = 0; i < 3000; i++)
14 g1->launch([] {
15 //... Do something
16 });
17
18 for (int i = 0; i < 2000; i++)
19 g2->launch([] {
20 //... Do something
21 });
22
23 // Returns immediately. g12 is empty
24 g12->wait_for();
25
26 // Return only after tasks in g1 and g2 complete
27 g1->wait_for();
28 g2->wait_for();
29
30 g12->launch([] {
31 //... Calculate the Ultimate Question of Life,
32 // the Universe, and Everything
33 HETCOMPUTE_ILOG("42\n");
34 });
35
36 // All will return after the task prints 42
37 g1->wait_for();
38 g2->wait_for();
39 hetcompute::runtime::shutdown();
40
41 return 0;
42 }
The example above shows an application with three groups: g1, g2, and their intersection g12. We
launch thousands of tasks on both g1 and g2. We then wait for g12 (line 23), but
g12->wait_for() returns immediately because g12 is empty. This is because at this point no task
belongs to both g1 and g2. We then launch a task into g12 (line 29). g1->wait_for() and
g2->wait_for() return only after the task in g12 completes execution because it belongs to g1,
g2, and g12.
See Also
hetcompute::operator&(hetcompute::group_ptr const&,
hetcompute::group_ptr const&)
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 254
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Parameters
Returns
group_ptr – Group pointer that points to a group that represents the intersection of a and b.
See Also
hetcompute::intersect(hetcompute::group_ptr const&,
hetcompute::group_ptr const&).
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 255
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
10.2 Kernels
Classes
• class hetcompute::beta::cl_t
Type used for declaring the constant hetcompute::cl. More...
Functions
• template<typename Fn >
hetcompute::cpu_kernel
< typename
std::remove_reference< Fn >
::type > hetcompute::create_cpu_kernel (Fn &&fn)
Create a cpu_kernel object from a function object.
• template<typename... Args>
hetcompute::dsp_kernel< int(∗)(Args...)> hetcompute::create_dsp_kernel (int(∗fn)(Args...))
• template<typename... Args>
gpu_kernel< Args...> hetcompute::create_gpu_kernel (std::string const &cl_kernel_str, std::string
const &cl_kernel_name, std::string const &cl_build_options="")
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 256
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
• template<typename... Args>
gpu_kernel< Args...> hetcompute::beta::create_gpu_kernel (beta::cl_t const &, std::string const
&cl_kernel_str, std::string const &cl_kernel_name, std::string const &cl_build_options="")
• template<typename... Args>
gpu_kernel< Args...> hetcompute::beta::create_gpu_kernel (beta::gl_t const &, std::string const
&gl_kernel_str)
• template<typename... Args>
gpu_kernel< Args...> hetcompute::create_gpu_kernel (void const ∗cl_kernel_bin, size_t
cl_kernel_len, std::string const &cl_kernel_name, std::string const &cl_build_options="")
• template<typename... Args>
gpu_kernel< Args...> hetcompute::beta::create_gpu_kernel (beta::cl_t const &, void const
∗cl_kernel_bin, size_t cl_kernel_len, std::string const &cl_kernel_name, std::string const
&cl_build_options="")
• template<typename... Args>
gpu_kernel< Args...> hetcompute::beta::create_gpu_kernel (beta::gl_t const &, void const
∗gl_kernel_bin, size_t gl_kernel_len)
Variables
• static
HETCOMPUTE_CONSTEXPR_CONST
size_type hetcompute::dsp_kernel< int(∗)(Args...)>::arity = parent::arity
• static
HETCOMPUTE_CONSTEXPR_CONST
size_type hetcompute::cpu_kernel< Fn >::arity = parent::arity
• static
HETCOMPUTE_CONSTEXPR_CONST
size_type hetcompute::cpu_kernel< FReturnType(FArgs...)>::arity = parent::arity
• cl_t const hetcompute::beta::cl
Used to explicitly indicate creation of an OpenCL kernel.
Utility template to get the tuple type of hetcompute::range, and other types.
The wrapped type can be accessed trough call_tuple<...>::type.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 257
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Template Parameters
Utility template to get the tuple type of hetcompute::range, and GPU kernel argument types. Use case: get
the return type of the before synchronization lambda for a gpu pipeline stage.
The wrapped type can be accessed trough call_tuple<...>::type.
Template Parameters
See Also
Data fields
hetcompute::gpu_kernel
A cpu_kernel object contains CPU executable code. It can be used to create tasks. When such a task
runs, it executes the function object in its cpu_kernel.
See Also
cpu_kernel<FReturnType(FArgs...)>
Public Types
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 258
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
• ∼cpu_kernel ()
Destructor.
• static
HETCOMPUTE_CONSTEXPR_CONST
size_type arity = parent::arity
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 259
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Friends
• struct ::hetcompute::internal::cpu_kernel_caller
• template<typename X , typename Y , typename Z >
struct ::hetcompute::internal::task_factory
• template<typename X , typename Y >
struct ::hetcompute::internal::task_factory_dispatch
Parameters
Parameters
Parameters
Parameters
A cpu_kernel object contains CPU executable code. It can be used to create tasks. When such a task
runs, it executes the function in its cpu_kernel.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 260
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
See Also
cpu_kernel<FReturnType(FArgs...)>
Public Types
• cpu_kernel (FReturnType(∗fn)(FArgs...))
Constructor.
• ∼cpu_kernel ()
Destructor.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 261
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
• static
HETCOMPUTE_CONSTEXPR_CONST
size_type arity = parent::arity
Friends
• struct ::hetcompute::internal::cpu_kernel_caller
• template<typename X , typename Y , typename Z >
struct ::hetcompute::internal::task_factory
• template<typename X , typename Y >
struct ::hetcompute::internal::task_factory_dispatch
Parameters
Parameters
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 262
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
For this DSP kernel, the template signature corresponds to the DSP kernel’s parameter list.
Template Parameters
See Also
Public Types
• static
HETCOMPUTE_CONSTEXPR_CONST
size_type arity = parent::arity
Friends
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 263
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Constructor
Parameters
Constructor
Parameters
Move constructor.
Destructor.
Equality operator.
Inequality operator.
hetcompute::gpu_kernel
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 264
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
A wrapper around OpenCL C kernels and OpenGL ES compute shaders for GPU compute. The template
signature corresponds to the GPU kernel parameter list.
See Also
• gpu_kernel (beta::cl_t const &, std::string const &cl_kernel_str, std::string const &cl_kernel_name,
std::string const &cl_build_options="")
Constructor, explicit for OpenCL kernel.
• gpu_kernel (beta::cl_t const &, void const ∗cl_kernel_bin, size_t cl_kernel_len, std::string const
&cl_kernel_name, std::string const &cl_build_options="")
Constructor, explicit for precompiled OpenCL kernel.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 265
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Friends
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 266
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Parameters
Returns
std::pair consisting of a pointer to an allocated buffer holding the CL binary and the size of the
allocated buffer (sized to hold the binary) in bytes.
Note
Each invocation of this function internally allocates a new buffer of an appropriate size using new[].
The user code is responsible for deleting the buffer after use by calling delete[].
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 267
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Returns
Returns
hetcompute::gpu_kernel
Example:
const char* kernel_string = "
__kernel void k(__global int *a,
__global int *b,
__local int *c)
{
...
}";
hetcompute::gpu_kernel<hetcompute::buffer_ptr<int>,
hetcompute::buffer_ptr<int>,
hetcompute::local<int>> gk(kernel_string, "k");
// pass __local size in number of elements (not number of bytes as for OpenCL)
int number_of_ints = number_of_bytes / sizeof(int);
auto t = hetcompute::create_task(gk, r, buf_a, buf_b, number_of_ints);
Create a cpu_kernel object that executes a given function. This kernel object can then be used to create
a task.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 268
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Parameters
Returns
See Also
create_cpu_kernel(Fn&& fn)
Create a cpu_kernel object that executes a given function object. A function object (also called a
functor) is any object with the () operator defined, such as a lambda expression. This kernel object can
then be used to create a task.
Parameters
Returns
See Also
create_cpu_kernel(FReturnType(∗fn)(FArgs...))
This template creates a DSP kernel executable by Qualcomm HetCompute SDK. The template signature
corresponds to the DSP kernel parameter list.
The kernel code is specified as a C language function. The function returns an int that corresponds to the
status. When the function returns something other than 0, the Qualcomm HetCompute runtime will trigger
a hetcompute::dsp_exception().
Template Parameters
// create the dsp task that will be executed inside the dsp DSP
auto hex_task = hetcompute::create_task(hex_kernel,
in_buf, // in access recognized
out_buf); // out access recognized
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 269
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Creates a GPU kernel executable by HETCOMPUTE, implicitly for OpenCL. The template signature
corresponds to the GPU kernel parameter list.
The kernel code is specified as a string of OpenCL C code.
Equivalent to calling the hetcompute::gpu_kernel constructor directly.
Parameters
Returns
A gpu_kernel object.
Creates a GPU kernel executable by HETCOMPUTE, explictly for OpenCL. The template signature
corresponds to the GPU kernel parameter list.
The kernel code is specified as a string of OpenCL C code.
Equivalent to calling the hetcompute::gpu_kernel constructor directly.
Returns
A gpu_kernel object.
Creates a GPU kernel executable by HETCOMPUTE, explictly for OpenGL ES. The template signature
corresponds to the GPU kernel parameter list.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 270
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Returns
A gpu_kernel object.
Creates a GPU kernel executable by HETCOMPUTE, implicitly for OpenCL, using a precompiled OpenCL
kernel. The template signature corresponds to the GPU kernel parameter list.
The kernel code is specified as a prebuilt OpenCL kernel binary.
Equivalent to calling the hetcompute::gpu_kernel constructor directly.
Parameters
Returns
A gpu_kernel object.
Creates a GPU kernel executable by HETCOMPUTE, explicitly for OpenCL, using a precompiled OpenCL
kernel. The template signature corresponds to the GPU kernel parameter list.
The kernel code is specified as a prebuilt OpenCL kernel binary.
Equivalent to calling the hetcompute::gpu_kernel constructor directly.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 271
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Parameters
Returns
A gpu_kernel object.
hetcompute::gpu_kernel
hetcompute::gpu_kernel
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 272
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
10.3 Indices
Classes
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 273
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Methods common to 1D, 2D, and 3D index objects are listed here. The value for Dims can be 1, 2, or 3.
Protected Attributes
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 274
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Parameters
Returns
Parameters
Returns
Sums the corresponding values of the current index_base object and another index_base object and returns
a new index_base object.
Parameters
rhs index_base object to be used for summing with the values of the current
object.
Sums the corresponding values of the current index_base object and another index_base object and returns
a reference to the current index_base object.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 275
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Parameters
rhs index_base object to be used for summing with the values of current object.
Subtracts the corresponding values of the current index_base object and another index_base object and
returns a new index_base object.
Parameters
rhs index_base object to be used for subtraction with the values of the current
object.
Subtracts the corresponding values of current index_base object and another index_base object and returns
a reference to current index_base object.
Parameters
rhs index_base object to be used for subtraction with the values of the current
object.
Checks if this object is less than another index_base object. Performs a lexicographical comparison of two
index_base objects, similar to std::lexicographical_compare().
Parameters
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 276
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Checks if this object is less than or equal to another index_base object. Does a lexicographical comparison
of two index_base objects, similar to std::lexicographical_compare().
Parameters
Returns
Replaces the contents of the current index_base object with an other index_base object.
Parameters
rhs index_base object to be used for replacing the contents of current object.
Parameters
Returns
Checks if this object is greater than another index_base object. Does a lexicographical comparison of two
index_base objects, similar to std::lexicographical_compare().
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 277
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Parameters
Returns
Checks if this object is greater or equal to another index_base object. Does a lexicographical comparison of
two index_base objects, similar to std::lexicographical_compare().
Parameters
Returns
Parameters
Returns
Returns a const reference to i-th coordinate of index_base object. No bounds checking is performed.
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 278
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 279
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
10.4 Ranges
Classes
A 1-dimensional range.
// 1d vector sum using hetcompute::range<1>
constexpr size_t N = 100; // size of vector
std::vector<size_t> v(N);
hetcompute::range<1> r(0, N);
std::atomic<size_t> sum(0);
• range ()
• range (const std::array< size_t, 1 > &bb, const std::array< size_t, 1 > &ee)
• range (const std::array< size_t, 1 > &bb, const std::array< size_t, 1 > &ee, const std::array<
size_t, 1 > &ss)
• range (size_t b0, size_t e0, size_t s0)
• range (size_t b0, size_t e0)
• range (size_t e0)
• size_t index_to_linear (const hetcompute::index< 1 > &it) const
• bool is_empty () const
• hetcompute::index< 1 > linear_to_index (size_t idx) const
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 280
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Creates a 1D range, spans from [b0, e0), and is incremented in s0. It will cause a fatal error if b0 is greater
than or equal to e0. s0 should be greater than 0.
Parameters
b0 Beginning of 1D range.
e0 End of 1D range.
s0 Stride of 1D range.
Parameters
b0 Beginning of 1D range.
e0 End of 1D range.
Parameters
e0 End of 1D range.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 281
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Converts a hetcompute::index<1> object to a linear number with respect to the current range object.
Parameters
it hetcompute::index<1> object
Returns
Parameters
Returns
Returns
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 282
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
A 2-dimensional range.
// fill in a 2d matrix by tiles
constexpr size_t N = 20; // size of matrix
constexpr size_t TILE_SIZE = 5;
size_t a[N][N];
• range ()
• range (const std::array< size_t, 2 > &bb, const std::array< size_t, 2 > &ee)
• range (const std::array< size_t, 2 > &bb, const std::array< size_t, 2 > &ee, const std::array<
size_t, 2 > &ss)
• range (size_t b0, size_t e0, size_t s0, size_t b1, size_t e1, size_t s1)
• range (size_t b0, size_t e0, size_t b1, size_t e1)
• range (size_t e0, size_t e1)
• size_t index_to_linear (const hetcompute::index< 2 > &it) const
• bool is_empty () const
• hetcompute::index< 2 > linear_to_index (size_t idx) const
• size_t linearized_distance () const
• void print () const
• size_t size () const
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 283
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
10.4.1.3.1.2 hetcompute::range< 2 >::range ( size_t b0, size_t e0, size_t s0, size_t b1, size_t e1,
size_t s1 )
Creates a 2D range, comprising of points from the cross product [b0:e0:s0) x [0:e1:s1).
It will cause a fatal error if b0 is greater than or equal to e0 or if b1 is greater than or equal to e1. s0 and
s1 should be greater than 0.
Parameters
10.4.1.3.1.3 hetcompute::range< 2 >::range ( size_t b0, size_t e0, size_t b1, size_t e1 )
Creates a 2D range, comprising of points from the cross product [b0, e0) x [b1, e1).
It will cause a fatal error if b0 is greater than or equal to e0 or if b1 is greater than or equal to e1.
Parameters
Creates a 2D range, comprising of points from the cross product [0, e0) x [0, e1).
Parameters
Converts a hetcompute::index<2> object to a linear number with respect to the current range object.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 284
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Parameters
it hetcompute::index<2> object
Returns
Parameters
Returns
Returns
Returns
A 3-dimensional range.
// 6-point stencil in 3-d
constexpr size_t N = 10; // size of matrix
constexpr size_t TILE_SIZE = 2;
float a[N][N][N];
// define a 3d range
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 285
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
• range ()
• range (const std::array< size_t, 3 > &bb, const std::array< size_t, 3 > &ee)
• range (const std::array< size_t, 3 > &bb, const std::array< size_t, 3 > &ee, const std::array<
size_t, 3 > &ss)
• range (size_t b0, size_t e0, size_t s0, size_t b1, size_t e1, size_t s1, size_t b2, size_t e2, size_t s2)
• range (size_t b0, size_t e0, size_t b1, size_t e1, size_t b2, size_t e2)
• range (size_t e0, size_t e1, size_t e2)
• size_t index_to_linear (const hetcompute::index< 3 > &it) const
• bool is_empty () const
• hetcompute::index< 3 > linear_to_index (size_t idx) const
• size_t linearized_distance () const
• void print () const
• size_t size () const
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 286
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
10.4.1.4.1.2 hetcompute::range< 3 >::range ( size_t b0, size_t e0, size_t s0, size_t b1, size_t e1,
size_t s1, size_t b2, size_t e2, size_t s2 )
Creates a 3D range, comprising of points from the cross product [b0:e0:s0) x [b1:e1:s1) x [b2:e2:s2)
It will cause a fatal error if b0 is greater than or equal to e0 or if b1 is greater than or equal to e1 or if b2
is greater than or equal to e2. s0, s1 and s2 should be greater than 0.
Parameters
10.4.1.4.1.3 hetcompute::range< 3 >::range ( size_t b0, size_t e0, size_t b1, size_t e1, size_t b2,
size_t e2 )
Creates a 3D range, comprising of points from the cross product [b0, e0) x [b1, e1) x [b2, e2) It will cause a
fatal error if b0 is greater than or equal to e0 or if b1 is greater than or equal to e1 or if b2 is greater than
or equal to e2.
Parameters
Creates a 3D range, comprising of points from the cross product [0, e0) x [0, e1) x [0, e2)
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 287
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Parameters
Converts a hetcompute::index<3> object to a linear number with the current range object.
Parameters
it hetcompute::index<3> object.
Returns
Parameters
Returns
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 288
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Returns
• range_base (const std::array< size_t, Dims > &bb, const std::array< size_t, Dims > &ee)
• range_base (const std::array< size_t, Dims > &bb, const std::array< size_t, Dims > &ee, const
std::array< size_t, Dims > &ss)
• size_t begin (const size_t i) const
• const std::array< size_t, Dims > & begin () const
• size_t dims () const
• size_t end (const size_t i) const
• const std::array< size_t, Dims > & end () const
• size_t length (const size_t i) const
• size_t num_elems (const size_t i) const
• size_t stride (const size_t i) const
• const std::array< size_t, Dims > & stride () const
Protected Attributes
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 289
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Returns the beginning of the range in the i-th coordinate. It will cause a fatal error if i is greater than or
equal to Dims.
Returns
Returns
Returns
Returns the end of the range in the i-th coordinate. It will cause a fatal error if i is greater than or equal to
Dims.
Returns
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 290
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
The length of range in the i-th coordinate. num_elems(i) does not take non-stride indices into account,
whereas length(i) returns the total length in the i-th coordinate.
Returns
Returns
Returns the stride the range in the i-th coordinate.. It will cause a fatal error if i is greater than or equal to
Dims.
Returns
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 291
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
10.5 Tasks
Classes
• struct hetcompute::do_not_collapse_t
• class hetcompute::task< ReturnType >
Tasks with a function of non-void return type. More...
• class hetcompute::task<>
Tasks as the basic unit of work. More...
• class hetcompute::task_ptr<>
Smart pointer to a task object without function information. More...
Typedefs
• template<typename Fn >
using hetcompute::collapsed_task_type = typename::hetcompute::internal::task_factory< Fn
>::collapsed_task_type
• template<typename Fn >
using hetcompute::non_collapsed_task_type = typename::hetcompute::internal::task_factory< Fn
>::non_collapsed_task_type
Functions
• void hetcompute::abort_on_cancel ()
Aborts execution of calling task if any of its groups is canceled or if someone has canceled it by calling
hetcompute::cancel().
• void hetcompute::abort_task ()
Aborts execution of calling task.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 292
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 293
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 294
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)∗std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator∗ (const ::hetcompute::task_ptr< T1 > &t1, const
::hetcompute::task_ptr< T2 > &t2)
Algebraic binary operator ∗ for tasks.
• template<typename T >
inline::hetcompute::task_ptr
< typename::hetcompute::task_ptr
< T >::return_type > hetcompute::operator+ (const ::hetcompute::task_ptr< T > &t)
Algebraic unary operator + for task.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 295
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)+std::declval< T2 >))> hetcompute::operator+ (const ::hetcompute::task_ptr< T1 > &t1, T2
&&op2)
Algebraic binary operator + for tasks.
• template<typename T >
inline::hetcompute::task_ptr
< typename::hetcompute::task_ptr
< T >::return_type > hetcompute::operator- (const ::hetcompute::task_ptr< T > &t)
Algebraic unary operator - for task.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 296
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 297
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)∧ std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator∧ (const ::hetcompute::task_ptr< T1 > &t1, const
::hetcompute::task_ptr< T2 > &t2)
Algebraic binary operator ∧ for tasks.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 298
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
inline::hetcompute::task_ptr
< decltype(std::declval< T1 >
)|std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator| (T1 &&op1, const ::hetcompute::task_ptr< T2 >
&t2)
Algebraic binary operator | for tasks.
• template<typename T >
inline::hetcompute::task_ptr
< typename::hetcompute::task_ptr
< T >::return_type > hetcompute::operator∼ (const ::hetcompute::task_ptr< T > &t)
Algebraic unary operator ∼ for task.
Variables
Template Parameters
Note: An object of this class should not be instantiated. It is a facade to the internal implementation.
Public Types
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 299
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Example
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 // Define a lambda.
9 auto l = [](int x) -> int { return x * 2; };
10 // Create a task out of the lambda and launch.
11 auto t1 = hetcompute::launch(l, 42);
12
13 // Wait t1 to finish and assign the return value to val.
14 int val = t1->copy_value();
15
16 HETCOMPUTE_ILOG("return value of t1 is: %d", val);
17
18 hetcompute::runtime::shutdown();
19 }
See Also
hetcompute::task<ReturnType>::move_value()
hetcompute::task<>::wait_for()
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 300
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Example
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Define a lambda.
8 auto l = [](int x) -> int { return x * 2; };
9 // Create a task out of the lambda and launch.
10 auto t1 = hetcompute::launch(l, 42);
11
12 int val = t1->move_value();
13 HETCOMPUTE_ILOG("return value of t1 is: %d", val);
14
15 hetcompute::runtime::shutdown();
16 // Error! value might not be there anymore!!
17 // int val_error = t1->move_value();
18 return 0;
19 }
See Also
hetcompute::task<ReturnType>::copy_value()
hetcompute::task<>::wait_for()
Template Parameters
Note: An object of this class should not be instantiated. It is a facade to the internal implementation.
Public Types
• template<typename... Arguments>
void bind_all (Arguments &&...args)
Bind all arguments to a task with a full-function signature.
• template<typename... Arguments>
void launch (Arguments &&...args)
Launches task and (optionally) binds arguments.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 301
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
• static
HETCOMPUTE_CONSTEXPR_CONST
size_type arity = sizeof...(Args)
Friends
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 302
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Template Parameters
Parameters
Example
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create a group.
8 auto g = hetcompute::create_group();
9
10 std::atomic<size_t> value;
11
12 // Create a non-collapsing task returns hetcompute::task_ptr<hetcompute::task_ptr<size_t>()>.
13 auto t1 = hetcompute::create_task(
hetcompute::do_not_collapse, [&value]() -> hetcompute::task_ptr<size_t> {
14 return hetcompute::create_task([&value]() -> size_t {
15 value = 27;
16 return value;
17 });
18 });
19
20 // Create a task.
21 auto t2 = hetcompute::create_task([&value] { value = 42; });
22
23 // Create a task takes two parameters of hetcompute::task_ptr.
24 auto t3 = hetcompute::create_task([g](
hetcompute::task_ptr<> ta, hetcompute::task_ptr<> tb) {
25 // Set task dependency and launch.
26 ta->then(tb); // t1->result() >> t2
27 g->launch(ta); // t1->result()->launch()
28 g->launch(tb); // t2->launch();
29 });
30
31 // Create a task takes one parameter of hetcompute::task_ptr<>.
32 auto t4 = hetcompute::create_task([g](
hetcompute::task_ptr<> ta) {
33 // Launch the task.
34 g->launch(ta); // t1->launch();
35 });
36
37 // Bind the arguments for t3.
38 // Bind t1 to the first argument as data dependency (explicity due to ambiguity).
39 // Bind t2 to the second argument (by value, no ambiguity).
40 t3->bind_all(hetcompute::bind_as_data_dependency(t1), t2);
41
42 // Bind the argument for t4.
43 // Bind t1 to the argument by value (explicity due to ambiguity).
44 t4->bind_all(hetcompute::bind_by_value(t1));
45
46 // launch the tasks into the group (t1 and t2 will be launched in t3 and t4)
47 g->launch(t3);
48 g->launch(t4);
49
50 // Wait for the group to finish.
51 g->wait_for();
52
53 HETCOMPUTE_ILOG("%zu", value.load());
54
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 303
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
55 hetcompute::runtime::shutdown();
56 return 0;
57 }
See Also
hetcompute::bind_as_dependency()
hetcompute::bind_by_value()
Example
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 auto t1 = hetcompute::create_task([](int x) { HETCOMPUTE_ILOG("Hello World %d!\n
", x); });
8
9 //...
10 // Set up dependencies if needed.
11 // ..
12
13 // t1 is ready, launch and bind it.
14 t1->launch(42);
15
16 // Wait for t1 to finish.
17 t1->wait_for();
18 hetcompute::runtime::shutdown();
19 return 0;
20 }
See Also
hetcompute::group::launch(Code)
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 304
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Note: An object of this class should not be instantiated. It is a facade to the internal implementation.
template<>class hetcompute::task<>
Note: An object of this class should not be instantiated. It is a facade to the internal implementation.
Public Types
• void cancel ()
Cancels task.
• void finish_after ()
finish_after the task.
• void launch ()
Launches task.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 305
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Protected Attributes
• internal_raw_task_ptr _ptr
Friends
Cancels task.
Use hetcompute::task<>::cancel() to cancel a task and its successors. The effects of
hetcompute::task<>::cancel() depend on the task status:
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 306
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
8 hetcompute::runtime::init();
9 auto t1 = hetcompute::create_task([] { assert(false); });
10
11 auto t2 = hetcompute::create_task([] { assert(false); });
12
13 // Create control dependency.
14 t1->then(t2);
15
16 // Cancel t1, which propagates cancellation to t2
17 t1->cancel();
18
19 // Launch t2. Does nothing, t2 got canceled via cancellation propagation
20 t2->launch();
21
22 // Returns immediately, t2 is canceled.
23 try
24 {
25 t2->wait_for();
26 }
27 catch (const hetcompute::canceled_exception& e)
28 {
29 std::cout << e.what() << ": t2 was canceled" << std::endl;
30 }
31 catch (...)
32 {
33 // Never reached
34 }
35
36 hetcompute::runtime::shutdown();
37 return 0;
38 }
In the example above, a control dependency is created betwen two tasks, t1 and t2. Notice that, if any of
the tasks executes, it will raise an assertion. In line 17, t1 is canceled, which causes t2 to be canceled as
well. In line 20, t2 is launched, but it does not matter as it will not execute because it was canceled when
t1 propagated its cancellation.
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t1!\n")
; });
10
11 auto t2 = hetcompute::create_task([] { assert(false); });
12
13 auto t3 = hetcompute::create_task([] { assert(false); });
14
15 // Create dependencies
16 t1->then(t2)->then(t3);
17
18 // Launch t2. It cannot execute as yet because t1 has not been launched.
19 t2->launch();
20
21 // Cancel t2, which propagates cancellation to t3
22 t2->cancel();
23
24 // Launch t1. It will execute because no one canceled it.
25 t1->launch();
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 307
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
26
27 // Returns after t1 completes execution
28 t1->wait_for();
29 hetcompute::runtime::shutdown();
30
31 return 0;
32 }
In the example above, three tasks are created and chained: t1, t2, and t3. In line 22, t2 is launched, but it
cannot execute because its predecessor has not yet executed. In line 25, t2 is canceled, which means that it
will never execute. Because t3 is t2’s successor, it is also canceled – if t3 had a successor, it would also
be canceled.
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9
10 auto t = hetcompute::create_task([] {
11 while (1)
12 {
13 hetcompute::abort_on_cancel();
14 HETCOMPUTE_ILOG("Waiting to be canceled.\n");
15 usleep(100);
16 }
17 assert(false); // This will never fire.
18 });
19
20 // Launch t.
21 t->launch();
22
23 // Wait for 2 seconds.
24 usleep(200);
25
26 // Cancel task. Returns immediately.
27 t->cancel();
28
29 try
30 {
31 // Wait for the task.
32 t->wait_for();
33 }
34 catch (const hetcompute::canceled_exception& e)
35 {
36 std::cout << e.what() << " thrown" << std::endl;
37 }
38 catch (...)
39 {
40 // Never reached.
41 }
42
43 hetcompute::runtime::shutdown();
44 return 0;
45 }
In the example above, task t’s will never finish unless it is canceled. t is launched in line 16. After
launching the task, it is blocked for 2 seconds in line 19 to ensure that t is scheduled and prints its
messages. In line 22, Qualcomm HetCompute is asked to cancel the task, which should be running by now.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 308
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t1!\n")
; });
8
9 auto t2 = hetcompute::create_task([] {
10 while (1)
11 {
12 hetcompute::abort_on_cancel();
13 HETCOMPUTE_ILOG("Hello World from t2!\n");
14 usleep(100);
15 }
16 });
17
18 // Create dependencies.
19 t1->then(t2);
20
21 // Launch tasks.
22 t1->launch();
23
24 // Wait for t1 to complete.
25 t1->wait_for();
26
27 // Cancel t1.
28 // Because it has already completed, it does not propagate its cancellation.
29 t1->cancel();
30
31 // If the two lines below are uncommented the wait_for will never return.
32 // t2->launch();
33 // t2->wait_for();
34
35 hetcompute::runtime::shutdown();
36 return 0;
37 }
In the example above, t1 and t2 are launched after a dependency is set up between them. On line 28, is
canceled t1 after it has completed. By then, t1 has finished execution (waiting for it in line 24) so
cancel(t1) has no effect. Thus, nobody cancels t2 and wait_for(t2) in line 31 never returns.
See Also
hetcompute::abort_on_cancel()
hetcompute::task<>::wait_for()
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 309
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Use the method to check whether a task is canceled. If the task was canceled – via cancellation propagation,
hetcompute::group::cancel() or hetcompute::task<>::cancel() – before it started
executing, hetcompute::task<>::canceled() returns true.
If the task was canceled – via hetcompute::group::cancel() or hetcompute::group-
::cancel() – while it was executing, then hetcompute::task<>::canceled() returns true
only if the task is not executing any more and it exited via hetcompute::abort_on_cancel().
Finally, if the task completed successfully, hetcompute::task<>::canceled() always returns false.
Returns
Example:
1 #include <cassert>
2 #include <hetcompute/hetcompute.h>
3
4 int
5 main()
6 {
7 auto t = hetcompute::create_task([] {
8 while (true)
9 {
10 hetcompute::abort_on_cancel();
11 HETCOMPUTE_ILOG("Hello World!\n");
12 usleep(1); // Sleep for one micro-second.
13 }
14 });
15
16 auto g = hetcompute::create_group();
17
18 // It will never fire.
19 assert(t->canceled() == false);
20
21 // Launch task.
22 g->launch(t);
23
24 // It will never fire.
25 assert(t->canceled() == false);
26
27 // Sleep for 10 micro-seconds.
28 usleep(10);
29
30 // It will never fire.
31 assert(t->canceled() == false);
32
33 // Cancel both the task and the group.
34 t->cancel();
35 g->cancel();
36
37 // Might be false if the task has not executed abort_on_cancel() yet.
38 // might also be true if the task has already executed abort_on_cancel().
39 HETCOMPUTE_ILOG("t->canceled() = %d", t->canceled() == true);
40
41 try
42 {
43 // Wait for the task to transition to canceled state.
44 t->wait_for();
45 }
46 catch (const hetcompute::canceled_exception& e)
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 310
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
47 {
48 std::cout << "threw " << e.what() << " due to task cancellation " << std::endl;
49 }
50 catch (...)
51 {
52 // Never reached.
53 }
54
55 // It will never fire.
56 assert(t->canceled() == true);
57 return 0;
58 }
See Also
hetcompute::task<>::cancel()
hetcompute::group::cancel()
Specifies that the current task should be deemed to finish only after the task on which this method is
invoked finishes. This method returns immediately.
Example
1 #include <algorithm>
2 #include <functional>
3 #include <iostream>
4 #include <iterator>
5 #include <sstream>
6 #include <vector>
7
8 #include <hetcompute/hetcompute.hh>
9
10 // Parallel mergesort using recursive fork-join parallelism.
11 // hetcompute::task<>::finish_after allows easy expression of the parallelism in the
12 // algorithm in a non-blocking manner, yielding better performance than
13 // blocking parallelization using hetcompute::task<>::wait_for.
14
16 const size_t GRANULARITY = 8192;
17
18 // Asynchronous mergesort, to be invoked in a task
19 template <typename Iterator, typename Compare>
20 void
21 mergesort(Iterator begin, Iterator end, Compare cmp)
22 {
23 size_t n = std::distance(begin, end);
24 if (n <= GRANULARITY)
25 {
26 sort(begin, end, cmp);
27 }
28 else
29 {
30 auto middle = begin;
31 std::advance(middle, n / 2);
32 auto left = hetcompute::launch([=] { mergesort(begin, middle, cmp); });
33 auto right = hetcompute::launch([=] { mergesort(middle, end, cmp); });
34 auto merge = hetcompute::create_task([=] { std::inplace_merge(begin, middle,
end, cmp); });
35 // The left subtree and right subtree tasks must finish before the merge
36 // task can execute
37 left->then(merge);
38 right->then(merge);
39 merge->launch();
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 311
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Exceptions
Note
If exceptions are disabled by application, this API will terminate the app, if pointer to task is
nullptr, invoked from outside a task or from within a hetcompute::pfor_each
See Also
hetcompute::group::finish_after()
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 312
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Use this method to check whether all the task parameters are bound. Returns true if the task had no
parameters. Remember that only bound tasks can be launched.
Returns
true – All the task parameters are bound. If the task has none, then is_bound always return true. false –
At least one of the task parameters is not bound.
Launches task.
This method informs the Qualcomm HetCompute runtime that the task is ready to execute as soon as there
is an available hardware context and after all its predecessors (both data- and control-dependent) have
executed.
Example
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create task t.
8 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World!\n"); });
9
10 //...
11 // Set up dependencies if needed.
12 // ..
13
14 // t1 is ready, launch it.
15 t1->launch();
16
17 // Wait for t to finish.
18 t1->wait_for(); // Will not return until t finishes.
19 hetcompute::runtime::shutdown();
20 return 0;
21 }
See Also
hetcompute::group::launch(Code)
Note: The programmer is responsible for ensuring that there are no cycles in the task graph.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 313
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Parameters
Exceptions
Note
If exceptions are disabled by Application, terminates the app if successor is already launched
Example
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 // Create group.
9 auto g = hetcompute::create_group("Hello World Group");
10
11 // Create tasks t1 and t2.
12 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World! from task
t1\n"); });
13
14 auto t2 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World! from task
t2\n"); });
15
16 // Create dependency between t1 and t2.
17 t1->then(t2);
18
19 // Launch both t1 and t2 into g.
20 g->launch(t1);
21 g->launch(t2);
22
23 // Wait until t1 and t2 finish.
24 g->wait_for();
25
26 hetcompute::runtime::shutdown();
27 return 0;
28 }
Output:
Hello World! from task t1
Hello World! from task t2
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 314
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Example
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create task t.
8 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World!\n"); });
9
10 //...
11 // Set up dependencies if needed.
12 // ..
13
14 // t1 is ready, launch it.
15 t1->launch();
16
17 // Wait for t to finish.
18 t1->wait_for(); // Will not return until t finishes.
19 hetcompute::runtime::shutdown();
20 return 0;
21 }
Exceptions
Note
If exceptions are disabled by application, in the above case API will return
hc_error::HC_TaskCanceled
Exceptions
hetcompute- If the task or any tasks on which it is dependent threw two or more
::aggregate_- exceptions.
exception
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 315
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Note
If exceptions are disabled by application, in the above case API will return
hc_error::HC_TaskAggregateFailure
Exceptions
hetcompute::gpu_- If the task or any tasks on which it is dependent encountered a runtime error
exception on the GPU.
Note
If exceptions are disabled by application, in the above case API will return
hc_error::HC_TaskGpuFailure
Exceptions
hetcompute- If the task or any tasks on which it is dependent encountered a runtime error
::hexagon_exception on the Hexagon DSP.
Note
If exceptions are disabled by application, in the above case API will return
hc_error::HC_TaskDspFailure
Exceptions
any other exception that may be thrown by the task or any tasks on which it is
dependent.
See Also
hetcompute::group::wait_for()
Public Types
• task_ptr ()
Default constructor. Constructs a task_ptr<ReturnType> with no task<ReturnType>.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 316
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
• task_ptr (std::nullptr_t)
Default constructor. Constructs a task_ptr<ReturnType> with no task<ReturnType>.
• ∼task_ptr ()
Destructor.
• template<typename T >
task_ptr & operator%= (T &&op)
Compound assignment operator %= with value operand.
• template<typename T >
task_ptr & operator&= (T &&op)
Compound assignment operator &= with value operand.
• template<typename T >
task_ptr & operator∗= (T &&op)
Compound assignment operator ∗= with value operand.
• template<typename T >
task_ptr & operator+= (T &&op)
Compound assignment operator += with value operand.
• template<typename T >
task_ptr & operator-= (T &&op)
Compound assignment operator -= with value operand.
• template<typename T >
task_ptr & operator/= (T &&op)
Compound assignment operator /= with value operand.
• template<typename T >
task_ptr & operator∧ = (T &&op)
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 317
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
• template<typename T >
task_ptr & operator|= (T &&op)
Compound assignment operator |= with value operand.
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 318
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Constructs a task_ptr<ReturnType> object that manages the same task as other and resets
other. If other points to nullptr, the newly built object also points to nullptr.
Parameters
Destructor.
Returns pointer to the managed task. Remember that the lifetime of the task is defined by the lifetime of the
task_ptr<ReturnType> objects managing it. If all task_ptr<ReturnType> objects managing
a task t go out of scope, all task<ReturnType>∗ pointing to t may be invalid.
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 319
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Note: The operator should be applicable onto the return value of the current task and the operand (return
value is considered here if the operand is also a task).
Parameters
Returns
A new task whose return value is the result of this operator and can be pointed to by this shared pointer
(same type of return value).
1 #include <iostream>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8
9 // create and launch a task that return -73 and 27.8
10 auto t = hetcompute::launch([]() { return 73; });
11
12 // create a task whose return value is t’s return value + 27.8
13 // the new task will still be pointed by t
14 // the new task t is data dependent on the original task
15 // and the return type keeps the same (type coersion)
16 t += 27.8;
17
18 // wait for t to finish and display the return value
19 std::cout << "The return value of t is: " << t->copy_value() << std::endl;
20
21 hetcompute::runtime::shutdown();
22 }
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 320
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Returns pointer to managed task. Do not call this member function if ∗this manages no task.
Exceptions
Note
If exceptions are disabled in application, terminates the app if task pointer is nullptr
Returns
Assigns the task managed by other to ∗this. If, before the assignment, ∗this was the last
task_ptr<ReturnType> pointing to a task t, then the assignment will cause t to be destroyed. If
other manages no object, ∗this will also not manage an object after the assignment.
Parameters
Returns
∗this.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 321
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Resets ∗this so that it manages no object. If, before the assignment, ∗this was the last
task_ptr<ReturnType> pointing to a task t, then the assignment will cause t to be destroyed. If
other manages no object, ∗this will also not manage an object after the assignment.
Returns
∗this.
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 322
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Public Types
• task_ptr ()
Default constructor. Constructs a task_ptr<ReturnType> with no task<ReturnType>.
• task_ptr (std::nullptr_t)
Default constructor. Constructs a task_ptr<ReturnType> with no task<ReturnType>.
• ∼task_ptr ()
Destructor.
• static
HETCOMPUTE_CONSTEXPR_CONST
size_type arity = task_type::arity
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 323
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 324
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Constructs a task_ptr<ReturnType> object that manages the same task as other and resets
other. If other points to nullptr, the newly built object also points to nullptr.
Parameters
Destructor.
Returns pointer to the managed task. Remember that the lifetime of the task is defined by the lifetime of the
task_ptr<ReturnType> objects managing it. If all task_ptr<ReturnType> objects managing
a task t go out of scope, all task<ReturnType>∗ pointing to t may be invalid.
Returns
Returns pointer to managed task. Do not call this member function if ∗this manages no task.
Returns
Assigns the task managed by other to ∗this. If, before the assignment, ∗this was the last
task_ptr<ReturnType> pointing to a task t, then the assignment will cause t to be destroyed. If
other manages no object, ∗this will also not manage an object after the assignment.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 325
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Parameters
Returns
∗this.
Resets ∗this so that it manages no object. If, before the assignment, ∗this was the last
task_ptr<ReturnType> pointing to a task t, then the assignment will cause t to be destroyed. If
other manages no object, ∗this will also not manage an object after the assignment.
Returns
∗this.
Parameters
Number of parameters.
Public Types
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 326
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
• task_ptr ()
Default constructor. Constructs a task_ptr<void> with no task<void>.
• task_ptr (std::nullptr_t)
Default constructor. Constructs a task_ptr<void> with no task<void>.
• ∼task_ptr ()
• task_type ∗ get () const
Returns pointer to managed task.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 327
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
10.5.1.8.2.3 hetcompute::task_ptr< void >::task_ptr ( task_ptr< void > const & other )
Constructs a task_ptr<void> object that manages the same task<void> as other. If other points
to nullptr, the newly built object also points to nullptr.
Parameters
Constructs a task_ptr<void> object that manages the same task as other and resets other. If
other points to nullptr, the newly built object also points to nullptr.
Parameters
Destructor.
Returns pointer to the managed task. Remember that the lifetime of the task is defined by the lifetime of the
task_ptr<void> objects managing it. If all task_ptr<void> objects managing a task t go out of
scope, all task<void>∗ pointing to t may be invalid.
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 328
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Returns pointer to managed task. Do not call this member function if ∗this manages no task.
Returns
10.5.1.8.3.3 task_ptr& hetcompute::task_ptr< void >::operator= ( task_ptr< void > const & other )
Assigns the task managed by other to ∗this. If, before the assignment, ∗this was the last
task_ptr<void> pointing to a task t, then the assignment will cause t to be destroyed. If other
manages no object, ∗this will also not manage an object after the assignment.
Parameters
Returns
∗this.
10.5.1.8.3.4 task_ptr& hetcompute::task_ptr< void >::operator= ( task_ptr< void > && other )
Resets ∗this so that it manages no object. If, before the assignment, ∗this was the last
task_ptr<void> pointing to a task t, then the assignment will cause t to be destroyed. If other
manages no object, ∗this will also not manage an object after the assignment.
Returns
∗this.
10.5.1.8.3.5 void hetcompute::task_ptr< void >::swap ( task_ptr< void > & other )
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 329
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
template<>class hetcompute::task_ptr<>
Public Types
• task_ptr ()
Default constructor. Constructs a task_ptr<> with no task<>.
• task_ptr (std::nullptr_t)
Default constructor. Constructs a task_ptr<> with no task<>.
• ∼task_ptr ()
• task_type ∗ get () const
Returns the pointer to the managed task.
• void reset ()
Resets the pointer to the managed task.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 330
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Returns the number of task_ptr<> objects managing the same object (including ∗this).
10.5.1.9.2.1 hetcompute::task_ptr<>::task_ptr ( )
Constructs a task_ptr<> object that manages the same task<> as other. If other points to
nullptr, the newly built object also points to nullptr.
Parameters
Constructs a task_ptr<> object that manages the same task as other and resets other. If other
points to nullptr, the newly built object also points to nullptr.
Parameters
10.5.1.9.2.5 hetcompute::task_ptr<>::∼task_ptr ( )
Default destructor.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 331
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Returns pointer to the managed task. Remember that the lifetime of the task is defined by the lifetime of the
task_ptr<> objects managing it. If all task_ptr<> objects managing a task t go out of scope, all
task<>∗ pointing to t may be invalid.
Returns
Returns
Returns pointer to managed task. Do not call this member function if ∗this manages no task.
Returns
Assigns the task managed by other to ∗this. If, before the assignment, ∗this was the last
task_ptr<> pointing to a task t, then the assignment will cause t to be destroyed. If other manages
no object, ∗this will also not manage an object after the assignment.
Parameters
Returns
∗this.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 332
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Resets ∗this so that it manages no object. If, before the assignment, ∗this was the last task_ptr<>
pointing to a task t, then the assignment will cause t to be destroyed. ∗this will also not manage an
object after the assignment.
Returns
∗this.
Move-assigns the task managed by other to ∗this. other will manage no task after the assignment.
If, before the assignment, ∗this was the last task_ptr<> pointing to a task t, then the assignment will
cause t to be destroyed. If other manages no object, ∗this will also not manage an object after the
assignment.
Parameters
Returns
∗this.
Resets pointer to managed task. If, ∗this was the last task_ptr<> pointing to a task t, then reset()
cause g to be destroyed.
Returns
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 333
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Returns if this is the only task_ptr object managing the underlying task.
Returns
Returns the number of task_ptr<> objects managing the same object (including ∗this).
1 #include <cassert>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 std::atomic<bool> running(false);
9 std::atomic<bool> finish(false);
10
11 // Create task t.
12 auto t = hetcompute::create_task([&running, &finish] {
13 running = true;
14 while (!finish)
15 {
16 };
17 });
18
19 // t’s use_count should be 1.
20 HETCOMPUTE_ILOG("After construction: t.use_count() = %zu\n", t.use_count());
21
22 // Copy-construct t2 from t. t and t2’s use_count is 2.
23 auto t2 = t;
24 HETCOMPUTE_ILOG("After copy-construction: t2.use_count() = %zu\n", t2.use_count());
25
26 auto t3 = t.get();
27 HETCOMPUTE_ILOG("After calling t.get(). t.use_count() = %zu\n", t.use_count());
28
29 // t’s use_count should be 2.
30 HETCOMPUTE_ILOG("After t->wait_for: t.use_count() = %zu\n", t.use_count());
31
32 assert(t3 != nullptr);
33 HETCOMPUTE_UNUSED(t3);
34 hetcompute::runtime::shutdown();
35 return 0;
36 }
Output
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 334
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
HetCompute uses cooperative multitasking. Therefore, it cannot abort an executing task without help from
the task. In HETCOMPUTE, each executing task is responsible for periodically checking whether it should
abort. Thus, tasks call hetcompute::abort_on_cancel() to test whether they, or any of the groups
to which they belong, have been canceled. If true, hetcompute::abort_on_cancel() does not
return. Instead, it throws hetcompute::abort_task_exception, which the HetCompute runtime
catches. The runtime then transitions the task to a canceled state and propagates cancellation to the task’s
successors, if any.
Because hetcompute::abort_on_cancel() does not return if the task has been canceled, we
recommend that you use use RAII to allocate and deallocate the resources used inside a task. If using RAII
in your code is not an option, surround hetcompute::abort_on_cancel() with try – catch, and
call throw from within the catch block after the cleanup code.
Exceptions
Note
If exceptions are disabled in application, will terminate the app if called from outside a task. Another
caveat to note with usage of abort_on_cancel with exceptions disabled is that the application
code can get sandwidched between functions that are able to handle exceptions resulting in improper
cleanups in the function where exceptions are disabled.
Example 1
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 335
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
8 // Create task
9 auto t = hetcompute::create_task([] {
10 size_t num_iters = 0;
11 while (1)
12 {
13 HETCOMPUTE_ILOG("Task has executed %zu iterations!", num_iters);
14
15 // Check whether the task needs to stop execution.
16 // Without abort_on_cancel() the task would never
17 // return
18 hetcompute::abort_on_cancel();
19
20 usleep(30);
21 num_iters++;
22 }
23 });
24
25 // Create group g
26 auto g = hetcompute::create_group("example group");
27
28 // Launch t into g.
29 g->launch(t);
30 // We don’t use t after launch(), so we can reset the shared pointer
31 t.reset();
32
33 // Wait for the task to execute a few iterations
34 usleep(200);
35
36 // Cancel group g, and wait for t to complete
37 g->cancel();
38
39 try
40 {
41 g->wait_for();
42 }
43 catch (const hetcompute::canceled_exception&)
44 {
45 // Do nothing
46 }
47 catch (...)
48 {
49 // Do nothing
50 }
51
52 hetcompute::runtime::shutdown();
53 return 0;
54 }
Output
Example 2
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9
10 auto t = hetcompute::create_task([] {
11 while (1)
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 336
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
12 {
13 try
14 {
15 hetcompute::abort_on_cancel();
16 }
17 catch (const hetcompute::abort_task_exception&)
18 {
19 //..do cleanup
20 throw;
21 }
22 catch (...)
23 {
24 //..do cleanup
25 throw;
26 }
27 // HETCOMPUTE_ILOG("Waiting to be canceled.\n");
28 usleep(10);
29 }
30 assert(false); // This will never fire
31 });
32
33 // Launch t
34 t->launch();
35
36 // Wait for 20 micro-seconds.
37 usleep(20);
38
39 // Cancel task. Returns immediately.
40 t->cancel();
41
42 try
43 {
44 // Wait for the task to complete.
45 t->wait_for();
46 }
47 catch (const hetcompute::canceled_exception& e)
48 {
49 std::cout << e.what() << " thrown" << std::endl;
50 }
51 catch (...)
52 {
53 // Never reached
54 }
55
56 hetcompute::runtime::shutdown();
57 return 0;
58 }
Use this method from within a running task to immediately abort it and all its successors.
hetcompute::abort_task() never returns. Instead, it throws
hetcompute::abort_task_exception, which the HetCompute runtime catches. The runtime then
transitions the task to a canceled state and propagates propagation to the task’s successors, if any.
Exceptions
abort_task_exception If called from a task has been canceled via hetcompute::cancel() or a task
that belongs to a canceled group.
api_exception If called from outside a task.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 337
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Note
If exceptions are disabled in application, will terminate the app if called from outside a task
Example
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9
10 auto t1 = hetcompute::create_task([] {
11 int i = 0;
12 while (true)
13 {
14 HETCOMPUTE_ILOG("Hello World %d\n", i);
15 sleep(1);
16 i++;
17 if (i == 10)
18 {
19 hetcompute::abort_task();
20 }
21 }
22 // This will never fire
23 assert(false);
24 });
25
26 auto t2 = hetcompute::create_task([] {
27 // This will never fire
28 assert(false);
29 });
30
31 t1 >> t2;
32
33 // Launch tasks
34 t1->launch();
35 t2->launch();
36
37 try
38 {
39 // Wait for t1 to complete.
40 t1->wait_for();
41 }
42 catch (const hetcompute::canceled_exception& e)
43 {
44 std::cout << e.what() << " thrown when syncing with t1" << std::endl;
45 }
46 catch (...)
47 {
48 // Never reached
49 }
50
51 try
52 {
53 // Returns immediately, t2 is canceled.
54 t2->wait_for();
55 }
56 catch (const hetcompute::canceled_exception& e)
57 {
58 std::cout << e.what() << " thrown when syncing with t2" << std::endl;
59 }
60 catch (...)
61 {
62 // Never reached
63 }
64
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 338
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
65 hetcompute::runtime::shutdown();
66 return 0;
67 }
Output
Hello World!
Hello World!
..
Hello World!
Parameters
Parameters
Used to enclose user-code that blocks on external activity and needs to be cancelable when an enclosing
task gets canceled.
A function/functor containing the blocking code bf is executed immediately. If cancellation is
asynchronously requested for the enclosing task while bf is currently executing, the cancellation handler
function/functor cf is asynchronously executed. Once bf completes, blocking throws
hetcompute::canceled_exception if task cancellation was requested.
If cancellation of the task had already been requested prior to the execution of blocking, blocking
immediately throws hetcompute::canceled_exception without executing bf or cf.
The programmer must write bf and cf to satisfy the following requirements:
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 339
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Example
bf may block on a network access causing its thread to sleep. cf writes special data into the network
handle causing bf to unblock.
do_whole_bunch_of_work(x, ...);
});
Note: It is not required that the blocking construct be enclosed in a task. Without an enclosing task bf will
execute as a normal function and cf will never be invoked.
Parameters
Create a collapsed task out of Code and (optionally) bind all arguments.
Template Parameters
Parameters
code The work for the task. code can be Qualcomm HetCompute kernels (CPU,
GPU, or DSP), a lambda expression, a function object, or a function pointer.
args Argument used to bind to the task (only supported by CPU tasks). If left
empty, no arguments will be bound to the task.
Returns
1 #include <hetcompute/hetcompute.hh>
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 340
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create a task out of a lambda and bind the argument.
8 auto t1 = hetcompute::create_task([](int x) { return x; }, 27);
9 // Launch t1.
10 t1->launch();
11 // Wait for t1 to finish and show the return value.
12 HETCOMPUTE_ILOG("t1->copy_value() = %d", t1->copy_value()); // Expect 27;
13
14 // Create a task out of a lambda and bind the argument later.
15 auto t2 = hetcompute::create_task([](int x) { return x; });
16 // Bind the argument before launch.
17 t2->bind_all(42);
18 // Launch t2.
19 t2->launch();
20 // Wait for t2 to finish and show the return value.
21 HETCOMPUTE_ILOG("t2->copy_value() = %d", t2->copy_value()); // Expect 42;
22
23 // Create a cpu kernel out of a lambda.
24 auto cpu_kn = hetcompute::create_cpu_kernel([](int x) { return x; });
25 // Create a task out of a cpu kernel and bind the argument.
26 auto t3 = hetcompute::create_task(cpu_kn, 73);
27 // Launch t3.
28 t3->launch();
29 // Wait for t3 to finish and show the return value.
30 HETCOMPUTE_ILOG("t3->copy_value() = %d", t3->copy_value()); // Expect 73;
31
32 // Create a collapsed task.
33 // typeof(t4) = hetcompute::task_ptr<int(int)>
34 auto t4 = hetcompute::create_task(
35 [](int x) {
36 // Create a task.
37 // typeof(t) = hetcompute::task_ptr<int(int)>
38 auto t = hetcompute::create_task([](int y) { return y; }, x);
39 return t;
40 },
41 168);
42 // Launch t4.
43 t4->launch();
44 // Wait for t4 to finish and show the return value.
45 HETCOMPUTE_ILOG("t4->copy_value() = %d", t4->copy_value()); // Expect 168;
46
47 hetcompute::runtime::shutdown();
48 return 0;
49 }
See Also
hetcompute::task<ReturnType(Args...)>::bind_all
Create a non-collapsed task out of Code and (optionally) bind all arguments.
Template Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 341
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Parameters
code The work for the task. code can be Qualcomm HetCompute kernels (CPU,
GPU, or DSP), a lambda expression, a function object, or a function pointer.
args Argument used to bind to the task (only supported by CPU tasks). If left
empty, no arguments will be bound to the task.
Returns
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 // Create a non-collapsed task.
9 // typeof(t) = hetcompute::task_ptr<hetcompute:task_ptr<int(int)>(int)>
10 auto t = hetcompute::create_task(
hetcompute::do_not_collapse,
11 [](int x) {
12 // Create a task.
13 // typeof(tt) = hetcompute::task_ptr<int(int)>
14 auto tt = hetcompute::create_task([](int y)
{ return y; }, x);
15 return tt;
16 },
17 271);
18
19 // Launch t.
20 t->launch();
21
22 // Wait for t to finish and get the return value.
23 auto tt = t->copy_value();
24
25 // Launch tt.
26 tt->launch();
27
28 // Wait for tt to finish and show the return value.
29 HETCOMPUTE_ILOG("tt->copy_value() = %d", tt->copy_value()); // Expect 271;
30
31 hetcompute::runtime::shutdown();
32
33 return 0;
34 }
See Also
hetcompute::task<ReturnType(Args...)>::bind_all
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 342
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Template Parameters
Parameters
Returns
1 #include <hetcompute/hetcompute.hh>
2
3 // User-defined type
4 struct point2d
5 {
6 // Member variables
7 int _x;
8 int _y;
9
10 // first constructor
11 explicit point2d(int x) : _x(x), _y(0) {}
12
13 // second constructor
14 point2d(int x, int y) : _x(x), _y(y) {}
15 };
16
17 int
18 main()
19 {
20 hetcompute::runtime::init();
21
22 // Create a value task returns an object of build-in type (int) of value 2.
23 auto t = hetcompute::create_value_task<int>(2);
24 // Launch t.
25 t->launch();
26 // Wait for t to finish.
27 t->wait_for();
28 HETCOMPUTE_ILOG("t->copy_value() = %d", t->copy_value()); // Expect 2;
29
30 int x = 5;
31 // Create a value task returns an object of point2d constructed by the first constructor.
32 auto t1 = hetcompute::create_value_task<point2d>(x);
33 // Launch t1.
34 t1->launch();
35 // Wait for t1 to finish.
36 t1->wait_for();
37
38 HETCOMPUTE_ILOG("t1->copy_value()._x = %d", t1->copy_value()._x); // Expect 5;
39 HETCOMPUTE_ILOG("t1->copy_value()._y = %d", t1->copy_value()._y); // Expect 0;
40
41 int y = 6;
42 x = 7;
43 // Create a value task returns an object of point2d constructed by the 2nd constructor.
44 auto t2 = hetcompute::create_value_task<point2d>(x, y);
45 // Launch t2.
46 t2->launch();
47 // Wait for t2 to finish.
48 t2->wait_for();
49
50 HETCOMPUTE_ILOG("t2->copy_value()._x = %d", t2->copy_value()._x); // Expect 7;
51 HETCOMPUTE_ILOG("t2->copy_value()._y = %d", t2->copy_value()._y); // Expect 6;
52
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 343
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
53 hetcompute::runtime::shutdown();
54
55 return 0;
56 }
Specifies that the task invoking this function should be deemed to finish only after the task finishes. This
method returns immediately.
If the invoking task is multi-threaded, the programmer must ensure that concurrent calls to finish_after from
within the task are properly synchronized.
Do not call this function if task is nullptr. It would cause a fatal error.
Parameters
task Task after which invoking task is deemed to finish. Can’t be nullptr
Exceptions
Create a collapsed task out of Code, bind all arguments, if any exist (mandatory), and launch the task.
Template Parameters
Parameters
code The work for the task. code can be Qualcomm HetCompute kernels (CPU,
GPU, or DSP), a lambda expression, a function object, or a function pointer.
args Argument used to bind to the task (only supported by CPU tasks). If left
empty, no arguments will be bound to the task.
Returns
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create a task out of a lambda, bind the argument, and launch.
8 auto t1 = hetcompute::launch([](int x) { return x; }, 27);
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 344
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
See Also
hetcompute::create_task(Code&&, Args&&...)
Create a non-collapsed task out of Code, bind all arguments, if any exist (mandatory), and launch the task.
Template Parameters
Parameters
code The work for the task. code can be Qualcomm HetCompute kernels (CPU,
GPU, or DSP), a lambda expression, a function object, or a function pointer.
args Argument used to bind to the task (only supported by CPU tasks). If left
empty, no arguments will be bound to the task.
Returns
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 345
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
6 hetcompute::runtime::init();
7 // Create a non-collapsed task, bind the argument, and launch.
8 // typeof(t) = hetcompute::task_ptr<hetcompute:task_ptr<int(int)>(int)>
9 auto t = hetcompute::launch(hetcompute::do_not_collapse,
10 [](int x) {
11 // Create a task, bind the argument and launch.
12 // typeof(tt) = hetcompute::task_ptr<int(int)>
13 auto tt = hetcompute::launch([](int y) { return y; },
x);
14 return tt;
15 },
16 271);
17
18 // Wait for t to finish and get the return value.
19 auto tt = t->copy_value();
20
21 // Launch tt.
22 tt->launch();
23 // Wait for tt to finish and show the return value.
24 HETCOMPUTE_ILOG("tt->copy_value() = %d", tt->copy_value()); // Expect 271;
25
26 hetcompute::runtime::shutdown();
27 return 0;
28 }
See Also
Returns
true – The pointer is nullptr (∗this does not manage a task). false – The pointer is not
nullptr (∗this manages a task).
Returns
true – The pointer is nullptr (∗this does not manage a task). false – The pointer is not
nullptr (∗this manages a task).
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 346
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Returns
true – Task a is not the same as task b. false – Task a is the same as task b.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 347
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
See Also
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 348
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 349
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
See Also
Note: the operator should be applicable onto the return values of task t1 and task t2.
Parameters
Returns
1 #include <iostream>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8
9 // create and launch a task that return -73 and 27.8
10 auto t1 = hetcompute::launch([]() { return 73; });
11 auto t2 = hetcompute::launch([]() { return 27.8; });
12
13 // create a task whose return value is t1’s return value + t2’s return value
14 // t is data dependent on t1 and t2
15 // t’s return type will be the same as t2’s (type promotion for +)
16 auto t = t1 + t2;
17
18 // wait for t to finish and display the return value
19 std::cout << "The return value of t is: " << t->copy_value() << std::endl;
20
21 hetcompute::runtime::shutdown();
22 }
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 350
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
Note: the operator should be applicable onto the return value of task t1 and operand op2.
Parameters
Returns
1 #include <iostream>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8
9 // create and launch a task that return -73
10 auto t1 = hetcompute::launch([]() { return -73; });
11
12 // create a task whose return value is t1’s return value + 100
13 // t is data dependent on t1
14 auto t = t1 + 100;
15
16 // wait for t to finish and display the return value
17 std::cout << "The return value of t is: " << t->copy_value() << std::endl;
18
19 hetcompute::runtime::shutdown();
20 }
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 351
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
The new task will be launching automatically by the runtime once the data is ready.
Note: the operator should be applicable onto the return value of operand op1 and task t2.
Parameters
Returns
1 #include <iostream>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8
9 // create and launch a task that return -73
10 auto t2 = hetcompute::launch([]() { return -73; });
11
12 // create a task whose return value is 100 + t2’s return value
13 // t is data dependent on t2
14 auto t = 100 + t2;
15
16 // wait for t to finish and display the return value
17 std::cout << "The return value of t is: " << t->copy_value() << std::endl;
18
19 hetcompute::runtime::shutdown();
20 }
Note: the operator should be appliable onto the return value of task t
Parameters
Returns
1 #include <iostream>
2 #include <hetcompute/hetcompute.hh>
3
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 352
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8
9 // create a task returns 73
10 auto t = hetcompute::create_task([]() { return 73; });
11 // launch the task
12 t->launch();
13
14 // create a task whose return value is the negation of the return value of t
15 // t1 is data dependent on t
16 auto t1 = -t;
17
18 // wait for t1 to finish and display the return value
19 std::cout << "The return value of t1 is: " << t1->copy_value() << std::endl;
20
21 hetcompute::runtime::shutdown();
22 }
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 353
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 354
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
See Also
Returns
Returns
Returns
true – Task a is the same as task b. false – Task a is not the same as task b.
hetcompute::task<>::then
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 355
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 356
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
See Also
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 357
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 358
11 Buffers Reference API
The Qualcomm HetCompute buffers API provides the user with a runtime-managed heterogeneous data
structure. Tasks on the CPU, GPU and Hexagon devices can share data using a Qualcomm HetCompute
buffer. The following categories provide the API reference for buffers and related functionality.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 359
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
• class hetcompute::device_set
Captures a set of device types. More...
Enumerations
• enum hetcompute::device_type {
cpu_big = HETCOMPUTE_DEVICE_TYPE_CPU_BIG, cpu_little = HETCOMPUTE_DEVICE_-
TYPE_CPU_LITTLE, cpu = HETCOMPUTE_DEVICE_TYPE_CPU_BIG |
HETCOMPUTE_DEVICE_TYPE_CPU_LITTLE, gpu = HETCOMPUTE_DEVICE_TYPE_GPU,
dsp = HETCOMPUTE_DEVICE_TYPE_DSP }
The system devices capable of executing HetCompute tasks.
Functions
• device_set ()
Default constructor produces empty set.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 360
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Friends
11.1.1.1.1.1 hetcompute::device_set::device_set ( )
Parameters
Example:
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 361
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Parameters
Returns
Parameters
Returns
Example:
hetcompute::device_set a{hetcompute::cpu};
hetcompute::device_set b{hetcompute::gpu};
a.add(b);
assert(true == a.on_cpu());
assert(true == a.on_gpu());
assert(false == a.on_dsp());
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 362
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Copy constructor.
Copy assignment.
Move constructor.
Move assignment.
Returns
Example:
hetcompute::device_set a{hetcompute::cpu, hetcompute::gpu};
a.negate();
assert(false == a.on_cpu());
assert(false == a.on_gpu());
assert(true == a.on_dsp());
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 363
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Returns
Returns
Returns
Returns
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 364
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Returns
Parameters
Returns
Example:
hetcompute::device_set a{hetcompute::cpu, hetcompute::gpu};
hetcompute::device_set b{hetcompute::gpu, hetcompute::dsp};
a.remove(b);
assert(true == a.on_cpu());
assert(false == a.on_gpu());
assert(false == a.on_dsp());
Returns
Example:
hetcompute::device_set a{hetcompute::cpu, hetcompute::gpu};
assert(a.to_string() == "cpu gpu ");
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 365
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Parameters
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 366
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
11.2 Buffers
Classes
Functions
• template<typename T >
buffer_ptr< T > hetcompute::create_buffer (size_t num_elems, device_set const &likely_devices)
Creates a buffer of datatype T of the requested size.
• template<typename T >
buffer_ptr< T > hetcompute::create_buffer (T ∗preallocated_ptr, size_t num_elems, device_set const
&likely_devices)
Creates a buffer of datatype T of the requested size from a pre-allocated pointer.
• template<typename T >
buffer_ptr< T > hetcompute::create_buffer (memregion const &mr, size_t num_elems, device_set
const &likely_devices)
Creates a buffer of datatype T of the requested size from a hetcompute::memregion.
• template<typename T >
bool hetcompute::operator!= (::hetcompute::buffer_ptr< T > const &b,::std::nullptr_t)
• template<typename T >
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 367
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
hetcompute::buffer_ptr::const_iterator
hetcompute::buffer_ptr::cbegin()
hetcompute::buffer_ptr::cend()
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 368
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
hetcompute::buffer_ptr::iterator
hetcompute::buffer_ptr::begin()
hetcompute::buffer_ptr::end()
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 369
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Public Types
• using data_type = T
• typedef buffer_iterator< T > iterator
Random access iterator providing mutable access to the buffer data.
• buffer_ptr ()
Create a buffer_ptr with no underlying buffer storage created.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 370
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
• buffer_ptr & operator= (buffer_ptr< typename std::remove_const< T >::type > const &other)
Copy assignment: points to the underlying buffer of other.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 371
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Create a buffer_ptr with no underlying buffer storage created. Tests equal to nullptr.
Example:
hetcompute::buffer_ptr<int> b;
assert(b == nullptr);
Copy constructor: creates a new buffer_ptr pointing to the same underlying buffer as other. A
buffer_ptr<const T> instance may be constructed from an instance of buffer_ptr<T>.
Parameters
Example:
hetcompute::buffer_ptr<int> b = hetcompute::create_buffer<int>(10);
hetcompute::buffer_ptr<int> x(b);
hetcompute::buffer_ptr<const int> y(b);
Acquires the underlying buffer for read-only access by the host code. The host code may read the existing
contents of the buffer after this call, until the host code releases access using the release() method.
The call will block for any conflicting operations to complete (e.g., a task concurrently performing
read-write access to the buffer), after which the buffer is acquired for access by the host code and the call
unblocks. However, if the buffer has already been acquired for the host code by a preceding
acquire_∗(), the call will return immediately.
The host code may recursively acquire the buffer using a combination of acquire_ro(),
acquire_wi() and acquire_rw() calls. The first acquire_∗ establishes the access type
(read-only, write-invalidate, or read-write) of the buffer for the host code. Subsequent recursive
acquire_∗ calls will succeed only if they are compatible with the previously established access type.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 372
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Subsequent recursive acquire_wi() and acquire_rw() calls will return with failure if the first
recursive acquisition was acquire_ro(), as the access type of these calls is incompatible with the
established read-only access. However, any subsequent acquire_∗() recursive calls will succeed if the
first acquisition was either write-invalidate or read-write. When the established access type is
write-invalidate, subsequent recursive read-only or read-write acquisitions are considered to get access to
any data written to the buffer after the original write-invalidate. When the established access type is
read-write, a subsequent recursive write-invalidate does not destroy any prior data, as there is no additional
synchronization required between device memories to access the latest data.
The host code releases the buffer only when a number of release() calls equal to the number of
successful recursive acquire_∗() calls are made.
Note that access by concurrent threads of the host code is also considered recursive, even when the
acquire-release calls do not properly nest across threads. The first acquire by any one thread establishes the
host access type for all threads of the host code, until the host code releases.
HetCompute disallows concurrent access to a buffer when the buffer is being modified. The acquisition will
be blocked when a concurrent task/pattern has acquired the buffer for read-write or write-invalidate access.
In rare situations, the acquisition may also be blocked when a concurrent task/pattern has read-only access
but HetCompute is unable to synchronize the buffer data for host access until the concurrent task/pattern
completes.
See Also
hetcompute::create_buffer()
Attempts to acquire the underlying buffer for read-write access by the host code. Returns true is
successful, false on failure to acquire for read-write due to a prior read-only acquisition by the host code.
If successful, the host may read the prior contents of the buffer and update the contents until the host code
releases access using the release() method.
The call will block for any conflicting operations to complete (e.g., a task concurrently performing
read-write access to the buffer), after which the buffer is acquired for access by the host code and the call
unblocks. However, if the buffer has already been acquired for the host code by a preceding
acquire_∗(), the call will return immediately.
The host code may recursively acquire the buffer using a combination of acquire_ro(),
acquire_wi() and acquire_rw() calls. The first acquire_∗ establishes the access type
(read-only, write-invalidate, or read-write) of the buffer for the host code. Subsequent recursive
acquire_∗ calls will succeed only if they are compatible with the previously established access type.
Subsequent recursive acquire_wi() and acquire_rw() calls will return with failure if the first
recursive acquisition was acquire_ro(), as the access type of these calls is incompatible with the
established read-only access. However, any subsequent acquire_∗() recursive calls will succeed if the
first acquisition was either write-invalidate or read-write. When the established access type is
write-invalidate, subsequent recursive read-only or read-write acquisitions are considered to get access to
any data written to the buffer after the original write-invalidate. When the established access type is
read-write, a subsequent recursive write-invalidate does not destroy any prior data, as there is no additional
synchronization required between device memories to access the latest data.
The host code releases the buffer only when a number of release() calls equal to the number of
successful recursive acquire_∗() calls are made.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 373
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Note that access by concurrent threads of the host code is also considered recursive, even when the
acquire-release calls do not properly nest across threads. The first acquire by any one thread establishes the
host access type for all threads of the host code, until the host code releases.
HetCompute disallows concurrent access to a buffer when the buffer is being modified. The acquisition will
be blocked when a concurrent task/pattern has acquired the buffer for read-write or write-invalidate access.
In rare situations, the acquisition may also be blocked when a concurrent task/pattern has read-only access
but HetCompute is unable to synchronize the buffer data for host access until the concurrent task/pattern
completes.
See Also
hetcompute::create_buffer()
Attempts to acquire the underlying buffer for write-invalidate access by the host code. Returns true is
successful, false on failure to acquire for write-invalidate due to a prior read-only acquisition by the host
code. If successful, the prior contents of the buffer are lost after this call. The host code may write the
buffer data (and read back what it wrote) after this call, until the host code releases access using the
release() method.
The call will block for any conflicting operations to complete (e.g., a task concurrently performing
read-write access to the buffer), after which the buffer is acquired for access by the host code and the call
unblocks. However, if the buffer has already been acquired for the host code by a preceding
acquire_∗(), the call will return immediately.
The host code may recursively acquire the buffer using a combination of acquire_ro(),
acquire_wi() and acquire_rw() calls. The first acquire_∗ establishes the access type
(read-only, write-invalidate, or read-write) of the buffer for the host code. Subsequent recursive
acquire_∗ calls will succeed only if they are compatible with the previously established access type.
Subsequent recursive acquire_wi() and acquire_rw() calls will return with failure if the first
recursive acquisition was acquire_ro(), as the access type of these calls is incompatible with the
established read-only access. However, any subsequent acquire_∗() recursive calls will succeed if the
first acquisition was either write-invalidate or read-write. When the established access type is
write-invalidate, subsequent recursive read-only or read-write acquisitions are considered to get access to
any data written to the buffer after the original write-invalidate. When the established access type is
read-write, a subsequent recursive write-invalidate does not destroy any prior data, as there is no additional
synchronization required between device memories to access the latest data.
The host code releases the buffer only when a number of release() calls equal to the number of
successful recursive acquire_∗() calls are made.
Note that access by concurrent threads of the host code is also considered recursive, even when the
acquire-release calls do not properly nest across threads. The first acquire by any one thread establishes the
host access type for all threads of the host code, until the host code releases.
HetCompute disallows concurrent access to a buffer when the buffer is being modified. The acquisition will
be blocked when a concurrent task/pattern has acquired the buffer for read-write or write-invalidate access.
In rare situations, the acquisition may also be blocked when a concurrent task/pattern has read-only access
but HetCompute is unable to synchronize the buffer data for host access until the concurrent task/pattern
completes.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 374
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
See Also
hetcompute::create_buffer()
Same as operator[] but also checks if the buffer data has been previously made host accessible via this
buffer_ptr and performs array bounds check. However, does not guarantee that the buffer data is currently
host accessible. The programmer must ensure that the buffer will not be concurrently accessed by tasks that
may invalidate the host accessible data (for example, by not launching any task that accesses a buffer_ptr to
this buffer until the host access is complete).
See saved_host_data() for host access criteria.
Parameters
index The index to the element to lookup inside the buffer data.
Exceptions
Note
If exceptions are disabled by application, the API will terminate the application if data is not
host-accessible.
Returns
See Also
saved_host_data()
Get iterator to the start of the buffer data. Allows mutable access to the buffer data.
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 375
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Get const iterator to the start of the buffer data. Restricts to immutable access.
Returns
Get const iterator to the end of the buffer data. Restricts to immutable access.
Returns
Get iterator to the end of the buffer data. Allows mutable access to the buffer data.
Returns
Gets a pointer to the host accessible data of the underlying buffer, allocating the host accessible storage if
necessary. Note that this call does not ensure that the buffer data is currently host accessible. For example,
data updates by a concurrent task on the underlying buffer may not be visible yet via the host accessible
data pointer.
Returns
See Also
saved_host_data() for fast lookup of a previously queried pointer to the host accessible data.
acquire_ro()
acquire_wi()
acquire_rw()
release()
to allow the buffer to be read or written by the host code, in addition to querying a pointer to the host
accessible data.
Unlike the acquire calls, which may sometimes block when there is concurrent task access to the buffer, this
method can be called at any time without blocking to determine the pointer to the host accessible data.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 376
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Copy assignment: points to the underlying buffer of other. A buffer_ptr<const T> instance may
be assigned to from an instance of buffer_ptr<T>. If this buffer_ptr was the last one pointing to its
underlying buffer, the underlying buffer will get deallocated once the buffer_ptr is copy-assigned and points
to a different underlying buffer.
Parameters
Example:
hetcompute::buffer_ptr<int> b = hetcompute::create_buffer<int>(10);
hetcompute::buffer_ptr<int> x;
hetcompute::buffer_ptr<const int> y;
x = b;
y = b;
If the buffer data is host accessible or being accessed as a CPU task parameter, it performs an array index
lookup. Undefined behavior for host accesses if the programmer has not previously ensured that the buffer
data is host accessible.
See saved_host_data() for host access criteria.
Parameters
index The index to the element to lookup inside the buffer data.
Returns
See Also
at()
saved_host_data()
Decrements the host acquire count, releasing the buffer from host access when the count goes to zero.
release() needs to be called once for every successful recursive call to acquire_∗(), after which the
buffer is released from access by the host code. The host code may not read or write the buffer contents
after the final release() call, until the host code acquires the buffer again.
The release() call never blocks.
The call returns the number of recursive acquisitions remaining to be released before the host code will
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 377
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
release access to the buffer. That is, the host code releases access when the return value of this call is 0.
Exceptions
hetcompute::api_- if called when the buffer is not currently acquired by the host code.
exception
Note
If exceptions are disabled by application, terminates the application when the buffer is not currently
acquired by the host code.
Fast lookup of a saved pointer to the host accessible data of the underlying buffer. The pointer may be saved
by either a previous host_code() or acquire_∗ calls.
Note that this call does not ensure that the buffer data is currently host accessible via this buffer_ptr. For
example, data updates by a concurrent task on the underlying buffer may not be visible yet via the host
accessible data pointer.
Returns
nullptr, if
i) the host accessible data pointer has not previously been queried via this buffer_ptr ii) this buffer_ptr
is a nullptr.
!=nullptr, if
See Also
host_code()
acquire_ro()
acquire_wi()
acquire_rw()
release()
for explicit host synchronization.
The number of elements of datatype T in the underlying buffer pointed to by this buffer_ptr.
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 378
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Allows this buffer_ptr to be interpreted as a texture of a given format, dimensionality and size when passed
as an argument to a gpu_kernel expecting a hetcompute::graphics::texture_ptr. The
interpretation applies to the current buffer_ptr, not to the buffer as a whole. That is, multiple
buffer_ptrs to the same buffer may simultaneously be interpreted as textures of different formats,
dimensions and sizes.
Template Parameters
Parameters
Returns
This buffer_ptr.
Use in a kernel parameter declaration to indicate that a buffer parameter will be input-only (read-only) for
the kernel.
Use in a kernel parameter declaration to indicate that a buffer parameter will be used both as an input and
an output (read-write) by the kernel.
Use in a kernel parameter declaration to indicate that a buffer parameter will be output-only
(write-invalidate) for the kernel.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 379
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
write
void f() {
scope_acquire_ro<decltype(b)::data_type> guard(b);
if(...) {
return;
}
...
}
See Also
buffer_ptr::acquire_ro();
write
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 380
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
void f() {
scope_acquire_rw<decltype(b)::data_type> guard(b);
if(...) {
return;
}
...
}
See Also
buffer_ptr::acquire_rw();
write
void f() {
scope_acquire_wi<decltype(b)::data_type> guard(b);
if(...) {
return;
}
...
}
See Also
buffer_ptr::acquire_wi();
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 381
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Parameters
Returns
The optional parameters allow for the following variants to this call.
hetcompute::create_buffer(size_t num_elems);
hetcompute::create_buffer(size_t num_elems,
hetcompute::device_set const& likely_devices);
Creates a buffer of datatype T of the requested size from a pre-allocated pointer. The pre-allocated pointer
provides initial storage and potentially initial data for the buffer. Fatal error if num_elems is 0.
Template Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 382
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Parameters
Returns
The optional parameter allows for the following variant to this call.
create_buffer(T* preallocated_ptr,
size_t num_elems);
create_buffer(T* preallocated_ptr,
size_t num_elems,
device_set const& likely_devices);
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 383
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Returns
See Also
hetcompute::memregion
The optional parameters allow for the following variants to this call.
hetcompute::create_buffer(hetcompute::memregion const& mr);
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 384
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
• class hetcompute::glbuffer_memregion
Creates inter-operability with an OpenGL buffer. More...
• class hetcompute::ion_memregion
Allocates ION memory on platforms that support it. More...
• class hetcompute::main_memregion
Allocates aligned memory from the platform main memory. More...
• class hetcompute::memregion
Base class for all mem-regions. More...
Functions
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 385
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
• hetcompute::memregion::HETCOMPUTE_DELETE_METHOD (memregion())
• hetcompute::memregion::HETCOMPUTE_DELETE_METHOD (memregion(memregion const
&))
• hetcompute::memregion::HETCOMPUTE_DELETE_METHOD (memregion
&operator=(memregion const &))
• hetcompute::memregion::HETCOMPUTE_DELETE_METHOD (memregion(memregion &&))
• hetcompute::main_memregion::HETCOMPUTE_DELETE_METHOD (main_memregion())
• hetcompute::main_memregion::HETCOMPUTE_DELETE_METHOD
(main_memregion(main_memregion const &))
• hetcompute::main_memregion::HETCOMPUTE_DELETE_METHOD (main_memregion
&operator=(main_memregion const &))
• hetcompute::main_memregion::HETCOMPUTE_DELETE_METHOD
(main_memregion(main_memregion &&))
• hetcompute::ion_memregion::HETCOMPUTE_DELETE_METHOD (ion_memregion())
• hetcompute::ion_memregion::HETCOMPUTE_DELETE_METHOD
(ion_memregion(ion_memregion const &))
• hetcompute::ion_memregion::HETCOMPUTE_DELETE_METHOD (ion_memregion
&operator=(ion_memregion const &))
• hetcompute::ion_memregion::HETCOMPUTE_DELETE_METHOD
(ion_memregion(ion_memregion &&))
• hetcompute::glbuffer_memregion::HETCOMPUTE_DELETE_METHOD
(glbuffer_memregion())
• hetcompute::glbuffer_memregion::HETCOMPUTE_DELETE_METHOD
(glbuffer_memregion(glbuffer_memregion const &))
• hetcompute::glbuffer_memregion::HETCOMPUTE_DELETE_METHOD
(glbuffer_memregion &operator=(glbuffer_memregion const &))
• hetcompute::glbuffer_memregion::HETCOMPUTE_DELETE_METHOD
(glbuffer_memregion(glbuffer_memregion &&))
• bool hetcompute::ion_memregion::is_cacheable () const
Returns whether the ION memregion is cacheable.
Variables
• internal::internal_memregion ∗ hetcompute::memregion::_int_mr
• static constexpr size_t hetcompute::main_memregion::s_default_alignment = 4096
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 386
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Friends
• class hetcompute::memregion::::hetcompute::internal::memregion_base_accessor
Creates inter-operability with an OpenGL buffer. The user may have an external OpenGL buffer.
Qualcomm HetCompute may access the OpenGL buffer once inter-operability has been setup using an
instance of this class. A derived class of hetcompute::memregion.
See Also
hetcompute::memregion
• HETCOMPUTE_DELETE_METHOD (glbuffer_memregion())
• HETCOMPUTE_DELETE_METHOD (glbuffer_memregion(glbuffer_memregion const &))
• HETCOMPUTE_DELETE_METHOD (glbuffer_memregion &operator=(glbuffer_memregion
const &))
• HETCOMPUTE_DELETE_METHOD (glbuffer_memregion(glbuffer_memregion &&))
• HETCOMPUTE_DELETE_METHOD (glbuffer_memregion &operator=(glbuffer_memregion
&&))
Allocates ION memory on platforms that support it. The ION memory can be allocated as cacheable or
non-cacheable. A derived class of hetcompute::memregion.
See Also
hetcompute::memregion
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 387
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
• HETCOMPUTE_DELETE_METHOD (ion_memregion())
• HETCOMPUTE_DELETE_METHOD (ion_memregion(ion_memregion const &))
• HETCOMPUTE_DELETE_METHOD (ion_memregion &operator=(ion_memregion const &))
• HETCOMPUTE_DELETE_METHOD (ion_memregion(ion_memregion &&))
• HETCOMPUTE_DELETE_METHOD (ion_memregion &operator=(ion_memregion &&))
• bool is_cacheable () const
Returns whether the ION memregion is cacheable.
Allocates aligned memory from the platform main memory. The default alignment is 4096 bytes to get
page-aligned allocation. A derived class of hetcompute::memregion.
See Also
hetcompute::memregion
• HETCOMPUTE_DELETE_METHOD (main_memregion())
• HETCOMPUTE_DELETE_METHOD (main_memregion(main_memregion const &))
• HETCOMPUTE_DELETE_METHOD (main_memregion &operator=(main_memregion const
&))
• HETCOMPUTE_DELETE_METHOD (main_memregion(main_memregion &&))
• HETCOMPUTE_DELETE_METHOD (main_memregion &operator=(main_memregion &&))
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 388
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Base class for all mem-regions. The only common feature across the mem-regions is that they all have a
size in bytes. The user constructs a mem-region of the appropriate type to allocate the corresponding type
of specialized device-memory (hetcompute::main_memregion and
hetcompute::ion_memregion) or to create inter-operability with data from an external framework
(hetcompute::glbuffer_memregion).
Mem-regions provide RAII semantics:
• The specialized memory is allocated or the interop created when the user constructs the mem-region
object of the appropriate type.
• The user keeps the allocated memory or the interop alive by keeping the mem-region object alive.
Note
The base class hetcompute::memregion is not user-constructible. The user may construct from
a derived class of hetcompute::memregion that provides the desired allocation or interop
functionaity.
See Also
hetcompute::main_memregion
hetcompute::cl2svm_memregion
hetcompute::ion_memregion
hetcompute::glbuffer_memregion
• ∼memregion ()
Destructor.
• HETCOMPUTE_DELETE_METHOD (memregion())
• HETCOMPUTE_DELETE_METHOD (memregion(memregion const &))
• HETCOMPUTE_DELETE_METHOD (memregion &operator=(memregion const &))
• HETCOMPUTE_DELETE_METHOD (memregion(memregion &&))
• HETCOMPUTE_DELETE_METHOD (memregion &operator=(memregion &&))
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 389
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Protected Attributes
• internal::internal_memregion ∗ _int_mr
Friends
• class ::hetcompute::internal::memregion_base_accessor
Constructor, wraps an existing OpenGL buffer to allow inter-operability with Qualcomm HetCompute. The
size of the OpenGL buffer is automatically determined and set as the size of the mem-region.
Parameters
Parameters
Constructor, uses allocated ION memory. The user is responsible for ensuring the lifetime of the ION
memory, and handling the deallocation. The lifetime of the user allocated memory MUST exceed any use
of the memory via the memregion object (say, if the memregion is used by a buffer).
Parameters
ptr Pointer to the externally allocated region.The block at ptr of size sz bytes
must be fully contained within an existing HetCompute ion_memregion
sz Size of the allocation in bytes.
cacheable true if ptr points to a cacheable ion region, false otherwise.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 390
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Constructor, uses allocated ION memory, with the associated file descriptor. The user is responsible for
ensuring the lifetime of the ION memory, and handling the deallocation. The lifetime of the user allocated
memory MUST exceed any use of the memory via the memregion object (say, if the memregion is used by
a buffer). This variant is more flexible as it enables the construction of an ion_memregion using ion
memory that was (a) allocated by another process, or, (b) allocated by the same process without using a
hetcompute::ion_memregion.
Parameters
Parameters
Constructor, uses user-allocated memory. The user is responsible for ensuring the lifetime of the memory,
and handling the deallocation. The lifetime of the user allocated memory MUST exceed any use of the
memory via the memregion object (say, if the memregion is used by a buffer).
Parameters
Gets the file descriptor associated with the pointer to the allocated ION memory.
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 391
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API
Returns
Get the size of the mem-region in bytes. Applies to all derived mem-region classes.
Returns
Returns
Returns
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 392
12 Graphics Reference API
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 393
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Graphics Reference API
Functions
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 394
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Graphics Reference API
The following OpenCL QCOM extension are adopted for this feature
Extract Derivative Image Plane: cl_qcom_extract_image_plane QCOM Supported Compressed Image:
cl_qcom_compressed_image QCOM Other Non-Conventional Images [NV12, TP10]:
cl_qcom_other_image
Create HetCompute single-plane derivative texture from a multi-plane parent HetCompute texture. The
parent texture is created with create_texture(...) using ION memory
Template Parameters
Parameters
Returns
Note
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 395
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Graphics Reference API
Returns
Parameters
Returns
Parameters
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 396
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Graphics Reference API
Returns
Parameters
Returns
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 397
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Graphics Reference API
Enumerations
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 398
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Graphics Reference API
Supported image addressing mode in Qualcomm HetCompute. Each mode can be mapped to OpenCL
sampler addressing mode.
Supported image filter mode in Qualcomm HetCompute. Each mode can be mapped to OpenCL sampler
filter mode.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 399
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Graphics Reference API
Supported image format in Qualcomm HetCompute. Each format can be mapped to OpenCL image format
and pixel channel.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 400
13 Data Structures Reference API
Qualcomm HetCompute provides a set of concurrent data structures that are optimized for performance
using internal Qualcomm HetCompute primitives.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 401
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Structures Reference API
Typedefs
• typedef
internal::blfq::blfq_size_t< T,(sizeof(size_t) >
=sizeof(T))> hetcompute::bounded_lfqueue< T >::container_type
• typedef T hetcompute::bounded_lfqueue< T >::value_type
Functions
Public Types
• typedef
internal::blfq::blfq_size_t< T,(sizeof(size_t) >
=sizeof(T))> container_type
• typedef T value_type
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 402
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Structures Reference API
Constructs the Bounded Lock-Free Queue, given the log (base 2) of the maximum number of entries it can
contain.
Parameters
Pop from the queue, placing the popped value in the result.
Parameters
Returns
True if the pop was successful; false if the queue was empty.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 403
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Structures Reference API
Parameters
Returns
True if the push was successful; false if the queue was full.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 404
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Structures Reference API
Typedefs
Functions
Unbounded Lock-Free FIFO queue that is capable of dynamically growing and shrinking.
Public Types
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 405
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Structures Reference API
Constructs the Unbounded Lock-Free Queue, given the log (base 2) of the size of the static array within
each node.
Parameters
Pop from the queue, placing the popped value in the result.
Parameters
Returns
True if the pop was successful; FALSE if the queue was empty.
Push value into the queue. Since the queue is capable of growing, a push always succeeds.
Parameters
Returns
True
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 406
14 Data Sharing and Storage Reference
API
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 407
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 408
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API
Classes
Scheduler-local storage allows sharing of information across tasks on a per-scheduler basis, like what
thread-local storage does for threads. A scheduler_storage_ptr\<T\> stores a pointer-to-T (T∗).
In contrast to task_storage_ptr, the contents are persistent across tasks. In contrast to
thread_storage_ptr, the contents are guaranteed to not be changed while a task is suspended. To
maintain these guarantees, the runtime system is free to create new objects of type T whenever needed.
See Also
task_storage_ptr
thread_storage_ptr
Example
1 #include <algorithm>
2 #include <iterator>
3
4 #include <hetcompute/hetcompute.hh>
5
6 template <size_t N>
7 struct image_scratchpad
8 {
9 image_scratchpad() { std::fill(std::begin(edge_image), std::end(edge_image), 0); }
10 char edge_image[N];
11 };
12
13 namespace
14 {
15 const hetcompute::scheduler_storage_ptr<image_scratchpad<4096>
> image_buffers;
16 }; // namespace
17
18 int
19 main()
20 {
21 hetcompute::runtime::init();
22 int const N = 200;
23
24 auto g = hetcompute::create_group();
25 for (int i = 1; i < N; ++i)
26 {
27 g->launch([i] {
28 // fill image buffer, which is reused across tasks
29 for (auto& slot : image_buffers->edge_image)
30 slot = i & 0xff;
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 409
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 410
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API
29 g->wait_for();
30 hetcompute::runtime::shutdown();
31
32 return 0;
33 }
Public Types
Friends
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 411
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API
Classes
Scoped storage allows sharing of information on a per-task basis, similar as thread-local storage does for
threads. A scoped_storage_ptr<Scope,T,Allocator> stores a pointer-to-T (T∗) for a given
scope Scope<T,A>, defining life time, persistence, etc.. Allocator controls the allocation of T
objects.
Public Types
• scoped_storage_ptr ()
• T ∗ get () const
• HETCOMPUTE_DELETE_METHOD (scoped_storage_ptr(scoped_storage_ptr const &))
• HETCOMPUTE_DELETE_METHOD (scoped_storage_ptr &operator=(scoped_storage_ptr const
&))
• HETCOMPUTE_DELETE_METHOD (scoped_storage_ptr &operator=(scoped_storage_ptr const
&) volatile)
• HETCOMPUTE_DELETE_METHOD (scoped_storage_ptr &operator=(T ∗const &))
• operator bool () const
• operator pointer_type () const
• bool operator! () const
• bool operator!= (T ∗const &other)
• T & operator∗ () const
• T ∗ operator-> () const
• bool operator== (T ∗const &other)
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 412
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API
14.3.1.1.1.1 template<template< class, class > class Scope, typename T, class Allocator>
hetcompute::scoped_storage_ptr< Scope, T, Allocator >::scoped_storage_ptr ( )
Exceptions
Note
14.3.1.1.2.1 template<template< class, class > class Scope, typename T, class Allocator> T∗
hetcompute::scoped_storage_ptr< Scope, T, Allocator >::get ( ) const
Returns
Stored pointer value; a new object of type T is created and stored, if it has not been stored before.
14.3.1.1.2.2 template<template< class, class > class Scope, typename T, class Allocator>
hetcompute::scoped_storage_ptr< Scope, T, Allocator >::operator bool ( ) const
[explicit]
14.3.1.1.2.3 template<template< class, class > class Scope, typename T, class Allocator>
hetcompute::scoped_storage_ptr< Scope, T, Allocator >::operator pointer_type ( ) const
14.3.1.1.2.4 template<template< class, class > class Scope, typename T, class Allocator> bool
hetcompute::scoped_storage_ptr< Scope, T, Allocator >::operator! ( ) const
Returns
Constantly false.
14.3.1.1.2.5 template<template< class, class > class Scope, typename T, class Allocator> T&
hetcompute::scoped_storage_ptr< Scope, T, Allocator >::operator∗ ( ) const
Returns
Reference to value pointed to by stored pointer; A new object of type T is created and stored, if it has
not been stored before.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 413
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API
14.3.1.1.2.6 template<template< class, class > class Scope, typename T, class Allocator> T∗
hetcompute::scoped_storage_ptr< Scope, T, Allocator >::operator-> ( ) const
Returns
Stored pointer value; a new object of type T is created and stored, if it has not been stored before.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 414
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API
Classes
Task-local storage enables allocation of task-specific data, like what thread-local storage does for threads.
The value of a task_storage_ptr is local to a task. A task_storage_ptr\<T\> stores a
pointer-to-T (T∗).
See Also
thread_storage_ptr
Example
1 #include <hetcompute/hetcompute.hh>
2
3 namespace
4 {
5 hetcompute::task_storage_ptr<int> storage;
6 }; // namespace
7
8 void func();
9
10 void
11 func()
12 {
13 HETCOMPUTE_ILOG("%d", *storage);
14 ++*storage;
15 }
16
17 int
18 main()
19 {
20 hetcompute::runtime::init();
21 auto g = hetcompute::create_group();
22 for (int i = 0; i < 10; ++i)
23 {
24 g->launch([i] {
25 int v = i;
26 storage = &v;
27 func();
28 if (v != i + 1)
29 {
30 HETCOMPUTE_ILOG("error");
31 }
32 func();
33 if (v != i + 2)
34 {
35 HETCOMPUTE_ILOG("error");
36 }
37 });
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 415
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API
38 }
39 g->wait_for();
40 hetcompute::runtime::shutdown();
41 return 0;
42 }
Public Types
• task_storage_ptr ()
• task_storage_ptr (T ∗const &ptr)
• task_storage_ptr (T ∗const &ptr, void(∗dtor)(T ∗))
• T ∗ get () const
• HETCOMPUTE_DELETE_METHOD (task_storage_ptr(task_storage_ptr const &))
• HETCOMPUTE_DELETE_METHOD (task_storage_ptr &operator=(task_storage_ptr const &))
• HETCOMPUTE_DELETE_METHOD (task_storage_ptr &operator=(task_storage_ptr const &)
volatile)
• operator bool () const
• operator pointer_type () const
• bool operator! () const
• bool operator!= (T ∗const &other) const
• T & operator∗ () const
• T ∗ operator-> () const
• task_storage_ptr & operator= (T ∗const &ptr)
• bool operator== (T ∗const &other) const
Exceptions
Note
If exceptions are disabled in application, logs error if task_storage_ptr could not be reserved.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 416
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API
Exceptions
Parameters
Exceptions
Parameters
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 417
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API
Returns
Returns
Returns
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 418
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API
Classes
task_storage_ptr
scheduler_storage_ptr
Example
1 #include <hetcompute/hetcompute.hh>
2
3 namespace
4 {
5 const hetcompute::thread_storage_ptr<size_t> s_tls_state;
6 }; // namespace
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 auto g = hetcompute::create_group("test");
13 auto t = hetcompute::create_task([] {});
14
15 for (size_t i = 0; i < 200; ++i)
16 {
17 g->launch([=] {
18 size_t* p1 = s_tls_state.get();
19 t->launch();
20 t->wait_for();
21 size_t* p2 = s_tls_state.get();
22 // cannot assume that p1 == p2
23 (void)p1;
24 (void)p2;
25 });
26 }
27
28 g->wait_for();
29
30 hetcompute::runtime::shutdown();
31 return 0;
32 }
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 419
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API
Public Types
Friends
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 420
15 Exceptions Reference API
In this chapter we discuss all exceptions thrown by the Qualcomm HetCompute runtime system.
Exceptions can be disabled in the library by compiling the application with the following compile time flag
-DHETCOMPUTE_DISABLE_EXCEPTIONS=1. A general caveat to disabling exceptions in HetCompute
library is that not all APIs will return error, some may terminate the application on API level errors such as
input parameter check failed, certain precondition to execute the API are not met.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 421
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Exceptions Reference API
15.1 Exceptions
Classes
• class hetcompute::abort_task_exception
• class hetcompute::aggregate_exception
• class hetcompute::api_exception
• class hetcompute::canceled_exception
• class hetcompute::dsp_exception
• class hetcompute::error_exception
• class hetcompute::gpu_exception
• class hetcompute::hetcompute_exception
• class hetcompute::tls_exception
hetcompute::abort_on_cancel()
hetcompute::abort_task()
Returns
Implements hetcompute::hetcompute_exception.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 422
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Exceptions Reference API
Aggregate exception encapsulating all exceptions thrown in a task graph leading up to a point-of-use of a
task/group; e.g., wait_for, copy_value, move_value
Example
auto g = hetcompute::create_group();
g->hetcompute::launch([]{
std::string().at(1); // throws std::out_of_range exception
});
g->hetcompute::launch([]{
std::string().at(1); // throws std::out_of_range exception
});
try {
// Point-of-use of group g
g->wait_for();
} catch (hetcompute::aggregate_exception& e) {
while (e.has_next()) {
try {
e.next(); // throws
} catch (const std::out_of_range&) {
// Do something
} catch(...) {
// Not reached
}
}
} catch (...) {
// Not reached
}
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 423
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Exceptions Reference API
Returns
Implements hetcompute::hetcompute_exception.
Represents a misuse of the Qualcomm HetCompute API. For example, invalid values passed to a function.
Should cause termination of the application (future releases will behave differently).
• api_exception (std::string msg, const char ∗filename, int lineno, const char ∗funcname)
Exception thrown to indicate that a task/group was canceled. Thrown at points-of-use such as wait_for,
copy_value, and move_value.
Example
auto t = hetcompute::create_task([]{...});
t->cancel();
t->launch();
try {
// Point-of-use of task t
t->wait_for();
} catch (const hetcompute::canceled_exception&) {
// Do something
} catch (...) {
// Not reached
}
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 424
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Exceptions Reference API
Returns
Implements hetcompute::hetcompute_exception.
Returns
Implements hetcompute::hetcompute_exception.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 425
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Exceptions Reference API
• error_exception (std::string msg, const char ∗filename, int lineno, const char ∗fname)
• virtual const char ∗ file () const HETCOMPUTE_NOEXCEPT
• virtual const char ∗ function () const HETCOMPUTE_NOEXCEPT
• virtual int line () const HETCOMPUTE_NOEXCEPT
• virtual const char ∗ message () const HETCOMPUTE_NOEXCEPT
• virtual const char ∗ type () const HETCOMPUTE_NOEXCEPT
• virtual const char ∗ what () const HETCOMPUTE_NOEXCEPT
Returns
Implements hetcompute::hetcompute_exception.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 426
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Exceptions Reference API
Returns
Implements hetcompute::hetcompute_exception.
Destructor.
Returns
Indicates that the thread TLS has been misused or become corrupted. Should cause termination of the
application.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 427
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Exceptions Reference API
• tls_exception (std::string msg, const char ∗filename, int lineno, const char ∗funcname)
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 428
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Exceptions Reference API
15.2 ErrorCodes
Enumerations
Error codes returned by HetCompute SDK APIs. These error codes are applicable only if application has
disabled exceptions. If application has enabled exceptions, hetcompute library will throw exceptions
instead of returning error codes.
Enumerator
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 429
16 Affinity Management API
The Qualcomm HetCompute affinity API enables the programmer to request which CPU cores should
execute tasks.
Using the power management API requires including the following header file:
#include <hetcompute/affinity.hh>
To get a detailed description of all the APIs, please follow the following link:
Note
Current version only sets the affinity in the CPU cores, excluding the GPU and the Hexagon DSP.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 430
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API
• struct hetcompute_affinity_settings_t
• class hetcompute::affinity::settings
Typedefs
Enumerations
Functions
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 431
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API
• bool hetcompute::internal::soc::is_big_little_cpu ()
• bool hetcompute::internal::soc::is_this_big_core ()
• bool hetcompute::affinity::settings::operator!= (const settings &rhs) const
• bool hetcompute::affinity::settings::operator== (const settings &rhs) const
• void hetcompute::affinity::reset ()
• void hetcompute::affinity::settings::reset_pin_threads ()
• void hetcompute::affinity::set (const settings as)
• void hetcompute::affinity::settings::set_cores (cores cores_attribute)
• void hetcompute::affinity::settings::set_mode (mode md)
• void hetcompute::affinity::settings::set_pin_threads ()
Data fields
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 432
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API
• void reset_pin_threads ()
• void set_cores (cores cores_attribute)
• void set_mode (mode md)
• void set_pin_threads ()
hetcompute_affinity_execute()
include/hetcompute/affinity.h for the C Affinity APIEnumeration type to select the cores where to
apply affinity settings in a big-little system. In homogeneous systems, all is always used.
Enumerator
C Affinity API
See Also
Enumerator
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 433
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API
Enumerator
hetcompute_affinity_mode_override_local_setting Set a default affinity for all cpu tasks for which
affinity was not specified. For example, if the user sets the affinity in allow_local_setting mode to
big, the big cores will execute all tasks except those marked as little
Enumeration type to select the affinity mode in big-little systems. In homogeneous system, mode, as cores,
is ignored.
Enumerator
allow_local_setting Set a default affinity for all cpu tasks for which affinity was not specified. For
example, if the user sets the affinity in allow_local_setting mode to big, the big cores will execute
all tasks except those marked as little
override_local_setting Set the affinity for all cpu tasks regardless of local task/scope settings. For
example, if the user sets the affinity to little in override_local_setting mode, the little cores will
execute all cpu tasks including those marked as big
Parameters
16.1.4.2 hetcompute::affinity::settings::∼settings ( )
Destructor
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 434
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API
Parameters
Returns
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 435
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API
Reset the thread pool affinity so that all Qualcomm HetCompute threads can run in any core of the system.
Parameters
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 auto fn = [](int i) { HETCOMPUTE_ILOG("Function executed with specified affinity on arg %d", i); };
9 auto aff_settings =
10 hetcompute::affinity::settings(
hetcompute::affinity::cores::big, false,
hetcompute::affinity::mode::allow_local_setting);
11 // In a big.LITTLE SoC, function fn executes on a big core.
12 hetcompute::affinity::execute(aff_settings, fn, 42);
13
14 auto g = hetcompute::create_group(__FUNCTION__);
15
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 436
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API
Returns
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 437
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API
Returns
Returns
Reset the thread pool affinity so that all Qualcomm HetCompute threads can run in any core of the system.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 438
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API
Parameters
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 auto fn = [](int i) { HETCOMPUTE_ILOG("Function executed with specified affinity on arg %d", i); };
9 auto aff_settings =
10 hetcompute::affinity::settings(
hetcompute::affinity::cores::big, false,
hetcompute::affinity::mode::allow_local_setting);
11 // In a big.LITTLE SoC, function fn executes on a big core.
12 hetcompute::affinity::execute(aff_settings, fn, 42);
13
14 auto g = hetcompute::create_group(__FUNCTION__);
15
16 auto k_wout_attrib = hetcompute::create_cpu_kernel([] { HETCOMPUTE_ILOG("
Task without kernel affinity attribute."); });
17
18 auto k_with_attrib = hetcompute::create_cpu_kernel([] { HETCOMPUTE_ILOG("
Task with kernel affinity attribute"); });
19 k_with_attrib.set_little();
20
21 // k_with_attrib kernel will run in a LITTLE core
22 g->launch(k_with_attrib);
23
24 // k_wout_attrib can run in any core
25 g->launch(k_wout_attrib);
26
27 g->wait_for();
28
29 // Set the affinity to the LITTLE cores without pinning in
30 // allow_local_setting mode
31 hetcompute::affinity::set(
32 hetcompute::affinity::settings(
hetcompute::affinity::cores::little, false,
hetcompute::affinity::mode::allow_local_setting));
33
34 // k_wout_attrib task will run in a LITTLE core because the kernel has no
35 // individual affinity specification
36 g->launch(k_wout_attrib);
37
38 // Set the affinity to the big cores with pinning in allow_local_setting mode
39 // by reading the current affinity and then updating the different fields
40 auto affinity = hetcompute::affinity::get();
41
42 // Update the cores from LITTLE to big
43 affinity.set_cores(hetcompute::affinity::cores::big);
44
45 // Enable thread pinning
46 affinity.set_pin_threads();
47
48 // Update the mode from allow_local_setting to override_local_setting in the
49 // settings
50 affinity.set_mode(hetcompute::affinity::mode::override_local_setting
);
51
52 // Update the affinity with the modified affinity object
53 hetcompute::affinity::set(affinity);
54
55 // The second run of k_with_attrib will run on a big core because the
56 // affinity mode is override_local_setting and global affinity settings are
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 439
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API
57 // obeyed
58 g->launch(k_with_attrib);
59
60 g->wait_for();
61
62 hetcompute::runtime::shutdown();
63 return 0;
64 }
Parameters
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 440
17 Miscellaneous
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 441
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Miscellaneous
17.1 Interoperability
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 442
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Miscellaneous
17.2 Legacy
Functions
• void hetcompute::runtime::init ()
• void hetcompute::runtime::shutdown ()
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 443
18 Class Documentation
Public Types
• pipeline ()
Constructor.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 444
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Class Documentation
• virtual ∼pipeline ()
Destructor.
• template<typename... Args>
std::enable_if
<!internal::pipeline_utility::check_gpu_kernel
< Args...>::has_gpu_kernel,
void >::type add_stage (Args &&...args)
Add a CPU stage.
• template<typename... Args>
std::enable_if
< internal::pipeline_utility::check_gpu_kernel
< Args...>::has_gpu_kernel,
void >::type add_stage (Args &&...args)
Add a GPU stage.
UserData The type for the pipeline context data or empty, i.e.,
hetcompute::pattern::pipeline<size_t> or
hetcompute::pattern::pipeline<>.
Constructor.
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 445
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Class Documentation
Destructor.
Reimplemented from hetcompute::pattern::pipeline< UserData...>.
Copy constructor.
Move constructor.
Parameters
See Also
Parameters
See Also
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 446
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Class Documentation
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 447
Index
∼dsp_kernel get, 435
hetcompute::dsp_kernel< int(∗)(Args...)>, 264 get_cores, 435
∼hetcompute_exception get_mode, 435
hetcompute::hetcompute_exception, 427 get_pin_threads, 435
∼pipeline hetcompute_affinity_cores_big, 433
hetcompute::beta::pattern::pipeline, 445 hetcompute_affinity_cores_little, 433
hetcompute::pattern::pipeline, 204 hetcompute_affinity_mode_override_local_-
∼pipeline_context setting,
hetcompute::pipeline_context< UserData >, 434
212 hetcompute_affinity_pin_threads_true, 434
hetcompute::pipeline_context<>, 213 hetcompute_affinity_cores_t, 433
∼pipeline_context_base hetcompute_affinity_execute, 435
hetcompute::pipeline_context_base, 214 hetcompute_affinity_get, 436
∼settings hetcompute_affinity_mode_t, 433
Affinity Settings, 434 hetcompute_affinity_pin_threads_t, 434
∼stage_input hetcompute_affinity_reset, 436
hetcompute::stage_input, 220 hetcompute_affinity_set, 436
∼task_ptr hetcompute_func_ptr_t, 433
hetcompute::task_ptr< ReturnType >, 319 is_this_big_core, 437
hetcompute::task_ptr< ReturnType(Args...)>, little, 433
325 mode, 434
hetcompute::task_ptr< void >, 328 operator==, 438
hetcompute::task_ptr<>, 331 override_local_setting, 434
reset, 438
abort_on_cancel reset_pin_threads, 438
Tasks, 335 set, 438
abort_task set_cores, 440
Tasks, 337 set_mode, 440
acquire_ro set_pin_threads, 440
hetcompute::buffer_ptr, 372 settings, 434
acquire_rw all
hetcompute::buffer_ptr, 373 Affinity Settings, 433
acquire_wi allow_local_setting
hetcompute::buffer_ptr, 374 Affinity Settings, 434
add args_tuple
hetcompute::device_set, 362 hetcompute::task< ReturnType(Args...)>, 302
hetcompute::group, 232, 233 hetcompute::task_ptr< ReturnType(Args...)>,
add_stage 324
hetcompute::beta::pattern::pipeline, 446 arity
addressing_mode hetcompute::task< ReturnType(Args...)>, 304
Texture Data Types, 399 hetcompute::task_ptr< ReturnType(Args...)>,
Affinity Management API, 430 326
Affinity Settings, 431 at
∼settings, 434 hetcompute::buffer_ptr, 375
all, 433
allow_local_setting, 434 begin
big, 433 hetcompute::buffer_ptr, 375
cores, 433 hetcompute::range_base, 289, 290
execute, 434 big
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 448
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 449
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 450
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 451
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 452
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 453
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 454
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 455
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 456
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 457
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 458
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 459
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX
wait_for
hetcompute::group, 244
hetcompute::task<>, 314
what
hetcompute::abort_task_exception, 422
hetcompute::aggregate_exception, 424
hetcompute::canceled_exception, 425
hetcompute::dsp_exception, 425
hetcompute::error_exception, 426
hetcompute::gpu_exception, 426
hetcompute::hetcompute_exception, 427
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 460
Qualcomm® Snapdragon™ Heterogeneous Compute SDK BIBLIOGRAPHY
Bibliography
[1] Sarita V. Adve and Kourosh Gharachorloo. Shared memory consistency models: A tutorial. IEEE
Computer, 29:66–76, 1995. 152
[2] Gene M. Amdahl. Validity of the single-processor approach to achieving large scale computing
capabilities. In AFIPS Conference Proceedings, volume 30, pages 483–485, Reston, VA, April 1967.
148
[3] Christopher Barton, Călin Cascaval, and José Nelson Amaral. A characterization of shared data access
patterns in upc programs. In Proceedings of the 19th international conference on Languages and
compilers for parallel computing, LCPC’06, pages 111–125, Berlin, Heidelberg, 2007.
Springer-Verlag. 153
[4] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A non-local algorithm for image denoising.
In Computer Vision and Pattern Recognition, 2005. 155
[5] Calin Cascaval, Seth Fowler, Pablo Montesinos-Ortego, Wayne Piekarski, Mehrdad Reshadi, Behnam
Robatmili, Michael Weber, and Vrajesh Bhavsar. Zoomm: a parallel web browser engine for
multicore mobile devices. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and
practice of parallel programming, PPoPP ’13, pages 271–280, 2013. 150
[6] Stephanie Coleman and Kathryn S. McKinley. Tile Size Selection Using Cache Organization and
Data Layout. In Proceedings of the ACM SIGPLAN Conference on Programming Languages Design
and Implementation (PLDI ’95, La Jolla, CA, June 1995. SIGPLAN. 153
[7] Michael J. Flynn. Some computer organizations and their effectiveness. IEEE Transactions on
Computers, C-21(9):948–960, Sept. 1972. 149
[8] Benedict R. Gaster and Lee Howes. Can GPGPU programming be liberated from the data-parallel
bottleneck? IEEE Computer, pages 42–52, 2012. 150
[9] John L. Gustafson. Reevaluating Amdahl’s law. Commun. ACM, 31(5):532–533, May 1988. 149
[10] John L. Hennessy and David A. Patterson. Computer Architecture A Quantitative Approach. Morgan
Kaufmann, second edition, 1996. 152
[11] Mark D. Hill and Michael R. Marty. Amdahl’s law in the multicore era. IEEE Computer, July 2008.
149
[12] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maurer, and D. Shippy. Introduction to the
Cell multiprocessor. IBM Journal of Research and Development, 49(4.5):589–604, 2005.
http://dx.doi.org/10.1147/rd.494.0589. 151
[13] Milind Kulkarni, Martin Burtscher, Rajeshkar Inkulu, Keshav Pingali, and Calin Caşcaval. How much
parallelism is there in irregular applications? In Proceedings of the 14th ACM SIGPLAN symposium
on Principles and practice of parallel programming, PPoPP ’09, pages 3–14, 2009. 153
[14] Timothy G. Mattson, Beverly A. Sanders, and Berna L. Massingill. Patterns for Parallel
Programming. Addison-Wesley, 2013. 151
[15] NEON intrinsics. http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html, Apr 2013. 149
[16] Qualcomm Research Silicon Valley,
http://developer.qualcomm.com/snapdragon-heterogeneous-compute-sdk. Qualcomm®
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 461
Qualcomm® Snapdragon™ Heterogeneous Compute SDK BIBLIOGRAPHY
Snapdragon™ Heterogeneous Compute SDK User’s Manual, 1.0.0 edition, Oct 2015. 150
[17] Gabriel Rivera and Chau-Wen Tseng. Data transformations for eliminating conflict misses. In
Proceedings of the ACM SIGPLAN Conference on Programming Languages Design and
Implementation (PLDI ’98, pages 38–49, June 1998. 153
[18] Anne Rogers and Keshav Pingali. Process decomposition through locality of reference. SIGPLAN
Notices, 24(7):69–80, July 1989. 152
[19] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to
Reduce Multiprocessor Cache Miss Rates. In ICPP, pages II–266–II–270, 1990. 152
[20] Michael Wolfe. More Iteration Space Tiling. In Proceedings of Supercomputing ’89, pages 655–664,
Reno, NV, November 1989. ACM. 153
80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 462