0% found this document useful (0 votes)
275 views462 pages

Snapdragon Hetcompute SDK 1.0.0 Refman en

Hdd hehehe yeyege eyyege sgeggsgd

Uploaded by

Ardito Vlor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
275 views462 pages

Snapdragon Hetcompute SDK 1.0.0 Refman en

Hdd hehehe yeyege eyyege sgeggsgd

Uploaded by

Ardito Vlor
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 462

Qualcomm® Snapdragon™ Heterogeneous Compute SDK

Documentation and Interface Specification


80-P2432-1 B

May 2, 2018

Qualcomm® Snapdragon™ Heterogeneous Compute SDK is a product of Qualcomm Technologies, Inc.


Other Qualcomm products referenced herein are products of Qualcomm Technologies, Inc. or its other
subsidiaries.
This technical data may be subject to U.S. and international export, re-export, or transfer ("export") laws.
Diversion contrary to U.S. and international law is strictly prohibited.
© 2017 Qualcomm Technologies, Inc. All rights reserved.
Submit technical questions at:
[email protected]

Qualcomm is a trademark of Qualcomm Incorporated, registered in the United States and other countries.
All Qualcomm Incorporated trademarks are used with permission. Other product and brand names may be
trademarks or registered trademarks of their respective owners.

Qualcomm Technologies, Inc.


5775 Morehouse Drive
San Diego, CA 92121-1714
U.S.A.
Revision History
Revision Date Description
A March 2018 Version 1.0

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 3


Qualcomm® Snapdragon™ Heterogeneous Compute SDK CONTENTS

Contents
1 Introduction 16
1.1 Purpose . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.2 Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.3 Conventions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
1.4 References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.5 Technical Assistance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1.6 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2 Installing Snapdragon™ Heterogeneous Compute SDK 18


2.1 Verifying your installation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
2.2 Integrating HetCompute with Android NDK Applications . . . . . . . . . . . . . . . . . . 19
2.3 Hexagon DSP Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 20
2.4 OpenCL C++ Support . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

3 Snapdragon™ Heterogeneous Compute SDK Migration Guide 22


3.1 Migration from Symphony System Manager SDK 1.x to Snapdragon™ Heterogeneous
Compute SDK 1.0 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

4 Getting Started 23
4.1 Writing your first HetCompute program . . . . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1 Building a HetCompute program using ndk-build . . . . . . . . . . . . . . . . . . 24

5 User Guide 25
5.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
5.1.1 Writing a HetCompute Application . . . . . . . . . . . . . . . . . . . . . . . . . . 28
5.1.1.1 Parallel vector addition . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1.2 Parallel sort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29
5.1.1.3 Parallelism using tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 30
5.1.2 Executing a HetCompute Application . . . . . . . . . . . . . . . . . . . . . . . . 32
5.2 Parallel Programming Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.1 Overview of HetCompute Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.2 Parallel Iteration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
5.2.2.1 Parallel For Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.2.2 Parallel Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5.2.3 Parallel Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
5.2.4 Parallel Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.5 Parallel Divide-and-Conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
5.2.6 Parallel Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.2.7 Advanced Topics for Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.7.1 Pattern Object . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 42
5.2.7.2 Tuner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.2.8 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
5.2.8.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
5.2.8.2 HetCompute Pipeline Example . . . . . . . . . . . . . . . . . . . . . . 47
5.2.8.3 HetCompute Pipeline Details . . . . . . . . . . . . . . . . . . . . . . . 48
5.2.8.4 Launch the HetCompute pipeline . . . . . . . . . . . . . . . . . . . . . 52

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 4


Qualcomm® Snapdragon™ Heterogeneous Compute SDK CONTENTS

5.2.8.5 Heterogeneous Pipeline (HetCompute Beta Feature) . . . . . . . . . . 54


5.2.8.6 Heterogeneous Pipeline Details . . . . . . . . . . . . . . . . . . . . . . 57
5.3 Introduction to Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3.1 Kernels: The Path to Heterogeneity . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1.1 Revisiting Hello World . . . . . . . . . . . . . . . . . . . . . . . . . . . 61
5.3.1.2 Creating a Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.1.3 Setting Kernel Attributes . . . . . . . . . . . . . . . . . . . . . . . . . . 65
5.3.1.4 Kernels: Advanced Topics . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.1.5 Poly-kernels (Beta feature) . . . . . . . . . . . . . . . . . . . . . . . . 67
5.3.1.6 hetcompute::range and hetcompute::index . . . . . . . . . . . . . . . . 69
5.3.1.7 Using hetcompute::range<N> to represent ND-Range (in OpenCL) . . 69
5.3.1.8 hetcompute::range<1> . . . . . . . . . . . . . . . . . . . . . . . . . . 69
5.3.1.9 hetcompute::range<2> . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.1.10 hetcompute::range<3> . . . . . . . . . . . . . . . . . . . . . . . . . . 70
5.3.1.11 Strided Ranges. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.2 Creating Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.3.2.1 Create Tasks Using Lambda Expressions . . . . . . . . . . . . . . . . 73
5.3.2.2 Create Tasks Using Classes . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.2.3 Create Tasks Using Function Pointers . . . . . . . . . . . . . . . . . . 77
5.3.3 Task Pointers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 78
5.3.4 Life of a HetCompute Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 80
5.3.4.1 The Green Line to Successful Completion . . . . . . . . . . . . . . . . 81
5.3.4.2 The Red Line to Cancellation . . . . . . . . . . . . . . . . . . . . . . . 81
5.3.5 Launching Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 82
5.3.6 Task Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
5.3.6.1 Control Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.6.2 Data Dependencies . . . . . . . . . . . . . . . . . . . . . . . . . . . . 84
5.3.6.3 Heterogeneous Task Graphs . . . . . . . . . . . . . . . . . . . . . . . 86
5.3.7 Task Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.7.1 Group Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.3.7.2 Launching Tasks or Kernels to Groups . . . . . . . . . . . . . . . . . . 90
5.3.7.3 Group Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.3.8 Waiting for Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 98
5.3.9 Exceptions and Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
5.3.9.1 Aggregate Exception . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
5.3.9.2 GPU/DSP Exception . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.9.3 Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
5.3.9.4 Synchronization Points where Exceptions are Observable . . . . . . . 103
5.3.9.5 Canceling a Task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103
5.3.10 Blocking Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.3.10.1 Blocking Kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 110
5.3.10.2 hetcompute::blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . 111
5.3.11 Algebraic Operations on Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
5.3.12 Task-Pointer Collapsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
5.3.13 Unleashing Asynchrony . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
5.3.13.1 finish_after . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
5.3.13.2 Asynchronous APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119
5.3.13.3 Cancellation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
5.3.13.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
5.4 Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 5


Qualcomm® Snapdragon™ Heterogeneous Compute SDK CONTENTS

5.4.1 Basic Usage of Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123


5.4.2 Using Buffers with Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 124
5.4.2.1 Buffers with CPU Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 125
5.4.2.2 Buffers with GPU Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 126
5.4.2.3 Buffers with DSP Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.4.3 Synchronized and Concurrent Use . . . . . . . . . . . . . . . . . . . . . . . . . 127
5.4.3.1 Synchronized Access to Buffers Across Host Code and Tasks . . . . . 127
5.4.3.2 Concurrent Access by Tasks . . . . . . . . . . . . . . . . . . . . . . . 128
5.4.4 Creating Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.4.4.1 With Storage Fully Managed by HetCompute . . . . . . . . . . . . . . 128
5.4.4.2 With User-provided Initial Storage and Data . . . . . . . . . . . . . . . 128
5.4.4.3 With a Memory Region . . . . . . . . . . . . . . . . . . . . . . . . . . 128
5.4.5 Performance and Storage Optimizations When Using Buffers . . . . . . . . . . . 129
5.4.5.1 Explicit Synchronization with Host Code . . . . . . . . . . . . . . . . . 129
5.4.5.2 Providing Device Hints . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.4.6 Memory Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.5 Textures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 131
5.5.1 QCOM Extended Image format . . . . . . . . . . . . . . . . . . . . . . . . . . . 133
5.6 Data Structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
5.6.1 HetCompute Lock-Free Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
5.7 Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.7.1 Task-Local Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 136
5.7.2 Scheduler-Local Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
5.7.3 Thread-Local Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.8 Affinity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 139
5.8.1 Overriding Local Affinity Settings . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.9 Heterogeneous Computing in Action . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 142
5.10 Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.10.1 Safe Points . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 145
5.10.2 Using HetCompute with the Fork() System Call . . . . . . . . . . . . . . . . . . . 146
5.10.3 Using HetCompute with TLS-aware Libraries . . . . . . . . . . . . . . . . . . . . 146
5.10.4 Distributed Computing using HetCompute . . . . . . . . . . . . . . . . . . . . . 147
5.10.5 Avoid the Use of C++ iostream and stringstream Libraries . . . . . . . . . . . . . 147

6 Parallel Processing Tutorial 148


6.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.2 Parallel Speedups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
6.3 Parallel Programming Paradigms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.3.1 Data parallelism (SIMD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
6.3.2 Task parallelism (MIMD) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3.3 Braided parallelism . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
6.3.4 Pipeline parallelism or Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.4 Parallel Programming Patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 151
6.5 Optimizations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.5.1 Cache locality . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
6.5.2 Minimizing wait time and synchronization . . . . . . . . . . . . . . . . . . . . . . 153
6.5.3 Load balancing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154
6.6 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 154

7 Image Processing Tutorial 155

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 6


Qualcomm® Snapdragon™ Heterogeneous Compute SDK CONTENTS

7.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155


7.2 Image Processing Filter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 155
7.3 Parallel Image Processing using HetCompute . . . . . . . . . . . . . . . . . . . . . . . . 156
7.3.1 Naive Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.3.2 Tiling for Parallelization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 156
7.3.3 Parallelization using patterns . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 158

8 Point Kernels (Beta feature) 159

9 Patterns Reference API 160


9.1 Parallel For Loop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 161
9.1.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162
9.1.1.1 class hetcompute::pattern::pfor . . . . . . . . . . . . . . . . . . . . . . 162
9.1.1.2 class hetcompute::pattern::pfor< hetcompute::internal::pointkernel::pointkernel<
RT, PKType...>, T2 > . . . . . . . . . . . . . . . . . . . . . . . . . . . 163
9.1.1.3 class hetcompute::pattern::pfor< T1, void > . . . . . . . . . . . . . . . 164
9.1.2 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.1.2.1 create_pfor_each . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.1.2.2 pfor_each . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 165
9.1.2.3 pfor_each . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 166
9.1.2.4 pfor_each . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.1.2.5 pfor_each_async . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 167
9.1.2.6 pfor_each_async . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
9.1.2.7 pfor_each_async . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
9.1.2.8 pfor_each_async . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 168
9.2 Parallel Transformation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.2.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
9.2.1.1 class hetcompute::pattern::ptransformer . . . . . . . . . . . . . . . . . 170
9.2.2 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.2.2.1 create_ptransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.2.2.2 ptransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172
9.2.2.3 ptransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 173
9.2.2.4 ptransform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
9.2.2.5 ptransform_async . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.2.2.6 ptransform_async . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.2.2.7 ptransform_async . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
9.3 Parallel Reduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
9.3.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 177
9.3.1.1 class hetcompute::pattern::preducer . . . . . . . . . . . . . . . . . . . 177
9.3.2 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.3.2.1 create_preduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 178
9.3.2.2 preduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
9.3.2.3 preduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 180
9.3.2.4 preduce . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.3.2.5 preduce_async . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 181
9.3.2.6 preduce_async . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.3.2.7 preduce_async . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 182
9.4 Parallel Scan . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
9.4.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 184
9.4.1.1 class hetcompute::pattern::pscan . . . . . . . . . . . . . . . . . . . . . 184

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 7


Qualcomm® Snapdragon™ Heterogeneous Compute SDK CONTENTS

9.4.2 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185


9.4.2.1 create_pscan_inclusive . . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.4.2.2 pscan_inclusive . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 185
9.4.2.3 pscan_inclusive_async . . . . . . . . . . . . . . . . . . . . . . . . . . 186
9.5 Parallel Divide-and-Conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
9.5.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 187
9.5.1.1 class hetcompute::pattern::pdivide_and_conquerer . . . . . . . . . . . 187
9.5.2 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
9.5.2.1 create_pdivide_and_conquer . . . . . . . . . . . . . . . . . . . . . . . 189
9.5.2.2 pdivide_and_conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
9.5.2.3 pdivide_and_conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.5.2.4 pdivide_and_conquer . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
9.5.2.5 pdivide_and_conquer_async . . . . . . . . . . . . . . . . . . . . . . . 193
9.5.2.6 pdivide_and_conquer_async . . . . . . . . . . . . . . . . . . . . . . . 193
9.5.2.7 pdivide_and_conquer_async . . . . . . . . . . . . . . . . . . . . . . . 194
9.6 Parallel Sorting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9.6.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 195
9.6.1.1 class hetcompute::pattern::psorter . . . . . . . . . . . . . . . . . . . . 195
9.6.2 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
9.6.2.1 create_psort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 196
9.6.2.2 psort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.6.2.3 psort . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 197
9.6.2.4 psort_async . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.6.2.5 psort_async . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
9.7 Pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
9.7.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 199
9.7.1.1 class hetcompute::iteration_lag . . . . . . . . . . . . . . . . . . . . . . 200
9.7.1.2 class hetcompute::iteration_rate . . . . . . . . . . . . . . . . . . . . . 201
9.7.1.3 class hetcompute::parallel_stage . . . . . . . . . . . . . . . . . . . . . 202
9.7.1.4 class hetcompute::pattern::pipeline . . . . . . . . . . . . . . . . . . . . 203
9.7.1.5 class hetcompute::pipeline_context< UserData > . . . . . . . . . . . . 211
9.7.1.6 class hetcompute::pipeline_context<> . . . . . . . . . . . . . . . . . . 212
9.7.1.7 class hetcompute::pipeline_context_base . . . . . . . . . . . . . . . . 213
9.7.1.8 class hetcompute::serial_stage . . . . . . . . . . . . . . . . . . . . . . 217
9.7.1.9 class hetcompute::sliding_window_size . . . . . . . . . . . . . . . . . 218
9.7.1.10 class hetcompute::stage_input . . . . . . . . . . . . . . . . . . . . . . 219
9.7.2 Typedef Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.7.2.1 serial_stage_type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.7.3 Enumeration Type Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . 221
9.7.3.1 serial_stage_type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 222
9.8 Tuner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
9.8.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 223
9.8.1.1 class hetcompute::pattern::tuner . . . . . . . . . . . . . . . . . . . . . 223

10 Tasks Reference API 229


10.1 Groups . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 230
10.1.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 231
10.1.1.1 class hetcompute::group . . . . . . . . . . . . . . . . . . . . . . . . . . 231
10.1.1.2 class hetcompute::group_ptr . . . . . . . . . . . . . . . . . . . . . . . 245
10.1.2 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 8


Qualcomm® Snapdragon™ Heterogeneous Compute SDK CONTENTS

10.1.2.1 create_group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 250


10.1.2.2 create_group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 251
10.1.2.3 create_group . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
10.1.2.4 finish_after . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252
10.1.2.5 finish_after . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
10.1.2.6 intersect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 253
10.1.2.7 operator& . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 254
10.2 Kernels . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 256
10.2.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 257
10.2.1.1 struct hetcompute::beta::call_tuple . . . . . . . . . . . . . . . . . . . . 257
10.2.1.2 struct hetcompute::beta::call_tuple< Dim, gpu_kernel< Args...> > . . 258
10.2.1.3 class hetcompute::beta::cl_t . . . . . . . . . . . . . . . . . . . . . . . . 258
10.2.1.4 class hetcompute::cpu_kernel . . . . . . . . . . . . . . . . . . . . . . . 258
10.2.1.5 class hetcompute::cpu_kernel< FReturnType(FArgs...)> . . . . . . . . 260
10.2.1.6 class hetcompute::dsp_kernel . . . . . . . . . . . . . . . . . . . . . . . 262
10.2.1.7 class hetcompute::dsp_kernel< int(∗)(Args...)> . . . . . . . . . . . . . 263
10.2.1.8 class hetcompute::beta::gl_t . . . . . . . . . . . . . . . . . . . . . . . . 264
10.2.1.9 class hetcompute::gpu_kernel . . . . . . . . . . . . . . . . . . . . . . . 265
10.2.1.10 class hetcompute::local . . . . . . . . . . . . . . . . . . . . . . . . . . 268
10.2.2 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
10.2.2.1 create_cpu_kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 268
10.2.2.2 create_cpu_kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
10.2.2.3 create_dsp_kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 269
10.2.2.4 create_gpu_kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
10.2.2.5 create_gpu_kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
10.2.2.6 create_gpu_kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 270
10.2.2.7 create_gpu_kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
10.2.2.8 create_gpu_kernel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 271
10.2.3 Variable Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
10.2.3.1 cl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
10.2.3.2 gl . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 272
10.3 Indices . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
10.3.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 273
10.3.1.1 class hetcompute::index . . . . . . . . . . . . . . . . . . . . . . . . . . 273
10.3.1.2 class hetcompute::index< 1 > . . . . . . . . . . . . . . . . . . . . . . 273
10.3.1.3 class hetcompute::index< 2 > . . . . . . . . . . . . . . . . . . . . . . 273
10.3.1.4 class hetcompute::index< 3 > . . . . . . . . . . . . . . . . . . . . . . 273
10.3.1.5 class hetcompute::index_base . . . . . . . . . . . . . . . . . . . . . . 274
10.4 Ranges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
10.4.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 280
10.4.1.1 class hetcompute::range . . . . . . . . . . . . . . . . . . . . . . . . . . 280
10.4.1.2 class hetcompute::range< 1 > . . . . . . . . . . . . . . . . . . . . . . 280
10.4.1.3 class hetcompute::range< 2 > . . . . . . . . . . . . . . . . . . . . . . 282
10.4.1.4 class hetcompute::range< 3 > . . . . . . . . . . . . . . . . . . . . . . 285
10.4.1.5 class hetcompute::range_base . . . . . . . . . . . . . . . . . . . . . . 289
10.5 Tasks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 292
10.5.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 299
10.5.1.1 struct hetcompute::do_not_collapse_t . . . . . . . . . . . . . . . . . . 299
10.5.1.2 class hetcompute::task< ReturnType > . . . . . . . . . . . . . . . . . 299
10.5.1.3 class hetcompute::task< ReturnType(Args...)> . . . . . . . . . . . . . 301

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 9


Qualcomm® Snapdragon™ Heterogeneous Compute SDK CONTENTS

10.5.1.4 class hetcompute::task< void > . . . . . . . . . . . . . . . . . . . . . 305


10.5.1.5 class hetcompute::task<> . . . . . . . . . . . . . . . . . . . . . . . . 305
10.5.1.6 class hetcompute::task_ptr< ReturnType > . . . . . . . . . . . . . . . 316
10.5.1.7 class hetcompute::task_ptr< ReturnType(Args...)> . . . . . . . . . . . 322
10.5.1.8 class hetcompute::task_ptr< void > . . . . . . . . . . . . . . . . . . . 326
10.5.1.9 class hetcompute::task_ptr<> . . . . . . . . . . . . . . . . . . . . . . 329
10.5.2 Typedef Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
10.5.2.1 collapsed_task_type . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
10.5.2.2 non_collapsed_task_type . . . . . . . . . . . . . . . . . . . . . . . . . 335
10.5.3 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
10.5.3.1 abort_on_cancel . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 335
10.5.3.2 abort_task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 337
10.5.3.3 bind_as_data_dependency . . . . . . . . . . . . . . . . . . . . . . . . 339
10.5.3.4 bind_by_value . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
10.5.3.5 blocking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 339
10.5.3.6 create_task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 340
10.5.3.7 create_task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 341
10.5.3.8 create_value_task . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 342
10.5.3.9 finish_after . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
10.5.3.10 launch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 344
10.5.3.11 launch . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 345
10.5.3.12 operator!= . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
10.5.3.13 operator!= . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 346
10.5.3.14 operator!= . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
10.5.3.15 operator% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
10.5.3.16 operator% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
10.5.3.17 operator% . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 347
10.5.3.18 operator& . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
10.5.3.19 operator& . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
10.5.3.20 operator& . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 348
10.5.3.21 operator∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
10.5.3.22 operator∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
10.5.3.23 operator∗ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
10.5.3.24 operator+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 349
10.5.3.25 operator+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 350
10.5.3.26 operator+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
10.5.3.27 operator+ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 351
10.5.3.28 operator- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 352
10.5.3.29 operator- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
10.5.3.30 operator- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 353
10.5.3.31 operator- . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
10.5.3.32 operator/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
10.5.3.33 operator/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
10.5.3.34 operator/ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 354
10.5.3.35 operator== . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
10.5.3.36 operator== . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
10.5.3.37 operator== . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
10.5.3.38 operator>> . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 355
10.5.3.39 operator∧ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
10.5.3.40 operator∧ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 10


Qualcomm® Snapdragon™ Heterogeneous Compute SDK CONTENTS

10.5.3.41 operator∧ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356


10.5.3.42 operator| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 356
10.5.3.43 operator| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
10.5.3.44 operator| . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
10.5.3.45 operator∼ . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
10.5.4 Variable Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 357
10.5.4.1 do_not_collapse . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 358

11 Buffers Reference API 359


11.1 Heterogeneous Compute Device Types . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
11.1.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 360
11.1.1.1 class hetcompute::device_set . . . . . . . . . . . . . . . . . . . . . . . 360
11.1.2 Enumeration Type Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . 365
11.1.2.1 device_type . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 365
11.1.3 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
11.1.3.1 to_string . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 366
11.2 Buffers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 367
11.2.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 368
11.2.1.1 class hetcompute::buffer_const_iterator . . . . . . . . . . . . . . . . . 368
11.2.1.2 class hetcompute::buffer_iterator . . . . . . . . . . . . . . . . . . . . . 369
11.2.1.3 class hetcompute::buffer_ptr . . . . . . . . . . . . . . . . . . . . . . . 370
11.2.1.4 struct hetcompute::in . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
11.2.1.5 struct hetcompute::inout . . . . . . . . . . . . . . . . . . . . . . . . . . 379
11.2.1.6 struct hetcompute::out . . . . . . . . . . . . . . . . . . . . . . . . . . . 379
11.2.1.7 class hetcompute::scope_acquire_ro . . . . . . . . . . . . . . . . . . . 379
11.2.1.8 class hetcompute::scope_acquire_rw . . . . . . . . . . . . . . . . . . . 380
11.2.1.9 class hetcompute::scope_acquire_wi . . . . . . . . . . . . . . . . . . . 381
11.2.2 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
11.2.2.1 create_buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
11.2.2.2 create_buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 382
11.2.2.3 create_buffer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 383
11.3 Memory Regions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 385
11.3.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 387
11.3.1.1 class hetcompute::glbuffer_memregion . . . . . . . . . . . . . . . . . . 387
11.3.1.2 class hetcompute::ion_memregion . . . . . . . . . . . . . . . . . . . . 387
11.3.1.3 class hetcompute::main_memregion . . . . . . . . . . . . . . . . . . . 388
11.3.1.4 class hetcompute::memregion . . . . . . . . . . . . . . . . . . . . . . 389
11.3.2 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
11.3.2.1 glbuffer_memregion . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
11.3.2.2 ion_memregion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
11.3.2.3 ion_memregion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 390
11.3.2.4 ion_memregion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
11.3.2.5 main_memregion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
11.3.2.6 main_memregion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
11.3.2.7 get_fd . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 391
11.3.2.8 get_id . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
11.3.2.9 get_num_bytes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
11.3.2.10 get_ptr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
11.3.2.11 get_ptr . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392
11.3.2.12 is_cacheable . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 11


Qualcomm® Snapdragon™ Heterogeneous Compute SDK CONTENTS

11.3.3 Variable Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392


11.3.3.1 s_default_alignment . . . . . . . . . . . . . . . . . . . . . . . . . . . . 392

12 Graphics Reference API 393


12.1 Texture APIs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 394
12.1.1 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
12.1.1.1 create_derivative_texture . . . . . . . . . . . . . . . . . . . . . . . . . 395
12.1.1.2 create_sampler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 395
12.1.1.3 create_texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
12.1.1.4 create_texture . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
12.1.1.5 is_supported . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 396
12.1.1.6 map . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
12.1.1.7 unmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 397
12.2 Texture Data Types . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
12.2.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 398
12.2.1.1 struct hetcompute::graphics::image_size . . . . . . . . . . . . . . . . . 398
12.2.1.2 struct hetcompute::graphics::image_size< 1 > . . . . . . . . . . . . . 398
12.2.1.3 struct hetcompute::graphics::image_size< 2 > . . . . . . . . . . . . . 399
12.2.1.4 struct hetcompute::graphics::image_size< 3 > . . . . . . . . . . . . . 399
12.2.2 Enumeration Type Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . 399
12.2.2.1 addressing_mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
12.2.2.2 extended_format_plane_type . . . . . . . . . . . . . . . . . . . . . . . 399
12.2.2.3 filter_mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 399
12.2.2.4 image_format . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 400

13 Data Structures Reference API 401


13.1 Bounded Lock-Free Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
13.1.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 402
13.1.1.1 class hetcompute::bounded_lfqueue . . . . . . . . . . . . . . . . . . . 402
13.1.2 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
13.1.2.1 bounded_lfqueue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
13.1.2.2 pop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
13.1.2.3 push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 403
13.2 Unbounded Lock-Free Queue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
13.2.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 405
13.2.1.1 class hetcompute::lfqueue . . . . . . . . . . . . . . . . . . . . . . . . . 405
13.2.2 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
13.2.2.1 lfqueue . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
13.2.2.2 pop . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406
13.2.2.3 push . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 406

14 Data Sharing and Storage Reference API 407


14.1 Data Sharing Synchronization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 408
14.2 Scheduler Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
14.2.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 409
14.2.1.1 class hetcompute::scheduler_storage_ptr . . . . . . . . . . . . . . . . 409
14.3 Scoped Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
14.3.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 412
14.3.1.1 class hetcompute::scoped_storage_ptr . . . . . . . . . . . . . . . . . . 412
14.4 Task Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 12


Qualcomm® Snapdragon™ Heterogeneous Compute SDK CONTENTS

14.4.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 415


14.4.1.1 class hetcompute::task_storage_ptr . . . . . . . . . . . . . . . . . . . 415
14.5 Thread Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
14.5.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 419
14.5.1.1 class hetcompute::thread_storage_ptr . . . . . . . . . . . . . . . . . . 419

15 Exceptions Reference API 421


15.1 Exceptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
15.1.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 422
15.1.1.1 class hetcompute::abort_task_exception . . . . . . . . . . . . . . . . . 422
15.1.1.2 class hetcompute::aggregate_exception . . . . . . . . . . . . . . . . . 423
15.1.1.3 class hetcompute::api_exception . . . . . . . . . . . . . . . . . . . . . 424
15.1.1.4 class hetcompute::canceled_exception . . . . . . . . . . . . . . . . . . 424
15.1.1.5 class hetcompute::dsp_exception . . . . . . . . . . . . . . . . . . . . . 425
15.1.1.6 class hetcompute::error_exception . . . . . . . . . . . . . . . . . . . . 425
15.1.1.7 class hetcompute::gpu_exception . . . . . . . . . . . . . . . . . . . . . 426
15.1.1.8 class hetcompute::hetcompute_exception . . . . . . . . . . . . . . . . 427
15.1.1.9 class hetcompute::tls_exception . . . . . . . . . . . . . . . . . . . . . 427
15.2 ErrorCodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429
15.2.1 Enumeration Type Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . 429
15.2.1.1 hc_error . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 429

16 Affinity Management API 430


16.1 Affinity Settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 431
16.1.1 Class Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 432
16.1.1.1 struct hetcompute_affinity_settings_t . . . . . . . . . . . . . . . . . . . 432
16.1.1.2 class hetcompute::affinity::settings . . . . . . . . . . . . . . . . . . . . 432
16.1.2 Typedef Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
16.1.2.1 hetcompute_func_ptr_t . . . . . . . . . . . . . . . . . . . . . . . . . . 433
16.1.3 Enumeration Type Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . 433
16.1.3.1 cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 433
16.1.3.2 hetcompute_affinity_cores_t . . . . . . . . . . . . . . . . . . . . . . . . 433
16.1.3.3 hetcompute_affinity_mode_t . . . . . . . . . . . . . . . . . . . . . . . 433
16.1.3.4 hetcompute_affinity_pin_threads_t . . . . . . . . . . . . . . . . . . . . 434
16.1.3.5 mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
16.1.4 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
16.1.4.1 settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
16.1.4.2 ∼settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 434
16.1.4.3 execute . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
16.1.4.4 get . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
16.1.4.5 get_cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
16.1.4.6 get_mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
16.1.4.7 get_pin_threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 435
16.1.4.8 hetcompute_affinity_execute . . . . . . . . . . . . . . . . . . . . . . . 435
16.1.4.9 hetcompute_affinity_get . . . . . . . . . . . . . . . . . . . . . . . . . . 436
16.1.4.10 hetcompute_affinity_reset . . . . . . . . . . . . . . . . . . . . . . . . . 436
16.1.4.11 hetcompute_affinity_set . . . . . . . . . . . . . . . . . . . . . . . . . . 436
16.1.4.12 is_this_big_core . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 437
16.1.4.13 operator!= . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
16.1.4.14 operator== . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 13


Qualcomm® Snapdragon™ Heterogeneous Compute SDK CONTENTS

16.1.4.15 reset . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438


16.1.4.16 reset_pin_threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
16.1.4.17 set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 438
16.1.4.18 set_cores . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
16.1.4.19 set_mode . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440
16.1.4.20 set_pin_threads . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 440

17 Miscellaneous 441
17.1 Interoperability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 442
17.2 Legacy . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
17.2.1 Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
17.2.1.1 init . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443
17.2.1.2 shutdown . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443

18 Class Documentation 444


18.1 cpu_kernel Class Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
18.2 dsp_kernel Class Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 444
18.3 HetComputeApp::features Class Reference . . . . . . . . . . . . . . . . . . . . . . . . . 444
18.4 hetcompute::beta::pattern::pipeline< UserData > Class Template Reference . . . . . . . 444
18.4.1 Member Typedef Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . 445
18.4.1.1 context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
18.4.2 Constructors and Destructors . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
18.4.2.1 pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 445
18.4.2.2 ∼pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
18.4.2.3 pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
18.4.2.4 pipeline . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
18.4.3 Member Function Documentation . . . . . . . . . . . . . . . . . . . . . . . . . . 446
18.4.3.1 add_stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
18.4.3.2 add_stage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 446
18.4.3.3 operator= . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
18.4.3.4 operator= . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
18.5 hetcompute::internal::pointkernel::pointkernel< RT, Args > Class Template Reference . 447
18.6 stage_input_base Class Reference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 447
18.7 hetcompute::internal::task_factory< X, Y, Z > Struct Template Reference . . . . . . . . 447
18.8 hetcompute::internal::task_factory_dispatch< X, Y > Struct Template Reference . . . . 447

Alphabetical Index 448

Bibliography 461

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 14


Qualcomm® Snapdragon™ Heterogeneous Compute SDK LIST OF TABLES

List of Tables
1-1 Reference documents and standards . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
1-2 Acronyms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 15


1 Introduction

1.1 Purpose
This document describes the Qualcomm® Snapdragon™ Heterogeneous Compute SDK programming
model and API.

1.2 Scope
This document is for system developers using the Qualcomm Heterogeneous Compute SDK to develop
domain-specific libraries for high-performance applications. Qualcomm Heterogeneous Compute SDK
handles core management, providing the ability to port an application across multiple cores. Speed is
determined by the number of processors on the device.
This document provides the public interfaces necessary to use the features provided by the Qualcomm
Heterogeneous Compute SDK. A functional overview and information on leveraging the interface
functionality are also provided.

1.3 Conventions
Function declarations, function names, type declarations, and code samples appear in a different font. For
example, #include.
Code variables appear in angle brackets. For example, <number>.
Commands and command variables appear in a different font. For example, {copy a:∗.∗ b:}.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 16


Qualcomm® Snapdragon™ Heterogeneous Compute SDK Introduction

1.4 References
The following table lists reference documents, which may include Qualcomm documents and
non-Qualcomm standards and resources. Reference documents that are no longer applicable are deleted
from this table; therefore, reference numbers might not be sequential. This document also includes a
Bibliography at the end of this document with linkable citations throughout.

Table 1-1 Reference documents and standards


Ref. Document
Qualcomm
Q1 Application Note: Software Glossary for Customers CL93-V3077-1

1.5 Technical Assistance


For assistance or clarification on information in this guide, send email to Qualcomm Technologies, Inc. at
[email protected].

1.6 Acronyms
For definitions of commonly used terms and abbreviations, refer to Q1. The following terms are specific to
this document.
Table 1-2 Acronyms
Acronym Definition
API application programming interface
DAG directed acyclic graph
GPGPU general purpose GPU
Qualcomm Het- Qualcomm® Snapdragon™ Heterogeneous Compute SDK
Compute
MIMD multiple instruction, multiple data
MPI message passing interface
NDEBUG C/C++ preprocessor macro for NO DEBUG
NDK Native Development Kit
SAXPY scalar vector multiply
SIMD single instruction, multiple data
SMP symmetric multiprocessing
SoC system-on-a-chip
TLS thread local storage

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 17


2 Installing Snapdragon™ Heterogeneous
Compute SDK

This chapter explains how to configure an application to use HetCompute given the binary distribution. The
installer package available from the Qualcomm Developer Network contains precompiled dynamic
libraries for Android (32-bit and 64-bit ARM). Install the distribution on your system following the installer
prompts, and then see the appropriate section below on how to verify installation, integrate it with your
application.

2.1 Verifying your installation


By default, the binary installer places the HetCompute library, headers, and samples in the following
directory: /opt/Qualcomm/SnapdragonHeterogeneousComputeSD-
K/<version>/<platform> in linux & mac and
C:\Qualcomm\SnapdragonHeterogeneousComputeSDK\<version>\<platform> in
windows, which is called the HETCOMPUTE_DIR directory throughout. Substitute platform with either
32bit (armeabi-v7a) or 64bit (arm84-v8a) variants. If installed in a different location, that location becomes
HETCOMPUTE_DIR.

Android 32-bit (armeabi-v7a):

• CPU, GPU, and Hexagon DSP support: libhetCompute-1.0.0.so

Android 64-bit (arm64-v8a):

• CPU, GPU, and Hexagon DSP support: libhetCompute-1.0.0.so

HetCompute assumes the existence of a working Android NDK and SDK. We recommend using NDK
r13b or later.
Note: The Qualcomm Hexagon SDK (available on Qualcomm Developer Network) is needed
to enable support for hexagon dsp in the Qualcomm HetCompute library. The recommended version of
the Qualcomm Hexagon SDK for use with HetCompute is 3.3.0 or later.

Before compiling the samples, specify the path to the root of OpenCL directory containing the headers and
the library by initializing QSHETCOMPUTE_OPENCL_PATH in
$HETCOMPUTE_DIR/samples/build/android/jni/Android.mk.
To verify the installation, perform the following:

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 18


Qualcomm® Snapdragon™ Heterogeneous Compute SDK Installing Snapdragon™ Heterogeneous Compute SDK

Using ndk-build:

cd $HETCOMPUTE_DIR/samples/build/android/jni
$ANDROID_NDK/ndk-build

# Create a directory on the device to push the executable


$ANDROID_SDK/adb shell mkdir /data/local/tmp/hetcompute

# Push the executable to the device


# Replace armeabi-v7a by arm64-v8a for 64 bit devices
$ANDROID_SDK/adb push ../obj/local/armeabi-v7a/hetcompute_sample_helloworld /data/local/tmp/hetcompute

# Push the hetcompute dynamic library to the device.


# The 32-bit library should be pushed to /system/vendor/lib,
# while the 64-bit library should be pushed to /system/vendor/lib64
# Make sure to replace QSHETCOMPUTE_VERSION and Target Architecture with appropriate values in the below
command.
$ANDROID_SDK/adb push $HETCOMPUTE_DIR/lib/$(TARGET_ARCH_ABI)/libhetCompute-$(QSHETCOMPUTE_VERSION).so /
system/vendor/lib
$ANDROID_SDK/adb shell /data/local/tmp/hetcompute/hetcompute_sample_helloworld

The above ndk-build will build both 32-bit and 64-bit variant of the samples.
Note that some of the HetCompute GPU samples requires image files, the samples assumes that the image
files are under /mnt/sdcard in the device. Sample Image files can be found in the
HETCOMPUTE_DIR/samples/src directory.

2.2 Integrating HetCompute with Android NDK Applications


The precompiled HetCompute libraries can be easily integrated with an existing native Android application.
These libraries have been compiled with the Google NDK r13b, using the clang toolchain and linked
against the c++_static runtime. The default build android platform is android-21. Using the same
NDK, compiler, and runtime C++ library is recommended.
You will need to make the following changes to your project files to use the HetCompute libraries. First,
edit your project’s jni/Application.mk file to include the following entries:
# APP_STL defines the C++ runtime to use
APP_STL := c++_static
# For 64-bit android, APP_ABI := arm64-v8a
APP_ABI := armeabi-v7a
NDK_TOOLCHAIN_VERSION := clang
# set the APP_PLATFORM to match your platform version
APP_PLATFORM := android-21

Next, edit your project’s jni/Android.mk to define the location of the HetCompute libraries and headers
and generate prebuilt shared library.
# Heterogeneous Compute SDK prebuilt
include $(CLEAR_VARS)

LOCAL_MODULE := qshetcompute
LOCAL_SRC_FILES := $(HETCOMPUTE_DIR)/$(TARGET_ARCH_ABI)/libhetCompute-$(QSHETCOMPUTE_VERSION).so
LOCAL_EXPORT_C_INCLUDES := $(HETCOMPUTE_DIR)/include

include $(PREBUILT_SHARED_LIBRARY)

If applications wants to disable exceptions, make the following changes in Android.nk & Application.mk
files.
# Add the following CFLAGS in Android.mk
LOCAL_CFLAGS := -DHETCOMPUTE_DISABLE_EXCEPTIONS

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 19


Qualcomm® Snapdragon™ Heterogeneous Compute SDK Installing Snapdragon™ Heterogeneous Compute SDK

# Disable exceptions for the app in Application.mk


APP_CPPFLAGS += -fno-exceptions

Here is sample Android.mk file that is used to build the shipped samples
define hetcompute_add_sample
include $(CLEAR_VARS)
LOCAL_MODULE := hetcompute_sample_$1
LOCAL_C_INCLUDES := $(QSHETCOMPUTE_CORE_INCLUDE_PATH) \
$(QSHETCOMPUTE_OPENCL_INC_PATH) \
$(QSHETCOMPUTE_DSP_STUB_PATH)

LOCAL_SHARED_LIBRARIES := qshetcompute libhetcompute-hexagon-prebuilt libOpenCL-prebuilt


LOCAL_CPPFLAGS := -pthread -std=c++11 -stdlib=libstdc++
LOCAL_LDLIBS := -llog -lGLESv3 -lEGL
LOCAL_CFLAGS := -DHAVE_CONFIG_H=1 -DHAVE_ANDROID_LOG_H=1 -DHETCOMPUTE_HAVE_RTTI=1 -
DHETCOMPUTE_HAVE_OPENCL=1 -DHETCOMPUTE_HAVE_GPU=1 -DHETCOMPUTE_HAVE_GLES=1 -DHETCOMPUTE_HAVE_QTI_DSP=1 -
DHETCOMPUTE_THROW_ON_API_ASSERT=1 -DHETCOMPUTE_LOG_FIRE_EVENT=1
ifeq ($(TARGET_ARCH_ABI), arm64-v8a)
LOCAL_LDFLAGS := -Wl,-allow-shlib-undefined
endif
LOCAL_SRC_FILES := $(QSHETCOMPUTE_SAMPLES_SRC_PATH)/$1.cc
include $(BUILD_EXECUTABLE)
endef

To build your application, run:


$ANDROID_NDK/ndk-build

HetCompute SDK supports Heterogeneous Compute with offload on the GPU and DSP. GPU offload is
supported using OpenCL and OpenGL kernels. The offload mechanism is either OpenCL 1.2 or later, or the
native Qualcomm GPU driver. The community has been collecting a list of Android devices that support
OpenCL. In addition, a Qualcomm-based platform that provides OpenCL support, such as the Qualcomm
DragonBoard.
Using HetCompute with OpenCL as GPU backend requires the OpenCL C++ header file cl.hpp, which
needs to be patched. For details on how to patch cl.hpp see OpenCL C++ Support. In the Android.mk
file, set LOCAL_C_INCLUDES to include path to OpenCL headers. In the above Android.mk, this is
referred by QSHETCOMPUTE_OPENCL_INC_PATH. libOpenCL-prebuit refers to the corresponding
OpenCL library for 32 or 64 bit variant.
To build an application with the Hexagon-enabled library, set LOCAL_C_INCLUDES to include DSP stub
headers generated using Hexagon SDK. This is referred by QSHETCOMPUTE_DSP_STUB_PATH in the
samples Android.mk file.

2.3 Hexagon DSP Support


Using HetCompute with Hexagon DSP tasks requires a working installation of the Hexagon SDK. This
section assumes that the programmer is familiar with the Hexagon SDK. Also, HetCompute Hexagon DSP
offloading requires a proper installation of OpenCL libraries.
Building the hetcompute dsp samples is a two-step process. First, you need to build the hetcompute_dsp
stub and skel library with the Hexagon SDK, then compile and build the HetCompute DSP samples with
the previously generated stub library. For ease of use, the installation distributes the stub & skel library used
by the samples. These libraries can be found in $HETCOMPUTE_DIR/external/dsp/ .
Both the skel and stub libraries need to be pushed into the device before running Hetcompute DSP samples.
Please refer to Hexagon SDK to determine the location where the libraries need to be installed in the device.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 20


Qualcomm® Snapdragon™ Heterogeneous Compute SDK Installing Snapdragon™ Heterogeneous Compute SDK

Running the HetCompute DSP samples:

• A README file is provided in $HETCOMPUTE_DIR/external/dsp/ that explains how to


compile the dsp skel & stub libraries in case one wants to add new functions in the stub & skel
libraries. This step can be skipped if one wants to use the precompiled dsp skel & stub libraries
distributed in HetCompute SDK package.
• Push the libraries into the device and run the dsp samples.
$ANDROID_SDK/adb shell mkdir -p /data/local/tmp/hetcompute/
cd $HETCOMPUTE_DIR
$ANDROID_SDK/adb push lib/armeabi-v7a/libhetcompute-@[email protected] /system/vendor/lib
$ANDROID_SDK/adb push samples/build/android/libs/armeabi-v7a/hetcompute_sample_hexagon_is_prime /data/local
/tmp/hetcompute

$ANDROID_SDK/adb push external/dsp/lib/libhetcompute_dsp_skel.so /system/lib/rfsa/adsp


$ANDROID_SDK/adb push external/dsp/lib/libhetcompute_dsp.so /system/vendor/lib

$ANDROID_SDK/adb shell chmod 0755 /data/local/tmp/hetcompute/hetcompute_sample_hexagon_is_prime


$ANDROID_SDK/adb shell /data/local/tmp/hetcompute/hetcompute_sample_hexagon_is_prime

Note

If issues are encountered, verify that the calculator example shipped with Hexagon SDK works
properly in your device and that your device properly supports DSP execution.

2.4 OpenCL C++ Support


Using HetCompute with OpenCL as the GPU backend requires the presence of the OpenCL C++ header
file from Khronos (version must match to your OpenCL driver installation):
http://www.khronos.org/registry/cl/
The header file should be installed in a subdirectory ${includedir}/OpenCL/ that is searched by the
compiler (e.g., /usr/local/include). Alternatively, additional options would need to be passed as compilation
flags (e.g., -I${includedir}).

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 21


3 Snapdragon™ Heterogeneous Compute
SDK Migration Guide

These chapters discuss the steps needed if you are transitioning from Symphony System Manager SDK to
Heterogeneous Compute SDK.

• Migration from Symphony System Manager SDK 1.x to Snapdragon™ Heterogeneous Compute
SDK 1.0

3.1 Migration from Symphony System Manager SDK 1.x to


Snapdragon™ Heterogeneous Compute SDK 1.0

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 22


4 Getting Started

4.1 Writing your first HetCompute program


Let’s explore a simple example using HetCompute:
1 #include <vector>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 // initialize the input vector
10 std::vector<size_t> vin(1024, 0);
11
12 // in-place update of the input vector
13 // equivalent to the following code
14 // for (size_t i = 0; i < vin.size(); ++i) {
15 // vin[i] = 2 * i;
16 // }
17 hetcompute::pfor_each(size_t(0), vin.size(), [&vin](size_t i) { vin[i] = 2 * i; })
;
18
19 hetcompute::runtime::shutdown();
20 return 0;
21 }

The above program does the following: Given an input vector vin containing 1024 elements, all of which
are initialized to 0, every element is updated to store 2∗i, where i is the index of that element.
In line 3, the hetcompute.hh header is included, which is needed for any HetCompute program. All the
HetCompute classes and functions are declared in the hetcompute namespace.
This simple example illustrates the use of the HetCompute pfor_each pattern, which allows the elements
of a collection to be processed in parallel. Because there are no dependencies between iterations (termed
inter-iteration dependencies), the values can be computed and updated in parallel. This pattern can be used
to replace all loops in the user’s program that do not have inter-iteration dependencies. HetCompute
provides a variety of other patterns, which are described in Parallel Programming Patterns.
HetCompute also provides programmers with another layer of abstraction, allowing them to think about
algorithms in terms of concurrent tasks and letting the HetCompute runtime schedule them onto available
resources in the system. Programmers can create dynamic task graphs by setting dependencies between
tasks that the runtime enforces. Another key HetCompute abstraction —not shown in the example— are
groups. Groups allow the programmer to easily manage sets of tasks. Tasks and groups are discussed in
more detail in Introduction to Tasks.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 23


Qualcomm® Snapdragon™ Heterogeneous Compute SDK Getting Started

4.1.1 Building a HetCompute program using ndk-build


The pfor_helloworld program above can be built using ndk-build. Please refer to Android.mk file in
$HETCOMPUTE_DIR/samples/build/android/jni/Android.mk for Building
pfor_helloworld sample. Snippets from the key build steps are listed below.
The following is the project’s Android.mk file.
include $(CLEAR_VARS)
LOCAL_MODULE := pfor_helloworld
LOCAL_SHARED_LIBRARIES := libhetcompute
LOCAL_SRC_FILES := pfor_helloworld.cc
include $(BUILD_EXECUTABLE)

The jni/Android.mk is as shown:


The jni/Application.mk file is shown below:
APP_STL := c++_static
APP_ABI := armeabi-v7a arm64-v8a
NDK_TOOLCHAIN_VERSION := clang
#set the APP_PLATFORM to match your platform version.
APP_PLATFORM := android-21

The example can then be built by typing the following at the command prompt:
$ ndk-build

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 24


5 User Guide

5.1 Overview
All current hardware platforms, from desktops to smartphones, are built around multicore and
heterogeneous systems-on-a-chip (SoC). Servers and supercomputers are also using specialized cores, such
as GPUs, to improve performance and power efficiency.
Qualcomm Heterogeneous Compute SDK (HetCompute) enables the full utilization of the hardware at the
user application level, in the following ways:

• By providing a parallel programming model that allows programmers to express the concurrency in
their applications. HetCompute’s powerful abstractions ease the burden of parallel programming
through a design that builds on dynamic concurrency from the ground up. At the high level,
HetCompute provides a set of parallel programming patterns that capture many of the existing
parallel building blocks, and adds dataflow and work cancellation as first-class primitives that
improve programmer productivity.
• By seamlessly integrating heterogeneous execution into a concurrent task graph and removing the
burden of managing data transfers and explicit data copies between kernels executing on different
devices. At the low level, HetCompute provides state-of-the-art algorithms for work stealing and
power optimizations that allow it to hide hardware idiosyncrasies to allow the development of
portable applications. In addition, HetCompute is designed to support dynamic mapping to
heterogeneous execution units. Moreover, expert programmers can take charge of the execution
through a carefully designed system of attributes and directives that provide the runtime system with
additional semantic information about the patterns, tasks, and buffers that HetCompute uses as
building blocks.
• By embedding the programming model in C++ and providing a C++ library API. C++ is a familiar
language for a large number of performance-oriented programmers, thus making it easy for
programmers to pick up the abstractions quickly. C++ embedding also allows incremental
development of existing applications, because HetCompute interoperates with existing libraries, such
as pthreads and OpenGL.
HetCompute runs on top of a runtime system that will execute the concurrent applications on all the
available computational resources on the SoC. The HetCompute runtime system is essentially a resource
manager for threads, address spaces, and devices. It builds on a set of state-of-the-art algorithms to free
programmers from the need to manage these resources explicitly and provide the best performance for the
HetCompute execution model.
The remaining sections of this chapter provide a high-level overview of the HetCompute parallel patterns
and concurrent abstractions, and its execution model. The rest of the User’s Guide provides additional
details on the design decisions in HetCompute, which will allow the programmer to chose the right level of
primitives to use in the application. The Reference Manual includes the API details.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 25


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

The following figure illustrates the HetCompute architecture:

Figure 5-1 HetCompute overview

HetCompute is a user-level library that integrates with OS services to hide the complexity of hardware as
much as possible, while still providing programmers with control over performance. HetCompute takes
advantage of existing standards to enable execution on the entire SoC: POSIX and C++11 for exploiting
multicore, OpenCL to dispatch onto GPUs, and OpenDSP to dispatch to the Qualcomm Hexagon(™ ) DSP.
The advantage of using HetCompute is that it provides a seamless interface for all these devices, therefore
enabling the programmer to focus on the application being developed, rather than managing hardware,
different execution models, and data transfers.
HetCompute’s execution model is a concurrent task graph, with acyclic control dependencies and/or data
dependencies that define which tasks should execute concurrently. Tasks (defined formally in sections
Introduction to Tasks and Tasks Reference API) are units of independent work. They are an intuitive way of
specifying chunks of computation that can map to different exection units. Dependencies (control and data)
provide the mechanism to dynamically build a concurrent task graph. The task will execute in parallel on as
many execution units are available on the platform at that moment. Note that on a mobile device, because of
power and thermal constraints, some execution units will not be available, or even disappear dynamically.
Therefore, it is best that the programmers focus on expressing the concurrency using HetCompute tasks,
and the runtime will map them to all the available resources. In HetCompute, heterogeneous execution is no
different than multicore execution. However, to provide best performance, HetCompute requires
programmers to write specialized kernels. The current version of HetCompute supports writing GPU
kernels in the OpenCL language and DSP kernels in C99.
The HetCompute runtime manages tasks and maps them to platform resources using a state-of-the-art work
scheduler. The scheduler implements pervasive work stealing and dynamic mapping of tasks to execution

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 26


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

units, based on heuristics that are driven by the programmer using novel high-level APIs. Later in this
guide, several examples will be discussed on how the programmer can control the behavior of the runtime,
such as using pattern tuners and task attributes. These are particularly relevant for mobile devices.
HetCompute provides two levels of APIs:

• A set of high-level APIs that includes parallel programming patterns and basic tasks and group
creation and launch. These APIs are intended for programmers who focus first and foremost on
productivity. Using these APIs provide you with the best performance in most instances, with
relatively little coding effort. The semantics of these APIs is precisely defined, and the HetCompute
type system is designed to catch many concurrent programming errors.
• A set of low-level APIs that allow expert programmers finer control over the parallel execution.
These APIs may offer better performance, at the cost of removing some of the guarantees that the
high-level APIs provide. Direct access to task pointer objects, task attributes and pattern tuners,
specialized allocators, buffer consistency and synchronization, and storage classes, are some example
of these APIs. These are the foundation for the high-level APIs, and thus the two levels work in
concert. However, using the low-level APIs requires a good understanding of parallel programming
and the side-effects that concurrent execution can have on your program; therefore, use with caution!
The target audience for HetCompute are programmers who require performance. HetCompute is envisioned
to be used by application programmers and library programmers to build high-performance applications
and domain-specific libraries. It is designed to make composing libraries easy: HetCompute tasks can be
launched from any application thread (no need to join a particular thread pool), tasks can be launched
hierarchically and synchronized individually or as a group, and a unified representation of patterns and
tasks. These novel characteristics make HetCompute uniquely positioned as a framework for heterogeneous
execution. Many other application programmers can benefit from HetCompute by embedding such
Qualcomm HetCompute-enabled libraries and thus indirectly benefit from parallel and heterogeneous
execution without the burden of parallel programming.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 27


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.1.1 Writing a HetCompute Application


Integrating HetCompute in your application is quite easy, as long as you understand the principles of
parallel programming and have a good idea of where the concurrency is in your application. The
fundamental design goal of HetCompute is to easily express a parallel algorithm and incrementally build a
parallel application.
The figure below illustrates how to write an application using HetCompute:

Figure 5-2 HetCompute workflow

The workflow from start to completion of a HetCompute application is as follows:

• Identify the algorithm to be parallelized and design a parallel version of the algorithm.
• Encode the algorithm using HetCompute abstractions:
– If the algorithm matches one of the HetCompute patterns, use the pattern directly and enjoy the
speedups.
– More complex applications will require either the use of multiple patterns or they may exhibit
parallelism that does not match one of the existing patterns. In this case, use the HetCompute
building blocks of task and group to partitioning the algorithm into tasks, setting dependencies
between the tasks (building the execution task graph), and launching the tasks for execution. Also,
partitioning the data should be considered for data concurrent access.
• Patterns and tasks are interoperable, as the HetCompute library maps patterns to tasks. Thus, a
HetCompute application consists of a forest of DAGs. The runtime system schedules the tasks once
their dependencies are satisfied.
• HetCompute task graphs execute across different devices when the programmer provides device
kernels. To execute on the GPU kernels in OpenCL are written. To run on the DSP, kernels in C99 are
written. These kernels are integrated into the task graph just like other tasks that are designed for the
CPU. Kernels: The Path to Heterogeneity has details on how to design and build heterogeneous
kernels.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 28


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

The fastest way to build a HetCompute application is by using the HetCompute patterns. If your parallel
algorithm matches one of the parallel programming patterns in HetCompute (pfor_each, preduce,
ptransform, pscan, psort, pdivide_and_conquer, or pipeline), directly using the
pattern is recommended. The HetCompute runtime understands the semantics of these constructs and
optimizes for their concurrent execution.
Below are several examples on how to use several of these patterns.

5.1.1.1 Parallel vector addition

One of the most common operations for parallel programming is parallel iteration. HetCompute provides
the pfor_each pattern to support parallel iteration. Below is an example:
1 #include <cstdlib>
2 #include <vector>
3
4 #include <hetcompute/hetcompute.hh>
5
6 using namespace std;
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 const size_t N = 100;
13 vector<float> a(N), b(N), c(N);
14
15 // Initialize the source arrays with random numbers
16 for (size_t i = 0; i < N; i++)
17 {
18 a[i] = static_cast<float>(rand()) / ((1ULL << 31) - 1);
19 b[i] = static_cast<float>(rand()) / ((1ULL << 31) - 1);
20 }
21 float alpha = 0.2f;
22
23 // add the two vectors concurrently
24 hetcompute::pfor_each(size_t(0), N, [&](size_t i) { c[i] = alpha * a[i] + b[i]; })
;
25
26 hetcompute::runtime::shutdown();
27 return 0;
28 }

The use of HetCompute is highlighted in this example. Line 4 includes the HetCompute library headers. Up
to, and including line 20 is standard C++11 for initializing two vectors, a and b, of size N. In line 24 the
hetcompute::pfor_each construct is invoked. It is very similar to a for loop, except that the
iterations will be executed in parallel on as many execution units are available on the platform. A more
detailed description of these patterns is in Patterns Reference API.

5.1.1.2 Parallel sort

Another example of a common operation that benefits from concurrent execution is sorting of large arrays.
Here is an example of how to sort in parallel in HetCompute:
1 #include <random>
2 #include <sstream>
3 #include <vector>
4
5 #include <hetcompute/hetcompute.hh>
6
12
13 int
14 main(int argc, const char* argv[])
15 {

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 29


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

16 hetcompute::runtime::init();
17 std::vector<long> input;
18 size_t n_def = 20;
19 size_t n = n_def;
20
21 if (argc >= 2)
22 {
23 std::istringstream istr(argv[1]);
24 istr >> n;
25 }
26
27 std::random_device rd;
28 std::mt19937 generator(rd());
29 std::uniform_int_distribution<long> dis;
30 const size_t num_ints = 1ULL << n;
31 // Create a random array of integers
32 for (size_t i = 0; i < num_ints; i++)
33 {
34 input.push_back(dis(generator));
35 }
36
37 hetcompute::psort(input.begin(), input.end());
38
39 if (!std::is_sorted(input.begin(), input.end()))
40 {
41 std::cerr << "psorting failed\n";
42 }
43
44 hetcompute::runtime::shutdown();
45 return 0;
46 }

Most of the code in this example is standard C++ to initialize the data structures. In line 37, the
hetcompute::psort parallel sorting function is invoked. It takes two iterators, the beginning and the
end of the list, and it sorts (in place) the input vector in the interval [begin, end).
Hopefully, you are convinced how easy is to introduce parallel programming in your application if your
application fits one of the pre-defined HetCompute patterns.

5.1.1.3 Parallelism using tasks

HetCompute exposes the fundamental building blocks tasks and groups to parallelize algorithms that
do not fit into one of the HetCompute patterns. Below is an example of sorting (using merge sort) that is
parallelized using hetcompute tasks:
1 #include <algorithm>
2 #include <functional>
3 #include <iostream>
4 #include <iterator>
5 #include <sstream>
6 #include <vector>
7
8 #include <hetcompute/hetcompute.hh>
9
10 // Parallel mergesort using recursive fork-join parallelism.
11 // hetcompute::task<>::finish_after allows easy expression of the parallelism in the
12 // algorithm in a non-blocking manner, yielding better performance than
13 // blocking parallelization using hetcompute::task<>::wait_for.
14
16 const size_t GRANULARITY = 8192;
17
18 // Asynchronous mergesort, to be invoked in a task
19 template <typename Iterator, typename Compare>
20 void
21 mergesort(Iterator begin, Iterator end, Compare cmp)
22 {
23 size_t n = std::distance(begin, end);

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 30


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

24 if (n <= GRANULARITY)
25 {
26 sort(begin, end, cmp);
27 }
28 else
29 {
30 auto middle = begin;
31 std::advance(middle, n / 2);
32 auto left = hetcompute::launch([=] { mergesort(begin, middle, cmp); });
33 auto right = hetcompute::launch([=] { mergesort(middle, end, cmp); });
34 auto merge = hetcompute::create_task([=] { std::inplace_merge(begin, middle,
end, cmp); });
35 // The left subtree and right subtree tasks must finish before the merge
36 // task can execute
37 left->then(merge);
38 right->then(merge);
39 merge->launch();
40 // mergesort(begin, end, cmp) logically finishes after the merge task
41 // finishes
42 merge->finish_after();
43 }
44 }
45
46 int
47 main(int argc, const char* argv[])
48 {
49 hetcompute::runtime::init();
50 std::vector<long> input;
51 size_t n_def = 1 << 16;
52 size_t n = n_def;
53
54 if (argc >= 2)
55 {
56 std::istringstream istr(argv[1]);
57 istr >> n;
58 }
59
60 // Create a random array of integers
61 for (size_t i = 0; i < n; i++)
62 {
63 input.push_back(rand());
64 }
65
66 // Launch mergesort inside a task since it has an asynchronous interface (due
67 // to use of hetcompute::task::finish_after)
68 auto t = hetcompute::launch([&] { mergesort(input.begin(), input.end(),
std::less<long>()); });
69 t->wait_for();
70
71 if (!std::is_sorted(input.begin(), input.end()))
72 {
73 std::cerr << "parallel mergesorting failed\n";
74 }
75
76 hetcompute::runtime::shutdown();
77 return 0;
78 }

Please note how much easier is to just use the HetCompute patterns. In this example, a dynamic DAG of
tasks is constructed by splitting the array into halves and sorting each half in parallel, and then merging the
results using another task. Lines 32-33 create the recursive sorting tasks. Line 34 create the merge task.
Lines 37-38 sets the dependencies between the tasks, thus building the DAG. Lines 39-42 launch the tasks
into the runtime, and finally the function terminates when the merge task terminates (line 42). In the main
function, a task is created for the merge by passing the entire array (line 68), which has no dependency so it
can be directly launched (line 66) and then wait for it to complete (line 69).
This is a quick illustration of the power of HetCompute’s abstractions. In the rest of this guide a
walkthrough is provided with details on the design to help you use HetCompute to extract the most benefits

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 31


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

for your application.

5.1.2 Executing a HetCompute Application


The figure below illustrates how the HetCompute runtime executes HetCompute applications.

Figure 5-3 HetCompute execution

The HetCompute runtime fundamentally implements a thread pool over which tasks are scheduled at the
user level. When the application starts running, the thread pool is initialized such that it makes optimal use
of the existing hardware contexts on the device. The scheduler is a throughput-oriented scheduler. Tasks are
scheduled in a non-preemptive manner as they are ready for execution (dependencies are satisfied). They
are mapped to devices based on the kernel type. The runtime performs additional optimizations for
performance and energy efficiency based on the patterns semantics, patterns tuning and task attributes.
Parallel Programming Patterns
Introduction to Tasks
Buffers
Textures
Data Structures
Storage
Affinity
Heterogeneous Computing in Action
Interoperability

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 32


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.2 Parallel Programming Patterns


Overview of HetCompute Patterns
Parallel Iteration
Parallel Reduction
Parallel Scan
Parallel Divide-and-Conquer
Parallel Sorting
Advanced Topics for Patterns
Pipeline

5.2.1 Overview of HetCompute Patterns


One of HetCompute’s main goals is to simplify parallel programming. To this end, it provides several
constructs that encapsulate commonly used parallel programming patterns. These patterns reflect how
parallel programming experts think about parallel algorithms, and capture these essential understandings
with ready-to-use HetCompute APIs and performance tuning facilities. HetCompute’s parallel
programming patterns incorporate a variety of algorithmic styles exploiting concurrency. Examples include:
data parallelism operations (parallel loop, parallel reduce, parallel transform, and parallel scan),
multi-branched recursion (parallel divide and conquer), and staged computation often encountered in
streaming applications (pipeline). Programmers are encouraged to use these patterns as basic building
blocks to construct complex concurrent applications such as physical simulation, image/video processing,
or linear algebra routines. More patterns that are representative of key computational requirements will be
developed. In addition, because these patterns are layered on top of the Qualcomm HetCompute
abstractions, programmers are welcome to add to the library of patterns themselves.
When a HetCompute tool is selected to parallelize an algorithm, first stop at the pattern warehouse, as one
of the existing patterns may meet your needs. If so, use the pattern, measure the performance and efficiency,
and refine the implementation using the pattern performance tuner. Use other HetCompute constructs only
if the algorithm does not map well to the existing patterns, or the pattern implementation does not satisfy
the intended performance criteria.
This section provides a high-level overview of what to expect from the parallel patterns. First, the basic
uses of the parallel patterns are demonstrated. Next, some advanced topics related to patterns are discussed.
Finally, the pipeline pattern is presented as a relatively complex pattern. Note that all patterns presented
herein only apply to CPU. Heterogeneous patterns will be explored and included in future releases. Readers
may also refer to Kernels: The Path to Heterogeneity and Heterogeneous Computing in Action for
heterogeneous computing tutorials in HetCompute.

5.2.2 Parallel Iteration


The most commonly used pattern is data-parallel computing, in which the same function is applied to
different pieces of data. In this chapter, two closely related parallel programming patterns are introduced
which express data parallelism: hetcompute::pfor_each and hetcompute::ptransform.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 33


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.2.2.1 Parallel For Loop

The parallel for loop pattern, hetcompute::pfor_each, supports concurrent application of a given
function object on each element in the input collection returned by the input iterator taken as an argument.
The input iterator can be expressed as a pair of integers (lower bound, upper bound), or a pair of random
access iterators (begin, end). This pattern is mostly suitable to replace a serial loop where loop-carried
dependence (dependence exists across iterations) does not occur. The following example illustrates the use
of the parallel iteration pattern for a simple computation.
1 #include <vector>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 // initialize the input vector
10 std::vector<size_t> vin(1024, 0);
11
12 // in-place update of the input vector
13 // equivalent to the following code
14 // for (size_t i = 0; i < vin.size(); ++i) {
15 // vin[i] = 2 * i;
16 // }
17 hetcompute::pfor_each(size_t(0), vin.size(), [&vin](size_t i) { vin[i] = 2 * i; })
;
18
19 hetcompute::runtime::shutdown();
20 return 0;
21 }

Despite the simple look, the underlying implementation is highly efficient in workload parallelization and
load balancing. A lock-free workstealing algorithm is employed to balance workload, i.e., iterations to
work on, across multiple computational cores. It attempts to exploit the maximum degree of concurrency
available in the loop computation, and has a very low overhead of synchronization.
The API also takes two optional parameters: stride and tuner. Pattern tuners are covered in Tuner. The
stride parameter represents the step size of the incremental iterator, and has a default value of one. For
example, the parallel version of the following code snippet
for(size_t i = 0; i < vin.size(); i += 2)
vin[i] = 2 * i;

is the following statement.


hetcompute::pfor_each(size_t(0), 2, vin.size(), [&vin](size_t i){ vin[i] = 2 * i; });

The parallel iteration pattern can be nested. However, it is usually sufficient to only decorate the outmost
loop with hetcompute::pfor_each, given the outmost loop has sufficient iterations to keep all cores
busy.

5.2.2.2 Parallel Transformation

The parallel transformation pattern, hetcompute::ptransform, has three versions. The first two
versions apply a given function object to a range and stores the result in another range. They are essentially
the parallel version of std::transform (one applies unary function and the other applies binary
function). The third version, similar to hetcompute::pfor_each, performs in-place transformation.
The major difference between the two patterns is that hetcompute::ptransform passes the
dereferenced input iterator to the function object, whereas hetcompute::pfor_each passes the input
iterator directly to the function object. Therefore, the input iterator passed to

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 34


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

hetcompute::ptransform is restricted to random access iterators. Integral iterators are not allowed
because they cannot be dereferenced.
The parallel transformation pattern is useful when the programmer wishes to directly manipulate the
dereferenced input iterator in the function object. For example, the following code performs a binary
operation on different segments of the input range, and stores the result in the output range.
1 #include <functional>
2 #include <vector>
3
4 #include <hetcompute/hetcompute.hh>
5
6 int
7 main()
8 {
9 hetcompute::runtime::init();
10 // Initialize input vector: vin[i] = i
11 std::vector<int> vin(1024);
12 int j = 0;
13 for (auto& i : vin)
14 i = j++;
15
16 // vout[i] = vin[i] + vin[i+1]
17 std::vector<int> vout(vin.size() - 1);
18 hetcompute::ptransform(begin(vin),
19 begin(vin) + vout.size(), // first input range
20 begin(vin) + 1, // start of the second input range
21 begin(vout), // start of the output range
22 std::plus<int>());
23
24 hetcompute::runtime::shutdown();
25 return 0;
26 }

5.2.3 Parallel Reduction


At times, programmers require to compute reduction over a range, e.g. sum, min, max, or as complex as
multiplication of a chain of matrices. The parallel reduction pattern, hetcompute::preduce, processes
a list of elements using a join function object and computes a return value. A join function object is applied
to two elements and produces a result which can be combined using the join function with the remaining
elements in the range. Parallelizing reduction in HetCompute is simple, in that the programmer only needs
to pass the input container to work on, or a pair of random access iterators specifying the range. An initial
value (the identity element) also needs to be specified for reduction. The binary operation defining
reduction is expected to be associative, but not necessarily commutative. Putting them altogether, the
following example demonstrates a parallel sum implementation in HetCompute:
1 #include <functional>
2 #include <iostream>
3 #include <numeric>
4 #include <vector>
5
6 #include <hetcompute/hetcompute.hh>
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 // initalize the input vector
13 std::vector<int> vin(1024, 0);
14 int val = 1;
15 for (auto& i : vin)
16 {
17 i = val++;
18 }
19

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 35


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

20 const int identity = 0;


21 // parallel_sum = 1 + 2 + 3 + ... + 1024
23 int parallel_sum = hetcompute::preduce(vin, identity, std::plus<int>());
25
26 // check result
27 int serial_sum = std::accumulate(vin.begin(), vin.end(), 0);
28 if (parallel_sum != serial_sum)
29 {
30 std::cout << "Parallel reduction failed!" << std::endl;
31 }
32 hetcompute::runtime::shutdown();
33 return 0;
34 }

If the programmer passes a pair of dereferenceable input iterators, the example becomes:
int parallel_sum = hetcompute::preduce(vin.begin(), vin.end(), identity,
std::plus<int>());

Or
int parallel_sum = hetcompute::preduce(arr, arr + 1024, identity, std::plus<int>());

However, if the programmer passes a pair of integral input iterators which are not dereferenceable, the
programmer needs another function object to capture the input container, and to define the accumulation
operation for a subrange starting with some initial value. This is necessary because the join function does
not offer dereferenced operation on the input iterators. The parallel sum example will become the following:
1 #include <functional>
2 #include <iostream>
3 #include <numeric>
4 #include <vector>
5
6 #include <hetcompute/hetcompute.hh>
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 // initalize the input vector
13 std::vector<int> vin(1024, 0);
14 int val = 1;
15 for (auto& i : vin)
16 {
17 i = val++;
18 }
19
20 const int identity = 0;
21 // parallel_sum = 1 + 2 + 3 + ... + 1024
22 int parallel_sum = hetcompute::preduce(size_t(0),
23 vin.size(),
24 identity,
25 // aggregate subrange
26 [&vin](size_t f, size_t l, int& init) {
27 for (size_t k = f; k < l; ++k)
28 {
29 init += vin[k];
30 }
31 },
32 // join intermediate results
33 std::plus<int>());
34
35 // check result
36 int serial_sum = std::accumulate(vin.begin(), vin.end(), 0);
37 if (parallel_sum != serial_sum)
38 {
39 std::cout << "Parallel reduction failed!" << std::endl;
40 }
41
42 hetcompute::runtime::shutdown();

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 36


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

43
44 return 0;
45 }

Internally, an efficient work stealing algorithm has been implemented to parallelize the reduction
computation. The algorithm builds a reduction tree and first performs accumulation of subranges in a
top-down manner. The intermediate values are then joined together bottom-up to obtain a final result.
Because of the work stealing implementation, programmers can put some in-place transformation
computation ahead of reduction to build more complex algorithms. In this sense,
hetcompute::preduce can be viewed as the combination of the parallel for loop pattern and the
parallel reduction pattern. The in-place transformations are completed during the top-down accumulation
process, as exhibited by the following code snippet.
1 #include <functional>
2 #include <iostream>
3 #include <numeric>
4 #include <vector>
5
6 #include <hetcompute/hetcompute.hh>
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 // initalize the input vector
13 std::vector<int> vin(1024, 0);
14 int val = 1;
15 for (auto& i : vin)
16 {
17 i = val++;
18 }
19
20 const int identity = 0;
21 // parallel_sum = 2 + 4 + 6 + ... + 2048
22 int parallel_sum = hetcompute::preduce(size_t(0),
23 vin.size(),
24 identity,
25 // aggregate subrange
26 [&vin](size_t f, size_t l, int& init) {
27 for (size_t k = f; k < l; ++k)
28 {
29 // some transformation func applied to vin
30 vin[k] *= 2;
31 init += vin[k];
32 }
33 },
34 // join intermediate results
35 std::plus<int>());
36
37 // check result
38 int serial_sum = std::accumulate(vin.begin(), vin.end(), 0);
39 if (parallel_sum != serial_sum)
40 {
41 std::cout << "Parallel reduction failed!" << std::endl;
42 }
43
44 hetcompute::runtime::shutdown();
45 return 0;
46 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 37


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.2.4 Parallel Scan


Parallel scan is a useful building block for many parallel algorithms. HetCompute implements
hetcompute::pscan_inclusive, a Sklansky-style, in-place parallel prefix operation. Example
applications include stream compaction and sorting. The scan is inclusive because it generates a new range
where each element i is the result of the prefix operation of all elements up to and including i. If the scan
result of each element includes operations on all previous elements, but not the element itself, it is called an
exclusive scan. An exclusive scan can be easily generated from an inclusive scan by shifting the resulting
range right by one element and inserting the identity element at the leftmost place. A
hetcompute::pscan_exclusive API may be provided in the future release.
The most commonly used prefix scan operation is prefix sum, which computes an output range consisting
of all sums of prefixes of some input range. An example of parallel prefix sum is given below (the prefix
scan operation is in-place, that is, vin will include the prefix sum of its original values after execution):
1 #include <functional>
2 #include <vector>
3
4 #include <hetcompute/hetcompute.hh>
5
6 int
7 main()
8 {
9 hetcompute::runtime::init();
10 // Initialize input vector: vin[i] = 1
11 std::vector<int> vin(1024, 1);
12
13 // After the scan, vin[i] == i + 1
14 hetcompute::pscan_inclusive(vin.begin(), vin.end(), std::plus<int>());
15
16 hetcompute::runtime::shutdown();
17 return 0;
18 }

5.2.5 Parallel Divide-and-Conquer


A common parallel pattern arising in various domains is divide-and-conquer. Examples are quicksort and
tree building/traversal. Use hetcompute::pdivide_and_conquer to solve an abstract problem p
by splitting it into subproblems solved in parallel. For example, in case of the Fibonacci problem p is
simply an int representing the Fibonacci term to compute in parallel. HetCompute internally uses a
high-performance non-blocking algorithm to solve this problem.
In the following example, the hetcompute::pdivide_and_conquer pattern is demonstrated to
calculate a Fibonacci sequence:
1 #include <sstream>
2 #include <vector>
3
4 #include <hetcompute/hetcompute.hh>
5
8
10 static size_t
11 fibonacci_s(size_t n)
12 {
13 if (n == 0 || n == 1)
14 {
15 return n;
16 }
17 else
18 {
19 return fibonacci_s(n - 1) + fibonacci_s(n - 2);
20 }
21 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 38


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

22
24 static const size_t GRANULARITY = 20;
25
27 static size_t
28 fibonacci(size_t n)
29 {
30 return hetcompute::pdivide_and_conquer<size_t, size_t>(
31 // Problem is to compute the n-th Fibonacci term
32 n,
33 // When should an arbitrary Fibonacci term, represented by ’m’, be
34 // computed sequentially?
35 // Note that programmer chooses to compute Fibonacci terms 20 and lower
36 // sequentially for best performance.
37 [](size_t& m) { return m <= GRANULARITY; },
38 // How to compute the term sequentially
39 [](size_t& m) { return fibonacci_s(m); },
40 // Split problem into independent subproblems
41 [](size_t& m) {
42 return std::vector<size_t>({ m - 1, m - 2 });
43 },
44 // Merge solutions to subproblems.
45 // Note that the first parameter (size_t, corresponding to the split
46 // problem) is unused in this case, but may be useful while merging in
47 // other cases.
48 [](size_t, std::vector<size_t>& sols) { return sols[0] + sols[1]; });
49 }
50
51 int
52 main(int argc, const char* argv[])
53 {
54 hetcompute::runtime::init();
55 size_t n_def = 24;
56 size_t n = n_def;
57
58 if (argc >= 2)
59 {
60 std::istringstream istr(argv[1]);
61 istr >> n;
62 }
63
64 size_t out = fibonacci(n);
65
66 if (out != fibonacci_s(n))
67 {
68 std::cerr << "parallel fibonacci failed\n";
69 }
70 hetcompute::runtime::shutdown();
71 return 0;
72 }

The Fibonacci example demonstrates one form of hetcompute::pdivide_and_conquer which


returns a solution. There also exists problems that do not expect a returned solution and/or do not have a
merge stage after the splitting phase. The hetcompute::pdivide_and_conquer pattern offers both
combinations. Refer to Patterns Reference API for the complete reference. The following example shows
quicksort, which has neither a merge phase nor a returned solution.
1 #include <algorithm>
2 #include <array>
3 #include <cstdlib>
4 #include <functional>
5 #include <sstream>
6 #include <utility>
7
8 #include <hetcompute/hetcompute.hh>
9
15
21 template <typename Iterator>
22 struct QuickSort
23 {

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 39


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

24 QuickSort(Iterator _begin, Iterator _end) : begin(_begin), end(_end), middle() {}


25 Iterator begin, end, middle;
26 };
27
29 const size_t GRANULARITY = 8192;
30
33 template <typename Iterator, typename Compare>
34 void
35 quicksort(Iterator begin, Iterator end, Compare cmp)
36 {
37 typedef QuickSort<Iterator> QuickSort;
38 hetcompute::pdivide_and_conquer(
39 // Main problem
40 QuickSort(begin, end),
41 // When should an arbitrary array, represented by ’q’, be sorted
42 // sequentially?
43 // Note that programmer chooses to sort arrays smaller than size 8192
44 // sequentially for best performance.
45 [&](QuickSort& q) {
46 size_t n = std::distance(q.begin, q.end);
47 if (n <= GRANULARITY)
48 {
49 return true;
50 }
51 // Choice of first element as pivot is arbitrary
52 auto pivot = *q.begin;
53 q.middle = std::partition(q.begin, q.end, std::bind2nd(cmp, pivot));
54 // If middle == begin, elements in [begin, end) are greater than or
55 // equal to pivot. We could either find a new pivot or as we do here,
56 // just sort sequentially.
57 return q.middle == q.begin;
58 },
59 // Sequential sort used
60 [&](QuickSort& q) { std::sort(q.begin, q.end, cmp); },
61 // Split problem into two subproblems
62 [&](QuickSort& q) {
63 std::array<QuickSort, 2> subarrays{ { QuickSort(q.begin, q.middle), QuickSort(q.middle, q.end)
} };
64 return subarrays;
65 });
66 }
67
68 int
69 main(int argc, const char* argv[])
70 {
71 hetcompute::runtime::init();
72 std::vector<long> input;
73 size_t n_def = 1 << 16;
74 size_t n = n_def;
75
76 if (argc >= 2)
77 {
78 std::istringstream istr(argv[1]);
79 istr >> n;
80 }
81
82 // Create a random array of integers
83 for (size_t i = 0; i < n; i++)
84 {
85 input.push_back(rand());
86 }
87
88 quicksort(input.begin(), input.end(), std::less<long>());
89
90 if (!std::is_sorted(input.begin(), input.end()))
91 {
92 std::cerr << "parallel quicksorting failed\n";
93 }
94
95 hetcompute::runtime::shutdown();
96

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 40


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

97 return 0;
98 }

The parallel divide-and-conquer pattern, like other patterns, are built using the basic HetCompute
constructs of tasks and groups. However, they are optimized using knowledge about the pattern structure
and the operations in the runtime to minimize the amount of synchronization, and to avoid other
bookkeeping operations that are needed for more generic use.

5.2.6 Parallel Sorting


HetCompute provides a parallel sorting utility hetcompute::psort for programmers. It performs an
unstable in-place comparison sorting of an input range. The programmer may either provide a customized
compare function, or use the default compare function (std::less<T>(), where T is the value type of
the iterators). An example of HetCompute parallel sort is listed below.
1 #include <random>
2 #include <sstream>
3 #include <vector>
4
5 #include <hetcompute/hetcompute.hh>
6
12
13 int
14 main(int argc, const char* argv[])
15 {
16 hetcompute::runtime::init();
17 std::vector<long> input;
18 size_t n_def = 20;
19 size_t n = n_def;
20
21 if (argc >= 2)
22 {
23 std::istringstream istr(argv[1]);
24 istr >> n;
25 }
26
27 std::random_device rd;
28 std::mt19937 generator(rd());
29 std::uniform_int_distribution<long> dis;
30 const size_t num_ints = 1ULL << n;
31 // Create a random array of integers
32 for (size_t i = 0; i < num_ints; i++)
33 {
34 input.push_back(dis(generator));
35 }
36
37 hetcompute::psort(input.begin(), input.end());
38
39 if (!std::is_sorted(input.begin(), input.end()))
40 {
41 std::cerr << "psorting failed\n";
42 }
43
44 hetcompute::runtime::shutdown();
45 return 0;
46 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 41


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.2.7 Advanced Topics for Patterns


HetCompute patterns are more than a collection of convenient methods. The programmers can create
HetCompute tasks from the patterns and throw them into the asynchronous runtime environment.
HetCompute also offers performance tuners to fine tune the patterns with the knowledge of the system and
the algorithm. This section includes some of the advanced topics to further facilitate the understanding of
HetCompute patterns. Note that the topics covered in this section apply to all the aforementioned patterns.
The semantics of HetCompute pipeline pattern is much different from others. Therefore, the advanced
topics such as asynchronous launch are covered separately in Pipeline.

5.2.7.1 Pattern Object

Programmers can create a pattern object and invoke the pattern by using the run method or the () operator
with arguments, as illustrated in the following code example.
1 #include <vector>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // initialize the input vector
9 std::vector<size_t> vin(1024, 0);
10
11 // declare function object to be applied
12 auto func = [&vin](size_t i) { vin[i] = 2 * i; };
13
14 auto pfor = hetcompute::pattern::create_pfor_each(func);
15 pfor.run(size_t(0), vin.size());
16
17 hetcompute::runtime::shutdown();
18 return 0;
19 }

Patterns are by default blocking, meaning that the execution is stopped until the pattern call returns, which
is sometimes undesirable. The programmer might want patterns to run asynchronously similar to other
HetCompute tasks. Fortunately, all patterns in HetCompute define a corresponding asynchronous API that
does not wait for termination. These APIs are named after the original patterns with suffix _async. In
HetCompute, the most common way to launch a pattern asynchronously is to (1) create a pattern object, (2)
using hetcompute::create_task and hetcompute::launch to invoke the pattern. As such,
programmers can utilize the rich semantics defined for HetCompute tasks and groups (that is, dependencies,
wait_for, finish_after, etc.) for pattern manipulation.
1 #include <vector>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // initialize the input vector
9 std::vector<size_t> vin(1024, 0);
10
11 // declare function object to be applied
12 auto func = [&vin](size_t i) { vin[i] = 2 * i; };
13 auto pfor = hetcompute::pattern::create_pfor_each(func);
14
15 // create a pfor task and launch!
16 auto t1 = hetcompute::create_task(pfor, size_t(0), vin.size());
17 t1->launch();
18 t1->wait_for();
19

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 42


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

20 // launch pfor directly


21 auto t2 = hetcompute::launch(pfor, size_t(0), vin.size());
22 t2->wait_for();
23 hetcompute::runtime::shutdown();
24 return 0;
25 }

5.2.7.2 Tuner

The default pattern implementations should cover the majority of use cases. However, no single
implementation is the best fit for all workload types. For that reason, HetCompute offers programmers a
collection of commonly used algorithm parameters served as the performance tuning knobs (tuner). In
particular, the programmer can declare a HetCompute tuner object and set its property up front, and pass the
tuner object to the pattern API for the purpose of performance tuning. This is illustrated by the following
example:
// declare hetcompute::tuner object and use the static chunking algorithm for parallelization.
hetcompute::pattern::tuner t;
t.set_static();

// start pfor
hetcompute::pfor_each(size_t(0), vin.size(), [&vin](size_t i) { vin[i] = 2 * i; },
t);

Performance settings can be chained in tuner declaration.


// create a pfor object
auto func = [&vin](size_t i) { vin[i] = 2 * i; };
auto pfor = hetcompute::pattern::create_pfor_each(func);

// start pfor
pfor.run(size_t(0),
vin.size(),
hetcompute::pattern::tuner()
.set_max_doc(8) // Use 8 tasks for load balancing
.set_chunk_size(16) // The minimum stealing granuality is 16
);

Some settings do not have any effect because there is no mapping under the setting specific to a pattern. The
current HetCompute release focuses on performance tuning for hetcompute::pfor_each. The most
useful settings are listed for hetcompute::pfor_each performance tuning explaining their usages.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 43


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

Setting Name Explanation


set_max_doc Maximum degree of concurrency. Default set to
the number of available cores which defines the
maximum number of tasks launched internally for
load balancing. A higher number indicates
oversubscription which might be beneficial in
certain usage scenarios.
set_chunk_size Work stealing granularity. Default set to one (each
task tries to steal work after finishing one
iteration). For long loops with tiny computational
kernel size, the default setting is problematic
because of the large synchronization overhead. If
the default setting does not meet the performance
requirement, a gradual increase in the chunk size
should be attempted until the saddle point is
located.
set_static Use the simple static chunking algorithm for
parallelization.
set_dynamic Use the dynamic workstealing algorithm for
parallelization (default).
set_serial Set to serial execution, convenient for performance
comparison and calculate speedup.

5.2.8 Pipeline
The HetCompute Pipeline pattern supports the pipeline parallel programming model, which is often used in
streaming applications.
The HetCompute Pipeline API allows the programmer to describe a linear chain of processing stages such
that the output of each stage is the input of the next. The programmer associates a C++ stage function with
each stage, and can specify a basic C++ type or a user-defined data-type for handing over data between
stages. Once launched, the Pipeline stage repeatedly executes the stage function over a data stream. A
successor stage starts executing on one data unit after its predecessor stage finishes processing the same
unit. While the stages in the pipeline executes one data unit sequentially (from the first stage to the last),
they can execute different data units at the same time.
Note that in contrast to a typical pipeline model, where all the stages always execute exactly the same
number of iterations, HetCompute supports stages executing a different number of total iterations in one
Pipeline (by using hetcompute::iteration_rate). Also in contrast to the standard pipeline model
where the successor stage can start executing immediately after its predecessor finishes one iteration,
HetCompute Pipeline supports iteration delays between stages so that the predecessor can run at least n
iterations ahead of its immediate successor (by using hetcompute::iteration_lag).
HetCompute Pipeline is compatible with the HetCompute asynchronous semantics so that it can be
launched and waited on, just like any other tasks.
Algorithms for streaming applications can be expected to map to Pipeline in a straightforward manner.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 44


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.2.8.1 Overview

The HetCompute Pipeline API is designed with the following intents:

• The Pipeline is created dynamically by a C++ program as a HetCompute pipeline object. Pipeline
stages should be added sequentially to the pipeline prior to launching, that is, once launched, the
Pipeline can no longer be modified.
• The Pipeline allows arbitrary C++ code in a stage function, though the parameter list is dictated by
the pipeline context (manditory,
hetcompute::pattern::pipeline<UserData...>::context) and the data packet
(hetcompute::stage_input) that is expected to be handed over between stages.
• The data between stages can transport any data type that is copyable (assignable and constructible)
and default constructible.
– One stage iteration can produce, at most 1 data unit.
– One stage iteration can consume 0 − n data units, depending on the features of the stage. (see
Iteration Rate (hetcompute::iteration_rate):).
• The HetCompute pipeline can control the memory footprint for the stages. Instead of allowing a stage
to proceed freely with many future iterations, HetCompute pipeline supports a default special
execution manner (hetcompute::pattern::pipeline::enable_sliding_window)
which favors pipeline throughput, that is, instead of allowing a stage to freely proceed with many
future iterations and storing the produced data, the Pipeline schedules the successor stage to consume
the data as soon as possible so as to save the memory space for storage. Thus, a pipeline stage can
specify a fixed amount of memory as a circular buffer to store the produced data. HetCompute calls
the circular buffer Sliding Window . Note that this special execution manner is pipeline-specific rather
than a stage feature. A pipeline can also run in a more free manner (hetcompute::pattern-
::pipeline::disable_sliding_window) if no sliding window is used in any of its stages.
This execution manner may lead to a higher level of parallelism at runtime, but has no control on the
memory footprint. It can be mostly used for performance tuning when level of parallelism is critical
and memory footprint control is not. Moreover, the pipeline internal uses different buffer data
structures for inter-stage data transfers under different sliding window modes. A static circular buffer
is used when sliding window mode is enabled while a dynamic pool bucket buffer is used when
sliding window mode is disabled. However, the implementation details of the inter-stage buffer is
transparent to the user. The information here is mentioned as a side note to consider when advanced
users need to reason performance and memory usage of their applications under different modes.
• The user can set the following parameters for each Pipeline stage:
– Stage Type: the execution order of the iterations for a stage
◦ Serial Stage (hetcompute::serial_stage): runs every iteration sequentially.
◦ Parallel Stage (hetcompute::parallel_stage): can run multiple consecutive iterations
concurrently.
· Degree of Concurrency (doc): number of consecutive iterations that can run in parallel.
· A parallel stage with doc = 1 is equivalent to a serial stage.
– Iteration Lag (hetcompute::iteration_lag): minimum number of iterations that a stage
should run ahead of its successor (should be ≥ 0).
– Iteration Rate (hetcompute::iteration_rate): rate of iterations between two consecutive
stages

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 45


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

◦ Each stage can have different iteration numbers


◦ Iteration lag will be scaled up according to the iteration rate
– Sliding Window Size (hetcompute::sliding_window_size): the unit size of the circular
buffer between stages in a Pipeline so as to control the memory footprint
◦ HetCompute Pipeline supports a sanity check
(hetcompute::pattern::pipeline::is_valid()) before launching the pipeline.
An exception will be raised if it fails.
– Pipeline Context (hetcompute::pattern::pipeline::context): the information that
is available to all user-defined stage functions for a specific pipeline:
◦ Contains user-defined pipeline-specific data
◦ Contains pipeline execution information: stage id, stage iteration id, etc.
◦ Provides control to the pipeline when possible, that is, stops the pipeline on the fly.
• The default pipeline pattern implementation should cover the most of basic user cases. However,
there’s no single implementation is the best fit for all pipeline workload types. User knowledge on the
application can sometimes help the pattern to perform better. The user can also use the hetcompute
tuners for performance tuning (Tuner):
– Stage Tuner: provided when adding a pipeline stage. Currently, HetCompute pipeline supports
stage iteration chunking for user to tweak the granuality of work to be performed together in the
scheduler. Iteration chunking is also known as iteration fusion, while multiple iterations will be
executed sequentially as a big entity to avoid scheduling overhead between iterations. This is very
helpful if the pipeline stage has very light workload with many itertions to perform. On the other
head, iteration chunking can reduce the level of parallelism due to the serialization. For parallel
stages, chunk size should not be larger than its degree of concurrency. Otherwise, it will be
ignored by the pattern scheduler. For serial stages, chunk size can be larger than 1. It is helpful to
reduce scheduling overhead, however, the user need to make sure chunking with a specific size is
safe for the application correcness. The case for serial stage is allowed mainly for performance
tuning of some special cases although it violates the philosophy of pattern tuning, which is that
tuners are only supposed to be used for performance and never affect the correctness of the pattern.
– Pipeline Tuner: provided when launching the pipeline. Currently, HetCompute pipeline supports
maximum degree of concurrency, i.e. the number of initial concurrent tasks are launched for
pipeline scheduling. This helps if the first stage is a parallel stage. The user may want to control
the level of parallelism of the pipeline as other applications may occupy some of the compuational
resources in the system.
Here is an example for the pipeline and stage tuner usage:
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Define a pipeline skeleton, with pipeline context data of type size_t.
8 hetcompute::pattern::pipeline<size_t> p;
9
10 // Pipeline context type.
11 typedef hetcompute::pattern::pipeline<size_t>::context
context;
12
13 // Add a parallel stage with degree of concurrency of 8.
14 p.add_stage(hetcompute::parallel_stage(8),
hetcompute::pattern::tuner().set_chunk_size(2), [](context& ctx) {
15 // some usage of iter and data here

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 46


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

16 size_t iter = ctx.get_iter_id();


17 size_t data = *ctx.get_data();
18 // some usage of iter and data here
19 HETCOMPUTE_ILOG("iter: %zu, data: %zu", iter, data);
20 });
21
22 // Add a serial stage.
23 p.add_stage(hetcompute::serial_stage(), [](context& ctx) {
24 // size_t iter = ctx.get_iter_id();
25 // size_t data = *ctx.get_data();
26 auto dp = ctx.get_data();
27 *dp = *dp + 1;
28 // some usage of iter and data here
29 });
30
31 // Define the context data.
32 size_t num = 0;
33
34 // Run the pipeline with 10 iterations with tuner.
35 p.run(&num, 10, hetcompute::pattern::tuner().set_max_doc(4));
36
37 std::cout << "pipeline runs " << num << " iters" << std::endl;
38
39 hetcompute::runtime::shutdown();
40 return 0;
41 }
Note that tuners are only suggestions from the user to the scheduler. It is up to the implementation of
the scheduler whether to respect or ignore the tuning hints.

5.2.8.2 HetCompute Pipeline Example

How to express a simple video processing application is demonstrated using the HetCompute Pipeline
Pattern.
The video processing application (as shown in fig_HETCOMPUTEPipeline) contains three stages:

• Stage 0: Read the video stream frame-by-frame (sequentially)


• Stage 1: Process the frames.
– No intra-stage data dependency, that is, to process the current frame previous processed frames are
not required. This qualifies the stage to be "parallel".
– Inter-stage dependency, which requires that the current frame refer to the previous two original
frames to process the current one. Note that the inter-stage data dependency does not disqualify
the stage from being parallel. This dependency can be ensured by setting the stage lag to be 2.
• Stage 2: Save every other frame to the output file.

Figure 5-4 HetCompute Pipeline Example

The application in the HetCompute Pipeline:


// define the FileInfo struct
// will use this type for the pipeline-specific data

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 47


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

typedef struct {
File* InputFile; //Input video file
File* OutputFileOdd; //Output video file for the odd frames
File* OutputFileEven; //Output video file for the even frames
size_t num_frames; //Number of total frames in the input file
}FileInfo;

// Create pipeline with pipeline-specific data (context data) of type FileInfo


hetcompute::pattern::pipeline<FileInfo> pipe;

// alias to the pipeline context


using context = hetcompute::pattern::pipeline<FileInfo>::context
;

// define stage function lambdas


auto read_frame_from_stream = [](context& ctx) {
// read frame ctx.get_iter_id() from ctx.get_data()->InputFile;
return ctx.get_iter_id();
};

auto process_one_frame = [](context& ctx, hetcompute::stage_input<size_t>&


in) {
// process frame in[0];
};

auto save_every_other_frame = [](context& ctx) {


// save frame ctx.get_iter_id() * 2 to ctx.get_data->OutputFileEven;
// save frame ctx.get_iter_id() * 2 + 1 to ctx.get_data->OutputFileOdd;
};

// Add the first stage to read the frames from the video stream
p.add_stage(hetcompute::serial_stage(), // serial stage
hetcompute::sliding_window_size(16), // use a sliding window for
16 frames
read_frame_from_stream);

// Add the second parallel stage to process the frames


p.add_stage(hetcompute::parallel_stage(4), // parallel stage and doc = 4
hetcompute::iteration_lag(2), // iteration delay of 2
hetcompute::sliding_window_size(16), // use a sliding window for
16 frames
process_one_frame);

// Add the last serial stage to save every other frame back to a new output video file
p.add_stage(hetcompute::serial_stage(), // serial stage
hetcompute::iteration_rate(2, 1), // Every 2 iterations in the 2nd
stage map to 1 here
save_every_other_frame);

// define the pipeline-specific data


FileInfo finfo;
// initialize the content for the pipeline-specific data
init_file_info(&finfo);
// launch the pipeline and process finfo.num_frames frames
// control the memory footprint of the pipeline execution
p.enable_sliding_window();
p.run(&finfo, finfo.num_frames);

5.2.8.3 HetCompute Pipeline Details

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 48


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.2.8.3.1 Stage Function

• The stage function of a HetCompute Pipeline can be any of the following user-defined C++ entities:
– lambda expression
– callable object
– function pointer
• The return type of the stage function:
– void means no data needs to be passed over to the next stage;
– Non-void means one object of the return type will be handed over to the next stage after each
iteration.
– The last stage in the pipeline should always return void
• The parameter list of the stage function:
– The 1st parameter of the stage function is mandatory. It should be a reference to the context type
of the pipeline the stage belongs to (hetcompute::pattern::pipeline::context).
This makes it possible for the user-defined stage function to get access to the pipeline state.
– The 2nd parameter is optional. When needed, the 2nd parameter of a pipeline stage function has to
be a reference to hetcompute::stage_input<type>, here type should match the return
type of the stage function for the predecessor stage.
– The stage function for the first stage should always take one parameter since it does not have a
predecessor stage.
– For any other stages, the stage function takes one parameter if their predecessor stages return void;
otherwise, it takes two parameters.
– The HetCompute pipeline sanity check will verify that the data types between the return type of
the predecessor stage and the stage_input type of the successor stage before launching the
pipeline. An exception will be raised if the types do not match.
Code snippet for defining stage function in HetCompute:
// alias to the context belongs to a pipeline without specfic data
using context = hetcompute::pattern::pipeline<>::context;

// stage function as a function pointer


// @param context& pipeline context
// @return size_t
size_t s0(context& ctx) {
// do some thing for iteration ctx.get_iter_id();
return ctx.get_iter_id();
}

// stage function as a lambda expression


// @param context& pipeline context
// @param hetcompute::stage_input<size_t>& pipeline context
// @return void
auto s1 = [](context& ctx, hetcompute::stage_input<size_t>& in) {
// do something for iteration ctx.get_iter_id() and uses the data in[0];
}

// stage function as a functor


class S2 {
public:
// stage function using a callable object
// @param context& pipeline context
// @return void

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 49


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

void operator()(context& ctx) {


// 2nd parameter is not needed here because the previous stage returns void
// do some thing for iteration ctx.get_iter_id()
}
};

S2 s2;

// define and run the pipeline


void foo() {
// define the pipeline object
hetcompute::pattern::pipeline<> p;

// add the first stage, where ... are other possbile stage features, such as
// lag, rate, sliding window size
p.add_stage(hetcompute::serial_stage(), ..., s0);

// add the second stage, where ... are other possible stage features, such as
// lag, rate, sliding window size
p.add_stage(hetcompute::parallel_stage(8), ..., s1);

// add the third stage, where ... are other possible stage features, such as
// lag, rate, sliding window size
p.add_stage(hetcompute::serial_stage(), ..., s2);

// launch the pipeline


...
}

5.2.8.3.2 Iteration Rate (<code>hetcompute::iteration_rate</code>):

Rate of iterations between two consecutive stages.

• Allows different Pipeline stages to run a different number of iterations


• Iteration Rate is always defined in the successor stage as
hetcompute::iteration_rate(r1, r2) for a pair of consecutive stages. Here r1 /r2 is a
positive rational number.
• HetCompute uses the Simplest Form of r1/r2 (divided by Greatest Common Factor), that is,
(4 : 8) → (1 : 2), (21 : 9) → (7 : 3)
• The default value for iteration rate is (1, 1)
• Execution Semantics:
– Stage s1 is followed by s2 and the iteration rate between them is (r1 , r2 )
– Each iteration of s1 has the weight of r2 , while each iteration of s2 has the weight of r1
– Banking analogy for the execution semantics of the HetCompute Pipeline stage iteration rate:
◦ stage s1 puts r2 coins into the bank when it finishes one iteration
◦ stage s2 needs at least r1 coins in the bank and takes r1 coins out once it is launched
– Mathematics definition: The ith iteration of s2 can not be launched until s1 finishes its
di ∗ r1/r2eth iteration
– Only specify the number of iterations for the FIRST stage when launching the pipeline
Code snippet for defining stage iteration lag in HetCompute:
// stage s1 is followed by s2
// iteration rate between s1 and s2 is (r1, r2)
p.add_stage(..., stage1_body);

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 50


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

p.add_stage(..., hetcompute::iteration_rate(r1, r2), stage2_body);

5.2.8.3.3 Iteration Lag (hetcompute::iteration_lag):

Minimum number of iterations that a stage runs ahead of its successor (lag ≥ 0)

• Iteration Lag is always described in the successor stage for a pair of consecutive stages.
• Iteration Lag will be scaled up according to stage rates ( s1 follows by s2 ), that is, L iterations in s2
compares to dL ∗ r1 /r2 e iterations in s1 .
• The default value for iteration lag is 0
• Execution Semantics:
– At any time, assume i1 and i2 are the iteration stages s1 and s2 has finished, the following equation
holds: i1 ∗ r2 ≥ (i2 + L) ∗ r1
Code snippet for defining stage iteration lag in HetCompute:
// stage s1 is followed by s2
// iteration rate between s1 and s2 is (r1, r2)
// the lag between them is L
p.add_stage(..., stage1_body);
p.add_stage(..., hetcompute::iteration_lag(L),
hetcompute::iteration_rate(r1, r2), stage2_body);

5.2.8.3.4 Sliding Window Size (<code>hetcompute::sliding_window_size</code>):

The unit size of the circular buffer between stages in a Pipeline that is needed to control the memory
footprint.

• Sliding window size is always defined in the current stage


• Controls the memory footprint between stages and favors pushing the execution towards the last
stage.
• If not specified, HetCompute assumes the stage does not use a sliding window
• Sizing the stage sliding window
– In order to be algorithmicaly correct, the minimum sliding window size depends on the value of
iteration lag, iteration rate and the degree of concurrency of a stage.
– Minimum size = dL ∗ r1 /r2 e + max{doc, 8, dr1 /r2 e} in HetCompute
– HetCompute performs a sanity check if sliding window size is explicitly specified in a pipeline
before launching. An exception will be raised if the check fails.
Code snippet for defining stage sliding window size in HetCompute:
// stage s1 is followed by stage s2
// iteration rate between s1 and s2 is (r1, r2)
// the lag between them is L
// the sliding window size for s1 is sws1
p.add_stage(..., hetcompute::sliding_window_size(sws1), stage1_body);
p.add_stage(..., hetcompute::iteration_lag(L),
hetcompute::iteration_rate(r1, r2), stage2_body);

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 51


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.2.8.3.5 Pipeline Context:

The information that is available to the user-defined stage functions for a specific pipeline

• Contains user-defined pipeline-specific data if needed


• Contains pipeline stage execution information: iteration id, stage id, etc.
• Access to control the pipeline: stop the pipeline for on the fly terminating execution
• Access to cancel the pipeline: cancel the pipeline execution. Note that
hetcompute::abort_on_cancel() needs to be called in the pipeline user-defined stage
functions for proper pipeline cancellation. A pipeline can be cancelled in any stages, however the
internal state of the pipeline will be non-deterministic for a cancelled pipeline.
Code snippet for defining pipeline context in HetCompute:
// User-defined pipeline-specific data
typedef struct FileInfo{
File* infile;
File* outfile;
} FileInfo;

// define a HetCompute pipeline object with data of type FileInfo


hetcompute::pattern::pipeline<FileInfo> p;

// get the type of the pipeline context


using finfo_pcontext = hetcompute::pattern::pipeline<FileInfo>::context
;

// define the stage function, which returns a size_t value and takes no stage input
auto stage_body = [](finfo_pcontext& ctx)->size_t{
foo(ctx.get_data()->infile, ctx.get_iter_id());
bar(ctx.get_data()->outfile, ctx.get_stage_id());
// stop the pipeline on the fly
if(...)
ctx.stop_pipeline();
else if(...)
ctx.cancel_pipeline();
return 0;
};
// add the stage to the pipeline
p.add_stage(..., stage_body);

// define two pipeline-specific data


FileInfo info1, info2;

// run the pipeline for file1 with info1


p.run(&info1, 0);
// run the pipeline for file2 with info2
p.run(&info2, 0);

5.2.8.4 Launch the HetCompute pipeline

5.2.8.4.1 Launch with a known total number of pipeline iterations

• The total number of pipeline iterations is known before launching


• Only need to provide the total iteration number for the first stage. The iteration number for the
following stages will be propogated according to the stage iteration rate.
Code snippet for launching a pipeline with number of iterations:
// define a HetCompute pipeline object
hetcompute::pattern::pipeline<> p;

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 52


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

// add stages
...
// synchronous launch with 10 iterations for the first stage
// launch without sliding window
p.disable_sliding_window();
p.run(10);

5.2.8.4.2 Launch without knowing the total number of pipeline iterations

• The total number of pipeline iterations is not known before launching


• Stop the pipeline on the fly during exeuction (only the first stage can stop a pipeline through the
pipeline context)
• May have performance penalty compared to launching with known iteration numbers
Code snippet for launching a pipeline and stop it on the fly
// define a HetCompute pipeline object
hetcompute::pattern::pipeline<> p;

using context = hetcompute::pattern::pipeline<>::context;

// add the first stage that will stop the pipeline at some point
p.add_stage(hetcompute::serial_stage(), // stage type
..., // other features of the stage, that is, lag, rate, sliding window size
[](context&ctx) { // stage function
// do something
// when condition is true, stop the pipeline
if(condition) {
ctx.stop_pipeline();
}
});

// add other stages


...
// using the launch type without control on the memory footprint,
p.disable_sliding_window();
// synchronous launch and stop it on the fly
p.run(0);

5.2.8.4.3 Synchronous Launch:

Launch the pipeline and execution blocks until the pipeline finishes execution
Code snippet for synchronously launching a pipeline
// define a HetCompute pipeline object
hetcompute::pattern::pipeline<> p;
// add stages
...
// synchronous launch with known number of iterations
p.run(10);
// use HetCompute free function for synchronous launch with known number of iterations
hetcompute::launch(p, 10)
// pipeline finishes execution at this point

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 53


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.2.8.4.4 Asynchronous Launch:

Pipeline execution behaves like a normal HetCompute task and supports all the task asynchronous
semantics, that is, launch, wait_for, dependency, finish_after, etc. Please use the same precautions with the
life-cycle of the data used in the pipeline as that are needed due to the asynchronous semantics of
HetCompute tasks.
Code snippet for pipeline asynchronous semantics
// define a HetCompute pipeline object
hetcompute::pattern::pipeline<> p;

// add stages
...
// launch with no control memory footprint
p.disable_sliding_window();
// asynchronous launch with a known number of iterations
auto t1 = p.create_task(10); // t1 is of type hetcompute::task_ptr<>
// use t1 as a regular HetCompute task set dependency, or launch, or wait_for, or finish_after...

// launch with control memory footprint


p.enable_sliding_window();
// use HetCompute free function for creating an asynchronous task of the pattern
// t2 is of type hetcompute::task_ptr<void(size_t)>
auto t2 = hetcompute::create_task(p);
// bind the argument to the task with a known number of iterations
t2->bind(10);
// use t2 as a regular HetCompute task set dependency, or launch, or wait_for, or finish_after...

5.2.8.5 Heterogeneous Pipeline (HetCompute Beta Feature)

HetCompute pipeline supports a BETA feature of heterogeneous pipeline execution. A heterogeneous


pipeline can have predefined CPU or GPU pipeline stages so that they can run concurrently on the
heterogeneous computational components to fully exploit the underlying hardware platform. The
heterogeneous pipeline is desirable for pipeline applications with some stages that are data intensive and
most suitable for GPU processing, while the others can run efficiently on CPU cores at the same time.
Here is a simple code example of a heterogeneous pipeline with two CPU stages (S0 and S2) and one GPU
stage (S1) for vector addition in the middle.
1 #include <cmath>
2 #include <assert.h>
3 #include <hetcompute/hetcompute.hh>
4
5 using namespace hetcompute;
6 using namespace std;
7
8 const size_t num_iters = 32;
9 std::vector<hetcompute::buffer_ptr<float>> a_bufs;
10 std::vector<hetcompute::buffer_ptr<float>> b_bufs;
11 std::vector<hetcompute::buffer_ptr<float>> c_bufs;
12
13 void init_bufs(size_t size);
14 void reset_bufs(size_t size);
15 void cleanup_bufs();
16
17 void
18 init_bufs(size_t size)
19 {
20 for (size_t j = 0; j < num_iters; j++)
21 {
22 // Create input buffers.
23 auto buf_a = hetcompute::create_buffer<float>(size);
24 auto buf_b = hetcompute::create_buffer<float>(size);
25 auto buf_c = hetcompute::create_buffer<float>(size);

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 54


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

26
27 a_bufs.push_back(buf_a);
28 b_bufs.push_back(buf_b);
29 c_bufs.push_back(buf_c);
30 }
31 }
32
33 void
34 reset_bufs(size_t size)
35 {
36 // Reset the initial value of the buffers.
37 for (size_t j = 0; j < num_iters; j++)
38 {
39 a_bufs[j].acquire_wi();
40 b_bufs[j].acquire_wi();
41 c_bufs[j].acquire_wi();
42
43 for (size_t i = 0; i < size; ++i)
44 {
45 a_bufs[j][i] = i;
46 b_bufs[j][i] = size - i;
47 c_bufs[j][i] = j + 1;
48 }
49
50 a_bufs[j].release();
51 b_bufs[j].release();
52 c_bufs[j].release();
53 }
54 }
55
56 // Release the memory of the buffers.
57 void
58 cleanup_bufs()
59 {
60 a_bufs.clear();
61 b_bufs.clear();
62 c_bufs.clear();
63 }
64
65 // A GPU kernel string which does the vector addition.
66 #define OCL_KERNEL(name, k) std::string const name##_string = #k
67 OCL_KERNEL(vadd_kernel, __kernel void vadd(__global float* a, __global float* b, __global float* c,
unsigned int size) {
68 unsigned int i = get_global_id(0);
69 if (i < size)
70 c[i] = a[i] + b[i];
71 });
72
73 int
74 main()
75 {
76 hetcompute::runtime::init();
77 const size_t size = 32;
78
79 // Initialize the buffers.
80 init_bufs(size);
81 // Reset the buffer values.
82 reset_bufs(size);
83
84 // Define a hetcompute heterogeneous pipeline and its context.
85 hetcompute::beta::pattern::pipeline<> p;
86 using context = hetcompute::beta::pattern::pipeline<>::context
;
87
88 // S0: A CPU stage.
89 // Add a serial cpu stage.
90 p.add_stage(hetcompute::serial_stage(), [](context& ctx) -> size_t {
91 // Return the iteration id.
92 return ctx.get_iter_id();
93 });
94

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 55


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

95 // S1: A GPU stage.


96 // Define a GPU kernel for the GPU stage.
97 std::string kernel_name("vadd");
98 auto gpu_vadd = hetcompute::create_gpu_kernel<hetcompute::in<hetcompute::buffer_ptr<float>>,
99
hetcompute::in<hetcompute::buffer_ptr<float>>,
100
hetcompute::out<hetcompute::buffer_ptr<float>>,
101 unsigned int>(vadd_kernel_string, kernel_name);
102
103 // Define the return type of the gpu stage before lambda.
104 // Can also use the utlitiy template to get the return type (beta feature):
105 // using gk_tuple_type =
106 // typename hetcompute::beta::call_tuple<1, decltype(gpu_vadd)>::type;
107 using gk_tuple_type =
108 std::tuple<hetcompute::range<1>, hetcompute::buffer_ptr<float>,
hetcompute::buffer_ptr<float>, hetcompute::buffer_ptr<float>, unsigned int>;
109
110 // Add a parallel gpu stage.
111 p.add_stage(hetcompute::parallel_stage(4),
112 hetcompute::iteration_lag(0),
113 hetcompute::iteration_rate(1, 1),
114 // before lambda
115 // (prepare the parameters for the GPU kernel as a return tuple)
116 [&](context& ctx, hetcompute::stage_input<size_t>&) ->
gk_tuple_type {
117
118 // Get the iteration id to index the buffers.
119 size_t y = ctx.get_iter_id();
120
121 auto r = hetcompute::range<1>(size);
122 auto buf_a = a_bufs[y];
123 auto buf_b = b_bufs[y];
124 auto buf_c = c_bufs[y];
125
126 return std::make_tuple(r, buf_a, buf_b, buf_c, size);
127 },
128 // The GPU kernel for the stage.
129 gpu_vadd,
130
131 // after lambda (optional)
132 [&](context&, gk_tuple_type&) -> size_t { return 0; });
133
134 // S2: A CPU stage.
135 // Add a serial cpu stage.
136 p.add_stage(hetcompute::serial_stage(hetcompute::in_order), [&](
context& ctx, hetcompute::stage_input<size_t>&) {
137
138 size_t y = ctx.get_iter_id();
139
140 // Verify the GPU stage outputs.
141 a_bufs[y].acquire_ro();
142 b_bufs[y].acquire_ro();
143 c_bufs[y].acquire_ro();
144
145 for (size_t i = 0; i < size; ++i)
146 {
147 assert(size == c_bufs[y][i]);
148 assert((a_bufs[y][i] + b_bufs[y][i]) == c_bufs[y][i]);
149 if (size != c_bufs[y][i])
150 HETCOMPUTE_ILOG("The output of the GPU stage is incorrect.");
151 if (a_bufs[y][i] + b_bufs[y][i] == c_bufs[y][i])
152 HETCOMPUTE_ILOG("The inputs of the GPU stage are incorrect.");
153 }
154
155 a_bufs[y].release();
156 b_bufs[y].release();
157 c_bufs[y].release();
158 });
159
160 // Launch hetero-pipeline through hetero-pipeline pattern.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 56


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

161 p.run(num_iters);
162
163 // Clean up the buffers.
164 cleanup_bufs();
165
166 hetcompute::runtime::shutdown();
167 return 0;
168 }

5.2.8.6 Heterogeneous Pipeline Details

Besides the GPU stage, the heterogeneous pipeline maintains the same features as a regular homogeneous
CPU pipeline (see HetCompute Pipeline Details).

5.2.8.6.1 GPU Pipeline Stages

To define the functionality of a GPU stage, we need three components:

• a before lambda
• a GPU kernel
• an after lambda (optional).
The before lambda (can also be a callable object or a function pointer) connects the stage with its previous
stage and provides the arguments to the GPU kernel for the current stage. The before lambda takes the
same parameters as a CPU stage function, i.e., a reference to
hetcompute::beta::pattern::pipeline::context and a reference to
hetcompute::stage_input<type> (optional), see Stage Function. It should return a
std::tuple of the range for the GPU kerenal and the variables that will be fed as the arguments for the
GPU kernel. In the simple example above (line 94 to 129), the GPU kernel is 1-d and takes three buffer
pointers and one unsigned int as its input arguments. Therefore, the before lambda of the GPU stage
returns a std::tuple<hetcompute::range<1>, hetcompute::buffer_ptr<float>,
hetcompute::buffer_ptr<float>, hetcompute::buffer_ptr<float>, unsigned
int>, with range as the first element in the tuple followed by the kernel arguments in order (see Basic
Usage of Buffers, Kernels: The Path to Heterogeneity).
The GPU kernel defines the main functionality of the stage. It will be launched as a GPU task and executes
on the GPU component.
The after lambda (optional and can also be a callable object or a function pointer) takes the output of the
GPU kernel and performs post-processing (if necessary) to prepare for the following stages, which takes
two parameters:

• A reference to hetcompute::beta::pattern::pipeline::context, as with regular


pipeline CPU stage functions.
• A reference to the return tuple of the before lambda. Because the tuple contains both the inputs and
outputs of the GPU kernel, this reference makes it possible for the after lambda to access the output
results of the GPU kernel for potential post-processing. The after lambda can return any type just as
a regular CPU stage function (see Stage Function).
The before lambda and after lambda of a GPU stage synchronize the data beween the current GPU stage
and its predecessor/successor stage. In the current design, the synchronization only happens on the CPU
side. Since the GPU kernel wraps boths its inputs and outputs as arguments, we have to use the

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 57


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

std::tuple approach to provide the inputs and get the outputs from the GPU kernel instead of
parameter pass and value returning as regular functions.
A GPU pipeline stage has the same stage parameters as a regular CPU stage, such as
hetcompute::serial_stage or hetcompute::parallel_stage
hetcompute::iteration_rate, hetcompute::iteration_lag,
hetcompute::sliding_window_size (see HetCompute Pipeline Details).

5.3 Introduction to Tasks


HetCompute programmers partition their applications into independent units of work that can be executed
asynchronously in the CPU, the GPU or the Qualcomm Hexagon DSP. These units of work are called tasks.
The simplest way to create and launch a task into the HetCompute runtime is by using
hetcompute::launch(Code&&, Args&&...):
1 #include <hetcompute/hetcompute.hh>
2 #include <stdio.h>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create task t that prints "Hello World!"
9 auto t = hetcompute::launch([] { HETCOMPUTE_ILOG("Hello World!\n"); });
10
11 HETCOMPUTE_ILOG("This is Qualcomm HetCompute!\n");
12
13 // Wait for t to complete
14 t->wait_for();
15
16 hetcompute::runtime::shutdown();
17 return 0;
18 }

The example above shows a HetCompute application that writes Hello World! and This is HetCompute!
concurrently. In line 9, hetcompute::launch(Code&&) creates and launches a task t that prints
Hello World!. hetcompute::launch(Code&&) takes as a parameter a lambda expression,
really just an anonymous function, which defines the work that the tasks execute. As will be seen in the
remainder of the guide, hetcompute::launch(Code&&, Args&&...) is extremely versatile, and
accepts function pointers, CPU kernels, OpenCL kernels, etc. hetcompute::launch(Code&&)
returns a pointer to the task immediately and the execution proceeds to the next statement
—HETCOMPUTE_ILOG("This is HetCompute!")— while the HetCompute runtime schedules
task t in the first available CPU core. Because t might run concurrently with main, the following two
program outputs are feasible:
Hello World!
This is HETCOMPUTE!

This is HETCOMPUTE!
Hello World!

Note that HETCOMPUTE_ILOG("This is HetCompute!\n") might execute before t does


(perhaps the system is busy). The t->wait_for() statement in line 14 ensures that the program finishes
only after t has completed its execution.
In HetCompute, tasks can have predecessors and successors, forming directed acyclic task graphs. The
predecessors of a task t are the tasks that must complete before t can execute. Conversely, the successors t
are the set of tasks that will execute only after t has completed its execution. Programmers can specify a
predecessor/successor relationship between two tasks by creating a dependency between them using the

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 58


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

hetcompute::task<>::then() member function:


1 #include <hetcompute/hetcompute.hh>
2 #include <stdio.h>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8
9 // Create task that prints "Hello "
10 auto hello = hetcompute::launch([] { HETCOMPUTE_ILOG("Hello "); });
11
12 // Create task that prints " World!"
13 auto world = hetcompute::create_task([] { HETCOMPUTE_ILOG("World!\n"); });
14
15 // Make sure that "World!" prints after "Hello"
16 hello->then(world);
17
18 // Launch world
19 world->launch();
20
21 // Wait for world to complete
22 world->wait_for();
23
24 hetcompute::runtime::shutdown();
25 return 0;
26 }

The example creates two tasks — hello and world — and sets up a dependency between them (line 16)
to ensure that World! is printed after Hello. Without this dependency, the HetCompute runtime could
execute world first, or concurrently. Note that the order in which tasks are launched does not reflect the
order in which the HetCompute runtime executes them. In line 22, the example waits for world to finish.
There is no need to explicitly wait for hello because the HetCompute runtime guarantees that hello
completes before world executes.
It is important to notice the use of two separate API calls to create and launch world (lines 10 and 19,
respectively). This is to create a dependency between hello and world before the latter is launched,
because it is not possible to add a predecessor to a launched task because it might have already executed by
the time the program calls hetcompute::task<>::then(). This is the reason why, in HetCompute,
task creation, launch, and execution are different operations — hetcompute::launch(Code&&,
Args&&...) combines them for the sake of programmatic convenience and efficiency. In HetCompute,
tasks must be launched in order to be executed. Launching a task t means that the programmer has finished
adding predecessors to t and that she wants the HetCompute runtime to execute t as soon as its
predecessors have completed and there are execution units available.
These two simple samples illustrate two basic HetCompute abstractions: tasks and dependencies. In
HetCompute, programmers think about algorithms in terms of concurrent tasks and let the HetCompute
runtime schedule them onto available resources in the system. Programmers can create dynamic task graphs
by setting dependencies between tasks that the runtime enforces.
Most parallel algorithms launch more than one task, unlike the example above. Waiting for each individual
task can get cumbersome: the programmer would need to store the task pointers in some container (e.g,
queue, list), and call hetcompute::task<>::wait_for() on each of them. For programmatic
convenience and efficiency, HetCompute provides another asynchronous abstraction called groups. A group
is a set of tasks that can be waited for as a unit.
1 #include <hetcompute/hetcompute.hh>
2 #include <stdio.h>
3
4 int
5 main()

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 59


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

6 {
7 hetcompute::runtime::init();
8 // Create group g
9 auto g = hetcompute::create_group();
10
11 // Launch 10 tasks into g
12 for (int i = 0; i < 10; i++)
13 {
14 g->launch([i] { HETCOMPUTE_ILOG("Hello World! I’m task #%d\n", i); });
15 }
16
17 // Wait for tasks to complete and exit group
18 g->wait_for();
19 hetcompute::runtime::shutdown();
20
21 return 0;
22 }

The example above creates a group g in line 9. Then, rather than creating tasks out of lambda functions and
launching them explicitly as in the earlier samples, this example launches the lambda functions directly into
group g (lines 12-15). HetCompute internally creates optimized tasks that execute as part of the group.
Finally, all the tasks are waited for as a unit in line 18.
With this basic introduction to the fundamental units of asynchrony in HetCompute, tasks and groups, read
through the next several chapters to discover various exciting operations and capabilities of these
asynchronous abstractions, including heterogeneous execution, dataflow, non-blocking parallelization,
cancellation, exception handling, and algebraic operations.
Kernels: The Path to Heterogeneity
Creating Tasks
Task Pointers
Life of a HetCompute Task
Launching Tasks
Task Dependencies
Task Groups
Waiting for Tasks
Exceptions and Cancellation
Blocking Tasks
Algebraic Operations on Tasks
Task-Pointer Collapsing
Unleashing Asynchrony

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 60


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.1 Kernels: The Path to Heterogeneity


As mentioned in Introduction to Tasks, a HetCompute task contains work that can be executed on any
device in a system: the CPU, the GPU, or the Qualcomm Hexagon DSP. This allows HetCompute
programmers to write applications composed of a variety of tasks, fully exploiting the performance and
power efficiency of the heterogeneous devices available in modern computing systems.
HetCompute tasks use kernels to achieve this computational heterogeneity. A kernel contains the
computation, that is, the actual device code, a task executes. This could be CPU code, GPU code, or
Qualcomm Hexagon DSP code, resulting in three different types of kernels. In the current HetCompute
release, every task contains exactly one kernel, dictating which device this task executes on.

5.3.1.1 Revisiting Hello World

The hello world example in the previous section uses hetcompute::launch to create and launch a
CPU task in one step. However, hetcompute::launch is in fact one of many convenient methods that
combine multiple steps into one method. In particular, hetcompute::launch creates anonymous
kernels and tasks as necessary along the way.
The general steps to write a HetCompute program are as follows:

1. Create a kernel using a CPU, GPU, or DSP function


2. Set attributes on the kernel such as blocking or not
3. Use the kernel to create one or more tasks
4. Repeat the previous steps to create more tasks
5. Set dependencies between tasks and launch them
Note

Some of the steps above can be interleaved with other steps. For example, new tasks can still be
created and set up task dependency for even after some tasks have already been launched. Meanwhile,
many convenient methods exist to combine multiple steps into one; refer to other sections in this
chapter to see some of these in action.

As an example, the following code rewrites the hello world program by explicitly carrying out the steps
listed above:
1 #include <hetcompute/hetcompute.hh>
2 #include <stdio.h>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create a cpu_kernel k that prints "Hello World!"
9 auto k = hetcompute::create_cpu_kernel([] { HETCOMPUTE_ILOG("Hello World!
\n"); });
10
11 // Use k to create a task t
12 auto t = hetcompute::create_task(k);
13
14 // Launch the task t
15 t->launch();
16
17 // Print another line after t is asynchronously launched
18 HETCOMPUTE_ILOG("This is HETCOMPUTE!\n");
19

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 61


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

20 // Wait for t to complete


21 t->wait_for();
22
23 hetcompute::runtime::shutdown();
24 return 0;
25 }

Note

Although the modified hello world example above creates a task using a CPU kernel (which itself is
created from a lambda expression, that is, an anonymous CPU function), hetcompute::create-
_task can take other types of kernels created from various device functions: a CPU kernel, a GPU
kernel, or a DSP kernel. The detail on how to create these heterogeneous kernels follows.

5.3.1.2 Creating a Kernel

Currently, there are three types of kernels in HetCompute. Use the following methods to create each of
them:

1. To create a CPU kernel, use hetcompute::create_cpu_kernel. It takes either a function or


a function object (that is, a lambda expression or a functor).
2. To create a GPU kernel, use hetcompute::create_gpu_kernel. There are two variants: one
for wrapping OpenCL C kernels and another for wrapping OpenGL ES compute shaders. The
arguments for the OpenCL variant are 1) a string containing an OpenCL GPU device function source,
and 2) a string containing the name of the device function. The OpenGL variant takes as argument a
string containing a compute shader program source. The hetcompute::create_gpu_kernel
must be explicitly typed by the programmer according to the GPU function’s signature; see Buffers
regarding how to translate arrays in GPU function parameters to hetcompute::buffer_ptr.
3. To create a DSP kernel, use hetcompute::create_dsp_kernel. It takes a DSP-compatible C
function.
Note

The three kernel creation methods have different semantics, due to the difference in how CPU, GPU,
and DSP functions are written. Refer to Kernels in the API Reference Manual for details regarding
each method.

The following example shows how to use the methods in practice:


1 #include <hetcompute/hetcompute.hh>
2
3 static int
4 f1(int x)
5 {
6 return x * 2;
7 }
8
9 static auto f2 = [](int x) -> int { return x * 2; };
10
11 static struct
12 {
13 int operator()(int x) { return x * 2; }
14 } f3;
15
16 // Source string for an OpenCL C kernel
17 static std::string const f4_string = "__kernel void f4(__global int *x, __global int *y) {"
18 " int i = get_global_id(0);"
19 " y[i] = x[i];"
20 "}";
21

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 62


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

22 static int
23 f5(int* x, int* y, int l)
24 {
25 int i = 0;
26 for (i = 0; i < l; i++)
27 y[i] = x[i] * 2;
28 return 0;
29 }
30
31 int
32 main()
33 {
34 hetcompute::runtime::init();
35
36 // Create a cpu_kernel from a function
37 auto k1 = hetcompute::create_cpu_kernel(f1);
38
39 // Create a cpu_kernel from a lambda expression
40 auto k2 = hetcompute::create_cpu_kernel(f2);
41
42 // Create a cpu_kernel from a functor
43 auto k3 = hetcompute::create_cpu_kernel(f3);
44
45 // Create a gpu_kernel from an OpenCL C GPU function
46 auto k4 = hetcompute::create_gpu_kernel<hetcompute::buffer_ptr<int>,
hetcompute::buffer_ptr<int>>(f4_string, "f4");
47
48 // Create a hexagon_kernel from a DSP function
49 auto k5 = hetcompute::create_dsp_kernel<>(f5);
50
51 hetcompute::runtime::shutdown();
52 return 0;
53 }

Once a kernel is created, it can be used by hetcompute::create_task to create a task. Moreover, a


kernel can be used to create multiple independent tasks. (Kernels: Advanced Topics explains what the
independence means.)

5.3.1.2.1 GPU kernels for OpenCL and OpenGL ES

The previous example in Creating a Kernel showed the basic usage of


hetcompute::create_gpu_kernel. A source string was passed, which was implicitly assumed to
contain an OpenCL C function. hetcompute::create_gpu_kernel also allows the user to provide
an OpenGL ES compute shader program. The user must explictly pass hetcompute::beta::gl as the
first parameter to hetcompute::beta::create_gpu_kernel to indicate an OpenGL ES compute
shader, as shown in the following example.
1 #include <hetcompute/hetcompute.hh>
2
3 #define LOCAL_SIZE 16 // This should match the local_size_x value in the shader
4
5 const char* shader_code = R"GLCODE(
6
7 #version 310 es
8 precision highp float;
9 layout(local_size_x = 16) in;
10 layout(std430) buffer;
11
12 layout(binding = 2) writeonly buffer Output {
13 float elements[];
14 } output_data;
15
16 layout(binding = 0) readonly buffer Input0 {
17 float elements[];
18 } input_data0;
19

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 63


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

20 layout(binding = 1) readonly buffer Input1 {


21 float elements[];
22 } input_data1;
23
24 void main()
25 {
26 uint ident = gl_GlobalInvocationID.x;
27 output_data.elements[ident] = input_data0.elements[ident] + input_data1.elements[ident];
28 }
29
30 )GLCODE";
31
32 int
33 main()
34 {
35 hetcompute::runtime::init();
36 auto buf_a = hetcompute::create_buffer<float>(1024);
37 auto buf_b = hetcompute::create_buffer<float>(buf_a.size());
38
39 buf_a.acquire_wi();
40 buf_b.acquire_wi();
41 // Initialize the input vectors
42 for (size_t i = 0; i < buf_a.size(); ++i)
43 {
44 buf_a[i] = i;
45 buf_b[i] = buf_a.size() - i;
46 }
47 buf_a.release();
48 buf_b.release();
49
50 auto buf_c = hetcompute::create_buffer<float>(buf_a.size());
51
52 auto gl_vadd = hetcompute::beta::create_gpu_kernel<hetcompute::in<hetcompute::buffer_ptr<float>>,
53 hetcompute::in<hetcompute::buffer_ptr<float>>,
54 hetcompute::out<hetcompute::buffer_ptr<float>>>(
hetcompute::beta::gl, shader_code);
55
56 hetcompute::range<1> global_range(buf_a.size());
57
58 hetcompute::range<1> local_range(LOCAL_SIZE);
59
60 // Create a task
61 auto gpu_task = hetcompute::create_task(gl_vadd, // gpu kernel
62 global_range, // global range
63 local_range, // local range
64 buf_a, // rest correspond to gpu_vadd template
parameters
65 buf_b,
66 buf_c);
67 gpu_task->launch();
68
69 gpu_task->wait_for();
70 hetcompute::runtime::shutdown();
71 }

Correspondingly when passing OpenCL C functions, the user may optionally pass
hetcompute::beta::cl to make the use of OpenCL explicit, as illustrated below.
auto k4 = hetcompute::beta::create_gpu_kernel<hetcompute::buffer_ptr<int>,
hetcompute::buffer_ptr<int>>
(hetcompute::beta::cl, f4_string, "f4");

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 64


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.1.2.2 Global and Local Ranges

When launching a GPU kernel, a global range parameter must always be provided. The global range
identifies the total number and shape of GPU threads that will be executed. Optionally, a local range
parameter may also be provided. When omitted, the local range is assumed to be of size 1 (of the
1D/2D/3D dimensionality corresponding to the global range).
With GPU kernels created from OpenCL C, the programmer typically doesn’t face correctness issues when
omitting the local range parameter. However, OpenGL ES shader program sources specify a local range.
When launching GPU kernels created from OpenGL ES shaders, the programmer must ensure that
the local range is provided and matches the local range used in the compute shader program. For
example, the sample program in GPU kernels for OpenCL and OpenGL ES uses a 1D local range of size 16.

5.3.1.3 Setting Kernel Attributes

After a kernel is created—from a CPU, GPU, or DSP function—Qualcomm HetCompute users cannot swap
the underlying function for another one in the kernel. However, users can change the following attributes of
the kernel:

1. Blocking: denotes whether a CPU task made from this kernel is expected to block on external events
such as I/O activities (Blocking Tasks). The blocking attribute can be set and queried using the
set_blocking and is_blocking methods of a kernel object.
2. Big: denotes that the CPU kernel is preferably executed on a big core (i.e. has affinity to the big core)
in a big.LITTLE SoC. Users can override the kernel affinity setting through the HetCompute affinity
APIs (Affinity). The big attribute can be set and queried using the set_big and is_big methods
of a kernel object.
3. Little: denotes that the CPU kernel is preferably executed on a LITTLE core (i.e. has affinity to the
LITTLE core) in a big.LITTLE SoC. Users can override the kernel affinity setting through the
HetCompute affinity APIs (Affinity). The little attribute can be set and queried using the
set_little and is_little methods of a kernel object.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 auto k1 = hetcompute::create_cpu_kernel([] { HETCOMPUTE_ILOG("big task
executed"); });
8 // inform the Hetcompute runtime that the kernel is best executed on a big
9 // core in a big.LITTLE SoC
10 k1.set_big();
11
12 auto t1 = hetcompute::launch(k1);
13 t1->wait_for();
14
15 auto k2 = hetcompute::create_cpu_kernel([] { HETCOMPUTE_ILOG("LITTLE task
executed"); });
16 // inform the Hetcompute runtime that the kernel is best executed on a
17 // LITTLE core in a big.LITTLE SoC
18 k2.set_little();
19
20 auto t2 = hetcompute::launch(k2);
21 t2->wait_for();
22
23 hetcompute::runtime::shutdown();
24 return 0;
25 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 65


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

Note

While the HetCompute runtime system will try to enforce the big and little kernel attributes, dynamic
system conditions such as offline cores may prevent it from doing so.
Changing a kernel’s attributes does not affect tasks created from this kernel prior to the change.

5.3.1.4 Kernels: Advanced Topics

Because object lifetime management is generally tricky in asynchronous and parallel programming, at task
creation time, HetCompute copies a kernel into the resulting task. This has a few implications:

1. The task can exist independent of the kernel’s lifetime. This is particularly useful when a kernel is
created inside a scope such as a function, but the tasks created from this kernel can live beyond the
end of this scope and get launched later, even after the original kernel object has already been
destroyed.
2. This also means programmers should be particularly careful with kernel-owned data. As shown in the
example below, the copying of kernel objects during task creation may lead to non-obvious results.
1 #include <hetcompute/hetcompute.hh>
2 #include <stdio.h>
3
4 // A functor that stores some internal state
5 struct foo
6 {
7 int n;
8
9 foo() : n(1) {}
10
11 void operator()()
12 {
13 HETCOMPUTE_ILOG("n = %d\n", n);
14 n++;
15 }
16 };
17
18 int
19 main()
20 {
21 hetcompute::runtime::init();
22
23 // Create a cpu_kernel from the functor
24 foo bar;
25 auto k = hetcompute::create_cpu_kernel(bar);
26
27 // Create two independent tasks from k
28 auto t1 = hetcompute::create_task(k);
29 auto t2 = hetcompute::create_task(k);
30
31 // Set dependency and launch the tasks
32 t1->then(t2);
33 t1->launch();
34 t2->launch();
35
36 // Expected output: 1 and 1, not 1 and 2
37 t2->wait_for();
38
39 hetcompute::runtime::shutdown();
40
41 return 0;
42 }

Kernel objects can generally be moved and copied in construction and assignment, similar to regular
objects. However, one exception is that cpu_kernel objects created from lambda functions cannot be
copy-assigned, because lambda functions do not have copy-assignment operators.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 66


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

Finally, it is possible to create kernel objects by directly instantiating them via their constructors, instead of
using the factory methods (that is, hetcompute::make_cpu_kernel,
hetcompute::make_gpu_kernel, and hetcompute::make_dsp_kernel). This may be useful
for passing kernel pointers around and extending kernel classes. However, in those situations, one usually
cannot use the auto keyword, and it is important to explicitly and correctly type the kernel objects.

5.3.1.5 Poly-kernels (Beta feature)

The HetCompute runtime automatically balances load across the various devices in a heterogeneous
system: the CPU (big or LITTLE), GPU, and DSP. Developers can exploit this feature by constructing
"poly-kernels" – kernels having multiple implementations of the same interface – and have the HetCompute
runtime system dynamically pick the most suitable device on which to execute the task constructed from the
poly-kernel.
1 #include <hetcompute/hetcompute.hh>
2
3 // Macro which creates a string containing the OpenCL C kernel.
4 #define OCL_KERNEL(name, k) std::string const name##_string = #k
5
6 OCL_KERNEL(vadd_kernel, __kernel void vadd(__global float* A, __global float* B, __global float* C,
unsigned int size) {
7 unsigned int i = get_global_id(0);
8 if (i < size)
9 C[i] = A[i] + B[i];
10 });
11
12 int
13 main(void)
14 {
15 hetcompute::runtime::init();
16
17 {
18 // Create input buffers, automatically host accessible
19 auto buf_a = hetcompute::create_buffer<float>(1024);
20 auto buf_b = hetcompute::create_buffer<float>(buf_a.size());
21
22 buf_a.acquire_wi();
23 buf_b.acquire_wi();
24 // Initialize the input buffers
25 for (size_t i = 0; i < buf_a.size(); ++i)
26 {
27 buf_a[i] = i;
28 buf_b[i] = buf_a.size() - i;
29 }
30 buf_a.release();
31 buf_b.release();
32
33 // Create an output buffer in relaxed mode: not automatically accsssible by host
34 auto buf_c = hetcompute::create_buffer<float>(buf_a.size());
35
36 // Name of the OpenCL C kernel.
37 std::string kernel_name("vadd");
38
39 // Create a gpu kernel. Note the optional in/out directions that allow HETCOMPUTE
40 // to perform copy optimizations. By default, the buffers are treated as
41 // inout.
42 auto gpu_vadd = hetcompute::create_gpu_kernel<hetcompute::in<hetcompute::buffer_ptr<float>>
,
43 hetcompute::in<hetcompute::buffer_ptr<float>>,
44 hetcompute::out<hetcompute::buffer_ptr<float>>,
45 unsigned int>(vadd_kernel_string, kernel_name);
46 unsigned int size = buf_a.size();
47
48 // Create a hetcompute::range object, 1D in this case.
49 hetcompute::range<1> range_1d(buf_a.size());
50

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 67


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

51 // Create cpu kernel with the same interface as the gpu kernel
52 auto cpu_vadd = [](hetcompute::range<1> r,
53 hetcompute::buffer_ptr<float> a,
54 hetcompute::buffer_ptr<float> b,
55 hetcompute::buffer_ptr<float> c,
56 unsigned int) {
57 HETCOMPUTE_ILOG("running on the CPU");
58 hetcompute::pfor_each(r, [&](
hetcompute::index<1> i) { c[i[0]] = a[i[0]] + b[i[0]]; });
59 };
60
61 // Create a task out of a poly-kernel.
62 // The HetCompute runtime will automatically choose the appropriate gpu or cpu
63 // variant to dispatch based on runtime load on the devices.
64 auto poly_task = hetcompute::beta::create_task(std::make_tuple(
gpu_vadd, cpu_vadd),
65 range_1d, // global range
66 buf_a, // rest correspond to gpu_vadd template
parameters
67 buf_b,
68 buf_c,
69 size);
70 // Launch the task
71 poly_task->launch();
72
73 // Wait for task completion.
74 poly_task->wait_for();
75
76 buf_a.acquire_ro();
77 buf_b.acquire_ro();
78 buf_c.acquire_ro();
79
80 // Access the results on the host and verify their correctness.
81 for (size_t i = 0; i < buf_a.size(); ++i)
82 {
83 HETCOMPUTE_INTERNAL_ASSERT(buf_a[i] + buf_b[i] == buf_c[i] && buf_c[i] == buf_a.size(),
84 "comparison failed at ix %zu: %f + %f == %f == %zu",
85 i,
86 buf_a[i],
87 buf_b[i],
88 buf_c[i],
89 buf_a.size());
90 }
91
92 buf_a.release();
93 buf_b.release();
94 buf_c.release();
95 }
96
97 hetcompute::runtime::shutdown();
98 }

In the example above, the programmer is interested in adding two vectors. A GPU kernel should be
implemented to perform vector addition on line 42. A CPU kernel should be implemented (using
hetcompute::pfor_each) on line 52. Both alternatives should be exposed to the HetCompute
runtime system by constructing a task out of a poly-kernel, exposed as an std::tuple, on line 64. The
constructed poly_task is like any other task; the programmer can wait for the task, add the task to
groups, set dependencies, etc. Once the task is launched (on line 71), the HetCompute runtime performs
"late binding" of function to device, wherein either the CPU or the GPU may execute the task depending on
their relative load.
Note

The poly-kernel is a beta feature in this HetCompute release, and its API may change in the future.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 68


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.1.6 hetcompute::range and hetcompute::index

hetcompute::range<N> provides an abstraction to specify the bounds of an N-dimensional space (N is


currently 3). Similarly, hetcompute::index<N> represents a point in N-dimensional space. Together, these
can be used to iterate over an N-dimensional space. It is primarily used for the creation of GPU tasks in
HetCompute as it maps very well to OpenCL NDRange; however, it can also be used with CPU tasks if the
iteration space has more than one dimension, or if there is a need for strided access.

5.3.1.7 Using hetcompute::range<N> to represent ND-Range (in OpenCL)

hetcompute::range<N> is useful to represent an ND-Range (in OpenCL). The table below shows different
ways to create the hetcompute::range<N> object and how it maps to OpenCL constructs.
hetcompute::range OpenCL NDRange
hetcompute::range<1>(w) cl::NDRange(w)
hetcompute::range<2>(w, h) cl::NDRange(w, h)
hetcompute::range<3>(w, h, d) cl::NDRange(w, h, d)
hetcompute::range<1>(off_x, w) cl::NDRange(off_x) - will be used as an offset
when used in the GPU kernel launch
cl::NDRange(w) - will be used as global size when
used in gpu kernel launch
hetcompute::range<2>(off_x, w, off_y, h) cl::NDRange(off_x, off_y) - will be used as an
offset when used in the GPU kernel launch
cl::NDRange(w, h) - will be used as global size
when used in the GPU kernel launch
hetcompute::range<3>(off_x, w, off_y, h, off_z, cl::NDRange(off_x, off_y, off_z) - will be used as
d) an offset when used in the GPU kernel launch
cl::NDRange(w, h, d) - will be used as global size
when used in the GPU kernel launch

5.3.1.8 hetcompute::range<1>

Represents a 1D range and is useful if your problem space is linear like vector addition. The example below
shows a simple vector addition where each work item takes two input vectors and produces one element in
the output vector.
// Macro which creates a string containing the OpenCL C kernel.
#define OCL_KERNEL(name, k) std::string const name##_string = #k

OCL_KERNEL(vadd_kernel, __kernel void vadd(__global float* A, __global float* B, __global float* C,


unsigned int size) {
unsigned int i = get_global_id(0);
if (i < size)
C[i] = A[i] + B[i];
});

// Name of the OpenCL C kernel.


std::string kernel_name("vadd");

// Create a gpu kernel object. Note the optional in/out directions that allow HetCompute to perform
// copy optimizations. By default, the buffers are treated as inout.
auto gpu_vadd = hetcompute::create_gpu_kernel<hetcompute::in<hetcompute::buffer_ptr<float>>,
hetcompute::in<hetcompute::buffer_ptr<float>>,
hetcompute::out<hetcompute::buffer_ptr<float>>,
unsigned int>(vadd_kernel_string, kernel_name);

hetcompute::range<1> range_1d(buf_a.size());

// Create a task

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 69


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

auto gpu_task = hetcompute::create_task(gpu_vadd, // gpu kernel


range_1d, // global range
buf_a, // rest correspond to gpu_vadd template parameters
buf_b,
buf_c,
size);

vadd_kernel_string represents a string containing the OpenCL C code for vector addition.

5.3.1.9 hetcompute::range<2>

Represents a 2D range and is useful if your problem space is two-dimensional, similar to matrix
multiplication or many image processing applications. The example below shows a simple matrix
multiplication, where each work item computes one element in the output matrix.
// Macro which creates a string containing the OpenCL C kernel.
#define OCL_KERNEL(name, k) std::string const name##_string = #k

OCL_KERNEL(matrix_multiply_kernel,
__kernel void matrix_multiply(__global float* a, __global float* b, __global float* c, int M,
int P, int N) {
int i = get_global_id(1);
int j = get_global_id(0);
if (i >= M || j >= N)
return;

c[i * N + j] = 0;
for (int k = 0; k < P; k++)
{
c[i * N + j] += a[i * P + k] * b[k * N + j];
}

});

// Name of the OpenCL C kernel.


std::string kernel_name("matrix_multiply");

// Create a gpu kernel object. Note the optional in/out directions that allow HetCompute to perform
// copy optimizations. By default, the buffers are treated as inout.
auto gpu_mm = hetcompute::create_gpu_kernel<hetcompute::in<hetcompute::buffer_ptr<float>>,
hetcompute::in<hetcompute::buffer_ptr<float>>,
hetcompute::out<hetcompute::buffer_ptr<float>>,
unsigned int,
unsigned int,
unsigned int>(matrix_multiply_kernel_string, kernel_name);

hetcompute::range<2> range_2d(n, m);

// Create a task
auto gpu_task = hetcompute::create_task(gpu_mm, // gpu kernel
range_2d, // global range
buf_a, // rest correspond to gpu_vadd template parameters
buf_b,
buf_c,
m,
p,
n);

matrix_multiply_kernel_string represents a string containing the OpenCL C code for matrix


multiplication.

5.3.1.10 hetcompute::range<3>

Represents a 3D range and is useful if your problem space is three-dimensional.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 70


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.1.11 Strided Ranges.

Consider the example of a denoise_kernel described in more detail Parallelization using patterns.
void denoise_image()
{
// initialization, etc
hetcompute::range<2> r(0, w, TILE_SIZE, 0, h, TILE_SIZE);
hetcompute::pfor_each(r, [input, &output] (size_t index) {
hetcompute::index<2> idx = r.linear_to_index(index);
denoise_kernel(input, idx[0], TILE_SIZE, idx[1], TILE_SIZE, output);
});
}

hetcompute::range<2> above defines a two-dimensional range, from [0,w) x [0, h) with a stride of
TILE_SIZE in each dimension. Each dimension can have a different stride. We use this range as our
iteration space for the parallel loop.
hetcompute::index<2> defines a two-dimensional index. HetCompute ranges know how to iterate to the
appropriate points. hetcompute::pfor_each provides a linear index to the lambda (size_t index) in the code
above; using the linear_to_index call returns a hetcompute::index<2> object that has the appropriate
coordinates in each dimension of the range. We can directly access the dimensions using the [] operator on
the object.
Note

Tasks which run on the GPU should have a stride of 1.

5.3.2 Creating Tasks


HetCompute offers multiple templated methods to create tasks in order to cover several use case scenarios:

1. Create and launch the HetCompute task in a group:


• template<typename Code, typename...Args>
void hetcompute::group::launch(Code&& code, Args&& ...args)
• template<typename Code, typename...Args>
void hetcompute::group::launch(do_not_collapse_t, Code&& code,
Args&& ...args)> Launching tasks in a group is the fastest method to create and execute
tasks in HetCompute. Launching into a group, simplifies the lifetime maintenance of the task
pointers, because the runtime will take care of it. If the task logically belongs with other tasks,
consider creating the group and launching tasks in the group.
2. Create a HetCompute task and launch:
• template<typename Code, typename...Args>
collapsed_task_type<Code> hetcompute::launch(Code&& code,
Args&& ...args)
• template<typename Code, typename...Args>
non_collapsed_task_type<Code>
hetcompute::launch(do_not_collapse_t, Code&& code, Args&&
...args)> If the tasks are not part of a group, creating and launching a task allows the
programmer to directly pass the task to the runtime. The runtime knows that the task does not
have any predecessors and is ready for execution.
3. Create a HetCompute task without launching:
• template<typename ReturnType, typename... Args>

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 71


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

hetcompute::task_ptr<ReturnType>
hetcompute::create_value_task(Args&& ...args)
• template<typename Code, typename... Args>
collapsed_task_type<Code> hetcompute::create_task(Code&&,
Args&&...)
• template<typename Code, typename... Args>
non_collapsed_task_type<Code>
hetcompute::create_task(do_not_collapse_t, Code&&, Args&&...)
Creating a task using the create_task methods returns a pointer to the task. The task is not
ready to be launched and the programmer has the opportunity to set up dependencies (that is,
make this task part of a task graph). The runtime maintains the validity of the pointer for the
time when the task is executing. However, because the programmer has a pointer to it, the
lifetime of that pointer will also impact how long the task will be maintained in the system. It is
recommended that the programmer reset the pointer when finished with the task.
There are three essential parts for creating a HetCompute task (non-value task):

1. Type of the task, i.e, collapsed or non-collapsed.


By default, collapsed tasks are created. Otherwise, pass hetcompute::do_not_collapse
to force create non-collapsed tasks. For more information about collapsed and
non-collapsed tasks, see Task-Pointer Collapsing.
2. Computation of the task, that is, Code&& code, in the templated methods. HetCompute supports
the following computation constructs to encapsulate in tasks:
• C++ basic function-like blocks, that is, lambda expressions (preferred), function objects, or
function pointers.
• HetCompute Kernels (see Kernels: The Path to Heterogeneity)
• HetCompute Patterns (see Parallel Programming Patterns)
3. Arguments bound to the task, that is, Args&&... args, in the template methods (optional). If
the computation construct of a task takes some arguments, the arguments can be bound to the task
when creating the task. Argument binding can also happen after the creation of the task
(hetcompute::task<ReturnType(Args...)>bind_all), but has to happen before
launching.
The methods in the first group create a task and return a pointer to it
(hetcompute::task_ptr<...>). The task_ptr is typed according to type of computation
construct encapsulated in the task. For example, the following code creates three hetcompute tasks.
auto t1 = hetcompute::create_task([&] {i ++;});
auto t2 = hetcompute::create_task([&] {return 42;});
auto t3 = hetcompute::create_task([&] (bool b, float f){
return b ? int(f) : 0;
});

The type of t1 is hetcompute::task_ptr<>.


The type of t2 is hetcompute::task_ptr<int>.
The type of t3 is hetcompute::task_ptr<int(bool, float)>.
Type of task_ptr determines operations possible on task through that pointer:

1. hetcompute::task_ptr<> can only set up control dependency, e.g., t1->then(t2);

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 72


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

2. hetcompute::task_ptr<ReturnType> can set up control and be the predecessor of some


data dependencies, e.g., t2->then(t1), t3->bind_all(t2). However, this type of
task_ptr cannot bind any data since the argument list information is not there.
3. hetcompute::task_ptr<ReturnType(Args...)> can set up control and data
dependencies (as predecessor or successor), and bind data to its computation construct. e.g.,
t3->then(t1), t3->bind_all(t2).
For more information about task data dependency, see Task Dependencies.
Note: hetcompute::task_ptr<ReturnType> is-a hetcompute::task_ptr<> Note:
hetcompute::task_ptr<ReturnType(Args...)> is-a
hetcompute::task_ptr<ReturnType>.
A HetCompute value task is a launched task with no computation construct but only the return value. It is
useful for task algebra, see Algebraic Operations on Tasks. For more information about
hetcompute::task_ptr<...> and HetCompute value tasks, see Tasks and Unleashing Asynchrony.
The methods in the second group create a task, return a pointer to it, and launch the task in the same
method. It is a combination of hetcompute::create_task(..) and
hetcompute::task::launch(...). Task arguments must be bound when using this group of
methods if the task computation construct takes any.
The methods in the third group create a task and launch it into a group. Using
hetcompute::group::launch(...) is the fastest way to create tasks in HetCompute and should
be used as often as possible. For more information about groups, see Task Groups.
Note: The first group of templated methods make it possible to decouple task creation and task launching
(Launching Tasks) so that task dependencies can be set between these two steps.
Note: The second group of templated methods are recommended when there are no predecessors to the
created task. It is more performant than the decoupled creation and launching.
Note: The third group of templated methods should be used as often as possible when there are no
dependencies needed for the created task. It is the most performant way to create and launch a task.

5.3.2.1 Create Tasks Using Lambda Expressions

Lambda expressions are a new feature in C++11, and the preferred argument type to create hetcompute
tasks. Lambda expressions are unnamed function objects that are able to capture variables from the
enclosing scopes. A description of this C++11 feature is outside the scope of this document. Find detailed
information about lambda expressions in the following links:

• C++11 Tutorial: Lambda Expressions -- The Nuts and Bolts of


Functional Programming
• Lambda functions
• Lambda Functions in C++11 - the Definitive Guide
• Michael Caisse: Lambda Functions
The following code uses a lambda expression to create a task t1 that prints ’Hello World!’:
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 73


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

6 hetcompute::runtime::init();
7
8 // Create a task that prints Hello World!
9 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World!\n"); });
10
11 // Launch the task.
12 t1->launch();
13 // Wait for the task to finish.
14 t1->wait_for();
15
16 hetcompute::runtime::shutdown();
17 return 0;
18 }

Alternatively, you could create the task using hetcompute::launch(...) or


hetcompute::group::launch(...).
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 // Create a task that prints Hello World!
9 auto t = hetcompute::launch([] { HETCOMPUTE_ILOG("Hello World!\n"); });
10 // wait for t1 to finish
11 t->wait_for();
12
13 // create a group
14 auto g = hetcompute::create_group();
15 // launch a task in g
16 g->launch([] { HETCOMPUTE_ILOG("Hello World!\n"); });
17 // wait for g to finish
18 g->wait_for();
19
20 hetcompute::runtime::shutdown();
21 return 0;
22 }

The lambda expression in the previous example is very simple as it does not capture any variables. Let’s
suppose that you want to capture a string with the user name to do a proper greeting:
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 auto g = hetcompute::create_group();
8 std::string name = "HETCOMPUTE";
9
10 // Launching a task in the group.
11 g->launch([name] { HETCOMPUTE_ILOG("Hello World, %s!\n", name.c_str()); });
12
13 // Wait for g to finish.
14 g->wait_for();
15
16 hetcompute::runtime::shutdown();
17 return 0;
18 }

By capturing name in the lambda expression, HetCompute makes sure that can use it when the task
executes, which happens outside the scope where the task is created. Make sure that, if you capture
variables by reference, the original object still exists when the task executes. For example, consider the
following code:
1 #include <hetcompute/hetcompute.hh>
2

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 74


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 auto g = hetcompute::create_group();
8
9 {
10 std::string name = "HETCOMPUTE";
11
12 // Launching a task in the group.
13 g->launch([name] { HETCOMPUTE_ILOG("Hello World, %s!\n", name.c_str()); });
14 } // "name" goes out of scope here.
15
16 // Wait for g to finish.
17 g->wait_for();
18 hetcompute::runtime::shutdown();
19
20 return 0;
21 }

The string name goes out-of-scope in line 14, and its destructor is then called. If the scheduler executes the
task after that happens, the program will most likely crash.
Refer to Task Pointers for information about capturing hetcompute::task_ptr<...> by reference
and this should never be performed.
Warning

Using default capture by copy ([=]) or by reference ([&]) will capture all variables from the
enclosing scope, which may increase the size of your tasks considerably if the compiler cannot figure
out that many of them are not used and do not need to be captured. It is recommended that only
capture the variables that your lambda expression uses.

5.3.2.2 Create Tasks Using Classes

You can use any custom class as <typename Code> by overloading the class’s operator(). The
following code shows how to create a task from a class instance. When the HetCompute scheduler executes
the task, the operator() method is called.
1 #include <hetcompute/hetcompute.hh>
2
3 class user_class
4 {
5 public:
6 explicit user_class(int value) : x(value) {}
7
8 void operator()(int y) { HETCOMPUTE_ILOG("x = %d, y = %d\n", x, y); }
9
10 void set_x(int value) { x = value; }
11
12 private:
13 int x;
14 };
15
16 int
17 main()
18 {
19 hetcompute::runtime::init();
20 // Create a hetcompute group.
21 auto g = hetcompute::create_group();
22
23 // Create and launch a task into group g.
24 g->launch(user_class(42), 27);
25
26 // Wait for the group to finish.
27 g->wait_for();

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 75


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

28 hetcompute::runtime::shutdown();
29 return 0;
30 }

It is also possible to create an object from user_class and then create a task using that object:
1 #include <hetcompute/hetcompute.hh>
2
3 class user_class
4 {
5 public:
6 explicit user_class(int value) : x(value) {}
7
8 void operator()(int y) { HETCOMPUTE_ILOG("x = %d, y = %d\n", x, y); }
9
10 void set_x(int value) { x = value; }
11
12 private:
13 int x;
14 };
15
16 int
17 main()
18 {
19 hetcompute::runtime::init();
20 // Create a hetcompute group.
21 auto g = hetcompute::create_group();
22
23 // Instantiate an object of user_class.
24 user_class obj(42);
25
26 // Create a hetcompute task.
27 auto t = hetcompute::create_task(obj);
28
29 // Launch the task into group g.
30 g->launch(t, 27);
31
32 // Wait for the group to finish.
33 g->wait_for();
34 hetcompute::runtime::shutdown();
35 return 0;
36 }

The previous example raises an interesting question: What would the task print if obj.set_x(100) was
called between lines 27 and 30?
1 #include <hetcompute/hetcompute.hh>
2
3 class user_class
4 {
5 public:
6 explicit user_class(int value) : x(value) {}
7
8 void operator()(int y) { HETCOMPUTE_ILOG("x = %d, y = %d\n", x, y); }
9
10 void set_x(int value) { x = value; }
11
12 private:
13 int x;
14 };
15
16 int
17 main()
18 {
19 hetcompute::runtime::init();
20 // Create a hetcompute group.
21 auto g = hetcompute::create_group();
22
23 // Instantiate an object of user_class.
24 user_class obj(42);
25

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 76


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

26 // Create a hetcompute task.


27 auto t = hetcompute::create_task(obj);
28
29 obj.set_x(100);
30 // Launch the task into group g.
31 g->launch(t, 27);
32
33 // Wait for the group to finish.
34 g->wait_for();
35 hetcompute::runtime::shutdown();
36 return 0;
37 }

As always, the answer to the ultimate question of life, the universe, and everything (including HetCompute
task execution) is 42. The reason is that HetCompute makes a copy of obj when it creates the task in line
27. Otherwise, users would need to keep track of the lifetime of the objects used to create tasks. However,
if you were to construct the object in-place, no copies would be made:
1 #include <hetcompute/hetcompute.hh>
2
3 class user_class
4 {
5 public:
6 explicit user_class(int value) : x(value) {}
7
8 void operator()(int y) { HETCOMPUTE_ILOG("x = %d, y = %d\n", x, y); }
9
10 void set_x(int value) { x = value; }
11
12 private:
13 int x;
14 };
15
16 int
17 main()
18 {
19 hetcompute::runtime::init();
20 // Create a hetcompute group.
21 auto g = hetcompute::create_group();
22
23 // Create a hetcompute task.
24 auto t = hetcompute::create_task(user_class(42));
25
26 // Launch the task into group g.
27 g->launch(t, 27);
28
29 // Wait for the group to finish.
30 g->wait_for();
31 hetcompute::runtime::shutdown();
32 return 0;
33 }

5.3.2.3 Create Tasks Using Function Pointers

The last way to create a task is by using a function pointer.


1 #include <hetcompute/hetcompute.hh>
2
3 void foo();
4 void
5 foo()
6 {
7 HETCOMPUTE_ILOG("Hello World!\n");
8 };
9
10 int
11 main()
12 {

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 77


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

13 hetcompute::runtime::init();
14 // Create a task that executes foo().
15 auto t = hetcompute::create_task(foo);
16
17 // Launch and wait for the task.
18 t->launch();
19 t->wait_for();
20 hetcompute::runtime::shutdown();
21 return 0;
22 }

Warning

Due to limitations in the Visual Studio C++ compiler, this does not work on Visual Studio. You can get
around it by using a lambda function:
1 #include <hetcompute/hetcompute.hh>
2
3 void foo();
4 void
5 foo()
6 {
7 HETCOMPUTE_ILOG("Hello World!\n");
8 };
9
10 int
11 main()
12 {
13 hetcompute::runtime::init();
14 // Create a task that executes foo().
15 auto t = hetcompute::create_task([] { foo(); });
16
17 // Launch and wait for the task.
18 t->launch();
19 t->wait_for();
20 hetcompute::runtime::shutdown();
21 return 0;
22 }

5.3.3 Task Pointers


Managing the lifetime of an object in a parallel application can be challenging. This is why HetCompute
provides shared_ptr-type access to tasks (and several other HetCompute constructs) to automatically
manage task lifetime. HetCompute automatically destroys tasks once they finish and all references are no
longer in scope.
The various methods to create tasks, e.g., hetcompute::create_task(Body&&), return an
appropriately typed custom smart pointer to the task object. The task_ptr may be of the following types
depending on the type of computation encapsulated by the task:

• hetcompute::task_ptr<>: points to a task that neither accepts any arguments nor returns a
value
• hetcompute::task_ptr<void>: points to a task that returns void
• hetcompute::task_ptr<ReturnType>: points to a task that returns a value
• hetcompute::task_ptr<ReturnType(Args...)>: points to a task that accepts
arguments and returns a value

The above types constitute a type hierarchy, such that a hetcompute::task_ptr<> can point to a
task that returns a value or accepts arguments. Each of the above task_ptrs permits different operations

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 78


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

on tasks, with hetcompute::task_ptr<> providing most of the common operations on tasks and the other
two types providing more advanced operations:

• hetcompute::task_ptr<> can be launched, canceled, waited for, finished after, serve as the
source of a control dependency, etc.
• In addition to the above, hetcompute::task_ptr<ReturnType> can serve as the source of a
data dependency.
• In addition to the above, hetcompute::task_ptr<ReturnType(Args...)> can have its
arguments bound through hetcompute::task<ReturnType(Args...)>::bind_all.
Tasks are reference-counted, so they are automatically destroyed when no more
hetcompute::task_ptr<>s reference them. When a task is launched, the HetCompute runtime
increases the reference count of the task. This prevents the task from being destroyed, even if all pointers
referencing the task are reset. The HetCompute runtime decrements the reference count of the task after it
finishes (completes execution, throws an exception, or is canceled). The task reference count requires
atomic operations. Copying a hetcompute::task_ptr<> causes an atomic increment and the new
copy of the hetcompute::task_ptr<> causes an atomic decrement when it goes out of scope. For
best results, minimize the number of times your application copies hetcompute::task_ptr<>s.
Some algorithms require constantly passing hetcompute::task_ptr<>s. To maintain high
performance, HetCompute provides another task pointer type that does not perform reference counting:
hetcompute::task<>∗ (, hetcompute::task<void>∗, hetcompute::task<ReturnType>∗, and
hetcompute::task<ReturnType(Args...)>∗). The following example demonstrates how to point
hetcompute::task<>∗ to a task:

Note

Task lifetime is determined by the number of hetcompute::task_ptr<>s referencing it.


Programmers must ensure that there is always a valid hetcompute::task_ptr<> while using a
hetcompute::task<>∗; otherwise memory corruption and/or segmentation faults may result.
Any operation permitted on a hetcompute::task_ptr<> is also permitted on a
hetcompute::task<>∗.

It is incorrect to reference a HetCompute hetcompute::task_ptr<> like the following example:


1 hetcompute::task_ptr<> t1 = hetcompute::create_task([] {
HETCOMPUTE_ILOG("Hello World from t1!"); });
2 hetcompute::task_ptr<> t2 = hetcompute::create_task([&t1] {
HETCOMPUTE_ILOG("Hello World from t2!"); });

Instead, copy the pointer


1 hetcompute::task_ptr<> t1 = hetcompute::create_task([] {
HETCOMPUTE_ILOG("Hello World from t1!"); });
2 hetcompute::task_ptr<> t2 = hetcompute::create_task([t1] {
HETCOMPUTE_ILOG("Hello World from t2!"); });

Or use a hetcompute::task<>∗ (of course, make sure that t1 does not go out of scope):
1 hetcompute::task_ptr<> t1 = hetcompute::create_task([] {
HETCOMPUTE_ILOG("Hello World from t1!"); });
2
3 hetcompute::task<>* unsafe_t1 = t1.get();
4
5 hetcompute::task_ptr<> t2 = hetcompute::create_task([unsafe_t1
] { HETCOMPUTE_ILOG("Hello World from t2!"); });
6
7 hetcompute::task_ptr<> t3 = hetcompute::create_task([&

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 79


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

unsafe_t1] { HETCOMPUTE_ILOG("Hello World from t3!"); });

5.3.4 Life of a HetCompute Task


A HetCompute task transitions through various stages, from the time it is created to the time it finishes
(completes execution successfully or is canceled – the combined state Finished) is called. The figure below
presents a high-level view of a task’s lifetime. The nodes represent task states and the numbered edges
represent transitions from one state to the next. Most of the operations on the tasks (through
hetcompute::task_ptr<> or hetcompute::task<>∗) result in a task transitioning from one
state to another.

Figure 5-5 Lifetime of a HetCompute Task

The state of a task determines the operations permitted on it. Understanding task states and transitions is
useful to debug both correctness and performance issues with your HetCompute program. For example, as
shown below, the program may hang because a task which is never launched is waited for.
auto t = hetcompute::create_task([]{});
// wait for the task without launching it
t->wait_for(); // never released

Stepping through the operations on a task with the above task state diagram in mind may help to identify
and fix errors such as the one in the code snippet.
Note

Neither the task states nor the transitions between them are part of the HetCompute API. They are
described herein merely for the sake of understanding.

Most HetCompute tasks take the green line to successful completion, while some take the red line due to a
variety of reasons, e.g., throwing an exception. Some tasks may directly be created in the Launched state
(via hetcompute::launch(Code&&, Args&&...)), while others may directly be created and
launched in the Ready state (via hetcompute::launch(Code&&, Args&&...) with all arguments
ready).

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 80


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.4.1 The Green Line to Successful Completion

A task created using hetcompute::create_task(Code&&),


hetcompute::launch(Code&&), etc. has at least one hetcompute::task_ptr<> alive in user
code. This hetcompute::task_ptr<> may be used to perform operations on the task. The state
transitions along the green line are described below:

• (1) After setting up any control dependencies (via hetcompute::task<>::then()) or data dependencies
(via hetcompute::task<ReturnType(Args...)>::bind_all()) from other tasks, use
hetcompute::task<>::launch() to register the task with the HetCompute runtime system.
No further dependencies may be added to a Launched task.
• (2) After all tasks on which a task is control- or data-dependent have transitioned to Completed, the
task becomes Ready for execution.
• (3) When an appropriate execution resource (CPU, GPU, or DSP) becomes available and any other
resources such as hetcompute::buffers that may be used by the task become available, the task
transitions to the Running state.
• (4) Finally, after the task completes execution successfully, and any other tasks after which it is set to
finish (via hetcompute::task<>::finish_after()) have also finished, the task
transitions to the Completed state.

5.3.4.2 The Red Line to Cancellation

Some tasks may not execute succesfully (e.g., throw an exception), may be canceled programmatically, etc.;
such tasks take the red line and end up in the Canceled state. See Exceptions and Cancellation for more
details. The state transitions along the red line are described below:

• (5) If a task or any task on which it is control- or data-dependent is canceled via


hetcompute::task<>::cancel(), the task transitions to the Canceled state.
• (6) If a task or any task on which it is control- or data-dependent is canceled via
hetcompute::task<>::cancel(), the task transitions to the Canceled state.
• (7) If a task or any task on which it is control- or data-dependent is canceled via
hetcompute::task<>::cancel(), the task transitions to the Canceled state.
• (8) If a task or any task on which it is control- or data-dependent is canceled via
hetcompute::task<>::cancel(), the task transitions to the Canceled state. Additionally,
an executing task may throw an exception as a result of which the task and its successor tasks are
canceled. An executing task may choose to ignore a cancellation request and may complete
successfully. Alternatively, an executing task may accept a cancellation request and choose to abort
(e.g., via hetcompute::abort_on_cancel()) and transition to the Canceled state.
• (9) Finally, some created tasks may never be launched. When the last
hetcompute::task_ptr<> pointing to such a task goes out of scope, the task is automatically
canceled. Any successor tasks are also canceled.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 81


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.5 Launching Tasks


Tasks do not execute unless they are launched. In Creating Tasks how to create and launch tasks are
described using hetcompute::launch(...) and hetcompute::group::launch(...). Tasks
launched in this way execute as soon as hardware contexts are available.
Note: hetcompute::group::launch(Code&&, Args&&...) does not provide a
hetcompute::task_ptr<...> and therefore becomes anonymous and cannot be part of a
DAG.
Tasks that are part of a DAG must be created using

1. template<typename ReturnType, typename... Args>


hetcompute::task_ptr<ReturnType>
hetcompute::create_value_task(Args&& ...args)
2. template<typename Code, typename... Args>
collapsed_task_type<Code> hetcompute::create_task(Code&&,
Args&&...)
3. template<typename Code, typename... Args>
non_collapsed_task_type<Code>
hetcompute::create_task(do_not_collapse_t, Code&&, Args&&...)
These template methods return a task pointer that can be used to set up dependencies between tasks. Once
the control dependencies of a task t are set, use

1. t->launch(args...) to launch the task and bind the arguments for data dependency;
2. t->launch() to launch the task, but the arguments need to be bound already by
t->bind_all(args...). For more information about task argument binding, see Unleashing
Asynchrony and Tasks.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create a task.
8 auto t1 = hetcompute::create_task([](int x) { HETCOMPUTE_ILOG("Hello World! x =
%d\n", x); });
9
10 // t1 is ready, launch it and bind the argument.
11 t1->launch(42);
12
13 // Wait for t1 to finish.
14 t1->wait_for();
15
16 // Create a task.
17 auto t2 = hetcompute::create_task([](int x) { HETCOMPUTE_ILOG("Hello World! x =
%d\n", x); });
18
19 // Bind the task argument to t2.
20 t2->bind_all(73);
21
22 // t2 is ready, launch it.
23 t2->launch();
24
25 // Wait for t2 to finish.
26 t2->wait_for();
27
28 hetcompute::runtime::shutdown();

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 82


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

29 return 0;
30 }

The method hetcompute::launch(...) and hetcompute::group::launch(...) informs


the HetCompute runtime that the task is ready to execute as soon as a hardware context is available and
after all its predecessors have executed. In the following example, task t2 launches, but it will never
execute because its predecessor t1 has not executed, and therefore, this task will not execute:
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create task t1.
8 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t1!");
});
9
10 // Create task t2.
11 auto t2 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t2!");
});
12
13 // Set dependency between t1 and t2 so that t2 won’t execute until t1 finishes.
14 t1->then(t2);
15
16 // Launch t2.
17 t2->launch();
18
19 // if uncommented, the wait_for below will never return
20 // because t2 will not execute until t1
21 // does. However, t1 has not been launched.
22 // t2->wait_for();
23
24 hetcompute::runtime::shutdown();
25 return 0;
26 }

Notice that launching a task means that it is not possible to add any new predecessors, although you can add
successors. The reason is that, by launching the task, the programmer is asking the HetCompute runtime to
execute the task as soon as possible. By the time the programmer tries to add a new predecessor to the task,
the task might have already executed, and adding a predecessor to an already-executed task is not allowed.
Tasks can launch only once. Any subsequent calls to hetcompute::task<>::launch(...) or
hetcompute::group::launch(...) on a task do not cause the task to execute again. Calls of
hetcompute::group::launch(...) on a task might, however, cause the task to be added to new
groups. See Task Groups

5.3.6 Task Dependencies


Often, the concurrency in an application is not regularly structured or is highly dynamic, and therefore
cannot be easily expressed through any composition of the HetCompute Parallel Programming Patterns. To
express such irregular or dynamic concurrency, HetCompute provides a means to specify control and data
dependencies among tasks. The programmer can construct rich acyclic task graphs that span the CPU,
GPU, and DSP, in a unified fashion. A task may have multiple predecessors and successors in the task
graph. A task becomes ready for execution only after all its (control and data dependency) predecessors
have completed successfully.
Note

While control dependencies can be set up among any CPU, GPU, or DSP tasks; data dependencies can
be set up only among CPU tasks in the current HetCompute release.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 83


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.6.1 Control Dependencies

Control dependencies may be set up among tasks using hetcompute::task<>::then() to specify


the relative order of task execution. The following example shows how to ensure that task t1 executes
before task t2. The HetCompute runtime guarantees that t2 will not begin execution until t1 completes
execution, regardless of how many hardware execution contexts are available in the system.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello "); });
9
10 auto t2 = hetcompute::create_task([] { HETCOMPUTE_ILOG("World!"); });
11
12 // Ensure that t1 executes before t2
13 t1->then(t2);
14
15 t1->launch();
16 t2->launch();
17
18 t2->wait_for();
19
20 hetcompute::runtime::shutdown();
21 return 0;
22 }

Note

In the example above, the statement t1->then(t2) guarantees that t1 finishes before t2 begins
execution. Consequently, it suffices to just wait_for t2 to finish to ensure that both t1 and t2
finish.

5.3.6.2 Data Dependencies

A task t2 can be data-dependent on another task t1 as the example below shows:


1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 auto t1 = hetcompute::create_task([] { return 42; });
9
10 auto t2 = hetcompute::create_task([](int i) { HETCOMPUTE_ILOG("The answer to
life the universe and everything = %d", i); });
11
12 // Set up data dependency from t1 to t2
13 t2->bind_all(t1);
14
15 t1->launch();
16 t2->launch();
17
18 t2->wait_for();
19
20 hetcompute::runtime::shutdown();
21
22 return 0;
23 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 84


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

In the example, the data dependency is specified using


hetcompute::task<ReturnType(Args...)>::bind_all(). Binding a task argument to a
hetcompute::task_ptr<ReturnType> (or hetcompute::task_ptr<ReturnType(Args...)>) implicitly creates a
data dependency from the latter to the former. Specifying data dependencies in this manner creates
equivalence to normal binding of values to function arguments, as the following example illustrates:
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 auto t1 = hetcompute::create_task([] { return 42; });
9
10 auto t2 = hetcompute::create_task([](int i, float f) { HETCOMPUTE_ILOG("int %d
float %f", i, f); });
11
12 // Bind i to t1: data dependency from t1 to t2
13 // Bind f to value: normal function argument binding
14 t2->bind_all(t1, 42.0f);
15
16 t1->launch();
17 t2->launch();
18
19 t2->wait_for();
20
21 hetcompute::runtime::shutdown();
22
23 return 0;
24 }

bind_all must be invoked on a task_ptr of type hetcompute::task_ptr<ReturnType(-


Args...)> with argument of type hetcompute::task_ptr<ReturnType> or
hetcompute::task_ptr<ReturnType(Args...)>. bind_all cannot be called on plain
hetcompute::task_ptr<>s.
auto t1 = hetcompute::create_task([]{return 42;});
auto t2 = hetcompute::create_task([](int i) {/*do something*/});
hetcompute::task_ptr<> t = t2;
t2->bind_all(t1); // Allowed
t->bind_all(t1); // Not allowed

auto t1 = hetcompute::create_task([]{return 42;});


hetcompute::task_ptr<> t = t1;
auto t2 = hetcompute::create_task([](int i) {/*do something*/});
t2->bind_all(t1); // Allowed
t2->bind_all(t); // Not allowed

Note

In this release of HetCompute, task arguments cannot be references.

For programmatic convenience, data dependencies can be specified similar to task argument binding at any
one of the following points:

• hetcompute::create_task()
• hetcompute::task<ReturnType(Args...)>::bind_all()
• hetcompute::task<ReturnType(Args...)>::launch()
The following example illustrates the above:
1 #include <hetcompute/hetcompute.hh>
2
3 int

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 85


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

4 main()
5 {
6 hetcompute::runtime::init();
7
8 auto t1 = hetcompute::create_task([] { return 42; });
9
10 t1->launch();
11
12 auto t2 = hetcompute::create_task([](int i) { HETCOMPUTE_ILOG("The answer to
life the universe and everything = %d", i); },
13 t1); // Set up data dependency from t1 to t2 when t2 is created
14 t2->launch();
15
16 auto t3 = hetcompute::create_task([](int i) { HETCOMPUTE_ILOG("The answer to
life the universe and everything = %d", i); });
17 // Set up data dependency from t1 to t3
18 t3->bind_all(t1);
19 t3->launch();
20
21 auto t4 = hetcompute::create_task([](int i) { HETCOMPUTE_ILOG("The answer to
life the universe and everything = %d", i); });
22 // Set up data dependency from t1 to t4 at launch-time
23 t4->launch(t1);
24
25 t2->wait_for();
26 t3->wait_for();
27 t4->wait_for();
28
29 hetcompute::runtime::shutdown();
30 return 0;
31 }

Tasks t2, t3, and t4 are all data-dependent on task t1. But the data dependency is specified through
different HetCompute API calls.
Use HetCompute data dependencies to easily and automatically manage lifetime of data accessed by tasks.
The data returned by a task is kept alive until the point the task finishes execution and all references to the
task (through hetcompute::task_ptr<>) go out of scope.

5.3.6.3 Heterogeneous Task Graphs

As mentioned previously, tasks can encapsulate computation to be executed on the CPU, GPU, or DSP. This
allows the creation of directed acyclic graphs (DAGs) of tasks whose execution spans across all three
processing units:
1 #include <hetcompute/hetcompute.hh>
2
3 // header to include the dsp bindings, it is generated by the Hexagon SDK
4 #include <include/hetcompute_dsp.h>
5
6 static std::string const gpu_fn_string = "__kernel void gpu_fn() {"
7 "}";
8
9 int
10 main()
11 {
12 hetcompute::runtime::init();
13
14 // This is ensure all task objects are freed before we call shutdown
15 {
16 // CPU task
17 auto t1 = hetcompute::create_task([] {});
18
19 // Create a dsp_kernel from a DSP function
20 auto dsp_kernel = hetcompute::create_dsp_kernel<>(hetcompute_dsp_return_input);
21 int i = 0;
22

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 86


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

23 auto t2 = hetcompute::create_task(dsp_kernel, i);


24
25 // Create a GPU kernel from an OpenCL string
26 auto gpu_kernel = hetcompute::create_gpu_kernel<>(gpu_fn_string, "gpu_fn");
27 auto t3 = hetcompute::create_task(gpu_kernel,
hetcompute::range<1>(1024));
28
29 // CPU task
30 auto t4 = hetcompute::create_task([i] { return i; });
31
32 // CPU task
33 auto t5 = hetcompute::create_task([](int) { /*do something*/ });
34
35 // Create a heterogeneous task DAG consisting of control and data
36 // dependencies
37 t1->then(t2)->then(t4);
38 t2->then(t3)->then(t5);
39 t5->bind_all(t4);
40
41 // Launch tasks for execution
42 t1->launch();
43 t2->launch();
44 t3->launch();
45 t4->launch();
46 t5->launch();
47
48 // Wait for the DAG to finish
49 t5->wait_for();
50 // Note that task dependencies ensure that tasks t1, t2, t3, and t4 have also
51 // finished by this point.
52 }
53
54 hetcompute::runtime::shutdown();
55 return 0;
56 }

In the above example, the programmer creates CPU tasks on lines 17, 30, and 33; a Hexagon DSP task on
line 20; and a GPU task on line 27. The programmer then sets up a combination of control and data
dependencies among these tasks on lines 37 through 39. These dependencies create a heterogeneous task
graph as the figure below shows.

Figure 5-6 Heterogeneous task graph

Subsequently, on lines 42 through 46, the programmer launches the task graph for execution and waits for
its completion on line 49.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 87


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

Warning

A cycle in the DAG may cause deadlock. For performance reasons, HetCompute does not check
whether there are cycles in the DAG. The programmer is responsible for avoiding them.

All the predecessor dependencies of a task must be specified before launching the task. Specifying
inter-task dependencies in this manner ahead of task execution provides information to the HetCompute
runtime system allowing it to schedule the tasks intelligently, optimizing for performance and power. While
this is the preferred method, alternative means to specify inter-task dependencies exist: After a task starts
execution, it can invoke hetcompute::task<>::wait_for()(Waiting for Tasks) or
hetcompute::task<>::finish_after()(finish_after) on some hetcompute::task_ptr<>. Using these
methods limits the scope of optimization in the HetCompute runtime task scheduler. The following
example illustrates the two different methods:
1 #include <hetcompute/hetcompute.hh>
2
3 void task_dependency();
4 void task_waiting();
5
6 void
7 task_dependency()
8 {
9 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello "); });
10
11 auto t2 = hetcompute::create_task([] { HETCOMPUTE_ILOG("World!"); });
12
13 // Ensure that t1 executes before t2
14 // *PREFERRED METHOD* of specifying task dependency
15 t1->then(t2);
16
17 t1->launch();
18 t2->launch();
19
20 t2->wait_for();
21 }
22
23 void
24 task_waiting()
25 {
26 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello "); });
27
28 auto t2 = hetcompute::create_task([t1] {
29 // Wait for t1 to finish execution
30 // *LESS PREFERRED METHOD* of specifying task dependency
31 t1->wait_for();
32 HETCOMPUTE_ILOG("World!");
33 });
34
35 t1->launch();
36 t2->launch();
37
38 t2->wait_for();
39 }
40
41 int
42 main()
43 {
44 hetcompute::runtime::init();
45 // preferred
46 task_dependency();
47 // less preferred
48 task_waiting();
49 hetcompute::runtime::shutdown();
50 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 88


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.7 Task Groups


One of the most common parallel programming patterns is fork and join. The idea is quite intuitive. At the
fork, the application splits the job-to-be-done into many tasks. At the join, the application waits for all of
them to complete before continuing with its execution. An example of fork and join could be styling a
website. During the styling phase, a parallel web browser could traverse the DOM tree and spawn one task
per DOM element. Each of the tasks would be responsible for styling an element. The browser can only
render the page once all the styling tasks complete.
In this scenario, waiting for each individual task would be cumbersome: the programmer would have to
store the task pointers into some container, and call hetcompute::task::wait_for for each of them
after traversing the DOM tree. A group is a HetCompute abstraction that allows the programmer to wait for
a set of tasks (each running on a different kind of device) to complete, relieving the programmer from
having to wait for each task separately.
In addition to the styling tasks, the parallel browser would have launched other tasks to do HTML parsing,
Javascript execution, etc. A logical design decision would be to have all tasks working on the same page to
belong to the same group. Notice that the styling tasks would need to belong to two groups: the "Page
XYZ" group, and the "Page XYZ-styling" group. HetCompute supports tasks that belong to multiple
groups.
Cancellation is another important operation that can be done with groups. The method
hetcompute::group::cancel() cancels all the tasks in a group. In our parallel browser example,
this would allow the browser to easily cancel all the tasks in the "Page XYZ" group when the user decides
to navigate to a new page before the current one displays.
In summary, groups are sets of tasks that can be canceled or waited for as a unit.

5.3.7.1 Group Creation

Use hetcompute::create_group() to create a new unnamed group. Use


hetcompute::create_group (std::string const &name) to create a new group called
<name>. Group names are only used for debugging applications, HetCompute does not check for
duplicate group names. These functions return a hetcompute::group_ptr pointing to the created
group. Use hetcompute::group::get_name() to get the group name. The following code
illustrates hetcompute::create_group and group::get_name to create one unnamed and two
named groups and then display their names.
1 #include <cassert>
2 #include <string>
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 // Create group named "Example 1"
10 auto g1 = hetcompute::create_group("Example 1");
11
12 // Create group named "Example 2"
13 std::string g2_name("Example 2");
14 auto g2 = hetcompute::create_group(g2_name);
15
16 // Create unnamed group
17 auto g3 = hetcompute::create_group();
18
19 HETCOMPUTE_ILOG("g1 name = %s\n", g1->get_name().c_str());
20 HETCOMPUTE_ILOG("g2 name = %s\n", g2->get_name().c_str());
21 HETCOMPUTE_ILOG("g3 name = %s\n", g3->get_name().c_str());

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 89


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

22
23 hetcompute::runtime::shutdown();
24 return 0;
25 }

5.3.7.1.1 The group_ptr pointer

The method hetcompute::group_ptr hetcompute::create_group(std::string


const& name) returns an object of type hetcompute::group_ptr, which is a custom smart
pointer to the group object. hetcompute::group_ptr pointers behave similarly to
hetcompute::task_ptr. Therefore, groups are reference counted, and they are automatically
destroyed when there are no more hetcompute::group_ptr pointers pointing to them and there is no
task in the group (an empty group). This means that even if the user has no pointers to the group,
HetCompute will not destroy the group until all its tasks complete.

5.3.7.2 Launching Tasks or Kernels to Groups

There are three ways to add a task or kernel into a group: by creating and Launching, just by launching, or
by directly adding without launching it

Creating and Launching

By creating a new task and immediately launching it using void


hetcompute::group::launch (Code && code, Args &&... args<). Use this
method when the task (or kernel) does not have predecessor or successors. This is the most performant
way to create and launch a task in HetCompute. This is because the HetCompute runtime knows that
the programmer has no pointer to the task and it can perform aggressive optimizations. It is for this
reason that this method does not return a hetcompute::task_ptr. Use this method as much as
you can: This method also binds its arguments args to the arguments of the task or kernel that is
being launched. The following example creates multiple tasks and binds their arguments when
launching them all into a group:

1 #include <hetcompute/hetcompute.hh>
2
3 static void
4 do_something(int, int)
5 {
6 }
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 // Create group g
13 auto g = hetcompute::create_group();
14
15 // Create tasks from l and launch them into g
16 for (int i = 0; i < 10; i++)
17 for (int j = 10; j < 20; j++)
18 g->launch(do_something, i, j);
19
20 // Wait for all the tasks in group g to complete
21 g->wait_for();
22
23 hetcompute::runtime::shutdown();
24 return 0;
25 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 90


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

Launching

By launching an existing task using void hetcompute::group::launch


(hetcompute::task_ptr<> const & task) . If the task has arguments that need to be
bound when launched, the arguments can be bound by using hetcompute::task::bind_all before
launching. Alternatively, you can use void hetcompute::group::launch
(hetcompute::task< TaskType > ∗ task, FirstArg && first_arg,
RestArgs &&... rest_args) which binds the arguments as you launch. In the following
example, the constant 42 will be bound to the argument x of the task when launched into the group:

1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group
9 auto g = hetcompute::create_group();
10
11 // hello is a fully-typed task pointer of type
12 // hetcompute::task_ptr<void(int)>
13 auto hello = hetcompute::create_task([](int x) { HETCOMPUTE_ILOG("Hello World
%d!\n", x); });
14
15 // Bind hello to 42 and launch task into g
16 g->launch(hello, 42);
17
18 // Wait for g to be empty
19 g->wait_for();
20
21 hetcompute::runtime::shutdown();
22 }

Adding

By adding an existing task to a group using hetcompute::group::add(task_ptr<>


const& task). This function does not launch the group and the launch has to be done separately.

1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group g.
9 auto g = hetcompute::create_group();
10
11 // Create task t1. Its type is hetcompute::task_ptr<void()>
12 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t1!\n")
; });
13
14 // Add task t1 to group g, but do not launch it.
15 g->add(t1);
16
17 auto t2 = hetcompute::launch([t1] {
18 // Launch t1. Because it already belongs to group g, there is no
19 // reason to use hetcompute::group::launch.
20 t1->launch();
21 });
22
23 // Wait for tasks in group g to complete.
24 g->wait_for();
25 hetcompute::runtime::shutdown();
26

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 91


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

27 return 0;
28 }

Use these methods when the task is part of a DAG. For example, there is a dependency between t1 and t2
in the following example:
1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create Example group
9 auto g = hetcompute::create_group();
10
11 // Create tasks
12 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t1!\n")
; });
13 auto t2 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t2!\n")
; });
14
15 // Launch t1 into g,
16 g->launch(t1);
17
18 // Use t1
19 t1->then(t2);
20
21 // Launch t2 into g
22 g->launch(t2);
23
24 // Wait for tasks to complete
25 g->wait_for();
26
27 hetcompute::runtime::shutdown();
28 return 0;
29 }

Regardless of the method used, the following rules always apply:

• Tasks stay in the group until they complete execution. Once a task is added to a group, there is no
way to remove it from the group.
• Once a task belonging to multiple groups completes execution, HetCompute removes it from all the
groups it belongs.
• Neither completed nor canceled tasks can join groups.
• Tasks can not be added to a canceled group.

5.3.7.2.1 Adding a Task to Multiple Groups

There are two ways to add a task to multiple groups:

• By launching it into each of the groups. For example, to add task t to groups g1 and g2, the
following would be performed:
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create task
8 auto t = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World!\n"); });
9

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 92


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

10 // Create groups
11 auto g1 = hetcompute::create_group("Example 1");
12 auto g2 = hetcompute::create_group("Example 2");
13
14 // Launch t into g1 and g2
15 g1->launch(t);
16 g2->launch(t);
17
18 t->wait_for();
19 hetcompute::runtime::shutdown();
20 }

Notice that, in the example above, t joins both g1 and g2, but it only executes once. Therefore, the code
snippet outputs a single ’Hello World!’. However, t might never join g2 because it might complete
execution before the first launch returns. Remember that completed tasks can never join groups.

• By creating a new group that is the intersection of all the groups where the task needs to launch, and
then launch the task into it. hetcompute::group_ptr intersect(hetcompute-
::group_ptr const& a, hetcompute::group_ptr const& b) returns a group
pointer to a group that represents the intersection of the two groups passed as arguments. This
method is more performant than repeatedly launching the same task into different groups.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create task
8 auto t = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World!\n"); });
9
10 // Create groups
11 auto g1 = hetcompute::create_group("Example 1");
12 auto g2 = hetcompute::create_group("Example 2");
13
14 auto g12 = hetcompute::intersect(g1, g2);
15
16 // Launch t into g1 and g2
17 g12->launch(t);
18 hetcompute::runtime::shutdown();
19 }

5.3.7.2.1.1 Group Intersection

It is important to understand what group intersection really means, because it might appear counterintuitive.
hetcompute::intersect returns a pointer to a group that represents the intersection of two or more
groups. Launching a task into the intersection group means simultaneously launching it into all the groups
that are part of the intersection.
For example, the following code snippet shows an application with two groups, g1 and g2, with 3000 and
2000 tasks in each, respectively.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create groups
8 auto g1 = hetcompute::create_group("Group 1");
9 auto g2 = hetcompute::create_group("Group 2");
10
11 auto g12 = hetcompute::intersect(g1, g2);
12

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 93


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

13 for (int i = 0; i < 3000; i++)


14 g1->launch([] {
15 //... Do something
16 });
17
18 for (int i = 0; i < 2000; i++)
19 g2->launch([] {
20 //... Do something
21 });
22
23 // Returns immediately. g12 is empty
24 g12->wait_for();
25
26 // Return only after tasks in g1 and g2 complete
27 g1->wait_for();
28 g2->wait_for();
29
30 g12->launch([] {
31 //... Calculate the Ultimate Question of Life,
32 // the Universe, and Everything
33 HETCOMPUTE_ILOG("42\n");
34 });
35
36 // All will return after the task prints 42
37 g1->wait_for();
38 g2->wait_for();
39 hetcompute::runtime::shutdown();
40
41 return 0;
42 }

In line 11, g1 and g2 are intersected into g12. The returned pointer, g12, points to an empty group
because no task belongs to both g1 and g2 yet. Therefore, g12->wait_for() in line 24 returns
immediately. The wait_for calls in lines 27 and 28 only return when their tasks complete. In line 30, a
new task is launched into g12. The wait_for calls in lines 37 and 38 only return after that task
completes execution because t belongs to both g1 and g2 (and, of course, g12).

Note: You can use the & operator instead of hetcompute::intersect:

Keep in mind that group intersection is a somewhat expensive operation. If you need to intersect groups
repeatedly, just do it once and keep the pointer to the group intersection alive.
1 #include <memory>
2 #include <hetcompute/hetcompute.hh>
3
4 static void
5 do_something()
6 {
7 }
8
9 int
10 main()
11 {
12 hetcompute::runtime::init();
13 // Create groups
14 auto g1 = hetcompute::create_group("Example group 1");
15 auto g2 = hetcompute::create_group("Example group 2");
16 auto g12 = g1 & g2;
17
18 // Launch 10 tasks into g1 and g2
19 for (int i = 0; i < 10; i++)
20 {
21 g12->launch(do_something);
22 }
23
24 g12->wait_for();
25 hetcompute::runtime::shutdown();
26 return 0;

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 94


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

27 }

Therefore, the code snippet above is faster than the one below:
1 #include <cassert>
2 #include <hetcompute/hetcompute.hh>
3
4 static void
5 do_something()
6 {
7 }
8
9 int
10 main()
11 {
12 hetcompute::runtime::init();
13 // Create groups
14 auto g1 = hetcompute::create_group("Example group 1");
15 auto g2 = hetcompute::create_group("Example group 2");
16
17 // Launch 10 tasks into g1 and g2
18 for (int i = 0; i < 10; i++)
19 {
20 (g1 & g2)->launch(do_something);
21 }
22
23 (g1 & g2)->wait_for();
24 hetcompute::runtime::shutdown();
25 return 0;
26 }

Consecutive calls to hetcompute::intersect with the same groups pointer as arguments return a
pointer to the same group. In addition, group intersection is commutative:
1 #include <cassert>
2 #include <stdio.h>
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 // Create groups
10 auto g1 = hetcompute::create_group("Group 1");
11 auto g2 = hetcompute::create_group("Group 2");
12
13 // Get pointer to intersection groups:
14 auto g12 = g1 & g2;
15 auto g21 = g2 & g1;
16
17 // This assert will never fire
18 assert(g12 == g21);
19 hetcompute::runtime::shutdown();
20 }

and associative:
1 #include <cassert>
2 #include <stdio.h>
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 // Create groups
10 auto g1 = hetcompute::create_group("Group 1");
11 auto g2 = hetcompute::create_group("Group 2");
12 auto g3 = hetcompute::create_group("Group 3");
13
14 // Get pointers to intersection groups:

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 95


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

15 auto g12_3 = (g1 & g2) & g3;


16 auto g1_23 = g1 & (g2 & g3);
17 auto g2_13 = g2 & (g1 & g3);
18
19 // These asserts will never fire
20 assert(g12_3 == g1_23);
21 assert(g12_3 == g2_13);
22 hetcompute::runtime::shutdown();
23 }

5.3.7.2.2 Waiting For a Group

hetcompute::group::wait_for() does not return until all the tasks in it have completed execution
or have been canceled.
1 #include <hetcompute/hetcompute.hh>
2 #include <stdio.h>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group g
9 auto g = hetcompute::create_group();
10
11 // Launch 10 tasks into g
12 for (int i = 0; i < 10; i++)
13 {
14 g->launch([i] { HETCOMPUTE_ILOG("Hello World! I’m task #%d\n", i); });
15 }
16
17 // Wait for tasks to complete and exit group
18 g->wait_for();
19 hetcompute::runtime::shutdown();
20
21 return 0;
22 }

Note: As in hetcompute::task::wait_for(), if hetcompute::group::wait_for() is


called from within a task, HetCompute context switches the task and finds another task to run. If
called from outside a task, it blocks the calling thread until it returns.
Waiting for an intersection group means that HetCompute returns once the tasks in the intersection group
have completed or canceled.
For example, g12->wait_for() in the following code returns immediately, because there are no tasks
in g12. Neither g1->wait_for() nor g2->wait_for() would return.
// Create and launch two tasks that never end
g1->launch([]{
while(1) {}
});

g2->launch([]{
while(1) {}
});

// Returns immediately because there are no


// tasks that belong to both g1 and g2
g12->wait_for();

// Never returns
g1->wait_for();
g2->wait_for();

hetcompute::group::wait_for() and is a safe point. For information about safe points, see

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 96


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

Interoperability.
hetcompute::group::finish_after is a non-blocking alternative to wait_for.
hetcompute::group::finish_after returns immediately but the task calling it is guaranteed not
to finish before all the tasks in group g are finished.

5.3.7.3 Group Cancellation

Use hetcompute::group::cancel() to cancel all the tasks in a group. Canceling a group means
that:

• The group tasks that have not started execution will never execute.
• The group tasks that are executing will be canceled only when they call
hetcompute::abort_on_cancel. If any of these executing tasks is a blocking task,
HetCompute will execute its cancel handler if they had not executed it before.
• Any tasks added to the group after the group is canceled will also be canceled.
In the following example, 2000 tasks are launched and then sleep for some time so that a few of those 2000
tasks are done, a few others are executing and a large majority are waiting to be executed. In line 26 the
group is canceled. This means that next time the running tasks execute
hetcompute::abort_on_cancel they will see that their group has been canceled and will abort.
g->wait_for() will not return before the running tasks end their execution — either because they call
hetcompute::abort_on_cancel() or because they finish writing all the messages.
1 #include <atomic>
2 #include <hetcompute/hetcompute.hh>
3
4 using namespace std;
5
6 int
7 main()
8 {
9 hetcompute::runtime::init();
10 // Counts the number of tasks that execute before the group gets
11 // canceled
12 atomic<size_t> counter;
13
14 auto group = hetcompute::create_group();
15
16 // Create 2000 tasks that increase an atomic counter
17 for (int i = 0; i < 2000; i++)
18 {
19 group->launch([&counter] {
20 counter++;
21 usleep(7);
22 });
23 }
24
25 // Cancel group
26 group->cancel();
27
28 // Wait for group to cancel
29 try
30 {
31 group->wait_for();
32 }
33 catch (const hetcompute::aggregate_exception& e)
34 {
35 // If many tasks were canceled, they each propagate a
36 // hetcompute::canceled_exception to the group, all of which get aggregated into
37 // a single hetcompute::aggregate_exception.
38 std::cout << "threw " << e.what() << " due to group cancellation " << std::endl;

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 97


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

39 }
40 catch (const hetcompute::canceled_exception& e)
41 {
42 // If all but one task finished by the time group cancellation took effect,
43 // then the one remaining task which was canceled will propagate a single
44 // hetcompute::canceled_exception.
45 std::cout << "threw " << e.what() << " due to group cancellation " << std::endl;
46 }
47 catch (...)
48 {
49 // Never reached
50 }
51 HETCOMPUTE_ILOG("wait_for returned after %zu tasks executed", counter.load());
52 hetcompute::runtime::shutdown();
53 return 0;
54 }

Like hetcompute::cancel(hetcompute::task_ptr const &),


hetcompute::group::cancel() returns immediately. Use hetcompute::group::wait_for
after hetcompute::group::cancel()to block execution until the group is empty.
Warning

Once a group is canceled, it cannot be "uncanceled".

5.3.8 Waiting for Tasks


Waiting for tasks is a necessary evil in parallel programming. Because computation is launched
asynchronously, most algorithms will need a mechanism to guarantee that the computation is completed.
There are several methods to ensure that in Qualcomm HetCompute, such as setting dependencies between
successive computations or using other forms of nonblocking synchronization Unleashing Asynchrony.
While it is recommended to avoid blocking waits, sometimes, you will just have to wait. This will be shown
in this section.
The method hetcompute::task<>::wait_for() does not return until the task completes
execution. It returns immediately once the task completes or cancels.

Note: If hetcompute::task<>::wait_for() is called from within a task, HetCompute


context-switches the task and finds another one to run. If called from outside a task (that is, the main
thread), HetCompute blocks the thread until hetcompute::task<>::wait_for() returns
(see Interoperability).
Note: It is helpful to use HetCompute primitives for blocking, because you communicate your intent to the
runtime. Therefore, it can take actions to continue to fully utilize the available resources, rather than
just spinning.
Both hetcompute::task<>::wait_for() and hetcompute::group::wait_for() (Task
Groups) are a safe points. For information about safe points, see Interoperability.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 98


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.9 Exceptions and Cancellation


In C++, exceptions provide a means to transfer control flow to special code blocks to gracefully handle
exceptional circumstances (e.g. runtime errors). In the sequential example below, function get_char
throws an exception due to illegal string access, and the exception is propagated up the call stack to main
which catches the exception and responds appropriately. The purpose of the exception in this example is to
notify callers of get_char that the function may not have completed successfully and therefore all of its
normal side effects (the write to the global variable c in the example) may not have taken effect.
1 #include <iostream>
2 #include <string>
3
4 #include <hetcompute/hetcompute.hh>
5
6 void get_char();
7
8 char c;
9
10 void
11 get_char()
12 {
13 c = std::string().at(1); // Illegal access of empty string
14 }
15
16 int
17 main()
18 {
19 hetcompute::runtime::init();
20 try
21 {
22 get_char();
23 std::cout << "got character " << c << " from string " << std::endl;
24 }
25 catch (const std::out_of_range&)
26 {
27 // Deal with exception
28 std::cerr << "illegal string access" << std::endl;
29 }
30 catch (...)
31 {
32 // Should never get here
33 }
34 hetcompute::runtime::shutdown();
35 return 0;
36 }

Now consider what happens when function get_char is executed asynchronously, possibly by a thread
different from the one that executes main. In the example below, the call stack of the thread executing
get_char does not contain the continuation of task t, because the continuation is possibly in one or more
(different) threads that synchronize with t through operations, such as wait_for. Consequently, normal
C++ exception propagation up the thread’s call stack is not sufficient for asynchronous programs. In the
example below, in the absence of a well-defined exceptions model for asynchronous programs, the main
thread waiting for task t on line 20 may resume execution without being aware that task t finished
unsuccessfully due to the exception.
1 #include <iostream>
2 #include <string>
3
4 #include <hetcompute/hetcompute.h>
5
6 void get_char();
7
8 char c;
9
10 void
11 get_char()

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 99


Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

12 {
13 c = std::string().at(1); // Illegal access of empty string
14 }
15
16 int
17 main()
18 {
19 auto t = hetcompute::launch([] { get_char(); }); // Executed asynchronously (possibly
in a different thread)
20 t->wait_for(); // Synchronization point
21 std::cout << "got character " << c << " from string " << std::endl;
22 return 0;
23 }

HetCompute provides a well-defined exceptions model for asynchronous programs. If an asynchronous


HetCompute construct (such as task, group, or pattern) does not successfully finish due to exceptional
circumstances, HetCompute captures the thrown exceptions and rethrows them at sync points of the
construct; these sync points are the dynamic program points at which synchronization operations on the
construct are invoked. Recall that a properly synchronized parallel program will invoke synchronization
operations on, e.g., a task t; such as wait_for, copy_value, move_value, or inter-task dependency;
in order to ensure that any side effects of task t, such as return value and stores to global variables, are
visible to instructions after the synchronization point. Below is the example from before, modified to
handle exceptions in the face of asynchronous execution. Note how copy_value acts as a
synchronization point to observe exceptions similar to wait_for.
1 #include <iostream>
2 #include <string>
3
4 #include <hetcompute/hetcompute.hh>
5
6 char get_char();
7
8 char
9 get_char()
10 {
11 auto c = std::string().at(1); // Illegal access of empty string
12 return c;
13 }
14
15 int
16 main()
17 {
18 hetcompute::runtime::init();
19 auto t = hetcompute::launch([] { return get_char(); }); // Executed asynchronously
(possibly in a different thread)
20 try
21 {
22 auto c = t->copy_value(); // Synchronization point
23 std::cout << "got character " << c << " from string " << std::endl;
24 }
25 catch (const std::out_of_range&)
26 {
27 // Deal with exception
28 std::cerr << "illegal string access" << std::endl;
29 }
30 catch (...)
31 {
32 // Should never get here
33 }
34 hetcompute::runtime::shutdown();
35 return 0;
36 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 100
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.9.1 Aggregate Exception

More than one task in a group of tasks or a pattern executed using a group of tasks may throw an exception.
HetCompute captures all such exceptions in a hetcompute::aggregate_exception, which is
thrown at sync points of the group or pattern. The example below illustrates its use. Note the use of
hetcompute::aggregate_exception::has_next and
hetcompute::aggregate_exception::next to iterate through all the exceptions contained in
hetcompute::aggregate_exception. The exceptions may be rethrown in any order due to the
asynchronous nature of the constructs generating the exceptions.
1 #include <iostream>
2 #include <string>
3
4 #include <hetcompute/hetcompute.hh>
5
6 void get_char();
7 void get_num();
8
9 char c;
10 int n;
11
12 #ifdef _MSC_VER
13 #pragma warning(disable : 4702)
14 #endif
15
16 void
17 get_char()
18 {
19 c = std::string().at(1); // Illegal access of empty string
20 }
21
22 void
23 get_num()
24 {
25 throw std::exception();
26 n = 1;
27 n = 2;
28 }
29
30 int
31 main()
32 {
33 hetcompute::runtime::init();
34 auto g = hetcompute::create_group();
35 g->launch([] { get_char(); }); // Executed asynchronously (possibly in a different thread)
36 g->launch([] { get_num(); }); // Executed asynchronously (possibly in a different thread)
37 try
38 {
39 g->wait_for(); // Synchronization point
40 std::cout << "got character " << c << " from string " << std::endl;
41 std::cout << "got num " << n << std::endl;
42 }
43 catch (hetcompute::aggregate_exception& e)
44 {
45 // Deal with all exceptions
46 while (e.has_next())
47 {
48 try
49 {
50 e.next(); // throws contained exceptions one-by-one
51 }
52 catch (const std::out_of_range&)
53 {
54 std::cerr << "illegal string access" << std::endl;
55 }
56 catch (const std::exception& s)
57 {
58 std::cerr << s.what() << std::endl;

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 101
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

59 }
60 catch (...)
61 {
62 // Should never get here
63 }
64 }
65 }
66 catch (...)
67 {
68 // Should never get here
69 }
70 hetcompute::runtime::shutdown();
71 return 0;
72 }
73
74 #ifdef _MSC_VER
75 #pragma warning(default : 4702)
76 #endif

5.3.9.2 GPU/DSP Exception

If a GPU/DSP task encounters a runtime error, a corresponding hetcompute::gpu_exception or


hetcompute::dsp_exception is thrown at the sync points of that task.

5.3.9.3 Cancellation

While exceptions are thrown from within an executing asynchronous construct, there is another external
reason as to why an asynchronous construct may finish unsuccessfully. In HetCompute, tasks, groups, and
patterns may all be canceled programmatically. Programmatic cancellation is very useful in a variety of
scenarios:

• If a background task, such as fetching data from a remote server, takes too long, the user may cancel
it through the UI.
• If a group of tasks is engaged in searching a database, and one of them finds the intended data, the
other tasks may be canceled to avoid unnecessary work and save energy.
While the source of exceptions is intrinsic to the asynchronous construct and the source of cancellation is
extrinsic to the asynchronous construct, exceptions and cancellations both result in the asynchronous
construct finishing unsuccessfully. Therefore, HetCompute deals with both exceptions and cancellation in
an identical fashion. When a task, group, or pattern is canceled or throws an exception, HetCompute
records the fact. Subsequently, at sync points of the asynchronous construct, HetCompute throws
hetcompute::canceled_exception or rethrows the original exception thrown by the
asynchronous construct. If the exception is not handled at a sync point, then that exception propagates up
the call stack of the thread executing the sync point as per normal C++ exception propagation.
When a task is canceled or throws an exception, HetCompute cancels all of its successors in the task graph.
Any synchronization with the now-canceled successor tasks will throw the appropriate exception(s).

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 102
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.9.4 Synchronization Points where Exceptions are Observable

The following is a comprehensive list of program points at which HetCompute may throw an exception due
to an asynchronous construct throwing an exception or getting canceled:

• for task t, t->wait_for()


• for task t, t->copy_value()
• for task t, t->move_value()
• for group g, g->wait_for()
Note

To observe an exception thrown by a task, group, or pattern, the programmer must synchronize with
the same through one of the above methods. If a sync point is not reached and the asynchronous
construct (task_ptr or group_ptr) goes out of scope, the asynchronous construct is deemed as
useless and the exception is never rethrown anywhere.

5.3.9.5 Canceling a Task

There are three main ways to cancel an individual task t:

• If you have a pointer to the task, you can use t->cancel().


• To cancel a running task from within the task body, call hetcompute::abort_task().
• An unlaunched task is canceled when every hetcompute::task_ptr pointing to the task goes
out-of-scope.
In this section, each of these cancellation methods is examined in detail.

5.3.9.5.1 t->cancel()

Use t->cancel() to cancel a task t and its successors. What happens to the task when the programmer
calls t->cancel() depends on the status of the task.

5.3.9.5.1.1 Canceling a Task Before It Executes

If a task is canceled before it is launched, it never executes, even if it is launched later. In addition, the
runtime will then cancel all successors in the task’s successor graph. In the following example, two tasks
are created t1 and t2 and create a dependency between them. Notice that, if any of the tasks execute, it
will raise an assertion. In line 18, t1 is canceled, which causes t2 to be canceled as well. In line 21, t2 is
launched, but this has no effect because the task will not execute, as it was canceled when t1 propagated its
cancellation.
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 auto t1 = hetcompute::create_task([] { assert(false); });
10

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 103
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

11 auto t2 = hetcompute::create_task([] { assert(false); });


12
13 // Create control dependency.
14 t1->then(t2);
15
16 // Cancel t1, which propagates cancellation to t2
17 t1->cancel();
18
19 // Launch t2. Does nothing, t2 got canceled via cancellation propagation
20 t2->launch();
21
22 // Returns immediately, t2 is canceled.
23 try
24 {
25 t2->wait_for();
26 }
27 catch (const hetcompute::canceled_exception& e)
28 {
29 std::cout << e.what() << ": t2 was canceled" << std::endl;
30 }
31 catch (...)
32 {
33 // Never reached
34 }
35
36 hetcompute::runtime::shutdown();
37 return 0;
38 }

Similarly, if a task is canceled after it is launched, but before it starts executing, it never executes and will
propagate the cancellation request to its successors. In the following example, three tasks are created and
chained, t1, t2 and t3. In line 22, t2 is launched, but it cannot execute because its predecessor has not
yet executed. In line 25, t2 is canceled, which means that it will never execute. Because t3 is t2’s
successor, it is also canceled - if t3 had a successor, it would also be canceled.
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t1!\n")
; });
10
11 auto t2 = hetcompute::create_task([] { assert(false); });
12
13 auto t3 = hetcompute::create_task([] { assert(false); });
14
15 // Create dependencies
16 t1->then(t2)->then(t3);
17
18 // Launch t2. It cannot execute as yet because t1 has not been launched.
19 t2->launch();
20
21 // Cancel t2, which propagates cancellation to t3
22 t2->cancel();
23
24 // Launch t1. It will execute because no one canceled it.
25 t1->launch();
26
27 // Returns after t1 completes execution
28 t1->wait_for();
29 hetcompute::runtime::shutdown();
30
31 return 0;
32 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 104
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.9.5.1.2 Canceling a Task While It Executes

Canceling a task that is executing is more involved because HetCompute uses cooperative
multitasking . This means that, once a task is executing, it is not pre-empted unless it voluntarily
cedes the processor (e.g., by calling hetcompute::task<>::wait_for()). Thus, it is up to the
task to check periodically whether or not it has been canceled. Use
hetcompute::abort_on_cancel() inside a task body to abort the task immediately if the task, or
any of the groups to which it belongs, have been canceled.
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9
10 auto t = hetcompute::create_task([] {
11 while (1)
12 {
13 hetcompute::abort_on_cancel();
14 HETCOMPUTE_ILOG("Waiting to be canceled.\n");
15 usleep(100);
16 }
17 assert(false); // This will never fire.
18 });
19
20 // Launch t.
21 t->launch();
22
23 // Wait for 2 seconds.
24 usleep(200);
25
26 // Cancel task. Returns immediately.
27 t->cancel();
28
29 try
30 {
31 // Wait for the task.
32 t->wait_for();
33 }
34 catch (const hetcompute::canceled_exception& e)
35 {
36 std::cout << e.what() << " thrown" << std::endl;
37 }
38 catch (...)
39 {
40 // Never reached.
41 }
42
43 hetcompute::runtime::shutdown();
44 return 0;
45 }

In the example above, task t will never finish unless it is canceled. Task t is launched in line 16. After
launching the task, HetCompute blocks for 2 seconds in line 19 to make sure that t is scheduled and prints
its messages. In line 22, HetCompute is asked to cancel the task, which should be running by now. The
method t->cancel() returns immediately after it marks the task as "pending for cancellation". This
means that t might still be executing after t->cancel() returns. That is why t->wait_for() is
called in line 26, to make sure t waits to complete its execution.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 105
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

Note

A task does not know whether someone has requested its cancellation unless it calls
hetcompute::abort_on_cancel() during its execution.

The method hetcompute::abort_on_cancel() never returns if the task has indeed been canceled
because it throws an exception that the Qualcomm HetCompute runtime catches. For this reason, it is
recommended that you use Resource Acquisition Is Initialization (RAII) to allocate
and deallocate the resources used inside a task. If using RAII in your code is not an option, surround
hetcompute::abort_on_cancel() with try - catch, and call throw from within the catch
block after the cleanup code:
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9
10 auto t = hetcompute::create_task([] {
11 while (1)
12 {
13 try
14 {
15 hetcompute::abort_on_cancel();
16 }
17 catch (const hetcompute::abort_task_exception&)
18 {
19 //..do cleanup
20 throw;
21 }
22 catch (...)
23 {
24 //..do cleanup
25 throw;
26 }
27 // HETCOMPUTE_ILOG("Waiting to be canceled.\n");
28 usleep(10);
29 }
30 assert(false); // This will never fire
31 });
32
33 // Launch t
34 t->launch();
35
36 // Wait for 20 micro-seconds.
37 usleep(20);
38
39 // Cancel task. Returns immediately.
40 t->cancel();
41
42 try
43 {
44 // Wait for the task to complete.
45 t->wait_for();
46 }
47 catch (const hetcompute::canceled_exception& e)
48 {
49 std::cout << e.what() << " thrown" << std::endl;
50 }
51 catch (...)
52 {
53 // Never reached
54 }
55
56 hetcompute::runtime::shutdown();

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 106
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

57 return 0;
58 }

Warning

If throw is replaced in line 12 of the previous example with return, the exception would not
propagate to the runtime, HetCompute would not consider the task as canceled, and, therefore, its
successors (if any) would not be canceled.

5.3.9.5.1.3 Canceling a Task After It Completes Execution

Canceling a task after it has been executed has no effect on the task, nor on its successors. In the following
example, t1 and t2 are launched after a dependency is set up between them. On line 28, t1 is canceled
after it has completed. By then, t1 has finished execution (waiting for it in line 24) so t1->cancel()
has no effect. Thus, nobody cancels t2 and t2->wait_for() in line 31 never returns, because t2
remains stuck in an infinite loop.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t1!\n")
; });
8
9 auto t2 = hetcompute::create_task([] {
10 while (1)
11 {
12 hetcompute::abort_on_cancel();
13 HETCOMPUTE_ILOG("Hello World from t2!\n");
14 usleep(100);
15 }
16 });
17
18 // Create dependencies.
19 t1->then(t2);
20
21 // Launch tasks.
22 t1->launch();
23
24 // Wait for t1 to complete.
25 t1->wait_for();
26
27 // Cancel t1.
28 // Because it has already completed, it does not propagate its cancellation.
29 t1->cancel();
30
31 // If the two lines below are uncommented the wait_for will never return.
32 // t2->launch();
33 // t2->wait_for();
34
35 hetcompute::runtime::shutdown();
36 return 0;
37 }

5.3.9.5.2 hetcompute::abort_task()

Running tasks call hetcompute::abort_task() to cancel themselves and their successors. Consider
the following example. Two tasks are created, t1 and t2, and create a dependency between them. The
body of t1 is very simple: it prints a message ten times and then it aborts. Both are launced and wait for t1
to complete its execution in line 31. Because t1 calls hetcompute::abort_task(), it is canceled

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 107
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

and propagates its cancellation to its successor,t2.


1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9
10 auto t1 = hetcompute::create_task([] {
11 int i = 0;
12 while (true)
13 {
14 HETCOMPUTE_ILOG("Hello World %d\n", i);
15 sleep(1);
16 i++;
17 if (i == 10)
18 {
19 hetcompute::abort_task();
20 }
21 }
22 // This will never fire
23 assert(false);
24 });
25
26 auto t2 = hetcompute::create_task([] {
27 // This will never fire
28 assert(false);
29 });
30
31 t1 >> t2;
32
33 // Launch tasks
34 t1->launch();
35 t2->launch();
36
37 try
38 {
39 // Wait for t1 to complete.
40 t1->wait_for();
41 }
42 catch (const hetcompute::canceled_exception& e)
43 {
44 std::cout << e.what() << " thrown when syncing with t1" << std::endl;
45 }
46 catch (...)
47 {
48 // Never reached
49 }
50
51 try
52 {
53 // Returns immediately, t2 is canceled.
54 t2->wait_for();
55 }
56 catch (const hetcompute::canceled_exception& e)
57 {
58 std::cout << e.what() << " thrown when syncing with t2" << std::endl;
59 }
60 catch (...)
61 {
62 // Never reached
63 }
64
65 hetcompute::runtime::shutdown();
66 return 0;
67 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 108
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.9.5.3 Cancellation by Abandonment

When all the hetcompute::task_ptrs referencing an unlaunched task go out of scope, the task is
canceled and it propagates the cancellation to its successors. The reasoning is simple: a task t cannot
launch without a task pointer, and none of its successors will ever be able to execute because t never
executed.
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 void foo();
6
7 void
8 foo()
9 {
10 auto t1 = hetcompute::create_task([] {
11 HETCOMPUTE_ILOG("Hello World from t1\n");
12 // This will never fire
13 assert(false);
14 });
15
16 auto t2 = hetcompute::create_task([] {
17 HETCOMPUTE_ILOG("Hello World from t2\n");
18 // This will never fire
19 assert(false);
20 });
21
22 auto t3 = hetcompute::create_task([] {
23 int i = 0;
24 while (i++ < 10)
25 {
26 HETCOMPUTE_ILOG("Hello World from t3\n");
27 sleep(1);
28 };
29 });
30
31 t1 >> t2;
32
33 t2->launch();
34 t3->launch();
35
36 // t1, t2, and t3 go out of scope
37 }
38
39 int
40 main()
41 {
42 hetcompute::runtime::init();
43 foo();
44 hetcompute::runtime::shutdown();
45 return 0;
46 }

In the snippet above, three tasks are created t1, t2 and t3, and create a dependency between the first two.
t2 and t3 are launched in lines 31 and 32. t2 cannot run because t1 has not yet executed. In line 35,
foo() ends and the three pointers go out-of-scope. t1 is canceled because it is not yet launched. t2 is
canceled because t1 propagated its cancellation. t3 does not get canceled and will run even after foo()
goes out-of-scope.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 109
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.10 Blocking Tasks


Programmers can mark CPU kernels with attributes that help HetCompute make better scheduling
decisions. The current HetCompute release supports just one attribute (to indicate that the kernel may
execute blocking code), but there are plans to include others in future releases. The following example
shows how to create a CPU kernel object that is marked as blocking.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 auto k = hetcompute::create_cpu_kernel([] {
9 // execute some blocking I/O request
10 });
11 // inform the Hetcompute runtime that the kernel may block on some external event
12 k.set_blocking();
13
14 auto t = hetcompute::launch(k);
15 t->wait_for();
16
17 hetcompute::runtime::shutdown();
18 return 0;
19 }

5.3.10.1 Blocking Kernel

A blocking kernel (and a task created out of that kernel) consists of computation that depends on external
(non-HetCompute) synchronization to make guaranteed forward progress. Typically, the external
synchronization includes completing I/O requests, other OS syscalls with indefinite run-time, and
busy-waiting. It does not include waiting on HetCompute tasks or groups using
hetcompute::task<>::wait_for, hetcompute::task<ReturnType>::copy_value,
hetcompute::task<ReturnType>::move_value, or hetcompute::group::wait_for.
There are two problems with blocking tasks. The first is that once a blocking task executes, it will take over
a thread in a HetCompute thread pool, thus preventing other tasks from executing in the same thread.
Because a blocking task spends most of its time blocking on an event, essentially one of the threads in the
thread pool is wasted. When the programmer marks the task kernel as blocking, HetCompute ensures that
the thread pool does not wastefully dedicate a thread to the task.
The second problem has to do with cancellation. If a blocking task is canceled while it is blocked on an
external event, HetCompute needs to be able to unblock the task so that it can respond to the cancellation
signal (e.g., by calling hetcompute::abort_on_cancel()). As the code snippet below shows, there
is often a well-defined means to unblock:
// blocking call
{
x = network_fetch(network_handle);
}

// means to unblock network_fetch above


{
write_spurious(network_handle);
}

write_spurious(network_handle) can be called asynchronously while network_fetch is


executing to unblock the latter. HetCompute captures the above idea through the
hetcompute::blocking() construct to enable efficient cancellation of blocking tasks.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 110
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.10.2 hetcompute::blocking

hetcompute::blocking() enables a task to enter and exit multiple blocking sections of code, and
provides means for the programmer to precisely and efficiently specify how to unblock each blocking
section. The hetcompute::blocking() construct takes two programmer-supplied arguments:

• blocking function: a C++ function, functor, or lambda, that contains the blocking code to execute, and
• cancel function: a C++ function, functor, or lambda, that contains the code to unblock the task if it is
blocked, so that the task may respond to cancellation.
In response to a task being canceled asynchronously, e.g., via hetcompute::task<>::cancel(),
the cancel function will be executed exactly once, only if the task is running.
In the following example, a CPU kernel that executes blocking code is created,. The blocking statement is
wrapped in hetcompute::blocking() and specified using two lambdas, including
cancel_function. The kernel is marked as blocking (line 30). After launching task t and sleeping for
a second, t(line 39) is canceled. Most likely, by the time t is canceled, it will be waiting on the condition
variable (line 26). The cancel function wakes up the task body so that it can abort (line 24).
1 #include <condition_variable>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8
9 static std::mutex mutex;
10 static std::condition_variable cv;
11
12 // create CPU kernel and mark it as blocking
13 auto k = hetcompute::create_cpu_kernel([] {
14 auto cancel_function = [] {
15 HETCOMPUTE_ILOG("CANCEL blocking task");
16 std::lock_guard<std::mutex> lock(mutex);
17 cv.notify_all();
18 };
19
20 HETCOMPUTE_ILOG("START blocking task");
21 std::unique_lock<std::mutex> lock(mutex);
22 for (;;)
23 {
24 hetcompute::abort_on_cancel();
25 // enter hetcompute::blocking construct
26 hetcompute::blocking([&lock] { cv.wait(lock); }, // blocking function
27 cancel_function); // cancel function
28 }
29 HETCOMPUTE_ILOG("STOP blocking task");
30 });
31 k.set_blocking();
32
33 auto t = hetcompute::launch(k);
34
35 // wait for task to block
36 sleep(1);
37
38 // cancel task; it will call t’s cancel function
39 t->cancel();
40
41 try
42 {
43 // wait for t to finish
44 t->wait_for();
45 }
46 catch (const hetcompute::canceled_exception& e)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 111
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

47 {
48 HETCOMPUTE_ILOG("task threw %s", e.what());
49 }
50 catch (...)
51 {
52 // Never reached
53 }
54
55 hetcompute::runtime::shutdown();
56 return 0;
57 }

Note

If a task receives a cancellation request prior to entering a hetcompute::blocking() section,


the task throws hetcompute::canceled_exception immediately upon entering the section. Neither the
blocking function nor the cancel function is executed.

5.3.11 Algebraic Operations on Tasks


As a programmatic convenience, HetCompute provides operator overloads on tasks with return values. A
task represents a value that will eventually materialize. The overloaded operators provide a means to
express computation using such eventual values. In the example below, greet creates two tasks and
returns their sum (through operator +). HetCompute internally creates a task that depends on tasks a and b,
and launches it.
1 #include <hetcompute/hetcompute.hh>
2
3 hetcompute::task_ptr<std::string> greet();
4
5 hetcompute::task_ptr<std::string>
6 greet()
7 {
8 auto a = hetcompute::launch([] { return std::string("hello"); });
9 auto b = hetcompute::launch([] { return std::string(" world"); });
10 return a + b; // returns a launched task which is
11 // data-dependent on tasks a and b
12 }
13
14 int
15 main()
16 {
17 hetcompute::runtime::init();
18 auto t = greet();
19 std::cout << t->move_value() << std::endl;
20 hetcompute::runtime::shutdown();
21 }

Later, at runtime, when tasks a and b finish, their return values are propagated to task a + b, which then
concatenates the two strings through the + operator overloaded on the std::string datatype. The
concatenated string is then retrieved through t->move_value in main.
HetCompute supports the following non-blocking algebraic operations on the return values of collapsed
tasks:

• Unary arithmetic and bitwise operations: +, -, ∼ The return value of the task (pointed to by
hetcompute::task_ptr) can be any type (built-in or user-defined) as long as the operation is
applicable to it (for user-defined types, the corresponding operator needs to be defined).
• Binary arithmetic and bitwise operations: +, -, ∗, /, %, &, |, ∧, which can take the
following three combinations of operands:
– operand1: hetcompute::task_ptr operand2: hetcompute::task_ptr

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 112
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

– operand1: hetcompute::task_ptr operand2: value


– operand1: value operand2: hetcompute::task_ptr
The value operand and the return value of the task (pointed to by hetcompute::task_-
ptr) can be any type (built-in or user-defined) as long as the operation is applicable between them
(for user-defined types, the corresponding operator needs to be defined). The type of the return value
of the new task will be the promoted type between the type of the two operands.
• Binary compound arithmetic and bitwise assignment operations: +=, -=, ∗=, /=, %=, &=,
|=, ∧ =: which can take the following three combinations of operands
– operand1: hetcompute::task_ptr operand2: hetcompute::task_ptr
– operand1: hetcompute::task_ptr operand2: value
The value operand and the return value of the task (pointed to by
hetcompute::task_ptr) can be any type (built-in or user-defined) as long as the operation is
applicable between them (for user-defined types, the corresponding operator needs to be defined).
Also, the type of the result will be type-cast to the type of the return value of the original task so that
the new task can be pointed to by the original hetcompute::task_ptr.
The result of these operations is a newly launched task whose return value is the result of the corresponding
operations. If any of the operands are tasks, the resulting task will be data dependent (see Task
Dependencies) on the operand tasks, and be launched when the operand values are ready.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 auto a = hetcompute::launch([] { return 11; });
8 auto b = hetcompute::launch([] { return 4; });
9 auto c = hetcompute::launch([] { return 24; });
10 auto e = (a * b) - (c / 12); // e is a task
that will eventually compute
11 HETCOMPUTE_ILOG("The answer to lie the universe and everything = %u", e->copy_value()); // copy_value
waits for e to compute
12 hetcompute::runtime::shutdown();
13 }

In the example above, e is an expression tree composed of values, tasks, and operations on tasks. The
expression tree will be evaluated when tasks in each sub-expression finish and return their values.
Note

Algebraic operators return launched tasks; do not attempt to re-launch them.


Use the algebraic operator when it is known that the operation is computationally expensive.
The following operators are not supported in the current version:
• Comparison and logical operators: ==, !=, >, <, >=, <=, !, &&, ||
• Bitwise shift operators: <<, >>, <<=, >>=
• Increment and Decrement arithmetic operators: ++, -
• Other meaningless operators for task_ptr: [], ∗, ->∗, (), comma, ""_, type, etc.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 113
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.12 Task-Pointer Collapsing


As shown in the example below, a task may return another task. Recall that a task’s type is determined by
its return value. The type of the return value determines the tasks to which the returned data may flow
through data dependencies. In the example, task t has type task_ptr<int> (actually
task_ptr<int(void)>, but the void argument is irrelevant for this discussion). By the same
reasoning, task t1 should have type task_ptr<task_ptr<int>>. However, if a task is viewed as a
construct that eventually computes a value, and therefore is merely a placeholder for that value, then
task_ptr<int> represents an int that will eventually materialize and therefore
task_ptr<task_ptr<int>> is really just task_ptr<int>. The degeneration of
task_ptr<task_ptr<int>> to task_ptr<int> is called return type collapsing. When a task is
created/launched, HetCompute performs return type collapsing by default. Consequently, in the example,
task t1 can be bound as a data dependency to task t2 which expects an int as its argument.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 auto t1 = hetcompute::launch([] {
8 auto t = hetcompute::launch([] { return 42; });
9 return t;
10 });
11 auto t2 = hetcompute::launch([](int i) { std::cout << i << std::endl; }, t1);
12 t2->wait_for();
13 hetcompute::runtime::shutdown();
14 return 0;
15 }

There may be certain situations in which the intent is to pass a task_ptr through dataflow. For instance, one
task may create a task t and some other task may launch task t after all of task t’s data is available. For such
situations, use hetcompute::do_not_collapse to indicate that return type collapsing should not be
performed.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 auto t1 = hetcompute::launch(hetcompute::do_not_collapse,
[] {
8 auto t = hetcompute::create_task([] { return 42; });
9 return t;
10 });
11 auto t2 = hetcompute::launch(
12 [](hetcompute::task_ptr<int> t) {
13 t->launch();
14 return t;
15 },
16 t1);
17 std::cout << t2->copy_value() << std::endl;
18 hetcompute::runtime::shutdown();
19 return 0;
20 }

In the above example, task t1 creates task t and passes it as a data dependency to task t2 which accepts a
task_ptr<int> as its argument.hetcompute::do_not_collapse ensures that the return type of
task t1 is task_ptr<task_ptr<int>> and that task_ptr t flows to task t2. Task t2 in turn
launches task t and returns it. Notice that because task t2 is not created with
hetcompute::do_not_collapse, t2->copy_value() accesses the int eventually computed
by task t.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 114
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.13 Unleashing Asynchrony


Applications should be parallelized or made asynchronous in a modular fashion, with the different modules
being composable with each other both programmatically and in terms of performance gains from the
independently parallelized modules. Using HetCompute, functions in an application may be split into tasks,
with tasks specified as related to each other through dependencies. Dependencies are typically specified
statically using hetcompute::task<>::then (control dependency) or
hetcompute::task<ReturnType(Args...)>::bind_all (data dependency), or dynamically
using the blocking function call hetcompute::task<>::wait_for. For best performance using
HetCompute, dependencies among tasks should be statically specified ahead of execution time using
hetcompute::task<>::then or hetcompute::task<ReturnType(Args...)>::bind-
_all, while avoiding dynamic discovery of dependencies during execution through use of
hetcompute::task<>::wait_for as much as possible. However, as the following example shows,
achieving modular and composable parallelism without blocking is hard using just the above APIs.
The application pseudocode below calls compose_webpages to build a composite display of multiple
webpages. compose_webpages calls display_webpage to display each webpage;
display_webpage in turn fetches data to be displayed and the styling of the data, both of which are
used to render the webpage on the composite display.
void display_webpage(string url) {
fetch(url, "data");
fetch(url, "style");
render();
}

void compose_webpages(string urls) {


for (auto url : urls) {
display_webpage(url);}
}
}

compose_webpages(urls);

Assume that individual webpages can be rendered independently of each other, and that the fetching of data
and its styling can also be done in parallel. This informs the following parallel implementation of the
composite display application. The for loop in the compose_webpages function is executed in
parallel. The code to fetch data and style is launched as tasks that can execute in parallel, while the
render function is launched as a task scheduled to execute after the fetchdata and fetchstyle
tasks have finished.

In HetCompute, tasks are not related to each other except through task dependencies specified through
hetcompute::task<>::then or hetcompute::task<ReturnType(Args...)>::bind-
_all. A notable consequence of this is that although the display_webpage task created in the for
loop in compose_webpages creates and launches three more tasks, the latter tasks are in no way related
to the display_webpage task that created them. Therefore, the display_webpage function must
explicitly wait_for the render task to finish before it returns, so that the display_webpage task
that created them finishes only when all tasks created and launched by it have finished. Similarly, the
compose_webpages function also waits for all tasks it creates to finish before it returns. While such a
parallelization is desirable because each function was locally parallelized in a modular fashion, the use of
blocking hetcompute::task<>::wait_for to enforce the synchronous function call interface can
significantly hinder performance when there are many outstanding
hetcompute::task<>::wait_fors in the application.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 115
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.13.1 finish_after

To enable modular and composable parallelization in a high-performance non-blocking style, HetCompute


provides a unique method called hetcompute::task<>::finish_after (and
hetcompute::group::finish_after).
A task t1 can register itself as finishing after another task t2 by calling t2->finish_after(). By
doing so, the HetCompute runtime system is informed that although task t1 may have completed
execution, the task will logically finish only after task t2 has finished. Therefore, just as the use of
t2->wait_for() extends the lifetime of task t1 to encapsulate that of task t2,
t2->finish_after() achieves the same but without blocking the thread executing task t1.
extern task_ptr<> t2;
void foo() {
...
t2->finish_after();
...
}

int main() {
auto t1 = hetcompute::create_task(foo);
...
}

A task can register itself as finishing after a group g by calling


hetcompute::group::finish_after on group g. Note that a task can register itself as finishing
after any number of tasks and groups.
extern task_ptr<> t2;
extern group_ptr g;
void foo() {
...
t2->finish_after();
g->finish_after();
...
}

int main() {
auto t1 = hetcompute::create_task(foo);
...
}

Both function calls are non-blocking, lightweight, and return immediately. Note that the non-blocking
parallelization using finish_after below is nearly identical to the blocking parallelization, with the
sole difference being the use of finish_after in place of wait_for. Furthermore, note that the
display_webpage and compose_webpages functions now return early, before any of the tasks they
launched finish. Consequently, these functions have become asynchronous and must be invoked from
within tasks. Therefore, in main(), compose_webpages is called from within task t rather than as a
synchronous function call.
1 #include <string>
2
3 #include <hetcompute/hetcompute.hh>
4
5 void display_webpage(char*);
6 void compose_webpages(int num_urls, char* urls[]);
7
8 void
9 display_webpage(char* url)
10 {
11 auto fetchdata = hetcompute::create_task([=] {
12 /*fetch(url, "fetchdata");*/
13 return std::string(url) + " data";
14 });
15 auto fetchstyle = hetcompute::create_task([=] {

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 116
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

16 /*fetch(url, "fetchstyle");*/
17 return std::string(url) + " style";
18 });
19 auto render = hetcompute::create_task([](std::string data, std::string style
) {
20 /*render();*/
21 std::cout << data + " " + style << std::endl;
22 });
23 // Render task may start executing only after data and style have been
24 // fetched
25 render->bind_all(fetchdata, fetchstyle);
26 fetchdata->launch();
27 fetchstyle->launch();
28 render->launch();
29 // Mark display_webpage as logically finishing after the render task finishes
30 render->finish_after();
31 // Return from function call even before any of the fetchdata, fetchstyle, or render
32 // tasks finish. Such an early return makes the function asynchronous.
33 }
34
35 void
36 compose_webpages(int num_urls, char* urls[])
37 {
38 auto g = hetcompute::create_group();
39 for (int i = 1; i < num_urls; i++)
40 {
41 g->launch([=] { display_webpage(urls[i]); });
42 }
43 // Mark compose_webpages as logically finishing after all webpages have been
44 // composed and displayed
45 g->finish_after();
46 // Return from function call before any of the tasks finish
47 }
48
51 int
52 main(int argc, char* argv[])
53 {
54 hetcompute::runtime::init();
55
56 // Launch compose_webpages as a task since it is an asynchronous function
57 // call
58 auto t = hetcompute::launch([=, &argv] { compose_webpages(argc, argv); });
59 // Waits for the composite display to be rendered!
60 t->wait_for();
61 return 0;
62
63 hetcompute::runtime::shutdown();
64 }

The single, global, hetcompute::task<>::wait_for in main ensures that the composite display
is correctly rendered before the application terminates. Note that there are no other blocking calls necessary
to specify all the parallelism in the application.
hetcompute::task<>::finish_after or hetcompute::group::finish_after can be
invoked only from within a task. Note that a task can register itself as finishing after an arbitrary number of
other tasks and groups. A task can register itself as finishing after tasks or groups it did not create or launch.
finish_after is a means to semantically relate tasks to each other, e.g., in a parent-child relationship.
In the parallel mergesort example below, every node in the mergesort tree is specified to finish after
the merge step corresponding to that node finishes (line 39).
1 #include <algorithm>
2 #include <functional>
3 #include <iostream>
4 #include <iterator>
5 #include <sstream>
6 #include <vector>
7
8 #include <hetcompute/hetcompute.hh>

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 117
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

9
10 // Parallel mergesort using recursive fork-join parallelism.
11 // hetcompute::task<>::finish_after allows easy expression of the parallelism in the
12 // algorithm in a non-blocking manner, yielding better performance than
13 // blocking parallelization using hetcompute::task<>::wait_for.
14
16 const size_t GRANULARITY = 8192;
17
18 // Asynchronous mergesort, to be invoked in a task
19 template <typename Iterator, typename Compare>
20 void
21 mergesort(Iterator begin, Iterator end, Compare cmp)
22 {
23 size_t n = std::distance(begin, end);
24 if (n <= GRANULARITY)
25 {
26 sort(begin, end, cmp);
27 }
28 else
29 {
30 auto middle = begin;
31 std::advance(middle, n / 2);
32 auto left = hetcompute::launch([=] { mergesort(begin, middle, cmp); });
33 auto right = hetcompute::launch([=] { mergesort(middle, end, cmp); });
34 auto merge = hetcompute::create_task([=] { std::inplace_merge(begin, middle,
end, cmp); });
35 // The left subtree and right subtree tasks must finish before the merge
36 // task can execute
37 left->then(merge);
38 right->then(merge);
39 merge->launch();
40 // mergesort(begin, end, cmp) logically finishes after the merge task
41 // finishes
42 merge->finish_after();
43 }
44 }
45
46 int
47 main(int argc, const char* argv[])
48 {
49 hetcompute::runtime::init();
50 std::vector<long> input;
51 size_t n_def = 1 << 16;
52 size_t n = n_def;
53
54 if (argc >= 2)
55 {
56 std::istringstream istr(argv[1]);
57 istr >> n;
58 }
59
60 // Create a random array of integers
61 for (size_t i = 0; i < n; i++)
62 {
63 input.push_back(rand());
64 }
65
66 // Launch mergesort inside a task since it has an asynchronous interface (due
67 // to use of hetcompute::task::finish_after)
68 auto t = hetcompute::launch([&] { mergesort(input.begin(), input.end(),
std::less<long>()); });
69 t->wait_for();
70
71 if (!std::is_sorted(input.begin(), input.end()))
72 {
73 std::cerr << "parallel mergesorting failed\n";
74 }
75
76 hetcompute::runtime::shutdown();
77 return 0;
78 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 118
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.13.2 Asynchronous APIs

As stated previously, calling finish_after in a function implicitly makes the function asynchronous. In
some cases, e.g., when the function is set to finish_after a single task, the asynchronous nature of the
function may be made explicit by modifying it to return that task, instead of calling finish_after. This
results in a lightweight, asynchronous, non-blocking API. For illustration, first consider the synchronous
sequential implementation below:
// synchronous sequential implementation
size_t
fibonacci_seq(size_t n)
{
if (n < 2)
{
return n;
}
else
{
auto a = fibonacci_seq(n - 1);
auto b = fibonacci_seq(n - 2);
return a + b;
}
}

This is trivially converted into the following fully asynchronous implementation in three easy steps:

1. Convert sequential function calls into task launches (hetcompute::launch(...))


2. Change return type of function to hetcompute::task_ptr<size_t>
3. Convert integer n into a value task using hetcompute::create_value_task<size_t>
// fully asynchronous non-blocking implementation
hetcompute::task_ptr<size_t>
fibonacci(size_t n)
{
if (n < 2)
{
return hetcompute::create_value_task<size_t>(n);
}
else
{
// task_ptr collapsing
// typeof(a) is task_ptr<size_t>, not task_ptr<task_ptr<size_t>>
auto a = hetcompute::launch(fibonacci, n - 1);
auto b = hetcompute::launch(fibonacci, n - 2);
return a + b; // task algebra
}
}

The snippet illustrates how Task-Pointer Collapsing and Algebraic Operations on Tasks are synergistically
combined to enable the asynchronous fibonacci API. Note the close correspondence with the
synchronous sequential implementation shown below. The HetCompute API enables the programmer to
easily and elegantly express the concurrency in the algorithm.
As a performance optimization, the programmer may coarsen the size of tasks so that small Fibonacci terms
are computed sequentially while large ones are computed in parallel. Below is the full example with both
the optimized and unoptimized versions.
1 #include <hetcompute/hetcompute.hh>
2
3 // synchronous sequential Fibonacci API
4 size_t fibonacci_seq(size_t n);
5
6 // asynchronous Fibonacci API
7 hetcompute::task_ptr<size_t> fibonacci(size_t n);

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 119
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

8 hetcompute::task_ptr<size_t> fibonacci_opti(size_t n);


9
11 // synchronous sequential implementation
12 size_t
13 fibonacci_seq(size_t n)
14 {
15 if (n < 2)
16 {
17 return n;
18 }
19 else
20 {
21 auto a = fibonacci_seq(n - 1);
22 auto b = fibonacci_seq(n - 2);
23 return a + b;
24 }
25 }
27
29 // fully asynchronous non-blocking implementation
30 hetcompute::task_ptr<size_t>
31 fibonacci(size_t n)
32 {
33 if (n < 2)
34 {
35 return hetcompute::create_value_task<size_t>(n);
36 }
37 else
38 {
39 // task_ptr collapsing
40 // typeof(a) is task_ptr<size_t>, not task_ptr<task_ptr<size_t>>
41 auto a = hetcompute::launch(fibonacci, n - 1);
42 auto b = hetcompute::launch(fibonacci, n - 2);
43 return a + b; // task algebra
44 }
45 }
47
49 // optimized asynchronous non-blocking version that dispatches to sequential
50 // implementation for small Fibonacci terms
51 const size_t GRANULARITY = 16;
52 hetcompute::task_ptr<size_t>
53 fibonacci_opti(size_t n)
54 {
55 if (n < GRANULARITY)
56 {
57 return hetcompute::create_value_task<size_t>(fibonacci_seq(n));
58 }
59 else
60 {
61 // task_ptr collapsing
62 // typeof(a) is task_ptr<size_t>, not task_ptr<task_ptr<size_t>>
63 auto a = hetcompute::launch(fibonacci_opti, n - 1);
64 auto b = hetcompute::launch(fibonacci_opti, n - 2);
65 return a + b; // task algebra
66 }
67 }
69
70 // e.g., ./hetcompute_examples_async_fibonacci 30
71 int
72 main(int argc, char* argv[])
73 {
74 hetcompute::runtime::init();
75
76 size_t n_def = 20;
77 size_t n = n_def;
78
79 if (argc >= 2)
80 {
81 std::istringstream istr(argv[1]);
82 istr >> n;
83 }
84

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 120
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

85 // fibonacci_opti is typically much faster than fibonacci


86 // std::cout << "Fibonacci term " << n << " is " << fibonacci(n)->copy_value() << std::endl;
87 std::cout << "Fibonacci term " << n << " is " << fibonacci_opti(n)->copy_value() << std::endl;
88
89 hetcompute::runtime::shutdown();
90 return 0;
91 }

5.3.13.3 Cancellation

As discussed in Cancellation, a task or group may be canceled. Consequently, any task registered to
finish_after the canceled task or group may be subject to cancellation.
1 #include <hetcompute/hetcompute.hh>
2
3 void foo();
4 void bar();
5
6 void
7 foo()
8 {
9 auto tl = hetcompute::launch([] {
10 while (true)
11 {
12 hetcompute::abort_on_cancel();
13 // do something
14 }
15 });
16 tl->finish_after();
17 // do something
18 tl->cancel();
19 }
20
21 void
22 bar()
23 {
24 // do something
25 }
26
27 int
28 main()
29 {
30 hetcompute::runtime::init();
31 auto t1 = hetcompute::create_task(foo);
32 auto t2 = hetcompute::create_task(bar);
33
34 t1->then(t2);
35
36 t1->launch();
37 t2->launch();
38
39 try
40 {
41 t2->wait_for();
42 }
43 catch (const hetcompute::canceled_exception&)
44 {
45 HETCOMPUTE_ILOG("t2 was canceled");
46 }
47 catch (...)
48 {
49 // Never reached
50 }
51 hetcompute::runtime::shutdown();
52
53 return 0;
54 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 121
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

Referring to the above example, task t1 launches task tl, registers itself as finishing after task tl, and
subsequently cancels task tl. As a result, task t1 is itself canceled by the HetCompute runtime system and
cancellation is propagated to successors of task t1, which in this case is only task t2.
Canceling a task does not result in cancellation of other tasks or groups it is registered to finish_after.
As the example below shows, canceling task t does not cancel tl that t is registered to finish_after.
1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 void foo();
6
7 hetcompute::task_ptr<> tl;
8 std::atomic<bool> tl_running(false);
9 std::atomic<bool> stop_tl(false);
10
11 void
12 foo()
13 {
14 tl = hetcompute::launch([] {
15 while (!stop_tl)
16 {
17 hetcompute::abort_on_cancel();
18 tl_running = true;
19 // do something
20 }
21 });
22 tl->finish_after();
23 }
24
25 int
26 main()
27 {
28 hetcompute::runtime::init();
29 auto t = hetcompute::launch(foo);
30
31 while (!tl_running)
32 {
33 }
34
35 t->cancel(); // Does not cancel tl
36
37 stop_tl = true;
38
39 try
40 {
41 t->wait_for();
42 }
43 catch (const hetcompute::canceled_exception&)
44 {
45 // Will never reach here since t->cancel is issued only after task t starts
46 // running and task t never acknowledges cancellation
47 }
48 catch (...)
49 {
50 // Never reached
51 }
52
53 assert(!tl->canceled());
54 HETCOMPUTE_ILOG("tl was not canceled");
55
56 tl.reset();
57
58 hetcompute::runtime::shutdown();
59 return 0;
60 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 122
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.3.13.4 Summary

The non-blocking APIs discussed above; including hetcompute::task::finish_after,


hetcompute::group::finish_after,
hetcompute::task<ReturnType(Args...)>::bind_all, and algebraic operators on
hetcompute::task_ptrs; enable the design and composition of asynchronous applications in a
modular fashion. Furthermore, in applications with many outstanding wait_for calls, the performance
gains from using finish_after in place of wait_for can be very significant. Application developers
are encouraged to use this method to implement highly responsive parallel applications.

5.4 Buffers
Basic Usage of Buffers
Using Buffers with Tasks
Synchronized and Concurrent Use
Creating Buffers
Performance and Storage Optimizations When Using Buffers
Memory Regions

5.4.1 Basic Usage of Buffers


The HetCompute buffers API provides the user with a runtime-managed heterogeneous data structure.
Tasks on the CPU, GPU, and DSP can share data using a HetCompute buffer. A HetCompute buffer is a
contiguous array of a user-defined data-type T. Each buffer can have one or more buffer pointers of type
hetcompute::buffer_ptr<T> or hetcompute::buffer_ptr<const T> pointing to it.
The buffer is ref-counted: the HetCompute runtime will automatically deallocate the buffer when there are
no more buffer pointers pointing to it. hetcompute::buffer_ptr<T> allows mutable access to the
buffer data, while hetcompute::buffer_ptr<const T> allows only immutable access (similar to
int∗ versus const int∗ access to an instance of int). The following code illustrates the most basic
API call for the creation of buffers of an intended number of elements.
hetcompute::buffer_ptr<float> b1 = hetcompute::create_buffer<float>(100);

hetcompute::buffer_ptr<const float> b2 = hetcompute::create_buffer<const


float>(100);

hetcompute::buffer_ptr<const float> b3 =
hetcompute::create_buffer<float>(100);

The runtime transparently manages the movement of the buffer data between specialized device-specific
backing stores. For example, the runtime allocates ION memory as backing store for the optimal sharing of
buffer data between the CPU and DSP. Similarly, the runtime uses an OpenCL buffer as backing store to
synchronize the buffer data between the CPU and GPU. Additionally, the runtime tries to take advantage of
any available advance knowledge of which devices may access a buffer to optimize the allocation of backing
stores from specialized device memories and to minimize the copying of data between the backing stores.
Please also refer to Textures for a GPU-only data structure suitable for image data.
There are four entities that may access a buffer’s data.

1. A CPU task

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 123
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

2. A GPU task
3. A DSP task
4. CPU host code
Task access: A task may access a buffer by taking the corresponding buffer pointer as an argument. A task
may access the buffer as an input, an output, or input-output, referred to as the direction of access. Recall
that a task is created using a device-specific kernel (hetcompute::cpu_kernel,
hetcompute::gpu_kernel and hetcompute::dsp_kernel). The kernel’s signature may
explicitly declare the direction for each buffer pointer parameter or the direction may be implicitly inferred
based on the mutability of the buffer pointer (hetcompute::buffer_ptr<T> versus
hetcompute::buffer_ptr<const T>).
Note that a CPU task may be created directly with a lambda, functor, or function parameter without
involving a CPU kernel. For such a CPU task, the access directions of any buffer pointer arguments are
inferred implicitly from the mutability of the buffer pointer parameters to the lambda, functor, or function.
Host code access: The application code on the CPU may directly access a buffer’s data using its buffer
pointer. A host code access refers to any access from the application code that is either not enclosed within
a task or uses a buffer pointer that is not a parameter to the enclosing task.
The following example illustrates the difference between task and host code access.
auto b1 = hetcompute::create_buffer<int>(3);
auto b2 = hetcompute::create_buffer<int>(3);

auto t = hetcompute::launch(
[=](hetcompute::buffer_ptr<int> x) {
// This is *task access* to b1’s buffer
// via task parameter x.
for (size_t i = 0; i < x.size(); i++)
x[i] = int(i);

b2.acquire_wi();
// This is *host code access* to b2’s buffer.
for (size_t i = 0; i < b2.size(); i++)
b2[i] = 1000 + int(i);
b2.release();
},
b1);
t->wait_for();

// This is host code access to b1’s buffer.


b1.acquire_ro();
for (size_t i = 0; i < b2.size(); i++)
printf("b1[%zu]=%d", i, b1[i]);
b1.release();

// This is host code access to b2’s buffer.


b2.acquire_ro();
for (size_t i = 0; i < b2.size(); i++)
printf("b2[%zu]=%d", i, b2[i]);
b2.release();

Please see Using Buffers with Tasks for more details.

5.4.2 Using Buffers with Tasks


The following sections elaborate on the use of buffers with tasks on each device type.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 124
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.4.2.1 Buffers with CPU Tasks

The following example illustrates buffer access by a CPU task created directly from a user function. Note
that the access directions are implicitly inferred from the mutability of the corresponding buffer pointer
parameters.
void user_function(hetcompute::buffer_ptr<const int> x, // x is input only
hetcompute::buffer_ptr<int> y) // y is
input-output
{
for (size_t i = 0; i < x.size(); i++)
y[i] = x[i] * 2;
}

int
main()
{
hetcompute::runtime::init();
auto b1 = hetcompute::create_buffer<int>(10);
auto b2 = hetcompute::create_buffer<int>(10);

b1.acquire_wi();
for (size_t i = 0; i < b1.size(); i++)
b1[i] = int(i);
b1.release();

// launch a CPU task with a user-function:


// b1 is inferred as input
// b2 is inferred as input-output
auto t = hetcompute::launch(user_function, b1, b2);
t->wait_for();
// elements of b2 are now double the corresponding elements of b1

b2.acquire_ro();
for (size_t i = 0; i < b2.size(); i++)
HETCOMPUTE_ILOG("b2[%zu]=%d", i, b2[i]);
b2.release();

hetcompute::runtime::shutdown();
return 0;
}

The following example illustrates buffer access by a CPU task created using a CPU kernel.
void user_function(hetcompute::buffer_ptr<const int> x, // x is input only
hetcompute::buffer_ptr<int> y) // y is
input-output
{
for (size_t i = 0; i < x.size(); i++)
y[i] = x[i] * 2;
}

int
main()
{
hetcompute::runtime::init();

auto b1 = hetcompute::create_buffer<int>(10);
auto b2 = hetcompute::create_buffer<int>(10);

b1.acquire_wi();
for (size_t i = 0; i < b1.size(); i++)
b1[i] = int(i);
b1.release();

// The CPU kernel infers the access directions of the


// buffer parameters from the user function’s signature.
auto ck = hetcompute::create_cpu_kernel(user_function);

// create a CPU task with a cpukernel:

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 125
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

// b1 is inferred as input
// b2 is inferred as input-output
auto t = hetcompute::launch(ck, b1, b2);
t->wait_for();
// elements of b2 are now double the corresponding elements of b1

b2.acquire_ro();
for (size_t i = 0; i < b2.size(); i++)
HETCOMPUTE_ILOG("b2[%zu]=%d", i, b2[i]);
b2.release();

hetcompute::runtime::shutdown();
return 0;
}

In the examples above, the user function accessed the buffer data by indexing the buffer pointer as an array.
The host code accesses the buffer data in a similar manner. The host code and CPU tasks may also request a
pointer to manipulate the entire contents of the buffer, as shown below.
auto b = hetcompute::create_buffer<int>(100);

void* ptr = b.host_data();


size_t size_in_bytes = b.size() * sizeof(int);

// manipulate [ptr, ptr + size_in_bytes] directly.


// The semantics of accessing the ptr data are the same
// as host code access on the buffer_ptr b

Limitation HetCompute v1.0 does not yet support explicit specification of access directions with CPU
kernels. The access directions are only allowed to be implicitly inferred.

5.4.2.2 Buffers with GPU Tasks

The following example illustrates creation of a GPU task with implicitly inferred access directions, similar
to the CPU task examples above.
// Create a string containing OpenCL C kernel code.
#define OCL_KERNEL(name, k) std::string const name##_string = #k

OCL_KERNEL(vdouble_kernel, __kernel void vdouble(__global float* A, __global float* B) {


unsigned int i = get_global_id(0);
B[i] = 2.0 * A[i];
});

int
main()
{
hetcompute::runtime::init();
auto buf_a = hetcompute::create_buffer<float>(3);
auto buf_b = hetcompute::create_buffer<float>(buf_a.size());

// Initialize the input


buf_a.acquire_wi();
for (size_t i = 0; i < buf_a.size(); ++i)
buf_a[i] = i;
buf_a.release();

// Create a kernel object


auto gpu_vdouble = hetcompute::create_gpu_kernel<hetcompute::buffer_ptr<const float>, // inferred as in
direction
hetcompute::buffer_ptr<float>> // inferred as
inout direction
(vdouble_kernel_string, "vdouble");

auto gpu_task = hetcompute::launch(gpu_vdouble,


hetcompute::range<1>(buf_a.size()),
buf_a, // accessed as ‘in‘
buf_b); // accessed as ‘inout‘

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 126
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

gpu_task->wait_for();

buf_b.acquire_ro();
for (size_t i = 0; i < buf_b.size(); i++)
HETCOMPUTE_ILOG("buf_b[%zu] = %f", i, buf_b[i]);
buf_b.release();

hetcompute::runtime::shutdown();
}

Access directions may be explicitly specified by wrapping the buffer pointer template parameters of the
kernel with hetcompute::in, hetcompute::out, and hetcompute::inout, as illustrated
below.
// Create a kernel object
auto gpu_vdouble = hetcompute::create_gpu_kernel<hetcompute::in<hetcompute::buffer_ptr<float>>, //
explicit in direction
hetcompute::out<hetcompute::buffer_ptr<float>>> //
explicit out direction
(vdouble_kernel_string, "vdouble");

auto gpu_task = hetcompute::launch(gpu_vdouble,


hetcompute::range<1>(buf_a.size()),
buf_a, // accessed as ‘in‘
buf_b); // accessed as ‘out‘

Note

A GPU kernel may be created using either an OpenCL C function or an OpenGL ES compute shader.
However, buffers interact with GPU kernels of either type in an identical manner. See GPU kernels for
OpenCL and OpenGL ES for an example program that uses a kernel created from an OpenGL ES
compute shader with buffers.

5.4.2.3 Buffers with DSP Tasks

Consider a DSP function with the following IDL signature. The IDL signature explicitly identifies the in
and out access directions for the parameters.
long array_is_prime(in sequence<long> numbers, rout sequence<long> primes);

HetCompute recognizes the in and out access directions coming from the IDL signature when a
hetcompute::dsp_kernel instance is created from the DSP function, as illustrated in the following
example.
// dsp kernel creation
auto hex_kernel = hetcompute::create_dsp_kernel<>(hetcompute_dsp_array_is_prime);

// create the dsp task that will be executed inside the dsp DSP
auto hex_task = hetcompute::create_task(hex_kernel,
in_buf, // in access recognized
out_buf); // out access recognized

5.4.3 Synchronized and Concurrent Use

5.4.3.1 Synchronized Access to Buffers Across Host Code and Tasks

When a task completes execution, HetCompute no longer automatically synchronizes all the buffers
accessed by the tasks. The host code must explicitly synchronize to access the buffer data updated by the
task (see Host Access, Explicit Synchronization with Host Code).
HetCompute deprecated the use of buffer_mode::synchronized & buffer_mode::relaxed
during buffer creation. Instead all buffer creation follow the semantics of relaxed mode.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 127
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.4.3.2 Concurrent Access by Tasks

The HetCompute runtime allows multiple tasks to concurrently access the buffer provided they access the
buffer only as input. The runtime ensures that a task accessing the buffer as output or as input-output does
not execute concurrently with other tasks accessing that buffer. HetCompute disallows concurrent access to
a buffer when the buffer is being modified. The acquisition will be blocked when a concurrent task/pattern
has acquired the buffer for read-write or write-invalidate access. In rare situations, the acquisition may also
be blocked when a concurrent task/pattern has read-only access but HetCompute is unable to synchronize
the buffer data for host access until the concurrent task/pattern completes.

5.4.4 Creating Buffers


A buffer can be created in three basic ways. Each of the ways may take additional parameters covered in
Providing Device Hints.

5.4.4.1 With Storage Fully Managed by HetCompute

The user specifies the datatype and number of elements needed in the buffer. HetCompute internally
manages the allocation of all the storage needed for the buffer.
auto b = hetcompute::create_buffer<int>(100);

5.4.4.2 With User-provided Initial Storage and Data

The initial storage and data for the buffer can be provided by the user. The HetCompute runtime may
allocate additional backing stores as needed, and will handle the synchronization between the user-provided
storage and any internal backing stores.
// user creates storage
std::vector<int> v;
for(int i = 0; i < 100; i++)
v.push_back(i);

// create buffer with initial storage:


// v will serve as the main-memory backing store for the buffer
auto b = hetcompute::create_buffer(v.data(), v.size());

5.4.4.3 With a Memory Region

hetcompute::memregion allows the user to allocate specialized memory or create inter-operability


with data from other frameworks (see Memory Regions).
The user may create a buffer from a previously created memory region. The memory region may also
contain initial data for the buffer.
For example, the following code creates a buffer with an ION memregion:
// user creates storage in a specialized "memory region" -- ION memory in this case
hetcompute::ion_memregion imr(100 * sizeof(int)); // allocate storage in ION
memory

// user optionally initializes the data


int* p = imr.get_ptr();
for(int i=0; i<100; i++)
p[i] = i;

// create buffer with specialized storage:


// imr will serve as the ION backing store for the buffer
auto b = hetcompute::create_buffer<int>(imr);

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 128
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.4.5 Performance and Storage Optimizations When Using Buffers


The runtime tries to take advantage of any advance knowledge of which devices may access a buffer’s data.
The knowledge may be gained from HetCompute’s internal scheduling graph when the user creates tasks
for specific devices, and also directly from the user as hints provided at the time of buffer creation.
HetCompute uses the current best knowledge to judiciously allocate limited resources (such as ION
memory), and minimizes the use of data copy and synchronization steps between specialized backing
stores. For example, if it is known up-front that the buffer will be accessed by a DSP task, HetCompute can
allocate ION memory at the time of buffer creation even if the first tasks accessing the buffer run only on
the CPU and GPU. The allocation of ION as the initial backing store eliminates all data copies between
devices. However, if it was not known up-front that the buffer would be accessed by a Hexagon task, the
runtime would initially allocate the backing store from the much cheaper system main memory. Later, the
execution of a Hexagon task would force the allocation of an ION backing store, followed by data copies
between the main memory and the ION backing stores. As a second example, if the CPU is not expected to
access the buffer data, the allocation of the main memory backing store can be skipped entirely, reducing
the number of backing stores whose contents have to be kept synchronized.

5.4.5.1 Explicit Synchronization with Host Code

Hetcompute v1.0 introduces the following APIs in the buffer_ptr class supports the following host
synchronization calls for a buffer,

1. acquire_ro(): The host code gains read-only access. Results from a prior task become visible to
the host code.
2. acquire_rw(): The host code gains read-write access. Results from a prior task become visible to
the host code. Modifications to the buffer data by the host code will be made visible to any
subsequent tasks.
3. acquire_wi(): The host code invalidates (clobbers) the prior contents of the buffer. Results from
a prior task may be lost. The entire contents of the buffer should be treated as undefined, save for the
new contents written by the host code subsequent to this synchronization call. It is valid for the host
code to read back any new contents written by the host code subsequent to this call. The new contents
of the buffer will be made visible to subsequent tasks.
The buffer synchronization allows the host code to access the buffer data updated by the task (see Host
Access). acquire_ro() acquires the underlying buffer for read-only access by the host code. The host
code may also modify the buffer data by attempting to acquire the underlying buffer for write access using
acquire_wi() or acquire_rw() buffer APIs.
All acquire_∗ calls will block for any conflicting operations to complete (e.g., a task concurrently
performing read-write access to the buffer), after which the buffer is acquired for access by the host code
and the call unblocks. However, if the buffer has already been acquired for the host code by a preceding
acquire_(), the call will return immediately.
The host code may recursively acquire the buffer using a combination of acquire_ro(),
acquire_wi() and acquire_rw() calls. The first acquire_ establishes the access type (read-only,
write-invalidate, or read-write) of the buffer for the host code. Subsequent recursive acquire_ calls will
succeed only if they are compatible with the previously established access type. Subsequent recursive
acquire_wi() and acquire_rw() calls will return with failure if the first recursive acquisition was
acquire_ro(), as the access type of these calls is incompatible with the established read-only access.
However, any subsequent acquire_() recursive calls will succeed if the first acquisition was either
write-invalidate or read-write. When the established access type is write-invalidate, subsequent recursive

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 129
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

read-only or read-write acquisitions are considered to get access to any data written to the buffer after the
original write-invalidate. When the established access type is read-write, a subsequent recursive
write-invalidate does not destroy any prior data, as there is no additional synchronization required between
device memories to access the latest data.
The host code releases the buffer only when a number of release() calls equal to the number of
successful recursive acquire_() calls are made.
Note that access by concurrent threads of the host code is also considered recursive, even when the
acquire-release calls do not properly nest across threads. The first acquire by any one thread establishes the
host access type for all threads of the host code, until the host code releases.
// Relaxed host synchronization:
// Select during buffer creation to get better performance
auto buf_a = hetcompute::create_buffer<float>(3);
auto buf_b = hetcompute::create_buffer<float>(buf_a.size());

buf_a.acquire_wi();
// Initialize the input
for (size_t i = 0; i < buf_a.size(); ++i)
buf_a[i] = i;
buf_a.release();

// Create a kernel object


auto gpu_vdouble = hetcompute::create_gpu_kernel<hetcompute::in<hetcompute::buffer_ptr<float>>, //
explicit in direction
hetcompute::out<hetcompute::buffer_ptr<float>>> //
explicit out direction
(vdouble_kernel_string, "vdouble");

// Execute the GPU kernel


auto gpu_task1 = hetcompute::launch(gpu_vdouble,
hetcompute::range<1>(buf_a.size()), buf_a, buf_b);
gpu_task1->wait_for();

buf_b.acquire_ro();
for (size_t i = 0; i < buf_b.size(); i++)
HETCOMPUTE_ILOG("buf_b[%zu] = %f", i, buf_b[i]);
buf_b.release();

buf_a.acquire_rw();
// Read buf_a
for (size_t i = 0; i < buf_a.size(); i++)
HETCOMPUTE_ILOG("buf_a[%zu] = %f", i, buf_a[i]);

// Read and modify buf_a


for (size_t i = 0; i < buf_a.size(); ++i)
buf_a[i] = buf_a[i] + 5;
buf_a.release();

// Execute the GPU kernel again


auto gpu_task2 = hetcompute::launch(gpu_vdouble,
hetcompute::range<1>(buf_a.size()), buf_a, buf_b);
gpu_task2->wait_for();

// Relaxed host synchronization:


// Host explictly requests permission to read buf_b.
buf_b.acquire_ro();
for (size_t i = 0; i < buf_b.size(); i++)
HETCOMPUTE_ILOG("buf_b[%zu] = %f", i, buf_b[i]);
buf_b.release();

The buffer contents become undefined if the host code accesses the buffer data without explicit
synchronization when the synchronization was required (such as after a task access), or if the host code
performs accesses incompatible with the type of access chosen (e.g., writing to the buffer after invoking
ro_sync()).

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 130
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.4.5.2 Providing Device Hints

Each of the variants of hetcompute::create_buffer() optionally take a list of devices likely to


access the buffer. The upfront knowledge of likely devices allows the HetCompute runtime to allocate
backing storage more optimally. For example, the total ION memory available on a platform tends to be
much more limited than the size of the main memory. However, if a buffer will be accessed by the CPU,
GPU, and Hexagon, allocating the buffer’s backing store in ION memory ensures that tasks on each device
can access the same ION backing store and no copies and further allocation/deallocation of backing stores
will be needed. If the likely devices knowledge for a buffer is not available, the HetCompute runtime will
allocate the least costly storage based on the known tasks: a GPU task will cause the allocation of an
OpenCL AHP buffer as a backing store, then a CPU task will allocate a main memory backing store to
which the data will be copied, and then finally when a Hexagon task executes a separate ION backing store
will be allocated and the data copied into it.
The following illustrates how to provide likely-device hints during buffer creation.
auto b = hetcompute::create_task(100, {hetcompute::gpu, hetcompute::dsp});

The likely devices information is used merely as an optimization hint. If the information turns out to be
partial or incorrect, the only penalty will be some avoidable backing store allocations and a performance hit
due to the avoidable copying of the buffer data between the backing stores.

5.4.6 Memory Regions


hetcompute::memregion allows abstract allocation of specialized device memory regions (such as
ION memory) and easy inter-operability with data allocated in other frameworks (such as the use of
existing OpenGL buffers or pre-allocated ION memory with HetCompute). hetcompute::memregion
provides RAII semantics over the specialized memory or framework data it is wrapping: the user constructs
the object to allocate the corresponding memory or setup the interop, and controls the lifetime of the object
to control the lifetime of the allocated memory or interop.
Currently, HetCompute provides the following three kinds of memory regions:

1. Main memory mem-region hetcompute::main_memregion: Allows a convenient mechanism


for allocating aligned memory.
2. ION memory mem-region hetcompute::ion_memregion: Allocates ION memory. The user
can choose if the memory will be cacheable (default) or non-cacheable. The user may also wrap
pre-allocated ION memory into a hetcompute::ion_memregion.
3. GL Buffer interop mem-region hetcompute::glbuffer_memregion: Creates a wrapper
around an existing OpenGL buffer.
The above three specializations derive from the hetcompute::memregion base class. A mem-region
can be passed to hetcompute::create_buffer() to create a buffer with the mem-region as a
backing store (see Creating Buffers). It is the user’s responsibility to ensure that the backing mem-region of
a buffer is kept alive while the buffer itself is going to be accessed. The user is not required to keep the
mem-region object alive beyond the point of last access to the buffer.

5.5 Textures
Texture APIs in HetCompute are useful for image processing tasks. These APIs allow the user to create
image objects from data residing in host memory and provide them to a kernel for processing. These APIs

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 131
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

also significantly simplify the programming of parellel image processing, such as filtering.
When a user is processing 2D or 3D image data, HetCompute texture APIs are useful as they provide a
multitude of image formats, filtering modes and addressing modes for accessing and manipulating pixel
data in a GPU kernel effectively.
All APIs are accessible by including the following header:
#include <hetcompute/texture.hh>

Note that HetCompute internally handles the initialization of device-and platform-dependent contexts, so
the programmers do not need to query or create these contexts by themselves. To begin with, the
programmers can use the following code to create HetCompute textures directly:
input_tex = hetcompute::graphics::create_texture<img_format, 2>({ width, height }, static_cast<unsigned
char const*>(input_img_data));

output_tex = hetcompute::graphics::create_texture<img_format, 2>({ width, height }, output_img_data);

This hetcompute::graphics::create_texture API takes an image format, image dimensions,


and a valid host pointer which points to raw pixel data in memory as inputs. We create an input texture and
an output texture for our GPU filter example.
To use a HetCompute texture, a GPU kernel can be created as follows:
auto boxfilter_gpukernel =
hetcompute::create_gpu_kernel<mytextureptrtype, mytextureptrtype, mysamplerptrtype>(source_string,
"box_filter");

The mytextureptrtype and mysamplerptrtype are the corresponding types of textures and
samplers in HetCompute, which is defined as follows:
typedef hetcompute::graphics::texture_ptr<img_format, 2> mytextureptrtype;

typedef hetcompute::graphics::sampler_ptr<addr_mode, fil_mode> mysamplerptrtype;

Please note that source_string contains the actual OpenCL kernel code that takes textures as kernel
function arguments and use them. In addition, template parameters must match the signature of the kernel
function. The kernel source code provided in source_string is shown as follows:
__kernel void box_filter(__read_only image2d_t source,
__write_only image2d_t dest,
sampler_t sampler)
{
// image dimensions
int img_width = get_global_size(0);
int img_height = get_global_size(1);

int2 out_coord = (int2) ( get_global_id(0), get_global_id(1) );

if( out_coord.x < img_width && out_coord.y < img_height )


{
int2 in_coord = out_coord;

// sample an 8x8 region and average the results


float4 sum = 0.0f;
for( int i = 0; i < 8; i++ )
{
for( int j = 0; j < 8; j++ )
{
sum += read_imagef(source, sampler, in_coord + (int2) (i - 4, j - 4));
}
}

// compute the average


float4 avg_color = sum / 64.0f;
// write the result to the output image

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 132
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

write_imagef( dest, out_coord, avg_color );


}
}

This kernel above applies an 8-by-8 box filter to an input image and generates a smoothed output image.
The kernel can be executed in the style of general HetCompute tasks:
// launch GPU kernel over 2D range
hetcompute::range<2> r(0, width, 0, height);

auto t = hetcompute::create_task(boxfilter_gpukernel, r, input_tex, output_tex,


sampler);

t->launch();

t->wait_for();

Internally, the kernel call is directed to the OpenCL driver for execution.
The processed result image can be read back using the following hetcompute::graphics::map API:
// read back result to CPU
auto ptr = static_cast<unsigned char*>(hetcompute::graphics::map(output_tex));

if (ptr != output_img_data)
HETCOMPUTE_FATAL("mapped addr does not match the original one.\n");

hetcompute::graphics::unmap(output_tex);

Please note that ptr should match the data pointer output_img_data that we use to create the output
texture as a sanity check.
HetCompute also handles release of the OpenCL contexts and HetCompute texture objects. However, the
programmers should still call hetcompute::graphics::unmap to release the mapping between CPU
memory and GPU memory, so the same HetCompute texture object can be reused for subsequent kernel
calls.

5.5.1 QCOM Extended Image format


HetCompute supports creation of textures in QCOM Extended Image formats namely, TP10, NV12, P010.
Both linear and UBWC variants of these formats are supported. These image formats are based on YUV
formats and restrict direct writes to the parent plane image, however they do support writes to derivative Y
and UV planes.
One can create derivative planes by invoking
hetcompute::graphics::create_derivative_texture and passing the parent texture, width
& height. hetcompute::graphics::create_derivate_textures is wrapper around newer
OpenCL QCOM extensions to support QCOM extended formats and vector operations. The code snippet
below shows a simple use of Compressed TP10 to Compressed TP10 copy using HetCompute parent
textures and derivative textures.
hetcompute::ion_memregion *src_ion_mem = new
hetcompute::ion_memregion(buffer_size, false);
memset(src_ion_mem->get_ptr(), 0, src_ion_mem->get_num_bytes());

hetcompute::ion_memregion *dst_ion_mem = new


hetcompute::ion_memregion(buffer_size, false);
memset(dst_ion_mem->get_ptr(), 0, dst_ion_mem->get_num_bytes());

if (memcmp(src_ion_mem->get_ptr(), dst_ion_mem->get_ptr(), buffer_size) != 0)


{
HETCOMPUTE_DLOG("Initial checking of memcmp is failing");
}

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 133
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

// Copy image data into source ion buffer


memcpy(src_ion_mem->get_ptr(), image.data(), image.size());

typedef
hetcompute::graphics::texture_ptr<hetcompute::graphics::image_format::CompressedTP10unorm_int10, 2> mytextur
typedef hetcompute::graphics::sampler_ptr<hetcompute::graphics::addressing_mode::ADDRESS_NONE,
hetcompute::graphics::filter_mode::FILTER_NEAREST>
mysamplerptrtype;
auto sampler =
hetcompute::graphics::create_sampler<hetcompute::graphics::addressing_mode::ADDRESS_NONE,
hetcompute::graphics::filter_mode::FILTER_NEAREST>(
false);

auto hetcomputegpukernel =
hetcompute::create_gpu_kernel<mytextureptrtype,
mytextureptrtype,
mytextureptrtype,
mysamplerptrtype>(default_source_string, "
copy_tp10_yuv_image_to_y_image_and_uv_image");

// Create src parent texture for TP10 compressed


auto input_tex =
hetcompute::graphics::create_texture<hetcompute::graphics::image_format::CompressedTP10unorm_int10, 2>({ wid

*(src_ion_mem),

true);

// Create dst parent texture for TP10 compressed


auto output_tex =
hetcompute::graphics::create_texture<hetcompute::graphics::image_format::CompressedTP10unorm_int10, 2>({ wid

*(dst_ion_mem),

true);

// Create dst derivative Y & UV plane


auto output_tex_y =
hetcompute::graphics::create_derivative_texture<
hetcompute::graphics::image_format::CompressedTP10unorm_int10,
2>(output_tex,
hetcompute::graphics::extended_format_plane_type::ExtendedFormatYPlane);
auto output_tex_uv =
hetcompute::graphics::create_derivative_texture<
hetcompute::graphics::image_format::CompressedTP10unorm_int10,
2>(output_tex,
hetcompute::graphics::extended_format_plane_type::ExtendedFormatUVPlane);

The above example create source and destination ION memory respectively. Source ION memory is
populated with image data being read. The creation of GPU Kernel, sampler follow the previous example.
The above snippet creates parent UBWC TP10 textures. Since we are writing UBWC TP10 data to
output_tex the example creates the derivative Y and UV plane. Both these derivative textures are later
passed into GPU Kernel for actually copying data to output Y and UV textures.

5.6 Data Structures

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 134
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.6.1 HetCompute Lock-Free Queue


HetCompute provides two variants of a concurrent lock-free first-in first-out (FIFO) queue data structure: a
fixed size implementatation denoted by bounded_lfqueue, and an unbounded version denoted by
lfqueue. Both variants support two operations: push and pop. The push operation inserts a value into
the queue and returns true if it was successful. A push operation may only return false in the case of
the bounded_lfqueue, when the queue is full.
A pop operation removes a value from a non-empty queue, and returns true. If the queue is empty, a pop
operation returns false. Multiple threads can execute push and pop operations in parallel on the queues,
and synchronization is achieved without the use of locks.
All lfqueue APIs are accessible by including the following header:
#include <hetcompute/lfqueue.hh>

The bounded_lfqueue APIs are accessible by including the following header:


#include <hetcompute/bounded_lfqueue.hh>

At a high level, the bounded_lfqueue is implemented as a fixed size circular array, whose size is
defined by the user through an input parameter. Specifically, in HetCompute, size of the array is forced to
be a power of two, by taking the log (to the base 2) of the size as input. Consider the following example of a
bounded_lfqueue instantiation:
hetcompute::bounded_lfqueue<size_t> q(8);

In this example, the size of the bounded_lfqueue is set to 2∧ 8 = 256 entries. When the queue is full, a
push operation cannot add a new value into the queue until one has been popped.
The lfqueue can be thought of as a linked list, where each node is a bounded_lfqueue, and is of
unbounded size. As in the case of the bounded_lfqueue, the size parameter passed during instantiation
gives its initial size. The lfqueue then extends itself in chunks of this size whenever needed.
The following is a simple example that illustrates the use of the lfqueue.
1 #include <hetcompute/lfqueue.hh>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 hetcompute::lfqueue<size_t> q(8);
10
11 // Create groups for producer and consumer tasks
12 auto producer = hetcompute::create_group("Producer");
13 auto consumer = hetcompute::create_group("Consumer");
14
15 // Launch 2 tasks into the producer group,
16 // each of which pushes 100 values into q
17 for (size_t p = 0; p < 2; ++p)
18 {
19 producer->launch([&]() {
20 for (size_t i = 0; i < 100; i++)
21 {
22 q.push(i);
23 }
24 });
25 }
26
27 // Launch 2 tasks into the consumer group,
28 // each of which pops 100 values from q
29 for (size_t c = 0; c < 2; ++c)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 135
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

30 {
31 consumer->launch([&]() {
32 size_t j = 0;
33 while (j < 100)
34 {
35 size_t result;
36 if (q.pop(result))
37 {
38 // The popped value is stored in result
39 ++j;
40 }
41 }
42 });
43 }
44
45 // wait for consumer group to finish
46 consumer->wait_for();
47
48 hetcompute::runtime::shutdown();
49 }

In the above example, two HetCompute groups producer and consumer, are created first (lines 12 and
13). Two HetCompute tasks are then launched into each group. Each task in the producer group pushes
100 size_t values (lines 19-24) into the queue q (instantiated in line 9), and the tasks in the consumer
group concurrently pop the values (lines 32-41) from q. The program terminates only when all the 200
values pushed into the queue have been popped. Therefore, it suffices to wait for the consumer group to
finish (line 46), as the consumer tasks will complete only after each one has popped 100 tasks.

5.7 Storage
5.7.1 Task-Local Storage
Tasks, much like threads, can be associated with task-local storage, via
hetcompute::task_storage_ptr. The usage pattern consists of declaring a global variable, say
storage, which holds a pointer to the actual task-local data. Then, within task t, that variable is assigned
a pointer to a (usually) local variable, or a chunk of freshly allocated memory. After that, storage can be
used within the dynamic extent of task t:
1 #include <hetcompute/hetcompute.hh>
2
3 namespace
4 {
5 hetcompute::task_storage_ptr<int> storage;
6 }; // namespace
7
8 void func();
9
10 void
11 func()
12 {
13 HETCOMPUTE_ILOG("%d", *storage);
14 ++*storage;
15 }
16
17 int
18 main()
19 {
20 hetcompute::runtime::init();
21 auto g = hetcompute::create_group();
22 for (int i = 0; i < 10; ++i)
23 {
24 g->launch([i] {
25 int v = i;

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 136
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

26 storage = &v;
27 func();
28 if (v != i + 1)
29 {
30 HETCOMPUTE_ILOG("error");
31 }
32 func();
33 if (v != i + 2)
34 {
35 HETCOMPUTE_ILOG("error");
36 }
37 });
38 }
39 g->wait_for();
40 hetcompute::runtime::shutdown();
41 return 0;
42 }

Note that accessing the value of storage affects only the current task. Attempting to modify the value of
a task_storage_ptr outside of a task yields undefined behaviour.
Optionally, a destructor (or rather: finalizer), can be employed to dispose resources. The destructor will run
within each task that has a value assigned to the global variable.

5.7.2 Scheduler-Local Storage


Another use case is scratchpads: data that is persistent across task boundaries, usually to avoid per-task
memory allocation or initialization. HetCompute can avoid synchronizing access to scratchpads if each
scheduler creates its own scratchpad (which can then be used like task-local storage). As further
optimization, hetcompute::scheduler_storage_ptr<T>s are created lazily when they are
written to inside of a task. Note that variable initialization and destruction happens through the constructor
and destructor of T:
1 #include <hetcompute/hetcompute.hh>
2
3 namespace
4 {
5 const hetcompute::scheduler_storage_ptr<size_t> s_sls_state;
6 }; // namespace
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 auto g = hetcompute::create_group();
13
14 for (size_t i = 0; i < 200; ++i)
15 {
16 g->launch([i] {
17 size_t c = ++*s_sls_state;
18 // values for c are consecutive on a per-scheduler basis
19 (void)c;
20 });
21 }
22
23 g->wait_for();
24
25 hetcompute::runtime::shutdown();
26 return 0;
27 }

Scheduler-local storage is unaffected by context switches (e.g., via


hetcompute::task<>::wait_for).
1 #include <hetcompute/hetcompute.hh>
2

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 137
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

3 namespace
4 {
5 const hetcompute::scheduler_storage_ptr<size_t> s_sls_state;
6 }; // namespace
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 auto g = hetcompute::create_group();
13 auto t = hetcompute::create_task([] {});
14
15 for (size_t i = 0; i < 200; ++i)
16 {
17 g->launch([=] {
18 size_t c1 = ++*s_sls_state;
19 t->launch();
20 t->wait_for();
21 size_t c2 = ++*s_sls_state;
22 if (c1 + 1 != c2)
23 {
24 HETCOMPUTE_ILOG("error: mismatch");
25 }
26 });
27 }
28
29 g->wait_for();
30 hetcompute::runtime::shutdown();
31
32 return 0;
33 }

A complete example
1 #include <algorithm>
2 #include <iterator>
3
4 #include <hetcompute/hetcompute.hh>
5
6 template <size_t N>
7 struct image_scratchpad
8 {
9 image_scratchpad() { std::fill(std::begin(edge_image), std::end(edge_image), 0); }
10 char edge_image[N];
11 };
12
13 namespace
14 {
15 const hetcompute::scheduler_storage_ptr<image_scratchpad<4096>
> image_buffers;
16 }; // namespace
17
18 int
19 main()
20 {
21 hetcompute::runtime::init();
22 int const N = 200;
23
24 auto g = hetcompute::create_group();
25 for (int i = 1; i < N; ++i)
26 {
27 g->launch([i] {
28 // fill image buffer, which is reused across tasks
29 for (auto& slot : image_buffers->edge_image)
30 slot = i & 0xff;
31 hetcompute::internal::yield(); // context-switch, we expect SLS to survive this
32 // check contents
33 for (auto const& slot : image_buffers->edge_image)
34 {
35 if (slot != char(i & 0xff))
36 {

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 138
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

37 HETCOMPUTE_ILOG("mismatch at position %d", i);


38 }
39 }
40 });
41 }
42 g->wait_for();
43
44 hetcompute::runtime::shutdown();
45 return 0;
46 }

5.7.3 Thread-Local Storage


If a group of tasks needs scratchpads, but does not require that data persists across context-switching,
hetcompute::thread_storage_ptr is a viable alternative to
hetcompute::scheduler_storage_ptr. Because HetCompute Thread-Local Storage is tied to
HetCompute’s device thread, HetCompute allocates fewer instances of T (compared to
hetcompute::scheduler_storage_ptr, see earlier example), at most one per device thread.
1 #include <hetcompute/hetcompute.hh>
2
3 namespace
4 {
5 const hetcompute::thread_storage_ptr<size_t> s_tls_state;
6 }; // namespace
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 auto g = hetcompute::create_group("test");
13 auto t = hetcompute::create_task([] {});
14
15 for (size_t i = 0; i < 200; ++i)
16 {
17 g->launch([=] {
18 size_t* p1 = s_tls_state.get();
19 t->launch();
20 t->wait_for();
21 size_t* p2 = s_tls_state.get();
22 // cannot assume that p1 == p2
23 (void)p1;
24 (void)p2;
25 });
26 }
27
28 g->wait_for();
29
30 hetcompute::runtime::shutdown();
31 return 0;
32 }

5.8 Affinity
The affinity APIs enable the programmer to change execution properties of program statements (arbitrary
functions), HetCompute tasks, and device threads. These properties include:

• location: to set the CPUs where the program constructs should run.
• pinning: to set whether HetCompute device threads should migrate freely among cores (also known
as thread binding).
• mode: to override local affinity settings.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 139
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

Programmers can benefit from these APIs to improve performance and even to save power. All APIs are
accessible by including the main HetCompute header defined in hetcompute/affinity.hh:
#include <hetcompute/hetcompute.hh>

Note

For setting the affinity of individual tasks (rather than all tasks using the above APIs) to big or LITTLE
in a big.LITTLE SoC, use CPU kernel attributes (Setting Kernel Attributes).

Regarding the capabilities of the APIs, location enables targeting clusters of cores in heterogeneous
System-On-Chip (SoC), such as Qualcomm Snapdragon 845 or 835, where not all cores are equal,
providing different performance/power points. For example, in a Snapdragon 845, a programmer may
choose to run only in the LITTLE cluster as illustrated in the following example, which demonstrates all
other affinity APIs as well.
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 auto fn = [](int i) { HETCOMPUTE_ILOG("Function executed with specified affinity on arg %d", i); };
9 auto aff_settings =
10 hetcompute::affinity::settings(
hetcompute::affinity::cores::big, false,
hetcompute::affinity::mode::allow_local_setting);
11 // In a big.LITTLE SoC, function fn executes on a big core.
12 hetcompute::affinity::execute(aff_settings, fn, 42);
13
14 auto g = hetcompute::create_group(__FUNCTION__);
15
16 auto k_wout_attrib = hetcompute::create_cpu_kernel([] { HETCOMPUTE_ILOG("
Task without kernel affinity attribute."); });
17
18 auto k_with_attrib = hetcompute::create_cpu_kernel([] { HETCOMPUTE_ILOG("
Task with kernel affinity attribute"); });
19 k_with_attrib.set_little();
20
21 // k_with_attrib kernel will run in a LITTLE core
22 g->launch(k_with_attrib);
23
24 // k_wout_attrib can run in any core
25 g->launch(k_wout_attrib);
26
27 g->wait_for();
28
29 // Set the affinity to the LITTLE cores without pinning in
30 // allow_local_setting mode
31 hetcompute::affinity::set(
32 hetcompute::affinity::settings(
hetcompute::affinity::cores::little, false,
hetcompute::affinity::mode::allow_local_setting));
33
34 // k_wout_attrib task will run in a LITTLE core because the kernel has no
35 // individual affinity specification
36 g->launch(k_wout_attrib);
37
38 // Set the affinity to the big cores with pinning in allow_local_setting mode
39 // by reading the current affinity and then updating the different fields
40 auto affinity = hetcompute::affinity::get();
41
42 // Update the cores from LITTLE to big
43 affinity.set_cores(hetcompute::affinity::cores::big);
44
45 // Enable thread pinning
46 affinity.set_pin_threads();

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 140
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

47
48 // Update the mode from allow_local_setting to override_local_setting in the
49 // settings
50 affinity.set_mode(hetcompute::affinity::mode::override_local_setting
);
51
52 // Update the affinity with the modified affinity object
53 hetcompute::affinity::set(affinity);
54
55 // The second run of k_with_attrib will run on a big core because the
56 // affinity mode is override_local_setting and global affinity settings are
57 // obeyed
58 g->launch(k_with_attrib);
59
60 g->wait_for();
61
62 hetcompute::runtime::shutdown();
63 return 0;
64 }

The example illustrates three different ways of setting affinity to program constructs and also shows how
one can override the others.

1. hetcompute::affinity::execute is the easiest, most portable and efficient way to enforce


affinity for program statements expressible as function calls, function objects, or C++ lambdas. In the
example, the programmer calls fn with argument 42, and HetCompute ensures that the function
executes on a big core. Note that the programmer need not be concerned about whether the code is
executing on a big.LITTLE SoC, whether the thread calling hetcompute::affinity-
::execute(..., fn, ...) is executing on a big or a LITTLE core, etc. HetCompute
determines the most efficient way to execute the function with the desired affinity.
2. The programmer can set the affinity of individual kernels. In the example, the HetCompute CPU
kernel k_with_attrib is marked as having affinity for the LITTLE cores through the statement:
k_with_attrib.set_little();
HetCompute will run the CPU kernel k_with_attrib on a LITTLE core, in constrast to
k_wout_attrib which may run on any core.
3. The programmer can set the affinity of all program statements via
hetcompute::affinity::set(
hetcompute::affinity::settings(
hetcompute::affinity::cores::little, false,
hetcompute::affinity::mode::allow_local_setting));
The above creates an affinity settings object containing the indicated location, pinning, and mode.
The call to hetcompute::affinity::set will update the HetCompute affinity settings. In this
case, HetCompute’s device threads will run only in the LITTLE CPUs and can migrate freely among
CPUs. This setting is very useful to save power because it guarantees that all CPU tasks are executed
in the low-power cluster of the SoC. Tasks constructed from CPU kernels without the big/LITTLE
attribute will automatically be routed to the appropriate CPU cores through the affinity setting
(LITTLE cores in the example).
Note

Specifying a big or LITTLE location in an SoC with homogeneous cores, such as Snapdragon 805,
will have no effect. However, the pinning request will still be fulfilled.

To update individual aspects of the current affinity settings, use the following API:
auto affinity = hetcompute::affinity::get();

// Update core affinity from LITTLE to big


affinity.set_cores(hetcompute::affinity::cores::big);

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 141
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.8.1 Overriding Local Affinity Settings


There will be situations where programmers would like to pin the device threads to the big cores in order to
maximize locality and performance; e.g., a highly optimized linear algebra library. They can achieve this
goal by setting the location to big and pinning to true. To guarantee that all tasks are run in the big; i.e.,
big/LITTLE CPU kernel affinity attributes or hetcompute::affinity::execute() affinity
settings are discarded, they can also set the mode to override_local_setting.
// Update the mode from allow_local_setting to override_local_setting in the settings
affinity.set_mode(hetcompute::affinity::mode::override_local_setting
);

// Update the affinity with the modified affinity object


hetcompute::affinity::set(affinity);

In our example, k_with_attrib kernel executes in the LITTLE cores the first two times; the third time,
it runs in the big cores.
To respect local affinity settings, set mode to
hetcompute::affinity::mode::allow_local_setting
Note

Very few situations will benefit from pinning; due to the thermally constrained environment of mobile
packages, CPUs can go online/offline unannounced. When requesting pinning, if there are offline
CPUs, HetCompute will pin device threads as much as possible to a single CPU; however, some device
threads may remain unpinned.

Reset affinity settings when they are no longer required using:


hetcompute::affinity::reset();

And the system will return to the default state where device threads can freely move across CPUs.
All previous free functions are thread-safe, and programmers may call the affinity APIs at any point of
execution, even within CPU tasks.

5.9 Heterogeneous Computing in Action


This section demonstrates how to write a simple heterogeneous HetCompute application that executes tasks
on multiple devices in a system, coherently bringing together many of the HetCompute constructs described
previously.
In general, the steps for writing a heterogeneous HetCompute application are as follows:

1. Write device functions for the devices that will be used.


2. Declare HetCompute buffers that pass data between devices.
3. Create HetCompute kernels from appropriate device functions and buffers.
4. Use the kernels to create HetCompute tasks.
5. Construct and launch a HetCompute program using the tasks.
In particular, the last step in the list above is the same for any program, whether it is CPU-only or
heterogeneous. Therefore, the remainder of this section focuses on the first four steps.
To explain those steps, the remainder of this section develops a simple HetCompute application

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 142
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

step-by-step. This application takes in an array of 10 floating-point numbers, x[i], and computes x[i] ∗ x[i]
+ 1 / x[i] for each element. That is, an input of 1.0, 2.0, ..., 10.0 would produce an output of 2.0, 4.5, ...,
100.01.
First, the device functions must be written. As explained in Kernels: The Path to Heterogeneity, at present,
CPUs, GPUs, and DSPs are programmed in different languages in HetCompute. The following example
lists three device functions that will be used in the example program, showing different device
programming styles.
1 #pragma once
2
3 // A CPU function that initializes an array
4 void
5 f1(float* b, int N)
6 {
7 for (int i = 0; i < N; i++)
8 b[i] = i + 1.0f;
9 }
10
11 // A GPU function that computes squares
12 std::string const f2_string = "__kernel void f2(__global float *in, __global float *out) {"
13 " int i = get_global_id(0);"
14 " out[i] = in[i] * in[i];"
15 "}";
16
17 // A DSP function that computes reciprocals
18 int
19 f3(float* in, int lin, float* out, int lout)
20 {
21 int i;
22 for (i = 0; i < lin && i < lout; i++)
23 out[i] += 1 / in[i];
24 return 0;
25 }

After the device functions are written, the next step is to consider how data is passed between these
functions and create data containers accordingly. HetCompute buffers serve this purpose. In particular,
buffers abstract away much of the manual data marshaling in traditional heterogeneous programming
environments, greatly simplifying multi-device programming.
Finally, the methods described in Kernels: The Path to Heterogeneity and Creating Tasks can be used to
create kernels and tasks, set dependency between tasks, and launch them. This results in the following final
result.
1 #include <hetcompute/hetcompute.hh>
2 #include "heterogeneous.hh"
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // The number of elements to compute x^2 + 1/x for
9 constexpr int N = 10;
10
11 // Create buffers for input and output data
12 auto b1 = hetcompute::create_buffer<float>(N);
13 auto b2 = hetcompute::create_buffer<float>(N);
14
15 // The CPU initializes the input data first
16 auto t1 = hetcompute::create_task(f1);
17
18 // The GPU squares every input element
19 auto k2 = hetcompute::create_gpu_kernel<hetcompute::buffer_ptr<float>, hetcompute::buffer_ptr<float>>(
f2_string, "f2");
20 auto t2 = hetcompute::create_task(k2,
hetcompute::range<1>{ N }, b1, b2);
21

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 143
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

22 // The DSP adds the reciprocals to the result


23 auto k3 = hetcompute::create_dsp_kernel<>(f3);
24 auto t3 = hetcompute::create_task(k3, b1, b2);
25
26 // Run all the tasks
27 t1 >> t2 >> t3;
28 t1->launch(b1, N);
29 t2->launch();
30 t3->launch();
31 t3->wait_for();
32
33 // Output the result
34 for (int i = 0; i < N; i++)
35 HETCOMPUTE_ILOG("%f\n", b2[i]);
36
37 hetcompute::runtime::shutdown();
38 return 0;
39 }

Note

It is worth emphasizing again that HetCompute tasks are universal across different devices. While
tasks may contain kernels customized for different devices, at the task level and above, a programmer
should not need to distinguish between these tasks.

While the example above is functionally correct, its performance can be improved by a few simple
techniques.

1. The GPU and DSP kernels in the example above are sequentially executed. However, through the use
of additional buffers, they can be launched asynchronously, giving them a chance to execute
concurrently depending on scheduling results. The caveat is to ensure that they converge in the same
host thread by using a group wait_for.
2. While the default buffer constructors need very few arguments to produce functionally correct
behavior, their performance can be improved by providing additional hints to the constructors. For
example, hetcompute::in, hetcompute::out, and hetcompute::inout can be used to
qualify the buffers and avoid unnecessary copies. Additionally, hints about likely devices can guide
storage allocation for buffers.
The optimizations above produce the following result:
1 #include <hetcompute/hetcompute.hh>
2 #include "heterogeneous2.hh"
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // The number of elements to compute x^2 + 1/x for
9 constexpr int N = 10;
10
11 // Create buffers for input and output data
12 auto b1 = hetcompute::create_buffer<float>(N);
13 auto b2 = hetcompute::create_buffer<float>(N);
14 auto b3 = hetcompute::create_buffer<float>(N);
15 auto b4 = hetcompute::create_buffer<float>(N);
16
17 // The CPU initializes the input data first
18 auto t1 = hetcompute::create_task(f1);
19
20 // The GPU squares every input element
21 auto k2 =
22 hetcompute::create_gpu_kernel<hetcompute::in<hetcompute::buffer_ptr<float>>,
hetcompute::out<hetcompute::buffer_ptr<float>>>(f2_string,
23

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 144
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

"f2");
24 auto t2 = hetcompute::create_task(k2,
hetcompute::range<1>{ N }, b1, b2);
25
26 // The DSP adds the reciprocals to the result
27 auto k3 = hetcompute::create_dsp_kernel<>(f3);
28 auto t3 = hetcompute::create_task(k3, b1, b3);
29
30 // Run all the tasks
31 auto g = hetcompute::create_group();
32 t1 >> t2;
33 t1 >> t3;
34 t1->launch(b1, N);
35 g->launch(t2);
36 g->launch(t3);
37 g->wait_for();
38
39 // Combine the results
40 for (int i = 1; i < N; i++)
41 b4[i] = b2[i] + b3[i];
42
43 // Output the result
44 for (int i = 0; i < N; i++)
45 HETCOMPUTE_ILOG("%f\n", b2[i]);
46
47 hetcompute::runtime::shutdown();
48
49 return 0;
50 }

5.10 Interoperability
The HetCompute programming model isolates programmers from threads; however, HetCompute
applications are multithreaded. In this section, some of the interoperability issues are discussed that arise
from using threads in your application and how they interact with the HetCompute runtime.
A HetCompute application starts in a main thread and will create a thread pool. The number of threads in
the thread pool depends on the platform. Additional threads may be created based on application behavior
in order to keep all the resources busy. A HetCompute task might be created in any thread, and other
operations, such as launching and executing on any other threads. Heterogeneous tasks on the GPU and D-
SP behave similarly. Any application thread may call hetcompute::create_task(Code &&code,
Args &&...args) and hetcompute::task<>::launch(). The task will be executed by one of
the threads in the HetCompute thread pool. It is important to note this distinction because programmers
need to ensure that data accessed in the task must be available throughout the lifetime of the task (even if it
has been allocated in a different thread) and that the data is accessed in a thread-safe manner.
HetCompute provides certain guarantees with respect to task execution, as defined below.

5.10.1 Safe Points


Safe points are HetCompute API methods where the following property holds: the thread on which the task
executes before the API call might not be the same as the thread on which the task executes after the API
call. The APIs in HetCompute that may switch threads are:

• hetcompute::task<>::wait_for()
• hetcompute::group<>::wait_for()
• hetcompute::condition_variable::wait()

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 145
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

5.10.2 Using HetCompute with the Fork() System Call


The Unix fork() system call is designed to duplicate the current process as a new process. This call is
commonly used within shells to start new commands, within web servers to handle new connections in a
separate process, and within web browsers to implement security between different browser tabs.
An important limitation of fork() is that it copies the memory of the process, but starts the child with
only one thread, cloned from the thread that made the fork() system call. This is a known problem for
multithreaded programs, and HetCompute is no exception. Calling fork() from a task running in the
thread pool starts the new process with only one thread from the pool, and no other threads would exist.
Tasks running on the other threads would be copied in an inconsistent state, and the output would be
indeterminate. HetCompute implements various features to prevent this misuse of fork().
No calls to fork() are allowed after any HetCompute function is invoked. This includes functions like
hetcompute::create_task and hetcompute::create_group. If the application requires the
use of fork(), it should be called before any HetCompute routines are invoked.

5.10.3 Using HetCompute with TLS-aware Libraries


There are other libraries, such as libraries that use thread local storage (TLS) that also interacts with the
HetCompute runtime in non-intuitive ways. Remember that when HetCompute tasks execute, they stay on
the same thread until they complete execution or they arrive at a safe point. In most cases, this is not a
problem; however, when a task makes calls to libraries that use TLS, there are a number of issues as
discussed below.
Following are examples of common TLS-aware libraries:

Xlib

The Xlib libraries are typically not thread-safe, although each implementation is different. It is not
possible to perform display operations from two threads at the same time because this could corrupt
internal data structures. While multiple threads can be used, the programmer must ensure that only one
thread can be using Xlib at any time.

UI Toolkits

User interface toolkits —such as QT— typically have a main thread which is dedicated to processing
input events, manipulating a display, and then sleeping until more input occurs. It is important that
control is returned to the UI toolkit as soon as possible to ensure that the user experience is smooth and
uninterrupted. If a call from a different thread is made to a function that manipulates the UI or triggers
an event, the toolkit may corrupt a data structure, or detect this and generate an error message.

OpenGL

Each OpenGL implementation varies in how it can be used with multiple threads. In typical usage, you
create an OpenGL context in the thread where you intend to use it. The OpenGL library then sets
internal state information into TLS. This internal state is used so that when calls are made to OpenGL,
you do not need to pass the context around each time. However, the TLS is set for only one thread. So
if you try to make an OpenGL call from a different thread, the implementation may fail. Some
implementations allow multiple contexts, with each context being created on the thread where it will
be used. With multiple contexts, some implementations also allow parallel access to the OpenGL
library, although this support varies depending on the vendor. Hardware implementing OpenGL

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 146
Qualcomm® Snapdragon™ Heterogeneous Compute SDK User Guide

typically uses some kind of command buffer, which can force a sequential ordering of commands.
Therefore, trying to implement calls to OpenGL in parallel may not provide any benefit, and may
actually slow things down due to contention on the mutex used to protect the command buffer. An
OpenGL application is typically used with some kind of user interface and event handler, which will
be running on the main thread. So it is recommended that you perform your OpenGL calls in the same
thread as the user interface.

While these are limitations that need to be taken into consideration, it is still possible to exploit parallelism
using HetCompute in these types of applications. For example, consider the case of a game with physics
simulation, where the user can click on the display to launch spheres into a room. In a sequential
implementation, the user touches the display, which generates a UI event. The UI thread wakes up and
processes the UI event, which needs to generate the new sphere in the physics simulation. The physics
simulation runs for the time required to compute the result. The location of all objects in the physics
simulation is then traversed, and OpenGL calls are made to draw the scene. The OpenGL buffers are then
swapped onto the display, and the thread goes back to sleep to wait for either a UI event, or a timeout to
refresh the display with no change.
When analyzing the previous example, the bulk of calculations are performed in the physics engine. This is
very computationally expensive and where the most optimization work can be applied. So HetCompute can
be used here to perform the computation in parallel, assuming the underlying implementation supports this.
The user breaks down the parts of the simulation into a suitable number of tasks, specifies dependencies,
and then launches them with HetCompute. When the tasks are launched, the thread does a wait_for()
until the tasks have completed. In the meantime, the HetCompute thread pool begins executing the tasks,
which spreads the computational load across all available processors. When the tasks are complete, the
wait_for() will return, and execution on the main thread can continue with the calls to OpenGL for
rendering. With this arrangement, you can see that the operations that are thread-sensitive are performed in
one thread, guaranteeing safe use of libraries such as OpenGL and the user-interface toolkit. Many
computationally intensive OpenGL applications are written with an event loop very similar to that described
above, so these changes should be relatively simple to implement to take advantage of HetCompute.

5.10.4 Distributed Computing using HetCompute


HetCompute may be used to provide SMP concurrency in an MPI (Message Passing Interface)
application. HetCompute has been tested with MPICH2. All the caveats above about fork() and threads
apply.

5.10.5 Avoid the Use of C++ iostream and stringstream Libraries


Warning

The C++11 standard indicates that the iostream library should be thread safe. As of this writing, the
programmers have experienced stability issues on some platforms, such as Android and OSX. In order
to maximize portability, HetCompute applications should avoid using cout and cerr to perform
asynchronous writes, especially to the console. It is recommended to use the C-based stdio
printf routines. On Android, programmers have experienced additional issues with
stringstream objects.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 147
6 Parallel Processing Tutorial

6.1 Abstract
In this tutorial, general principles of parallel programming are introduced with an emphasis on task-based
parallel programming models. Scaling is first introduced as a metric of evaluating the potential speedup that
an algorithm can obtain. Different parallel programming paradigms are introduced with a number of
optimizations for parallel code. Examples are provided to illustrate the HetCompute programming model.

6.2 Parallel Speedups


Amdahl [2] put forward an argument that the maximum speedup that can be obtained by a parallel
algorithm is bounded by the serial fraction of the program. Intuitively, even if HetCompute could execute
the parallel fraction infinitely fast (zero time), the serial fraction will determine the total execution time.
This argument, commonly known as Amdahl’s Law, can be summarized by the following equation, when
considering N parallel processors:

s+p 1
P arallelSpeedup = = ,
s + p/N s + p/N

where s + p = 1, representing the serial and parallel fractions of the program, respectively. Using Amdahl’s
law, the speedup that can be obtained with eight processors as a function of the serial fraction is illustrated

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 148
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Parallel Processing Tutorial

in Figure Amdahl.

Figure 6-1 Theoretical speedup on eight processors using Amdahl’s Law.

Note that even if the serial fraction is only 10%, the maximum theoretical speedup achievable is 4.58. In
practice, however, hardware architecture characteristics, such as caching, allow programmers to obtain
much better performance from multicore systems. Amdahl’s law expresses performance increase for
constant problem size (strong scaling). Gustafson [9] demonstrates that parallel processing can be used to
perform more work in the same amount of time by increasing the problem size, thus improving scalability.
this technique is called weak scaling. Architectural artifacts [11] also play an important role; additional
processors come with additional cache and memory resources, often enabling applications to obtain
super-linear speedup. A number of optimizations are discussed that take advantage of architectural features
in Section Optimizations.

6.3 Parallel Programming Paradigms


When discussing parallel programming, practitioners classify the different types of parallelism loosely
following machine organizations [7] :

6.3.1 Data parallelism (SIMD)


SIMD machines include vector units, array processors, and GPUs. In this model, the program is executing
the same code on different data elements. Data parallel algorithms are typically expressed as operations on
a multi-dimensional array. Control flow is uniform; however, operations on certain elements may be
masked out. Image processing algorithms are prototypical for data parallelism. In the current version of
HetCompute, one can exploit data parallelism by using vector intrinsics [15] to target the NEON units, or
by calling OpenGL functions to execute on the GPU. Future versions of HetCompute will support SIMD

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 149
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Parallel Processing Tutorial

compute on the GPU as part of the programming model. In the code example below, a simple example is
shown of a scalar vector multiply (SAXPY) using the HetCompute pfor_each pattern (see Parallel
Programming Patterns) and vector operations.
void saxpy(float* y, float a, float* x, int n) { // Y += a * X
// for simplicity only vector sizes are multiplied that are multiples of 4
assert(n%4 == 0);

float32x4_t av = vmovq_n_f32(a); // initialize all lanes to a


// implicitly waits for the loop to complete
hetcompute::pfor_each(hetcompute::range<1>(0, n, 4), // create
an iteration for each vector op
[&av, x, y](const hetcompute::index<1>& i) {
float32x4_t xv, yv;
auto sz = 4*sizeof(float);
memcpy(&xv, &x[i[0]], sz); // initialize the vector regs
memcpy(&yv, &y[i[0]], sz);
yv = vmlaq_f32(yv, av, xv); // yv += a * x[i];
memcpy(&y[i[0]], &yv, sz); // copy result back into y[i]
});
}

6.3.2 Task parallelism (MIMD)


MIMD machines are multiprocessors. In this model, different hardware execution contexts (e.g., threads,
cores, processors) execute different code on different data elements. Tasks are either independent or
cooperate on processing over a shared data structure. Thus, tasks may have control or data dependencies.
The irregular structure of the computation complicates the handling of dependencies and synchronization.
Typical applications include: physical simulations, computer simulations, browsers [5], etc. Task
parallelism is supported in HetCompute and samples/src are provided in the HetCompute User’s Manual
[16].
In the example below, a simple parallel depth first traversal is demonstrated for an n-ary tree data structure
(each node has a variable number of children stored in the children collection). The hetcompute::pfor_each
construct will recursively launch tasks for each of the children of a node. Examples with more complicated
task graphs are provided in Unleashing Asynchrony.
void depth_first_traversal(node *root)
{
// process all the children of the node in parallel
hetcompute::pfor_each(children.begin(), children.end(), [](childIterator it) {
dfs(*it);
});
// process the node
node->mark_as_traversed();
}

6.3.3 Braided parallelism


Modern machines combine CPUs and GPUs for heterogeneous general-purpose computation. Recently,
there has been a significant increase in GPGPU (general purpose GPU) programming. The braided
parallelism model combines task parallel computation with data parallel execution [8]. This unified model
is used to dynamically exploit data parallelism on SIMD units and GPUs from within concurrent tasks
executing on the MIMD units. Examples include gaming applications, which have many concurrent tasks
(physics, AI, UI) that are composed from data-parallel computations, such as particle simulations, image
processing and rendering.
The following example (in pseudocode) shows how HetCompute can execute tasks on CPU, GPU, and DSP
at the same time. Additional samples/src are available in Kernels: The Path to Heterogeneity.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 150
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Parallel Processing Tutorial

// Create a set of input buffers for kernels


auto buf_a = hetcompute::create_buffer(1024);
auto buf_b = ...; auto buf_c = ...;
auto g = hetcompute::create_group(); // aggregate of bound kernels/tasks

// Create a CPU kernel that takes an int as an argument


auto ck = hetcompute::create_cpu_kernel<>([](int){...});
// Bind 42 as the argument to kernel; executes on CPU
g->launch(ck, 42);

// Create a GPU kernel that takes three buffer arguments


auto gk = hetcompute::create_gpu_kernel<hetcompute::buffer_ptr<float>,
hetcompute::buffer_ptr<float>,
hetcompute::buffer_ptr<float>>
(vadd_kernel_string, “vadd”);
// Bind the buffer arguments and execute on GPU
g->launch(gk, hetcompute::range<1>(1024), buf_a, buf_b, buf_c);

// Create a Qualcomm Hexagon DSP kernel with a buffer and a long ptr parameter
auto hk = hetcompute::create_dsp_kernel<hetcompute::buffer<float>, long *>(math_sum);
long sum;
// Bind arguments and execute on DSP
g->launch(hk, buf_a, &sum);
g->wait_for();

6.3.4 Pipeline parallelism or Streaming


Distributed processor architectures are composed of separate hardware contexts, such as IBM’s Cell
processor [12], and designed to exploit small working sets and/or algorithms with little locality.
Computation organized as pipeline stages allows code to reside on each unit and the data is streamed across.
The pattern of dependencies is fixed, simplifying the parallel execution. Tuning and balancing is
complicated by the predefined computation structure, and may require significant reorganization. Many
algorithms are amenable to pipeline parallelism when partitioned. Examples include computer vision
algorithms, search, etc., HetCompute supports the pipeline pattern and samples/src illustrating this pattern
are shown in Pipeline.
HetCompute supports all the models described above, which includes data parallel through the use of
hetcompute::pfor_each and GPU tasks, task parallel using tasks and groups, braided parallelism by
mapping tasks to different execution units, and pipeline parallelism using the hetcompute::pipeline
pattern. Programmers are encouraged to think in terms of pattens. If the algorithm matches one of the
parallel patterns, the path to speedup is very fast. Otherwise, task graphs are easy to construct in
HetCompute, and they can express essentially any parallel computation. The HetCompute User Guide
provides a detailed description of the HetCompute parallelization constructs, their design philosophy, and
best practices.

6.4 Parallel Programming Patterns


Parallel patterns are the shipping industry’s containers for parallel programming. If your algorithm fits into
a parallel pattern, it can be quickly packaged, thus you can enjoy exploiting multiple resources on your
platform. Parallel Programming Patterns describes the HetCompute parallel patterns and [14] is a good
resource to learn more about parallel programming using patterns.
Moreover, the HetCompute patterns are built upon the basic APIs to create and manage tasks. Therefore,
your own patterns can be built and they will seamlessly interoperate with the rest of the HetCompute
runtime system. Patterns can also be composed with other patterns. For example, a stage in a
hetcompute::pipeline pattern may contain any of the other HetCompute patterns or other
HetCompute tasks.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 151
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Parallel Processing Tutorial

6.5 Optimizations
Other than algorithmic decomposition of work and data, a parallel program requires tuning to a specific
platform to achieve optimal performance. As a general rule, the following process should be used:

• Serial tuning: The code executed by each task should be optimized using classical optimization
techniques: loop optimizations, strength reduction, and cache locality optimizations.
• Synchronization tuning: Coordinating parallel execution is typically considered overhead – the
program executes additional instructions that are not necessarily part of the effective work. Such
overhead includes serialization in critical sections, waiting for dependencies to be satisfied and/or
condition variables to be signaled, etc. A well-tuned parallel program spends most of its time
executing work as opposed to managing work. However, it may be necessary to replicate
computation in order to minimize synchronization. Using HetCompute asynchronous patterns is one
way of avoiding synchronization.
• Parallel efficiency: A parallel execution is optimal when all execution units are equally busy,
performing minimal redundant work. Therefore, it is important to balance the computation across all
processors. This can be achieved by a combination of algorithmic decomposition — finer grain tasks
allow better load balancing, and take into account architectural characteristics, such as resource
sharing and overhead of spawning tasks — coarser grain tasks typically incur less overhead. In
HetCompute, one can tune a number of parameters to control the granularity of tasks that execute the
pattern.
These topics are discussed briefly in the next sections.

6.5.1 Cache locality


There are two types of memory reference locality, temporal locality in which program references to a given
memory address are clustered together in time, and spatial locality, where program references to
neighboring addresses are clustered in time [10]. Caches transparently take advantage of both types of
locality: replacement policies exploit temporal locality, while wide cache lines and prefetching techniques
exploit spatial locality. Moreover, current architectures provide several levels of caches, with different
sharing patterns. For example, level one caches (L1) are typically split between instructions and data, and
are private to a core (shared by the hyperthreads in the case of an SMT architecture), and level two caches
(L2) are shared by multiple cores.
Although programmers do not have direct control over caching, code and data can be structured to improve
the locality of reference, thus making effective use of cache mechanisms [18]. In a multicore system,
caches are a shared resource, thus programmers should consider the following:

• Consistency: Most multicore shared memory systems provide hardware coherency [10]. However,
architectures implement different consistency models [1], thereby affecting the way shared memory
updates are visible to different threads. In particular, the ARM architecture defines a weak memory
consistency model. The C++11 standard defines primitives to enforce the ordering of memory
operations for all atomic accesses. Senior-level programmers can exploit nonsequential consistent
orderings to obtain better performance on such systems.
• False sharing: False sharing [19] arises when independent data items used by two tasks executing
concurrently on two different cores are co-located in the same cache line. Because the unit of
coherence is the cache line, if the items are accessed by both tasks, the line will be forced by the
coherence protocol to bounce between caches. False sharing can be avoided by separating data items
accessed by different concurrent tasks into separate cache lines, using techniques such as padding

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 152
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Parallel Processing Tutorial

[17] and/or allocation to cache line boundaries. To improve locality of reference and limit memory
fragmentation, programmers should group data items accessed by a single task as close as possible,
preferrably in contiguous blocks of memory addresses.
• Cache interference: In serial applications, cache optimizations are tuned to the entire cache.
However, in a parallel application, caches are shared by execution units. A carefully tuned parallel
program should maximize the utilization of the cache, by ensuring reuse of true shared data. For
example, by maintaining a single copy of read-only shared data, and referencing it simultaneously,
one will exploit temporal locality and minimize the amount of cache used. To minimize contention
and interference, the working set sizes of the tasks should fit in the cache. Tiling and cache blocking
[6], [20] parameters must be tuned considering the capacity when the caches are shared.
Many other cache locality optimizations are described in the literature.

6.5.2 Minimizing wait time and synchronization


As mentioned, efficient parallel execution implies balanced execution of non-redundant work on all
computing units, while minimizing management overhead. Fhe fraction of serial execution (Amdahl’s law)
limits has shown the efectiveness of the parallel application in Parallel Speedups. Therefore, programmers
should carefully consider different factors that serialize execution, including:

• Avoid waiting for single tasks: Long chains of dependencies, and/or often waiting for the results of
single tasks, limits the level of parallelism available in applications [13]. HetCompute groups can be
used to wait for sets of tasks, thus potentially minimizing the overall amount of stalling.
• Data synchronization: Synchronizing shared memory accesses may introduce considerable
serialization or cache conflict overhead. Such overhead can be reduced by the following
optimizations:
– Privatize data [3] – mutually exclusive partitioning of shared data. For example, partitioning an
image into tiles, where each task works on a different tile. In cases where the partitioning is not
obvious, programmers can copy shared data into private buffers, work on the private data, and then
synchronize changes to the shared copy. Parallel reductions, and parallel gather and scatter
operations are helpful in reshaping the private and shared data formats.
– Avoid large critical sections – Because critical sections guarantee mutual exclusion, they serialize
the execution of tasks that are accessing these areas. Minimizing the time spent in critical
sections, in particular, when they are highly contended, will reduce the synchronization overhead.
– Use atomic operations – the appropriate memory ordering further reduces the synchronization
overhead and relies on hardware capabilities for efficient shared data accesses.
HetCompute encourages an asynchronous programming style. Besides providing parallel programming
patterns with asynchronous semantics, the execution model of HetCompute is one in which fine-grained
tasks are placed in a dependence graph, and thus minimizes the need for waits. By contrast, fork-join
models spawn a large set of work which needs to complete before the control flows from the join.
Asynchronous concurrency is also preferable in the case of heterogeneous computing because resources
need not be blocked waiting for an off-load device to complete the work.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 153
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Parallel Processing Tutorial

6.5.3 Load balancing


Serialization is one of the potential pitfalls of parallel programming. Another is the under-utilization of all
compute units. If the parallel computation is unbalanced, some processors will be idle, thereby cutting into
the potential performance gains. To avoid such scenarios, programmers should pay attention to balancing
the work. This can be achieved in several ways, the most popular being:

• Tuning the task granularity: Task granularity represents the amount of work in a task. Ideally, if the
amount of work is known, one can balance the computation manually. However, this is not the case
for irregular applications, in which case, overdecomposition and relying on the HetCompute runtime
dynamic scheduling is a better option. Task granularity also plays an important role in managing the
overhead. As task granularity decreases, the overhead of managing the parallel execution becomes a
larger fraction of the total time. Therefore, coarser tasks are preferred to minimize the overhead. This
is an important balance that the programmers need to weigh. HetCompute makes it easy to explore
these trade-offs by providing a set of flexible APIs to create tasks.
• Overdecomposition: Overdecomposition is the mechanism by which programmers ensure that there
is enough parallel work in the system, so that the runtime always has work to schedule.
Overdecomposition is defined as creating more tasks than the number of computation units available,
such that if a task blocks or waits for dependencies to be satisfied, other independent tasks continue to
make progress. The more independent tasks are provided, the better the load balancing that can be
achieved. Of course, one needs to take into consideration the task granularity and manage the
overhead.

6.6 Conclusions
Parallel programming is fun and intellectually challenging. There are many factors that come into play
when building a parallel application, which may not be obvious. The techniques described in this tutorial
will help you reach the main goal of parallel programming — speeding up the execution of the application.
HetCompute is designed to ease this task and provide abstractions that make it convenient to express
parallel computation. The hard work of creating a parallel algorithm remains; however, HetCompute and
these techniques will help encoding these algorithms into an efficient solution.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 154
7 Image Processing Tutorial

7.1 Abstract
The goal of this tutorial is to illustrate how to use the HetCompute programming model to process images
using task parallelism and shared memory.

7.2 Image Processing Filter


As an example, the non-local means (NL-means) image denoising algorithm [4] shall be explored. In this
algorithm, the estimated value of a pixel is computed as a weighted average of all pixels in the image. The
weights depend on the similarity between pairs of pixels, a similarity which is defined as a decreasing
function of the weighted Euclidian distance. Pixels with a similar gray level neighborhood have, on
average, larger weights. For a practical computational algorithm, search of similar windows is restricted in
an SxS window. In [4] a 21x21 search window with a 7x7 similarity square neighborhood is considered
robust enough to denoise, while taking care of the finer details.
The weights are computed using the following equation:

2
1 − ||v(Ni )−v(N j )||2,a
w(i, j) = e h2 ,
Z(i)

where Z(i) is the normalizing constant:


||v(Ni )−v(Nj )||2
2,a

X
Z(i) = e h2

Pseudo-code for implementing this algorithm is shown below.


#define SEARCH_WINDOW_SIZE 21
#define SIMILARITY_WINDOW_SIZE 7

void compute_weights(Pixel __restrict *input, int x, int y, int *weights)


{
// compute similarity using Euclidian distance in the similarity window,
// using the equation above.
// reads from the input, writes int the array weights
}

int denoise_image(Pixel __restrict *input, int width, int height, Pixel *output)
{
for (int y = 0; y < height; y++) {
for (int x = 0; x < width; x++) { //iterate through all pixel points in the input image
Point point(x, y);
// compute weights for pixel points in the search window
int w[SEARCH_WINDOW_SIZE][SEARCH_WINDOW_SIZE];
compute_weights(input, point, w);
// denoise: compute the weighted average for this pixel point

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 155
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Image Processing Tutorial

for (int i = 0; i < SEARCH_WINDOW_SIZE; i++) {


for (int j = 0; j < SEARCH_WINDOW_SIZE; j++) {
Point neighbor(point.x - SEARCH_WINDOW_SIZE / 2 + i,
point.y - SEARCH_WINDOW_SIZE / 2 + j);
output[point] += w[i][j] * input[neighbor];
}
}
}
}
}

7.3 Parallel Image Processing using HetCompute


The denoising algorithm presented above is embarrasingly parallel. In this implementation, an in-place
algorithm (the output is written in a separate image) is not used; therefore, each pixel can be processed in
parallel with all other pixels.

7.3.1 Naive Parallelization


Given this algorithm, a naive implementation will simply parallelize the outermost loop in denoise_image,
creating a task for each pixel, and launching it asynchronously:
int denoise_image(Pixel __restrict *input, int width, int height, Pixel *output)
{
auto g = create_group("denoise");

for (int y = 0; y < height; y++) {


for (int x = 0; x < width; x++) {// iterate through all pixel points in the input image
g->launch( [=] {
Point point(x, y);
// compute weights for pixel points in the search window
int w[SEARCH_WINDOW_SIZE][SEARCH_WINDOW_SIZE];
compute_weights(input, point, w);
// denoise: compute the weighted average for this pixel point
for (int i = 0; i < SEARCH_WINDOW_SIZE; i++) {
for (int j = 0; j < SEARCH_WINDOW_SIZE; j++) {
Point neighbor(point.x - SEARCH_WINDOW_SIZE / 2 + i,
point.y - SEARCH_WINDOW_SIZE / 2 + j);
output[point] += w[i][j] * input[neighbor];
}
}
});
}
}
g->wait_for(); // wait for all the tasks to complete
}

While such a parallelization strategy is very simple and easy to implement in HetCompute, the performance
of such an implementation may not be optimal, for several reasons, as discussed in the Parallel Processing
Tutorial. In particular, this implementation is too fine-grained to overcome the parallel overhead and does
not exploit cache locality.

7.3.2 Tiling for Parallelization


A simple method to coarsen the granularity of tasks is to tile the image and spawn tasks for each tile. This
can be done by either tiling the loop directly:
int denoise_image(Pixel __restrict *input, int width, int height, Pixel *output)
{
auto g = create_group("denoise");
for(int y = 0; y < height; y+ = TILE_SIZE_ROW) {
for(int x = 0; x < width; x+ = TILE_SIZE_COL){// iterate through all pixel points in the input tile

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 156
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Image Processing Tutorial

g->launch( [=] {
for(int ty = y; ty < y + TILE_SIZE_ROW; ty++){// iterate through pixel points in the tile
for(int tx = x; tx < x + TILE_SIZE_COL; tx++){
Point point(tx,ty);
// compute weights for pixel points in the search window
int w[SEARCH_WINDOW_SIZE][SEARCH_WINDOW_SIZE];
compute_weights(input, point, w);
// denoise: compute the weighted average for this pixel point
for(int i = 0; i < SEARCH_WINDOW_SIZE; i++) {
for(int j = 0; j < SEARCH_WINDOW_SIZE; j++) {
Point neighbor(point.x - SEARCH_WINDOW_SIZE / 2 + i,
point.y - SEARCH_WINDOW_SIZE / 2 + j);
output[point] += w[i][j] * input[neighbor];
}
}
}
}
});
}
}
g->wait_for(); // wait for all the tasks to complete
}

or by restructuring the code, such that a denoise kernel is preserved that is identical to the serial
implementation and parallelize its invocation.
int denoise_kernel(Pixel __restrict *input, int startX, int width, int startY, int height, Pixel *output)
{
for(int y = startY; y < startY + height; y++){
for(int x = startX ; x < startX + width; x++){// iterate through all pixel points in the input tile
Point point(tx,ty);
// compute weights for pixel points in the search window
int w[SEARCH_WINDOW_SIZE][SEARCH_WINDOW_SIZE];
compute_weights(input, point, w);
// denoise: compute the weighted average for this pixel point
for (int i = 0; i < SEARCH_WINDOW_SIZE; i++) {
for (int j = 0; j < SEARCH_WINDOW_SIZE; j++) {
Point neighbor(point.x - SEARCH_WINDOW_SIZE / 2 + i,
point.y - SEARCH_WINDOW_SIZE / 2 + j);
output[point] += w[i][j] * input[neighbor];
}
}
}
}
}
void denoise_image(Pixel __restrict *input, int width, int height, Pixel *output)
{
// initialization, etc
auto g = create_group("denoise");
for(int y = 0; y < height; y+= TILE_SIZE_ROW) {
for(int x = 0; x < width; x+= TILE_SIZE_COL) {
g->launch( [=] {
denoise_kernel(input, x, TILE_SIZE_COL, y, TILE_SIZE_ROW, output);
});
}
}
g->wait_for(); // wait for all the tasks to complete
}

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 157
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Image Processing Tutorial

7.3.3 Parallelization using patterns


The tiling optimization takes care of locality, because each task will compute on a larger granularity.
However, as you may notice, a task is launched for every tile and then waits for all of them to finish. While
HetCompute is efficient in launching many tasks, it also has a pattern API that provides better load
balancing and management of resources. In particular, for this example hetcompute::pfor_each construct is
used.
The kernel remains the same, but the driver function is rewritten:
void denoise_image()
{
// initialization, etc
hetcompute::range<2> r(0, width, TILE_SIZE_COL, 0, height, TILE_SIZE_ROW);
hetcompute::pfor_each(r, [input, &output] (size_t index) {
hetcompute::index<2> idx = r.linear_to_index(index);
denoise_kernel(input, idx[0], TILE_SIZE_COL, idx[1], TILE_SIZE_ROW, output);
});
}

In analyzing the code:


hetcompute::range<2> defines a 2-dimensional range, from [0,width) x [0, height) with a stride of
TILE_SIZE in each dimension. Each dimension can have a different stride. This range is used as the
iteration space for the parallel loop.
The hetcompute::pfor_each() construct expresses the fact that this is a parallel loop, that is all iterations can
execute concurrently. A lambda expression is passed that encapsulates the kernel (just as Qualcomm
HetCompute was launching tasks in the previous section). In this case, the lambda takes an argument, the
linear index in the range. The hetcompute::pfor_each pattern allows HetCompute to aggregate tasks as it is
appropriate for the platform.
hetcompute::index<2> defines a 2-dimensional index. HetCompute ranges know how to iterate to the
appropriate points. The linear_to_index call returns an object that has the appropriate coordinates in each
dimension of the range. These can be directly accessed using the [] operator on the object.
Note that there is no hetcompute::wait_for() at the end of the loop. hetcompute::pfor_each is a synchronous
operation and will wait internally for all tasks in the iteration to complete. HetCompute also provides an
asynchronous version of pfor_each, namely hetcompute::pfor_each_async(), which should be used when
computation can proceed because it does not require the data produced in the loop, for example,
overlapping the denoising of several images. See the HetCompute reference manual for the
hetcompute::pfor_each_async() specification.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 158
8 Point Kernels (Beta feature)

A Point Kernel, combined with the hetcompute::pfor_each pattern (see Parallel Iteration), is
automatically scheduled for heterogeneous execution across the CPU, GPU, and DSP, by the HetCompute
runtime. A Point Kernel is written in C99 with some minor restrictions. The programmer is encouraged to
use the hetcompute::pattern::tuner to experiment with and set the distribution of workload
across the CPU, GPU, and DSP. For example, hetcompute::pattern::tuner().set_cpu_-
load(30).set_gpu_load(50).set_dsp_load(20) instructs the HetCompute runtime to
partition the range of iterations such that 30% of the iterations are assigned to the CPU, 50% to the GPU,
and the remaining 20% to the DSP. In the present release, there are a few constraints on the code inside a
Point Kernel:

1. The pfor_each with a Point Kernel shall write to only one output buffer.
2. Each iteration i of the pfor_each range shall write only to index i of the output buffer.
A Point Kernel captures the operations performed at a point in an iteration space. For example, a vector-add
point kernel computes the sum of two vector elements A[i] and B[i] and stores the result in C[i], for every
point i in a hetcompute::range of iterations. In contrast to OpenCL kernels, which can synchronize
across work-items in a work-group using, e.g., barriers, a Point Kernel captures pure data-parallelism such
that no two points in the iteration space can synchronize with each other during a kernel’s execution – all
synchronization is deferred to until after the kernel’s execution. In practice, this is not a significant
limitation, as several algorithms in multiple domains such as image processing, video encoding, and
simultaneous localization and mapping can be expressed as Point Kernels.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 159
9 Patterns Reference API

The Qualcomm HetCompute parallel patterns API provides programmers with a high-level interface to
express commonly used parallel programming idioms, such as parallel loops, parallel prefix operations,
parallel map and reduce operations, etc. We recommend considering using one of the patterns before trying
to implement custom task graphs, as the Qualcomm HetCompute runtime optimizes the execution of
patterns. You can fine tune the execution as explained in Section Tuner.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 160
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.1 Parallel For Loop


Classes

• class hetcompute::pattern::pfor< T1, T2 >


• class hetcompute::pattern::pfor< hetcompute::internal::pointkernel::pointkernel< RT, PKType...>,
T2 >
• class hetcompute::pattern::pfor< T1, void >

Functions

• template<typename UnaryFn >


pfor< UnaryFn, void > hetcompute::pattern::create_pfor_each (UnaryFn &&fn)
• template<typename KernelTuple , typename ArgTuple >
hetcompute::pattern::pfor
< KernelTuple, ArgTuple > hetcompute::beta::pattern::create_pfor_each_helper (KernelTuple
&&ktpl, ArgTuple &&atpl)
• template<class InputIterator , typename UnaryFn >
void hetcompute::pfor_each (InputIterator first, InputIterator last, UnaryFn &&fn, const
hetcompute::pattern::tuner &t=hetcompute::pattern::tuner())
• template<class InputIterator , typename UnaryFn >
void hetcompute::pfor_each (InputIterator first, const size_t stride, InputIterator last, UnaryFn &&fn,
const hetcompute::pattern::tuner &t=hetcompute::pattern::tuner())
• template<size_t Dims, typename UnaryFn >
void hetcompute::pfor_each (const hetcompute::range< Dims > &r, UnaryFn &&fn, const
hetcompute::pattern::tuner &t=hetcompute::pattern::tuner())
Parallel version of std::for_each.

• template<class InputIterator , typename UnaryFn >


hetcompute::task_ptr< void()> hetcompute::pfor_each_async (InputIterator first, InputIterator last,
UnaryFn fn, const hetcompute::pattern::tuner &tuner=hetcompute::pattern::tuner())
• template<class InputIterator , typename UnaryFn >
hetcompute::task_ptr< void()> hetcompute::pfor_each_async (InputIterator first, const size_t stride,
InputIterator last, UnaryFn fn, const hetcompute::pattern::tuner &tuner=hetcompute::pattern::tuner())
• template<size_t Dims, typename UnaryFn >
hetcompute::task_ptr< void()> hetcompute::pfor_each_async (const hetcompute::range< Dims >
&r, UnaryFn fn, const hetcompute::pattern::tuner &tuner=hetcompute::pattern::tuner())
• template<size_t Dims, typename UnaryFn >
hetcompute::task_ptr< void()> hetcompute::pfor_each_async (const hetcompute::range< Dims >
&r, const size_t stride, UnaryFn fn, const hetcompute::pattern::tuner
&tuner=hetcompute::pattern::tuner())

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 161
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.1.1 Class Documentation

9.1.1.1 class hetcompute::pattern::pfor

template<typename T1, typename T2>class hetcompute::pattern::pfor< T1, T2 >

Public member functions

• pfor (T1 &&ktpl, T2 &&atpl)


• void add_run ()
• uint64_t get_cpu_task_time () const
• uint64_t get_dsp_task_time () const
• uint64_t get_gpu_task_time () const
• template<size_t Dims>
void operator() (const hetcompute::range< Dims > &r, hetcompute::pattern::tuner
&t=hetcompute::pattern::tuner().set_cpu_load(100))
• double query_dsp_profile () const
• double query_gpu_profile () const
• template<size_t Dims>
void run (const hetcompute::range< Dims > &r, hetcompute::pattern::tuner
&t=hetcompute::pattern::tuner().set_cpu_load(100))
• void set_cpu_task_time (uint64_t ct)
• void set_dsp_profile (double dp)
• void set_dsp_task_time (uint64_t ht)
• void set_gpu_profile (double gp)
• void set_gpu_task_time (uint64_t gt)

Public Attributes

• T2 _atpl
• uint64_t _cpu_task_time
• double _dsp_profile
• uint64_t _dsp_task_time
• double _gpu_profile
• uint64_t _gpu_task_time
• T1 _ktpl
• size_t _num_runs

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 162
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Friends

• template<typename KT , typename AT >


pfor< KT, AT > hetcompute::beta::pattern::create_pfor_each_helper (KT &&ktpl, AT &&atpl)
• template<size_t Dims, typename KernelTuple , typename ArgTuple , typename KernelFirst , typename... KernelRest,
typename... Args, typename Boolean , typename Buf_Tuple >
void hetcompute::internal::pfor_each (hetcompute::pattern::pfor< KernelTuple, ArgTuple >
∗const p, const hetcompute::range< Dims > &r, std::tuple< KernelFirst, KernelRest...> &klist,
hetcompute::pattern::tuner &tuner, const Boolean called_with_pointkernel, Buf_Tuple &&buf_tup,
Args &&...args)

9.1.1.2 class hetcompute::pattern::pfor<


hetcompute::internal::pointkernel::pointkernel< RT, PKType...>, T2 >

template<typename RT, typename... PKType, typename T2>class hetcompute::pattern::pfor<


hetcompute::internal::pointkernel::pointkernel< RT, PKType...>, T2 >

Public member functions

• pfor (pointkernel_type &pk, T2 &&atpl)


• void add_run ()
• uint64_t get_cpu_task_time () const
• uint64_t get_dsp_task_time () const
• uint64_t get_gpu_task_time () const
• template<size_t Dims>
void operator() (const hetcompute::range< Dims > &r, hetcompute::pattern::tuner
&t=hetcompute::pattern::tuner().set_cpu_load(100))
• double query_dsp_profile () const
• double query_gpu_profile () const
• template<size_t Dims>
void run (const hetcompute::range< Dims > &r, hetcompute::pattern::tuner
&t=hetcompute::pattern::tuner().set_cpu_load(100))
• void set_cpu_task_time (uint64_t ct)
• void set_dsp_profile (double dp)
• void set_dsp_task_time (uint64_t ht)
• void set_gpu_profile (double gp)
• void set_gpu_task_time (uint64_t gt)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 163
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Public Attributes

• T2 _atpl
• uint64_t _cpu_task_time
• double _dsp_profile
• uint64_t _dsp_task_time
• double _gpu_profile
• uint64_t _gpu_task_time
• size_t _num_runs
• pointkernel_type & _pk

Friends

• template<typename RetType , typename... PKArgs, typename ArgTuple >


pfor
< hetcompute::internal::pointkernel::pointkernel
< RetType, PKArgs...>
, ArgTuple > hetcompute::beta::pattern::create_pfor_each_helper
(hetcompute::internal::pointkernel::pointkernel< RetType, PKArgs...> &pk, ArgTuple &&atpl)
• template<size_t Dims, typename KernelTuple , typename ArgTuple , typename KernelFirst , typename... KernelRest,
typename... Args, typename Boolean , typename Buf_Tuple >
void hetcompute::internal::pfor_each (hetcompute::pattern::pfor< KernelTuple, ArgTuple >
∗const p, const hetcompute::range< Dims > &r, std::tuple< KernelFirst, KernelRest...> &klist,
hetcompute::pattern::tuner &tuner, const Boolean called_with_pointkernel, Buf_Tuple &&buf_tup,
Args &&...args)

9.1.1.3 class hetcompute::pattern::pfor< T1, void >

template<typename T1>class hetcompute::pattern::pfor< T1, void >

Public member functions

• pfor (T1 &&fn)


• template<typename InputIterator >
void operator() (InputIterator first, InputIterator last, const hetcompute::pattern::tuner
&t=hetcompute::pattern::tuner())
• template<typename InputIterator >
void operator() (InputIterator first, const size_t stride, InputIterator last, const
hetcompute::pattern::tuner &t=hetcompute::pattern::tuner())
• template<size_t Dims>
void operator() (const hetcompute::range< Dims > &r, const hetcompute::pattern::tuner
&t=hetcompute::pattern::tuner())
• template<typename InputIterator >
void run (InputIterator first, InputIterator last, const hetcompute::pattern::tuner

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 164
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

&t=hetcompute::pattern::tuner())
• template<typename InputIterator >
void run (InputIterator first, const size_t stride, InputIterator last, const hetcompute::pattern::tuner
&t=hetcompute::pattern::tuner())
• template<size_t Dims>
void run (const hetcompute::range< Dims > &r, const hetcompute::pattern::tuner
&t=hetcompute::pattern::tuner())

Public Attributes

• T1 _fn

Friends

• pfor create_pfor_each (T1 &&fn)


• template<typename Fn , typename... Args>
hetcompute::task_ptr< void()> hetcompute::create_task (const pfor< Fn, void > &pf, Args
&&...args)

9.1.2 Function Documentation

9.1.2.1 template<typename UnaryFn > pfor< UnaryFn, void > hetcompute::pattern-


::create_pfor_each ( UnaryFn && fn )

Create a pattern object from function object fn


Returns a pattern object which can be invoked (1) synchronously, using the run method or the () operator
with arguments; or (2) asynchronously, using hetcompute::create_task or hetcompute::launch

Examples

auto l = [](size_t) {};


auto pfor = hetcompute::pattern::create_pfor_each(l);
pfor(0, 100);

Parameters

fn Function object apply to range.

9.1.2.2 template<class InputIterator , typename UnaryFn > void hetcompute-


::pfor_each ( InputIterator first, InputIterator last, UnaryFn && fn, const
hetcompute::pattern::tuner & t = hetcompute::pattern::tuner() )

Parallel version of std::for_each.


Applies function object fn in parallel to every iterator in the range [first, last). It has a default step size of
one.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 165
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Note

An "iterator" refers to an object that enables a programmer to traverse a container. It can be indices of
integral type, or pointers of RandomAccessIterator type.
In contrast to std::for_each and ptransform, the iterator is passed to the function, instead of
the element.

It is permissible to modify the elements of the range from fn, provided that InputIterator is a
mutable iterator.
Note

This function returns only after fn has been applied to the whole iteration range.
The usual rules for cancellation apply, i.e., within fn the cancellation must be acknowledged using
abort_on_cancel.

Complexity

Exactly std::distance(first,last) applications of fn.

See Also

ptransform(InputIterator, InputIterator, UnaryFn)


abort_on_cancel()

Examples

[...]
// Parallel for-loop using indices
pfor_each(size_t(0), vin.size(),
[=,&vin] (size_t i) {
vin[i] = 2 * vin[i];
});
[...]

Parameters

first Start of the range to which to apply fn.


last End of the range to which to apply fn.
fn Unary function object to be applied.
t Qualcomm HETCOMPUTE pattern tuner object (optional)

9.1.2.3 template<class InputIterator , typename UnaryFn > void hetcompute::pfor_-


each ( InputIterator first, const size_t stride, InputIterator last, UnaryFn &&
fn, const hetcompute::pattern::tuner & t = hetcompute::pattern::tuner() )

Parallel version of std::for_each with step size.


Applies function object fn in parallel to every iterator in the range [first, last) with step size defined by the
stride parameter.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 166
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Note

The function object will be applied to iterators with an incremental step size (iter+=stride)

Parameters

first start of the range to which to apply fn.


stride step size.
last end of the range to which to apply fn.
fn unary function object to be applied. set to one by default.
t Qualcomm HETCOMPUTE pattern tuner object (optional)

9.1.2.4 template<size_t Dims, typename UnaryFn > void hetcompute::pfor_-


each ( const hetcompute::range< Dims > & r, UnaryFn && fn, const
hetcompute::pattern::tuner & t = hetcompute::pattern::tuner() )

Instead of passing in a pair of iterators, this form accepts a hetcompute::range object. Internally the
indices are linearized before passing to the kernel function. It has a default step size of one.

Parameters

r Range object (1D, 2D or 3D) representing the iteration space.


fn Unary function object to be applied.
t Qualcomm HETCOMPUTE pattern tuner object (optional).

9.1.2.5 template<class InputIterator , typename UnaryFn > hetcompute::task_-


ptr<void()> hetcompute::pfor_each_async ( InputIterator first, InputIterator
last, UnaryFn fn, const hetcompute::pattern::tuner & tuner = hetcompute-
::pattern::tuner() )

Create an asynchronous task from the hetcompute::pfor_each pattern.


Returns a task that represents the pattern’s execution. Operations on the task translate into operations on the
executing pattern. The caller must launch the task. This API has a default stride size of one.

Parameters

first Start of the range to which to apply fn.


last End of the range to which to apply fn.
fn Unary function object to be applied.
tuner Qualcomm HETCOMPUTE pattern tuner object (optional).

Returns

task_ptr Unlaunched task representing pattern’s execution.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 167
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.1.2.6 template<class InputIterator , typename UnaryFn > hetcompute::task_-


ptr<void()> hetcompute::pfor_each_async ( InputIterator first, const size_t
stride, InputIterator last, UnaryFn fn, const hetcompute::pattern::tuner &
tuner = hetcompute::pattern::tuner() )

Create an asynchronous task from the hetcompute::pfor_each pattern (with step size).

Parameters

first start of the range to which to apply fn.


stride step size.
last end of the range to which to apply fn.
fn unary function object to be applied.
tuner Qualcomm HETCOMPUTE pattern tuner object (optional)

Returns

task_ptr unlaunched task representing pattern’s execution

9.1.2.7 template<size_t Dims, typename UnaryFn > hetcompute::task_ptr<void()>


hetcompute::pfor_each_async ( const hetcompute::range< Dims > & r,
UnaryFn fn, const hetcompute::pattern::tuner & tuner = hetcompute::pattern-
::tuner() )

Create an asynchronous task from the hetcompute::pfor_each pattern.


Returns a task that represents the pattern’s execution. Operations on the task translate into operations on the
executing pattern. The caller must launch the task. This API has a default step size of one.

Parameters

r range object (1D, 2D or 3D) representing the iteration space.


fn unary function object to be applied.
tuner HETCOMPUTE pattern tuner object (optional)

Returns

task_ptr unlaunched task representing pattern’s execution

9.1.2.8 template<size_t Dims, typename UnaryFn > hetcompute::task_ptr<void()>


hetcompute::pfor_each_async ( const hetcompute::range< Dims > & r,
const size_t stride, UnaryFn fn, const hetcompute::pattern::tuner & tuner =
hetcompute::pattern::tuner() )

Create an asynchronous task from the hetcompute::pfor_each pattern (with step size).

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 168
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Parameters

r range object (1D, 2D or 3D) representing the iteration space.


stride step size.
fn unary function object to be applied.
tuner HETCOMPUTE pattern tuner object (optional)

Returns

task_ptr unlaunched task representing pattern’s execution

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 169
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.2 Parallel Transformation


Classes

• class hetcompute::pattern::ptransformer< Fn >

Functions

• template<typename Fn >
ptransformer< Fn > hetcompute::pattern::create_ptransform (Fn &&fn)
• template<typename InputIterator , typename OutputIterator , typename UnaryFn >
std::enable_if<!std::is_same
< hetcompute::pattern::tuner,
typename std::remove_reference
< UnaryFn >::type >::value,
void >::type hetcompute::ptransform (InputIterator first, InputIterator last, OutputIterator d_first,
UnaryFn &&fn, const hetcompute::pattern::tuner &tuner=hetcompute::pattern::tuner())
• template<typename InputIterator , typename OutputIterator , typename BinaryFn >
std::enable_if<!std::is_same
< hetcompute::pattern::tuner,
typename std::remove_reference
< BinaryFn >::type >::value,
void >::type hetcompute::ptransform (InputIterator first1, InputIterator last1, InputIterator first2,
OutputIterator d_first, BinaryFn &&fn, const hetcompute::pattern::tuner
&tuner=hetcompute::pattern::tuner())
• template<typename InputIterator , typename UnaryFn >
void hetcompute::ptransform (InputIterator first, InputIterator last, UnaryFn &&fn, const
hetcompute::pattern::tuner &tuner=hetcompute::pattern::tuner())
• template<typename InputIterator , typename OutputIterator , typename UnaryFn >
hetcompute::task_ptr< void()> hetcompute::ptransform_async (InputIterator first, InputIterator last,
OutputIterator d_first, UnaryFn &&fn, const hetcompute::pattern::tuner
&tuner=hetcompute::pattern::tuner())
• template<typename InputIterator , typename OutputIterator , typename BinaryFn >
hetcompute::task_ptr< void()> hetcompute::ptransform_async (InputIterator first1, InputIterator
last1, InputIterator first2, OutputIterator d_first, BinaryFn &&fn, const hetcompute::pattern::tuner
&tuner=hetcompute::pattern::tuner())
• template<typename InputIterator , typename UnaryFn >
hetcompute::task_ptr< void()> hetcompute::ptransform_async (InputIterator first, InputIterator last,
UnaryFn &&fn, const hetcompute::pattern::tuner &tuner=hetcompute::pattern::tuner())

9.2.1 Class Documentation

9.2.1.1 class hetcompute::pattern::ptransformer

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 170
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

template<typename Fn>class hetcompute::pattern::ptransformer< Fn >

Public member functions

• template<typename InputIterator >


void operator() (InputIterator first, InputIterator last, const hetcompute::pattern::tuner
&t=hetcompute::pattern::tuner())
• template<typename InputIterator , typename OutputIterator >
void operator() (InputIterator first, InputIterator last, OutputIterator d_first, const
hetcompute::pattern::tuner &t=hetcompute::pattern::tuner())
• template<typename InputIterator1 , typename InputIterator2 , typename OutputIterator >
void operator() (InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, OutputIterator d_first,
const hetcompute::pattern::tuner &t=hetcompute::pattern::tuner())
• template<typename InputIterator >
void run (InputIterator first, InputIterator last, const hetcompute::pattern::tuner
&t=hetcompute::pattern::tuner())
• template<typename InputIterator , typename OutputIterator >
void run (InputIterator first, InputIterator last, OutputIterator d_first, const
hetcompute::pattern::tuner &t=hetcompute::pattern::tuner())
• template<typename InputIterator1 , typename InputIterator2 , typename OutputIterator >
void run (InputIterator1 first1, InputIterator1 last1, InputIterator2 first2, OutputIterator d_first, const
hetcompute::pattern::tuner &t=hetcompute::pattern::tuner())

Friends

• ptransformer create_ptransform (Fn &&fn)


• template<typename F , typename... Args>
hetcompute::task_ptr< void()> hetcompute::create_task (const ptransformer< F > &ptf, Args
&&...args)

9.2.1.1.1 Related Function Documentation

9.2.1.1.1.1 template<typename Fn > ptransformer create_ptransform ( Fn && fn ) [friend]

Create a pattern object from a function object fn.


Returns a pattern object which can be invoked (1) synchronously, using the run method or the () operator
with arguments; or (2) asynchronously, using hetcompute::create_task or hetcompute::launch.

Examples

auto l = [](size_t) {};


auto ptransform = hetcompute::pattern::create_ptransform(l)
;
ptransform(vin.begin(), vin.end());

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 171
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Parameters

fn Function object to be applied.

9.2.2 Function Documentation

9.2.2.1 template<typename Fn > ptransformer< Fn > hetcompute::pattern::create_-


ptransform ( Fn && fn )

Create a pattern object from a function object fn.


Returns a pattern object which can be invoked (1) synchronously, using the run method or the () operator
with arguments; or (2) asynchronously, using hetcompute::create_task or hetcompute::launch.

Examples

auto l = [](size_t) {};


auto ptransform = hetcompute::pattern::create_ptransform(l)
;
ptransform(vin.begin(), vin.end());

Parameters

fn Function object to be applied.

9.2.2.2 template<typename InputIterator , typename OutputIterator , typename


UnaryFn > std::enable_if<!std::is_same<hetcompute::pattern::tuner,
typename std::remove_reference<UnaryFn>::type>::value, void>::type
hetcompute::ptransform ( InputIterator first, InputIterator last, Output-
Iterator d_first, UnaryFn && fn, const hetcompute::pattern::tuner & tuner =
hetcompute::pattern::tuner() )

Parallel version of std::transform.


Applies function object fn in parallel to every dereferenced iterator in the range [first, last) and stores the
return value in another range, starting at d_first.
Note

This function returns only after fn has been applied to the whole iteration range.
In contrast to pfor_each, arguments specifying ranges are restricted to RandomAccessIterator,
where as pfor_each allows them to be of integral type representing indices.

Complexity

Exactly std::distance(first,last) applications of fn.

See Also

ptransform(InputIterator, InputIterator, UnaryFn&&, const hetcompute::pattern::tuner&)


pfor_each(InputIterator, InputIterator, UnaryFn&&, const hetcompute::pattern::tuner&)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 172
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Examples

// arr[i] == 2*vin[i]
size_t arr[vin.size()];
ptransform(begin(vin), end(vin), arr,
[=] (size_t const& v) {
return 2*v;
});

Parameters

first Start of the range to which to apply fn.


last End of the range to which to apply fn.
d_first Start of the destination range.
fn Unary function object to be applied.
tuner Qualcomm HetCompute pattern tuner object (optional).

9.2.2.3 template<typename InputIterator , typename OutputIterator , typename Binary-


Fn > std::enable_if<!std::is_same<hetcompute::pattern::tuner, typename
std::remove_reference<BinaryFn>::type>::value, void>::type hetcompute-
::ptransform ( InputIterator first1, InputIterator last1, InputIterator first2,
OutputIterator d_first, BinaryFn && fn, const hetcompute::pattern::tuner &
tuner = hetcompute::pattern::tuner() )

Parallel version of std::transform.


Applies function object fn in parallel to every pair of dereferenced destination iterators in the range [first1,
last1) and [first2,...), and stores the return value in another range, starting at d_first.
Note

This function returns only after fn has been applied to the whole iteration range.
In contrast to pfor_each, arguments specifying range are restricted to RandomAccessIterator, where
as pfor_each allows them to be of integral type representing indices.

Complexity

Exactly std::distance(first1,last1) applications of fn.

See Also

ptransform(InputIterator, InputIterator, UnaryFn&&, const hetcompute::pattern::tuner&)


pfor_each(InputIterator, InputIterator, UnaryFn&&, const hetcompute::pattern::tuner&)

Examples

// vout[i] == vin[i] + vin[i+1]


vector<size_t> vout(vin.size()-1);
ptransform(begin(vin), begin(vin)+vout.size(),
begin(vin)+1,
begin(vout),
[=] (size_t const& op1, size_t const& op2) {
return op1 + op2;
});

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 173
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Parameters

first1 Start of the range to which to apply fn.


last1 End of the range to which to apply fn.
first2 Start of the second range to which to apply fn.
d_first Start of the destination range.
fn Binary function object to be applied.
tuner Qualcomm HetCompute pattern tuner object (optional).

9.2.2.4 template<typename InputIterator , typename UnaryFn > void hetcompute-


::ptransform ( InputIterator first, InputIterator last, UnaryFn && fn, const
hetcompute::pattern::tuner & tuner = hetcompute::pattern::tuner() )

Parallel version of std::transform.


Applies function object fn in parallel to every dereferenced iterator in the range [first, last).
Note

In contrast to pfor_each, the dereferenced iterator is passed to the function.


In contrast to pfor_each, arguments specifying range are restricted to RandomAccessIterator, where
as pfor_each allows them to be of integral type representing indices.

It is permissible to modify the elements of the range from fn, assuming that InputIterator is a
mutable iterator.
Note

This function returns only after fn has been applied to the whole iteration range.

Complexity

Exactly std::distance(first,last) applications of fn.

See Also

pfor_each(InputIterator, InputIterator, UnaryFn&&, const hetcompute::pattern::tuner&)


// In-place double the value for all elements in range
ptransform(begin(vin), end(vin),
[] (size_t& v) {
v *= 2;
});

Parameters

first Start of the range to which to apply fn.


last End of the range to which to apply fn.
fn Unary function object to be applied.
tuner Qualcomm HetCompute pattern tuner object (optional).

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 174
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.2.2.5 template<typename InputIterator , typename OutputIterator , typename


UnaryFn > hetcompute::task_ptr<void()> hetcompute::ptransform_async (
InputIterator first, InputIterator last, OutputIterator d_first, UnaryFn && fn,
const hetcompute::pattern::tuner & tuner = hetcompute::pattern::tuner() )

Create an asynchronous task from the hetcompute::ptransform pattern.


Note

The usual rules for cancellation apply, i.e., within fn the cancellation must be acknowledged using
abort_on_cancel.

See Also

ptransform(InputIterator, InputIterator, OutputIterator, UnaryFn&&, const


hetcompute::pattern::tuner&)

Examples

// create an async task from the ptransform pattern


auto t = ptransform_async(begin(vin), end(vin), arr,
[=] (size_t const& i) {
return 2 * i;
});
t->launch();
t->wait_for();

9.2.2.6 template<typename InputIterator , typename OutputIterator , typename


BinaryFn > hetcompute::task_ptr<void()> hetcompute::ptransform_async
( InputIterator first1, InputIterator last1, InputIterator first2, OutputIterator
d_first, BinaryFn && fn, const hetcompute::pattern::tuner & tuner =
hetcompute::pattern::tuner() )

Create an asynchronous task from the hetcompute::ptransform pattern.


Note

The usual rules for cancellation apply, i.e., within fn the cancellation must be acknowledged using
abort_on_cancel.

See Also

ptransform(InputIterator, InputIterator, InputIterator, OutputIterator, BinaryFn&&, const


hetcompute::pattern::tuner&)

9.2.2.7 template<typename InputIterator , typename UnaryFn > hetcompute-


::task_ptr<void()> hetcompute::ptransform_async ( InputIterator first,
InputIterator last, UnaryFn && fn, const hetcompute::pattern::tuner & tuner =
hetcompute::pattern::tuner() )

Create an asynchronous task from the hetcompute::ptransform pattern.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 175
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Note

The usual rules for cancellation apply, i.e., within fn the cancellation must be acknowledged using
abort_on_cancel.

See Also

ptransform(InputIterator, InputIterator, UnaryFn&&, const hetcompute::pattern::tuner&)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 176
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.3 Parallel Reduction


Classes

• class hetcompute::pattern::preducer< Reduce, Join >

Functions

• template<typename Reduce , typename Join >


preducer< Reduce, Join > hetcompute::pattern::create_preduce (Reduce &&r, Join &&j)
• template<typename T , class InputIterator , typename Reduce , typename Join >
T hetcompute::preduce (InputIterator first, InputIterator last, const T &identity, Reduce &&reduce,
Join &&join, hetcompute::pattern::tuner tuner=hetcompute::pattern::tuner())
• template<typename T , typename Container , typename Join >
T hetcompute::preduce (Container &c, const T &identity, Join &&join, hetcompute::pattern::tuner
tuner=hetcompute::pattern::tuner())
Perform parallel reduction on some container c.

• template<typename T , typename Iterator , typename Join >


T hetcompute::preduce (Iterator first, Iterator last, const T &identity, Join &&join,
hetcompute::pattern::tuner tuner=hetcompute::pattern::tuner())
Perform parallel reduction on range defined by iterators.

• template<typename T , class InputIterator , typename Reduce , typename Join >


hetcompute::task_ptr< T > hetcompute::preduce_async (InputIterator first, InputIterator last, const T
&identity, Reduce &&reduce, Join &&join, hetcompute::pattern::tuner
tuner=hetcompute::pattern::tuner())
• template<typename T , typename Container , typename Join >
hetcompute::task_ptr< T > hetcompute::preduce_async (Container &c, const T &identity, Join
&&join, hetcompute::pattern::tuner tuner=hetcompute::pattern::tuner())
• template<typename T , typename Iterator , typename Join >
hetcompute::task_ptr< T > hetcompute::preduce_async (Iterator first, Iterator last, const T &identity,
Join &&join, hetcompute::pattern::tuner tuner=hetcompute::pattern::tuner())

9.3.1 Class Documentation

9.3.1.1 class hetcompute::pattern::preducer

template<typename Reduce, typename Join>class hetcompute::pattern::preducer< Reduce, Join >

Public member functions

• template<typename T , typename InputIterator >


T operator() (InputIterator first, InputIterator last, const T &identity, const
hetcompute::pattern::tuner &t=hetcompute::pattern::tuner())
• template<typename T , typename InputIterator >
T run (InputIterator first, InputIterator last, const T &identity, const hetcompute::pattern::tuner
&t=hetcompute::pattern::tuner())

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 177
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Friends

• preducer create_preduce (Reduce &&r, Join &&j)


• template<typename T , typename R , typename J , typename... Args>
hetcompute::task_ptr< T > hetcompute::create_task (const preducer< R, J > &p, Args &&...args)

9.3.1.1.1 Related Function Documentation

9.3.1.1.1.1 template<typename Reduce , typename Join > preducer create_preduce ( Reduce && r,
Join && j ) [friend]

Create pattern object from function objects: Reduce and Join.


Returns a pattern object which can be invoked (1) synchronously, using the run method or the () operator
with arguments; or (2) asynchronously, using hetcompute::create_task or hetcompute::launch.

Examples

// parallel sum over a vector of elements


vector<size_t> vec(100, 1);
const size_t identity = 0;
typedef vector<size_t>::iterator iter_type;

auto reduce = [&vec](iter_type it1, iter_type it2, size_t init){


return std::accumulate(it1, it2, init);
};
auto join = std::plus<size_t>();
auto preduce = hetcompute::pattern::create_preduce(reduce, join);

// result == 100
auto result = preduce(vec.begin(), vec.end(), identity);

Parameters

r Function object accumulating result for subrange.


j Function object combining two subrange results.

9.3.2 Function Documentation

9.3.2.1 template<typename Reduce , typename Join > preducer< Reduce, Join >
hetcompute::pattern::create_preduce ( Reduce && r, Join && j )

Create pattern object from function objects: Reduce and Join.


Returns a pattern object which can be invoked (1) synchronously, using the run method or the () operator
with arguments; or (2) asynchronously, using hetcompute::create_task or hetcompute::launch.

Examples

// parallel sum over a vector of elements


vector<size_t> vec(100, 1);
const size_t identity = 0;
typedef vector<size_t>::iterator iter_type;

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 178
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

auto reduce = [&vec](iter_type it1, iter_type it2, size_t init){


return std::accumulate(it1, it2, init);
};
auto join = std::plus<size_t>();
auto preduce = hetcompute::pattern::create_preduce(reduce, join);

// result == 100
auto result = preduce(vec.begin(), vec.end(), identity);

Parameters

r Function object accumulating result for subrange.


j Function object combining two subrange results.

9.3.2.2 template<typename T , class InputIterator , typename Reduce , typename Join


> T hetcompute::preduce ( InputIterator first, InputIterator last, const T &
identity, Reduce && reduce, Join && join, hetcompute::pattern::tuner tuner =
hetcompute::pattern::tuner() )

Performs parallel reduction by reducing the results using binary operator join. Returns the result of the
reduction.
Note

Qualcomm HetCompute parallel reduction pattern operates in two stages. In the first stage, it applies
the reduction operation (reduce) to a set of subranges. In the second stage, the reduction results of all
the subranges will be aggregated (join) into the final result.
The binary operation is expected to be associative, but not necessarily commutative, as the algorithm
does not swap operands of the reduce operation while working on the range.
InputIterator can be either of type RandomAccess Iterators, or of integral type to represent iteration
indices.
For tiny iteration range and/or trivial binary operator, it may not be worthwhile to parallelize the
reduction operation.
Reduce function requires pass-by-reference semantics.
To achieve best performance, it is recommended to implement move constructor/assignment for
user-defined to-be-reduced type.

Complexity

Exactly last-first-1 applications of join.

Examples

[...]
const int identity = 0;
// Parallel sum
auto p_sum = hetcompute::preduce(0, vin.size(), identity,
// reduce func
[&vin](int i, int j, int& init)
{
for(int k = i; k < j; ++k)
init += vin[k];
},
// join func
[](size_t x, size_t y){ return x + y; }
);

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 179
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Parameters

first Start of the range to parallel reduce.


last End of the range to parallel reduce.
identity A special element leaves other elements unchanged when combined with
them, i.e., x = identity ∗ x and x = x ∗ identity. For example, 0 is the identity
element under addition for the real numbers.
reduce Reduce function applied to each subrange and return the result.
join Join calculated results from two subranges.
tuner Qualcomm HetCompute pattern tuner object (optional).

Returns

T Returns the reduction result of type T.

9.3.2.3 template<typename T , typename Container , typename Join > T hetcompute-


::preduce ( Container & c, const T & identity, Join && join, hetcompute-
::pattern::tuner tuner = hetcompute::pattern::tuner() )

A convenient API to use for applying parallel reduction on a passed-in container.

Note: Container must have size() defined and indexable with operator []

See Also

T preduce(InputIterator, InputIterator, const T&, Reduce&&, Join&&)

Examples

[...]
vector<int> vin;
// Initialize vin
[...]
const int identity = 0;
// Parallel sum
auto p_sum = hetcompute::preduce(vin, identity, std::plus<int>());

Parameters

c Container on which parallel reduce is applied.


identity A special element leaves other elements unchanged when combined with
them, i.e., x = identity ∗ x and x = x ∗ identity. For example, 0 is the identity
element under addition for the real numbers.
join Join calculated results from two subranges.
tuner Qualcomm HetCompute pattern tuner object (optional).

Returns

T Returns the reduction result of type T.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 180
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.3.2.4 template<typename T , typename Iterator , typename Join > T hetcompute-


::preduce ( Iterator first, Iterator last, const T & identity, Join && join,
hetcompute::pattern::tuner tuner = hetcompute::pattern::tuner() )

A convenient API to apply parallel reduction on a range specified by iterators.

Note: Developer must ensure the validity of iterators.

See Also

T preduce(InputIterator, InputIterator, const T&, Reduce&&, Join&&)

Examples

[...]
vector<int> vin;
// Initialize vector
[...]
const int identity = 0;
// Parallel sum
auto p_sum = hetcompute::preduce(vin.begin(), vin.end(), identity, std::plus<int>());

Parameters

first Iterator to range start.


last Iterator to range end.
identity A special element leaves other elements unchanged when combined with
them, i.e., x = identity ∗ x and x = x ∗ identity. For example, 0 is the identity
element under addition for the real numbers.
join Join calculated results from two subranges.
tuner Qualcomm HetCompute pattern tuner object (optional).

Returns

T Returns the reduction result of type T.

9.3.2.5 template<typename T , class InputIterator , typename Reduce , typename Join


> hetcompute::task_ptr<T> hetcompute::preduce_async ( InputIterator first,
InputIterator last, const T & identity, Reduce && reduce, Join && join,
hetcompute::pattern::tuner tuner = hetcompute::pattern::tuner() )

Create an asynchronous task from the hetcompute::preduce pattern.

Parameters

first Start of the range to parallel reduce.


last End of the range to parallel reduce.
identity A special element leaves other elements unchanged when combined with
them, i.e., x = identity ∗ x and x = x ∗ identity. For example, 0 is the identity
element under addition for the real numbers.
reduce Reduce function applied to each subrange and return the result.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 181
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

join Join calculated results from two subranges.


tuner Qualcomm HetCompute pattern tuner object (optional).

Returns

T Returns the reduction result of type T.

9.3.2.6 template<typename T , typename Container , typename Join > hetcompute-


::task_ptr<T> hetcompute::preduce_async ( Container & c, const T &
identity, Join && join, hetcompute::pattern::tuner tuner = hetcompute-
::pattern::tuner() )

Create an asynchronous task from the hetcompute::preduce pattern.

Parameters

c Container on which parallel reduce is applied.


identity a special element leaves other elements unchanged when combined with
them, i.e., x = identity ∗ x and x = x ∗ identity. For example, 0 is the identity
element under addition for the real numbers.
join Join calculated results from two subranges.
tuner Qualcomm HetCompute pattern tuner object (optional).

Returns

T Returns the reduction result of type T.

9.3.2.7 template<typename T , typename Iterator , typename Join > hetcompute-


::task_ptr<T> hetcompute::preduce_async ( Iterator first, Iterator last,
const T & identity, Join && join, hetcompute::pattern::tuner tuner =
hetcompute::pattern::tuner() )

Create an asynchronous task from the hetcompute::preduce pattern.

Parameters

first Iterator to range start.


last Iterator to range end.
identity A special element leaves other elements unchanged when combined with
them, i.e., x = identity ∗ x and x = x ∗ identity. For example, 0 is the identity
element under addition for the real numbers.
join Join calculated results from two subranges.
tuner Qualcomm HetCompute pattern tuner object (optional.)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 182
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Returns

T Returns the reduction result of type T.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 183
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.4 Parallel Scan


Classes

• class hetcompute::pattern::pscan< BinaryFn >

Functions

• template<typename BinaryFn >


pscan< BinaryFn > hetcompute::pattern::create_pscan_inclusive (BinaryFn &&fn)
• template<typename RandomAccessIterator , typename BinaryFn >
void hetcompute::pscan_inclusive (RandomAccessIterator first, RandomAccessIterator last,
BinaryFn &&fn, hetcompute::pattern::tuner tuner=hetcompute::pattern::tuner())
• template<typename RandomAccessIterator , typename BinaryFn >
hetcompute::task_ptr< void()> hetcompute::pscan_inclusive_async (RandomAccessIterator first,
RandomAccessIterator last, BinaryFn fn, hetcompute::pattern::tuner
tuner=hetcompute::pattern::tuner())

9.4.1 Class Documentation

9.4.1.1 class hetcompute::pattern::pscan

template<typename BinaryFn>class hetcompute::pattern::pscan< BinaryFn >

Public member functions

• template<typename RandomAccessIterator >


void operator() (RandomAccessIterator first, RandomAccessIterator last, const
hetcompute::pattern::tuner &t=hetcompute::pattern::tuner())
• template<typename RandomAccessIterator >
void run (RandomAccessIterator first, RandomAccessIterator last, hetcompute::pattern::tuner
t=hetcompute::pattern::tuner())

Friends

• pscan create_pscan_inclusive (BinaryFn &&fn)


• template<typename Fn , typename... Args>
hetcompute::task_ptr< void()> hetcompute::create_task (const pscan< Fn > &ps, Args &&...args)

9.4.1.1.1 Related Function Documentation

9.4.1.1.1.1 template<typename BinaryFn > pscan create_pscan_inclusive ( BinaryFn && fn )


[friend]

Create pattern object from function object fn.


Returns a pattern object which can be invoked (1) synchronously, using the run method or the () operator
with arguments; or (2) asynchronously, using hetcompute::create_task or hetcompute::launch.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 184
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Examples

auto l = std::plus<size_t>();
auto pscan = hetcompute::pattern::create_pscan_inclusive(l);
pscan(vec.begin(), vec.end());

9.4.2 Function Documentation

9.4.2.1 template<typename BinaryFn > pscan< BinaryFn > hetcompute::pattern-


::create_pscan_inclusive ( BinaryFn && fn )

Create pattern object from function object fn.


Returns a pattern object which can be invoked (1) synchronously, using the run method or the () operator
with arguments; or (2) asynchronously, using hetcompute::create_task or hetcompute::launch.

Examples

auto l = std::plus<size_t>();
auto pscan = hetcompute::pattern::create_pscan_inclusive(l);
pscan(vec.begin(), vec.end());

9.4.2.2 template<typename RandomAccessIterator , typename BinaryFn > void


hetcompute::pscan_inclusive ( RandomAccessIterator first, Random-
AccessIterator last, BinaryFn && fn, hetcompute::pattern::tuner tuner =
hetcompute::pattern::tuner() )

Sklansky-style parallel inclusive scan.


Performs an in-place parallel prefix computation using the function object fn for the range [first, last).
fn should be associative, because the order of applications is not fixed.
Note

This function returns only after fn has been applied to the whole iteration range.
Similar to hetcompute::ptransform, range iterators are restricted to type
RandomAccessIterator.
The usual rules for cancellation apply, i.e., within fn the cancellation must be acknowledged using
abort_on_cancel.

Examples

// After: v’ = { v[0], v[0] x v[1], v[0] x v[1] x v[2], ... }


pscan_inclusive(begin(v), end(v),
[] (size_t const& i, size_t const& j) {
return i + j;
});

Parameters

first Start of the range to which to apply fn.


last End of the range to which to apply fn.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 185
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

fn Binary function object to be applied.


tuner HetCompute pattern tuner object (optional).

9.4.2.3 template<typename RandomAccessIterator , typename BinaryFn >


hetcompute::task_ptr<void()> hetcompute::pscan_inclusive_async (
RandomAccessIterator first, RandomAccessIterator last, BinaryFn fn,
hetcompute::pattern::tuner tuner = hetcompute::pattern::tuner() )

Create an asynchronous task from the hetcompute::pscan_inclusive pattern.


Returns a task that represents the pattern’s execution. Operations on the task translates into operations on
the executing pattern. The caller must launch the task.

Parameters

first Start of the range to which to apply fn.


last End of the range to which to apply fn.
fn Binary function object to be applied.
tuner HetCompute pattern tuner object (optional).

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 186
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.5 Parallel Divide-and-Conquer


Classes

• class hetcompute::pattern::pdivide_and_conquerer< IsBaseFn, BaseFn, SplitFn, MergeFn >

Functions

• template<typename IsBaseFn , typename BaseFn , typename SplitFn , typename MergeFn >


pdivide_and_conquerer
< IsBaseFn, BaseFn, SplitFn,
MergeFn > hetcompute::pattern::create_pdivide_and_conquer (IsBaseFn &&isbase, BaseFn
&&base, SplitFn &&split, MergeFn &&merge)
• template<typename Problem , typename Solution , typename Fn1 , typename Fn2 , typename Fn3 , typename Fn4 >
Solution hetcompute::pdivide_and_conquer (Problem p, Fn1 &&is_base, Fn2 &&base, Fn3 &&split,
Fn4 &&merge, const hetcompute::pattern::tuner &tuner=hetcompute::pattern::tuner())
• template<typename Problem , typename Fn1 , typename Fn2 , typename Fn3 , typename Fn4 >
void hetcompute::pdivide_and_conquer (Problem p, Fn1 &&is_base, Fn2 &&base, Fn3 &&split, Fn4
&&merge, const hetcompute::pattern::tuner &tuner=hetcompute::pattern::tuner())
• template<typename Problem , typename Fn1 , typename Fn2 , typename Fn3 >
void hetcompute::pdivide_and_conquer (Problem p, Fn1 &&is_base, Fn2 &&base, Fn3 &&split,
const hetcompute::pattern::tuner &tuner=hetcompute::pattern::tuner())
• template<typename Problem , typename Solution , typename Fn1 , typename Fn2 , typename Fn3 , typename Fn4 >
hetcompute::task_ptr< Solution > hetcompute::pdivide_and_conquer_async (Problem p, Fn1
&&is_base, Fn2 &&base, Fn3 &&split, Fn4 &&merge, const hetcompute::pattern::tuner
&tuner=hetcompute::pattern::tuner())
• template<typename Problem , typename Fn1 , typename Fn2 , typename Fn3 , typename Fn4 >
hetcompute::task_ptr hetcompute::pdivide_and_conquer_async (Problem p, Fn1 &&is_base, Fn2
&&base, Fn3 &&split, Fn4 &&merge, const hetcompute::pattern::tuner
&tuner=hetcompute::pattern::tuner())
• template<typename Problem , typename Fn1 , typename Fn2 , typename Fn3 >
hetcompute::task_ptr hetcompute::pdivide_and_conquer_async (Problem p, Fn1 &&is_base, Fn2
&&base, Fn3 &&split, const hetcompute::pattern::tuner &tuner=hetcompute::pattern::tuner())

9.5.1 Class Documentation

9.5.1.1 class hetcompute::pattern::pdivide_and_conquerer

template<typename IsBaseFn, typename BaseFn, typename SplitFn, typename MergeFn>class


hetcompute::pattern::pdivide_and_conquerer< IsBaseFn, BaseFn, SplitFn, MergeFn >

Public member functions

• _Solution operator() (_Problem &p, const hetcompute::pattern::tuner


&pt=hetcompute::pattern::tuner())
• _Solution run (_Problem &p, const hetcompute::pattern::tuner &pt=hetcompute::pattern::tuner())

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 187
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Friends

• pdivide_and_conquerer create_pdivide_and_conquer (IsBaseFn &&isbase, BaseFn &&base, SplitFn


&&split, MergeFn &&merge)
• template<typename Problem , typename Solution , typename Fn1 , typename Fn2 , typename Fn3 , typename Fn4 >
hetcompute::task_ptr< Solution > hetcompute::create_task (const pdivide_and_conquerer< Fn1,
Fn2, Fn3, Fn4 > &pdnc, Problem p, const hetcompute::pattern::tuner &t)
• template<typename Problem , typename Fn1 , typename Fn2 , typename Fn3 , typename Fn4 >
hetcompute::task_ptr hetcompute::create_task (const pdivide_and_conquerer< Fn1, Fn2, Fn3, Fn4
> &pdnc, Problem p, const hetcompute::pattern::tuner &t)
• template<typename Code , typename IsPattern , class Enable >
struct hetcompute::internal::task_factory

9.5.1.1.1 Related Function Documentation

9.5.1.1.1.1 template<typename IsBaseFn , typename BaseFn , typename SplitFn , typename MergeFn >
pdivide_and_conquerer create_pdivide_and_conquer ( IsBaseFn && isbase, BaseFn &&
base, SplitFn && split, MergeFn && merge ) [friend]

Create a pattern object from function objects isbase, base, split, and merge.
Returns a pattern object which can be invoked (1) synchronously, using the run method or the () operator
with arguments; or (2) asynchronously, using hetcompute::create_task or hetcompute::launch.

Examples

// Calculate Fibonacci sequence in parallel


auto fib = hetcompute::pattern::create_pdivide_and_conquer(
[](size_t m){ return m <= 1; },
[](size_t m){ return m; },
[](size_t m){return std::vector<size_t>({m - 1, m - 2});},
[](size_t, std::vector<size_t>& sols){return sols[0] + sols[1];}
);

size_t input = 10;


auto par = fib.run(input);

Parameters

isbase Indicates if the problem is a base case problem.


base Solves a base case problem.
split Splits the problem into subproblems and returns them.
merge Merges the solutions to subproblems of p returned by split.

9.5.2 Function Documentation

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 188
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.5.2.1 template<typename IsBaseFn , typename BaseFn , typename SplitFn ,


typename MergeFn > pdivide_and_conquerer< IsBaseFn, BaseFn, SplitFn,
MergeFn > hetcompute::pattern::create_pdivide_and_conquer ( IsBaseFn &&
isbase, BaseFn && base, SplitFn && split, MergeFn && merge )

Create a pattern object from function objects isbase, base, split, and merge.
Returns a pattern object which can be invoked (1) synchronously, using the run method or the () operator
with arguments; or (2) asynchronously, using hetcompute::create_task or hetcompute::launch.

Examples

// Calculate Fibonacci sequence in parallel


auto fib = hetcompute::pattern::create_pdivide_and_conquer(
[](size_t m){ return m <= 1; },
[](size_t m){ return m; },
[](size_t m){return std::vector<size_t>({m - 1, m - 2});},
[](size_t, std::vector<size_t>& sols){return sols[0] + sols[1];}
);

size_t input = 10;


auto par = fib.run(input);

Parameters

isbase Indicates if the problem is a base case problem.


base Solves a base case problem.
split Splits the problem into subproblems and returns them.
merge Merges the solutions to subproblems of p returned by split.

9.5.2.2 template<typename Problem , typename Solution , typename Fn1 ,


typename Fn2 , typename Fn3 , typename Fn4 > Solution hetcompute-
::pdivide_and_conquer ( Problem p, Fn1 && is_base, Fn2 && base, Fn3
&& split, Fn4 && merge, const hetcompute::pattern::tuner & tuner =
hetcompute::pattern::tuner() )

Parallel divide-and-conquer
Solve a problem by splitting it into independent subproblems, which may be solved in parallel, and merging
the solutions to the subproblems. A subproblem may recursively be split into yet more problems, yielding
significant parallelism, e.g., Fibonacci.

Note: For best performance, make split and merge relatively inexpensive compared to base.

Example: Compute n-th Fibonacci term

1 #include <sstream>
2 #include <vector>
3
4 #include <hetcompute/hetcompute.hh>
5
8
10 static size_t
11 fibonacci_s(size_t n)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 189
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

12 {
13 if (n == 0 || n == 1)
14 {
15 return n;
16 }
17 else
18 {
19 return fibonacci_s(n - 1) + fibonacci_s(n - 2);
20 }
21 }
22
24 static const size_t GRANULARITY = 20;
25
27 static size_t
28 fibonacci(size_t n)
29 {
30 return hetcompute::pdivide_and_conquer<size_t, size_t>(
31 // Problem is to compute the n-th Fibonacci term
32 n,
33 // When should an arbitrary Fibonacci term, represented by ’m’, be
34 // computed sequentially?
35 // Note that programmer chooses to compute Fibonacci terms 20 and lower
36 // sequentially for best performance.
37 [](size_t& m) { return m <= GRANULARITY; },
38 // How to compute the term sequentially
39 [](size_t& m) { return fibonacci_s(m); },
40 // Split problem into independent subproblems
41 [](size_t& m) {
42 return std::vector<size_t>({ m - 1, m - 2 });
43 },
44 // Merge solutions to subproblems.
45 // Note that the first parameter (size_t, corresponding to the split
46 // problem) is unused in this case, but may be useful while merging in
47 // other cases.
48 [](size_t, std::vector<size_t>& sols) { return sols[0] + sols[1]; });
49 }
50
51 int
52 main(int argc, const char* argv[])
53 {
54 hetcompute::runtime::init();
55 size_t n_def = 24;
56 size_t n = n_def;
57
58 if (argc >= 2)
59 {
60 std::istringstream istr(argv[1]);
61 istr >> n;
62 }
63
64 size_t out = fibonacci(n);
65
66 if (out != fibonacci_s(n))
67 {
68 std::cerr << "parallel fibonacci failed\n";
69 }
70 hetcompute::runtime::shutdown();
71 return 0;
72 }

Parameters

p Problem data structure operated on by the functions.


is_base Returns TRUE if p is a base case problem, else FALSE.
base Solves a base case problem.
split Splits p into subproblems and returns them.
merge Merges the solutions to subproblems of p returned by split.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 190
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

tuner Qualcomm HetCompute pattern tuner object (optional).

Returns

Solution data structure representing solution to the problem.

9.5.2.3 template<typename Problem , typename Fn1 , typename Fn2 , typename


Fn3 , typename Fn4 > void hetcompute::pdivide_and_conquer ( Problem
p, Fn1 && is_base, Fn2 && base, Fn3 && split, Fn4 && merge, const
hetcompute::pattern::tuner & tuner = hetcompute::pattern::tuner() )

Parallel divide-and-conquer specialized for not returning a solution, e.g., mergesort.

Parameters

p Problem data structure operated on by the functions.


is_base Returns TRUE if p is a base case problem, else FALSE.
base Solves a base case problem, but does not return any solution.
split Splits p into subproblems and returns them.
merge Merges subproblems returned by split.
tuner Qualcomm HetCompute pattern tuner object (optional).

9.5.2.4 template<typename Problem , typename Fn1 , typename Fn2 , typename Fn3


> void hetcompute::pdivide_and_conquer ( Problem p, Fn1 && is_base,
Fn2 && base, Fn3 && split, const hetcompute::pattern::tuner & tuner =
hetcompute::pattern::tuner() )

Parallel divide-and-conquer specialized for no merge of subproblems and not returning a solution, e.g.,
quicksort.

Example: Quicksort n random integers

1 #include <algorithm>
2 #include <array>
3 #include <cstdlib>
4 #include <functional>
5 #include <sstream>
6 #include <utility>
7
8 #include <hetcompute/hetcompute.hh>
9
15
21 template <typename Iterator>
22 struct QuickSort
23 {
24 QuickSort(Iterator _begin, Iterator _end) : begin(_begin), end(_end), middle() {}
25 Iterator begin, end, middle;
26 };
27
29 const size_t GRANULARITY = 8192;
30
33 template <typename Iterator, typename Compare>

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 191
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

34 void
35 quicksort(Iterator begin, Iterator end, Compare cmp)
36 {
37 typedef QuickSort<Iterator> QuickSort;
38 hetcompute::pdivide_and_conquer(
39 // Main problem
40 QuickSort(begin, end),
41 // When should an arbitrary array, represented by ’q’, be sorted
42 // sequentially?
43 // Note that programmer chooses to sort arrays smaller than size 8192
44 // sequentially for best performance.
45 [&](QuickSort& q) {
46 size_t n = std::distance(q.begin, q.end);
47 if (n <= GRANULARITY)
48 {
49 return true;
50 }
51 // Choice of first element as pivot is arbitrary
52 auto pivot = *q.begin;
53 q.middle = std::partition(q.begin, q.end, std::bind2nd(cmp, pivot));
54 // If middle == begin, elements in [begin, end) are greater than or
55 // equal to pivot. We could either find a new pivot or as we do here,
56 // just sort sequentially.
57 return q.middle == q.begin;
58 },
59 // Sequential sort used
60 [&](QuickSort& q) { std::sort(q.begin, q.end, cmp); },
61 // Split problem into two subproblems
62 [&](QuickSort& q) {
63 std::array<QuickSort, 2> subarrays{ { QuickSort(q.begin, q.middle), QuickSort(q.middle, q.end)
} };
64 return subarrays;
65 });
66 }
67
68 int
69 main(int argc, const char* argv[])
70 {
71 hetcompute::runtime::init();
72 std::vector<long> input;
73 size_t n_def = 1 << 16;
74 size_t n = n_def;
75
76 if (argc >= 2)
77 {
78 std::istringstream istr(argv[1]);
79 istr >> n;
80 }
81
82 // Create a random array of integers
83 for (size_t i = 0; i < n; i++)
84 {
85 input.push_back(rand());
86 }
87
88 quicksort(input.begin(), input.end(), std::less<long>());
89
90 if (!std::is_sorted(input.begin(), input.end()))
91 {
92 std::cerr << "parallel quicksorting failed\n";
93 }
94
95 hetcompute::runtime::shutdown();
96
97 return 0;
98 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 192
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Parameters

p Problem data structure operated on by the functions.


is_base Returns TRUE if p is a base case problem, else FALSE.
base Solves a base case problem, but does not return any solution.
split Splits p into subproblems and returns them.
tuner Qualcomm HetCompute pattern tuner object (optional).

9.5.2.5 template<typename Problem , typename Solution , typename Fn1 , typename


Fn2 , typename Fn3 , typename Fn4 > hetcompute::task_ptr<Solution>
hetcompute::pdivide_and_conquer_async ( Problem p, Fn1 && is_base, Fn2
&& base, Fn3 && split, Fn4 && merge, const hetcompute::pattern::tuner &
tuner = hetcompute::pattern::tuner() )

Asynchronous Parallel divide-and-conquer


Returns a task that represents the pattern’s execution. Operations on the task translate into operations on the
executing pattern. The caller must launch the task.

Parameters

p Problem data structure operated on by the functions.


is_base Returns TRUE if p is a base case problem, else FALSE.
base Solves a base case problem.
split Splits p into subproblems and returns them.
merge Merges solutions to subproblems of p returned by split.
tuner Qualcomm HetCompute pattern tuner object (optional).

Returns

task_ptr Unlaunched task representing pattern’s execution.

9.5.2.6 template<typename Problem , typename Fn1 , typename Fn2 , typename


Fn3 , typename Fn4 > hetcompute::task_ptr hetcompute::pdivide_and-
_conquer_async ( Problem p, Fn1 && is_base, Fn2 && base, Fn3
&& split, Fn4 && merge, const hetcompute::pattern::tuner & tuner =
hetcompute::pattern::tuner() )

Asynchronous parallel divide-and-conquer specialized for not returning a solution, e.g., mergesort.
Returns a task that represents the pattern’s execution. Operations on the task translate into operations on the
executing pattern. The caller must launch the task.

Parameters

p Problem data structure operated on by the functions.


is_base Returns TRUE if p is a base case problem, else FALSE.
base Solves a base case problem.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 193
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

split Splits p into subproblems and returns them.


merge Merges solutions to subproblems of p returned by split.
tuner Qualcomm HetCompute pattern tuner object (optional).

Returns

task_ptr Unlaunched task representing pattern’s execution.

9.5.2.7 template<typename Problem , typename Fn1 , typename Fn2 , typename


Fn3 > hetcompute::task_ptr hetcompute::pdivide_and_conquer_async
( Problem p, Fn1 && is_base, Fn2 && base, Fn3 && split, const
hetcompute::pattern::tuner & tuner = hetcompute::pattern::tuner() )

Asynchronous parallel divide-and-conquer specialized for no merge of subproblems and not returning a
solution, e.g., quicksort.
Returns a task that represents the pattern’s execution. Operations on the task translate into operations on the
executing pattern. The caller must launch the task.

Parameters

p Problem data structure operated on by the functions.


is_base Returns TRUE if p is a base case problem, else FALSE.
base Solves a base case problem, but does not return any solution.
split Splits p into subproblems and returns them.
tuner Qualcomm HetCompute pattern tuner object (optional).

Returns

task_ptr unlaunched task representing pattern’s execution.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 194
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.6 Parallel Sorting


Classes

• class hetcompute::pattern::psorter< Compare >

Functions

• template<typename Compare >


psorter< Compare > hetcompute::pattern::create_psort (Compare &&cmp)
• template<class RandomAccessIterator , class Compare >
void hetcompute::psort (RandomAccessIterator first, RandomAccessIterator last, Compare cmp,
const hetcompute::pattern::tuner &tuner=hetcompute::pattern::tuner())
• template<class RandomAccessIterator >
void hetcompute::psort (RandomAccessIterator first, RandomAccessIterator last, const
hetcompute::pattern::tuner &tuner=hetcompute::pattern::tuner())
• template<class RandomAccessIterator , class Compare >
hetcompute::task_ptr< void()> hetcompute::psort_async (RandomAccessIterator first,
RandomAccessIterator last, Compare &&cmp, const hetcompute::pattern::tuner
&tuner=hetcompute::pattern::tuner())
• template<class RandomAccessIterator >
hetcompute::task_ptr< void()> hetcompute::psort_async (RandomAccessIterator first,
RandomAccessIterator last, const hetcompute::pattern::tuner &tuner=hetcompute::pattern::tuner())

9.6.1 Class Documentation

9.6.1.1 class hetcompute::pattern::psorter

template<typename Compare>class hetcompute::pattern::psorter< Compare >

Public member functions

• psorter (Compare &&cmp)


• template<typename RandomAccessIterator >
void operator() (RandomAccessIterator first, RandomAccessIterator last, const
hetcompute::pattern::tuner &t=hetcompute::pattern::tuner())
• template<typename RandomAccessIterator >
void run (RandomAccessIterator first, RandomAccessIterator last, const hetcompute::pattern::tuner
&t=hetcompute::pattern::tuner())

Friends

• psorter create_psort (Compare &&cmp)


• template<typename Cmp , typename... Args>
hetcompute::task_ptr< void()> hetcompute::create_task (const psorter< Cmp > &p, Args
&&...args)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 195
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.6.1.1.1 Related Function Documentation

9.6.1.1.1.1 template<typename Compare > psorter create_psort ( Compare && cmp ) [friend]

Create hetcompute::psort from function objects cmp.


Returns a pattern object which can be invoked (1) synchronously, using the run method or the () operator
with arguments; or (2) asynchronously, using hetcompute::create_task or hetcompute::launch.

Examples

vector<int> vin(100000);
Rand_int rnd{0, int(vin.size() - 1)};

// Generate 100,000 random integers


for (auto& i : vin)
i = rnd();

// Sort the vector using <code>hetcompute::psort</code>


auto p = hetcompute::pattern::create_psort([](int l, int r){return r < l;}
);
p(vin.begin(), vin.end());

Parameters

cmp User-customized compare function object to be applied.

9.6.2 Function Documentation

9.6.2.1 template<typename Compare > psorter< Compare > hetcompute::pattern-


::create_psort ( Compare && cmp )

Create hetcompute::psort from function objects cmp.


Returns a pattern object which can be invoked (1) synchronously, using the run method or the () operator
with arguments; or (2) asynchronously, using hetcompute::create_task or hetcompute::launch.

Examples

vector<int> vin(100000);
Rand_int rnd{0, int(vin.size() - 1)};

// Generate 100,000 random integers


for (auto& i : vin)
i = rnd();

// Sort the vector using <code>hetcompute::psort</code>


auto p = hetcompute::pattern::create_psort([](int l, int r){return r < l;}
);
p(vin.begin(), vin.end());

Parameters

cmp User-customized compare function object to be applied.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 196
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.6.2.2 template<class RandomAccessIterator , class Compare > void hetcompute-


::psort ( RandomAccessIterator first, RandomAccessIterator last, Compare
cmp, const hetcompute::pattern::tuner & tuner = hetcompute::pattern-
::tuner() )

Parallel version of std::sort.


Performs an unstable in-place comparison sort of a container using the supplied cmp function.

Examples

// Sort vin using <code>hetcompute::psort</code>


psort(vin.begin(), vin.end(), [](int l, int r){ return r < l; });

See Also

psort(RandomAccessIterator, RandomAccessIterator)

Parameters

first Start of the range to sort.


last End of the range to sort.
cmp User-customized compare function object to be applied.
tuner Qualcomm HetCompute pattern tuner object (optional).

9.6.2.3 template<class RandomAccessIterator > void hetcompute::psort (


RandomAccessIterator first, RandomAccessIterator last, const hetcompute-
::pattern::tuner & tuner = hetcompute::pattern::tuner() )

Parallel version of std::sort.


Performs an unstable in-place comparison sort of a container. Equivalent to psort(first, last, std::less<T>())
where T is the value type of the iterators.
See Also

psort(RandomAccessIterator, RandomAccessIterator, Compare)

Parameters

first Start of the range to sort.


last End of the range to sort.
tuner Qualcomm HetCompute pattern tuner object (optional).

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 197
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.6.2.4 template<class RandomAccessIterator , class Compare > hetcompute-


::task_ptr<void()> hetcompute::psort_async ( RandomAccessIterator first,
RandomAccessIterator last, Compare && cmp, const hetcompute::pattern-
::tuner & tuner = hetcompute::pattern::tuner() )

Create an asynchronous task from the hetcompute::psort pattern.

Parameters

first Start of the range to sort.


last End of the range to sort.
cmp User-customized compare function object to be applied.
tuner Qualcomm HetCompute pattern tuner object (optional).

9.6.2.5 template<class RandomAccessIterator > hetcompute::task_ptr<void()>


hetcompute::psort_async ( RandomAccessIterator first, RandomAccess-
Iterator last, const hetcompute::pattern::tuner & tuner = hetcompute::pattern-
::tuner() )

Parallel version of std::sort (asynchronous).


Returns a task that represents the pattern’s execution. Operations on the task translate into operations on the
executing pattern. The caller must launch the task.
See Also

psort(RandomAccessIterator, RandomAccessIterator)

Parameters

first Start of the range to sort.


last End of the range to sort.
tuner Qualcomm HetCompute pattern tuner object (optional).

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 198
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.7 Pipeline
Classes

• class hetcompute::iteration_lag
Pipeline stage iteration lag. More...

• class hetcompute::iteration_rate
Pipeline stage iteration match rate. More...

• class hetcompute::parallel_stage
Parallel pipeline stage for specifying the type of the stages when adding to the pipeline. More...

• class hetcompute::pattern::pipeline< UserData >


Pipeline class. More...

• class hetcompute::pipeline_context< UserData >


Pipeline_context with one user data. More...

• class hetcompute::pipeline_context<>
Pipeline_context with no user data. More...

• class hetcompute::pipeline_context_base
Pipeline context class. More...

• class hetcompute::serial_stage
Serial stage for specifying the type of the stages when adding to the pipeline. More...

• class hetcompute::sliding_window_size
Pipeline stage sliding window size. More...

• class hetcompute::stage_input< InputType >


Pipeline stage input class. More...

Typedefs

• typedef enum
hetcompute::serial_stage_type hetcompute::serial_stage_type
Serial pipeline stage types.

Enumerations

• enum hetcompute::serial_stage_type { in_order = 0 }


Serial pipeline stage types.

9.7.1 Class Documentation

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 199
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.7.1.1 class hetcompute::iteration_lag

Pipeline stage iteration lag

Public member functions

• iteration_lag (size_t lag)


Constructor.

• iteration_lag (iteration_lag const &other)


Copy constructor.

• iteration_lag (iteration_lag &&other)


Move constructor.

• size_t get_iter_lag () const


Get the stage lag.

• HETCOMPUTE_DELETE_METHOD (iteration_lag &operator=(iteration_lag &&other))


• iteration_lag & operator= (iteration_lag const &other)
Copy assignment operator.

9.7.1.1.1 Constructors and Destructors

9.7.1.1.1.1 hetcompute::iteration_lag::iteration_lag ( size_t lag ) [explicit]

Constructor.

9.7.1.1.1.2 hetcompute::iteration_lag::iteration_lag ( iteration_lag const & other )

Copy constructor.

9.7.1.1.1.3 hetcompute::iteration_lag::iteration_lag ( iteration_lag && other ) [explicit]

Move constructor.

9.7.1.1.2 Member Function Documentation

9.7.1.1.2.1 size_t hetcompute::iteration_lag::get_iter_lag ( ) const

Get the stage lag.

Returns

size_t Stage lag.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 200
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.7.1.1.2.2 iteration_lag& hetcompute::iteration_lag::operator= ( iteration_lag const & other )

Copy assignment operator.

9.7.1.2 class hetcompute::iteration_rate

Pipeline stage iteration match rate.

Public member functions

• iteration_rate (size_t p, size_t c)


Constructor.

• iteration_rate (iteration_rate const &other)


Copy constructor.

• iteration_rate (iteration_rate &&other)


Move constructor.

• size_t get_iter_rate_curr () const


Get the iteration rate for the curr stage.

• size_t get_iter_rate_pred () const


Get the iteration rate for the prev stage.

• HETCOMPUTE_DELETE_METHOD (iteration_rate &operator=(iteration_rate &&other))


• iteration_rate & operator= (iteration_rate const &other)
Copy assignment operator.

9.7.1.2.1 Constructors and Destructors

9.7.1.2.1.1 hetcompute::iteration_rate::iteration_rate ( size_t p, size_t c )

Constructor.

9.7.1.2.1.2 hetcompute::iteration_rate::iteration_rate ( iteration_rate const & other )

Copy constructor.

9.7.1.2.1.3 hetcompute::iteration_rate::iteration_rate ( iteration_rate && other ) [explicit]

Move constructor.

9.7.1.2.2 Member Function Documentation

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 201
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.7.1.2.2.1 size_t hetcompute::iteration_rate::get_iter_rate_curr ( ) const

Get the iteration rate for the curr stage.

Returns

size_t Iteration rate for the current stage.

9.7.1.2.2.2 size_t hetcompute::iteration_rate::get_iter_rate_pred ( ) const

Get the iteration rate for the prev stage.

Returns

size_t Iteration rate for the previous stage.

9.7.1.2.2.3 iteration_rate& hetcompute::iteration_rate::operator= ( iteration_rate const & other )

Copy assignment operator.

9.7.1.3 class hetcompute::parallel_stage

Parallel pipeline stage for specifying the type of the stages when adding to the pipeline.

Public member functions

• parallel_stage (size_t doc)


Constructor.

• parallel_stage (parallel_stage const &other)


Copy constructor.

• size_t get_degree_of_concurrency () const


Get the degree of concurrency for a parallel stage.

• HETCOMPUTE_DELETE_METHOD (parallel_stage &operator=(parallel_stage &&other))


• parallel_stage & operator= (parallel_stage const &other)
Copy assignment operator.

9.7.1.3.1 Constructors and Destructors

9.7.1.3.1.1 hetcompute::parallel_stage::parallel_stage ( size_t doc ) [explicit]

Constructor.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 202
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Parameters

doc Degree of concurrency for the parallel stage. Degree of concurrency (doc):
should be a positive integer number. It specifies the maximum number of
consecutive stage iterations that can run in parallel for this stage. When doc
= 1, the parallel stage will behave like a serial stage.

9.7.1.3.1.2 hetcompute::parallel_stage::parallel_stage ( parallel_stage const & other )

Copy constructor.

9.7.1.3.2 Member Function Documentation

9.7.1.3.2.1 size_t hetcompute::parallel_stage::get_degree_of_concurrency ( ) const

Get the degree of concurrency for a parallel stage.

Returns

size_t Degree of concurrency.

9.7.1.3.2.2 parallel_stage& hetcompute::parallel_stage::operator= ( parallel_stage const & other )

Copy assignment operator.

9.7.1.4 class hetcompute::pattern::pipeline

template<typename... UserData>class hetcompute::pattern::pipeline< UserData >

Pipeline class.
Template Parameters

UserData The type for the pipeline context data or empty, i.e.,
hetcompute::pattern::pipeline<size_t> or
hetcompute::pattern::pipeline<>.

Public Types

• using context = pipeline_context< UserData...>


Context type for the pipeline.

Public member functions

• pipeline ()
Constructor.

• pipeline (pipeline const &other)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 203
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Copy constructor.

• pipeline (pipeline &&other)


Move constructor.

• virtual ∼pipeline ()
Destructor.

• template<typename... Confs>
hetcompute::task_ptr create_task (UserData ∗...context_data, size_t num_iterations, Confs
&&...confs) const
Create a task for the pipeline for asynchronous execution.

• hetcompute::task_ptr< void(UserData
∗..., size_t)> create_task (const hetcompute::pattern::tuner &t=hetcompute::pattern::tuner()) const
Create a task for the pipeline for asynchronous execution.

• void disable_sliding_window ()
Disable the pipeline sliding window launch type.

• void enable_sliding_window ()
Enable the pipeline launch type to be with sliding window.

• bool is_valid ()
Pipeline sanity check for stage IO types and sliding window size.

• pipeline & operator= (pipeline const &other)


Copy assignment operator.

• pipeline & operator= (pipeline &&other)


Move assignment operator.

• template<typename... Confs>
void run (UserData ∗...context_data, size_t num_iterations, Confs &&...confs) const
Launch and wait for the pipeline.

9.7.1.4.1 Member Typedef Documentation

9.7.1.4.1.1 template<typename... UserData> using hetcompute::pattern::pipeline< UserData


>::context = pipeline_context<UserData...>

Context type for the pipeline.

9.7.1.4.2 Constructors and Destructors

9.7.1.4.2.1 template<typename... UserData> hetcompute::pattern::pipeline< UserData >::pipeline ( )

Constructor.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 204
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.7.1.4.2.2 template<typename... UserData> virtual hetcompute::pattern::pipeline< UserData


>::∼pipeline ( ) [virtual]

Destructor.
Reimplemented in hetcompute::beta::pattern::pipeline< UserData >.

9.7.1.4.2.3 template<typename... UserData> hetcompute::pattern::pipeline< UserData >::pipeline (


pipeline< UserData > const & other )

Copy constructor.

9.7.1.4.2.4 template<typename... UserData> hetcompute::pattern::pipeline< UserData >::pipeline (


pipeline< UserData > && other )

Move constructor.

9.7.1.4.3 Member Function Documentation

9.7.1.4.3.1 template<typename... UserData> template<typename... Confs> hetcompute::task_ptr


hetcompute::pattern::pipeline< UserData >::create_task ( UserData ∗... context_data,
size_t num_iterations, Confs &&... confs ) const

Create a task for the pipeline for asynchronous execution. Do not call this member function if the pipeline
has no stages. This would cause a fatal error.

Parameters

context_data Pointer to the data for the pipeline context if the pipeline is defined as
having one, i.e., sizeof...(UserData) == 1.
num_iterations The total number of iterations for the first stage. Note: if num_iterations ==
0, the pipeline runs infinite number of iterations until the first stage stops the
pipeline.
confs Other configurations for launching a task out of pipeline. Currently, only
support one tuner object for the pipeline (optional).

Returns

hetcompute::task_ptr<> The pointer to the task in which the pipeline is running.

1 #include <hetcompute/hetcompute.hh>
2
3 //
4 // Pipeline without context data (wcd),
5 // Known iterations before launch(iter)
6 // Launch by creating tasks (ct)
7 // Through the pipeline object (obj)
8 //
9 int
10 main()
11 {
12 hetcompute::runtime::init();
13 // Define a pipeline skeleton, without pipeline context data.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 205
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

14 hetcompute::pattern::pipeline<> p;
15
16 // Pipeline context type.
17 typedef hetcompute::pattern::pipeline<>::context
context;
18
19 // Add a serial first stage.
20 p.add_stage(hetcompute::serial_stage(), [](context& ctx) {
21 size_t iter = ctx.get_iter_id();
22 // some usage of iter
23 HETCOMPUTE_ILOG("iter: %zu", iter);
24 });
25
26 // Add a parallel stage with degree of concurrency of 4.
27 p.add_stage(hetcompute::parallel_stage(4), [](context&) {});
28
29 // Add a serial stage.
30 p.add_stage(hetcompute::serial_stage(), [](context&) {});
31
32 // Asynchronous launch.
33 // Create a task of a pipeline that runs for 20 iterations.
34 // Run the pipeline as if the stages are using sliding windows.
35 p.enable_sliding_window();
36 auto t1 = p.create_task(20);
37 // Launch the pipeline and do not block.
38 t1->launch();
39
40 // Create a task of a pipeline that runs for 10 iterations.
41 // Run the pipeline as if the stages are not using sliding windows.
42 p.disable_sliding_window();
43 auto t2 = p.create_task(10);
44 // Launch the pipeline and do not block.
45 t2->launch();
46
47 // Wait for the first pipeline to stop.
48 t1->wait_for();
49 // Wait for the second pipeline to stop.
50 t2->wait_for();
51
52 std::cout << "pipeline1 runs 20 iters" << std::endl;
53 std::cout << "pipeline2 runs 10 iters" << std::endl;
54
55 hetcompute::runtime::shutdown();
56 return 0;
57 }

9.7.1.4.3.2 template<typename... UserData> hetcompute::task_ptr<void(UserData∗..., size_t)>


hetcompute::pattern::pipeline< UserData >::create_task ( const hetcompute::pattern::tuner
& t = hetcompute::pattern::tuner() ) const

Create a task for the pipeline for asynchronous execution. The task arguments need to be bound later. Do
not call this member function if the pipeline has no stages. This would cause a fatal error.

Parameters

t One tuner object for the pipeline (optional).

Returns

hetcompute::task_ptr<void(UserData∗..., size_t num_iterations)> The pointer to the task in which the


pipeline is running. Here, UserData∗... is for the pipeline context data (if there is one), size_t is for
specifying the number of iternations. Both of them need to be bound before launching the task.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 206
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

1 #include <hetcompute/hetcompute.hh>
2
3 //
4 // Pipeline with context data (wcd),
5 // On the fly stop (ofs)
6 // Create task and launch later (ctlchl)
7 // Through the pipeline object (obj)
8 //
9 int
10 main()
11 {
12 hetcompute::runtime::init();
13 // Define a pipeline skeleton, with pipeline context data of type size_t.
14 hetcompute::pattern::pipeline<size_t> p;
15
16 // pipeline context type
17 typedef hetcompute::pattern::pipeline<size_t>::context
context;
18
19 // Add a serial first stage.
20 p.add_stage(hetcompute::serial_stage(), [](context& ctx) {
21 size_t iter = ctx.get_iter_id();
22 size_t data = *ctx.get_data();
23 if (iter == data - 1)
24 {
25 ctx.stop_pipeline();
26 }
27 });
28
29 // Add a parallel stage with degree of concurrency of 4.
30 p.add_stage(hetcompute::parallel_stage(4), [](context& ctx) {
31 size_t iter = ctx.get_iter_id();
32 size_t data = *ctx.get_data();
33 // some usage of iter and data here
34 HETCOMPUTE_ILOG("iter: %zu, data: %zu", iter, data);
35 });
36
37 // Add a serial stage.
38 p.add_stage(hetcompute::serial_stage(), [](context&) {});
39
40 // Define the context data.
41 size_t num1 = 20;
42 size_t num2 = 10;
43
44 // Asynchronous launch.
45 // Create a task of a pipeline that runs for num1 iterations.
46 // Run the pipeline as if the stages are using sliding windows.
47 //
48 // Here the total number of iterations is set to be 0 (infinite number of runs).
49 // The first stage of the pipeline does dynamic checking to stop the pipeline on the fly.
50 // The total number of pipeline iterations is specified by using the pipeline context data.
51 p.enable_sliding_window();
52 auto t1 = p.create_task();
53 // Launch the pipeline, bind the arguments, and do not block.
54 t1->launch(&num1, 0);
55
56 // Create a task of a pipeline that runs for num2 iterations.
57 // Run the pipeline as if the stages are not using sliding windows.
58 //
59 // Here the total number of iterations is set to be 0 (infinite number of runs).
60 // The first stage of the pipeline does dynamic checking to stop the pipeline on the fly.
61 // The total number of pipeline iterations is specified by using the pipeline context data.
62 p.disable_sliding_window();
63 auto t2 = p.create_task();
64 // Bind the arguments to the task.
65 t2->bind_all(&num2, 0);
66 // Launch the pipeline and do not block.
67 t2->launch();
68
69 // Wait for the first pipeline to stop.
70 t1->wait_for();

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 207
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

71 // Wait for the second pipeline to stop.


72 t2->wait_for();
73
74 std::cout << "pipeline1 runs " << num1 << " iters" << std::endl;
75 std::cout << "pipeline2 runs " << num2 << " iters" << std::endl;
76
77 hetcompute::runtime::shutdown();
78 return 0;
79 }

1 #include <hetcompute/hetcompute.hh>
2
3 //
4 // Pipeline with context data (wcd),
5 // Known iterations before launch(iter)
6 // Create task and launch later (ctlchl)
7 // Through the pipeline object (obj)
8 //
9 int
10 main()
11 {
12 hetcompute::runtime::init();
13 // Define a pipeline skeleton, with pipeline context data of type size_t.
14 hetcompute::pattern::pipeline<size_t> p;
15
16 // pipeline context type
17 typedef hetcompute::pattern::pipeline<size_t>::context
context;
18
19 // Add a serial first stage.
20 p.add_stage(hetcompute::serial_stage(), [](context& ctx) {
21 size_t iter = ctx.get_iter_id();
22 size_t data = *ctx.get_data();
23 if (iter == data - 1)
24 {
25 ctx.stop_pipeline();
26 }
27 });
28
29 // Add a parallel stage with degree of concurrency of 4.
30 p.add_stage(hetcompute::parallel_stage(4), [](context& ctx) {
31 size_t iter = ctx.get_iter_id();
32 size_t data = *ctx.get_data();
33 // some usage of iter and data here
34 HETCOMPUTE_ILOG("iter: %zu, data: %zu", iter, data);
35 });
36
37 // Add a serial stage.
38 p.add_stage(hetcompute::serial_stage(), [](context&) {});
39
40 // Define the context data.
41 size_t num1 = 20;
42 size_t num2 = 10;
43
44 // Asynchronous launch.
45 // Create a task of a pipeline that runs for num1 iterations.
46 // Run the pipeline as if the stages are using sliding windows.
47 //
48 // Here the total number of iterations is set to be 0 (infinite number of runs).
49 // The first stage of the pipeline does dynamic checking to stop the pipeline on the fly.
50 // The total number of pipeline iterations is specified by using the pipeline context data.
51 p.enable_sliding_window();
52 auto t1 = p.create_task();
53 // Launch the pipeline, bind the arguments, and do not block.
54 t1->launch(&num1, 0);
55
56 // Create a task of a pipeline that runs for num2 iterations.
57 // Run the pipeline as if the stages are not using sliding windows.
58 //
59 // Here the total number of iterations is set to be 0 (infinite number of runs).
60 // The first stage of the pipeline does dynamic checking to stop the pipeline on the fly.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 208
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

61 // The total number of pipeline iterations is specified by using the pipeline context data.
62 p.disable_sliding_window();
63 auto t2 = p.create_task();
64 // Bind the arguments to the task.
65 t2->bind_all(&num2, 0);
66 // Launch the pipeline and do not block.
67 t2->launch();
68
69 // Wait for the first pipeline to stop.
70 t1->wait_for();
71 // Wait for the second pipeline to stop.
72 t2->wait_for();
73
74 std::cout << "pipeline1 runs " << num1 << " iters" << std::endl;
75 std::cout << "pipeline2 runs " << num2 << " iters" << std::endl;
76 hetcompute::runtime::shutdown();
77 return 0;
78 }

9.7.1.4.3.3 template<typename... UserData> void hetcompute::pattern::pipeline< UserData


>::disable_sliding_window ( )

Disable the pipeline sliding window launch type and there won’t be any control on the memory footprint.

9.7.1.4.3.4 template<typename... UserData> void hetcompute::pattern::pipeline< UserData


>::enable_sliding_window ( )

Enable the pipeline launch type to be with sliding window.

9.7.1.4.3.5 template<typename... UserData> bool hetcompute::pattern::pipeline< UserData >::is_valid


( )

Pipeline sanity check for stage IO types and sliding window size.

Returns

TRUE (pass) or FALSE (fail)

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 const size_t num_iters = 100;
8
9 hetcompute::pattern::pipeline<std::array<size_t, num_iters>
> p;
10 typedef hetcompute::pattern::pipeline<std::array<size_t, num_iters>
>::context context;
11
12 // Add a parallel stage which behaves like a serial stage.
13 p.add_stage(hetcompute::parallel_stage(1), [](context&) {});
14
15 // Add a parallel stage with doc = 8, no lag.
16 p.add_stage(hetcompute::parallel_stage(8), [](context&) {});
17
18 // Add a serial stage.
19 p.add_stage(hetcompute::serial_stage(), [](context& ctx) {
20 size_t i = ctx.get_iter_id();

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 209
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

21 (*ctx.get_data())[i] = i;
22 });
23
24 // Sanity check.
25 if (p.is_valid())
26 {
27 HETCOMPUTE_ILOG("The pipeline settings are valid.");
28 }
29 else
30 {
31 HETCOMPUTE_ILOG("The pipeline settings are not valid.");
32 }
33
34 hetcompute::runtime::shutdown();
35 return 0;
36 }

9.7.1.4.3.6 template<typename... UserData> pipeline& hetcompute::pattern::pipeline< UserData


>::operator= ( pipeline< UserData > const & other )

Copy assignment operator.

9.7.1.4.3.7 template<typename... UserData> pipeline& hetcompute::pattern::pipeline< UserData


>::operator= ( pipeline< UserData > && other )

Move assignment operator.

9.7.1.4.3.8 template<typename... UserData> template<typename... Confs> void hetcompute::pattern-


::pipeline< UserData >::run ( UserData ∗... context_data, size_t num_iterations, Confs
&&... confs ) const

Launch and wait for the pipeline.

Parameters

context_data Pointer to the data for the pipeline context if the pipeline defined as having
one, i.e., sizeof...(UserData) == 1.
num_iterations The total number of iterations for the first stage.
confs Other configurations for running a pipeline. Currently, only support one
tuner object for the pipeline (optional).

Note: if num_iterations == 0, the pipeline runs infinite number of iterations until the first stage stops the
pipeline.
1 #include <hetcompute/hetcompute.hh>
2
3 //
4 // Pipeline without context data (wocd),
5 // Known iterations before launch(iter)
6 // Launch by using hetcompute free function (lch)
7 // Through the pipeline object (obj)
8 //
9 int
10 main()
11 {
12 hetcompute::runtime::init();

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 210
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

13 // Define a pipeline skeleton, without pipeline context data.


14 hetcompute::pattern::pipeline<> p;
15
16 // Pipeline context type.
17 typedef hetcompute::pattern::pipeline<>::context
context;
18
19 // Add a serial first stage.
20 p.add_stage(hetcompute::serial_stage(), [](context& ctx) {
21 size_t iter = ctx.get_iter_id();
22 // some usage of iter
23 HETCOMPUTE_ILOG("iter: %zu", iter);
24 });
25
26 // Add a parallel stage with degree of concurrency of 4.
27 p.add_stage(hetcompute::parallel_stage(4), [](context&) {});
28
29 // Add a serial stage.
30 p.add_stage(hetcompute::serial_stage(), [](context&) {});
31
32 // Copy the pipeline.
33 hetcompute::pattern::pipeline<> p1(p);
34
35 // Launch using free functions.
36 // Launch and wait for pipeline for 15 iterations.
37 // Run the pipeline as if the stages are using sliding windows.
38 p1.enable_sliding_window();
39 p1.run(15);
40
41 // Launch and wait for pipeline for 25 iterations.
42 // Run the pipeline as if the stages are not using sliding windows.
43 p.disable_sliding_window();
44 p.run(25);
45
46 std::cout << "pipeline1 runs 15 iters" << std::endl;
47 std::cout << "pipeline2 runs 25 iters" << std::endl;
48
49 hetcompute::runtime::shutdown();
50 return 0;
51 }

9.7.1.5 class hetcompute::pipeline_context< UserData >

template<typename UserData>class hetcompute::pipeline_context< UserData >

Pipeline_context with one user data.


Template Parameters

UserData The type for the pipeline context data.

Note: This is the pipeline_context type for the pipeline with context data, of type UserData, i.e.,
hetcompute::pattern::pipeline<UserData>. Do not use this type directly. Instead, get the member type
from the pipeline that the context is associated with, i.e., using context =
hetcompute::pattern::pipeline<UserData>::context.

Public member functions

• virtual ∼pipeline_context ()
Destructor.

• UserData ∗ get_data () const

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 211
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Get the pointer to the programmer-defined context data.

• HETCOMPUTE_DELETE_METHOD (pipeline_context(pipeline_context const &other))


• HETCOMPUTE_DELETE_METHOD (pipeline_context(pipeline_context &&other))
• HETCOMPUTE_DELETE_METHOD (pipeline_context &operator=(pipeline_context const
&other))
• HETCOMPUTE_DELETE_METHOD (pipeline_context &operator=(pipeline_context &&other))

9.7.1.5.1 Constructors and Destructors

9.7.1.5.1.1 template<typename UserData > virtual hetcompute::pipeline_context< UserData


>::∼pipeline_context ( ) [virtual]

Destructor.

9.7.1.5.2 Member Function Documentation

9.7.1.5.2.1 template<typename UserData > UserData∗ hetcompute::pipeline_context< UserData


>::get_data ( ) const

Get the pointer to the programmer-defined context data.

Returns

UserData∗ The pointer to the user-defined context data, which is provided by the user when launching
the pipeline.

9.7.1.6 class hetcompute::pipeline_context<>

template<>class hetcompute::pipeline_context<>

Pipeline_context with no user data.

Note: This is the pipeline_context type for the pipeline without context data, i.e.,
hetcompute::pattern::pipeline<>. So not use this type directly. Instead, get the member type from the
pipeline that the context is associated with, i.e., using context =
hetcompute::pattern::pipeline<>::context.

Public member functions

• virtual ∼pipeline_context ()
Destructor.

• HETCOMPUTE_DELETE_METHOD (pipeline_context(pipeline_context const &other))


• HETCOMPUTE_DELETE_METHOD (pipeline_context(pipeline_context &&other))
• HETCOMPUTE_DELETE_METHOD (pipeline_context &operator=(pipeline_context const
&other))

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 212
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

• HETCOMPUTE_DELETE_METHOD (pipeline_context &operator=(pipeline_context &&other))

9.7.1.6.1 Constructors and Destructors

9.7.1.6.1.1 virtual hetcompute::pipeline_context<>::∼pipeline_context ( ) [virtual]

Destructor.

9.7.1.7 class hetcompute::pipeline_context_base

Pipeline context class.


The user will be able to get information/limited control from the pipeline in the user-defined pipeline
function through this structure. The user will be able to know the stage_id and the iteration_id during
execution through pipeline_context and have some control of the execution of the underlying pipeline, such
as stopping the pipeline during execution. When defining a pipeline stage function (function or lambda or
callable object), the first parameter should always be of type pipeline_context.

Public member functions

• virtual ∼pipeline_context_base ()
Destructor.

• void cancel_pipeline ()
Cancel the pipeline.

• size_t get_iter_id () const


Get the current iteration id.

• size_t get_max_stage_iter () const


Get the maximum number of iterations for this stage.

• size_t get_stage_id () const


Get the current stage id.

• bool has_iter_limit () const


Check whether the maximum number of iterations for this stage is set.

• HETCOMPUTE_DELETE_METHOD (pipeline_context_base(pipeline_context_base const


&other))
• HETCOMPUTE_DELETE_METHOD (pipeline_context_base(pipeline_context_base &&other))
• HETCOMPUTE_DELETE_METHOD (pipeline_context_base &operator=(pipeline_context_base
const &other))
• HETCOMPUTE_DELETE_METHOD (pipeline_context_base &operator=(pipeline_context_base
&&other))
• void stop_pipeline ()
Stop the pipeline.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 213
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.7.1.7.1 Constructors and Destructors

9.7.1.7.1.1 virtual hetcompute::pipeline_context_base::∼pipeline_context_base ( ) [virtual]

Destructor.

9.7.1.7.2 Member Function Documentation

9.7.1.7.2.1 void hetcompute::pipeline_context_base::cancel_pipeline ( )

Use this method to cancel a pipeline. Note that hetcompute::abort_on_cancel() needs to be called in the
pipeline user-defined stage functions for proper pipeline cancellation. A pipeline can be cancelled in any
stages, however the internal state of the pipeline could be non-deterministic
1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 const size_t num_iters = 100;
8 const size_t cancel_iter = 50;
9 const size_t doc = 8;
10
11 hetcompute::pattern::pipeline<std::array<size_t, num_iters>
> p;
12 typedef hetcompute::pattern::pipeline<std::array<size_t, num_iters>
>::context context;
13
14 // Add a serial stage
15 p.add_stage(hetcompute::serial_stage(), [](context&) {
hetcompute::abort_on_cancel(); });
16
17 // Add a parallel stage with doc = 8, no lag
18 p.add_stage(hetcompute::parallel_stage(doc), [cancel_iter](context& ctx) {
19 size_t i = ctx.get_iter_id();
20 (*ctx.get_data())[i] = i;
21 if (ctx.get_iter_id() == cancel_iter - 1)
22 ctx.cancel_pipeline();
23 hetcompute::abort_on_cancel();
24 });
25
26 // Add a serial stage
27 p.add_stage(hetcompute::serial_stage(), [](context&) {
hetcompute::abort_on_cancel(); });
28
29 // define and reset the output array
30 std::array<size_t, num_iters> out_array;
31 for (size_t i = 0; i < num_iters; i++)
32 {
33 out_array[i] = 0;
34 }
35
36 // launch with sliding window
37 p.enable_sliding_window();
38 p.run(&out_array, num_iters);
39
40 // check the results
41 for (size_t i = 0; i < cancel_iter - doc; i++)
42 {
43 if (out_array[i] != i)
44 {
45 HETCOMPUTE_ILOG("The pipeline cancellation is not correct.");
46 return -1;
47 }
48 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 214
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

49 for (size_t i = cancel_iter + doc - 1; i < num_iters; i++)


50 {
51 if (out_array[i] != 0)
52 {
53 HETCOMPUTE_ILOG("The pipeline cancellation is not correct.");
54 return -1;
55 }
56 }
57
58 // parallel launching
59 // reset the output arrays
60 std::array<size_t, num_iters> out1_array;
61 std::array<size_t, num_iters> out2_array;
62 for (size_t i = 0; i < num_iters; i++)
63 {
64 out1_array[i] = 0;
65 out2_array[i] = 0;
66 }
67
68 p.enable_sliding_window();
69 auto t1 = hetcompute::create_task(p, &out1_array, num_iters);
70 t1->launch();
71
72 p.disable_sliding_window();
73 auto t2 = p.create_task(&out2_array, num_iters);
74 t2->launch();
75
76 try
77 {
78 t1->wait_for();
79 }
80 catch (const hetcompute::aggregate_exception& e)
81 {
82 HETCOMPUTE_ILOG("threw %s due to group cancellation. \n", e.what());
83 }
84 catch (const hetcompute::canceled_exception& e)
85 {
86 HETCOMPUTE_ILOG("threw %s due to group cancellation. \n", e.what());
87 }
88 catch (...)
89 {
90 // unreachable.
91 return -1;
92 }
93
94 try
95 {
96 t2->wait_for();
97 }
98 catch (const hetcompute::aggregate_exception& e)
99 {
100 HETCOMPUTE_ILOG("threw %s due to group cancellation. \n", e.what());
101 }
102 catch (const hetcompute::canceled_exception& e)
103 {
104 HETCOMPUTE_ILOG("threw %s due to group cancellation. \n", e.what());
105 }
106 catch (...)
107 {
108 // unreachable.
109 return -1;
110 }
111
112 // checking the results
113 for (size_t i = 0; i < cancel_iter - doc; i++)
114 {
115 if (out1_array[i] != i || out2_array[i] != i)
116 {
117 HETCOMPUTE_ILOG("The pipeline cancellation is not correct.");
118 return -1;
119 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 215
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

120 }
121 for (size_t i = cancel_iter + doc - 1; i < num_iters; i++)
122 {
123 if (out1_array[i] != 0 || out2_array[i] != 0)
124 {
125 HETCOMPUTE_ILOG("The pipeline cancellation is not correct.");
126 return -1;
127 }
128 }
129 hetcompute::runtime::shutdown();
130 }

9.7.1.7.2.2 size_t hetcompute::pipeline_context_base::get_iter_id ( ) const

Get the current iteration id (begins from 0).

Returns

size_t Stage iteration id.

9.7.1.7.2.3 size_t hetcompute::pipeline_context_base::get_max_stage_iter ( ) const

Get the maximum number of iterations for this stage.

Returns

size_t maximum number of iterations for this stage. 0 means the maximum number is unknown and
the pipeline will be stopped or canceled dynamically during execution.

9.7.1.7.2.4 size_t hetcompute::pipeline_context_base::get_stage_id ( ) const

Get the current stage id.

Returns

size_t Stage id.

9.7.1.7.2.5 bool hetcompute::pipeline_context_base::has_iter_limit ( ) const

Check whether the maximum number of iterations for this stage is set.

Returns

true - The pipeline has an iteration limit known before running. false- The pipeline does not have an
iteration limit known before running.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 216
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

See Also

pipeline_context_base::stop_pipeline()

9.7.1.7.2.6 void hetcompute::pipeline_context_base::stop_pipeline ( )

Use this method to stop a pipeline launched with an iteration limit. Calling this method on a pipeline
without an iteration limit will cause a fatal error. This method can only be called from the first stage of the
pipeline.
See Also

pipeline_context_base::has_iter_limit()

9.7.1.8 class hetcompute::serial_stage

Serial stage for specifying the type of the stages when adding to the pipeline.

Public member functions

• serial_stage (serial_stage_type t=serial_stage_type::in_order)


Constructor.

• serial_stage (serial_stage const &other)


Copy constructor.

• serial_stage (serial_stage &&other)


Move constructor.

• serial_stage_type get_type () const


Get the type of the serial stage.

• HETCOMPUTE_DELETE_METHOD (serial_stage &operator=(serial_stage &&other))


• serial_stage & operator= (serial_stage const &other)
Copy assignment operator.

9.7.1.8.1 Constructors and Destructors

9.7.1.8.1.1 hetcompute::serial_stage::serial_stage ( serial_stage_type t = serial_stage_type::in-


_order ) [explicit]

Constructor.

Parameters

t hetcompute::in_order (default) or

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 217
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.7.1.8.1.2 hetcompute::serial_stage::serial_stage ( serial_stage const & other )

Copy constructor.

9.7.1.8.1.3 hetcompute::serial_stage::serial_stage ( serial_stage && other ) [explicit]

Move constructor.

9.7.1.8.2 Member Function Documentation

9.7.1.8.2.1 serial_stage_type hetcompute::serial_stage::get_type ( ) const

Get the type of the serial stage.

Returns

serial_stage_type The type for the serial stage.

9.7.1.8.2.2 serial_stage& hetcompute::serial_stage::operator= ( serial_stage const & other )

Copy assignment operator.

9.7.1.9 class hetcompute::sliding_window_size

Pipeline stage sliding window size.

Public member functions

• sliding_window_size (size_t size)


Constructor.

• sliding_window_size (sliding_window_size const &other)


Copy constructor.

• sliding_window_size (sliding_window_size &&other)


Move constructor.

• size_t get_size () const


Get the size of the sliding window.

• HETCOMPUTE_DELETE_METHOD (sliding_window_size &operator=(sliding_window_size


&&other))
• sliding_window_size & operator= (sliding_window_size const &other)
Copy assignment operator.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 218
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.7.1.9.1 Constructors and Destructors

9.7.1.9.1.1 hetcompute::sliding_window_size::sliding_window_size ( size_t size ) [explicit]

Constructor.

9.7.1.9.1.2 hetcompute::sliding_window_size::sliding_window_size ( sliding_window_size const &


other )

Copy constructor.

9.7.1.9.1.3 hetcompute::sliding_window_size::sliding_window_size ( sliding_window_size && other )


[explicit]

Move constructor.

9.7.1.9.2 Member Function Documentation

9.7.1.9.2.1 size_t hetcompute::sliding_window_size::get_size ( ) const

Get the size of the sliding window.

Returns

size_t Sliding window size.

9.7.1.9.2.2 sliding_window_size& hetcompute::sliding_window_size::operator= ( sliding_window_size


const & other )

Copy assignment operator.

9.7.1.10 class hetcompute::stage_input

template<typename InputType>class hetcompute::stage_input< InputType >

Pipeline stage input class.


Template Parameters

InputType The data type for the stage_input, which should match the return type of the
previous stage.

Public Types

• typedef InputType input_type


Type of the input data.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 219
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Public member functions

• virtual ∼stage_input ()
Destructor.

• size_t get_first_elem_iter_id () const


Get the iter_id for the stage iteration that generates the first element.

• InputType & get_ith_element (size_t i)


Get the ith element from the input.

• HETCOMPUTE_DELETE_METHOD (stage_input(stage_input const &other))


• HETCOMPUTE_DELETE_METHOD (stage_input(stage_input &&other))
• HETCOMPUTE_DELETE_METHOD (stage_input &operator=(stage_input const &other))
• HETCOMPUTE_DELETE_METHOD (stage_input &operator=(stage_input &&other))
• InputType & operator[ ] (size_t i)
[] operator to get the ith element from the input.

• size_t size () const


Get the number of elements of type InputType in the stage input.

9.7.1.10.1 Member Typedef Documentation

9.7.1.10.1.1 template<typename InputType > typedef InputType hetcompute::stage_input< InputType


>::input_type

Type of the input data.

9.7.1.10.2 Constructors and Destructors

9.7.1.10.2.1 template<typename InputType > virtual hetcompute::stage_input< InputType


>::∼stage_input ( ) [virtual]

Destructor.

9.7.1.10.3 Member Function Documentation

9.7.1.10.3.1 template<typename InputType > size_t hetcompute::stage_input< InputType


>::get_first_elem_iter_id ( ) const

Get the iter_id for the stage iteration that generates the first element.

Returns

size_t The iteration id in the previous stage that generates the first element in the stage_input.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 220
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.7.1.10.3.2 template<typename InputType > InputType& hetcompute::stage_input< InputType


>::get_ith_element ( size_t i )

Get the ith element from the input.

Parameters

i The index of the element to retrieve.

Returns

InputType The ith element in the input.

9.7.1.10.3.3 template<typename InputType > InputType& hetcompute::stage_input< InputType


>::operator[ ] ( size_t i )

[] operator to get the ith element from the input.

Parameters

i The index of the element to retrieve.

Returns

InputType The ith element in the input.

9.7.1.10.3.4 template<typename InputType > size_t hetcompute::stage_input< InputType >::size ( )


const

Get the number of elements of type InputType in the stage input.

Returns

size_t The number of elements in the input for current iteration.

9.7.2 Typedef Documentation

9.7.2.1 typedef enum hetcompute::serial_stage_type hetcompute::serial_stage_type

Serial pipeline stage types.

9.7.3 Enumeration Type Documentation

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 221
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.7.3.1 enum hetcompute::serial_stage_type

Serial pipeline stage types.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 222
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.8 Tuner
Classes

• class hetcompute::pattern::tuner

9.8.1 Class Documentation

9.8.1.1 class hetcompute::pattern::tuner

Public Types

• using load_type = size_t

Public member functions

• tuner ()
• size_t get_chunk_size () const
• load_type get_cpu_load () const
• size_t get_doc () const
• load_type get_dsp_load () const
• load_type get_gpu_load () const
• bool has_profile () const
• bool is_serial () const
• bool is_static () const
• tuner & set_chunk_size (size_t sz)
• tuner & set_cpu_load (load_type load)
• tuner & set_dsp_load (load_type load)
• tuner & set_dynamic ()
• tuner & set_gpu_load (load_type load)
• tuner & set_max_doc (size_t doc)
• tuner & set_profile ()
• tuner & set_serial ()
• tuner & set_static ()

9.8.1.1.1 Constructors and Destructors

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 223
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.8.1.1.1.1 hetcompute::pattern::tuner::tuner ( )

Tuner constructor Parameters to fine-tune various execution settings in HETCOMPUTE patterns. Note that
tuner settings are hints that the HETCOMPUTE runtime takes into account while scheduling a pattern.
Constraining factors may cause HETCOMPUTE to ignore the hints.

9.8.1.1.2 Member Function Documentation

9.8.1.1.2.1 size_t hetcompute::pattern::tuner::get_chunk_size ( ) const

Query the granularity of work stealing.

Returns

size_t minimum chunk size.

9.8.1.1.2.2 load_type hetcompute::pattern::tuner::get_cpu_load ( ) const

For patterns executable heterogeneously on multiple devices (e.g. CPU, GPU, DSP), get fraction of pattern
work to be executed on the CPU.

Returns

number of units out of total work (cpu_load + gpu_load + dsp_load) to be executed on the CPU

9.8.1.1.2.3 size_t hetcompute::pattern::tuner::get_doc ( ) const

Query the maximum number of tasks launched in parallel.

Returns

size_t degree of concurrency.

9.8.1.1.2.4 load_type hetcompute::pattern::tuner::get_dsp_load ( ) const

For patterns executable heterogeneously on multiple devices (e.g. CPU, GPU, DSP), get fraction of pattern
work to be executed on the DSP.

Returns

number of units out of total work (cpu_load + gpu_load + dsp_load) to be executed on the DSP

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 224
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.8.1.1.2.5 load_type hetcompute::pattern::tuner::get_gpu_load ( ) const

For patterns executable heterogeneously on multiple devices (e.g. CPU, GPU, DSP), set fraction of pattern
work to execute on the GPU.

9.8.1.1.2.6 bool hetcompute::pattern::tuner::has_profile ( ) const

Check if hetero pfor_each pattern is profiled.

Returns

bool TRUE if execution is profiled and false otherwise.

9.8.1.1.2.7 bool hetcompute::pattern::tuner::is_serial ( ) const

Check if pattern execution is serialized

Returns

bool TRUE if execution is serialized and false otherwise.

9.8.1.1.2.8 bool hetcompute::pattern::tuner::is_static ( ) const

Check if the parallelization algorithm is static chunking.

Returns

bool TRUE if using static chunking and false if using dynamic work stealing.

9.8.1.1.2.9 tuner& hetcompute::pattern::tuner::set_chunk_size ( size_t sz )

Defines granularity for work stealing. In data parallel patterns, Qualcomm HETCOMPUTE launches
multiple tasks (defined by doc) in parallel. Each task steals some iterations from other tasks when its
assigned iterations are completed. The chunk size parameter controls the minimum number of iterations a
task needs to finish before it is stolen from a stealer task. It is recommended to increase chunk size when
the computation in each iteration is less.

Parameters

sz Minimum chunk size.

Returns

tuner& reference to the tuner object.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 225
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

9.8.1.1.2.10 tuner& hetcompute::pattern::tuner::set_cpu_load ( load_type load )

For patterns executable heterogeneously on multiple devices (e.g. CPU, GPU, DSP), set fraction of pattern
work to execute on the CPU.

Parameters

load = number of units out of total work (cpu_load + gpu_load + dsp_load) to


execute on the CPU.

return tuner& reference to the tuner object.

9.8.1.1.2.11 tuner& hetcompute::pattern::tuner::set_dsp_load ( load_type load )

For patterns executable heterogeneously on multiple devices (e.g. CPU, GPU, DSP), set fraction of pattern
work to execute on the DSP.

Parameters

load = number of units out of total work (cpu_load + gpu_load + dsp_load) to


execute on the DSP.

return tuner& reference to the tuner object.

9.8.1.1.2.12 tuner& hetcompute::pattern::tuner::set_dynamic ( )

Set the parallelization algorithm to dynamic work stealing (default)


Qualcomm HETCOMPUTE implements a highly efficient work-stealing algorithm and uses it as the
common backend for parallel iteration patterns. It works well with most workload types, especially for
uneven workload distribution across iterations. To improve performance further, consider tuning chunk size
(set_chunk_size) and degree of concurrency (set_max_doc).
See Also

set_chunk_size(size_t)
set_max_doc(size_t)

Returns

tuner& reference to the tuner object.

9.8.1.1.2.13 tuner& hetcompute::pattern::tuner::set_gpu_load ( load_type load )

For patterns executable heterogeneously on multiple devices (e.g. CPU, GPU, DSP), set fraction of pattern
work to execute on the GPU.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 226
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Parameters

load = number of units out of total work (cpu_load + gpu_load + dsp_load) to


execute on the GPU.

return tuner& reference to the tuner object.

9.8.1.1.2.14 tuner& hetcompute::pattern::tuner::set_max_doc ( size_t doc )

Defines the maximum number of tasks in parallel (degree of concurrency) for load balancing. A higher
number indicates over-subscription which might be beneficial in certain usage scenarios. doc must be
larger than zero. Otherwise, it will cause a fatal error.

Parameters

doc Degree of concurrency, set to the number of available cores by default.

Returns

tuner& reference to the tuner object

9.8.1.1.2.15 tuner& hetcompute::pattern::tuner::set_profile ( )

Enable profiling within pattern execution. Currently meaningful to hetero pfor_each pattern to generate
auto-tuned work distribution across heterogeneous devices.

Returns

tuner& reference to the tuner object.

9.8.1.1.2.16 tuner& hetcompute::pattern::tuner::set_serial ( )

Execute pattern sequentially

Returns

tuner& reference to the tuner object.

9.8.1.1.2.17 tuner& hetcompute::pattern::tuner::set_static ( )

Set the parallelization algorithm to static chunking.


The static chunking algorithm simply divides the iteration range equally and allocates the chunks to parallel
launched tasks. This algorithm has lower synchronization overhead compared with the default dynamic
work stealing algorithm, and features good locality. However, it does not provide load balancing, and in
most cases is outperformed by the dynamic work stealing algorithm (default).

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 227
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Patterns Reference API

Returns

tuner& reference to the tuner object.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 228
10 Tasks Reference API

Tasks represent independent units of work that can be executed asynchronously. Qualcomm HetCompute
programmers are responsible for partitioning their application into tasks and organizing them into a task
graph using dependencies. This chapter documents the interfaces to create tasks, setup dependencies, and
launch (execute) tasks. It also discusses task synchronization (waiting) and cancellation. Grouping is the
mechanism to wait and cancel on a set of tasks. And finally, attributes is a more advanced feature which
allows programmers to pass additional information about task behavior to the Qualcomm HetCompute
runtime system.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 229
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.1 Groups
Classes

• class hetcompute::group
Groups represent sets of tasks, which are used to simplify waiting and canceling multiple tasks. More...

• class hetcompute::group_ptr
Smart pointer to a group object. More...

Functions

• group_ptr hetcompute::create_group (const char ∗name)


Creates a named group and returns a group_ptr that points to the group.

• group_ptr hetcompute::create_group (std::string const &name)


Creates a named group and returns a group_ptr that points to the group.

• group_ptr hetcompute::create_group ()
Creates a group and returns a group_ptr that points to the group.

• void hetcompute::finish_after (group ∗g)


• void hetcompute::finish_after (group_ptr const &g)
Specifies that the task invoking this function should be deemed to finish only after tasks in group g finish.

• group_ptr hetcompute::intersect (group_ptr const &a, group_ptr const &b)


Returns a pointer to a group that represents the intersection of two groups.

• bool hetcompute::operator!= (group_ptr const &g, std::nullptr_t)


Compares group g to nullptr.

• bool hetcompute::operator!= (std::nullptr_t, group_ptr const &g)


Compares nullptr to group g.

• bool hetcompute::operator!= (group_ptr const &a, group_ptr const &b)


Compares group a to group b.

• group_ptr hetcompute::operator& (group_ptr const &a, group_ptr const &b)


Returns a pointer to a group that represents the intersection of two groups.

• bool hetcompute::operator== (group_ptr const &g, std::nullptr_t)


Compares group g to nullptr.

• bool hetcompute::operator== (std::nullptr_t, group_ptr const &g)


Compares nullptr to group g.

• bool hetcompute::operator== (group_ptr const &a, group_ptr const &b)


Compares group a to group b.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 230
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.1.1 Class Documentation

10.1.1.1 class hetcompute::group

Public member functions

• void add (task_ptr<> const &task)


Adds a task to group without launching it.

• void add (task<> ∗task)


Adds a task to group without launching it.

• void cancel ()
Cancels group.

• bool canceled () const


Checks whether the group is canceled.

• void finish_after ()
Specifies that the task invoking this function should be deemed to finish only after tasks the group. This
method returns immediately.

• std::string get_name () const


Returns the group name.

• group_ptr intersect (group_ptr const &other)


Returns a pointer to a group that represents the intersection of two groups.

• group_ptr intersect (group ∗other)


• template<typename FullType , typename FirstArg , typename... RestArgs>
void launch (hetcompute::task_ptr< FullType > const &task, FirstArg &&first_arg, RestArgs
&&...rest_args)
Binds arguments to task and launches it into the group.

• template<typename TaskType , typename FirstArg , typename... RestArgs>


void launch (hetcompute::task< TaskType > ∗task, FirstArg &&first_arg, RestArgs &&...rest_args)

Binds arguments to task and launches it into the group.

• void launch (hetcompute::task_ptr<> const &task)


Launches task and into group.

• void launch (hetcompute::task<> ∗task)


Launches task into group.

• template<typename Code , typename... Args>


void launch (Code &&code, Args &&...args)
Creates a new task, binds arguments (if given) and launches it into a group.

• hc_error wait_for ()

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 231
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Blocks until all tasks in the group complete execution or are canceled (that is, the group is empty)

10.1.1.1.1 Member Function Documentation

10.1.1.1.1.1 void hetcompute::group::add ( task_ptr<> const & task )

Use add to add a task to a group without launching it. Because of performance reasons, it is recommended
that tasks are added to groups at the time they are launched using hetcompute::group::launch.
Use add when your algorithm requires that the task belongs to a group, but you are not yet ready to launch
the task. For example, perhaps you want to prevent the group from being empty, so you can wait on it
somewhere else.
It is possible, though not recommended because of performance reasons, to use add repeatedly to add a
task to multiple groups. Repeatedly adding a task to the same group is not an error, Qualcomm HetCompute
ignores subsequent launches. If the task has previously been launched, hetcompute::group-
::launch(task_ptr<> const&) and hetcompute::group::add(task_ptr<>
const&) are equivalent. For more information about tasks joining multiple groups, see Task Groups.
Regardless of the method used to add tasks to a group, the following rules always apply:

• Tasks stay in the group until they finish execution (successfully or unsuccessfully due to exceptions
or cancellation). Once a task is added to a group, there is no way to remove it from the group.
• Once a task belonging to multiple groups completes execution, Qualcomm HetCompute removes it
from all the groups to which it belongs.
• Neither completed nor canceled tasks can join groups.
• Tasks cannot be added to a canceled group.
Do not call this method if task is nullptr. This would cause a fatal error.

Parameters

task Base task-pointer.

Example 1

1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group g.
9 auto g = hetcompute::create_group();
10
11 // Create task t1. Its type is hetcompute::task_ptr<void()>
12 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t1!\n")
; });
13
14 // Add task t1 to group g, but do not launch it.
15 g->add(t1);
16
17 auto t2 = hetcompute::launch([t1] {
18 // Launch t1. Because it already belongs to group g, there is no
19 // reason to use hetcompute::group::launch.
20 t1->launch();

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 232
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

21 });
22
23 // Wait for tasks in group g to complete.
24 g->wait_for();
25 hetcompute::runtime::shutdown();
26
27 return 0;
28 }

Example 2

1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create groups g1, g2, g3
9 auto g1 = hetcompute::create_group();
10 auto g2 = hetcompute::create_group();
11 auto g3 = hetcompute::create_group();
12
13 // Create task t. Its type is hetcompute::task_ptr<void(int)>
14 auto t = hetcompute::create_task([](int seconds) {
15 HETCOMPUTE_ILOG("Hello World from t! I’ll sleep for %d seconds\n", seconds);
16 sleep(seconds);
17 HETCOMPUTE_ILOG("Good bye from t\n");
18 });
19
20 // Launch t into g1, let it sleep for 4 seconds.
21 g1->launch(t, 4);
22
23 // t is launched, possibly running, let’s add it to g2 as well.
24 g2->add(t);
25
26 // Equivalent to g3->add(t).
27 g3->launch(t);
28
29 // Wait for g2 to be empty
30 g2->wait_for();
31
32 HETCOMPUTE_ILOG("**%s**\n", g3->get_name().c_str());
33 hetcompute::runtime::shutdown();
34 return 0;
35 }

See Also

hetcompute::group::launch(task_ptr<>const&)

10.1.1.1.1.2 void hetcompute::group::add ( task<> ∗ task )

Similar to add(task_ptr<> const&) except that it takes a pointer to a base task instead of a base
task-pointer.
Do not call this method if task is nullptr. This would cause a fatal error.

Parameters

task Pointer to base task.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 233
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

See Also

hetcompute::group::launch(task_ptr<>const&)
hetcompute::group::add(task_ptr<> const&)

10.1.1.1.1.3 void hetcompute::group::cancel ( )

Marks the group as canceled and returns immediately. Once a group is canceled, it cannot revert to a
non-canceled state. Canceling a group means that:

• The tasks in the group that have not started execution will never execute.
• The tasks in the group that are executing will be canceled only when they call
hetcompute::abort_on_cancel. If any of these executing tasks is a blocking executing a
hetcompute::blocking construct, Qualcomm HetCompute executes the constructs’s
cancellation handler if they had not executed it before.
• Any tasks added to the group after the group is canceled are also canceled.
cancel returns immediately. Call hetcompute::group::wait_for() afterwards to wait for all
the running tasks to be completed. For more information about cancellation, check Tasks.

Example 1

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 // Create group
9 auto g = hetcompute::create_group();
10
11 // Create lambda for task body.
12 auto l = [](int task_id) {
13 HETCOMPUTE_ILOG("Task %d begins execution.\n", task_id);
14 for (int i = 0; i < 2; ++i)
15 {
16 hetcompute::abort_on_cancel();
17 usleep(400000);
18 }
19 HETCOMPUTE_ILOG("Task %d ends execution normally.\n", task_id);
20 };
21
22 // Launch many tasks
23 for (int j = 0; j < 10000; ++j)
24 {
25 g->launch(l, j);
26 }
27
28 // Sleep for a little while, to give some tasks
29 // time to completely execute.
30 sleep(1);
31
32 // Cancel group and wait for the running tasks to complete
33 g->cancel();
34 try
35 {
36 g->wait_for();
37 }
38 catch (const hetcompute::aggregate_exception& e)
39 {

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 234
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

40 std::cout << "threw " << e.what() << " due to group cancellation " << std::endl;
41 }
42 catch (const hetcompute::canceled_exception& e)
43 {
44 std::cout << "threw " << e.what() << " due to group cancellation " << std::endl;
45 }
46 catch (...)
47 {
48 // Never reached
49 }
50 hetcompute::runtime::shutdown();
51 return 0;
52 }

In the example above, launch 10000 tasks are launched into group g. Each task prints a message when it
starts execution and another one right before it ends execution. The latter one will only print if the task does
not notice that the group has been canceled. (See hetcompute::abort_on_cancel).
Right after launching the tasks, main sleeps for a second before canceling the group. This means that next
time the running tasks execute hetcompute::abort_on_cancel(), they will see that their group
has been canceled and will abort. wait_for will not return before the running tasks end their execution –
either because they call hetcompute::abort_on_cancel(), or because they complete their
execution without being canceled.

Example 2

1 #include <atomic>
2 #include <hetcompute/hetcompute.hh>
3
4 using namespace std;
5
6 int
7 main()
8 {
9 hetcompute::runtime::init();
10 // Counts the number of tasks that execute before the group gets
11 // canceled
12 atomic<size_t> counter;
13
14 auto group = hetcompute::create_group();
15
16 // Create 2000 tasks that increase an atomic counter
17 for (int i = 0; i < 2000; i++)
18 {
19 group->launch([&counter] {
20 counter++;
21 usleep(7);
22 });
23 }
24
25 // Cancel group
26 group->cancel();
27
28 // Wait for group to cancel
29 try
30 {
31 group->wait_for();
32 }
33 catch (const hetcompute::aggregate_exception& e)
34 {
35 // If many tasks were canceled, they each propagate a
36 // hetcompute::canceled_exception to the group, all of which get aggregated into
37 // a single hetcompute::aggregate_exception.
38 std::cout << "threw " << e.what() << " due to group cancellation " << std::endl;
39 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 235
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

40 catch (const hetcompute::canceled_exception& e)


41 {
42 // If all but one task finished by the time group cancellation took effect,
43 // then the one remaining task which was canceled will propagate a single
44 // hetcompute::canceled_exception.
45 std::cout << "threw " << e.what() << " due to group cancellation " << std::endl;
46 }
47 catch (...)
48 {
49 // Never reached
50 }
51 HETCOMPUTE_ILOG("wait_for returned after %zu tasks executed", counter.load());
52 hetcompute::runtime::shutdown();
53 return 0;
54 }

Output

wait_for returned after 87 tasks executed.

Note that this is an example output. Actual output is timing dependent.


See Also

hetcompute::abort_on_cancel()
hetcompute::group::wait_for()
hetcompute::task<>::cancel()

10.1.1.1.1.4 bool hetcompute::group::canceled ( ) const

Returns true if the group has been canceled; otherwise, returns false. For more about cancellation, see
Tasks.

Returns

true – The group is canceled.


false – The group is not canceled.

See Also

hetcompute::group::cancel()

10.1.1.1.1.5 void hetcompute::group::finish_after ( )

Exceptions

api_exception If invoked from outside a task or from within a


hetcompute::pfor_each or if ’task’ points to null

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 236
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Note

If exceptions are disabled by application, this API will terminate the app, if pointer to task is
nullptr, invoked from outside a task or from within a hetcompute::pfor_each

Example

1 #include <string>
2
3 #include <hetcompute/hetcompute.hh>
4
5 void display_webpage(char*);
6 void compose_webpages(int num_urls, char* urls[]);
7
8 void
9 display_webpage(char* url)
10 {
11 auto fetchdata = hetcompute::create_task([=] {
12 /*fetch(url, "fetchdata");*/
13 return std::string(url) + " data";
14 });
15 auto fetchstyle = hetcompute::create_task([=] {
16 /*fetch(url, "fetchstyle");*/
17 return std::string(url) + " style";
18 });
19 auto render = hetcompute::create_task([](std::string data, std::string style
) {
20 /*render();*/
21 std::cout << data + " " + style << std::endl;
22 });
23 // Render task may start executing only after data and style have been
24 // fetched
25 render->bind_all(fetchdata, fetchstyle);
26 fetchdata->launch();
27 fetchstyle->launch();
28 render->launch();
29 // Mark display_webpage as logically finishing after the render task finishes
30 render->finish_after();
31 // Return from function call even before any of the fetchdata, fetchstyle, or render
32 // tasks finish. Such an early return makes the function asynchronous.
33 }
34
35 void
36 compose_webpages(int num_urls, char* urls[])
37 {
38 auto g = hetcompute::create_group();
39 for (int i = 1; i < num_urls; i++)
40 {
41 g->launch([=] { display_webpage(urls[i]); });
42 }
43 // Mark compose_webpages as logically finishing after all webpages have been
44 // composed and displayed
45 g->finish_after();
46 // Return from function call before any of the tasks finish
47 }
48
51 int
52 main(int argc, char* argv[])
53 {
54 hetcompute::runtime::init();
55
56 // Launch compose_webpages as a task since it is an asynchronous function
57 // call
58 auto t = hetcompute::launch([=, &argv] { compose_webpages(argc, argv); });
59 // Waits for the composite display to be rendered!
60 t->wait_for();
61 return 0;
62
63 hetcompute::runtime::shutdown();

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 237
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

64 }

10.1.1.1.1.6 std::string hetcompute::group::get_name ( ) const

Returns string with the name of the group. If the group has no name, the returned string is empty.

Returns

std::string containing the name of the group.

See Also

hetcompute::create_group(std::string const&)

10.1.1.1.1.7 group_ptr hetcompute::group::intersect ( group_ptr const & other )

Returns a pointer to a group that represents the intersection of the group managed by ∗this and other.
Some applications require that tasks join more than one group. It is possible, though not recommended for
performance reasons, to use hetcompute::group::launch(hetcompute::task_ptr<>
const&) or hetcompute::group::add(hetcompute::task_ptr<> const&) repeatedly to
add a task to several groups. Instead, use hetcompute::group::intersect(group_ptr
const&) to create a new group that represents the intersection of all the groups where the tasks need to
launch. Again, this method is more performant than repeatedly launching the same task into different
groups.
Launching a task into the intersection group also simultaneously launches it into all the groups that are part
of the intersection.
Consecutive calls to hetcompute::group::intersect with the same group pointer as argument
return a pointer to the same group.
Group intersection is a commutative operation.
You can use the & operator instead of hetcompute::group::intersect.

Parameters

other group pointer to the group to intersect with.

Returns

group_ptr – Group pointer that points to a group that represents the intersection of ∗this and other.

See Also

hetcompute::intersect(hetcompute::group_ptr const&,
hetcompute::group_ptr const&).
hetcompute::operator&(hetcompute::group_ptr const&,
hetcompute::group_ptr const&)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 238
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.1.1.1.1.8 template<typename FullType , typename FirstArg , typename... RestArgs> void


hetcompute::group::launch ( hetcompute::task_ptr< FullType > const & task, FirstArg &&
first_arg, RestArgs &&... rest_args )

Binds arguments to task and launches it into the group. task must be a fully-typed task-pointer to allow
argument binding, and it should not be bound already. Otherwise, launch causes a runtime error. For
more information about binding, check Tasks.
Tasks do not execute unless they are launched. By launching a task, the programmer informs the Qualcomm
HetCompute runtime that the task is ready to execute as soon as all its (control and data) dependencies have
been satisfied, required buffers, if any, are available, and a hardware context is available. For more
information about task launching, see Tasks.
Tasks can launch only once. Any subsequent calls to g->launch() do not cause the task to execute
again. Instead, they cause the task to be added to group g, if the task was not part of that group already.
When launching a task into many groups, remember that group intersection is a somewhat expensive
operation. If you need to launch into multiple groups several times, intersect the groups once and launch the
tasks into the intersection. For more information about tasks joining multiple groups, see Task Groups.
Template Parameters

FullType Task pointer type. Should be a full type (i.e., void(int, float)).
FirstArg Type of the first argument to be bound to the task.
RestArgs Type of the rest of the arguments to be bound to the task.

Parameters

task Fully-typed task pointer.


first_arg First task argument.
rest_args Rest of the task arguments.

Exceptions

api_exception If task pointer is nullptr.

Note

If exceptions are disabled by application, this API will terminate the app if pointer to task is nullptr

Example

1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group
9 auto g = hetcompute::create_group();
10
11 // hello is a fully-typed task pointer of type
12 // hetcompute::task_ptr<void(int)>
13 auto hello = hetcompute::create_task([](int x) { HETCOMPUTE_ILOG("Hello World
%d!\n", x); });
14
15 // Bind hello to 42 and launch task into g

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 239
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

16 g->launch(hello, 42);
17
18 // Wait for g to be empty
19 g->wait_for();
20
21 hetcompute::runtime::shutdown();
22 }

See Also

hetcompute::group::launch(hetcompute::task_ptr<> const&)

10.1.1.1.1.9 template<typename TaskType , typename FirstArg , typename... RestArgs> void


hetcompute::group::launch ( hetcompute::task< TaskType > ∗ task, FirstArg && first_arg,
RestArgs &&... rest_args )

Similar to hetcompute::group::launch(hetcompute::task_ptr<FullType> const&,


FirstArg&& first_arg, RestArgs&& ...rest_args) except it takes a pointer to a base task
instead of a base task pointer.
Template Parameters

FullType Task type. Should be a full type (i.e., void(int, float)).


FirstArg Type of the first argument to be bound to the task.
RestArgs Type of the the rest of the arguments to be bound to the task.

Parameters

task Pointer to task.


first_arg First argument value to bind.
rest_args The rest of the argument values to bind.

Exceptions

api_exception If pointer to task is nullptr.

Note

If exceptions are disabled by application, this API will terminate the app if pointer to task is nullptr

See Also

hetcompute::group::launch(hetcompute::task_ptr<FullType> const&,
FirstArg&& first_arg, RestArgs&& ...rest_args)

10.1.1.1.1.10 void hetcompute::group::launch ( hetcompute::task_ptr<> const & task )

Launches task and into group. Tasks do not execute unless they are launched. By launching a task, the
programmer informs the Qualcomm HetCompute runtime that the task is ready to execute as soon as all its
(control and data) dependencies have been satisfied, required buffers (if any) are available, and a hardware
context is available. For more information about task launching, see Tasks.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 240
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

A task executes only once regardless of how many times it has been launched. Therefore, any subsequent
call to launch does not cause the task to execute again. Instead, it causes the task to be added to a new
group, if the task was not part of that group already. For more information about tasks joining multiple
groups, see Task Groups.

Parameters

task Base task pointer.

Exceptions

api_exception If task pointer is nullptr.

Note

If exceptions are disabled by application, this API will terminate the app if task pointer is nullptr

Example

1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group
9 auto g = hetcompute::create_group();
10
11 // hello is a fully-typed task pointer of type
12 // hetcompute::task_ptr<void()>.
13 auto hello = hetcompute::create_task([]() { HETCOMPUTE_ILOG("Hello World!\n"); }
);
14
15 // Launch hello into g.
16 g->launch(hello);
17
18 // Wait for g to be empty.
19 g->wait_for();
20 hetcompute::runtime::shutdown();
21 }

10.1.1.1.1.11 void hetcompute::group::launch ( hetcompute::task<> ∗ task )

Similar to hetcompute::group::launch(hetcompute::task_ptr<> const&) except it


takes a pointer to a base task instead of a base task-pointer.

Parameters

task Pointer to base task.

Exceptions

api_exception If task is nullptr.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 241
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Note

If exceptions are disabled by application, this API will terminate the app if task is nullptr

Example

1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group
9 auto g = hetcompute::create_group();
10
11 // hello is a fully-typed task pointer of type
12 // hetcompute::task_ptr<void()>.
13 auto hello = hetcompute::create_task([]() { HETCOMPUTE_ILOG("Hello World!\n"); }
);
14
15 // get regular pointer to task.
16 auto hello_ptr = hello.get();
17
18 // Launch hello into g.
19 g->launch(hello_ptr);
20
21 // Wait for g to be empty.
22 g->wait_for();
23 hetcompute::runtime::shutdown();
24 }

10.1.1.1.1.12 template<typename Code , typename... Args> void hetcompute::group::launch ( Code


&& code, Args &&... args )

Creates a new task, binds arguments (if given) and launches it into a group. This is the fastest way to create
and launch a task into a group. It is recommended that it be used as much as possible. Note, however, that
this method does not return a pointer to the task. Therefore, only use this method if the new task will not be
part of a task graph. Qualcomm HetCompute runtime will execute the task as soon as all its (control and
data) dependencies have been satisfied, required buffers if any are available, and a hardware context is
available. For more information about task launching, see Tasks.
The new task executes the Code passed as an argument to this method.
When creating a task that will execute in the CPU, the preferred types for Code are C++11 lambda and
hetcompute::cpu_kernel, although it is possible to use other types such as function objects and
function pointers. Use hetcompute::dsp_kernel or hetcompute::gpu_kernel to create a task
that runs in the Qualcomm Hexagon DSP or in the GPU. Regardless of the Code type, it can take up to 31
arguments.
Notice that launch makes a copy of code so that the programmer does not need to worry about the
lifetime of the code object.
launch can launch a hetcompute pattern object directly just as launching a regular task, as shown in the
following example:

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 242
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Examples

// increment every element in input vector vin


auto l = [&vin](size_t i){ vin[i]++; };
auto pfor = hetcompute::pattern::create_pfor_each(l);
g->launch(pfor, size_t(0), vin.size());

Notice that launch does not support launching patterns with non-void return value. Therefore
programmers cannot launch preduce or pdivide_and_conquer using this group launch semantic.
Template Parameters

Code Code that the task will execute. It can be a lambda expression, function
pointer, functor, pattern
cpu_kernel, gpu_kernel or a dsp_kernel.

Parameters

code Task body.


args Arguments to bind to the parameters.

Example

1 #include <stdio.h>
2 #include <hetcompute/hetcompute.hh>
3
4 static void
5 foo()
6 {
7 HETCOMPUTE_ILOG("Hello World! from foo()\n");
8 sleep(1);
9 HETCOMPUTE_ILOG("Bye from foo()\n");
10 }
11
12 int
13 main()
14 {
15 hetcompute::runtime::init();
16 // Create group g
17 auto g = hetcompute::create_group();
18
19 // Create cpu_kernel that executes foo
20 auto k1 = hetcompute::create_cpu_kernel(foo);
21
22 // Create a task from a kernel and launch it
23 g->launch(k1);
24
25 // Create lambda expression l that takes two arguments
26 auto l = [](int x, int y) { HETCOMPUTE_ILOG("Hello World! %d + %d = %d\n", x, y, x + y); };
27
28 // Create tasks from l and launch them into g
29 for (int i = 0; i < 3; i++)
30 for (int j = 42; j < 44; j++)
31 g->launch(l, i, j);
32
33 // Wait for all the tasks in group g to complete
34 g->wait_for();
35 hetcompute::runtime::shutdown();
36
37 return 0;
38 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 243
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

See Also

hetcompute::launch(Code&& code, Args&& ...args)


hetcompute::group::launch(hetcompute::task_ptr<FullType> const&,
FirstArg&& first_arg, RestArgs&& ...rest_args)
hetcompute::group::launch(hetcompute::task_ptr<> const&)

10.1.1.1.1.13 hc_error hetcompute::group::wait_for ( )

Blocks until all tasks in the group complete execution or are canceled. If new tasks are added to the group
while wait_for is blocking, wait_for does not return until all those new tasks also complete.
If wait_for is called from within a task, Qualcomm HetCompute context switches the task and finds
another task to run. If called from outside a task, this wait_for blocks the calling thread until it returns.
Note

If exceptions are disabled by application, wait_for returns hetcompute::hc_error instead of


throwing exceptions.

wait_for is a safe point.

Example 1

1 #include <hetcompute/hetcompute.hh>
2 #include <stdio.h>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group g
9 auto g = hetcompute::create_group();
10
11 // Launch 10 tasks into g
12 for (int i = 0; i < 10; i++)
13 {
14 g->launch([i] { HETCOMPUTE_ILOG("Hello World! I’m task #%d\n", i); });
15 }
16
17 // Wait for tasks to complete and exit group
18 g->wait_for();
19 hetcompute::runtime::shutdown();
20
21 return 0;
22 }

Waiting for a group intersection means that Qualcomm HetCompute returns once the tasks in the
intersection group have completed or executed.

Example 2

1 #include <stdio.h>
2 #include <hetcompute/hetcompute.h>
3
4 int
5 main()
6 {
7 // Create groups

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 244
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

8 hetcompute::group_ptr g1 = hetcompute::create_group("
Example 1");
9 hetcompute::group_ptr g2 = hetcompute::create_group("
Example 2");
10 hetcompute::group_ptr g12 = g1 & g2;
11
12 // Create and launch two tasks that never end
13 g1->launch([] {
14 while (1)
15 {
16 }
17 });
18
19 g2->launch([] {
20 while (1)
21 {
22 }
23 });
24
25 // Returns immediately because there are no
26 // tasks that belong to both g1 and g2
27 g12->wait_for();
28
29 // Never returns
30 // g1->wait_for();
31 // g2->wait_for();
32
33 g1->cancel();
34 g2->cancel();
35
36 return 0;
37 }

See Also

hetcompute::group::finish_after
hetcompute::task::finish_after
hetcompute::task::wait_for
hetcompute::intersection

10.1.1.2 class hetcompute::group_ptr

Smart pointer to a group object, similar to std::shared_ptr.

Public member functions

• group_ptr ()
Default constructor. Constructs a group_ptr with no group.

• group_ptr (std::nullptr_t)
Default constructor. Constructs a group_ptr with no group.

• group_ptr (group_ptr const &other)


Copy constructor. Constructs a group_ptr that manages the same group as other.

• group_ptr (group_ptr &&other)


Move constructor. Move-constructs a group_ptr that manages the same group as other.

• group ∗ get () const

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 245
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Returns pointer to managed group.

• operator bool () const


Checks whether pointer is not nullptr.

• group ∗ operator-> () const


Dereference operator. Returns pointer to the managed group.

• group_ptr & operator= (group_ptr const &other)


Assignment operator. Assigns the group managed by other to ∗this.

• group_ptr & operator= (std::nullptr_t)


Assignment operator. Resets ∗this.

• group_ptr & operator= (group_ptr &&other)


Move-assignment operator. Move-assigns the group managed by other to ∗this.

• void reset ()
Resets pointer to managed group.

• void swap (group_ptr &other)


Exchanges managed groups between ∗this and other.

• bool unique () const


Checks whether ∗this is the onlygroup_ptr managing the same group object.

• size_t use_count () const


Returns the number of group_ptr objects managing the same object (including ∗this).

10.1.1.2.1 Constructors and Destructors

10.1.1.2.1.1 hetcompute::group_ptr::group_ptr ( )

Constructs a group_ptr that manages no group. group_ptr::get returns nullptr.

10.1.1.2.1.2 hetcompute::group_ptr::group_ptr ( std::nullptr_t )

Constructs a group_ptr that manages no group. group_ptr::get returns nullptr.

10.1.1.2.1.3 hetcompute::group_ptr::group_ptr ( group_ptr const & other )

Constructs a group_ptr object that manages the same group as other. If other points to nullptr,
the newly built object points to nullptr as well.

Parameters

other Group pointer to copy.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 246
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.1.1.2.1.4 hetcompute::group_ptr::group_ptr ( group_ptr && other )

Constructs a group_ptr object that manages the same group as other and resets other. If other
points to nullptr, the newly built object points to nullptr as well.

Parameters

other Group pointer to move from.

10.1.1.2.2 Member Function Documentation

10.1.1.2.2.1 group∗ hetcompute::group_ptr::get ( ) const

Returns pointer to the managed group. Remember that the lifetime of the group is defined by the lifetime of
the group_ptr objects managing it. If all group_ptr objects managing a group g go out of scope, all
group∗ pointing to g may be invalid.

Returns

Pointer to managed group object.

10.1.1.2.2.2 hetcompute::group_ptr::operator bool ( ) const [explicit]

Checks whether ∗this manages a group.

Returns

true – The pointer is not nullptr (∗this manages a group).


false – The pointer is nullptr (∗this does not manage a group).

10.1.1.2.2.3 group∗ hetcompute::group_ptr::operator-> ( ) const

Returns pointer to the managed group. Do not call this member function if ∗this does not manage a
group. This would cause a fatal error.

Returns

Pointer to managed group object.

10.1.1.2.2.4 group_ptr& hetcompute::group_ptr::operator= ( group_ptr const & other )

Assigns the group managed by other to ∗this. If, before the assignment, ∗this was the last
group_ptr pointing to a group g, then the assignment will cause g to be destroyed. If other manages
no object, ∗this will not manage an object either after the assignment.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 247
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Parameters

other Group pointer to copy.

Returns

∗this.

10.1.1.2.2.5 group_ptr& hetcompute::group_ptr::operator= ( std::nullptr_t )

Resets ∗this so that it manages no object. If, before the assignment, ∗this was the last group_ptr
pointing to a group g, then the assignment will cause g to be destroyed. If other manages no object,
∗this will not manage an object either after the assignment.

Returns

∗this.

10.1.1.2.2.6 group_ptr& hetcompute::group_ptr::operator= ( group_ptr && other )

Move-assigns the group managed by other to ∗this. other will manage no group after the assignment.
If, before the assignment, ∗this was the last group_ptr pointing to a group g, then the assignment will
cause g to be destroyed. If other manages no object, ∗this will not manage an object either after the
assignment.

Parameters

other Group pointer to move from.

Returns

∗this.

10.1.1.2.2.7 void hetcompute::group_ptr::reset ( )

Resets pointer to managed group. If, ∗this was the last group_ptr pointing to a group g, then
reset() cause g to be destroyed.
Exceptions

api_exception If group pointer is nullptr.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 248
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.1.1.2.2.8 void hetcompute::group_ptr::swap ( group_ptr & other )

Exchanges managed groups between ∗this and other.

Parameters

other Group pointer to exchange with.

10.1.1.2.2.9 bool hetcompute::group_ptr::unique ( ) const

Checks whether ∗this is the onlygroup_ptr managing the same group object. If ∗this does not
manage any group, unique() returns false.
It is equivalent to checking whether use_count is 1, except that it is more efficient.

Returns

true – The pointer is the only group_ptr managing the group. false – The pointer is not the
only group_ptr managing the group or ∗this is nullptr.

10.1.1.2.2.10 size_t hetcompute::group_ptr::use_count ( ) const

Returns the number of group_ptr objects managing the same object (including ∗this). Notice that the
HETCOMPUTE runtime keeps one internal group_ptr to a group if the group contains one or more
tasks. This is to prevent a group from disappearing while it has tasks.
1 #include <cassert>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 // Create group g.
9 auto g = hetcompute::create_group();
10
11 // g’s use_count should be 1
12 HETCOMPUTE_ILOG("After construction: g.use_count() = %zu\n", g.use_count());
13
14 // Copy-construct g2 from g. g and g2’s use_count is 2.
15 auto g2 = g;
16 HETCOMPUTE_ILOG("After copy-construction: g2.use_count() = %zu\n", g2.use_count());
17
18 std::atomic<bool> running(false);
19 std::atomic<bool> finish(false);
20
21 // Launch t into g and wait for its completion.
22 g->launch([&running, &finish] {
23 running = true;
24 while (!finish)
25 {
26 };
27 });
28
29 while (!running)
30 {
31 };
32

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 249
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

33 HETCOMPUTE_ILOG("Task in g running. g.use_count() = %zu\n", g.use_count());


34
35 // Call g.get to get pointer to managed group. g’s use_cont still 3.
36 auto g3 = g.get();
37 HETCOMPUTE_ILOG("After calling g.get(). g.use_count() = %zu\n", g.use_count());
38
39 finish = true;
40
41 g->wait_for();
42
43 // g’s use_count should be 2
44 HETCOMPUTE_ILOG("After g->wait_for: g.use_count() = %zu\n", g.use_count());
45
46 assert(g3 != nullptr);
47 HETCOMPUTE_UNUSED(g3);
48
49 hetcompute::runtime::shutdown();
50 return 0;
51 }

Output

After construction: g.use_count() = 1


After copy-construction: g2.use_count() = 2
Task in g running. g.use_count() = 3
After calling g.get(). g.use_count() = 3
After g->wait_for: g.use_count() = 2

Returns

Total number of group_ptr

10.1.2 Function Documentation

10.1.2.1 group_ptr hetcompute::create_group ( const char ∗ name )

Creates a named group and returns a group_ptr that points to it. Named groups can facilitate debugging
of complex applications. Keep in mind, that Qualcomm HetCompute will make a copy of name, which
may cause a slight overhead if you repeatedly create and destroy groups.
name does not have to be unique. Qualcomm HetCompute does not ensure it, so two or more groups can
share the same name.

Parameters

name Group name.

Returns

group_ptr – Pointer to the new group.

Example

1 #include <cassert>
2 #include <string>
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 250
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

7 {
8 hetcompute::runtime::init();
9 // Create group named "Example 1"
10 auto g1 = hetcompute::create_group("Example 1");
11
12 // Create group named "Example 2"
13 std::string g2_name("Example 2");
14 auto g2 = hetcompute::create_group(g2_name);
15
16 // Create unnamed group
17 auto g3 = hetcompute::create_group();
18
19 HETCOMPUTE_ILOG("g1 name = %s\n", g1->get_name().c_str());
20 HETCOMPUTE_ILOG("g2 name = %s\n", g2->get_name().c_str());
21 HETCOMPUTE_ILOG("g3 name = %s\n", g3->get_name().c_str());
22
23 hetcompute::runtime::shutdown();
24 return 0;
25 }

See Also

hetcompute::create_group()
hetcompute::create_group(std::string const&)

10.1.2.2 group_ptr hetcompute::create_group ( std::string const & name )

Creates a named group and returns a group_ptr that points to it. Named groups can facilitate debugging
of complex applications. Keep in mind, that Qualcomm HetCompute will make a copy of name, which
may cause a slight overhead if you repeatedly create and destroy groups.
name does not have to be unique. Qualcomm HetCompute does not ensure it, so two or more groups can
share the same name.

Parameters

name Group name.

Returns

group_ptr – Pointer to the new group.

Example

1 #include <cassert>
2 #include <string>
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 // Create group named "Example 1"
10 auto g1 = hetcompute::create_group("Example 1");
11
12 // Create group named "Example 2"
13 std::string g2_name("Example 2");
14 auto g2 = hetcompute::create_group(g2_name);
15
16 // Create unnamed group
17 auto g3 = hetcompute::create_group();

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 251
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

18
19 HETCOMPUTE_ILOG("g1 name = %s\n", g1->get_name().c_str());
20 HETCOMPUTE_ILOG("g2 name = %s\n", g2->get_name().c_str());
21 HETCOMPUTE_ILOG("g3 name = %s\n", g3->get_name().c_str());
22
23 hetcompute::runtime::shutdown();
24 return 0;
25 }

See Also

hetcompute::create_group()
hetcompute::create_group(const char∗)

10.1.2.3 group_ptr hetcompute::create_group ( )

Creates a group and returns a group_ptr that points to it.

Returns

group_ptr Pointer to the new group.

Example

1 #include <cassert>
2 #include <string>
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 // Create group named "Example 1"
10 auto g1 = hetcompute::create_group("Example 1");
11
12 // Create group named "Example 2"
13 std::string g2_name("Example 2");
14 auto g2 = hetcompute::create_group(g2_name);
15
16 // Create unnamed group
17 auto g3 = hetcompute::create_group();
18
19 HETCOMPUTE_ILOG("g1 name = %s\n", g1->get_name().c_str());
20 HETCOMPUTE_ILOG("g2 name = %s\n", g2->get_name().c_str());
21 HETCOMPUTE_ILOG("g3 name = %s\n", g3->get_name().c_str());
22
23 hetcompute::runtime::shutdown();
24 return 0;
25 }

See Also

hetcompute::create_group()
hetcompute::create_group(const char∗)

10.1.2.4 void hetcompute::finish_after ( group ∗ g )

PRIVATE

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 252
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.1.2.5 void hetcompute::finish_after ( group_ptr const & g )

Specifies that the task invoking this function should be deemed to finish only after tasks in g finish. This
method returns immediately.
If the invoking task is multi-threaded, the programmer must ensure that concurrent calls to
finish_after from within the task are properly synchronized.

Parameters

g Group pointer.

Exceptions

api_exception If invoked from outside a task or from within a


hetcompute::pfor_each or if g points to nullptr

Note

If exceptions are disabled by the application, the API terminates in the above listed error conditions.

10.1.2.6 group_ptr hetcompute::intersect ( group_ptr const & a, group_ptr const & b


)

Returns a pointer to a group that represents the intersection of two groups. Some applications require that
tasks join more than one group. It is possible, though not recommended for performance reasons, to use
hetcompute::group::launch(hetcompute::task_ptr<> const&) or hetcompute-
::group::add(hetcompute::task_ptr<> const&) repeatedly to add a task to several groups.
Instead, use hetcompute::intersect(group_ptr const&, group_ptr const&) to create
a new group that represents the intersection of all the groups where the tasks need to launch. Again, this
method is more performant than repeatedly launching the same task into different groups.
Launching a task into the intersection group also simultaneously launches it into all the groups that are part
of the intersection.
Consecutive calls to hetcompute::intersect with the same groups’ pointer as arguments, return a
pointer to the same group.
Group intersection is a commutative operation.
You can use the & operator instead of hetcompute::group::intersect.

Parameters

a Group pointer to the first group.


b Group pointer to the second group.

Returns

group_ptr – Group pointer that points to a group that represents the intersection of a and b.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 253
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Example

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create groups
8 auto g1 = hetcompute::create_group("Group 1");
9 auto g2 = hetcompute::create_group("Group 2");
10
11 auto g12 = hetcompute::intersect(g1, g2);
12
13 for (int i = 0; i < 3000; i++)
14 g1->launch([] {
15 //... Do something
16 });
17
18 for (int i = 0; i < 2000; i++)
19 g2->launch([] {
20 //... Do something
21 });
22
23 // Returns immediately. g12 is empty
24 g12->wait_for();
25
26 // Return only after tasks in g1 and g2 complete
27 g1->wait_for();
28 g2->wait_for();
29
30 g12->launch([] {
31 //... Calculate the Ultimate Question of Life,
32 // the Universe, and Everything
33 HETCOMPUTE_ILOG("42\n");
34 });
35
36 // All will return after the task prints 42
37 g1->wait_for();
38 g2->wait_for();
39 hetcompute::runtime::shutdown();
40
41 return 0;
42 }

The example above shows an application with three groups: g1, g2, and their intersection g12. We
launch thousands of tasks on both g1 and g2. We then wait for g12 (line 23), but
g12->wait_for() returns immediately because g12 is empty. This is because at this point no task
belongs to both g1 and g2. We then launch a task into g12 (line 29). g1->wait_for() and
g2->wait_for() return only after the task in g12 completes execution because it belongs to g1,
g2, and g12.

See Also

hetcompute::operator&(hetcompute::group_ptr const&,
hetcompute::group_ptr const&)

10.1.2.7 group_ptr hetcompute::operator& ( group_ptr const & a, group_ptr const &


b )

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 254
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Parameters

a Group pointer to the first group to intersect.


b Group pointer to the second group to intersect.

Returns

group_ptr – Group pointer that points to a group that represents the intersection of a and b.

See Also

hetcompute::intersect(hetcompute::group_ptr const&,
hetcompute::group_ptr const&).

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 255
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.2 Kernels
Classes

• struct hetcompute::beta::call_tuple< Dim, Args >


Utility base template to get the tuple type. More...

• struct hetcompute::beta::call_tuple< Dim, gpu_kernel< Args...> >


Utility template to get the tuple type for GPU pipeline stages. More...

• class hetcompute::beta::cl_t
Type used for declaring the constant hetcompute::cl. More...

• class hetcompute::cpu_kernel< Fn >


A wrapper around a function object. More...

• class hetcompute::cpu_kernel< FReturnType(FArgs...)>


A wrapper around a function. More...

• class hetcompute::dsp_kernel< Fn >


• class hetcompute::dsp_kernel< int(∗)(Args...)>
• class hetcompute::beta::gl_t
Type used for declaring the constant hetcompute::gl. More...

• class hetcompute::gpu_kernel< Args >


A wrapper around OpenCL C kernels and OpenGL ES compute shaders for GPU compute. More...

• class hetcompute::local< T >


Used as a template parameter to hetcompute::gpu_kernel to indicate a locally allocated parameter. More...

Functions

• template<typename FReturnType , typename... FArgs>


hetcompute::cpu_kernel
< FReturnType(FArgs...)> hetcompute::create_cpu_kernel (FReturnType(∗fn)(FArgs...))
Create a cpu_kernel object from a function.

• template<typename Fn >
hetcompute::cpu_kernel
< typename
std::remove_reference< Fn >
::type > hetcompute::create_cpu_kernel (Fn &&fn)
Create a cpu_kernel object from a function object.

• template<typename... Args>
hetcompute::dsp_kernel< int(∗)(Args...)> hetcompute::create_dsp_kernel (int(∗fn)(Args...))
• template<typename... Args>
gpu_kernel< Args...> hetcompute::create_gpu_kernel (std::string const &cl_kernel_str, std::string
const &cl_kernel_name, std::string const &cl_build_options="")

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 256
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

• template<typename... Args>
gpu_kernel< Args...> hetcompute::beta::create_gpu_kernel (beta::cl_t const &, std::string const
&cl_kernel_str, std::string const &cl_kernel_name, std::string const &cl_build_options="")
• template<typename... Args>
gpu_kernel< Args...> hetcompute::beta::create_gpu_kernel (beta::gl_t const &, std::string const
&gl_kernel_str)
• template<typename... Args>
gpu_kernel< Args...> hetcompute::create_gpu_kernel (void const ∗cl_kernel_bin, size_t
cl_kernel_len, std::string const &cl_kernel_name, std::string const &cl_build_options="")
• template<typename... Args>
gpu_kernel< Args...> hetcompute::beta::create_gpu_kernel (beta::cl_t const &, void const
∗cl_kernel_bin, size_t cl_kernel_len, std::string const &cl_kernel_name, std::string const
&cl_build_options="")
• template<typename... Args>
gpu_kernel< Args...> hetcompute::beta::create_gpu_kernel (beta::gl_t const &, void const
∗gl_kernel_bin, size_t gl_kernel_len)

Variables

• static
HETCOMPUTE_CONSTEXPR_CONST
size_type hetcompute::dsp_kernel< int(∗)(Args...)>::arity = parent::arity
• static
HETCOMPUTE_CONSTEXPR_CONST
size_type hetcompute::cpu_kernel< Fn >::arity = parent::arity
• static
HETCOMPUTE_CONSTEXPR_CONST
size_type hetcompute::cpu_kernel< FReturnType(FArgs...)>::arity = parent::arity
• cl_t const hetcompute::beta::cl
Used to explicitly indicate creation of an OpenCL kernel.

• gl_t const hetcompute::beta::gl {}


Used to explicitly indicate creation of an OpenGL ES compute kernel.

10.2.1 Class Documentation

10.2.1.1 struct hetcompute::beta::call_tuple

template<size_t Dim, typename... Args>struct hetcompute::beta::call_tuple< Dim, Args >

Utility template to get the tuple type of hetcompute::range, and other types.
The wrapped type can be accessed trough call_tuple<...>::type.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 257
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Template Parameters

Dim dimension of hetcompute::range.


Args... other types.

10.2.1.2 struct hetcompute::beta::call_tuple< Dim, gpu_kernel< Args...> >

template<size_t Dim, typename... Args>struct hetcompute::beta::call_tuple< Dim, gpu_kernel<


Args...> >

Utility template to get the tuple type of hetcompute::range, and GPU kernel argument types. Use case: get
the return type of the before synchronization lambda for a gpu pipeline stage.
The wrapped type can be accessed trough call_tuple<...>::type.
Template Parameters

Dim dimension of hetcompute::range.


Args... GPU kernel argument type list.

See Also

template<typename... Args> void add_gpu_stage(Args&&... args)

Data fields

Type Field Description


type

10.2.1.3 class hetcompute::beta::cl_t


See Also

hetcompute::gpu_kernel

10.2.1.4 class hetcompute::cpu_kernel

template<typename Fn>class hetcompute::cpu_kernel< Fn >

A cpu_kernel object contains CPU executable code. It can be used to create tasks. When such a task
runs, it executes the function object in its cpu_kernel.
See Also

cpu_kernel<FReturnType(FArgs...)>

Public Types

• using args_tuple = typename parent::args_tuple


• using collapsed_task_type = typename parent::collapsed_task_type
• using non_collapsed_task_type = typename parent::non_collapsed_task_type

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 258
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

• using return_type = typename parent::return_type


• using size_type = typename parent::size_type

Public member functions

• cpu_kernel (Fn const &fn)


Constructor.

• cpu_kernel (Fn &&fn)


Constructor.

• cpu_kernel (cpu_kernel const &other)


Copy constructor.

• cpu_kernel (cpu_kernel &&other)


Move constructor.

• ∼cpu_kernel ()
Destructor.

• bool is_big () const


Returns whether a cpu_kernel object is meant for a big core in a big.LITTLE SoC.

• bool is_blocking () const


Returns whether this cpu_kernel object is blocking.

• bool is_little () const


Returns whether a cpu_kernel object is meant for a LITTLE core in a big.LITTLE SoC.

• cpu_kernel & operator= (cpu_kernel const &other)


Copy assignment.

• cpu_kernel & operator= (cpu_kernel &&other)


Move assignment.

• cpu_kernel & set_big ()


Set this cpu_kernel object as meant for a big core in a big.LITTLE SoC.

• cpu_kernel & set_blocking ()


Set this cpu_kernel object as blocking.

• cpu_kernel & set_little ()


Set this cpu_kernel object as meant for a LITTLE core in a big.LITTLE SoC.

Static Public Attributes

• static
HETCOMPUTE_CONSTEXPR_CONST
size_type arity = parent::arity

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 259
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Friends

• struct ::hetcompute::internal::cpu_kernel_caller
• template<typename X , typename Y , typename Z >
struct ::hetcompute::internal::task_factory
• template<typename X , typename Y >
struct ::hetcompute::internal::task_factory_dispatch

10.2.1.4.1 Constructors and Destructors

10.2.1.4.1.1 template<typename Fn > hetcompute::cpu_kernel< Fn >::cpu_kernel ( Fn const & fn )


[explicit]

Parameters

fn An lvalue function object.

10.2.1.4.1.2 template<typename Fn > hetcompute::cpu_kernel< Fn >::cpu_kernel ( Fn && fn )


[explicit]

Parameters

fn An rvalue function object.

10.2.1.4.1.3 template<typename Fn > hetcompute::cpu_kernel< Fn >::cpu_kernel ( cpu_kernel< Fn >


const & other )

Parameters

other Another cpu_kernel object.

10.2.1.4.1.4 template<typename Fn > hetcompute::cpu_kernel< Fn >::cpu_kernel ( cpu_kernel< Fn >


&& other )

Parameters

other Another cpu_kernel object.

10.2.1.5 class hetcompute::cpu_kernel< FReturnType(FArgs...)>

template<typename FReturnType, typename... FArgs>class hetcompute::cpu_kernel<


FReturnType(FArgs...)>

A cpu_kernel object contains CPU executable code. It can be used to create tasks. When such a task
runs, it executes the function in its cpu_kernel.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 260
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

See Also

cpu_kernel<FReturnType(FArgs...)>

Public Types

• using args_tuple = typename parent::args_tuple


• using collapsed_task_type = typename parent::collapsed_task_type
• using non_collapsed_task_type = typename parent::non_collapsed_task_type
• using return_type = typename parent::return_type
• using size_type = typename parent::size_type

Public member functions

• cpu_kernel (FReturnType(∗fn)(FArgs...))
Constructor.

• cpu_kernel (cpu_kernel const &other)


Copy constructor.

• cpu_kernel (cpu_kernel &&other)


Move constructor.

• ∼cpu_kernel ()
Destructor.

• bool is_big () const


Returns whether a cpu_kernel object is meant for a big core in a big.LITTLE SoC.

• bool is_blocking () const


Returns whether a cpu_kernel object is blocking.

• bool is_little () const


Returns whether a cpu_kernel object is meant for a LITTLE core in a big.LITTLE SoC.

• cpu_kernel & operator= (cpu_kernel const &other)


Copy assignment.

• cpu_kernel & operator= (cpu_kernel &&other)


Move assignment.

• cpu_kernel & set_big ()


Set this cpu_kernel object as meant for a big core in a big.LITTLE SoC.

• cpu_kernel & set_blocking ()


Set a cpu_kernel object as blocking.

• cpu_kernel & set_little ()


Set this cpu_kernel object as meant for a LITTLE core in a big.LITTLE SoC.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 261
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Static Public Attributes

• static
HETCOMPUTE_CONSTEXPR_CONST
size_type arity = parent::arity

Friends

• struct ::hetcompute::internal::cpu_kernel_caller
• template<typename X , typename Y , typename Z >
struct ::hetcompute::internal::task_factory
• template<typename X , typename Y >
struct ::hetcompute::internal::task_factory_dispatch

10.2.1.5.1 Constructors and Destructors

10.2.1.5.1.1 template<typename FReturnType , typename... FArgs> hetcompute::cpu_kernel<


FReturnType(FArgs...)>::cpu_kernel ( FReturnType(∗)(FArgs...) fn ) [explicit]

Parameters

fn A function name or function pointer.

10.2.1.5.1.2 template<typename FReturnType , typename... FArgs> hetcompute::cpu_kernel<


FReturnType(FArgs...)>::cpu_kernel ( cpu_kernel< FReturnType(FArgs...)> const & other
)

Parameters

other Another cpu_kernel object.

10.2.1.5.1.3 template<typename FReturnType , typename... FArgs> hetcompute::cpu_kernel<


FReturnType(FArgs...)>::cpu_kernel ( cpu_kernel< FReturnType(FArgs...)> && other )

Parameters

other Another cpu_kernel object.

10.2.1.6 class hetcompute::dsp_kernel

template<typename Fn>class hetcompute::dsp_kernel< Fn >

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 262
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.2.1.7 class hetcompute::dsp_kernel< int(∗)(Args...)>

template<typename... Args>class hetcompute::dsp_kernel< int(∗)(Args...)>

For this DSP kernel, the template signature corresponds to the DSP kernel’s parameter list.
Template Parameters

Args Arguments of the DSP function run by the kernel.

See Also

create_dsp_kernel() for creating a dsp_kernel.

Public Types

• using args_tuple = typename parent::args_tuple


• using collapsed_task_type = typename parent::collapsed_task_type
• using fn_type = typename parent::dsp_code_type
• using non_collapsed_task_type = typename parent::non_collapsed_task_type
• using return_type = typename parent::return_type
• using size_type = typename parent::size_type

Public member functions

• dsp_kernel (fn_type const &fn)


• dsp_kernel (fn_type &&fn)
• dsp_kernel (dsp_kernel const &other)
• dsp_kernel (dsp_kernel &&other)
• ∼dsp_kernel ()
• dsp_kernel & operator= (dsp_kernel const &other)
• dsp_kernel & operator= (dsp_kernel &&other)

Static Public Attributes

• static
HETCOMPUTE_CONSTEXPR_CONST
size_type arity = parent::arity

Friends

• template<typename X , typename Y , typename Z >


struct ::hetcompute::internal::task_factory
• template<typename X , typename Y >
struct ::hetcompute::internal::task_factory_dispatch

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 263
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.2.1.7.1 Constructors and Destructors

10.2.1.7.1.1 template<typename... Args> hetcompute::dsp_kernel< int(∗)(Args...)>::dsp_kernel (


fn_type const & fn ) [explicit]

Constructor

Parameters

fn The dsp function to be called.

10.2.1.7.1.2 template<typename... Args> hetcompute::dsp_kernel< int(∗)(Args...)>::dsp_kernel (


fn_type && fn ) [explicit]

Constructor

Parameters

fn The DSP function to be called.

10.2.1.7.1.3 template<typename... Args> hetcompute::dsp_kernel< int(∗)(Args...)>::dsp_kernel (


dsp_kernel< int(∗)(Args...)> && other )

Move constructor.

10.2.1.7.1.4 template<typename... Args> hetcompute::dsp_kernel< int(∗)(Args...)>::∼dsp_kernel ( )

Destructor.

10.2.1.7.2 Member Function Documentation

10.2.1.7.2.1 template<typename... Args> dsp_kernel& hetcompute::dsp_kernel< int(∗)(Args...)>-


::operator= ( dsp_kernel< int(∗)(Args...)> const & other )

Equality operator.

10.2.1.7.2.2 template<typename... Args> dsp_kernel& hetcompute::dsp_kernel< int(∗)(Args...)>-


::operator= ( dsp_kernel< int(∗)(Args...)> && other )

Inequality operator.

10.2.1.8 class hetcompute::beta::gl_t


See Also

hetcompute::gpu_kernel

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 264
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.2.1.9 class hetcompute::gpu_kernel

template<typename... Args>class hetcompute::gpu_kernel< Args >

A wrapper around OpenCL C kernels and OpenGL ES compute shaders for GPU compute. The template
signature corresponds to the GPU kernel parameter list.
See Also

hetcompute::create_gpu_kernel for creating a gpu_kernel.

Public member functions

• gpu_kernel (std::string const &cl_kernel_str, std::string const &cl_kernel_name, std::string const


&cl_build_options="")
Constructor, implicit for OpenCL kernel.

• gpu_kernel (beta::cl_t const &, std::string const &cl_kernel_str, std::string const &cl_kernel_name,
std::string const &cl_build_options="")
Constructor, explicit for OpenCL kernel.

• gpu_kernel (beta::gl_t const &, std::string const &gl_kernel_str)


Constructor, explicit for OpenGL ES compute kernel.

• gpu_kernel (void const ∗cl_kernel_bin, size_t cl_kernel_len, std::string const &cl_kernel_name,


std::string const &cl_build_options="")
Constructor, implicit for precompiled OpenCL kernel.

• gpu_kernel (beta::cl_t const &, void const ∗cl_kernel_bin, size_t cl_kernel_len, std::string const
&cl_kernel_name, std::string const &cl_build_options="")
Constructor, explicit for precompiled OpenCL kernel.

• std::pair< void const ∗, size_t > get_cl_kernel_binary () const


Extracts the CL binary. Error if invoked on a non-OpenCL gpukernel.

• HETCOMPUTE_DEFAULT_METHOD (gpu_kernel(gpu_kernel const &))


• HETCOMPUTE_DEFAULT_METHOD (gpu_kernel &operator=(gpu_kernel const &))
• HETCOMPUTE_DEFAULT_METHOD (gpu_kernel(gpu_kernel &&))
• HETCOMPUTE_DEFAULT_METHOD (gpu_kernel &operator=(gpu_kernel &&))
• bool is_cl () const
Identifies if kernel type is OpenCL.

• bool is_gl () const


Identifies if kernel type is OpenGL ES.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 265
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Friends

• template<typename GPUKernel , typename... CallArgs>


void hetcompute::internal::execute_gpu (GPUKernel const &gk, CallArgs &&...args)
• template<typename GPUKernel , size_t Dims, typename... CallArgs>
struct hetcompute::internal::executor
• template<typename Code , class Enable >
struct hetcompute::internal::task_factory_dispatch

10.2.1.9.1 Constructors and Destructors

10.2.1.9.1.1 template<typename... Args> hetcompute::gpu_kernel< Args >::gpu_kernel ( std::string


const & cl_kernel_str, std::string const & cl_kernel_name, std::string const &
cl_build_options = "" )

Constructor, implicit for OpenCL kernel.

Parameters

cl_kernel_str The OpenCL C kernel code as a string.


cl_kernel_name The name of the kernel function to be called.
cl_build_options Build options to pass to OpenCL (optional).

10.2.1.9.1.2 template<typename... Args> hetcompute::gpu_kernel< Args >::gpu_kernel ( beta::cl_t


const & , std::string const & cl_kernel_str, std::string const & cl_kernel_name, std::string
const & cl_build_options = "" )

Constructor, explicit for OpenCL kernel.

• Pass hetcompute::beta::cl to explicitly select this OpenCL kernel constructor.


Parameters

cl_kernel_str The OpenCL C kernel code as a string.


cl_kernel_name The name of the kernel function to be called.
cl_build_options Build options to pass to OpenCL (optional).

10.2.1.9.1.3 template<typename... Args> hetcompute::gpu_kernel< Args >::gpu_kernel ( beta::gl_t


const & , std::string const & gl_kernel_str )

Constructor, explicit for OpenGL ES compute kernel.

• Pass hetcompute::beta::gl to explicitly select this OpenGL ES kernel constructor.


Parameters

gl_kernel_str The OpenGL ES shader code as a string.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 266
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.2.1.9.1.4 template<typename... Args> hetcompute::gpu_kernel< Args >::gpu_kernel ( void const ∗


cl_kernel_bin, size_t cl_kernel_len, std::string const & cl_kernel_name, std::string const
& cl_build_options = "" )

Constructor, implicit for precompiled OpenCL kernel.

Parameters

cl_kernel_bin The pointer to a precompiled OpenCL C kernel binary.


cl_kernel_len The length of the precompiled kernel in bytes.
cl_kernel_name The name of the kernel function to be called.
cl_build_options Build options to pass to OpenCL (optional).

10.2.1.9.1.5 template<typename... Args> hetcompute::gpu_kernel< Args >::gpu_kernel ( beta::cl_t


const & , void const ∗ cl_kernel_bin, size_t cl_kernel_len, std::string const &
cl_kernel_name, std::string const & cl_build_options = "" )

Constructor, explicit for precompiled OpenCL kernel.

• Pass hetcompute::beta::cl to explicitly select this OpenCL kernel constructor.


Parameters

cl_kernel_bin The pointer to a precompiled OpenCL C kernel binary.


cl_kernel_len The length of the precompiled kernel in bytes.
cl_kernel_name The name of the kernel function to be called.
cl_build_options Build options to pass to OpenCL (optional).

10.2.1.9.2 Member Function Documentation

10.2.1.9.2.1 template<typename... Args> std::pair<void const∗, size_t> hetcompute::gpu_kernel<


Args >::get_cl_kernel_binary ( ) const

Extracts the CL binary. Error if invoked on a non-OpenCL gpukernel.

Returns

std::pair consisting of a pointer to an allocated buffer holding the CL binary and the size of the
allocated buffer (sized to hold the binary) in bytes.

Note

Each invocation of this function internally allocates a new buffer of an appropriate size using new[].
The user code is responsible for deleting the buffer after use by calling delete[].

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 267
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.2.1.9.2.2 template<typename... Args> bool hetcompute::gpu_kernel< Args >::is_cl ( ) const

Identifies if kernel type is OpenCL.

Returns

true if this is an OpenCL kernel, false otherwise.

10.2.1.9.2.3 template<typename... Args> bool hetcompute::gpu_kernel< Args >::is_gl ( ) const

Identifies if kernel type is OpenGL ES.

Returns

true if this is an OpenGL ES kernel, false otherwise.

10.2.1.10 class hetcompute::local

template<typename T>class hetcompute::local< T >

Used as a template parameter to hetcompute::gpu_kernel to indicate a locally allocated parameter.


Corresponds to a __local parameter of an OpenCL kernel. During task creation or launch, the
corresponding argument takes the size of the local allocation in number of elements of type T.
See Also

hetcompute::gpu_kernel

Example:
const char* kernel_string = "
__kernel void k(__global int *a,
__global int *b,
__local int *c)
{
...
}";

hetcompute::gpu_kernel<hetcompute::buffer_ptr<int>,
hetcompute::buffer_ptr<int>,
hetcompute::local<int>> gk(kernel_string, "k");

// pass __local size in number of elements (not number of bytes as for OpenCL)
int number_of_ints = number_of_bytes / sizeof(int);
auto t = hetcompute::create_task(gk, r, buf_a, buf_b, number_of_ints);

10.2.2 Function Documentation

10.2.2.1 template<typename FReturnType , typename... FArgs> hetcompute-


::cpu_kernel<FReturnType(FArgs...)> hetcompute::create_cpu_kernel (
FReturnType(∗)(FArgs...) fn )

Create a cpu_kernel object that executes a given function. This kernel object can then be used to create
a task.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 268
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Parameters

fn A function name or function pointer.

Returns

cpu_kernel A kernel object that executes the given function.

See Also

create_cpu_kernel(Fn&& fn)

10.2.2.2 template<typename Fn > hetcompute::cpu_kernel<typename std::remove-


_reference<Fn>::type> hetcompute::create_cpu_kernel ( Fn && fn
)

Create a cpu_kernel object that executes a given function object. A function object (also called a
functor) is any object with the () operator defined, such as a lambda expression. This kernel object can
then be used to create a task.

Parameters

fn A function object such as a lambda expression.

Returns

cpu_kernel A kernel object that executes the given function object.

See Also

create_cpu_kernel(FReturnType(∗fn)(FArgs...))

10.2.2.3 template<typename... Args> hetcompute::dsp_kernel<int (∗)(Args...)>


hetcompute::create_dsp_kernel ( int(∗)(Args...) fn )

This template creates a DSP kernel executable by Qualcomm HetCompute SDK. The template signature
corresponds to the DSP kernel parameter list.
The kernel code is specified as a C language function. The function returns an int that corresponds to the
status. When the function returns something other than 0, the Qualcomm HetCompute runtime will trigger
a hetcompute::dsp_exception().
Template Parameters

dsp_function The DSP function pointer.


// dsp kernel creation
auto hex_kernel = hetcompute::create_dsp_kernel<>(hetcompute_dsp_array_is_prime);

// create the dsp task that will be executed inside the dsp DSP
auto hex_task = hetcompute::create_task(hex_kernel,
in_buf, // in access recognized
out_buf); // out access recognized

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 269
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.2.2.4 template<typename... Args> gpu_kernel<Args...> hetcompute::create-


_gpu_kernel ( std::string const & cl_kernel_str, std::string const &
cl_kernel_name, std::string const & cl_build_options = "" )

Creates a GPU kernel executable by HETCOMPUTE, implicitly for OpenCL. The template signature
corresponds to the GPU kernel parameter list.
The kernel code is specified as a string of OpenCL C code.
Equivalent to calling the hetcompute::gpu_kernel constructor directly.

Parameters

cl_kernel_str The OpenCL C kernel code as a string.


cl_kernel_name The name of the kernel function to be called.
cl_build_options The build options to pass to OpenCL (optional).

Returns

A gpu_kernel object.

10.2.2.5 template<typename... Args> gpu_kernel<Args...> hetcompute::beta-


::create_gpu_kernel ( beta::cl_t const & , std::string const & cl_kernel_str,
std::string const & cl_kernel_name, std::string const & cl_build_options =
"" )

Creates a GPU kernel executable by HETCOMPUTE, explictly for OpenCL. The template signature
corresponds to the GPU kernel parameter list.
The kernel code is specified as a string of OpenCL C code.
Equivalent to calling the hetcompute::gpu_kernel constructor directly.

• Pass hetcompute::beta::cl to explicitly select OpenCL.


Parameters

cl_kernel_str The OpenCL C kernel code as a string.


cl_kernel_name The name of the kernel function to be called.
cl_build_options The build options to pass to OpenCL (optional).

Returns

A gpu_kernel object.

10.2.2.6 template<typename... Args> gpu_kernel<Args...> hetcompute::beta-


::create_gpu_kernel ( beta::gl_t const & , std::string const & gl_kernel_str
)

Creates a GPU kernel executable by HETCOMPUTE, explictly for OpenGL ES. The template signature
corresponds to the GPU kernel parameter list.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 270
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

The OpenGL ES shader code is specified as a string.


Equivalent to calling the hetcompute::gpu_kernel constructor directly.

• Pass hetcompute::beta::gl to explicitly select OpenGL.


Parameters

gl_kernel_str The OpenGL ES shader code as a string.

Returns

A gpu_kernel object.

10.2.2.7 template<typename... Args> gpu_kernel<Args...> hetcompute::create_gpu-


_kernel ( void const ∗ cl_kernel_bin, size_t cl_kernel_len, std::string const
& cl_kernel_name, std::string const & cl_build_options = "" )

Creates a GPU kernel executable by HETCOMPUTE, implicitly for OpenCL, using a precompiled OpenCL
kernel. The template signature corresponds to the GPU kernel parameter list.
The kernel code is specified as a prebuilt OpenCL kernel binary.
Equivalent to calling the hetcompute::gpu_kernel constructor directly.

Parameters

cl_kernel_bin The pointer to a precompiled OpenCL C kernel binary.


cl_kernel_len The length of the precompiled kernel in bytes.
cl_kernel_name The name of the kernel function to be called.
cl_build_options The build options to pass to OpenCL (optional).

Returns

A gpu_kernel object.

10.2.2.8 template<typename... Args> gpu_kernel<Args...> hetcompute::beta::create-


_gpu_kernel ( beta::cl_t const & , void const ∗ cl_kernel_bin, size_t
cl_kernel_len, std::string const & cl_kernel_name, std::string const &
cl_build_options = "" )

Creates a GPU kernel executable by HETCOMPUTE, explicitly for OpenCL, using a precompiled OpenCL
kernel. The template signature corresponds to the GPU kernel parameter list.
The kernel code is specified as a prebuilt OpenCL kernel binary.
Equivalent to calling the hetcompute::gpu_kernel constructor directly.

• Pass hetcompute::beta::cl to explicitly select OpenCL.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 271
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Parameters

cl_kernel_bin The pointer to a precompiled OpenCL C kernel binary.


cl_kernel_len The length of the precompiled kernel in bytes.
cl_kernel_name The name of the kernel function to be called.
cl_build_options The build options to pass to OpenCL (optional).

Returns

A gpu_kernel object.

10.2.3 Variable Documentation

10.2.3.1 cl_t const hetcompute::beta::cl


See Also

hetcompute::gpu_kernel

10.2.3.2 gl_t const hetcompute::beta::gl {}


See Also

hetcompute::gpu_kernel

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 272
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.3 Indices
Classes

• class hetcompute::index< Dims >


• class hetcompute::index< 1 >
• class hetcompute::index< 2 >
• class hetcompute::index< 3 >
• class hetcompute::index_base< Dims >

10.3.1 Class Documentation

10.3.1.1 class hetcompute::index

template<size_t Dims>class hetcompute::index< Dims >

10.3.1.2 class hetcompute::index< 1 >

template<>class hetcompute::index< 1 >

Public member functions

• index (const std::array< size_t, 1 > &rhs)


• index (size_t i)
• void print () const

Additional Inherited Members

10.3.1.3 class hetcompute::index< 2 >

template<>class hetcompute::index< 2 >

Public member functions

• index (const std::array< size_t, 2 > &rhs)


• index (size_t i, size_t j)
• void print () const

Additional Inherited Members

10.3.1.4 class hetcompute::index< 3 >

template<>class hetcompute::index< 3 >

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 273
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Public member functions

• index (const std::array< size_t, 3 > &rhs)


• index (size_t i, size_t j, size_t k)
• void print () const

Additional Inherited Members

10.3.1.5 class hetcompute::index_base

template<size_t Dims>class hetcompute::index_base< Dims >

Methods common to 1D, 2D, and 3D index objects are listed here. The value for Dims can be 1, 2, or 3.

Public member functions

• index_base (const std::array< size_t, Dims > &rhs)


• const std::array< size_t, Dims > & data () const
• bool operator!= (const index_base< Dims > &rhs) const
• index_base< Dims > operator+ (const index_base< Dims > &rhs)
• index_base< Dims > & operator+= (const index_base< Dims > &rhs)
• index_base< Dims > operator- (const index_base< Dims > &rhs)
• index_base< Dims > & operator-= (const index_base< Dims > &rhs)
• bool operator< (const index_base< Dims > &rhs) const
• bool operator<= (const index_base< Dims > &rhs) const
• index_base< Dims > & operator= (const index_base< Dims > &rhs)
• bool operator== (const index_base< Dims > &rhs) const
• bool operator> (const index_base< Dims > &rhs) const
• bool operator>= (const index_base< Dims > &rhs) const
• size_t & operator[ ] (size_t i)
• const size_t & operator[ ] (size_t i) const

Protected Attributes

• std::array< size_t, Dims > _data

10.3.1.5.1 Constructors and Destructors

10.3.1.5.1.1 template<size_t Dims> hetcompute::index_base< Dims >::index_base ( const std::array<


size_t, Dims > & rhs ) [explicit]

Constructs an index_base object from std::array

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 274
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Parameters

rhs std::array to be used for constructing a new object.

10.3.1.5.2 Member Function Documentation

10.3.1.5.2.1 template<size_t Dims> const std::array<size_t, Dims>& hetcompute::index_base< Dims


>::data ( ) const

Returns a reference to an std::array of all coordinates of index_base object.

Returns

Const reference to an std::array of all coordinates of an index_base object.

10.3.1.5.2.2 template<size_t Dims> bool hetcompute::index_base< Dims >::operator!= ( const


index_base< Dims > & rhs ) const

Checks for inequality of this with another index_base object.

Parameters

rhs Reference to index_base to be compared with this.

Returns

TRUE – The two indices have different values.


FALSE – The two indices have the same values.

10.3.1.5.2.3 template<size_t Dims> index_base<Dims> hetcompute::index_base< Dims >::operator+


( const index_base< Dims > & rhs )

Sums the corresponding values of the current index_base object and another index_base object and returns
a new index_base object.

Parameters

rhs index_base object to be used for summing with the values of the current
object.

10.3.1.5.2.4 template<size_t Dims> index_base<Dims>& hetcompute::index_base< Dims


>::operator+= ( const index_base< Dims > & rhs )

Sums the corresponding values of the current index_base object and another index_base object and returns
a reference to the current index_base object.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 275
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Parameters

rhs index_base object to be used for summing with the values of current object.

10.3.1.5.2.5 template<size_t Dims> index_base<Dims> hetcompute::index_base< Dims >::operator- (


const index_base< Dims > & rhs )

Subtracts the corresponding values of the current index_base object and another index_base object and
returns a new index_base object.

Parameters

rhs index_base object to be used for subtraction with the values of the current
object.

10.3.1.5.2.6 template<size_t Dims> index_base<Dims>& hetcompute::index_base< Dims


>::operator-= ( const index_base< Dims > & rhs )

Subtracts the corresponding values of current index_base object and another index_base object and returns
a reference to current index_base object.

Parameters

rhs index_base object to be used for subtraction with the values of the current
object.

10.3.1.5.2.7 template<size_t Dims> bool hetcompute::index_base< Dims >::operator< ( const


index_base< Dims > & rhs ) const

Checks if this object is less than another index_base object. Performs a lexicographical comparison of two
index_base objects, similar to std::lexicographical_compare().

Parameters

rhs Reference to the index_base to be compared with this.

Returns

TRUE – If this is lexicographically is smaller than rhs.


FALSE – If this is lexicographically is larger or equal to rhs.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 276
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.3.1.5.2.8 template<size_t Dims> bool hetcompute::index_base< Dims >::operator<= ( const


index_base< Dims > & rhs ) const

Checks if this object is less than or equal to another index_base object. Does a lexicographical comparison
of two index_base objects, similar to std::lexicographical_compare().

Parameters

rhs Reference to index_base to be compared with this.

Returns

TRUE – If this is lexicographically is smaller or equal than rhs.


FALSE – If this is lexicographically is larger than rhs.

10.3.1.5.2.9 template<size_t Dims> index_base<Dims>& hetcompute::index_base< Dims


>::operator= ( const index_base< Dims > & rhs )

Replaces the contents of the current index_base object with an other index_base object.

Parameters

rhs index_base object to be used for replacing the contents of current object.

10.3.1.5.2.10 template<size_t Dims> bool hetcompute::index_base< Dims >::operator== ( const


index_base< Dims > & rhs ) const

Compares this with another index_base object.

Parameters

rhs Reference to index_base to be compared with this.

Returns

TRUE – The two indices have the same values.


FALSE – The two indices have different values.

10.3.1.5.2.11 template<size_t Dims> bool hetcompute::index_base< Dims >::operator> ( const


index_base< Dims > & rhs ) const

Checks if this object is greater than another index_base object. Does a lexicographical comparison of two
index_base objects, similar to std::lexicographical_compare().

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 277
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Parameters

rhs Reference to index_base to be compared with this.

Returns

TRUE – If this is lexicographically is larger than rhs.


FALSE – If this is lexicographically is smaller or equal to rhs.

10.3.1.5.2.12 template<size_t Dims> bool hetcompute::index_base< Dims >::operator>= ( const


index_base< Dims > & rhs ) const

Checks if this object is greater or equal to another index_base object. Does a lexicographical comparison of
two index_base objects, similar to std::lexicographical_compare().

Parameters

rhs Reference to index_base to be compared with this.

Returns

TRUE – If this is lexicographically is larger or equal to rhs.


FALSE – If this is lexicographically is smaller than rhs.

10.3.1.5.2.13 template<size_t Dims> size_t& hetcompute::index_base< Dims >::operator[ ] ( size_t i )

Returns a reference to i-th coordinate of index_base object. No bounds checking is performed.

Parameters

i Specifies which coordinate to return.

Returns

Reference to i-th coordinate of the index_base object.

10.3.1.5.2.14 template<size_t Dims> const size_t& hetcompute::index_base< Dims >::operator[ ] (


size_t i ) const

Returns a const reference to i-th coordinate of index_base object. No bounds checking is performed.

Parameters

i Specifies which coordinate to return.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 278
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Returns

Const reference to i-th coordinate of the index_base object.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 279
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.4 Ranges
Classes

• class hetcompute::range< Dims >


• class hetcompute::range< 1 >
• class hetcompute::range< 2 >
• class hetcompute::range< 3 >
• class hetcompute::range_base< Dims >

10.4.1 Class Documentation

10.4.1.1 class hetcompute::range

template<size_t Dims>class hetcompute::range< Dims >

10.4.1.2 class hetcompute::range< 1 >

template<>class hetcompute::range< 1 >

A 1-dimensional range.
// 1d vector sum using hetcompute::range<1>
constexpr size_t N = 100; // size of vector
std::vector<size_t> v(N);
hetcompute::range<1> r(0, N);
std::atomic<size_t> sum(0);

// initialize the vector


for (size_t i = 0; i < N; i++)
v[i] = i + 1;

// compute the sum in parallel


hetcompute::pfor_each(r, [v, &sum](hetcompute::index<1> idx) {
sum += v[idx[0]]; });

Public member functions

• range ()
• range (const std::array< size_t, 1 > &bb, const std::array< size_t, 1 > &ee)
• range (const std::array< size_t, 1 > &bb, const std::array< size_t, 1 > &ee, const std::array<
size_t, 1 > &ss)
• range (size_t b0, size_t e0, size_t s0)
• range (size_t b0, size_t e0)
• range (size_t e0)
• size_t index_to_linear (const hetcompute::index< 1 > &it) const
• bool is_empty () const
• hetcompute::index< 1 > linear_to_index (size_t idx) const

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 280
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

• size_t linearized_distance () const


• void print () const
• size_t size () const

Additional Inherited Members

10.4.1.2.1 Constructors and Destructors

10.4.1.2.1.1 hetcompute::range< 1 >::range ( )

Creates an empty 1D range.

10.4.1.2.1.2 hetcompute::range< 1 >::range ( size_t b0, size_t e0, size_t s0 )

Creates a 1D range, spans from [b0, e0), and is incremented in s0. It will cause a fatal error if b0 is greater
than or equal to e0. s0 should be greater than 0.

Parameters

b0 Beginning of 1D range.
e0 End of 1D range.
s0 Stride of 1D range.

10.4.1.2.1.3 hetcompute::range< 1 >::range ( size_t b0, size_t e0 ) [explicit]

Creates a 1D range, spans from [b0, e0).


It will cause a fatal error if b0 is greater than or equal to e0.

Parameters

b0 Beginning of 1D range.
e0 End of 1D range.

10.4.1.2.1.4 hetcompute::range< 1 >::range ( size_t e0 ) [explicit]

Creates a 1D range, spans from [0, e0).

Parameters

e0 End of 1D range.

10.4.1.2.2 Member Function Documentation

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 281
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.4.1.2.2.1 size_t hetcompute::range< 1 >::index_to_linear ( const hetcompute::index< 1 > & it )


const

Converts a hetcompute::index<1> object to a linear number with respect to the current range object.

Parameters

it hetcompute::index<1> object

Returns

A linear ID with respect to the current range object.

10.4.1.2.2.2 hetcompute::index<1> hetcompute::range< 1 >::linear_to_index ( size_t idx ) const

Converts a linear ID to a hetcompute::index<1> with respect to the current range object.

Parameters

idx Linear ID to be converted to an hetcompute::index<1> object.

Returns

hetcompute::index<1> with respect to the current range object.

10.4.1.2.2.3 size_t hetcompute::range< 1 >::linearized_distance ( ) const

Returns the linearized distance of the range

Returns

linearized distance of the range, product of length() in each dimension

10.4.1.2.2.4 size_t hetcompute::range< 1 >::size ( ) const

Returns the size of the range.

Returns

Size of the range, product of the number of elements in each dimension.

10.4.1.3 class hetcompute::range< 2 >

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 282
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

template<>class hetcompute::range< 2 >

A 2-dimensional range.
// fill in a 2d matrix by tiles
constexpr size_t N = 20; // size of matrix
constexpr size_t TILE_SIZE = 5;
size_t a[N][N];

// define a 2d range with stride


hetcompute::range<2> r(0, N, TILE_SIZE, 0, N, TILE_SIZE);

// fill in each tile with the linear index of the tile


hetcompute::pfor_each(r, [&a, r, TILE_SIZE](
hetcompute::index<2> idx) {
size_t id = r.index_to_linear(idx);
// iterate through the tile
hetcompute::range<2> rt(idx[0], idx[0] + TILE_SIZE, idx[1], idx[1] + TILE_SIZE)
;
for (size_t i = rt.begin(0); i < rt.end(0); i++)
for (size_t j = rt.begin(1); j < rt.end(1); j++)
a[i][j] = id;
});

Public member functions

• range ()
• range (const std::array< size_t, 2 > &bb, const std::array< size_t, 2 > &ee)
• range (const std::array< size_t, 2 > &bb, const std::array< size_t, 2 > &ee, const std::array<
size_t, 2 > &ss)
• range (size_t b0, size_t e0, size_t s0, size_t b1, size_t e1, size_t s1)
• range (size_t b0, size_t e0, size_t b1, size_t e1)
• range (size_t e0, size_t e1)
• size_t index_to_linear (const hetcompute::index< 2 > &it) const
• bool is_empty () const
• hetcompute::index< 2 > linear_to_index (size_t idx) const
• size_t linearized_distance () const
• void print () const
• size_t size () const

Additional Inherited Members

10.4.1.3.1 Constructors and Destructors

10.4.1.3.1.1 hetcompute::range< 2 >::range ( )

Creates an empty 2D range.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 283
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.4.1.3.1.2 hetcompute::range< 2 >::range ( size_t b0, size_t e0, size_t s0, size_t b1, size_t e1,
size_t s1 )

Creates a 2D range, comprising of points from the cross product [b0:e0:s0) x [0:e1:s1).
It will cause a fatal error if b0 is greater than or equal to e0 or if b1 is greater than or equal to e1. s0 and
s1 should be greater than 0.

Parameters

b0 First coordinate of the beginning of 2D range.


e0 First coordinate of the end of 2D range.
s0 Stride of the first dimension of 2D range.
b1 Second coordinate of the beginning of 2D range.
e1 Second coordinate of the end of 2D range.
s1 Stride of the second dimension of 2D range.

10.4.1.3.1.3 hetcompute::range< 2 >::range ( size_t b0, size_t e0, size_t b1, size_t e1 )

Creates a 2D range, comprising of points from the cross product [b0, e0) x [b1, e1).
It will cause a fatal error if b0 is greater than or equal to e0 or if b1 is greater than or equal to e1.

Parameters

b0 First coordinate of the beginning of 2D range.


e0 First coordinate of the end of 2D range.
b1 Second coordinate of the beginning of 2D range.
e1 Second coordinate of the end of 2D range.

10.4.1.3.1.4 hetcompute::range< 2 >::range ( size_t e0, size_t e1 )

Creates a 2D range, comprising of points from the cross product [0, e0) x [0, e1).

Parameters

e0 First coordinate of the end of 2D range.


e1 Second coordinate of the end of 2D range.

10.4.1.3.2 Member Function Documentation

10.4.1.3.2.1 size_t hetcompute::range< 2 >::index_to_linear ( const hetcompute::index< 2 > & it )


const

Converts a hetcompute::index<2> object to a linear number with respect to the current range object.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 284
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Parameters

it hetcompute::index<2> object

Returns

A linear ID with respect to the current range object.

10.4.1.3.2.2 hetcompute::index<2> hetcompute::range< 2 >::linear_to_index ( size_t idx ) const

Converts a linear ID to a hetcompute::index<2> with respect to the current range object.

Parameters

idx Linear ID to be converted to a hetcompute::index<2> object.

Returns

hetcompute::index<2> with respect to the current range object.

10.4.1.3.2.3 size_t hetcompute::range< 2 >::linearized_distance ( ) const

Returns the linearized distance of the range

Returns

linearized distance of the range, product of length() in each dimension

10.4.1.3.2.4 size_t hetcompute::range< 2 >::size ( ) const

Returns the size of the range.

Returns

Size of the range, product of the number of elements in each dimension.

10.4.1.4 class hetcompute::range< 3 >

template<>class hetcompute::range< 3 >

A 3-dimensional range.
// 6-point stencil in 3-d
constexpr size_t N = 10; // size of matrix
constexpr size_t TILE_SIZE = 2;
float a[N][N][N];

// define a 3d range

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 285
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

hetcompute::range<3> r(0, N, TILE_SIZE, 0, N, TILE_SIZE, 0, N, TILE_SIZE);

// initialize all the tiles in parallel


hetcompute::pfor_each(r, [&a, TILE_SIZE](
hetcompute::index<3> idx) {
// iterate through the tile
hetcompute::range<3> rt(idx[0], idx[0] + TILE_SIZE, idx[1], idx[1] + TILE_SIZE,
idx[2], idx[2] + TILE_SIZE);
for (size_t i = rt.begin(0); i < rt.end(0); i++)
for (size_t j = rt.begin(1); j < rt.end(1); j++)
for (size_t k = rt.begin(2); k < rt.end(2); k++)
if (i == 0 || j == 0 || k == 0)
a[i][j][k] = 1;
else
a[i][j][k] = 0;
});

// stencil: define a range excluding the border elements


hetcompute::range<3> ri(1, N - 1, TILE_SIZE, 1, N - 1, TILE_SIZE, 1, N - 1,
TILE_SIZE);
hetcompute::pfor_each(ri, [&a, TILE_SIZE](
hetcompute::index<3> idx) {
// iterate through the tile
hetcompute::range<3> rt(idx[0], idx[0] + TILE_SIZE, idx[1], idx[1] + TILE_SIZE,
idx[2], idx[2] + TILE_SIZE);
for (size_t i = rt.begin(0); i < rt.end(0); i++)
for (size_t j = rt.begin(1); j < rt.end(1); j++)
for (size_t k = rt.begin(2); k < rt.end(2); k++)
{
a[i][j][k] = (a[i - 1][j][k] + a[i + 1][j][k] + a[i][j - 1][k] + a[i][j + 1][k] + a[i][
j][k - 1] + a[i][j][k + 1]) / 6.0f;
}
});

Public member functions

• range ()
• range (const std::array< size_t, 3 > &bb, const std::array< size_t, 3 > &ee)
• range (const std::array< size_t, 3 > &bb, const std::array< size_t, 3 > &ee, const std::array<
size_t, 3 > &ss)
• range (size_t b0, size_t e0, size_t s0, size_t b1, size_t e1, size_t s1, size_t b2, size_t e2, size_t s2)
• range (size_t b0, size_t e0, size_t b1, size_t e1, size_t b2, size_t e2)
• range (size_t e0, size_t e1, size_t e2)
• size_t index_to_linear (const hetcompute::index< 3 > &it) const
• bool is_empty () const
• hetcompute::index< 3 > linear_to_index (size_t idx) const
• size_t linearized_distance () const
• void print () const
• size_t size () const

Additional Inherited Members

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 286
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.4.1.4.1 Constructors and Destructors

10.4.1.4.1.1 hetcompute::range< 3 >::range ( )

Creates an empty 3D range.

10.4.1.4.1.2 hetcompute::range< 3 >::range ( size_t b0, size_t e0, size_t s0, size_t b1, size_t e1,
size_t s1, size_t b2, size_t e2, size_t s2 )

Creates a 3D range, comprising of points from the cross product [b0:e0:s0) x [b1:e1:s1) x [b2:e2:s2)
It will cause a fatal error if b0 is greater than or equal to e0 or if b1 is greater than or equal to e1 or if b2
is greater than or equal to e2. s0, s1 and s2 should be greater than 0.

Parameters

b0 First coordinate of the beginning of 3D range.


e0 First coordinate of the end of 3D range.
s0 Stride of the first dimension of 3D range.
b1 Second coordinate of the beginning of 3D range.
e1 Second coordinate of the end of 3D range.
s1 Stride of the second dimension of 3D range.
b2 Third coordinate of the beginning of 3D range.
e2 Third coordinate of the end of 3D range.
s2 Stride of the third dimension of 3D range.

10.4.1.4.1.3 hetcompute::range< 3 >::range ( size_t b0, size_t e0, size_t b1, size_t e1, size_t b2,
size_t e2 )

Creates a 3D range, comprising of points from the cross product [b0, e0) x [b1, e1) x [b2, e2) It will cause a
fatal error if b0 is greater than or equal to e0 or if b1 is greater than or equal to e1 or if b2 is greater than
or equal to e2.

Parameters

b0 First coordinate of the beginning of 3D range.


e0 First coordinate of the end of 3D range.
b1 Second coordinate of the beginning of 3D range.
e1 Second coordinate of the end of 3D range.
b2 Third coordinate of the beginning of 3D range.
e2 Third coordinate of the end of 3D range.

10.4.1.4.1.4 hetcompute::range< 3 >::range ( size_t e0, size_t e1, size_t e2 )

Creates a 3D range, comprising of points from the cross product [0, e0) x [0, e1) x [0, e2)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 287
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Parameters

e0 First coordinate of the end of 3D range.


e1 Second coordinate of the end of 3D range.
e2 Third coordinate of the end of 3D range.

10.4.1.4.2 Member Function Documentation

10.4.1.4.2.1 size_t hetcompute::range< 3 >::index_to_linear ( const hetcompute::index< 3 > & it )


const

Converts a hetcompute::index<3> object to a linear number with the current range object.

Parameters

it hetcompute::index<3> object.

Returns

A linear ID with respect to the current range object.

10.4.1.4.2.2 hetcompute::index<3> hetcompute::range< 3 >::linear_to_index ( size_t idx ) const

Converts a linear ID to a hetcompute::index<3> with respect to the current range object.

Parameters

idx Linear ID to be converted to a hetcompute::index<3> object.

Returns

hetcompute::index<3> with respect to the current range object.

10.4.1.4.2.3 size_t hetcompute::range< 3 >::linearized_distance ( ) const

Returns the linearized distance of the range

Returns

linearized distance of the range, product of length() in each dimension

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 288
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.4.1.4.2.4 size_t hetcompute::range< 3 >::size ( ) const

Returns the size of the range.

Returns

Size of the range, product of the number of elements in each dimension.

10.4.1.5 class hetcompute::range_base

template<size_t Dims>class hetcompute::range_base< Dims >

Methods common to 1D, 2D, and 3D ranges.

Public member functions

• range_base (const std::array< size_t, Dims > &bb, const std::array< size_t, Dims > &ee)
• range_base (const std::array< size_t, Dims > &bb, const std::array< size_t, Dims > &ee, const
std::array< size_t, Dims > &ss)
• size_t begin (const size_t i) const
• const std::array< size_t, Dims > & begin () const
• size_t dims () const
• size_t end (const size_t i) const
• const std::array< size_t, Dims > & end () const
• size_t length (const size_t i) const
• size_t num_elems (const size_t i) const
• size_t stride (const size_t i) const
• const std::array< size_t, Dims > & stride () const

Protected Attributes

• std::array< size_t, Dims > _b


• std::array< size_t, Dims > _e
• std::array< size_t, Dims > _s

10.4.1.5.1 Member Function Documentation

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 289
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.4.1.5.1.1 template<size_t Dims> size_t hetcompute::range_base< Dims >::begin ( const size_t i )


const

Returns the beginning of the range in the i-th coordinate. It will cause a fatal error if i is greater than or
equal to Dims.

Returns

The beginning of the range in the i-th coordinate. (i < Dims).

10.4.1.5.1.2 template<size_t Dims> const std::array<size_t, Dims>& hetcompute::range_base< Dims


>::begin ( ) const

Returns all the coordinates of the beginning of the range.

Returns

An array of all the coordinates at the beginning of the range.

10.4.1.5.1.3 template<size_t Dims> size_t hetcompute::range_base< Dims >::dims ( ) const

Returns the dimensions of the range.

Returns

Dimensions of the range, between [1,3].

10.4.1.5.1.4 template<size_t Dims> size_t hetcompute::range_base< Dims >::end ( const size_t i )


const

Returns the end of the range in the i-th coordinate. It will cause a fatal error if i is greater than or equal to
Dims.

Returns

End of the range in the i-th coordinate. (i < Dims)

10.4.1.5.1.5 template<size_t Dims> const std::array<size_t, Dims>& hetcompute::range_base< Dims


>::end ( ) const

Returns all the coordinates of the end of range.

Returns

An array of all the coordinates of the end of the range.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 290
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.4.1.5.1.6 template<size_t Dims> size_t hetcompute::range_base< Dims >::length ( const size_t i )


const

The length of range in the i-th coordinate. num_elems(i) does not take non-stride indices into account,
whereas length(i) returns the total length in the i-th coordinate.

Returns

the length of range in the i-th coordinate.

10.4.1.5.1.7 template<size_t Dims> size_t hetcompute::range_base< Dims >::num_elems ( const


size_t i ) const

The number of elements in the i-th coordinate. Equivalent to size(i).

Returns

the number of elements in the i-th coordinate. (i < Dims)

10.4.1.5.1.8 template<size_t Dims> size_t hetcompute::range_base< Dims >::stride ( const size_t i )


const

Returns the stride the range in the i-th coordinate.. It will cause a fatal error if i is greater than or equal to
Dims.

Returns

Stride of the range in the i-th coordinate. (i < Dims)

10.4.1.5.1.9 template<size_t Dims> const std::array<size_t, Dims>& hetcompute::range_base< Dims


>::stride ( ) const

Returns all the coordinates of the stride of the range.

Returns

An array of all the coordinates of the stride of the range.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 291
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5 Tasks
Classes

• struct hetcompute::do_not_collapse_t
• class hetcompute::task< ReturnType >
Tasks with a function of non-void return type. More...

• class hetcompute::task< ReturnType(Args...)>


Tasks with the full function signature (return type + parameter list). More...

• class hetcompute::task< void >


Tasks with a function of void return type. More...

• class hetcompute::task<>
Tasks as the basic unit of work. More...

• class hetcompute::task_ptr< ReturnType >


Smart pointer to a task object with function with non-void return type. More...

• class hetcompute::task_ptr< ReturnType(Args...)>


Smart pointer to a task object with full-function signature (return type + parameter list). More...

• class hetcompute::task_ptr< void >


Smart pointer to a task object with function with void return type. More...

• class hetcompute::task_ptr<>
Smart pointer to a task object without function information. More...

Typedefs

• template<typename Fn >
using hetcompute::collapsed_task_type = typename::hetcompute::internal::task_factory< Fn
>::collapsed_task_type
• template<typename Fn >
using hetcompute::non_collapsed_task_type = typename::hetcompute::internal::task_factory< Fn
>::non_collapsed_task_type

Functions

• void hetcompute::abort_on_cancel ()
Aborts execution of calling task if any of its groups is canceled or if someone has canceled it by calling
hetcompute::cancel().

• void hetcompute::abort_task ()
Aborts execution of calling task.

• template<typename Task >


hetcompute::internal::by_data_dep_t
< Task && > hetcompute::bind_as_data_dependency (Task &&t)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 292
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Explicitly bind a hetcompute::task_ptr<...> or hetcompute::task<...>∗ as data dependency.

• template<typename Task >


hetcompute::internal::by_value_t
< Task && > hetcompute::bind_by_value (Task &&t)
Explicitly bind a hetcompute::task_ptr<...> or hetcompute::task<...>∗ by value.

• template<typename BlockingFunction , typename CancelFunction >


void hetcompute::blocking (BlockingFunction &&bf, CancelFunction &&cf)
Enclose user-code that blocks on external activity.

• template<typename Code , typename... Args>


collapsed_task_type< Code > hetcompute::create_task (Code &&code, Args &&...args)
• template<typename Code , typename... Args>
non_collapsed_task_type< Code > hetcompute::create_task (do_not_collapse_t, Code &&code,
Args &&...args)
• template<typename ReturnType , typename... Args>
::hetcompute::task_ptr
< ReturnType > hetcompute::create_value_task (Args &&...args)
• void hetcompute::finish_after (::hetcompute::task<> ∗task)
Specifies that the task invoking this function should be deemed to finish only after the task finishes.

• void hetcompute::finish_after (::hetcompute::task_ptr<> const &task)


• template<typename Code , typename... Args>
collapsed_task_type< Code > hetcompute::launch (Code &&code, Args &&...args)
• template<typename Code , typename... Args>
non_collapsed_task_type< Code > hetcompute::launch (do_not_collapse_t, Code &&code, Args
&&...args)
• bool hetcompute::operator!= (::hetcompute::task_ptr<> const &t, std::nullptr_t)
Compare tasks.

• bool hetcompute::operator!= (std::nullptr_t,::hetcompute::task_ptr<> const &t)


Compare tasks.

• bool hetcompute::operator!= (::hetcompute::task_ptr<> const &a,::hetcompute::task_ptr<> const


&b)
Compare tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)%std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator% (const ::hetcompute::task_ptr< T1 > &t1, const
::hetcompute::task_ptr< T2 > &t2)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 293
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Algebraic binary operator % for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)%std::declval< T2 >))> hetcompute::operator% (const ::hetcompute::task_ptr< T1 > &t1, T2
&&op2)
Algebraic binary operator % for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval< T1 >
)%std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator% (T1 &&op1, const ::hetcompute::task_ptr< T2 >
&t2)
Algebraic binary operator % for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)&std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator& (const ::hetcompute::task_ptr< T1 > &t1, const
::hetcompute::task_ptr< T2 > &t2)
Algebraic binary operator & for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)&std::declval< T2 >))> hetcompute::operator& (const ::hetcompute::task_ptr< T1 > &t1, T2
&&op2)
Algebraic binary operator & for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval< T1 >
)&std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator& (T1 &&op1, const ::hetcompute::task_ptr< T2 >
&t2)
Algebraic binary operator & for tasks.

• template<typename T1 , typename T2 >

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 294
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)∗std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator∗ (const ::hetcompute::task_ptr< T1 > &t1, const
::hetcompute::task_ptr< T2 > &t2)
Algebraic binary operator ∗ for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)∗std::declval< T2 >))> hetcompute::operator∗ (const ::hetcompute::task_ptr< T1 > &t1, T2
&&op2)
Algebraic binary operator ∗ for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval< T1 >
)∗std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator∗ (T1 &&op1, const ::hetcompute::task_ptr< T2 >
&t2)
Algebraic binary operator ∗ for tasks.

• template<typename T >
inline::hetcompute::task_ptr
< typename::hetcompute::task_ptr
< T >::return_type > hetcompute::operator+ (const ::hetcompute::task_ptr< T > &t)
Algebraic unary operator + for task.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)+std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator+ (const ::hetcompute::task_ptr< T1 > &t1, const
::hetcompute::task_ptr< T2 > &t2)
Algebraic binary operator + for tasks.

• template<typename T1 , typename T2 >

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 295
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)+std::declval< T2 >))> hetcompute::operator+ (const ::hetcompute::task_ptr< T1 > &t1, T2
&&op2)
Algebraic binary operator + for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval< T1 >
)+std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator+ (T1 &&op1, const ::hetcompute::task_ptr< T2 >
&t2)
Algebraic binary operator + for tasks.

• template<typename T >
inline::hetcompute::task_ptr
< typename::hetcompute::task_ptr
< T >::return_type > hetcompute::operator- (const ::hetcompute::task_ptr< T > &t)
Algebraic unary operator - for task.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)-std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator- (const ::hetcompute::task_ptr< T1 > &t1, const
::hetcompute::task_ptr< T2 > &t2)
Algebraic binary operator - for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)-std::declval< T2 >))> hetcompute::operator- (const ::hetcompute::task_ptr< T1 > &t1, T2
&&op2)
Algebraic binary operator - for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval< T1 >
)-std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator- (T1 &&op1, const ::hetcompute::task_ptr< T2 >
&t2)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 296
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Algebraic binary operator - for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)/std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator/ (const ::hetcompute::task_ptr< T1 > &t1, const
::hetcompute::task_ptr< T2 > &t2)
Algebraic binary operator / for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)/std::declval< T2 >))> hetcompute::operator/ (const ::hetcompute::task_ptr< T1 > &t1, T2
&&op2)
Algebraic binary operator / for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval< T1 >
)/std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator/ (T1 &&op1, const ::hetcompute::task_ptr< T2 >
&t2)
Algebraic binary operator / for tasks.

• bool hetcompute::operator== (task_ptr<> const &t, std::nullptr_t)


Compare tasks.

• bool hetcompute::operator== (std::nullptr_t, task_ptr<> const &t)


Compare tasks.

• bool hetcompute::operator== (::hetcompute::task_ptr<> const &a,::hetcompute::task_ptr<> const


&b)
Compare tasks.

• inline::hetcompute::task_ptr & hetcompute::operator>> (::hetcompute::task_ptr<>


&pred,::hetcompute::task_ptr<> &succ)
Compare tasks.

• template<typename T1 , typename T2 >

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 297
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)∧ std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator∧ (const ::hetcompute::task_ptr< T1 > &t1, const
::hetcompute::task_ptr< T2 > &t2)
Algebraic binary operator ∧ for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)∧ std::declval< T2 >))> hetcompute::operator∧ (const ::hetcompute::task_ptr< T1 > &t1, T2
&&op2)
Algebraic binary operator ∧ for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval< T1 >
)∧ std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator∧ (T1 &&op1, const ::hetcompute::task_ptr< T2 >
&t2)
Algebraic binary operator ∧ for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)|std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator| (const ::hetcompute::task_ptr< T1 > &t1, const
::hetcompute::task_ptr< T2 > &t2)
Algebraic binary operator | for tasks.

• template<typename T1 , typename T2 >


inline::hetcompute::task_ptr
< decltype(std::declval
< typename::hetcompute::task_ptr
< T1 >::return_type >
)|std::declval< T2 >))> hetcompute::operator| (const ::hetcompute::task_ptr< T1 > &t1, T2
&&op2)
Algebraic binary operator | for tasks.

• template<typename T1 , typename T2 >

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 298
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

inline::hetcompute::task_ptr
< decltype(std::declval< T1 >
)|std::declval
< typename::hetcompute::task_ptr
< T2 >::return_type >))> hetcompute::operator| (T1 &&op1, const ::hetcompute::task_ptr< T2 >
&t2)
Algebraic binary operator | for tasks.

• template<typename T >
inline::hetcompute::task_ptr
< typename::hetcompute::task_ptr
< T >::return_type > hetcompute::operator∼ (const ::hetcompute::task_ptr< T > &t)
Algebraic unary operator ∼ for task.

Variables

• const do_not_collapse_t hetcompute::do_not_collapse {}

10.5.1 Class Documentation

10.5.1.1 struct hetcompute::do_not_collapse_t

Helper struct for specifying do_not_collapse task.

10.5.1.2 class hetcompute::task< ReturnType >

template<typename ReturnType>class hetcompute::task< ReturnType >

Template Parameters

ReturnType Return type of the task function.

Note: An object of this class should not be instantiated. It is a facade to the internal implementation.

Public Types

• using return_type = ReturnType

Public member functions

• return_type const & copy_value ()


Returns a const reference to the value returned by the task.

• return_type && move_value ()


Moves the value returned by value returned by the task.

Additional Inherited Members

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 299
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.1.2.1 Member Typedef Documentation

10.5.1.2.1.1 template<typename ReturnType > using hetcompute::task< ReturnType >::return_type =


ReturnType

Type returned by the task

10.5.1.2.2 Member Function Documentation

10.5.1.2.2.1 template<typename ReturnType > return_type const& hetcompute::task< ReturnType


>::copy_value ( )

Returns a const reference to the value returned by the task.


This method behaves like hetcompute::task<>::wait_for, and it does not return until the task
completes its execution. It returns immediately if the task has already finished.
hetcompute::task<ReturnType>::copy_value() can be called multiple times (unlike
hetcompute::task<ReturnType>::move_value).

Example

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 // Define a lambda.
9 auto l = [](int x) -> int { return x * 2; };
10 // Create a task out of the lambda and launch.
11 auto t1 = hetcompute::launch(l, 42);
12
13 // Wait t1 to finish and assign the return value to val.
14 int val = t1->copy_value();
15
16 HETCOMPUTE_ILOG("return value of t1 is: %d", val);
17
18 hetcompute::runtime::shutdown();
19 }

See Also

hetcompute::task<ReturnType>::move_value()
hetcompute::task<>::wait_for()

10.5.1.2.2.2 template<typename ReturnType > return_type&& hetcompute::task< ReturnType


>::move_value ( )

Moves the value returned by value returned by the task.


This method behaves like hetcompute::task<>::wait_for, and it does not return until the task
completes its execution. It returns immediately if the task has already finished.
hetcompute::task<ReturnType>::move_value() can only be called once. Any subsequent
call may raise an exception.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 300
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Example

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Define a lambda.
8 auto l = [](int x) -> int { return x * 2; };
9 // Create a task out of the lambda and launch.
10 auto t1 = hetcompute::launch(l, 42);
11
12 int val = t1->move_value();
13 HETCOMPUTE_ILOG("return value of t1 is: %d", val);
14
15 hetcompute::runtime::shutdown();
16 // Error! value might not be there anymore!!
17 // int val_error = t1->move_value();
18 return 0;
19 }

See Also

hetcompute::task<ReturnType>::copy_value()
hetcompute::task<>::wait_for()

10.5.1.3 class hetcompute::task< ReturnType(Args...)>

template<typename ReturnType, typename... Args>class hetcompute::task< ReturnType(Args...)>

Template Parameters

Args... Parameter type list of the task function.


ReturnType Return type of the task function.

Note: An object of this class should not be instantiated. It is a facade to the internal implementation.

Public Types

• using args_tuple = std::tuple< Args...>


• using return_type = ReturnType
• using size_type = task<>::size_type

Public member functions

• template<typename... Arguments>
void bind_all (Arguments &&...args)
Bind all arguments to a task with a full-function signature.

• template<typename... Arguments>
void launch (Arguments &&...args)
Launches task and (optionally) binds arguments.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 301
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Static Public Attributes

• static
HETCOMPUTE_CONSTEXPR_CONST
size_type arity = sizeof...(Args)

Friends

• template<typename R , typename... As>


friend::hetcompute::internal::cputask_arg_layer
< R(As...)> ∗ c_ptr2 (::hetcompute::task< R(As...)> &t)

Additional Inherited Members

10.5.1.3.1 Member Typedef Documentation

10.5.1.3.1.1 template<typename ReturnType , typename... Args> using hetcompute::task<


ReturnType(Args...)>::args_tuple = std::tuple<Args...>

Tuple whose types are the types of the task arguments.

10.5.1.3.1.2 template<typename ReturnType , typename... Args> using hetcompute::task<


ReturnType(Args...)>::return_type = ReturnType

Type returned by the task.

10.5.1.3.1.3 template<typename ReturnType , typename... Args> using hetcompute::task<


ReturnType(Args...)>::size_type = task<>::size_type

Unsigned integral type.

10.5.1.3.2 Member Function Documentation

10.5.1.3.2.1 template<typename ReturnType , typename... Args> template<typename... Arguments>


void hetcompute::task< ReturnType(Args...)>::bind_all ( Arguments &&... args )

Bind all arguments to a task with a full-function signature.


The arguments should be provided in the same order as the task’s parameter list. If an arg is not a
hetcompute::task_ptr or hetcompute::task∗, its type should match the corresponding task
parameter (bound by value).
If an arg is a hetcompute::task_ptr or hetcompute::task∗, its return type should be c++
compatible with the corresponding task parameter (bound as data dependency); or the type of the arg
should be c++ compatible with the corresponding task paramter type (bound by value). When an arg can
be bound either as data dependency or by value, the binding type needs to be explicitly specificed using
hetcompute::bind_as_data_dependency or hetcompute::bind_by_value to avoid
ambiguity.
The number of arguments should be of the same arity.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 302
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Template Parameters

Arguments... Task arguments type.

Parameters

args Arguments to the task.

Example

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create a group.
8 auto g = hetcompute::create_group();
9
10 std::atomic<size_t> value;
11
12 // Create a non-collapsing task returns hetcompute::task_ptr<hetcompute::task_ptr<size_t>()>.
13 auto t1 = hetcompute::create_task(
hetcompute::do_not_collapse, [&value]() -> hetcompute::task_ptr<size_t> {
14 return hetcompute::create_task([&value]() -> size_t {
15 value = 27;
16 return value;
17 });
18 });
19
20 // Create a task.
21 auto t2 = hetcompute::create_task([&value] { value = 42; });
22
23 // Create a task takes two parameters of hetcompute::task_ptr.
24 auto t3 = hetcompute::create_task([g](
hetcompute::task_ptr<> ta, hetcompute::task_ptr<> tb) {
25 // Set task dependency and launch.
26 ta->then(tb); // t1->result() >> t2
27 g->launch(ta); // t1->result()->launch()
28 g->launch(tb); // t2->launch();
29 });
30
31 // Create a task takes one parameter of hetcompute::task_ptr<>.
32 auto t4 = hetcompute::create_task([g](
hetcompute::task_ptr<> ta) {
33 // Launch the task.
34 g->launch(ta); // t1->launch();
35 });
36
37 // Bind the arguments for t3.
38 // Bind t1 to the first argument as data dependency (explicity due to ambiguity).
39 // Bind t2 to the second argument (by value, no ambiguity).
40 t3->bind_all(hetcompute::bind_as_data_dependency(t1), t2);
41
42 // Bind the argument for t4.
43 // Bind t1 to the argument by value (explicity due to ambiguity).
44 t4->bind_all(hetcompute::bind_by_value(t1));
45
46 // launch the tasks into the group (t1 and t2 will be launched in t3 and t4)
47 g->launch(t3);
48 g->launch(t4);
49
50 // Wait for the group to finish.
51 g->wait_for();
52
53 HETCOMPUTE_ILOG("%zu", value.load());
54

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 303
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

55 hetcompute::runtime::shutdown();
56 return 0;
57 }

See Also

hetcompute::bind_as_dependency()
hetcompute::bind_by_value()

10.5.1.3.2.2 template<typename ReturnType , typename... Args> template<typename... Arguments>


void hetcompute::task< ReturnType(Args...)>::launch ( Arguments &&... args )

Launches task and (optionally) binds arguments.


This method informs the Qualcomm HetCompute runtime that the task is ready to execute as soon as there
is an available hardware context and after all its predecessors (both data- and control-dependent) have
executed.
launch() can have arguments. If so, the number of arguments in launch() should be the same as
arity.

Example

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 auto t1 = hetcompute::create_task([](int x) { HETCOMPUTE_ILOG("Hello World %d!\n
", x); });
8
9 //...
10 // Set up dependencies if needed.
11 // ..
12
13 // t1 is ready, launch and bind it.
14 t1->launch(42);
15
16 // Wait for t1 to finish.
17 t1->wait_for();
18 hetcompute::runtime::shutdown();
19 return 0;
20 }

See Also

hetcompute::group::launch(Code)

10.5.1.3.3 Member Data Documentation

10.5.1.3.3.1 template<typename ReturnType , typename... Args> HETCOMPUTE_CONSTEXPR_CONST


size_type hetcompute::task< ReturnType(Args...)>::arity = sizeof...(Args) [static]

Number of task arguments.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 304
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.1.4 class hetcompute::task< void >

template<>class hetcompute::task< void >

Note: An object of this class should not be instantiated. It is a facade to the internal implementation.

Additional Inherited Members

10.5.1.5 class hetcompute::task<>

template<>class hetcompute::task<>

Note: This is the basic task without function signature information.

Note: An object of this class should not be instantiated. It is a facade to the internal implementation.

Public Types

• using size_type = ::hetcompute::internal::task::size_type

Public member functions

• void cancel ()
Cancels task.

• bool canceled () const


Checks whether a task is canceled.

• void finish_after ()
finish_after the task.

• bool is_bound () const


Checks whether a task is bound.

• void launch ()
Launches task.

• task_ptr & then (task_ptr<> &succ)


Creates a control dependency between two tasks.

• task ∗ then (task<> ∗succ)


• hc_error wait_for ()
Waits for the task to complete execution.

Protected Member Functions

• internal_raw_task_ptr get_raw_ptr () const


• HETCOMPUTE_DELETE_METHOD (task())
• HETCOMPUTE_DELETE_METHOD (task(task const &))

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 305
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

• HETCOMPUTE_DELETE_METHOD (task(task &&))


• HETCOMPUTE_DELETE_METHOD (task &operator=(task const &))
• HETCOMPUTE_DELETE_METHOD (task &operator=(task &&))

Protected Attributes

• internal_raw_task_ptr _ptr

Friends

• friend::hetcompute::internal::task ∗ hetcompute::internal::c_ptr (::hetcompute::task<> ∗t)

10.5.1.5.1 Member Typedef Documentation

10.5.1.5.1.1 using hetcompute::task<>::size_type = ::hetcompute::internal::task::size_type

Unsigned integral type.

10.5.1.5.2 Member Function Documentation

10.5.1.5.2.1 void hetcompute::task<>::cancel ( )

Cancels task.
Use hetcompute::task<>::cancel() to cancel a task and its successors. The effects of
hetcompute::task<>::cancel() depend on the task status:

• If a task is canceled before it launches, it never executes – even if it is launched afterwards. In


addition, it propagates the cancellation to the task’s successors. This is called "cancellation
propagation".
• If a task is canceled after it is launched, but before it starts executing, it will never execute and it will
propagate cancellation to its successors.
• If the task is running when someone else calls hetcompute::task<>::cancel(), it is up to
the task to ignore the cancellation request and continue its execution, or to honor the request via
hetcompute::abort_on_cancel(), which aborts the task’s execution and propagates the
cancellation to the task’s successors.
• Finally, if a task is canceled after it completes its execution (successfully or not), it does not change
its status and it does not propagate cancellation.

Example 1: Canceling a task before launching it:

1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 306
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

8 hetcompute::runtime::init();
9 auto t1 = hetcompute::create_task([] { assert(false); });
10
11 auto t2 = hetcompute::create_task([] { assert(false); });
12
13 // Create control dependency.
14 t1->then(t2);
15
16 // Cancel t1, which propagates cancellation to t2
17 t1->cancel();
18
19 // Launch t2. Does nothing, t2 got canceled via cancellation propagation
20 t2->launch();
21
22 // Returns immediately, t2 is canceled.
23 try
24 {
25 t2->wait_for();
26 }
27 catch (const hetcompute::canceled_exception& e)
28 {
29 std::cout << e.what() << ": t2 was canceled" << std::endl;
30 }
31 catch (...)
32 {
33 // Never reached
34 }
35
36 hetcompute::runtime::shutdown();
37 return 0;
38 }

In the example above, a control dependency is created betwen two tasks, t1 and t2. Notice that, if any of
the tasks executes, it will raise an assertion. In line 17, t1 is canceled, which causes t2 to be canceled as
well. In line 20, t2 is launched, but it does not matter as it will not execute because it was canceled when
t1 propagated its cancellation.

Example 2: Canceling a task after launching it, but before it executes:

1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t1!\n")
; });
10
11 auto t2 = hetcompute::create_task([] { assert(false); });
12
13 auto t3 = hetcompute::create_task([] { assert(false); });
14
15 // Create dependencies
16 t1->then(t2)->then(t3);
17
18 // Launch t2. It cannot execute as yet because t1 has not been launched.
19 t2->launch();
20
21 // Cancel t2, which propagates cancellation to t3
22 t2->cancel();
23
24 // Launch t1. It will execute because no one canceled it.
25 t1->launch();

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 307
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

26
27 // Returns after t1 completes execution
28 t1->wait_for();
29 hetcompute::runtime::shutdown();
30
31 return 0;
32 }

In the example above, three tasks are created and chained: t1, t2, and t3. In line 22, t2 is launched, but it
cannot execute because its predecessor has not yet executed. In line 25, t2 is canceled, which means that it
will never execute. Because t3 is t2’s successor, it is also canceled – if t3 had a successor, it would also
be canceled.

Example 3: Canceling a task while it executes:

1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9
10 auto t = hetcompute::create_task([] {
11 while (1)
12 {
13 hetcompute::abort_on_cancel();
14 HETCOMPUTE_ILOG("Waiting to be canceled.\n");
15 usleep(100);
16 }
17 assert(false); // This will never fire.
18 });
19
20 // Launch t.
21 t->launch();
22
23 // Wait for 2 seconds.
24 usleep(200);
25
26 // Cancel task. Returns immediately.
27 t->cancel();
28
29 try
30 {
31 // Wait for the task.
32 t->wait_for();
33 }
34 catch (const hetcompute::canceled_exception& e)
35 {
36 std::cout << e.what() << " thrown" << std::endl;
37 }
38 catch (...)
39 {
40 // Never reached.
41 }
42
43 hetcompute::runtime::shutdown();
44 return 0;
45 }

In the example above, task t’s will never finish unless it is canceled. t is launched in line 16. After
launching the task, it is blocked for 2 seconds in line 19 to ensure that t is scheduled and prints its
messages. In line 22, Qualcomm HetCompute is asked to cancel the task, which should be running by now.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 308
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

hetcompute::task<>::cancel() returns immediately after it marks the task as "pending for


cancellation". This means that t might still be executing after t->cancel() returns. That is why
t->wait_for() is called in line 26, to ensure that we wait for t to complete its execution. Remember:
a task does not know whether someone has requested its cancellation unless it calls
hetcompute::abort_on_cancel() during its execution.

Example 4: Canceling a completed task.

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World from t1!\n")
; });
8
9 auto t2 = hetcompute::create_task([] {
10 while (1)
11 {
12 hetcompute::abort_on_cancel();
13 HETCOMPUTE_ILOG("Hello World from t2!\n");
14 usleep(100);
15 }
16 });
17
18 // Create dependencies.
19 t1->then(t2);
20
21 // Launch tasks.
22 t1->launch();
23
24 // Wait for t1 to complete.
25 t1->wait_for();
26
27 // Cancel t1.
28 // Because it has already completed, it does not propagate its cancellation.
29 t1->cancel();
30
31 // If the two lines below are uncommented the wait_for will never return.
32 // t2->launch();
33 // t2->wait_for();
34
35 hetcompute::runtime::shutdown();
36 return 0;
37 }

In the example above, t1 and t2 are launched after a dependency is set up between them. On line 28, is
canceled t1 after it has completed. By then, t1 has finished execution (waiting for it in line 24) so
cancel(t1) has no effect. Thus, nobody cancels t2 and wait_for(t2) in line 31 never returns.
See Also

hetcompute::abort_on_cancel()
hetcompute::task<>::wait_for()

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 309
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.1.5.2.2 bool hetcompute::task<>::canceled ( ) const

Use the method to check whether a task is canceled. If the task was canceled – via cancellation propagation,
hetcompute::group::cancel() or hetcompute::task<>::cancel() – before it started
executing, hetcompute::task<>::canceled() returns true.
If the task was canceled – via hetcompute::group::cancel() or hetcompute::group-
::cancel() – while it was executing, then hetcompute::task<>::canceled() returns true
only if the task is not executing any more and it exited via hetcompute::abort_on_cancel().
Finally, if the task completed successfully, hetcompute::task<>::canceled() always returns false.

Returns

true – The task has transitioned to a canceled state.


false – The task has not yet transitioned to a canceled state.

Example:

1 #include <cassert>
2 #include <hetcompute/hetcompute.h>
3
4 int
5 main()
6 {
7 auto t = hetcompute::create_task([] {
8 while (true)
9 {
10 hetcompute::abort_on_cancel();
11 HETCOMPUTE_ILOG("Hello World!\n");
12 usleep(1); // Sleep for one micro-second.
13 }
14 });
15
16 auto g = hetcompute::create_group();
17
18 // It will never fire.
19 assert(t->canceled() == false);
20
21 // Launch task.
22 g->launch(t);
23
24 // It will never fire.
25 assert(t->canceled() == false);
26
27 // Sleep for 10 micro-seconds.
28 usleep(10);
29
30 // It will never fire.
31 assert(t->canceled() == false);
32
33 // Cancel both the task and the group.
34 t->cancel();
35 g->cancel();
36
37 // Might be false if the task has not executed abort_on_cancel() yet.
38 // might also be true if the task has already executed abort_on_cancel().
39 HETCOMPUTE_ILOG("t->canceled() = %d", t->canceled() == true);
40
41 try
42 {
43 // Wait for the task to transition to canceled state.
44 t->wait_for();
45 }
46 catch (const hetcompute::canceled_exception& e)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 310
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

47 {
48 std::cout << "threw " << e.what() << " due to task cancellation " << std::endl;
49 }
50 catch (...)
51 {
52 // Never reached.
53 }
54
55 // It will never fire.
56 assert(t->canceled() == true);
57 return 0;
58 }

See Also

hetcompute::task<>::cancel()
hetcompute::group::cancel()

10.5.1.5.2.3 void hetcompute::task<>::finish_after ( )

Specifies that the current task should be deemed to finish only after the task on which this method is
invoked finishes. This method returns immediately.

Example

1 #include <algorithm>
2 #include <functional>
3 #include <iostream>
4 #include <iterator>
5 #include <sstream>
6 #include <vector>
7
8 #include <hetcompute/hetcompute.hh>
9
10 // Parallel mergesort using recursive fork-join parallelism.
11 // hetcompute::task<>::finish_after allows easy expression of the parallelism in the
12 // algorithm in a non-blocking manner, yielding better performance than
13 // blocking parallelization using hetcompute::task<>::wait_for.
14
16 const size_t GRANULARITY = 8192;
17
18 // Asynchronous mergesort, to be invoked in a task
19 template <typename Iterator, typename Compare>
20 void
21 mergesort(Iterator begin, Iterator end, Compare cmp)
22 {
23 size_t n = std::distance(begin, end);
24 if (n <= GRANULARITY)
25 {
26 sort(begin, end, cmp);
27 }
28 else
29 {
30 auto middle = begin;
31 std::advance(middle, n / 2);
32 auto left = hetcompute::launch([=] { mergesort(begin, middle, cmp); });
33 auto right = hetcompute::launch([=] { mergesort(middle, end, cmp); });
34 auto merge = hetcompute::create_task([=] { std::inplace_merge(begin, middle,
end, cmp); });
35 // The left subtree and right subtree tasks must finish before the merge
36 // task can execute
37 left->then(merge);
38 right->then(merge);
39 merge->launch();

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 311
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

40 // mergesort(begin, end, cmp) logically finishes after the merge task


41 // finishes
42 merge->finish_after();
43 }
44 }
45
46 int
47 main(int argc, const char* argv[])
48 {
49 hetcompute::runtime::init();
50 std::vector<long> input;
51 size_t n_def = 1 << 16;
52 size_t n = n_def;
53
54 if (argc >= 2)
55 {
56 std::istringstream istr(argv[1]);
57 istr >> n;
58 }
59
60 // Create a random array of integers
61 for (size_t i = 0; i < n; i++)
62 {
63 input.push_back(rand());
64 }
65
66 // Launch mergesort inside a task since it has an asynchronous interface (due
67 // to use of hetcompute::task::finish_after)
68 auto t = hetcompute::launch([&] { mergesort(input.begin(), input.end(),
std::less<long>()); });
69 t->wait_for();
70
71 if (!std::is_sorted(input.begin(), input.end()))
72 {
73 std::cerr << "parallel mergesorting failed\n";
74 }
75
76 hetcompute::runtime::shutdown();
77 return 0;
78 }

Exceptions

api_exception If invoked from outside a task or from within a


hetcompute::pfor_each or if ’task’ points to null.

Note

If exceptions are disabled by application, this API will terminate the app, if pointer to task is
nullptr, invoked from outside a task or from within a hetcompute::pfor_each

See Also

hetcompute::group::finish_after()

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 312
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.1.5.2.4 bool hetcompute::task<>::is_bound ( ) const

Use this method to check whether all the task parameters are bound. Returns true if the task had no
parameters. Remember that only bound tasks can be launched.

Returns

true – All the task parameters are bound. If the task has none, then is_bound always return true. false –
At least one of the task parameters is not bound.

10.5.1.5.2.5 void hetcompute::task<>::launch ( )

Launches task.
This method informs the Qualcomm HetCompute runtime that the task is ready to execute as soon as there
is an available hardware context and after all its predecessors (both data- and control-dependent) have
executed.

Example

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create task t.
8 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World!\n"); });
9
10 //...
11 // Set up dependencies if needed.
12 // ..
13
14 // t1 is ready, launch it.
15 t1->launch();
16
17 // Wait for t to finish.
18 t1->wait_for(); // Will not return until t finishes.
19 hetcompute::runtime::shutdown();
20 return 0;
21 }

See Also

hetcompute::group::launch(Code)

10.5.1.5.2.6 task_ptr& hetcompute::task<>::then ( task_ptr<> & succ )

Creates a control dependency between two tasks.


The Qualcomm HetCompute runtime ensures that succ starts executing only after this has completed its
execution, regardless of how many hardware execution contexts are available in the device. Use this method
to create task dependency graphs.

Note: The programmer is responsible for ensuring that there are no cycles in the task graph.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 313
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

If succ has already been launched, hetcompute::task<>::then() will throw an api_exception.


This is because it makes little sense to add a predecessor to a task that might already be running. On the
other hand, if this successfully completed execution, no dependency will be created, and
hetcompute::task<>::then returns immediately. If this gets canceled (or if it is canceled in the
future), succ will be canceled as well due to cancellation propagation.
Do not call this member if succ is nullptr. This would cause a fatal error.

Parameters

succ Successor task. Cannot be nullptr.

Exceptions

api_exception If succ has already been launched.

Note

If exceptions are disabled by Application, terminates the app if successor is already launched

Example

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 // Create group.
9 auto g = hetcompute::create_group("Hello World Group");
10
11 // Create tasks t1 and t2.
12 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World! from task
t1\n"); });
13
14 auto t2 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World! from task
t2\n"); });
15
16 // Create dependency between t1 and t2.
17 t1->then(t2);
18
19 // Launch both t1 and t2 into g.
20 g->launch(t1);
21 g->launch(t2);
22
23 // Wait until t1 and t2 finish.
24 g->wait_for();
25
26 hetcompute::runtime::shutdown();
27 return 0;
28 }
Output:
Hello World! from task t1
Hello World! from task t2

No other output is possible because of the dependency between t1 and t2.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 314
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.1.5.2.7 hc_error hetcompute::task<>::wait_for ( )

Waits for the task to complete execution.


This method does not return until the task completes its execution. It returns immediately if the task has
already finished.
If hetcompute::task<>::wait_for() is called from within a task, Qualcomm HetCompute
context-switches the task and finds another one to run. If called from outside a task (i.e., the main thread),
Qualcomm HetCompute blocks the thread until wait_for(hetcompute::task_ptr) returns.
This method is a safe point. Safe points are Qualcomm HetCompute API methods where the following
property holds:
The thread on which the task executes before the API call might not be the same as the thread on
which the task executes after the API call.

Example

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create task t.
8 auto t1 = hetcompute::create_task([] { HETCOMPUTE_ILOG("Hello World!\n"); });
9
10 //...
11 // Set up dependencies if needed.
12 // ..
13
14 // t1 is ready, launch it.
15 t1->launch();
16
17 // Wait for t to finish.
18 t1->wait_for(); // Will not return until t finishes.
19 hetcompute::runtime::shutdown();
20 return 0;
21 }

Exceptions

hetcompute- If the task or any task on which it is dependent was canceled.


::canceled_exception

Note

If exceptions are disabled by application, in the above case API will return
hc_error::HC_TaskCanceled

Exceptions

hetcompute- If the task or any tasks on which it is dependent threw two or more
::aggregate_- exceptions.
exception

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 315
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Note

If exceptions are disabled by application, in the above case API will return
hc_error::HC_TaskAggregateFailure

Exceptions

hetcompute::gpu_- If the task or any tasks on which it is dependent encountered a runtime error
exception on the GPU.

Note

If exceptions are disabled by application, in the above case API will return
hc_error::HC_TaskGpuFailure

Exceptions

hetcompute- If the task or any tasks on which it is dependent encountered a runtime error
::hexagon_exception on the Hexagon DSP.

Note

If exceptions are disabled by application, in the above case API will return
hc_error::HC_TaskDspFailure

Exceptions

any other exception that may be thrown by the task or any tasks on which it is
dependent.

See Also

hetcompute::group::wait_for()

10.5.1.6 class hetcompute::task_ptr< ReturnType >

template<typename ReturnType>class hetcompute::task_ptr< ReturnType >

Smart pointer to a task object, i.e., hetcompute::task<ReturnType>, similar to


std::shared_ptr.
Template Parameters

ReturnType Return type of the task function

Public Types

• using return_type = ReturnType


• using task_type = task< return_type >

Public member functions

• task_ptr ()
Default constructor. Constructs a task_ptr<ReturnType> with no task<ReturnType>.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 316
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

• task_ptr (std::nullptr_t)
Default constructor. Constructs a task_ptr<ReturnType> with no task<ReturnType>.

• task_ptr (task_ptr< return_type > const &other)


Copy constructor. Constructs a task_ptr<ReturnType> that manages the same task as other.

• task_ptr (task_ptr< return_type > &&other)


Move constructor. Move-constructs a task_ptr<ReturnType> that manages the same task as
other.

• ∼task_ptr ()
Destructor.

• task_type ∗ get () const


Returns pointer to managed task.

• template<typename T >
task_ptr & operator%= (T &&op)
Compound assignment operator %= with value operand.

• template<typename T >
task_ptr & operator&= (T &&op)
Compound assignment operator &= with value operand.

• template<typename T >
task_ptr & operator∗= (T &&op)
Compound assignment operator ∗= with value operand.

• template<typename T >
task_ptr & operator+= (T &&op)
Compound assignment operator += with value operand.

• template<typename T >
task_ptr & operator-= (T &&op)
Compound assignment operator -= with value operand.

• task_type ∗ operator-> () const


Dereference operator. Returns pointer to managed task.

• template<typename T >
task_ptr & operator/= (T &&op)
Compound assignment operator /= with value operand.

• task_ptr & operator= (task_ptr< return_type > const &other)


Assignment operator. Assigns the task managed by other to ∗this.

• task_ptr & operator= (task_ptr< return_type > &&other)


Assignment operator. Resets ∗this.

• template<typename T >
task_ptr & operator∧ = (T &&op)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 317
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Compound assignment operator ∧ = with value operand.

• template<typename T >
task_ptr & operator|= (T &&op)
Compound assignment operator |= with value operand.

• void swap (task_ptr< return_type > &other)


Exchanges managed tasks between ∗this and other.

10.5.1.6.1 Member Typedef Documentation

10.5.1.6.1.1 template<typename ReturnType > using hetcompute::task_ptr< ReturnType


>::return_type = ReturnType

Return type of the task function.

10.5.1.6.1.2 template<typename ReturnType > using hetcompute::task_ptr< ReturnType >::task_type


= task<return_type>

Task object type.

10.5.1.6.2 Constructors and Destructors

10.5.1.6.2.1 template<typename ReturnType > hetcompute::task_ptr< ReturnType >::task_ptr ( )

Constructs a task_ptr<ReturnType> that manages no task<ReturnType>.


task_ptr<ReturnType>::get returns nullptr.

10.5.1.6.2.2 template<typename ReturnType > hetcompute::task_ptr< ReturnType >::task_ptr (


std::nullptr_t )

Constructs a task_ptr<ReturnType> that manages no task<ReturnType>.


task_ptr<ReturnType>::get returns nullptr.

10.5.1.6.2.3 template<typename ReturnType > hetcompute::task_ptr< ReturnType >::task_ptr (


task_ptr< return_type > const & other )

Constructs a task_ptr<ReturnType> object that manages the same task<ReturnType> as other.


If other points to nullptr, the newly built object also points to nullptr.

Parameters

other Task pointer to copy.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 318
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.1.6.2.4 template<typename ReturnType > hetcompute::task_ptr< ReturnType >::task_ptr (


task_ptr< return_type > && other )

Constructs a task_ptr<ReturnType> object that manages the same task as other and resets
other. If other points to nullptr, the newly built object also points to nullptr.

Parameters

other Task pointer to move from.

10.5.1.6.2.5 template<typename ReturnType > hetcompute::task_ptr< ReturnType >::∼task_ptr ( )

Destructor.

10.5.1.6.3 Member Function Documentation

10.5.1.6.3.1 template<typename ReturnType > task_type∗ hetcompute::task_ptr< ReturnType >::get (


) const

Returns pointer to the managed task. Remember that the lifetime of the task is defined by the lifetime of the
task_ptr<ReturnType> objects managing it. If all task_ptr<ReturnType> objects managing
a task t go out of scope, all task<ReturnType>∗ pointing to t may be invalid.

Returns

Pointer to managed task object.

10.5.1.6.3.2 template<typename ReturnType > template<typename T > task_ptr& hetcompute::task_-


ptr< ReturnType >::operator%= ( T && op )

Compound assignment operator %= with value operand.


See Also

template<typename T> task_ptr& operator+=(T&& op2)

10.5.1.6.3.3 template<typename ReturnType > template<typename T > task_ptr& hetcompute::task_-


ptr< ReturnType >::operator&= ( T && op )

Compound assignment operator &= with value operand.


See Also

template<typename T> task_ptr& operator+=(T&& op2)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 319
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.1.6.3.4 template<typename ReturnType > template<typename T > task_ptr& hetcompute::task_-


ptr< ReturnType >::operator∗= ( T && op )

Compound assignment operator ∗= with value operand.


See Also

template<typename T> task_ptr& operator+=(T&& op2)

10.5.1.6.3.5 template<typename ReturnType > template<typename T > task_ptr& hetcompute::task_-


ptr< ReturnType >::operator+= ( T && op )

Compound assignment operator += with value operand


Create a new task whose return value will be the result of this operator applied to the return value of the
original task pointed by this task_ptr and the operand on the right side of the operator.
The new task will be data dependent on the original task and the operand on the right side if it is also a task.
The new task will be launching automatically by the runtime once the data is ready.
The old task pointed by this will be dereferrenced, and this task_ptr will point to the newly created task
with the result value.

Note: The operator should be applicable onto the return value of the current task and the operand (return
value is considered here if the operand is also a task).

Parameters

op Operand on the right side of this operator.

Returns

A new task whose return value is the result of this operator and can be pointed to by this shared pointer
(same type of return value).

1 #include <iostream>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8
9 // create and launch a task that return -73 and 27.8
10 auto t = hetcompute::launch([]() { return 73; });
11
12 // create a task whose return value is t’s return value + 27.8
13 // the new task will still be pointed by t
14 // the new task t is data dependent on the original task
15 // and the return type keeps the same (type coersion)
16 t += 27.8;
17
18 // wait for t to finish and display the return value
19 std::cout << "The return value of t is: " << t->copy_value() << std::endl;
20
21 hetcompute::runtime::shutdown();
22 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 320
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.1.6.3.6 template<typename ReturnType > template<typename T > task_ptr& hetcompute::task_-


ptr< ReturnType >::operator-= ( T && op )

Compound assignment operator -= with value operand.


See Also

template<typename T> task_ptr& operator+=(T&& op2)

10.5.1.6.3.7 template<typename ReturnType > task_type∗ hetcompute::task_ptr< ReturnType


>::operator-> ( ) const

Returns pointer to managed task. Do not call this member function if ∗this manages no task.
Exceptions

api_exception If task pointer is nullptr.

Note

If exceptions are disabled in application, terminates the app if task pointer is nullptr

Returns

Pointer to managed task object.

10.5.1.6.3.8 template<typename ReturnType > template<typename T > task_ptr& hetcompute::task_-


ptr< ReturnType >::operator/= ( T && op )

Compound assignment operator /= with value operand.


See Also

template<typename T> task_ptr& operator+=(T&& op2)

10.5.1.6.3.9 template<typename ReturnType > task_ptr& hetcompute::task_ptr< ReturnType


>::operator= ( task_ptr< return_type > const & other )

Assigns the task managed by other to ∗this. If, before the assignment, ∗this was the last
task_ptr<ReturnType> pointing to a task t, then the assignment will cause t to be destroyed. If
other manages no object, ∗this will also not manage an object after the assignment.

Parameters

other Task pointer to copy.

Returns

∗this.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 321
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.1.6.3.10 template<typename ReturnType > task_ptr& hetcompute::task_ptr< ReturnType


>::operator= ( task_ptr< return_type > && other )

Resets ∗this so that it manages no object. If, before the assignment, ∗this was the last
task_ptr<ReturnType> pointing to a task t, then the assignment will cause t to be destroyed. If
other manages no object, ∗this will also not manage an object after the assignment.

Returns

∗this.

10.5.1.6.3.11 template<typename ReturnType > template<typename T > task_ptr& hetcompute::task_-


ptr< ReturnType >::operator∧ = ( T && op )

Compound assignment operator ∧ = with value operand.


See Also

template<typename T> task_ptr& operator+=(T&& op2)

10.5.1.6.3.12 template<typename ReturnType > template<typename T > task_ptr& hetcompute::task_-


ptr< ReturnType >::operator|= ( T && op )

Compound assignment operator |= with value operand.


See Also

template<typename T> task_ptr& operator+=(T&& op2)

10.5.1.6.3.13 template<typename ReturnType > void hetcompute::task_ptr< ReturnType >::swap (


task_ptr< return_type > & other )

Exchanges managed tasks between ∗this and other.

Parameters

other Task pointer to exchange with.

10.5.1.7 class hetcompute::task_ptr< ReturnType(Args...)>

template<typename ReturnType, typename... Args>class hetcompute::task_ptr< ReturnType(Args...)>

Smart pointer to a task object, i.e., hetcompute::task<ReturnType(Args...)>, similar to


std::shared_ptr.
Template Parameters

ReturnType Return type of the task function.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 322
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Args... The type for the task parameters.

Public Types

• using args_tuple = typename task_type::args_tuple


• using return_type = typename task_type::return_type
• using size_type = typename task_type::size_type
• using task_type = task< ReturnType(Args...)>

Public member functions

• task_ptr ()
Default constructor. Constructs a task_ptr<ReturnType> with no task<ReturnType>.

• task_ptr (std::nullptr_t)
Default constructor. Constructs a task_ptr<ReturnType> with no task<ReturnType>.

• task_ptr (task_ptr< ReturnType(Args...)> const &other)


Copy constructor. Constructs a task_ptr<ReturnType> that manages the same task as other.

• task_ptr (task_ptr< ReturnType(Args...)> &&other)


Move constructor. Move-constructs a task_ptr<ReturnType> that manages the same task as
other.

• ∼task_ptr ()
Destructor.

• task_type ∗ get () const


Returns pointer to managed task.

• task_type ∗ operator-> () const


Dereference operator. Returns pointer to managed task.

• task_ptr & operator= (task_ptr< ReturnType(Args...)> const &other)


Assignment operator. Assigns the task managed by other to ∗this.

• task_ptr & operator= (task_ptr< ReturnType(Args...)> &&other)


Assignment operator. Resets ∗this.

• void swap (task_ptr< ReturnType(Args...)> &other)


Exchanges managed tasks between ∗this and other.

Static Public Attributes

• static
HETCOMPUTE_CONSTEXPR_CONST
size_type arity = task_type::arity

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 323
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.1.7.1 Member Typedef Documentation

10.5.1.7.1.1 template<typename ReturnType , typename... Args> using hetcompute::task_ptr<


ReturnType(Args...)>::args_tuple = typename task_type::args_tuple

Tuple for the task object parameter types.

10.5.1.7.1.2 template<typename ReturnType , typename... Args> using hetcompute::task_ptr<


ReturnType(Args...)>::return_type = typename task_type::return_type

Return type of the task function.

10.5.1.7.1.3 template<typename ReturnType , typename... Args> using hetcompute::task_ptr<


ReturnType(Args...)>::size_type = typename task_type::size_type

Unsigned integral type.

10.5.1.7.1.4 template<typename ReturnType , typename... Args> using hetcompute::task_ptr<


ReturnType(Args...)>::task_type = task<ReturnType(Args...)>

Task object type.

10.5.1.7.2 Constructors and Destructors

10.5.1.7.2.1 template<typename ReturnType , typename... Args> hetcompute::task_ptr<


ReturnType(Args...)>::task_ptr ( )

Constructs a task_ptr<ReturnType> that manages no task<ReturnType>.


task_ptr<ReturnType>::get returns nullptr.

10.5.1.7.2.2 template<typename ReturnType , typename... Args> hetcompute::task_ptr<


ReturnType(Args...)>::task_ptr ( std::nullptr_t )

Constructs a task_ptr<ReturnType> that manages no task<ReturnType>.


task_ptr<ReturnType>::get returns nullptr.

10.5.1.7.2.3 template<typename ReturnType , typename... Args> hetcompute::task_ptr<


ReturnType(Args...)>::task_ptr ( task_ptr< ReturnType(Args...)> const & other )

Constructs a task_ptr<ReturnType> object that manages the same task<ReturnType> as other.


If other points to nullptr, the newly built object also points to nullptr.

Parameters

other Task pointer to copy.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 324
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.1.7.2.4 template<typename ReturnType , typename... Args> hetcompute::task_ptr<


ReturnType(Args...)>::task_ptr ( task_ptr< ReturnType(Args...)> && other )

Constructs a task_ptr<ReturnType> object that manages the same task as other and resets
other. If other points to nullptr, the newly built object also points to nullptr.

Parameters

other Task pointer to move from.

10.5.1.7.2.5 template<typename ReturnType , typename... Args> hetcompute::task_ptr<


ReturnType(Args...)>::∼task_ptr ( )

Destructor.

10.5.1.7.3 Member Function Documentation

10.5.1.7.3.1 template<typename ReturnType , typename... Args> task_type∗ hetcompute::task_ptr<


ReturnType(Args...)>::get ( ) const

Returns pointer to the managed task. Remember that the lifetime of the task is defined by the lifetime of the
task_ptr<ReturnType> objects managing it. If all task_ptr<ReturnType> objects managing
a task t go out of scope, all task<ReturnType>∗ pointing to t may be invalid.

Returns

Pointer to managed task object.

10.5.1.7.3.2 template<typename ReturnType , typename... Args> task_type∗ hetcompute::task_ptr<


ReturnType(Args...)>::operator-> ( ) const

Returns pointer to managed task. Do not call this member function if ∗this manages no task.

Returns

Pointer to managed task object.

10.5.1.7.3.3 template<typename ReturnType , typename... Args> task_ptr& hetcompute::task_ptr<


ReturnType(Args...)>::operator= ( task_ptr< ReturnType(Args...)> const & other )

Assigns the task managed by other to ∗this. If, before the assignment, ∗this was the last
task_ptr<ReturnType> pointing to a task t, then the assignment will cause t to be destroyed. If
other manages no object, ∗this will also not manage an object after the assignment.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 325
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Parameters

other Task pointer to copy.

Returns

∗this.

10.5.1.7.3.4 template<typename ReturnType , typename... Args> task_ptr& hetcompute::task_ptr<


ReturnType(Args...)>::operator= ( task_ptr< ReturnType(Args...)> && other )

Resets ∗this so that it manages no object. If, before the assignment, ∗this was the last
task_ptr<ReturnType> pointing to a task t, then the assignment will cause t to be destroyed. If
other manages no object, ∗this will also not manage an object after the assignment.

Returns

∗this.

10.5.1.7.3.5 template<typename ReturnType , typename... Args> void hetcompute::task_ptr<


ReturnType(Args...)>::swap ( task_ptr< ReturnType(Args...)> & other )

Exchanges managed tasks between ∗this and other.

Parameters

other Task pointer to exchange with.

10.5.1.7.4 Member Data Documentation

10.5.1.7.4.1 template<typename ReturnType , typename... Args> HETCOMPUTE_CONSTEXPR_CONST


size_type hetcompute::task_ptr< ReturnType(Args...)>::arity = task_type::arity [static]

Number of parameters.

10.5.1.8 class hetcompute::task_ptr< void >

template<>class hetcompute::task_ptr< void >

Smart pointer to a task object, i.e., hetcompute::task<void>, similar to std::shared_ptr.

Public Types

• using task_type = task< void >

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 326
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Public member functions

• task_ptr ()
Default constructor. Constructs a task_ptr<void> with no task<void>.

• task_ptr (std::nullptr_t)
Default constructor. Constructs a task_ptr<void> with no task<void>.

• task_ptr (task_ptr< void > const &other)


Copy constructor. Constructs a task_ptr<void> that manages the same task as other.

• task_ptr (task_ptr< void > &&other)


Move constructor. Move-constructs a task_ptr<void> that manages the same task as other.

• ∼task_ptr ()
• task_type ∗ get () const
Returns pointer to managed task.

• task_type ∗ operator-> () const


Dereference operator. Returns pointer to managed task.

• task_ptr & operator= (task_ptr< void > const &other)


Assignment operator. Assigns the task managed by other to ∗this.

• task_ptr & operator= (task_ptr< void > &&other)


Assignment operator. Resets ∗this.

• void swap (task_ptr< void > &other)


Exchanges managed tasks between ∗this and other.

Protected Member Functions

• task_ptr (::hetcompute::internal::task ∗t,::hetcompute::internal::task_shared_ptr::ref_policy policy)

10.5.1.8.1 Member Typedef Documentation

10.5.1.8.1.1 using hetcompute::task_ptr< void >::task_type = task<void>

Task object type.

10.5.1.8.2 Constructors and Destructors

10.5.1.8.2.1 hetcompute::task_ptr< void >::task_ptr ( )

Constructs a task_ptr<void> that manages no task<void>. task_ptr<void>::get returns


nullptr.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 327
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.1.8.2.2 hetcompute::task_ptr< void >::task_ptr ( std::nullptr_t )

Constructs a task_ptr<void> that manages no task<void>. task_ptr<void>::get returns


nullptr.

10.5.1.8.2.3 hetcompute::task_ptr< void >::task_ptr ( task_ptr< void > const & other )

Constructs a task_ptr<void> object that manages the same task<void> as other. If other points
to nullptr, the newly built object also points to nullptr.

Parameters

other Task pointer to copy.

10.5.1.8.2.4 hetcompute::task_ptr< void >::task_ptr ( task_ptr< void > && other )

Constructs a task_ptr<void> object that manages the same task as other and resets other. If
other points to nullptr, the newly built object also points to nullptr.

Parameters

other Task pointer to move from.

10.5.1.8.2.5 hetcompute::task_ptr< void >::∼task_ptr ( )

Destructor.

10.5.1.8.3 Member Function Documentation

10.5.1.8.3.1 task_type∗ hetcompute::task_ptr< void >::get ( ) const

Returns pointer to the managed task. Remember that the lifetime of the task is defined by the lifetime of the
task_ptr<void> objects managing it. If all task_ptr<void> objects managing a task t go out of
scope, all task<void>∗ pointing to t may be invalid.

Returns

Pointer to managed task object.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 328
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.1.8.3.2 task_type∗ hetcompute::task_ptr< void >::operator-> ( ) const

Returns pointer to managed task. Do not call this member function if ∗this manages no task.

Returns

Pointer to managed task object.

10.5.1.8.3.3 task_ptr& hetcompute::task_ptr< void >::operator= ( task_ptr< void > const & other )

Assigns the task managed by other to ∗this. If, before the assignment, ∗this was the last
task_ptr<void> pointing to a task t, then the assignment will cause t to be destroyed. If other
manages no object, ∗this will also not manage an object after the assignment.

Parameters

other Task pointer to copy.

Returns

∗this.

10.5.1.8.3.4 task_ptr& hetcompute::task_ptr< void >::operator= ( task_ptr< void > && other )

Resets ∗this so that it manages no object. If, before the assignment, ∗this was the last
task_ptr<void> pointing to a task t, then the assignment will cause t to be destroyed. If other
manages no object, ∗this will also not manage an object after the assignment.

Returns

∗this.

10.5.1.8.3.5 void hetcompute::task_ptr< void >::swap ( task_ptr< void > & other )

Exchanges managed tasks between ∗this and other.

Parameters

other Task pointer to exchange with.

10.5.1.9 class hetcompute::task_ptr<>

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 329
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

template<>class hetcompute::task_ptr<>

Smart pointer to a task object, i.e., hetcompute::task<>, similar to std::shared_ptr.

Public Types

• using task_type = task<>

Public member functions

• task_ptr ()
Default constructor. Constructs a task_ptr<> with no task<>.

• task_ptr (std::nullptr_t)
Default constructor. Constructs a task_ptr<> with no task<>.

• task_ptr (task_ptr const &other)


Copy constructor. Constructs a task_ptr<> that manages the same task as other.

• task_ptr (task_ptr &&other)


Move constructor. Move-constructs a task_ptr<> that manages the same task as other.

• ∼task_ptr ()
• task_type ∗ get () const
Returns the pointer to the managed task.

• operator bool () const


Checks whether pointer is not nullptr.

• task_type ∗ operator-> () const


Dereference operator. Returns pointer to managed task.

• task_ptr & operator= (task_ptr const &other)


Assignment operator. Assigns the task managed by other to ∗this.

• task_ptr & operator= (std::nullptr_t)


Assignment operator. Resets ∗this.

• task_ptr & operator= (task_ptr &&other)


Move-assignment operator. Move-assigns the task managed by other to ∗this.

• void reset ()
Resets the pointer to the managed task.

• void swap (task_ptr &other)


Exchanges managed tasks between ∗this and other.

• bool unique () const


Returns if this is the only task_ptr managing the underlying task.

• size_t use_count () const

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 330
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Returns the number of task_ptr<> objects managing the same object (including ∗this).

10.5.1.9.1 Member Typedef Documentation

10.5.1.9.1.1 using hetcompute::task_ptr<>::task_type = task<>

Task Object type

10.5.1.9.2 Constructors and Destructors

10.5.1.9.2.1 hetcompute::task_ptr<>::task_ptr ( )

Constructs a task_ptr<> that manages no task<>. task_ptr<>::get returns nullptr.

10.5.1.9.2.2 hetcompute::task_ptr<>::task_ptr ( std::nullptr_t )

Constructs a task_ptr<> that manages no task<>. task_ptr<>::get returns nullptr.

10.5.1.9.2.3 hetcompute::task_ptr<>::task_ptr ( task_ptr<> const & other )

Constructs a task_ptr<> object that manages the same task<> as other. If other points to
nullptr, the newly built object also points to nullptr.

Parameters

other Task pointer to copy.

10.5.1.9.2.4 hetcompute::task_ptr<>::task_ptr ( task_ptr<> && other )

Constructs a task_ptr<> object that manages the same task as other and resets other. If other
points to nullptr, the newly built object also points to nullptr.

Parameters

other Task pointer to move from.

10.5.1.9.2.5 hetcompute::task_ptr<>::∼task_ptr ( )

Default destructor.

10.5.1.9.3 Member Function Documentation

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 331
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.1.9.3.1 task_type∗ hetcompute::task_ptr<>::get ( ) const

Returns pointer to the managed task. Remember that the lifetime of the task is defined by the lifetime of the
task_ptr<> objects managing it. If all task_ptr<> objects managing a task t go out of scope, all
task<>∗ pointing to t may be invalid.

Returns

Pointer to managed task object.

10.5.1.9.3.2 hetcompute::task_ptr<>::operator bool ( ) const [explicit]

Checks whether ∗this manages a task.

Returns

true – The pointer is not nullptr (∗this manages a task).


false – The pointer is nullptr (∗this does not manage a task).

10.5.1.9.3.3 task_type∗ hetcompute::task_ptr<>::operator-> ( ) const

Returns pointer to managed task. Do not call this member function if ∗this manages no task.

Returns

Pointer to managed task object.

10.5.1.9.3.4 task_ptr& hetcompute::task_ptr<>::operator= ( task_ptr<> const & other )

Assigns the task managed by other to ∗this. If, before the assignment, ∗this was the last
task_ptr<> pointing to a task t, then the assignment will cause t to be destroyed. If other manages
no object, ∗this will also not manage an object after the assignment.

Parameters

other Task pointer to copy.

Returns

∗this.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 332
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.1.9.3.5 task_ptr& hetcompute::task_ptr<>::operator= ( std::nullptr_t )

Resets ∗this so that it manages no object. If, before the assignment, ∗this was the last task_ptr<>
pointing to a task t, then the assignment will cause t to be destroyed. ∗this will also not manage an
object after the assignment.

Returns

∗this.

10.5.1.9.3.6 task_ptr& hetcompute::task_ptr<>::operator= ( task_ptr<> && other )

Move-assigns the task managed by other to ∗this. other will manage no task after the assignment.
If, before the assignment, ∗this was the last task_ptr<> pointing to a task t, then the assignment will
cause t to be destroyed. If other manages no object, ∗this will also not manage an object after the
assignment.

Parameters

other Task pointer to move from.

Returns

∗this.

10.5.1.9.3.7 void hetcompute::task_ptr<>::reset ( )

Resets pointer to managed task. If, ∗this was the last task_ptr<> pointing to a task t, then reset()
cause g to be destroyed.

Returns

Pointer to managed task object.

10.5.1.9.3.8 void hetcompute::task_ptr<>::swap ( task_ptr<> & other )

Exchanges managed tasks between ∗this and other.

Parameters

other Task pointer to exchange with.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 333
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.1.9.3.9 bool hetcompute::task_ptr<>::unique ( ) const

Returns if this is the only task_ptr object managing the underlying task.

Returns

A boolean indicating if task_ptr uniquely manages the underlying task

10.5.1.9.3.10 size_t hetcompute::task_ptr<>::use_count ( ) const

Returns the number of task_ptr<> objects managing the same object (including ∗this).
1 #include <cassert>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8 std::atomic<bool> running(false);
9 std::atomic<bool> finish(false);
10
11 // Create task t.
12 auto t = hetcompute::create_task([&running, &finish] {
13 running = true;
14 while (!finish)
15 {
16 };
17 });
18
19 // t’s use_count should be 1.
20 HETCOMPUTE_ILOG("After construction: t.use_count() = %zu\n", t.use_count());
21
22 // Copy-construct t2 from t. t and t2’s use_count is 2.
23 auto t2 = t;
24 HETCOMPUTE_ILOG("After copy-construction: t2.use_count() = %zu\n", t2.use_count());
25
26 auto t3 = t.get();
27 HETCOMPUTE_ILOG("After calling t.get(). t.use_count() = %zu\n", t.use_count());
28
29 // t’s use_count should be 2.
30 HETCOMPUTE_ILOG("After t->wait_for: t.use_count() = %zu\n", t.use_count());
31
32 assert(t3 != nullptr);
33 HETCOMPUTE_UNUSED(t3);
34 hetcompute::runtime::shutdown();
35 return 0;
36 }

Output

After construction: t.use_count() = 1


After copy-construction: t2.use_count() = 2
After calling t.get(). t.use_count() = 2
After t->wait_for: t.use_count() = 2

Returns

Total number of task_ptr<> points to the same task.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 334
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.2 Typedef Documentation

10.5.2.1 template<typename Fn > using hetcompute::collapsed_task_type = typedef


typename ::hetcompute::internal::task_factory<Fn>::collapsed_task_type

Collapsed task type.

10.5.2.2 template<typename Fn > using hetcompute::non_collapsed_task_type =


typedef typename ::hetcompute::internal::task_factory<Fn>::non_collapsed-
_task_type

Non-collapsed task type.

10.5.3 Function Documentation

10.5.3.1 void hetcompute::abort_on_cancel ( )

HetCompute uses cooperative multitasking. Therefore, it cannot abort an executing task without help from
the task. In HETCOMPUTE, each executing task is responsible for periodically checking whether it should
abort. Thus, tasks call hetcompute::abort_on_cancel() to test whether they, or any of the groups
to which they belong, have been canceled. If true, hetcompute::abort_on_cancel() does not
return. Instead, it throws hetcompute::abort_task_exception, which the HetCompute runtime
catches. The runtime then transitions the task to a canceled state and propagates cancellation to the task’s
successors, if any.
Because hetcompute::abort_on_cancel() does not return if the task has been canceled, we
recommend that you use use RAII to allocate and deallocate the resources used inside a task. If using RAII
in your code is not an option, surround hetcompute::abort_on_cancel() with try – catch, and
call throw from within the catch block after the cleanup code.
Exceptions

<code>abort_task_- If called from a task that has been canceled via


exception</code> hetcompute::cancel() or that belongs to a canceled group.
<code>api_- If called from outside a task.
exception</code>

Note

If exceptions are disabled in application, will terminate the app if called from outside a task. Another
caveat to note with usage of abort_on_cancel with exceptions disabled is that the application
code can get sandwidched between functions that are able to handle exceptions resulting in improper
cleanups in the function where exceptions are disabled.

Example 1

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 335
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

8 // Create task
9 auto t = hetcompute::create_task([] {
10 size_t num_iters = 0;
11 while (1)
12 {
13 HETCOMPUTE_ILOG("Task has executed %zu iterations!", num_iters);
14
15 // Check whether the task needs to stop execution.
16 // Without abort_on_cancel() the task would never
17 // return
18 hetcompute::abort_on_cancel();
19
20 usleep(30);
21 num_iters++;
22 }
23 });
24
25 // Create group g
26 auto g = hetcompute::create_group("example group");
27
28 // Launch t into g.
29 g->launch(t);
30 // We don’t use t after launch(), so we can reset the shared pointer
31 t.reset();
32
33 // Wait for the task to execute a few iterations
34 usleep(200);
35
36 // Cancel group g, and wait for t to complete
37 g->cancel();
38
39 try
40 {
41 g->wait_for();
42 }
43 catch (const hetcompute::canceled_exception&)
44 {
45 // Do nothing
46 }
47 catch (...)
48 {
49 // Do nothing
50 }
51
52 hetcompute::runtime::shutdown();
53 return 0;
54 }

Output

Task has executed 1 iterations!


Task has executed 2 iterations!
...
Task has executed 47 iterations!

Example 2

1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9
10 auto t = hetcompute::create_task([] {
11 while (1)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 336
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

12 {
13 try
14 {
15 hetcompute::abort_on_cancel();
16 }
17 catch (const hetcompute::abort_task_exception&)
18 {
19 //..do cleanup
20 throw;
21 }
22 catch (...)
23 {
24 //..do cleanup
25 throw;
26 }
27 // HETCOMPUTE_ILOG("Waiting to be canceled.\n");
28 usleep(10);
29 }
30 assert(false); // This will never fire
31 });
32
33 // Launch t
34 t->launch();
35
36 // Wait for 20 micro-seconds.
37 usleep(20);
38
39 // Cancel task. Returns immediately.
40 t->cancel();
41
42 try
43 {
44 // Wait for the task to complete.
45 t->wait_for();
46 }
47 catch (const hetcompute::canceled_exception& e)
48 {
49 std::cout << e.what() << " thrown" << std::endl;
50 }
51 catch (...)
52 {
53 // Never reached
54 }
55
56 hetcompute::runtime::shutdown();
57 return 0;
58 }

10.5.3.2 void hetcompute::abort_task ( )

Use this method from within a running task to immediately abort it and all its successors.
hetcompute::abort_task() never returns. Instead, it throws
hetcompute::abort_task_exception, which the HetCompute runtime catches. The runtime then
transitions the task to a canceled state and propagates propagation to the task’s successors, if any.
Exceptions

abort_task_exception If called from a task has been canceled via hetcompute::cancel() or a task
that belongs to a canceled group.
api_exception If called from outside a task.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 337
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Note

If exceptions are disabled in application, will terminate the app if called from outside a task

Example

1 #include <cassert>
2
3 #include <hetcompute/hetcompute.hh>
4
5 int
6 main()
7 {
8 hetcompute::runtime::init();
9
10 auto t1 = hetcompute::create_task([] {
11 int i = 0;
12 while (true)
13 {
14 HETCOMPUTE_ILOG("Hello World %d\n", i);
15 sleep(1);
16 i++;
17 if (i == 10)
18 {
19 hetcompute::abort_task();
20 }
21 }
22 // This will never fire
23 assert(false);
24 });
25
26 auto t2 = hetcompute::create_task([] {
27 // This will never fire
28 assert(false);
29 });
30
31 t1 >> t2;
32
33 // Launch tasks
34 t1->launch();
35 t2->launch();
36
37 try
38 {
39 // Wait for t1 to complete.
40 t1->wait_for();
41 }
42 catch (const hetcompute::canceled_exception& e)
43 {
44 std::cout << e.what() << " thrown when syncing with t1" << std::endl;
45 }
46 catch (...)
47 {
48 // Never reached
49 }
50
51 try
52 {
53 // Returns immediately, t2 is canceled.
54 t2->wait_for();
55 }
56 catch (const hetcompute::canceled_exception& e)
57 {
58 std::cout << e.what() << " thrown when syncing with t2" << std::endl;
59 }
60 catch (...)
61 {
62 // Never reached
63 }
64

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 338
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

65 hetcompute::runtime::shutdown();
66 return 0;
67 }

Output

Hello World!
Hello World!
..
Hello World!

10.5.3.3 template<typename Task > hetcompute::internal::by_data_dep_t<Task&&>


hetcompute::bind_as_data_dependency ( Task && t )

Explicitly bind a hetcompute::task_ptr<...> or hetcompute::task<...>∗ as data dependency. The type of


the return value of the hetcompute::task_ptr<...> or the hetcompute::task<...>∗ should match the
correpsonding parameter type for the task to bind.

Parameters

t a hetcompute::task_ptr or hetcompute::task∗ which has the


return type information.

10.5.3.4 template<typename Task > hetcompute::internal::by_value_t<Task&&>


hetcompute::bind_by_value ( Task && t )

Explicitly bind a hetcompute::task_ptr<...> or hetcompute::task<...>∗ by value. The


type of the hetcompute::task_ptr<...> or the hetcompute::task<...>∗ should match
the corresponding parameter type for the task to bind.

Parameters

t a hetcompute::task_ptr or hetcompute::task∗, which has the


return type information.

10.5.3.5 template<typename BlockingFunction , typename CancelFunction > void


hetcompute::blocking ( BlockingFunction && bf, CancelFunction && cf )

Used to enclose user-code that blocks on external activity and needs to be cancelable when an enclosing
task gets canceled.
A function/functor containing the blocking code bf is executed immediately. If cancellation is
asynchronously requested for the enclosing task while bf is currently executing, the cancellation handler
function/functor cf is asynchronously executed. Once bf completes, blocking throws
hetcompute::canceled_exception if task cancellation was requested.
If cancellation of the task had already been requested prior to the execution of blocking, blocking
immediately throws hetcompute::canceled_exception without executing bf or cf.
The programmer must write bf and cf to satisfy the following requirements:

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 339
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

1. cf can be safely called concurrently with bf.


2. The blocked work inside bf must somehow be able to unblock and resume execution when signalled
to do so by cf.

Example

bf may block on a network access causing its thread to sleep. cf writes special data into the network
handle causing bf to unblock.

auto t = hetcompute::create_task([&x, &handle]()


{
hetcompute::blocking(
[&]() { x = network_fetch(handle); }, //blocking code
[&]() { write_spurious(handle); }, //cancellation handler
);
// throws hetcompute::canceled_exception if encapsulating task is canceled

do_whole_bunch_of_work(x, ...);
});

Note: It is not required that the blocking construct be enclosed in a task. Without an enclosing task bf will
execute as a normal function and cf will never be invoked.

Parameters

bf Function/functor/lambda with signature void(void) that encapsulates


blocking work.
cf Function/functor/lambda with signature void(void) that is capable of
canceling blocked work in bf.

10.5.3.6 template<typename Code , typename... Args> collapsed_task_type<Code>


hetcompute::create_task ( Code && code, Args &&... args )

Create a collapsed task out of Code and (optionally) bind all arguments.
Template Parameters

Code Type of the work for this task.


Args... Parameter types of the task.

Parameters

code The work for the task. code can be Qualcomm HetCompute kernels (CPU,
GPU, or DSP), a lambda expression, a function object, or a function pointer.
args Argument used to bind to the task (only supported by CPU tasks). If left
empty, no arguments will be bound to the task.

Returns

hetcompute::task_ptr<ReturnType(Params...)>, the task_ptr with full function


signature.

1 #include <hetcompute/hetcompute.hh>

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 340
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create a task out of a lambda and bind the argument.
8 auto t1 = hetcompute::create_task([](int x) { return x; }, 27);
9 // Launch t1.
10 t1->launch();
11 // Wait for t1 to finish and show the return value.
12 HETCOMPUTE_ILOG("t1->copy_value() = %d", t1->copy_value()); // Expect 27;
13
14 // Create a task out of a lambda and bind the argument later.
15 auto t2 = hetcompute::create_task([](int x) { return x; });
16 // Bind the argument before launch.
17 t2->bind_all(42);
18 // Launch t2.
19 t2->launch();
20 // Wait for t2 to finish and show the return value.
21 HETCOMPUTE_ILOG("t2->copy_value() = %d", t2->copy_value()); // Expect 42;
22
23 // Create a cpu kernel out of a lambda.
24 auto cpu_kn = hetcompute::create_cpu_kernel([](int x) { return x; });
25 // Create a task out of a cpu kernel and bind the argument.
26 auto t3 = hetcompute::create_task(cpu_kn, 73);
27 // Launch t3.
28 t3->launch();
29 // Wait for t3 to finish and show the return value.
30 HETCOMPUTE_ILOG("t3->copy_value() = %d", t3->copy_value()); // Expect 73;
31
32 // Create a collapsed task.
33 // typeof(t4) = hetcompute::task_ptr<int(int)>
34 auto t4 = hetcompute::create_task(
35 [](int x) {
36 // Create a task.
37 // typeof(t) = hetcompute::task_ptr<int(int)>
38 auto t = hetcompute::create_task([](int y) { return y; }, x);
39 return t;
40 },
41 168);
42 // Launch t4.
43 t4->launch();
44 // Wait for t4 to finish and show the return value.
45 HETCOMPUTE_ILOG("t4->copy_value() = %d", t4->copy_value()); // Expect 168;
46
47 hetcompute::runtime::shutdown();
48 return 0;
49 }

See Also

hetcompute::task<ReturnType(Args...)>::bind_all

10.5.3.7 template<typename Code , typename... Args> non_collapsed_task_-


type<Code> hetcompute::create_task ( do_not_collapse_t , Code && code,
Args &&... args )

Create a non-collapsed task out of Code and (optionally) bind all arguments.
Template Parameters

Code Type of the work for this task.


Args... Parameter types of the task.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 341
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Parameters

code The work for the task. code can be Qualcomm HetCompute kernels (CPU,
GPU, or DSP), a lambda expression, a function object, or a function pointer.
args Argument used to bind to the task (only supported by CPU tasks). If left
empty, no arguments will be bound to the task.

Returns

hetcompute::task_ptr<ReturnType(Params...)>, the task_ptr with full function


signature.

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 // Create a non-collapsed task.
9 // typeof(t) = hetcompute::task_ptr<hetcompute:task_ptr<int(int)>(int)>
10 auto t = hetcompute::create_task(
hetcompute::do_not_collapse,
11 [](int x) {
12 // Create a task.
13 // typeof(tt) = hetcompute::task_ptr<int(int)>
14 auto tt = hetcompute::create_task([](int y)
{ return y; }, x);
15 return tt;
16 },
17 271);
18
19 // Launch t.
20 t->launch();
21
22 // Wait for t to finish and get the return value.
23 auto tt = t->copy_value();
24
25 // Launch tt.
26 tt->launch();
27
28 // Wait for tt to finish and show the return value.
29 HETCOMPUTE_ILOG("tt->copy_value() = %d", tt->copy_value()); // Expect 271;
30
31 hetcompute::runtime::shutdown();
32
33 return 0;
34 }

See Also

hetcompute::task<ReturnType(Args...)>::bind_all

10.5.3.8 template<typename ReturnType , typename... Args> ::hetcompute::task-


_ptr<ReturnType> hetcompute::create_value_task ( Args &&... args
)

Create a value task.


The task needs to be launched and will return immediately.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 342
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

Template Parameters

ReturnType Return type of the task.


Args... Parameter types of the task.

Parameters

args Arguments used to construct an object of type ReturnType.

Returns

hetcompute::task_ptr<ReturnType> whose return value is an object of type


ReturnType constructed with args....

1 #include <hetcompute/hetcompute.hh>
2
3 // User-defined type
4 struct point2d
5 {
6 // Member variables
7 int _x;
8 int _y;
9
10 // first constructor
11 explicit point2d(int x) : _x(x), _y(0) {}
12
13 // second constructor
14 point2d(int x, int y) : _x(x), _y(y) {}
15 };
16
17 int
18 main()
19 {
20 hetcompute::runtime::init();
21
22 // Create a value task returns an object of build-in type (int) of value 2.
23 auto t = hetcompute::create_value_task<int>(2);
24 // Launch t.
25 t->launch();
26 // Wait for t to finish.
27 t->wait_for();
28 HETCOMPUTE_ILOG("t->copy_value() = %d", t->copy_value()); // Expect 2;
29
30 int x = 5;
31 // Create a value task returns an object of point2d constructed by the first constructor.
32 auto t1 = hetcompute::create_value_task<point2d>(x);
33 // Launch t1.
34 t1->launch();
35 // Wait for t1 to finish.
36 t1->wait_for();
37
38 HETCOMPUTE_ILOG("t1->copy_value()._x = %d", t1->copy_value()._x); // Expect 5;
39 HETCOMPUTE_ILOG("t1->copy_value()._y = %d", t1->copy_value()._y); // Expect 0;
40
41 int y = 6;
42 x = 7;
43 // Create a value task returns an object of point2d constructed by the 2nd constructor.
44 auto t2 = hetcompute::create_value_task<point2d>(x, y);
45 // Launch t2.
46 t2->launch();
47 // Wait for t2 to finish.
48 t2->wait_for();
49
50 HETCOMPUTE_ILOG("t2->copy_value()._x = %d", t2->copy_value()._x); // Expect 7;
51 HETCOMPUTE_ILOG("t2->copy_value()._y = %d", t2->copy_value()._y); // Expect 6;
52

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 343
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

53 hetcompute::runtime::shutdown();
54
55 return 0;
56 }

10.5.3.9 void hetcompute::finish_after ( ::hetcompute::task<> ∗ task )

Specifies that the task invoking this function should be deemed to finish only after the task finishes. This
method returns immediately.
If the invoking task is multi-threaded, the programmer must ensure that concurrent calls to finish_after from
within the task are properly synchronized.
Do not call this function if task is nullptr. It would cause a fatal error.

Parameters

task Task after which invoking task is deemed to finish. Can’t be nullptr

Exceptions

api_exception If invoked from outside a task or from within a hetcompute::pfor_each or if


’task’ points to null.

10.5.3.10 template<typename Code , typename... Args> collapsed_task_type<Code>


hetcompute::launch ( Code && code, Args &&... args )

Create a collapsed task out of Code, bind all arguments, if any exist (mandatory), and launch the task.
Template Parameters

Code Type of the work for this task.


Args... Parameter types of the task.

Parameters

code The work for the task. code can be Qualcomm HetCompute kernels (CPU,
GPU, or DSP), a lambda expression, a function object, or a function pointer.
args Argument used to bind to the task (only supported by CPU tasks). If left
empty, no arguments will be bound to the task.

Returns

hetcompute::task_ptr<ReturnType(Params...)>, the task_ptr with full function


signature.

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7 // Create a task out of a lambda, bind the argument, and launch.
8 auto t1 = hetcompute::launch([](int x) { return x; }, 27);

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 344
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

9 // Wait for t1 to finish and show the return value.


10 HETCOMPUTE_ILOG("t1->copy_value() = %d", t1->copy_value()); // Expect 27;
11
12 // Create a cpu kernel out of a lambda.
13 auto cpu_kn = hetcompute::create_cpu_kernel([](int x) { return x; });
14 // Create a task out of a cpu kernel, bind the argument and launch.
15 auto t2 = hetcompute::launch(cpu_kn, 73);
16 // Wait for t3 to finish and show the return value.
17 HETCOMPUTE_ILOG("t2->copy_value() = %d", t2->copy_value()); // Expect 73;
18
19 // Create a collapsed task, bind the argument, and launch.
20 // typeof(t3) = hetcompute::task_ptr<int(int)>
21 auto t3 = hetcompute::launch(
22 [](int x) {
23 // Create a task.
24 // typeof(t) = hetcompute::task_ptr<int(int)>
25 auto t = hetcompute::create_task([](int y) { return y; }, x);
26 return t;
27 },
28 168);
29
30 // Wait for t3 to finish and show the return value.
31 HETCOMPUTE_ILOG("t3->copy_value() = %d", t3->copy_value()); // Expect 168;
32
33 hetcompute::runtime::shutdown();
34 return 0;
35 }

See Also

hetcompute::create_task(Code&&, Args&&...)

10.5.3.11 template<typename Code , typename... Args> non_collapsed_task_-


type<Code> hetcompute::launch ( do_not_collapse_t , Code && code,
Args &&... args )

Create a non-collapsed task out of Code, bind all arguments, if any exist (mandatory), and launch the task.
Template Parameters

Code Type of work for this task.


Args... Parameter types of the task.

Parameters

code The work for the task. code can be Qualcomm HetCompute kernels (CPU,
GPU, or DSP), a lambda expression, a function object, or a function pointer.
args Argument used to bind to the task (only supported by CPU tasks). If left
empty, no arguments will be bound to the task.

Returns

hetcompute::task_ptr<ReturnType(Params...)>, the task_ptr with full function


signature.

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 345
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

6 hetcompute::runtime::init();
7 // Create a non-collapsed task, bind the argument, and launch.
8 // typeof(t) = hetcompute::task_ptr<hetcompute:task_ptr<int(int)>(int)>
9 auto t = hetcompute::launch(hetcompute::do_not_collapse,
10 [](int x) {
11 // Create a task, bind the argument and launch.
12 // typeof(tt) = hetcompute::task_ptr<int(int)>
13 auto tt = hetcompute::launch([](int y) { return y; },
x);
14 return tt;
15 },
16 271);
17
18 // Wait for t to finish and get the return value.
19 auto tt = t->copy_value();
20
21 // Launch tt.
22 tt->launch();
23 // Wait for tt to finish and show the return value.
24 HETCOMPUTE_ILOG("tt->copy_value() = %d", tt->copy_value()); // Expect 271;
25
26 hetcompute::runtime::shutdown();
27 return 0;
28 }

See Also

hetcompute::create_task(do_not_collapse_t, Code&&, Args&&...)

10.5.3.12 bool hetcompute::operator!= ( ::hetcompute::task_ptr<> const & t,


std::nullptr_t )

Compares task t to nullptr.

Returns

true – The pointer is nullptr (∗this does not manage a task). false – The pointer is not
nullptr (∗this manages a task).

10.5.3.13 bool hetcompute::operator!= ( std::nullptr_t , ::hetcompute::task_ptr<>


const & t )

Compares task t to nullptr.

Returns

true – The pointer is nullptr (∗this does not manage a task). false – The pointer is not
nullptr (∗this manages a task).

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 346
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.3.14 bool hetcompute::operator!= ( ::hetcompute::task_ptr<> const & a,


::hetcompute::task_ptr<> const & b )

Compares task a to task b.

Returns

true – Task a is not the same as task b. false – Task a is the same as task b.

10.5.3.15 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<typename ::hetcompute::task_ptr<T1>::return-
_type>) % std::declval<typename ::hetcompute::task_ptr<T2>::return_-
type>))> hetcompute::operator% ( const ::hetcompute::task_ptr< T1 > &
t1, const ::hetcompute::task_ptr< T2 > & t2 )

Algebraic binary operator % for tasks.


See Also

template<typename T1, typename T2> inline


::hetcompute::task_ptr<decltype(std::declval<typename
::hetcompute::task_ptr<T1>::return_type>()
• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(const
::hetcompute::task_ptr<T1>& t1, const ::hetcompute::task_ptr<T2>& t2)

10.5.3.16 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<typename ::hetcompute::task_ptr<T1>-
::return_type>) % std::declval<T2>))> hetcompute::operator% ( const
::hetcompute::task_ptr< T1 > & t1, T2 && op2 )

Algebraic binary operator % for tasks.


See Also

template<typename T1, typename T2> inline ::hetcompute::task_ptr<decltype(std::declval<T1>()


• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(T1&& op1,
const ::hetcompute::task_ptr<T2>& t2)

10.5.3.17 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<T1>) % std::declval<typename ::hetcompute-
::task_ptr<T2>::return_type>))> hetcompute::operator% ( T1 && op1,
const ::hetcompute::task_ptr< T2 > & t2 )

Algebraic binary operator % for tasks.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 347
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

See Also

template<typename T1, typename T2> inline ::hetcompute::task_ptr<decltype(std::declval<T1>()


• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(T1&& op1,
const ::hetcompute::task_ptr<T2>& t2)

10.5.3.18 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<typename ::hetcompute::task_ptr<T1>::return-
_type>) & std::declval<typename ::hetcompute::task_ptr<T2>::return_-
type>))> hetcompute::operator& ( const ::hetcompute::task_ptr< T1 > &
t1, const ::hetcompute::task_ptr< T2 > & t2 )

Algebraic binary operator & for tasks.


See Also

template<typename T1, typename T2> inline


::hetcompute::task_ptr<decltype(std::declval<typename
::hetcompute::task_ptr<T1>::return_type>()
• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(const
::hetcompute::task_ptr<T1>& t1, const ::hetcompute::task_ptr<T2>& t2)

10.5.3.19 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<typename ::hetcompute::task_ptr<T1>-
::return_type>) & std::declval<T2>))> hetcompute::operator& ( const
::hetcompute::task_ptr< T1 > & t1, T2 && op2 )

Algebraic binary operator & for tasks.


See Also

template<typename T1, typename T2> inline ::hetcompute::task_ptr<decltype(std::declval<T1>()


• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(T1&& op1,
const ::hetcompute::task_ptr<T2>& t2)

10.5.3.20 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<T1>) & std::declval<typename ::hetcompute-
::task_ptr<T2>::return_type>))> hetcompute::operator& ( T1 && op1,
const ::hetcompute::task_ptr< T2 > & t2 )

Algebraic binary operator & for tasks.


See Also

template<typename T1, typename T2> inline ::hetcompute::task_ptr<decltype(std::declval<T1>()


• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(T1&& op1,
const ::hetcompute::task_ptr<T2>& t2)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 348
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.3.21 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<typename ::hetcompute::task_ptr<T1>::return-
_type>) ∗ std::declval<typename ::hetcompute::task_ptr<T2>::return_-
type>))> hetcompute::operator∗ ( const ::hetcompute::task_ptr< T1 > & t1,
const ::hetcompute::task_ptr< T2 > & t2 )

Algebraic binary operator ∗ for tasks.


See Also

template<typename T1, typename T2> inline


::hetcompute::task_ptr<decltype(std::declval<typename
::hetcompute::task_ptr<T1>::return_type>()
• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(const
::hetcompute::task_ptr<T1>& t1, const ::hetcompute::task_ptr<T2>& t2)

10.5.3.22 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<typename ::hetcompute::task_ptr<T1>-
::return_type>) ∗ std::declval<T2>))> hetcompute::operator∗ ( const
::hetcompute::task_ptr< T1 > & t1, T2 && op2 )

Algebraic binary operator ∗ for tasks.


See Also

template<typename T1, typename T2> inline ::hetcompute::task_ptr<decltype(std::declval<T1>()


• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(T1&& op1,
const ::hetcompute::task_ptr<T2>& t2)

10.5.3.23 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<T1>) ∗ std::declval<typename ::hetcompute-
::task_ptr<T2>::return_type>))> hetcompute::operator∗ ( T1 && op1,
const ::hetcompute::task_ptr< T2 > & t2 )

Algebraic binary operator ∗ for tasks.


See Also

template<typename T1, typename T2> inline ::hetcompute::task_ptr<decltype(std::declval<T1>()


• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(T1&& op1,
const ::hetcompute::task_ptr<T2>& t2)

10.5.3.24 template<typename T > inline ::hetcompute::task_ptr<typename


::hetcompute::task_ptr<T>::return_type> hetcompute::operator+ ( const
::hetcompute::task_ptr< T > & t )

Algebraic unary operator + for task.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 349
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

See Also

template<typename T> inline ::hetcompute::task_ptr<typename


::hetcompute::task_ptr<T>::return_type> operator-(const ::hetcompute::task_ptr<T>& t);

10.5.3.25 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<typename ::hetcompute::task_ptr<T1>::return-
_type>) + std::declval<typename ::hetcompute::task_ptr<T2>::return_-
type>))> hetcompute::operator+ ( const ::hetcompute::task_ptr< T1 > &
t1, const ::hetcompute::task_ptr< T2 > & t2 )

Algebraic binary operator + for tasks.


Create a new task whose return value will be the result of this operator applied to the return values of task t1
and task t2.
The new task will be data dependent on task t1 and t2.
The new task will be launching automatically by the runtime once the data are ready.

Note: the operator should be applicable onto the return values of task t1 and task t2.

Parameters

t1 First task operand (should have return value).


t2 Second task operand (should have return value).

Returns

A new task whose return value is the result of this operator.

1 #include <iostream>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8
9 // create and launch a task that return -73 and 27.8
10 auto t1 = hetcompute::launch([]() { return 73; });
11 auto t2 = hetcompute::launch([]() { return 27.8; });
12
13 // create a task whose return value is t1’s return value + t2’s return value
14 // t is data dependent on t1 and t2
15 // t’s return type will be the same as t2’s (type promotion for +)
16 auto t = t1 + t2;
17
18 // wait for t to finish and display the return value
19 std::cout << "The return value of t is: " << t->copy_value() << std::endl;
20
21 hetcompute::runtime::shutdown();
22 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 350
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.3.26 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<typename ::hetcompute::task_ptr<T1>-
::return_type>) + std::declval<T2>))> hetcompute::operator+ ( const
::hetcompute::task_ptr< T1 > & t1, T2 && op2 )

Algebraic binary operator + for task.


Create a new task whose return value will be the result of this operator applied to the return value of task t1
and operand op2.
The new task will be data dependent on task t1.
The new task will be launching automatically by the runtime once the data is ready.

Note: the operator should be applicable onto the return value of task t1 and operand op2.

Parameters

t1 Task operand (should have return value).


op2 Value operand.

Returns

A new task whose return value is the result of this operator.

1 #include <iostream>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8
9 // create and launch a task that return -73
10 auto t1 = hetcompute::launch([]() { return -73; });
11
12 // create a task whose return value is t1’s return value + 100
13 // t is data dependent on t1
14 auto t = t1 + 100;
15
16 // wait for t to finish and display the return value
17 std::cout << "The return value of t is: " << t->copy_value() << std::endl;
18
19 hetcompute::runtime::shutdown();
20 }

10.5.3.27 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<T1>) + std::declval<typename ::hetcompute-
::task_ptr<T2>::return_type>))> hetcompute::operator+ ( T1 && op1,
const ::hetcompute::task_ptr< T2 > & t2 )

Algebraic binary operator + for tasks.


Create a new task whose return value will be the result of this operator applied to the return value of
operand op1 and task t2.
The new task will be data dependent on task t2.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 351
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

The new task will be launching automatically by the runtime once the data is ready.

Note: the operator should be applicable onto the return value of operand op1 and task t2.

Parameters

op1 Value operand.


t2 Task operand (should have return value).

Returns

A new task whose return value is the result of this operator.

1 #include <iostream>
2 #include <hetcompute/hetcompute.hh>
3
4 int
5 main()
6 {
7 hetcompute::runtime::init();
8
9 // create and launch a task that return -73
10 auto t2 = hetcompute::launch([]() { return -73; });
11
12 // create a task whose return value is 100 + t2’s return value
13 // t is data dependent on t2
14 auto t = 100 + t2;
15
16 // wait for t to finish and display the return value
17 std::cout << "The return value of t is: " << t->copy_value() << std::endl;
18
19 hetcompute::runtime::shutdown();
20 }

10.5.3.28 template<typename T > inline ::hetcompute::task_ptr<typename


::hetcompute::task_ptr<T>::return_type> hetcompute::operator- ( const
::hetcompute::task_ptr< T > & t )

Algebraic unary operator - for task.


Create a new task whose return value will be the result of this operator applied to the return value of task t.
The new task will be data dependent on task t.
The new task will be launching automatically by the runtime once the data is ready.

Note: the operator should be appliable onto the return value of task t

Parameters

t Task operand (should have return value).

Returns

A new task whose return value is the result of this operator.

1 #include <iostream>
2 #include <hetcompute/hetcompute.hh>
3

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 352
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

4 int
5 main()
6 {
7 hetcompute::runtime::init();
8
9 // create a task returns 73
10 auto t = hetcompute::create_task([]() { return 73; });
11 // launch the task
12 t->launch();
13
14 // create a task whose return value is the negation of the return value of t
15 // t1 is data dependent on t
16 auto t1 = -t;
17
18 // wait for t1 to finish and display the return value
19 std::cout << "The return value of t1 is: " << t1->copy_value() << std::endl;
20
21 hetcompute::runtime::shutdown();
22 }

10.5.3.29 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<typename ::hetcompute::task_ptr<T1>::return_-
type>) - std::declval<typename ::hetcompute::task_ptr<T2>::return_-
type>))> hetcompute::operator- ( const ::hetcompute::task_ptr< T1 > & t1,
const ::hetcompute::task_ptr< T2 > & t2 )

Algebraic binary operator - for tasks.


See Also

template<typename T1, typename T2> inline


::hetcompute::task_ptr<decltype(std::declval<typename
::hetcompute::task_ptr<T1>::return_type>()
• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(const
::hetcompute::task_ptr<T1>& t1, const ::hetcompute::task_ptr<T2>& t2)

10.5.3.30 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<typename ::hetcompute::task_ptr<T1>-
::return_type>) - std::declval<T2>))> hetcompute::operator- ( const
::hetcompute::task_ptr< T1 > & t1, T2 && op2 )

Algebraic binary operator - for tasks.


See Also

template<typename T1, typename T2> inline ::hetcompute::task_ptr<decltype(std::declval<T1>()


• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(T1&& op1,
const ::hetcompute::task_ptr<T2>& t2)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 353
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.3.31 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<T1>) - std::declval<typename ::hetcompute-
::task_ptr<T2>::return_type>))> hetcompute::operator- ( T1 && op1, const
::hetcompute::task_ptr< T2 > & t2 )

Algebraic binary operator - for tasks.


See Also

template<typename T1, typename T2> inline ::hetcompute::task_ptr<decltype(std::declval<T1>()


• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(T1&& op1,
const ::hetcompute::task_ptr<T2>& t2)

10.5.3.32 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<typename ::hetcompute::task_ptr<T1>::return_-
type>) / std::declval<typename ::hetcompute::task_ptr<T2>::return_-
type>))> hetcompute::operator/ ( const ::hetcompute::task_ptr< T1 > & t1,
const ::hetcompute::task_ptr< T2 > & t2 )

Algebraic binary operator / for tasks.


See Also

template<typename T1, typename T2> inline


::hetcompute::task_ptr<decltype(std::declval<typename
::hetcompute::task_ptr<T1>::return_type>()
• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(const
::hetcompute::task_ptr<T1>& t1, const ::hetcompute::task_ptr<T2>& t2)

10.5.3.33 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<typename ::hetcompute::task_ptr<T1>-
::return_type>) / std::declval<T2>))> hetcompute::operator/ ( const
::hetcompute::task_ptr< T1 > & t1, T2 && op2 )

Algebraic binary operator / for tasks.


See Also

template<typename T1, typename T2> inline ::hetcompute::task_ptr<decltype(std::declval<T1>()


• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(T1&& op1,
const ::hetcompute::task_ptr<T2>& t2)

10.5.3.34 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<T1>) / std::declval<typename ::hetcompute-
::task_ptr<T2>::return_type>))> hetcompute::operator/ ( T1 && op1, const
::hetcompute::task_ptr< T2 > & t2 )

Algebraic binary operator / for tasks.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 354
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

See Also

template<typename T1, typename T2> inline ::hetcompute::task_ptr<decltype(std::declval<T1>()


• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(T1&& op1,
const ::hetcompute::task_ptr<T2>& t2)

10.5.3.35 bool hetcompute::operator== ( task_ptr<> const & t, std::nullptr_t )

Compares task t to nullptr.

Returns

true – The pointer is not nullptr (∗this manages a task).


false – The pointer is nullptr (∗this does not manage a task).

10.5.3.36 bool hetcompute::operator== ( std::nullptr_t , task_ptr<> const & t )

Compares task t to nullptr.

Returns

true – The pointer is not nullptr (∗this manages a task).


false – The pointer is nullptr (∗this does not manage a task).

10.5.3.37 bool hetcompute::operator== ( ::hetcompute::task_ptr<> const & a,


::hetcompute::task_ptr<> const & b )

Compares task a to task b.

Returns

true – Task a is the same as task b. false – Task a is not the same as task b.

10.5.3.38 inline ::hetcompute::task_ptr& hetcompute::operator>> ( ::hetcompute-


::task_ptr<> & pred, ::hetcompute::task_ptr<> & succ )

Set control dependency from pred to task succ.


See Also

hetcompute::task<>::then

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 355
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.3.39 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<typename ::hetcompute::task_ptr<T1>::return-
_type>) ∧ std::declval<typename ::hetcompute::task_ptr<T2>::return_-
type>))> hetcompute::operator∧ ( const ::hetcompute::task_ptr< T1 > & t1,
const ::hetcompute::task_ptr< T2 > & t2 )

Algebraic binary operator ∧ for tasks.


See Also

template<typename T1, typename T2> inline


::hetcompute::task_ptr<decltype(std::declval<typename
::hetcompute::task_ptr<T1>::return_type>()
• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(const
::hetcompute::task_ptr<T1>& t1, const ::hetcompute::task_ptr<T2>& t2)

10.5.3.40 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<typename ::hetcompute::task_ptr<T1>-
::return_type>) ∧ std::declval<T2>))> hetcompute::operator∧ ( const
::hetcompute::task_ptr< T1 > & t1, T2 && op2 )

Algebraic binary operator ∧ for tasks.


See Also

template<typename T1, typename T2> inline ::hetcompute::task_ptr<decltype(std::declval<T1>()


• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(T1&& op1,
const ::hetcompute::task_ptr<T2>& t2)

10.5.3.41 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<T1>) ∧ std::declval<typename ::hetcompute-
::task_ptr<T2>::return_type>))> hetcompute::operator∧ ( T1 && op1,
const ::hetcompute::task_ptr< T2 > & t2 )

Algebraic binary operator ∧ for tasks.


See Also

template<typename T1, typename T2> inline ::hetcompute::task_ptr<decltype(std::declval<T1>()


• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(T1&& op1,
const ::hetcompute::task_ptr<T2>& t2)

10.5.3.42 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<typename ::hetcompute::task_ptr<T1>::return_-
type>) | std::declval<typename ::hetcompute::task_ptr<T2>::return_-
type>))> hetcompute::operator| ( const ::hetcompute::task_ptr< T1 > & t1,
const ::hetcompute::task_ptr< T2 > & t2 )

Algebraic binary operator | for tasks.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 356
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

See Also

template<typename T1, typename T2> inline


::hetcompute::task_ptr<decltype(std::declval<typename
::hetcompute::task_ptr<T1>::return_type>()
• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(const
::hetcompute::task_ptr<T1>& t1, const ::hetcompute::task_ptr<T2>& t2)

10.5.3.43 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<typename ::hetcompute::task_ptr<T1>-
::return_type>) | std::declval<T2>))> hetcompute::operator| ( const
::hetcompute::task_ptr< T1 > & t1, T2 && op2 )

Algebraic binary operator | for tasks.


See Also

template<typename T1, typename T2> inline ::hetcompute::task_ptr<decltype(std::declval<T1>()


• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(T1&& op1,
const ::hetcompute::task_ptr<T2>& t2)

10.5.3.44 template<typename T1 , typename T2 > inline ::hetcompute::task_-


ptr<decltype(std::declval<T1>) | std::declval<typename ::hetcompute-
::task_ptr<T2>::return_type>))> hetcompute::operator| ( T1 && op1, const
::hetcompute::task_ptr< T2 > & t2 )

Algebraic binary operator | for tasks.


See Also

template<typename T1, typename T2> inline ::hetcompute::task_ptr<decltype(std::declval<T1>()


• std::declval<typename ::hetcompute::task_ptr<T2>::return_type>())> operator+(T1&& op1,
const ::hetcompute::task_ptr<T2>& t2)

10.5.3.45 template<typename T > inline ::hetcompute::task_ptr<typename


::hetcompute::task_ptr<T>::return_type> hetcompute::operator∼ ( const
::hetcompute::task_ptr< T > & t )

Algebraic unary operator ∼ for task.


See Also

template<typename T> inline ::hetcompute::task_ptr<typename


::hetcompute::task_ptr<T>::return_type> operator-(const ::hetcompute::task_ptr<T>& t);

10.5.4 Variable Documentation

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 357
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Tasks Reference API

10.5.4.1 const do_not_collapse_t hetcompute::do_not_collapse {}

Object of type do_not_collapse_t.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 358
11 Buffers Reference API

The Qualcomm HetCompute buffers API provides the user with a runtime-managed heterogeneous data
structure. Tasks on the CPU, GPU and Hexagon devices can share data using a Qualcomm HetCompute
buffer. The following categories provide the API reference for buffers and related functionality.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 359
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

11.1 Heterogeneous Compute Device Types


Classes

• class hetcompute::device_set
Captures a set of device types. More...

Enumerations

• enum hetcompute::device_type {
cpu_big = HETCOMPUTE_DEVICE_TYPE_CPU_BIG, cpu_little = HETCOMPUTE_DEVICE_-
TYPE_CPU_LITTLE, cpu = HETCOMPUTE_DEVICE_TYPE_CPU_BIG |
HETCOMPUTE_DEVICE_TYPE_CPU_LITTLE, gpu = HETCOMPUTE_DEVICE_TYPE_GPU,
dsp = HETCOMPUTE_DEVICE_TYPE_DSP }
The system devices capable of executing HetCompute tasks.

Functions

• std::string hetcompute::to_string (device_type d)


Converts device_type to string.

11.1.1 Class Documentation

11.1.1.1 class hetcompute::device_set

Captures a set of device types.


Supports addition and removal of device_types from the set. Supports set union and set subtraction
with another device_set object.

Public member functions

• device_set ()
Default constructor produces empty set.

• device_set (std::initializer_list< device_type > device_list)


Constructor with initialization.

• device_set & add (device_type d)


Add a device_type to the device_set.

• device_set & add (device_set const &other)


Set union with another device_set.

• bool empty () const


Checks if device set has any devices or its empty.

• HETCOMPUTE_DEFAULT_METHOD (device_set(device_set const &))


• HETCOMPUTE_DEFAULT_METHOD (device_set &operator=(device_set const &))

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 360
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

• HETCOMPUTE_DEFAULT_METHOD (device_set(device_set &&))


• HETCOMPUTE_DEFAULT_METHOD (device_set &operator=(device_set &&))
• device_set & negate ()
Negate the device_set.

• bool on_cpu () const


Query if cpu is part of the device_set.

• bool on_cpu_big () const


Query if cpu big core is part of the device_set.

• bool on_cpu_little () const


Query if cpu LITTLE core is part of the device_set.

• bool on_dsp () const


Query if dsp is part of the device_set.

• bool on_gpu () const


Query if gpu is part of the device_set.

• device_set & remove (device_type d)


Remove a device_type from the device_set.

• device_set & remove (device_set const &other)


Set substraction with another device_set.

• std::string to_string () const


Convert the device_set to a string representation.

Friends

• hetcompute_device_set_t internal::get_raw_device_set_t (device_set const &d)

11.1.1.1.1 Constructors and Destructors

11.1.1.1.1.1 hetcompute::device_set::device_set ( )

Default constructor produces empty set.

11.1.1.1.1.2 hetcompute::device_set::device_set ( std::initializer_list< device_type > device_list )

Constructor with initialization.

Parameters

device_list nn initializer list of device_type elements.

Example:

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 361
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

hetcompute::device_set ds{hetcompute::cpu, hetcompute::dsp};

11.1.1.1.2 Member Function Documentation

11.1.1.1.2.1 device_set& hetcompute::device_set::add ( device_type d )

Add a device_type to the device_set.

Parameters

d device_type to add (no effect if already present).

Returns

Reference to the updated device_set.

11.1.1.1.2.2 device_set& hetcompute::device_set::add ( device_set const & other )

Set union with another device_set.

Parameters

other Another device_set.

Returns

Reference to the updated device_set.

Example:
hetcompute::device_set a{hetcompute::cpu};
hetcompute::device_set b{hetcompute::gpu};
a.add(b);
assert(true == a.on_cpu());
assert(true == a.on_gpu());
assert(false == a.on_dsp());

11.1.1.1.2.3 bool hetcompute::device_set::empty ( ) const

Checks if device set has any devices or its empty.

Returns

true if device_set has no devices,


false otherwise.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 362
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

11.1.1.1.2.4 hetcompute::device_set::HETCOMPUTE_DEFAULT_METHOD ( device_set(device_set


const &) )

Copy constructor.

11.1.1.1.2.5 hetcompute::device_set::HETCOMPUTE_DEFAULT_METHOD ( device_set & operator =


(device_set const &) )

Copy assignment.

11.1.1.1.2.6 hetcompute::device_set::HETCOMPUTE_DEFAULT_METHOD ( device_set(device_set &&)


)

Move constructor.

11.1.1.1.2.7 hetcompute::device_set::HETCOMPUTE_DEFAULT_METHOD ( device_set & operator =


(device_set &&) )

Move assignment.

11.1.1.1.2.8 device_set& hetcompute::device_set::negate ( )

Negate the device_set.

Returns

Reference to the updated device_set.

Example:
hetcompute::device_set a{hetcompute::cpu, hetcompute::gpu};
a.negate();
assert(false == a.on_cpu());
assert(false == a.on_gpu());
assert(true == a.on_dsp());

11.1.1.1.2.9 bool hetcompute::device_set::on_cpu ( ) const

Query if cpu is part of the device_set.

Returns

true if cpu present,


false otherwise.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 363
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

11.1.1.1.2.10 bool hetcompute::device_set::on_cpu_big ( ) const

Query if cpu big core is part of the device_set.

Returns

true if cpu big core present,


false otherwise.

11.1.1.1.2.11 bool hetcompute::device_set::on_cpu_little ( ) const

Query if cpu LITTLE core is part of the device_set.

Returns

true if cpu LITTLE core present,


false otherwise.

11.1.1.1.2.12 bool hetcompute::device_set::on_dsp ( ) const

Query if dsp is part of the device_set.

Returns

true if dsp present,


false otherwise.

11.1.1.1.2.13 bool hetcompute::device_set::on_gpu ( ) const

Query if gpu is part of the device_set.

Returns

true if gpu present,


false otherwise.

11.1.1.1.2.14 device_set& hetcompute::device_set::remove ( device_type d )

Remove a device_type from the device_set.

Parameters

d device_type to remove (no effect if not present).

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 364
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

Returns

Reference to the updated device_set.

11.1.1.1.2.15 device_set& hetcompute::device_set::remove ( device_set const & other )

Set substraction with another device_set.

Parameters

other Another device_set.

Returns

Reference to the updated device_set.

Example:
hetcompute::device_set a{hetcompute::cpu, hetcompute::gpu};
hetcompute::device_set b{hetcompute::gpu, hetcompute::dsp};
a.remove(b);
assert(true == a.on_cpu());
assert(false == a.on_gpu());
assert(false == a.on_dsp());

11.1.1.1.2.16 std::string hetcompute::device_set::to_string ( ) const

Convert the device_set to a string representation.

Returns

std::string representation of the devices present.

Example:
hetcompute::device_set a{hetcompute::cpu, hetcompute::gpu};
assert(a.to_string() == "cpu gpu ");

11.1.2 Enumeration Type Documentation

11.1.2.1 enum hetcompute::device_type

The system devices capable of executing HetCompute tasks.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 365
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

11.1.3 Function Documentation

11.1.3.1 std::string hetcompute::to_string ( device_type d )

Converts device_type to string.

Parameters

d device_type (e.g., cpu, cpu_big, cpu_little, gpu, or dsp).

Returns

std::string (e.g., "cpu", "cpu_big", "cpu_little", "gpu", or "dsp").

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 366
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

11.2 Buffers
Classes

• class hetcompute::buffer_const_iterator< T >


Const random access iterator over a buffer. More...

• class hetcompute::buffer_iterator< T >


Random access iterator over a buffer. More...

• class hetcompute::buffer_ptr< T >


Pointer to an underlying runtime-managed buffer data structure. More...

• struct hetcompute::in< BufferPtr >


Indicates that a buffer parameter is input-only (read-only) for a kernel. More...

• struct hetcompute::inout< BufferPtr >


Indicates that a buffer parameter is used both as an input and an output (read-write) by a kernel. More...

• struct hetcompute::out< BufferPtr >


Indicates that a buffer parameter is output-only (write-invalidate) for a kernel. More...

• class hetcompute::scope_acquire_ro< T >


Scope guard for read-only acquire of a buffer by the host code. More...

• class hetcompute::scope_acquire_rw< T >


Scope guard for read-write acquire of a buffer by the host code. More...

• class hetcompute::scope_acquire_wi< T >


Scope guard for write-invalidate acquire of a buffer by the host code. More...

Functions

• template<typename T >
buffer_ptr< T > hetcompute::create_buffer (size_t num_elems, device_set const &likely_devices)
Creates a buffer of datatype T of the requested size.

• template<typename T >
buffer_ptr< T > hetcompute::create_buffer (T ∗preallocated_ptr, size_t num_elems, device_set const
&likely_devices)
Creates a buffer of datatype T of the requested size from a pre-allocated pointer.

• template<typename T >
buffer_ptr< T > hetcompute::create_buffer (memregion const &mr, size_t num_elems, device_set
const &likely_devices)
Creates a buffer of datatype T of the requested size from a hetcompute::memregion.

• template<typename T >
bool hetcompute::operator!= (::hetcompute::buffer_ptr< T > const &b,::std::nullptr_t)
• template<typename T >

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 367
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

bool hetcompute::operator!= (::std::nullptr_t,::hetcompute::buffer_ptr< T > const &b)


• template<typename T >
bool hetcompute::operator!= (::hetcompute::buffer_ptr< T > const &b1,::hetcompute::buffer_ptr<
T > const &b2)
• template<typename T >
bool hetcompute::operator== (::hetcompute::buffer_ptr< T > const &b,::std::nullptr_t)
• template<typename T >
bool hetcompute::operator== (::std::nullptr_t,::hetcompute::buffer_ptr< T > const &b)
• template<typename T >
bool hetcompute::operator== (::hetcompute::buffer_ptr< T > const
&b1,::hetcompute::buffer_ptr< T > const &b2)

11.2.1 Class Documentation

11.2.1.1 class hetcompute::buffer_const_iterator

template<typename T>class hetcompute::buffer_const_iterator< T >

Const random access iterator over a buffer.


See Also

hetcompute::buffer_ptr::const_iterator
hetcompute::buffer_ptr::cbegin()
hetcompute::buffer_ptr::cend()

Public member functions

• buffer_const_iterator (buffer_iterator< T > const &it)


• HETCOMPUTE_DEFAULT_METHOD (buffer_const_iterator(buffer_const_iterator const &))
• HETCOMPUTE_DEFAULT_METHOD (buffer_const_iterator &operator=(buffer_const_iterator
const &))
• HETCOMPUTE_DEFAULT_METHOD (buffer_const_iterator(buffer_const_iterator &&))
• HETCOMPUTE_DEFAULT_METHOD (buffer_const_iterator &operator=(buffer_const_iterator
&&))
• bool operator!= (buffer_const_iterator const &it) const
• T const & operator∗ () const
• buffer_const_iterator operator+ (size_t offset)
• buffer_const_iterator & operator++ ()
• buffer_const_iterator operator++ (int)
• buffer_const_iterator & operator+= (size_t offset)
• buffer_const_iterator operator- (size_t offset)
• int operator- (buffer_const_iterator const &it)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 368
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

• buffer_const_iterator & operator-- ()


• buffer_const_iterator operator-- (int)
• buffer_const_iterator & operator-= (size_t offset)
• bool operator< (buffer_const_iterator const &it) const
• bool operator<= (buffer_const_iterator const &it) const
• bool operator== (buffer_const_iterator const &it) const
• bool operator> (buffer_const_iterator const &it) const
• bool operator>= (buffer_const_iterator const &it) const
• T const & operator[ ] (size_t n) const

11.2.1.2 class hetcompute::buffer_iterator

template<typename T>class hetcompute::buffer_iterator< T >

Random access iterator over a buffer.


See Also

hetcompute::buffer_ptr::iterator
hetcompute::buffer_ptr::begin()
hetcompute::buffer_ptr::end()

Public member functions

• HETCOMPUTE_DEFAULT_METHOD (buffer_iterator(buffer_iterator const &))


• HETCOMPUTE_DEFAULT_METHOD (buffer_iterator &operator=(buffer_iterator const &))
• HETCOMPUTE_DEFAULT_METHOD (buffer_iterator(buffer_iterator &&))
• HETCOMPUTE_DEFAULT_METHOD (buffer_iterator &operator=(buffer_iterator &&))
• bool operator!= (buffer_iterator const &it) const
• T & operator∗ () const
• buffer_iterator operator+ (size_t offset)
• buffer_iterator & operator++ ()
• buffer_iterator operator++ (int)
• buffer_iterator & operator+= (size_t offset)
• buffer_iterator operator- (size_t offset)
• int operator- (buffer_iterator const &it)
• buffer_iterator & operator-- ()
• buffer_iterator operator-- (int)
• buffer_iterator & operator-= (size_t offset)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 369
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

• bool operator< (buffer_iterator const &it) const


• bool operator<= (buffer_iterator const &it) const
• bool operator== (buffer_iterator const &it) const
• bool operator> (buffer_iterator const &it) const
• bool operator>= (buffer_iterator const &it) const
• T & operator[ ] (size_t n) const

11.2.1.3 class hetcompute::buffer_ptr

template<typename T>class hetcompute::buffer_ptr< T >

Pointer to an underlying runtime-managed buffer data structure, if != nullptr.


Similar to std::shared_ptr, the buffer pointer can be assigned to another buffer pointer, compared for
equality/inequality against another buffer pointer or against nullptr, and provides ref-counted access.
However, HetCompute does not expose the underlying buffer as an API object. Therefore, there is no
support for getting the "address-of" the buffer to assign to a buffer pointer.
The underlying buffer is ref-counted: it is automatically deallocated when there are no longer any buffer
pointers to it. Once the programmer creates a buffer via a hetcompute::create_buffer() call, the
user can control the lifetime of the buffer by controlling the lifetime of a buffer pointer object pointing to it.
See Also

Shared pointers http://en.cppreference.com/w/cpp/memory/shared_ptr

Public Types

• typedef buffer_const_iterator< T > const_iterator


Random access iterator providing immutable access to the buffer data.

• using data_type = T
• typedef buffer_iterator< T > iterator
Random access iterator providing mutable access to the buffer data.

Public member functions

• buffer_ptr ()
Create a buffer_ptr with no underlying buffer storage created.

• buffer_ptr (buffer_ptr< typename std::remove_const< T >::type > const &other)


Copy constructor: creates a new buffer_ptr pointing to the same underlying buffer as other.

• void acquire_ro () const


Acquires the underlying buffer for read-only access by the host code.

• bool acquire_rw () const


Attempts to acquire the underlying buffer for read-write access by the host code.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 370
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

• bool acquire_wi () const


Attempts to acquire the underlying buffer for write-invalidate access by the host code.

• T & at (size_t index) const


Indexed lookup of buffer data with array-bounds and host-access allocation checks.

• iterator begin () const


Iterator to the start of the buffer data.

• const_iterator cbegin () const


Const iterator to the start of the buffer data.

• const_iterator cend () const


Const iterator to the end of the buffer data.

• iterator end () const


Iterator to the end of the buffer data.

• void ∗ host_data () const


Gets a pointer to the host accessible data of the underlying buffer, allocating if necessary.

• buffer_ptr & operator= (buffer_ptr< typename std::remove_const< T >::type > const &other)
Copy assignment: points to the underlying buffer of other.

• T & operator[ ] (size_t index) const


Indexed lookup of buffer data.

• size_t release () const


Decrements the host acquire count, releasing the buffer from host access when the count goes to zero.

• void ∗ saved_host_data () const


Fast lookup of a saved pointer to the host accessible data of the underlying buffer.

• size_t size () const


The number of elements of datatype T in the underlying buffer pointed to by this buffer_ptr.

• std::string to_string () const


Gets a string with basic information about the buffer_ptr.

• template<hetcompute::graphics::image_format img_format, int dims>


buffer_ptr & treat_as_texture (hetcompute::graphics::image_size< dims > const &is)
Allows this buffer_ptr to be passed to a gpu_kernel where a
hetcompute::graphics::texture_ptr parameter was expected.

11.2.1.3.1 Member Typedef Documentation

11.2.1.3.1.1 template<typename T> typedef buffer_const_iterator<T> hetcompute::buffer_ptr< T


>::const_iterator

Random access iterator providing immutable access to the buffer data.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 371
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

11.2.1.3.1.2 template<typename T> using hetcompute::buffer_ptr< T >::data_type = T

Data type of buffer

11.2.1.3.1.3 template<typename T> typedef buffer_iterator<T> hetcompute::buffer_ptr< T >::iterator

Random access iterator providing mutable access to the buffer data.

11.2.1.3.2 Constructors and Destructors

11.2.1.3.2.1 template<typename T> hetcompute::buffer_ptr< T >::buffer_ptr ( )

Create a buffer_ptr with no underlying buffer storage created. Tests equal to nullptr.
Example:
hetcompute::buffer_ptr<int> b;
assert(b == nullptr);

11.2.1.3.2.2 template<typename T> hetcompute::buffer_ptr< T >::buffer_ptr ( buffer_ptr< typename


std::remove_const< T >::type > const & other )

Copy constructor: creates a new buffer_ptr pointing to the same underlying buffer as other. A
buffer_ptr<const T> instance may be constructed from an instance of buffer_ptr<T>.

Parameters

other An existing buffer pointer.

Example:
hetcompute::buffer_ptr<int> b = hetcompute::create_buffer<int>(10);
hetcompute::buffer_ptr<int> x(b);
hetcompute::buffer_ptr<const int> y(b);

11.2.1.3.3 Member Function Documentation

11.2.1.3.3.1 template<typename T> void hetcompute::buffer_ptr< T >::acquire_ro ( ) const

Acquires the underlying buffer for read-only access by the host code. The host code may read the existing
contents of the buffer after this call, until the host code releases access using the release() method.
The call will block for any conflicting operations to complete (e.g., a task concurrently performing
read-write access to the buffer), after which the buffer is acquired for access by the host code and the call
unblocks. However, if the buffer has already been acquired for the host code by a preceding
acquire_∗(), the call will return immediately.
The host code may recursively acquire the buffer using a combination of acquire_ro(),
acquire_wi() and acquire_rw() calls. The first acquire_∗ establishes the access type
(read-only, write-invalidate, or read-write) of the buffer for the host code. Subsequent recursive
acquire_∗ calls will succeed only if they are compatible with the previously established access type.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 372
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

Subsequent recursive acquire_wi() and acquire_rw() calls will return with failure if the first
recursive acquisition was acquire_ro(), as the access type of these calls is incompatible with the
established read-only access. However, any subsequent acquire_∗() recursive calls will succeed if the
first acquisition was either write-invalidate or read-write. When the established access type is
write-invalidate, subsequent recursive read-only or read-write acquisitions are considered to get access to
any data written to the buffer after the original write-invalidate. When the established access type is
read-write, a subsequent recursive write-invalidate does not destroy any prior data, as there is no additional
synchronization required between device memories to access the latest data.
The host code releases the buffer only when a number of release() calls equal to the number of
successful recursive acquire_∗() calls are made.
Note that access by concurrent threads of the host code is also considered recursive, even when the
acquire-release calls do not properly nest across threads. The first acquire by any one thread establishes the
host access type for all threads of the host code, until the host code releases.
HetCompute disallows concurrent access to a buffer when the buffer is being modified. The acquisition will
be blocked when a concurrent task/pattern has acquired the buffer for read-write or write-invalidate access.
In rare situations, the acquisition may also be blocked when a concurrent task/pattern has read-only access
but HetCompute is unable to synchronize the buffer data for host access until the concurrent task/pattern
completes.
See Also

hetcompute::create_buffer()

11.2.1.3.3.2 template<typename T> bool hetcompute::buffer_ptr< T >::acquire_rw ( ) const

Attempts to acquire the underlying buffer for read-write access by the host code. Returns true is
successful, false on failure to acquire for read-write due to a prior read-only acquisition by the host code.
If successful, the host may read the prior contents of the buffer and update the contents until the host code
releases access using the release() method.
The call will block for any conflicting operations to complete (e.g., a task concurrently performing
read-write access to the buffer), after which the buffer is acquired for access by the host code and the call
unblocks. However, if the buffer has already been acquired for the host code by a preceding
acquire_∗(), the call will return immediately.
The host code may recursively acquire the buffer using a combination of acquire_ro(),
acquire_wi() and acquire_rw() calls. The first acquire_∗ establishes the access type
(read-only, write-invalidate, or read-write) of the buffer for the host code. Subsequent recursive
acquire_∗ calls will succeed only if they are compatible with the previously established access type.
Subsequent recursive acquire_wi() and acquire_rw() calls will return with failure if the first
recursive acquisition was acquire_ro(), as the access type of these calls is incompatible with the
established read-only access. However, any subsequent acquire_∗() recursive calls will succeed if the
first acquisition was either write-invalidate or read-write. When the established access type is
write-invalidate, subsequent recursive read-only or read-write acquisitions are considered to get access to
any data written to the buffer after the original write-invalidate. When the established access type is
read-write, a subsequent recursive write-invalidate does not destroy any prior data, as there is no additional
synchronization required between device memories to access the latest data.
The host code releases the buffer only when a number of release() calls equal to the number of
successful recursive acquire_∗() calls are made.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 373
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

Note that access by concurrent threads of the host code is also considered recursive, even when the
acquire-release calls do not properly nest across threads. The first acquire by any one thread establishes the
host access type for all threads of the host code, until the host code releases.
HetCompute disallows concurrent access to a buffer when the buffer is being modified. The acquisition will
be blocked when a concurrent task/pattern has acquired the buffer for read-write or write-invalidate access.
In rare situations, the acquisition may also be blocked when a concurrent task/pattern has read-only access
but HetCompute is unable to synchronize the buffer data for host access until the concurrent task/pattern
completes.
See Also

hetcompute::create_buffer()

11.2.1.3.3.3 template<typename T> bool hetcompute::buffer_ptr< T >::acquire_wi ( ) const

Attempts to acquire the underlying buffer for write-invalidate access by the host code. Returns true is
successful, false on failure to acquire for write-invalidate due to a prior read-only acquisition by the host
code. If successful, the prior contents of the buffer are lost after this call. The host code may write the
buffer data (and read back what it wrote) after this call, until the host code releases access using the
release() method.
The call will block for any conflicting operations to complete (e.g., a task concurrently performing
read-write access to the buffer), after which the buffer is acquired for access by the host code and the call
unblocks. However, if the buffer has already been acquired for the host code by a preceding
acquire_∗(), the call will return immediately.
The host code may recursively acquire the buffer using a combination of acquire_ro(),
acquire_wi() and acquire_rw() calls. The first acquire_∗ establishes the access type
(read-only, write-invalidate, or read-write) of the buffer for the host code. Subsequent recursive
acquire_∗ calls will succeed only if they are compatible with the previously established access type.
Subsequent recursive acquire_wi() and acquire_rw() calls will return with failure if the first
recursive acquisition was acquire_ro(), as the access type of these calls is incompatible with the
established read-only access. However, any subsequent acquire_∗() recursive calls will succeed if the
first acquisition was either write-invalidate or read-write. When the established access type is
write-invalidate, subsequent recursive read-only or read-write acquisitions are considered to get access to
any data written to the buffer after the original write-invalidate. When the established access type is
read-write, a subsequent recursive write-invalidate does not destroy any prior data, as there is no additional
synchronization required between device memories to access the latest data.
The host code releases the buffer only when a number of release() calls equal to the number of
successful recursive acquire_∗() calls are made.
Note that access by concurrent threads of the host code is also considered recursive, even when the
acquire-release calls do not properly nest across threads. The first acquire by any one thread establishes the
host access type for all threads of the host code, until the host code releases.
HetCompute disallows concurrent access to a buffer when the buffer is being modified. The acquisition will
be blocked when a concurrent task/pattern has acquired the buffer for read-write or write-invalidate access.
In rare situations, the acquisition may also be blocked when a concurrent task/pattern has read-only access
but HetCompute is unable to synchronize the buffer data for host access until the concurrent task/pattern
completes.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 374
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

See Also

hetcompute::create_buffer()

11.2.1.3.3.4 template<typename T> T& hetcompute::buffer_ptr< T >::at ( size_t index ) const

Same as operator[] but also checks if the buffer data has been previously made host accessible via this
buffer_ptr and performs array bounds check. However, does not guarantee that the buffer data is currently
host accessible. The programmer must ensure that the buffer will not be concurrently accessed by tasks that
may invalidate the host accessible data (for example, by not launching any task that accesses a buffer_ptr to
this buffer until the host access is complete).
See saved_host_data() for host access criteria.

Parameters

index The index to the element to lookup inside the buffer data.

Exceptions

hetcompute::api_- if data is not host-accessible.


exception
std::out_of_range if index exceeds array bounds.

Note

If exceptions are disabled by application, the API will terminate the application if data is not
host-accessible.

Returns

A reference to the indexed element.

See Also

saved_host_data()

11.2.1.3.3.5 template<typename T> iterator hetcompute::buffer_ptr< T >::begin ( ) const

Get iterator to the start of the buffer data. Allows mutable access to the buffer data.

Returns

Iterator to the start of the buffer data.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 375
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

11.2.1.3.3.6 template<typename T> const_iterator hetcompute::buffer_ptr< T >::cbegin ( ) const

Get const iterator to the start of the buffer data. Restricts to immutable access.

Returns

Const iterator to the start of the buffer data.

11.2.1.3.3.7 template<typename T> const_iterator hetcompute::buffer_ptr< T >::cend ( ) const

Get const iterator to the end of the buffer data. Restricts to immutable access.

Returns

Const iterator to the end of the buffer data.

11.2.1.3.3.8 template<typename T> iterator hetcompute::buffer_ptr< T >::end ( ) const

Get iterator to the end of the buffer data. Allows mutable access to the buffer data.

Returns

Tterator to the end of the buffer data.

11.2.1.3.3.9 template<typename T> void∗ hetcompute::buffer_ptr< T >::host_data ( ) const

Gets a pointer to the host accessible data of the underlying buffer, allocating the host accessible storage if
necessary. Note that this call does not ensure that the buffer data is currently host accessible. For example,
data updates by a concurrent task on the underlying buffer may not be visible yet via the host accessible
data pointer.

Returns

nullptr, if this buffer_ptr is nullptr.


!=nullptr, if this buffer_ptr points to a valid buffer.

See Also

saved_host_data() for fast lookup of a previously queried pointer to the host accessible data.
acquire_ro()
acquire_wi()
acquire_rw()
release()
to allow the buffer to be read or written by the host code, in addition to querying a pointer to the host
accessible data.

Unlike the acquire calls, which may sometimes block when there is concurrent task access to the buffer, this
method can be called at any time without blocking to determine the pointer to the host accessible data.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 376
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

11.2.1.3.3.10 template<typename T> buffer_ptr& hetcompute::buffer_ptr< T >::operator= (


buffer_ptr< typename std::remove_const< T >::type > const & other )

Copy assignment: points to the underlying buffer of other. A buffer_ptr<const T> instance may
be assigned to from an instance of buffer_ptr<T>. If this buffer_ptr was the last one pointing to its
underlying buffer, the underlying buffer will get deallocated once the buffer_ptr is copy-assigned and points
to a different underlying buffer.

Parameters

other An existing buffer pointer.

Example:
hetcompute::buffer_ptr<int> b = hetcompute::create_buffer<int>(10);
hetcompute::buffer_ptr<int> x;
hetcompute::buffer_ptr<const int> y;
x = b;
y = b;

11.2.1.3.3.11 template<typename T> T& hetcompute::buffer_ptr< T >::operator[ ] ( size_t index )


const

If the buffer data is host accessible or being accessed as a CPU task parameter, it performs an array index
lookup. Undefined behavior for host accesses if the programmer has not previously ensured that the buffer
data is host accessible.
See saved_host_data() for host access criteria.

Parameters

index The index to the element to lookup inside the buffer data.

Returns

A reference to the indexed element.

See Also

at()
saved_host_data()

11.2.1.3.3.12 template<typename T> size_t hetcompute::buffer_ptr< T >::release ( ) const

Decrements the host acquire count, releasing the buffer from host access when the count goes to zero.
release() needs to be called once for every successful recursive call to acquire_∗(), after which the
buffer is released from access by the host code. The host code may not read or write the buffer contents
after the final release() call, until the host code acquires the buffer again.
The release() call never blocks.
The call returns the number of recursive acquisitions remaining to be released before the host code will

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 377
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

release access to the buffer. That is, the host code releases access when the return value of this call is 0.
Exceptions

hetcompute::api_- if called when the buffer is not currently acquired by the host code.
exception

Note

If exceptions are disabled by application, terminates the application when the buffer is not currently
acquired by the host code.

11.2.1.3.3.13 template<typename T> void∗ hetcompute::buffer_ptr< T >::saved_host_data ( ) const

Fast lookup of a saved pointer to the host accessible data of the underlying buffer. The pointer may be saved
by either a previous host_code() or acquire_∗ calls.
Note that this call does not ensure that the buffer data is currently host accessible via this buffer_ptr. For
example, data updates by a concurrent task on the underlying buffer may not be visible yet via the host
accessible data pointer.

Returns

nullptr, if
i) the host accessible data pointer has not previously been queried via this buffer_ptr ii) this buffer_ptr
is a nullptr.
!=nullptr, if

See Also

host_code()
acquire_ro()
acquire_wi()
acquire_rw()
release()
for explicit host synchronization.

11.2.1.3.3.14 template<typename T> size_t hetcompute::buffer_ptr< T >::size ( ) const

The number of elements of datatype T in the underlying buffer pointed to by this buffer_ptr.

11.2.1.3.3.15 template<typename T> std::string hetcompute::buffer_ptr< T >::to_string ( ) const

Gets a string with basic information about the buffer_ptr.

Returns

String with basic information about the buffer_ptr.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 378
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

11.2.1.3.3.16 template<typename T> template<hetcompute::graphics::image_format img_-


format, int dims> buffer_ptr& hetcompute::buffer_ptr< T >::treat_as_texture (
hetcompute::graphics::image_size< dims > const & is )

Allows this buffer_ptr to be interpreted as a texture of a given format, dimensionality and size when passed
as an argument to a gpu_kernel expecting a hetcompute::graphics::texture_ptr. The
interpretation applies to the current buffer_ptr, not to the buffer as a whole. That is, multiple
buffer_ptrs to the same buffer may simultaneously be interpreted as textures of different formats,
dimensions and sizes.
Template Parameters

img_format The texture image format to interpret this buffer_ptr as.


dims The texture dimensions to interpret this buffer_ptr as.

Parameters

is The texture image size to interpret this buffer_ptr as.

Returns

This buffer_ptr.

Throws an assertion if the HetCompute library is built without GPU support.

11.2.1.4 struct hetcompute::in

template<typename BufferPtr>struct hetcompute::in< BufferPtr >

Use in a kernel parameter declaration to indicate that a buffer parameter will be input-only (read-only) for
the kernel.

11.2.1.5 struct hetcompute::inout

template<typename BufferPtr>struct hetcompute::inout< BufferPtr >

Use in a kernel parameter declaration to indicate that a buffer parameter will be used both as an input and
an output (read-write) by the kernel.

11.2.1.6 struct hetcompute::out

template<typename BufferPtr>struct hetcompute::out< BufferPtr >

Use in a kernel parameter declaration to indicate that a buffer parameter will be output-only
(write-invalidate) for the kernel.

11.2.1.7 class hetcompute::scope_acquire_ro

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 379
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

template<typename T>class hetcompute::scope_acquire_ro< T >

Scope guard for read-only acquire of a buffer by the host code.


Example: instead of writing
void f() {
b.acquire_ro();
if(...) {
b.release();
return;
}
...
b.release();
}

write
void f() {
scope_acquire_ro<decltype(b)::data_type> guard(b);
if(...) {
return;
}
...
}

See Also

buffer_ptr::acquire_ro();

Public member functions

• scope_acquire_ro (hetcompute::buffer_ptr< T > const &b)


• HETCOMPUTE_DELETE_METHOD (scope_acquire_ro())
• HETCOMPUTE_DELETE_METHOD (scope_acquire_ro(scope_acquire_ro const &))
• HETCOMPUTE_DELETE_METHOD (scope_acquire_ro &operator=(scope_acquire_ro const
&))
• HETCOMPUTE_DELETE_METHOD (scope_acquire_ro(scope_acquire_ro &&))
• HETCOMPUTE_DELETE_METHOD (scope_acquire_ro &operator=(scope_acquire_ro &&))

11.2.1.8 class hetcompute::scope_acquire_rw

template<typename T>class hetcompute::scope_acquire_rw< T >

Scope guard for read-write acquire of a buffer by the host code.


Example: instead of writing
void f() {
b.acquire_rw();
if(...) {
b.release();
return;
}
...
b.release();
}

write

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 380
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

void f() {
scope_acquire_rw<decltype(b)::data_type> guard(b);
if(...) {
return;
}
...
}

See Also

buffer_ptr::acquire_rw();

Public member functions

• scope_acquire_rw (hetcompute::buffer_ptr< T > const &b)


• HETCOMPUTE_DELETE_METHOD (scope_acquire_rw())
• HETCOMPUTE_DELETE_METHOD (scope_acquire_rw(scope_acquire_rw const &))
• HETCOMPUTE_DELETE_METHOD (scope_acquire_rw &operator=(scope_acquire_rw const
&))
• HETCOMPUTE_DELETE_METHOD (scope_acquire_rw(scope_acquire_rw &&))
• HETCOMPUTE_DELETE_METHOD (scope_acquire_rw &operator=(scope_acquire_rw &&))

11.2.1.9 class hetcompute::scope_acquire_wi

template<typename T>class hetcompute::scope_acquire_wi< T >

Scope guard for write-invalidate acquire of a buffer by the host code.


Example: instead of writing
void f() {
b.acquire_wi();
if(...) {
b.release();
return;
}
...
b.release();
}

write
void f() {
scope_acquire_wi<decltype(b)::data_type> guard(b);
if(...) {
return;
}
...
}

See Also

buffer_ptr::acquire_wi();

Public member functions

• scope_acquire_wi (hetcompute::buffer_ptr< T > const &b)


• HETCOMPUTE_DELETE_METHOD (scope_acquire_wi())

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 381
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

• HETCOMPUTE_DELETE_METHOD (scope_acquire_wi(scope_acquire_wi const &))


• HETCOMPUTE_DELETE_METHOD (scope_acquire_wi &operator=(scope_acquire_wi const
&))
• HETCOMPUTE_DELETE_METHOD (scope_acquire_wi(scope_acquire_wi &&))
• HETCOMPUTE_DELETE_METHOD (scope_acquire_wi &operator=(scope_acquire_wi &&))

11.2.2 Function Documentation

11.2.2.1 template<typename T > buffer_ptr<T> hetcompute::create_buffer ( size_t


num_elems, device_set const & likely_devices )

Creates a buffer of datatype T of the requested size. Fatal error if num_elems is 0.


Template Parameters

T User data type for buffer.

Parameters

num_elems Number of elements of type T in buffer. Must be larger than zero.


likely_devices Optional, default is an empty hetcompute::device_set. Allows the
programmer to convey advance knowledge of which device-types may
access this buffer, thereby allowing Qualcomm HetCompute to internally
determine an optimal storage and data-transfer policy for this buffer. This
information is only used as a hint to guide internal storage-allocation and
data-transfer optimizations, and is allowed to be partial or incorrect.

Returns

A buffer pointer to the created buffer.

The optional parameters allow for the following variants to this call.
hetcompute::create_buffer(size_t num_elems);

hetcompute::create_buffer(size_t num_elems,
hetcompute::device_set const& likely_devices);

11.2.2.2 template<typename T > buffer_ptr<T> hetcompute::create_buffer ( T ∗


preallocated_ptr, size_t num_elems, device_set const & likely_devices )

Creates a buffer of datatype T of the requested size from a pre-allocated pointer. The pre-allocated pointer
provides initial storage and potentially initial data for the buffer. Fatal error if num_elems is 0.
Template Parameters

T User data type for buffer.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 382
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

Parameters

preallocated_ptr Pointer to pre-allocated contiguous memory of size at least num_elems ∗


sizeof(T) bytes.
num_elems Number of elements of type T in buffer. Must be larger than zero.
likely_devices Optional, default is an empty hetcompute::device_set. Allows the
programmer to convey advance knowledge of which device-types may
access this buffer, thereby allowing Qualcomm HetCompute to internally
determine an optimal storage and data-transfer policy for this buffer. This
information is only used as a hint to guide internal storage-allocation and
data-transfer optimizations, and is allowed to be partial or incorrect.

Returns

A buffer pointer to the created buffer.

The optional parameter allows for the following variant to this call.
create_buffer(T* preallocated_ptr,
size_t num_elems);

create_buffer(T* preallocated_ptr,
size_t num_elems,
device_set const& likely_devices);

11.2.2.3 template<typename T > buffer_ptr<T> hetcompute::create_buffer (


memregion const & mr, size_t num_elems, device_set const & likely_devices
)

Creates a buffer of datatype T of the requested size from a hetcompute::memregion. The


hetcompute::memregion provides initial storage and potentially initial data for the buffer. Fatal error if both
num_elems == 0 and mr.get_num_bytes() == 0, or if (num_elems ∗ sizeof(T)) >
mr.get_num_bytes().
Template Parameters

T User data type for buffer.

Parameters

mr A memory region of the desired type allocated by the user.


num_elems Optional, default =0. If =0, the number of elements is determined by the
capacity of mr, computed as follows mr.get_num_bytes() /
sizeof(T). If non-zero, specifies the number of elements of type T in
buffer, which must fit in the capacity of mr.
likely_devices Optional, default is an empty hetcompute::device_set. Allows the
programmer to convey advance knowledge of which device-types may
access this buffer, thereby allowing Qualcomm HetCompute to internally
determine an optimal storage and data-transfer policy for this buffer. This
information is only used as a hint to guide internal storage-allocation and
data-transfer optimizations, and is allowed to be partial or incorrect.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 383
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

Returns

A buffer pointer to the created buffer.

See Also

hetcompute::memregion

The optional parameters allow for the following variants to this call.
hetcompute::create_buffer(hetcompute::memregion const& mr);

hetcompute::create_buffer(hetcompute::memregion const& mr,


size_t num_elems);

hetcompute::create_buffer(hetcompute::memregion const& mr,


hetcompute::device_set const& likely_devices);

hetcompute::create_buffer(hetcompute::memregion const& mr,


size_t num_elems,
hetcompute::device_set const& likely_devices);

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 384
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

11.3 Memory Regions


Classes

• class hetcompute::glbuffer_memregion
Creates inter-operability with an OpenGL buffer. More...

• class hetcompute::ion_memregion
Allocates ION memory on platforms that support it. More...

• class hetcompute::main_memregion
Allocates aligned memory from the platform main memory. More...

• class hetcompute::memregion
Base class for all mem-regions. More...

Functions

• hetcompute::glbuffer_memregion::glbuffer_memregion (GLuint id)


Constructor, wraps an existing OpenGL buffer to allow inter-operability with Qualcomm HetCompute.

• hetcompute::ion_memregion::ion_memregion (size_t sz, bool cacheable=true)


Constructor, allocates ION memory.

• hetcompute::ion_memregion::ion_memregion (void ∗ptr, size_t sz, bool cacheable)


Constructor, uses allocated ION memory.

• hetcompute::ion_memregion::ion_memregion (void ∗ptr, int fd, size_t sz, bool cacheable)


Constructor, uses allocated ION memory, with the associated file descriptor.

• hetcompute::main_memregion::main_memregion (size_t sz, size_t alignment=s_default_alignment)

Constructor, allocates aligned memory.

• hetcompute::main_memregion::main_memregion (void ∗ptr, size_t sz)


Constructor, uses user-allocated memory.

• hetcompute::memregion::memregion (internal::internal_memregion ∗int_mr)


• hetcompute::memregion::∼memregion ()
Destructor.

• int hetcompute::ion_memregion::get_fd () const


Gets the file descriptor associated with the pointer to the allocated ION memory.

• GLuint hetcompute::glbuffer_memregion::get_id () const


Gets the id of the wrapped OpenGL buffer.

• size_t hetcompute::memregion::get_num_bytes () const


Get the size of the mem-region in bytes.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 385
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

• void ∗ hetcompute::main_memregion::get_ptr () const


Gets a pointer to the allocated memory.

• void ∗ hetcompute::ion_memregion::get_ptr () const


Gets a pointer to the allocated ION memory.

• hetcompute::memregion::HETCOMPUTE_DELETE_METHOD (memregion())
• hetcompute::memregion::HETCOMPUTE_DELETE_METHOD (memregion(memregion const
&))
• hetcompute::memregion::HETCOMPUTE_DELETE_METHOD (memregion
&operator=(memregion const &))
• hetcompute::memregion::HETCOMPUTE_DELETE_METHOD (memregion(memregion &&))
• hetcompute::main_memregion::HETCOMPUTE_DELETE_METHOD (main_memregion())
• hetcompute::main_memregion::HETCOMPUTE_DELETE_METHOD
(main_memregion(main_memregion const &))
• hetcompute::main_memregion::HETCOMPUTE_DELETE_METHOD (main_memregion
&operator=(main_memregion const &))
• hetcompute::main_memregion::HETCOMPUTE_DELETE_METHOD
(main_memregion(main_memregion &&))
• hetcompute::ion_memregion::HETCOMPUTE_DELETE_METHOD (ion_memregion())
• hetcompute::ion_memregion::HETCOMPUTE_DELETE_METHOD
(ion_memregion(ion_memregion const &))
• hetcompute::ion_memregion::HETCOMPUTE_DELETE_METHOD (ion_memregion
&operator=(ion_memregion const &))
• hetcompute::ion_memregion::HETCOMPUTE_DELETE_METHOD
(ion_memregion(ion_memregion &&))
• hetcompute::glbuffer_memregion::HETCOMPUTE_DELETE_METHOD
(glbuffer_memregion())
• hetcompute::glbuffer_memregion::HETCOMPUTE_DELETE_METHOD
(glbuffer_memregion(glbuffer_memregion const &))
• hetcompute::glbuffer_memregion::HETCOMPUTE_DELETE_METHOD
(glbuffer_memregion &operator=(glbuffer_memregion const &))
• hetcompute::glbuffer_memregion::HETCOMPUTE_DELETE_METHOD
(glbuffer_memregion(glbuffer_memregion &&))
• bool hetcompute::ion_memregion::is_cacheable () const
Returns whether the ION memregion is cacheable.

Variables

• internal::internal_memregion ∗ hetcompute::memregion::_int_mr
• static constexpr size_t hetcompute::main_memregion::s_default_alignment = 4096

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 386
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

The default alignment needed for the allocation to be page aligned.

Friends

• class hetcompute::memregion::::hetcompute::internal::memregion_base_accessor

11.3.1 Class Documentation

11.3.1.1 class hetcompute::glbuffer_memregion

Creates inter-operability with an OpenGL buffer. The user may have an external OpenGL buffer.
Qualcomm HetCompute may access the OpenGL buffer once inter-operability has been setup using an
instance of this class. A derived class of hetcompute::memregion.
See Also

hetcompute::memregion

Public member functions

• glbuffer_memregion (GLuint id)


Constructor, wraps an existing OpenGL buffer to allow inter-operability with Qualcomm HetCompute.

• GLuint get_id () const


Gets the id of the wrapped OpenGL buffer.

• HETCOMPUTE_DELETE_METHOD (glbuffer_memregion())
• HETCOMPUTE_DELETE_METHOD (glbuffer_memregion(glbuffer_memregion const &))
• HETCOMPUTE_DELETE_METHOD (glbuffer_memregion &operator=(glbuffer_memregion
const &))
• HETCOMPUTE_DELETE_METHOD (glbuffer_memregion(glbuffer_memregion &&))
• HETCOMPUTE_DELETE_METHOD (glbuffer_memregion &operator=(glbuffer_memregion
&&))

Additional Inherited Members

11.3.1.2 class hetcompute::ion_memregion

Allocates ION memory on platforms that support it. The ION memory can be allocated as cacheable or
non-cacheable. A derived class of hetcompute::memregion.
See Also

hetcompute::memregion

Public member functions

• ion_memregion (size_t sz, bool cacheable=true)


Constructor, allocates ION memory.

• ion_memregion (void ∗ptr, size_t sz, bool cacheable)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 387
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

Constructor, uses allocated ION memory.

• ion_memregion (void ∗ptr, int fd, size_t sz, bool cacheable)


Constructor, uses allocated ION memory, with the associated file descriptor.

• int get_fd () const


Gets the file descriptor associated with the pointer to the allocated ION memory.

• void ∗ get_ptr () const


Gets a pointer to the allocated ION memory.

• HETCOMPUTE_DELETE_METHOD (ion_memregion())
• HETCOMPUTE_DELETE_METHOD (ion_memregion(ion_memregion const &))
• HETCOMPUTE_DELETE_METHOD (ion_memregion &operator=(ion_memregion const &))
• HETCOMPUTE_DELETE_METHOD (ion_memregion(ion_memregion &&))
• HETCOMPUTE_DELETE_METHOD (ion_memregion &operator=(ion_memregion &&))
• bool is_cacheable () const
Returns whether the ION memregion is cacheable.

Additional Inherited Members

11.3.1.3 class hetcompute::main_memregion

Allocates aligned memory from the platform main memory. The default alignment is 4096 bytes to get
page-aligned allocation. A derived class of hetcompute::memregion.
See Also

hetcompute::memregion

Public member functions

• main_memregion (size_t sz, size_t alignment=s_default_alignment)


Constructor, allocates aligned memory.

• main_memregion (void ∗ptr, size_t sz)


Constructor, uses user-allocated memory.

• void ∗ get_ptr () const


Gets a pointer to the allocated memory.

• HETCOMPUTE_DELETE_METHOD (main_memregion())
• HETCOMPUTE_DELETE_METHOD (main_memregion(main_memregion const &))
• HETCOMPUTE_DELETE_METHOD (main_memregion &operator=(main_memregion const
&))
• HETCOMPUTE_DELETE_METHOD (main_memregion(main_memregion &&))
• HETCOMPUTE_DELETE_METHOD (main_memregion &operator=(main_memregion &&))

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 388
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

Static Public Attributes

• static constexpr size_t s_default_alignment = 4096


The default alignment needed for the allocation to be page aligned.

Additional Inherited Members

11.3.1.4 class hetcompute::memregion

Base class for all mem-regions. The only common feature across the mem-regions is that they all have a
size in bytes. The user constructs a mem-region of the appropriate type to allocate the corresponding type
of specialized device-memory (hetcompute::main_memregion and
hetcompute::ion_memregion) or to create inter-operability with data from an external framework
(hetcompute::glbuffer_memregion).
Mem-regions provide RAII semantics:

• The specialized memory is allocated or the interop created when the user constructs the mem-region
object of the appropriate type.
• The user keeps the allocated memory or the interop alive by keeping the mem-region object alive.
Note

The base class hetcompute::memregion is not user-constructible. The user may construct from
a derived class of hetcompute::memregion that provides the desired allocation or interop
functionaity.

See Also

hetcompute::main_memregion
hetcompute::cl2svm_memregion
hetcompute::ion_memregion
hetcompute::glbuffer_memregion

Public member functions

• ∼memregion ()
Destructor.

• size_t get_num_bytes () const


Get the size of the mem-region in bytes.

• HETCOMPUTE_DELETE_METHOD (memregion())
• HETCOMPUTE_DELETE_METHOD (memregion(memregion const &))
• HETCOMPUTE_DELETE_METHOD (memregion &operator=(memregion const &))
• HETCOMPUTE_DELETE_METHOD (memregion(memregion &&))
• HETCOMPUTE_DELETE_METHOD (memregion &operator=(memregion &&))

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 389
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

Protected Member Functions

• memregion (internal::internal_memregion ∗int_mr)

Protected Attributes

• internal::internal_memregion ∗ _int_mr

Friends

• class ::hetcompute::internal::memregion_base_accessor

11.3.2 Function Documentation

11.3.2.1 hetcompute::glbuffer_memregion::glbuffer_memregion ( GLuint id )


[explicit]

Constructor, wraps an existing OpenGL buffer to allow inter-operability with Qualcomm HetCompute. The
size of the OpenGL buffer is automatically determined and set as the size of the mem-region.

Parameters

id The GLuint id of an existing OpenGL buffer.

11.3.2.2 hetcompute::ion_memregion::ion_memregion ( size_t sz, bool cacheable =


true ) [explicit]

Constructor, allocates ION memory.

Parameters

sz Size of the allocation in bytes.


cacheable Optional, cacheable if true, non-cacheable if false. Default is
cacheable.

11.3.2.3 hetcompute::ion_memregion::ion_memregion ( void ∗ ptr, size_t sz, bool


cacheable )

Constructor, uses allocated ION memory. The user is responsible for ensuring the lifetime of the ION
memory, and handling the deallocation. The lifetime of the user allocated memory MUST exceed any use
of the memory via the memregion object (say, if the memregion is used by a buffer).

Parameters

ptr Pointer to the externally allocated region.The block at ptr of size sz bytes
must be fully contained within an existing HetCompute ion_memregion
sz Size of the allocation in bytes.
cacheable true if ptr points to a cacheable ion region, false otherwise.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 390
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

11.3.2.4 hetcompute::ion_memregion::ion_memregion ( void ∗ ptr, int fd, size_t sz,


bool cacheable )

Constructor, uses allocated ION memory, with the associated file descriptor. The user is responsible for
ensuring the lifetime of the ION memory, and handling the deallocation. The lifetime of the user allocated
memory MUST exceed any use of the memory via the memregion object (say, if the memregion is used by
a buffer). This variant is more flexible as it enables the construction of an ion_memregion using ion
memory that was (a) allocated by another process, or, (b) allocated by the same process without using a
hetcompute::ion_memregion.

Parameters

ptr Pointer to the externally allocated region.


fd File descriptor associated with the allocated ion pointer.
sz Size of the allocation in bytes.
cacheable true if ptr points to a cacheable ion region, false otherwise.

11.3.2.5 hetcompute::main_memregion::main_memregion ( size_t sz, size_t


alignment = s_default_alignment ) [explicit]

Constructor, allocates aligned memory.

Parameters

sz Size of the allocation in bytes.


alignment Optional, desired alignment. Default is page aligned.

11.3.2.6 hetcompute::main_memregion::main_memregion ( void ∗ ptr, size_t sz )

Constructor, uses user-allocated memory. The user is responsible for ensuring the lifetime of the memory,
and handling the deallocation. The lifetime of the user allocated memory MUST exceed any use of the
memory via the memregion object (say, if the memregion is used by a buffer).

Parameters

ptr Pointer to the externally allocated region.


sz Size of the allocation in bytes.

11.3.2.7 int hetcompute::ion_memregion::get_fd ( ) const

Gets the file descriptor associated with the pointer to the allocated ION memory.

Returns

The file descriptor associated with the allocated ION memory.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 391
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Buffers Reference API

11.3.2.8 GLuint hetcompute::glbuffer_memregion::get_id ( ) const

Gets the id of the wrapped OpenGL buffer.

Returns

The id of the wrapped OpenGL buffer.

11.3.2.9 size_t hetcompute::memregion::get_num_bytes ( ) const

Get the size of the mem-region in bytes. Applies to all derived mem-region classes.

Returns

Size of the mem-region in bytes.

11.3.2.10 void∗ hetcompute::main_memregion::get_ptr ( ) const

Gets a pointer to the allocated memory.

Returns

A pointer to the allocated memory.

11.3.2.11 void∗ hetcompute::ion_memregion::get_ptr ( ) const

Gets a pointer to the allocated ION memory.

Returns

A pointer to the allocated ION memory.

11.3.2.12 bool hetcompute::ion_memregion::is_cacheable ( ) const

Returns whether the ION memregion is cacheable.

Returns

whether the ION memregion is cacheable.

11.3.3 Variable Documentation

11.3.3.1 constexpr size_t hetcompute::main_memregion::s_default_alignment = 4096


[static]

The default alignment needed for the allocation to be page aligned.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 392
12 Graphics Reference API

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 393
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Graphics Reference API

12.1 Texture APIs


Typedefs

• template<addressing_mode addr_mode, filter_mode fil_mode>


using hetcompute::graphics::sampler_ptr = ::hetcompute::internal::hetcompute_shared_ptr<
internal::sampler_cl< addr_mode, fil_mode >>
• template<image_format img_format, int dims>
using hetcompute::graphics::texture_ptr = ::hetcompute::internal::hetcompute_shared_ptr<
internal::texture_cl< img_format, dims >>

Functions

• template<image_format img_format, int dims>


texture_ptr< img_format, dims > hetcompute::graphics::create_derivative_texture (texture_ptr<
img_format, dims > &parent_texture, extended_format_plane_type derivative_plane_type, bool
read_only)
Create HetCompute single-plane derivative texture from a multi-plane parent texture. The parent texture is
created with create_texture(...) using ION memory.

• template<addressing_mode addr_mode, filter_mode fil_mode>


sampler_ptr< addr_mode, fil_mode > hetcompute::graphics::create_sampler (bool
normalized_coords)
Create HetCompute sampler with create_sampler(...). Create HetCompute sampler with
create_sampler(...).

• template<image_format img_format, int dims, typename T >


texture_ptr< img_format, dims > hetcompute::graphics::create_texture (image_size< dims > const
&is, T ∗host_ptr)
Create Qualcomm HetCompute texture with create_texture(...). Create HetCompute texture with
create_texture(...)

• template<image_format img_format, int dims>


texture_ptr< img_format, dims > hetcompute::graphics::create_texture (image_size< dims > const
&is, ion_memregion const &ion_mr, bool read_only=false)
Create HetCompute texture with create_texture(...) using ion memory Create HetCompute texture with
create_texture(...) using ion memory.

• bool hetcompute::graphics::is_supported (image_format img_format)


Test if given image format is supported by current platform and context at runtime.

• template<image_format img_format, int dims>


void ∗ hetcompute::graphics::map (texture_ptr< img_format, dims > &tp)
Map data from GPU to CPU with map(...). Map data from GPU to CPU with map(...).

• template<image_format img_format, int dims>


void hetcompute::graphics::unmap (texture_ptr< img_format, dims > &tp)
Unmap data from CPU to GPU with unmap(...). Unmap data from CPU to GPU with unmap(...).

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 394
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Graphics Reference API

12.1.1 Function Documentation

12.1.1.1 template<image_format img_format, int dims> texture_ptr<img_format,


dims> hetcompute::graphics::create_derivative_texture ( texture_ptr<
img_format, dims > & parent_texture, extended_format_plane_type
derivative_plane_type, bool read_only )

The following OpenCL QCOM extension are adopted for this feature
Extract Derivative Image Plane: cl_qcom_extract_image_plane QCOM Supported Compressed Image:
cl_qcom_compressed_image QCOM Other Non-Conventional Images [NV12, TP10]:
cl_qcom_other_image
Create HetCompute single-plane derivative texture from a multi-plane parent HetCompute texture. The
parent texture is created with create_texture(...) using ION memory
Template Parameters

img_format HetCompute image format


dims image dimensions

Parameters

parent_texture multi-plane parent texture created using create_texture(...)


derivative_plane_- Type of child plane (Y or UV)
type
read_only indicates if the created derivative texture should be RO/RW

Returns

a pointer to the created derivative texture

Note

img_format should match the image format of parent texture

12.1.1.2 template<addressing_mode addr_mode, filter_mode fil_mode> sampler_-


ptr<addr_mode, fil_mode> hetcompute::graphics::create_sampler ( bool
normalized_coords )
Template Parameters

addr_mode Addressing mode.


fil_mode Filtering mode.

Parameters

normalized_coords Whether to use normalized coordinates for pixel access.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 395
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Graphics Reference API

Returns

A pointer to the created sampler.

12.1.1.3 template<image_format img_format, int dims, typename T > texture_-


ptr<img_format, dims> hetcompute::graphics::create_texture ( image_size<
dims > const & is, T ∗ host_ptr )
Template Parameters

img_format Qualcomm HetCompute image format.


dims Image dimensions.

Parameters

is Image dimension size.


host_ptr Host pointer to image data in CPU. host_ptr must not be nullptr.

Returns

A pointer to the created texture.

12.1.1.4 template<image_format img_format, int dims> texture_ptr<img_format,


dims> hetcompute::graphics::create_texture ( image_size< dims > const &
is, ion_memregion const & ion_mr, bool read_only = false )
Template Parameters

img_format HetCompute image format


dims image dimensions

Parameters

is image dimension size


ion_mr instance of ion_memregion

Returns

a pointer to the created texture

12.1.1.5 bool hetcompute::graphics::is_supported ( image_format img_format )


Parameters

img_format Qualcomm HetCompute image format.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 396
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Graphics Reference API

Returns

Ff this image format is supported by current device.

12.1.1.6 template<image_format img_format, int dims> void∗ hetcompute::graphics-


::map ( texture_ptr< img_format, dims > & tp )
Template Parameters

img_format Qualcomm HetCompute image format.


dims Image dimensions.

Parameters

tp Qualcomm HetCompute texture pointer.

Returns

A pointer to image data in CPU.

12.1.1.7 template<image_format img_format, int dims> void hetcompute::graphics-


::unmap ( texture_ptr< img_format, dims > & tp )
Template Parameters

img_format Qualcomm HetCompute image format.


dims Image dimensions.

Parameters

tp Qualcomm HetCompute texture pointer.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 397
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Graphics Reference API

12.2 Texture Data Types


Classes

• struct hetcompute::graphics::image_size< dims >


• struct hetcompute::graphics::image_size< 1 >
• struct hetcompute::graphics::image_size< 2 >
• struct hetcompute::graphics::image_size< 3 >

Enumerations

• enum hetcompute::graphics::addressing_mode { ADDRESS_NONE,


ADDRESS_CLAMP_TO_EDGE, ADDRESS_CLAMP, ADDRESS_REPEAT }
• enum hetcompute::graphics::extended_format_plane_type { ExtendedFormatYPlane,
ExtendedFormatUVPlane }
• enum hetcompute::graphics::filter_mode { FILTER_NEAREST, FILTER_LINEAR }
• enum hetcompute::graphics::image_format : int {
first, RGBAsnorm_int8 = first, RGBAunorm_int8, RGBAsigned_int8,
RGBAunsigned_int8, RGBAunorm_int16, RGBA_float, RGBA_half,
ARGBsnorm_int8, ARGBunorm_int8, ARGBsigned_int8, ARGBunsigned_int8,
BGRAsnorm_int8, BGRAunorm_int8, BGRAsigned_int8, BGRAunsigned_int8,
RGsnorm_int8, RGunorm_int8, RGsigned_int8, RGunsigned_int8,
RGunorm_int16, RG_float, RG_half, INTENSITYsnorm_int8,
INTENSITYsnorm_int16, INTENSITYunorm_int8, INTENSITYunorm_int16,
INTENSITY_float,
LUMINANCEsnorm_int8, LUMINANCEsnorm_int16, LUMINANCEunorm_int8,
LUMINANCEunorm_int16,
LUMINANCE_float, Rsnorm_int8, Runorm_int8, Rsigned_int8,
Runsigned_int8, Runorm_int16, R_float, R_half,
NV12unorm_int8, P010unorm_int10, TP10unorm_int10, TiledNV12unorm_int8,
TiledP010unorm_int10, TiledTP10unorm_int10, CompressedNV12unorm_int8,
CompressedNV124Runorm_int8,
CompressedP010unorm_int10, CompressedTP10unorm_int10, last =
CompressedTP10unorm_int10 }

12.2.1 Class Documentation

12.2.1.1 struct hetcompute::graphics::image_size

template<int dims>struct hetcompute::graphics::image_size< dims >

HetCompute image dimension description.

12.2.1.2 struct hetcompute::graphics::image_size< 1 >

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 398
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Graphics Reference API

template<>struct hetcompute::graphics::image_size< 1 >

Qualcomm HetCompute 1D image dimension description.


Data fields

Type Field Description


size_t _width

12.2.1.3 struct hetcompute::graphics::image_size< 2 >

template<>struct hetcompute::graphics::image_size< 2 >

Qualcomm HetCompute 2D image dimension description.


Data fields

Type Field Description


size_t _height Height.
size_t _width Width.

12.2.1.4 struct hetcompute::graphics::image_size< 3 >

template<>struct hetcompute::graphics::image_size< 3 >

Qualcomm HetCompute 3D image dimension description.


Data fields

Type Field Description


size_t _depth Depth.
size_t _height Height.
size_t _width Width.

12.2.2 Enumeration Type Documentation

12.2.2.1 enum hetcompute::graphics::addressing_mode [strong]

Supported image addressing mode in Qualcomm HetCompute. Each mode can be mapped to OpenCL
sampler addressing mode.

12.2.2.2 enum hetcompute::graphics::extended_format_plane_type [strong]

Qualcomm HetCompute supported extended format derivative types

12.2.2.3 enum hetcompute::graphics::filter_mode [strong]

Supported image filter mode in Qualcomm HetCompute. Each mode can be mapped to OpenCL sampler
filter mode.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 399
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Graphics Reference API

12.2.2.4 enum hetcompute::graphics::image_format : int [strong]

Supported image format in Qualcomm HetCompute. Each format can be mapped to OpenCL image format
and pixel channel.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 400
13 Data Structures Reference API

Qualcomm HetCompute provides a set of concurrent data structures that are optimized for performance
using internal Qualcomm HetCompute primitives.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 401
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Structures Reference API

13.1 Bounded Lock-Free Queue


Classes

• class hetcompute::bounded_lfqueue< T >

Typedefs

• typedef
internal::blfq::blfq_size_t< T,(sizeof(size_t) >
=sizeof(T))> hetcompute::bounded_lfqueue< T >::container_type
• typedef T hetcompute::bounded_lfqueue< T >::value_type

Functions

• hetcompute::bounded_lfqueue< T >::bounded_lfqueue (size_t log_size)


• hetcompute::bounded_lfqueue< T >::HETCOMPUTE_DELETE_METHOD
(bounded_lfqueue(bounded_lfqueue const &))
• hetcompute::bounded_lfqueue< T >::HETCOMPUTE_DELETE_METHOD
(bounded_lfqueue(bounded_lfqueue &&))
• hetcompute::bounded_lfqueue< T >::HETCOMPUTE_DELETE_METHOD
(bounded_lfqueue &operator=(bounded_lfqueue const &))
• bool hetcompute::bounded_lfqueue< T >::pop (value_type &r)
• bool hetcompute::bounded_lfqueue< T >::push (value_type const &v)

13.1.1 Class Documentation

13.1.1.1 class hetcompute::bounded_lfqueue

template<typename T>class hetcompute::bounded_lfqueue< T >

A Bounded Lock-Free FIFO Queue.

Note: The size of the queue is bounded at creation time.

Public Types

• typedef
internal::blfq::blfq_size_t< T,(sizeof(size_t) >
=sizeof(T))> container_type
• typedef T value_type

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 402
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Structures Reference API

Public member functions

• bounded_lfqueue (size_t log_size)


• HETCOMPUTE_DELETE_METHOD (bounded_lfqueue(bounded_lfqueue const &))
• HETCOMPUTE_DELETE_METHOD (bounded_lfqueue(bounded_lfqueue &&))
• HETCOMPUTE_DELETE_METHOD (bounded_lfqueue &operator=(bounded_lfqueue const &))
• HETCOMPUTE_DELETE_METHOD (bounded_lfqueue &operator=(bounded_lfqueue &&))
• bool pop (value_type &r)
• bool push (value_type const &v)

13.1.2 Function Documentation

13.1.2.1 template<typename T > hetcompute::bounded_lfqueue< T >::bounded_-


lfqueue ( size_t log_size ) [explicit]

Constructs the Bounded Lock-Free Queue, given the log (base 2) of the maximum number of entries it can
contain.

Parameters

log_size Log (base 2) of the maximum number of entries in each node.

13.1.2.2 template<typename T > bool hetcompute::bounded_lfqueue< T >::pop (


value_type & r )

Pop from the queue, placing the popped value in the result.

Parameters

r The object to store the popped value in, if successful.

Returns

True if the pop was successful; false if the queue was empty.

Note: The contents of r are not modified if the pop is unsuccessful.

13.1.2.3 template<typename T > bool hetcompute::bounded_lfqueue< T >::push (


value_type const & v )

Push value into the queue.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 403
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Structures Reference API

Parameters

v Value to be pushed into the queue.

Returns

True if the push was successful; false if the queue was full.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 404
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Structures Reference API

13.2 Unbounded Lock-Free Queue


Classes

• class hetcompute::lfqueue< T >

Typedefs

• typedef internal::lfq::lfq< T > hetcompute::lfqueue< T >::container_type


• typedef T hetcompute::lfqueue< T >::value_type

Functions

• hetcompute::lfqueue< T >::lfqueue (size_t log_size)


• hetcompute::lfqueue< T >::HETCOMPUTE_DELETE_METHOD (lfqueue(lfqueue const &))
• hetcompute::lfqueue< T >::HETCOMPUTE_DELETE_METHOD (lfqueue(lfqueue &&))
• hetcompute::lfqueue< T >::HETCOMPUTE_DELETE_METHOD (lfqueue
&operator=(lfqueue const &))
• bool hetcompute::lfqueue< T >::pop (value_type &r)
• bool hetcompute::lfqueue< T >::push (value_type const &v)

13.2.1 Class Documentation

13.2.1.1 class hetcompute::lfqueue

template<typename T>class hetcompute::lfqueue< T >

Unbounded Lock-Free FIFO queue that is capable of dynamically growing and shrinking.

Public Types

• typedef internal::lfq::lfq< T > container_type


• typedef T value_type

Public member functions

• lfqueue (size_t log_size)


• HETCOMPUTE_DELETE_METHOD (lfqueue(lfqueue const &))
• HETCOMPUTE_DELETE_METHOD (lfqueue(lfqueue &&))
• HETCOMPUTE_DELETE_METHOD (lfqueue &operator=(lfqueue const &))
• HETCOMPUTE_DELETE_METHOD (lfqueue &operator=(lfqueue &&))
• bool pop (value_type &r)
• bool push (value_type const &v)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 405
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Structures Reference API

13.2.2 Function Documentation

13.2.2.1 template<typename T> hetcompute::lfqueue< T >::lfqueue ( size_t log_size


) [explicit]

Constructs the Unbounded Lock-Free Queue, given the log (base 2) of the size of the static array within
each node.

Parameters

log_size Log (base 2) of the maximum number of entries in each node.

13.2.2.2 template<typename T> bool hetcompute::lfqueue< T >::pop ( value_type &


r )

Pop from the queue, placing the popped value in the result.

Parameters

r The object to store the popped value in, if successful.

Returns

True if the pop was successful; FALSE if the queue was empty.

Note: the contents of r are not modified if the pop is unsuccessful.

13.2.2.3 template<typename T> bool hetcompute::lfqueue< T >::push ( value_type


const & v )

Push value into the queue. Since the queue is capable of growing, a push always succeeds.

Parameters

v Value to be pushed into the queue.

Returns

True

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 406
14 Data Sharing and Storage Reference
API

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 407
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API

14.1 Data Sharing Synchronization


The primitives defined in this chapter allow concurrent access to data.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 408
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API

14.2 Scheduler Storage


Using scheduler storage requires including the following header file:
#include <hetcompute/schedulerstorage.hh>

Classes

• class hetcompute::scheduler_storage_ptr< T, Allocator >

14.2.1 Class Documentation

14.2.1.1 class hetcompute::scheduler_storage_ptr

template<typename T, class Allocator = std::allocator<T>>class hetcompute::scheduler_storage_ptr<


T, Allocator >

Scheduler-local storage allows sharing of information across tasks on a per-scheduler basis, like what
thread-local storage does for threads. A scheduler_storage_ptr\<T\> stores a pointer-to-T (T∗).
In contrast to task_storage_ptr, the contents are persistent across tasks. In contrast to
thread_storage_ptr, the contents are guaranteed to not be changed while a task is suspended. To
maintain these guarantees, the runtime system is free to create new objects of type T whenever needed.
See Also

task_storage_ptr
thread_storage_ptr

Example

1 #include <algorithm>
2 #include <iterator>
3
4 #include <hetcompute/hetcompute.hh>
5
6 template <size_t N>
7 struct image_scratchpad
8 {
9 image_scratchpad() { std::fill(std::begin(edge_image), std::end(edge_image), 0); }
10 char edge_image[N];
11 };
12
13 namespace
14 {
15 const hetcompute::scheduler_storage_ptr<image_scratchpad<4096>
> image_buffers;
16 }; // namespace
17
18 int
19 main()
20 {
21 hetcompute::runtime::init();
22 int const N = 200;
23
24 auto g = hetcompute::create_group();
25 for (int i = 1; i < N; ++i)
26 {
27 g->launch([i] {
28 // fill image buffer, which is reused across tasks
29 for (auto& slot : image_buffers->edge_image)
30 slot = i & 0xff;

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 409
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API

31 hetcompute::internal::yield(); // context-switch, we expect SLS to survive this


32 // check contents
33 for (auto const& slot : image_buffers->edge_image)
34 {
35 if (slot != char(i & 0xff))
36 {
37 HETCOMPUTE_ILOG("mismatch at position %d", i);
38 }
39 }
40 });
41 }
42 g->wait_for();
43
44 hetcompute::runtime::shutdown();
45 return 0;
46 }
1 #include <hetcompute/hetcompute.hh>
2
3 namespace
4 {
5 const hetcompute::scheduler_storage_ptr<size_t> s_sls_state;
6 }; // namespace
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 auto g = hetcompute::create_group();
13
14 for (size_t i = 0; i < 200; ++i)
15 {
16 g->launch([i] {
17 size_t c = ++*s_sls_state;
18 // values for c are consecutive on a per-scheduler basis
19 (void)c;
20 });
21 }
22
23 g->wait_for();
24
25 hetcompute::runtime::shutdown();
26 return 0;
27 }
1 #include <hetcompute/hetcompute.hh>
2
3 namespace
4 {
5 const hetcompute::scheduler_storage_ptr<size_t> s_sls_state;
6 }; // namespace
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 auto g = hetcompute::create_group();
13 auto t = hetcompute::create_task([] {});
14
15 for (size_t i = 0; i < 200; ++i)
16 {
17 g->launch([=] {
18 size_t c1 = ++*s_sls_state;
19 t->launch();
20 t->wait_for();
21 size_t c2 = ++*s_sls_state;
22 if (c1 + 1 != c2)
23 {
24 HETCOMPUTE_ILOG("error: mismatch");
25 }
26 });
27 }
28

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 410
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API

29 g->wait_for();
30 hetcompute::runtime::shutdown();
31
32 return 0;
33 }

Public Types

• typedef Allocator allocator_type

Static Public Member Functions

• static void ∗ get_specific (internal::storage_key key)


• static int key_create (internal::storage_key ∗key, void(∗dtor)(void ∗))
• static int set_specific (internal::storage_key key, void const ∗value)

Friends

• class scoped_storage_ptr<::hetcompute::scheduler_storage_ptr, T, Allocator >

Additional Inherited Members

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 411
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API

14.3 Scoped Storage


Using scoped storage requires including the following header file:
#include <hetcompute/scopedstorage.hh>

Classes

• class hetcompute::scoped_storage_ptr< Scope, T, Allocator >

14.3.1 Class Documentation

14.3.1.1 class hetcompute::scoped_storage_ptr

template<template< class, class > class Scope, typename T, class Allocator>class


hetcompute::scoped_storage_ptr< Scope, T, Allocator >

Scoped storage allows sharing of information on a per-task basis, similar as thread-local storage does for
threads. A scoped_storage_ptr<Scope,T,Allocator> stores a pointer-to-T (T∗) for a given
scope Scope<T,A>, defining life time, persistence, etc.. Allocator controls the allocation of T
objects.

Public Types

• typedef T const ∗ pointer_type


• typedef Scope< T, Allocator > scope_type

Public member functions

• scoped_storage_ptr ()
• T ∗ get () const
• HETCOMPUTE_DELETE_METHOD (scoped_storage_ptr(scoped_storage_ptr const &))
• HETCOMPUTE_DELETE_METHOD (scoped_storage_ptr &operator=(scoped_storage_ptr const
&))
• HETCOMPUTE_DELETE_METHOD (scoped_storage_ptr &operator=(scoped_storage_ptr const
&) volatile)
• HETCOMPUTE_DELETE_METHOD (scoped_storage_ptr &operator=(T ∗const &))
• operator bool () const
• operator pointer_type () const
• bool operator! () const
• bool operator!= (T ∗const &other)
• T & operator∗ () const
• T ∗ operator-> () const
• bool operator== (T ∗const &other)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 412
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API

14.3.1.1.1 Constructors and Destructors

14.3.1.1.1.1 template<template< class, class > class Scope, typename T, class Allocator>
hetcompute::scoped_storage_ptr< Scope, T, Allocator >::scoped_storage_ptr ( )

Exceptions

hetcompute::tls_- if scoped_storage_ptr could not be reserved


exception

Note

If exceptions are disabled, logs error if key create is unsuccessful.

14.3.1.1.2 Member Function Documentation

14.3.1.1.2.1 template<template< class, class > class Scope, typename T, class Allocator> T∗
hetcompute::scoped_storage_ptr< Scope, T, Allocator >::get ( ) const

Returns

Stored pointer value; a new object of type T is created and stored, if it has not been stored before.

14.3.1.1.2.2 template<template< class, class > class Scope, typename T, class Allocator>
hetcompute::scoped_storage_ptr< Scope, T, Allocator >::operator bool ( ) const
[explicit]

Casting operator to bool (constantly true).

14.3.1.1.2.3 template<template< class, class > class Scope, typename T, class Allocator>
hetcompute::scoped_storage_ptr< Scope, T, Allocator >::operator pointer_type ( ) const

Casting operator to T∗ pointer type.

14.3.1.1.2.4 template<template< class, class > class Scope, typename T, class Allocator> bool
hetcompute::scoped_storage_ptr< Scope, T, Allocator >::operator! ( ) const

Returns

Constantly false.

14.3.1.1.2.5 template<template< class, class > class Scope, typename T, class Allocator> T&
hetcompute::scoped_storage_ptr< Scope, T, Allocator >::operator∗ ( ) const

Returns

Reference to value pointed to by stored pointer; A new object of type T is created and stored, if it has
not been stored before.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 413
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API

14.3.1.1.2.6 template<template< class, class > class Scope, typename T, class Allocator> T∗
hetcompute::scoped_storage_ptr< Scope, T, Allocator >::operator-> ( ) const

Returns

Stored pointer value; a new object of type T is created and stored, if it has not been stored before.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 414
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API

14.4 Task Storage


Using task storage requires including the following header file:
#include <hetcompute/taskstorage.hh>

Classes

• class hetcompute::task_storage_ptr< T >

14.4.1 Class Documentation

14.4.1.1 class hetcompute::task_storage_ptr

template<typename T>class hetcompute::task_storage_ptr< T >

Task-local storage enables allocation of task-specific data, like what thread-local storage does for threads.
The value of a task_storage_ptr is local to a task. A task_storage_ptr\<T\> stores a
pointer-to-T (T∗).
See Also

thread_storage_ptr

Example

1 #include <hetcompute/hetcompute.hh>
2
3 namespace
4 {
5 hetcompute::task_storage_ptr<int> storage;
6 }; // namespace
7
8 void func();
9
10 void
11 func()
12 {
13 HETCOMPUTE_ILOG("%d", *storage);
14 ++*storage;
15 }
16
17 int
18 main()
19 {
20 hetcompute::runtime::init();
21 auto g = hetcompute::create_group();
22 for (int i = 0; i < 10; ++i)
23 {
24 g->launch([i] {
25 int v = i;
26 storage = &v;
27 func();
28 if (v != i + 1)
29 {
30 HETCOMPUTE_ILOG("error");
31 }
32 func();
33 if (v != i + 2)
34 {
35 HETCOMPUTE_ILOG("error");
36 }
37 });

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 415
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API

38 }
39 g->wait_for();
40 hetcompute::runtime::shutdown();
41 return 0;
42 }

Public Types

• typedef T const ∗ pointer_type

Public member functions

• task_storage_ptr ()
• task_storage_ptr (T ∗const &ptr)
• task_storage_ptr (T ∗const &ptr, void(∗dtor)(T ∗))
• T ∗ get () const
• HETCOMPUTE_DELETE_METHOD (task_storage_ptr(task_storage_ptr const &))
• HETCOMPUTE_DELETE_METHOD (task_storage_ptr &operator=(task_storage_ptr const &))
• HETCOMPUTE_DELETE_METHOD (task_storage_ptr &operator=(task_storage_ptr const &)
volatile)
• operator bool () const
• operator pointer_type () const
• bool operator! () const
• bool operator!= (T ∗const &other) const
• T & operator∗ () const
• T ∗ operator-> () const
• task_storage_ptr & operator= (T ∗const &ptr)
• bool operator== (T ∗const &other) const

14.4.1.1.1 Constructors and Destructors

14.4.1.1.1.1 template<typename T> hetcompute::task_storage_ptr< T >::task_storage_ptr ( )

Exceptions

hetcompute::tls_- If task_storage_ptr could not be reserved.


exception

Note

If exceptions are disabled in application, logs error if task_storage_ptr could not be reserved.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 416
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API

14.4.1.1.1.2 template<typename T> hetcompute::task_storage_ptr< T >::task_storage_ptr ( T ∗const


& ptr ) [explicit]

Exceptions

hetcompute::tls_- If task_storage_ptr could not be reserved.


exception

Parameters

ptr Initial value of task-local storage.

14.4.1.1.1.3 template<typename T> hetcompute::task_storage_ptr< T >::task_storage_ptr ( T ∗const


& ptr, void(∗)(T ∗) dtor )

Exceptions

hetcompute::tls_- if task_storage_ptr could not be reserved.


exception

Parameters

ptr Initial value of task-local storage.


dtor Destructor function.

14.4.1.1.2 Member Function Documentation

14.4.1.1.2.1 template<typename T> T∗ hetcompute::task_storage_ptr< T >::get ( ) const

Returns

Pointer to stored pointer value.

14.4.1.1.2.2 template<typename T> hetcompute::task_storage_ptr< T >::operator bool ( ) const


[explicit]

Casting operator to bool.

14.4.1.1.2.3 template<typename T> hetcompute::task_storage_ptr< T >::operator pointer_type ( )


const

Casting operator to T∗ pointer type.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 417
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API

14.4.1.1.2.4 template<typename T> bool hetcompute::task_storage_ptr< T >::operator! ( ) const

Returns

True if stored pointer is nullptr.

14.4.1.1.2.5 template<typename T> T& hetcompute::task_storage_ptr< T >::operator∗ ( ) const

Returns

Reference to value pointed to by stored pointer.

Note: No checking for nullptr is performed.

14.4.1.1.2.6 template<typename T> T∗ hetcompute::task_storage_ptr< T >::operator-> ( ) const

Returns

Pointer to stored pointer value.

14.4.1.1.2.7 template<typename T> task_storage_ptr& hetcompute::task_storage_ptr< T >::operator=


( T ∗const & ptr )

Assignment operator, stores T∗.

Parameters

ptr Pointer value to store.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 418
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API

14.5 Thread Storage


Using thread storage requires including the following header file:
#include <hetcompute/threadstorage.hh>

Classes

• class hetcompute::thread_storage_ptr< T, Allocator >

14.5.1 Class Documentation

14.5.1.1 class hetcompute::thread_storage_ptr

template<typename T, class Allocator = std::allocator<T>>class hetcompute::thread_storage_ptr< T,


Allocator >

Thread-local storage allows sharing of information across tasks on a per-thread basis. A


thread_storage_ptr\<T\> stores a pointer-to-T (T∗). In contrast to task_storage_ptr, the
contents are persistent across tasks. In contrast to scheduler_storage_ptr, the thread-local storage
may be accessed by other tasks while a task is suspended.
See Also

task_storage_ptr
scheduler_storage_ptr

Example

1 #include <hetcompute/hetcompute.hh>
2
3 namespace
4 {
5 const hetcompute::thread_storage_ptr<size_t> s_tls_state;
6 }; // namespace
7
8 int
9 main()
10 {
11 hetcompute::runtime::init();
12 auto g = hetcompute::create_group("test");
13 auto t = hetcompute::create_task([] {});
14
15 for (size_t i = 0; i < 200; ++i)
16 {
17 g->launch([=] {
18 size_t* p1 = s_tls_state.get();
19 t->launch();
20 t->wait_for();
21 size_t* p2 = s_tls_state.get();
22 // cannot assume that p1 == p2
23 (void)p1;
24 (void)p2;
25 });
26 }
27
28 g->wait_for();
29
30 hetcompute::runtime::shutdown();
31 return 0;
32 }

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 419
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Data Sharing and Storage Reference API

Public Types

• typedef Allocator allocator_type

Static Public Member Functions

• static void ∗ get_specific (internal::storage_key key)


• static int key_create (internal::storage_key ∗key, void(∗dtor)(void ∗))
• static int set_specific (internal::storage_key key, void const ∗value)

Friends

• class scoped_storage_ptr<::hetcompute::thread_storage_ptr, T, Allocator >

Additional Inherited Members

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 420
15 Exceptions Reference API

In this chapter we discuss all exceptions thrown by the Qualcomm HetCompute runtime system.
Exceptions can be disabled in the library by compiling the application with the following compile time flag
-DHETCOMPUTE_DISABLE_EXCEPTIONS=1. A general caveat to disabling exceptions in HetCompute
library is that not all APIs will return error, some may terminate the application on API level errors such as
input parameter check failed, certain precondition to execute the API are not met.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 421
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Exceptions Reference API

15.1 Exceptions
Classes

• class hetcompute::abort_task_exception
• class hetcompute::aggregate_exception
• class hetcompute::api_exception
• class hetcompute::canceled_exception
• class hetcompute::dsp_exception
• class hetcompute::error_exception
• class hetcompute::gpu_exception
• class hetcompute::hetcompute_exception
• class hetcompute::tls_exception

15.1.1 Class Documentation

15.1.1.1 class hetcompute::abort_task_exception

Exception thrown to abort the current task.


See Also

hetcompute::abort_on_cancel()
hetcompute::abort_task()

Public member functions

• abort_task_exception (std::string msg="aborted task")


• virtual const char ∗ what () const HETCOMPUTE_NOEXCEPT

15.1.1.1.1 Member Function Documentation

15.1.1.1.1.1 virtual const char∗ hetcompute::abort_task_exception::what ( ) const [virtual]

Returns exception description.

Returns

C string describing the exception.

Implements hetcompute::hetcompute_exception.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 422
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Exceptions Reference API

15.1.1.2 class hetcompute::aggregate_exception

Aggregate exception encapsulating all exceptions thrown in a task graph leading up to a point-of-use of a
task/group; e.g., wait_for, copy_value, move_value

Example

auto g = hetcompute::create_group();
g->hetcompute::launch([]{
std::string().at(1); // throws std::out_of_range exception
});
g->hetcompute::launch([]{
std::string().at(1); // throws std::out_of_range exception
});
try {
// Point-of-use of group g
g->wait_for();
} catch (hetcompute::aggregate_exception& e) {
while (e.has_next()) {
try {
e.next(); // throws
} catch (const std::out_of_range&) {
// Do something
} catch(...) {
// Not reached
}
}
} catch (...) {
// Not reached
}

Public member functions

• aggregate_exception (std::vector< std::exception_ptr > ∗exceptions)


• aggregate_exception (const aggregate_exception &other)
• aggregate_exception (aggregate_exception &&other)
• bool has_next () const
• void next ()
• aggregate_exception & operator= (const aggregate_exception &other)
• aggregate_exception & operator= (aggregate_exception &&other)
• virtual const char ∗ what () const HETCOMPUTE_NOEXCEPT

15.1.1.2.1 Member Function Documentation

15.1.1.2.1.1 bool hetcompute::aggregate_exception::has_next ( ) const

Returns whether the aggregate_exception contains more exceptions

Returns

true if it contains more exceptions, false otherwise

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 423
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Exceptions Reference API

15.1.1.2.1.2 void hetcompute::aggregate_exception::next ( )

Throws the next exception contained in the aggregate_exception.


Note

Does nothing if there are no more exceptions to be thrown.

15.1.1.2.1.3 virtual const char∗ hetcompute::aggregate_exception::what ( ) const [virtual]

Returns exception description.

Returns

C string describing the exception.

Implements hetcompute::hetcompute_exception.

15.1.1.3 class hetcompute::api_exception

Represents a misuse of the Qualcomm HetCompute API. For example, invalid values passed to a function.
Should cause termination of the application (future releases will behave differently).

Public member functions

• api_exception (std::string msg, const char ∗filename, int lineno, const char ∗funcname)

15.1.1.4 class hetcompute::canceled_exception

Exception thrown to indicate that a task/group was canceled. Thrown at points-of-use such as wait_for,
copy_value, and move_value.

Example

auto t = hetcompute::create_task([]{...});
t->cancel();
t->launch();
try {
// Point-of-use of task t
t->wait_for();
} catch (const hetcompute::canceled_exception&) {
// Do something
} catch (...) {
// Not reached
}

Public member functions

• HETCOMPUTE_DEFAULT_METHOD (canceled_exception(const canceled_exception &))


• HETCOMPUTE_DEFAULT_METHOD (canceled_exception(canceled_exception &&))
• HETCOMPUTE_DEFAULT_METHOD (canceled_exception &operator=(const
canceled_exception &)&)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 424
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Exceptions Reference API

• HETCOMPUTE_DEFAULT_METHOD (canceled_exception &operator=(canceled_exception


&&)&)
• virtual const char ∗ what () const HETCOMPUTE_NOEXCEPT

15.1.1.4.1 Member Function Documentation

15.1.1.4.1.1 virtual const char∗ hetcompute::canceled_exception::what ( ) const [virtual]

Returns exception description.

Returns

C string describing the exception.

Implements hetcompute::hetcompute_exception.

15.1.1.5 class hetcompute::dsp_exception

Thrown by the HetCompute runtime if there is a problem with a DSP kernel.

Public member functions

• HETCOMPUTE_DEFAULT_METHOD (dsp_exception(const dsp_exception &))


• HETCOMPUTE_DEFAULT_METHOD (dsp_exception(dsp_exception &&))
• HETCOMPUTE_DEFAULT_METHOD (dsp_exception &operator=(const dsp_exception &)&)
• HETCOMPUTE_DEFAULT_METHOD (dsp_exception &operator=(dsp_exception &&)&)
• virtual const char ∗ what () const HETCOMPUTE_NOEXCEPT

15.1.1.5.1 Member Function Documentation

15.1.1.5.1.1 virtual const char∗ hetcompute::dsp_exception::what ( ) const [virtual]

Returns exception description.

Returns

C string describing the exception.

Implements hetcompute::hetcompute_exception.

15.1.1.6 class hetcompute::error_exception

Superclass of all HETCOMPUTE-generated exceptions that indicate internal or programmer errors.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 425
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Exceptions Reference API

Public member functions

• error_exception (std::string msg, const char ∗filename, int lineno, const char ∗fname)
• virtual const char ∗ file () const HETCOMPUTE_NOEXCEPT
• virtual const char ∗ function () const HETCOMPUTE_NOEXCEPT
• virtual int line () const HETCOMPUTE_NOEXCEPT
• virtual const char ∗ message () const HETCOMPUTE_NOEXCEPT
• virtual const char ∗ type () const HETCOMPUTE_NOEXCEPT
• virtual const char ∗ what () const HETCOMPUTE_NOEXCEPT

15.1.1.6.1 Member Function Documentation

15.1.1.6.1.1 virtual const char∗ hetcompute::error_exception::what ( ) const [virtual]

Returns exception description.

Returns

C string describing the exception.

Implements hetcompute::hetcompute_exception.

15.1.1.7 class hetcompute::gpu_exception

Thrown by the HetCompute runtime if there is a problem with a GPU kernel.

Public member functions

• HETCOMPUTE_DEFAULT_METHOD (gpu_exception(const gpu_exception &))


• HETCOMPUTE_DEFAULT_METHOD (gpu_exception(gpu_exception &&))
• HETCOMPUTE_DEFAULT_METHOD (gpu_exception &operator=(const gpu_exception &)&)
• HETCOMPUTE_DEFAULT_METHOD (gpu_exception &operator=(gpu_exception &&)&)
• virtual const char ∗ what () const HETCOMPUTE_NOEXCEPT

15.1.1.7.1 Member Function Documentation

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 426
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Exceptions Reference API

15.1.1.7.1.1 virtual const char∗ hetcompute::gpu_exception::what ( ) const [virtual]

Returns exception description.

Returns

C string describing the exception.

Implements hetcompute::hetcompute_exception.

15.1.1.8 class hetcompute::hetcompute_exception

Superclass of all Qualcomm HetCompute-generated exceptions.

Public member functions

• virtual ∼hetcompute_exception () HETCOMPUTE_NOEXCEPT


• virtual const char ∗ what () const HETCOMPUTE_NOEXCEPT=0

15.1.1.8.1 Constructors and Destructors

15.1.1.8.1.1 virtual hetcompute::hetcompute_exception::∼hetcompute_exception ( ) [virtual]

Destructor.

15.1.1.8.2 Member Function Documentation

15.1.1.8.2.1 virtual const char∗ hetcompute::hetcompute_exception::what ( ) const [pure


virtual]

Returns exception description.

Returns

C string describing the exception.

Implemented in hetcompute::dsp_exception, hetcompute::gpu_exception,


hetcompute::aggregate_exception, hetcompute::canceled_exception, hetcompute::abort_task_exception,
and hetcompute::error_exception.

15.1.1.9 class hetcompute::tls_exception

Indicates that the thread TLS has been misused or become corrupted. Should cause termination of the
application.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 427
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Exceptions Reference API

Public member functions

• tls_exception (std::string msg, const char ∗filename, int lineno, const char ∗funcname)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 428
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Exceptions Reference API

15.2 ErrorCodes
Enumerations

• enum hetcompute::hc_error : int {


first, hetcompute::hc_error::HC_Success = first, hetcompute::hc_error::HC_TaskGpuFailure,
hetcompute::hc_error::HC_TaskDspFailure,
hetcompute::hc_error::HC_TaskCanceled, hetcompute::hc_error::HC_TaskAggregateFailure,
hetcompute::hc_error::HC_TaskGenericError, hetcompute::hc_error::HC_GroupCanceled,
hetcompute::hc_error::HC_GroupAggregateFailure, hetcompute::hc_error::HC_GroupGenericError,
last = HC_GroupGenericError }

15.2.1 Enumeration Type Documentation

15.2.1.1 enum hetcompute::hc_error : int [strong]

Error codes returned by HetCompute SDK APIs. These error codes are applicable only if application has
disabled exceptions. If application has enabled exceptions, hetcompute library will throw exceptions
instead of returning error codes.

Enumerator

HC_Success HetCompute API successfully completed


HC_TaskGpuFailure Error codes pertaining to task execution Returned by HetCompute runtime if
there is a problem with a GPU Kernel.
HC_TaskDspFailure Returned by HetCompute runtime if there is a problem with a DSP Kernel.
HC_TaskCanceled Returned by HetCompute runtime if task was canceled.
HC_TaskAggregateFailure Returned by HetCompute runtime if there were multiple failures in a task
graph.
HC_TaskGenericError Indicates any error not categorized by the above task failures.
HC_GroupCanceled Error codes pertaining to task execution HetCompute returns this to indicate
group was canceled.
HC_GroupAggregateFailure Returned by HetCompute runtime if there were multiple failures in
groups. One possible scenario where this could be returned, is if multiple tasks within a group
encountered runtime errors.
HC_GroupGenericError Indicates any error not categorized by the above group failures.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 429
16 Affinity Management API

The Qualcomm HetCompute affinity API enables the programmer to request which CPU cores should
execute tasks.
Using the power management API requires including the following header file:
#include <hetcompute/affinity.hh>

To get a detailed description of all the APIs, please follow the following link:
Note

Current version only sets the affinity in the CPU cores, excluding the GPU and the Hexagon DSP.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 430
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API

16.1 Affinity Settings


Classes

• struct hetcompute_affinity_settings_t
• class hetcompute::affinity::settings

Typedefs

• typedef void(∗ hetcompute_func_ptr_t )(void ∗data)

Enumerations

• enum hetcompute::affinity::cores { hetcompute::affinity::cores::all, hetcompute::affinity::cores::big,


hetcompute::affinity::cores::little }
• enum hetcompute_affinity_cores_t { hetcompute_affinity_cores_all = 0,
hetcompute_affinity_cores_big, hetcompute_affinity_cores_little }
• enum hetcompute_affinity_mode_t { hetcompute_affinity_mode_allow_local_setting = 0,
hetcompute_affinity_mode_override_local_setting }
• enum hetcompute_affinity_pin_threads_t { hetcompute_affinity_pin_threads_false = 0,
hetcompute_affinity_pin_threads_true }
• enum hetcompute::affinity::mode { hetcompute::affinity::mode::allow_local_setting,
hetcompute::affinity::mode::override_local_setting }

Functions

• hetcompute::affinity::settings::settings (cores cores_attribute, bool pin_threads, mode


md=::hetcompute::affinity::mode::allow_local_setting)
• hetcompute::affinity::settings::∼settings ()
• template<typename Function , typename... Args>
void hetcompute::affinity::execute (hetcompute::affinity::settings desired_aff, Function &&f,
Args...args)
• settings hetcompute::affinity::get ()
• cores hetcompute::affinity::settings::get_cores () const
• mode hetcompute::affinity::settings::get_mode () const
• hetcompute::affinity::settings hetcompute::internal::affinity::get_non_local_affinity_settings ()
• bool hetcompute::affinity::settings::get_pin_threads () const
• void hetcompute_affinity_execute (const hetcompute_affinity_settings_t desired_aff,
hetcompute_func_ptr_t f, void ∗args)
• hetcompute_affinity_settings_t hetcompute_affinity_get ()
• void hetcompute_affinity_reset ()
• void hetcompute_affinity_set (const hetcompute_affinity_settings_t as)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 431
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API

• bool hetcompute::internal::soc::is_big_little_cpu ()
• bool hetcompute::internal::soc::is_this_big_core ()
• bool hetcompute::affinity::settings::operator!= (const settings &rhs) const
• bool hetcompute::affinity::settings::operator== (const settings &rhs) const
• void hetcompute::affinity::reset ()
• void hetcompute::affinity::settings::reset_pin_threads ()
• void hetcompute::affinity::set (const settings as)
• void hetcompute::affinity::settings::set_cores (cores cores_attribute)
• void hetcompute::affinity::settings::set_mode (mode md)
• void hetcompute::affinity::settings::set_pin_threads ()

16.1.1 Class Documentation

16.1.1.1 struct hetcompute_affinity_settings_t

Data fields

Type Field Description


hetcompute_- cores
affinity_cores_t
hetcompute_- mode Pin threads to individual cores
affinity_mode_t
hetcompute_- pin_threads Group of cores to set the affinity to
affinity_pin_-
threads_t

16.1.1.2 class hetcompute::affinity::settings

Affinity settings class


This class is used to define the affinity conditions desired by the programmer

Public member functions

• settings (cores cores_attribute, bool pin_threads, mode


md=::hetcompute::affinity::mode::allow_local_setting)
• ∼settings ()
• cores get_cores () const
• mode get_mode () const
• bool get_pin_threads () const
• bool operator!= (const settings &rhs) const
• bool operator== (const settings &rhs) const

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 432
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API

• void reset_pin_threads ()
• void set_cores (cores cores_attribute)
• void set_mode (mode md)
• void set_pin_threads ()

16.1.2 Typedef Documentation

16.1.2.1 typedef void(∗ hetcompute_func_ptr_t)(void ∗data)

Function pointer type to pass to hetcompute_affinity_execute()


See Also

hetcompute_affinity_execute()

16.1.3 Enumeration Type Documentation

16.1.3.1 enum hetcompute::affinity::cores [strong]

C++ Affinity API


See Also

include/hetcompute/affinity.h for the C Affinity APIEnumeration type to select the cores where to
apply affinity settings in a big-little system. In homogeneous systems, all is always used.

Enumerator

all Use all SoC cores to set the affinity


big Set all threads to be eligible to run in the big cluster of the SoC
little Set all threads to be eligible to run in the big cluster of the SoC

16.1.3.2 enum hetcompute_affinity_cores_t

C Affinity API
See Also

include/hetcompute/affinity.hh for the C++ Affinity API

Enumerator

hetcompute_affinity_cores_big Use all SoC cores to set the affinity


hetcompute_affinity_cores_little Set all threads to be eligible to run in the big cluster of the SoC Set
all threads to be eligible to run in the LITTLE cluster of the SoC

16.1.3.3 enum hetcompute_affinity_mode_t

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 433
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API

Enumerator

hetcompute_affinity_mode_override_local_setting Set a default affinity for all cpu tasks for which
affinity was not specified. For example, if the user sets the affinity in allow_local_setting mode to
big, the big cores will execute all tasks except those marked as little

16.1.3.4 enum hetcompute_affinity_pin_threads_t


Enumerator

hetcompute_affinity_pin_threads_true Do not pin threads to individual cores Pin threads to


individual cores

16.1.3.5 enum hetcompute::affinity::mode [strong]

Enumeration type to select the affinity mode in big-little systems. In homogeneous system, mode, as cores,
is ignored.

Enumerator

allow_local_setting Set a default affinity for all cpu tasks for which affinity was not specified. For
example, if the user sets the affinity in allow_local_setting mode to big, the big cores will execute
all tasks except those marked as little
override_local_setting Set the affinity for all cpu tasks regardless of local task/scope settings. For
example, if the user sets the affinity to little in override_local_setting mode, the little cores will
execute all cpu tasks including those marked as big

16.1.4 Function Documentation

16.1.4.1 hetcompute::affinity::settings::settings ( cores cores_attribute, bool


pin_threads, mode md = ::hetcompute::affinity::mode::allow_local_setting
) [explicit]

Constructor with cores and pin_threads arguments

Parameters

in cores_attribute Type of cores hetcompute::affinity::cores::all


hetcompute::affinity::cores::big
hetcompute::affinity::cores::little
in pin_threads If true, enable pinning
in md Operation mode
hetcompute::affinity::mode::allow_local_setting
hetcompute::affinity::mode::override_local_setting

16.1.4.2 hetcompute::affinity::settings::∼settings ( )

Destructor

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 434
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API

16.1.4.3 template<typename Function , typename... Args> void hetcompute::affinity-


::execute ( hetcompute::affinity::settings desired_aff, Function && f, Args...
args )

Execute function/lambda/function-object with args, while enforcing desired affinity (big/LITTLE/all).


Note

Call returns upon completion of function f

Parameters

desired_aff desired affinity for execution


f function to execute; may be lambda/function/function-object
args arguments to pass to function f for execution

16.1.4.4 settings hetcompute::affinity::get ( )

Return the current affinity settings.

16.1.4.5 cores hetcompute::affinity::settings::get_cores ( ) const

Return the cores affinity member

16.1.4.6 mode hetcompute::affinity::settings::get_mode ( ) const

Return the mode member

16.1.4.7 bool hetcompute::affinity::settings::get_pin_threads ( ) const

Return the pin affinity member

Returns

true–Device threads are pinned.


false–Device threads are not pinned.

16.1.4.8 void hetcompute_affinity_execute ( const hetcompute_affinity_settings_t


desired_aff, hetcompute_func_ptr_t f, void ∗ args )

Execute function with args, while enforcing desired affinity (big/LITTLE/all).


Note

Call returns upon completion of function f

Parameters

desired_aff desired affinity for execution


f function to execute

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 435
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API

args arguments to pass to function f for execution

16.1.4.9 hetcompute_affinity_settings_t hetcompute_affinity_get ( )

Return the current affinity settings.

16.1.4.10 void hetcompute_affinity_reset ( )

Reset the thread pool affinity so that all Qualcomm HetCompute threads can run in any core of the system.

16.1.4.11 void hetcompute_affinity_set ( const hetcompute_affinity_settings_t as )

Set the affinity of all Qualcomm HetCompute runtime pool threads.


A successful call to this function will set the affinity of all runtime threads to the requested settings.
The HetCompute affinity settings enables to control three knobs:

• cores: Set where to run the task.


• pinning: Set whether threads can migrate among cores
• mode: Set whether kernel affinity attributes are going to be fulfilled. When mode equals normal,
set_big() and set_little() kernel attributes are respected, when mode equals force, otherwise are
ignored. Force mode can be useful in situations where the programmers want to guarantee that
certain cores are not used; e.g.; leave the little cores for an audio library in a game. The default mode
for HetCompute is normal.
For example, to run all Qualcomm HetCompute threads in all big cores of the SoC allowing kernel
attributes, call this function with a settings object with cores, pin, and mode equal to big, false, and normal
respectively. Or to enable pinning in a system with homogeneous cores, call with all and true. This will set
the mode to normal, since it is the default.

Parameters

in as Affinity settings object.

Example 1 Setting the affinity with force and normal modes

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 auto fn = [](int i) { HETCOMPUTE_ILOG("Function executed with specified affinity on arg %d", i); };
9 auto aff_settings =
10 hetcompute::affinity::settings(
hetcompute::affinity::cores::big, false,
hetcompute::affinity::mode::allow_local_setting);
11 // In a big.LITTLE SoC, function fn executes on a big core.
12 hetcompute::affinity::execute(aff_settings, fn, 42);
13
14 auto g = hetcompute::create_group(__FUNCTION__);
15

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 436
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API

16 auto k_wout_attrib = hetcompute::create_cpu_kernel([] { HETCOMPUTE_ILOG("


Task without kernel affinity attribute."); });
17
18 auto k_with_attrib = hetcompute::create_cpu_kernel([] { HETCOMPUTE_ILOG("
Task with kernel affinity attribute"); });
19 k_with_attrib.set_little();
20
21 // k_with_attrib kernel will run in a LITTLE core
22 g->launch(k_with_attrib);
23
24 // k_wout_attrib can run in any core
25 g->launch(k_wout_attrib);
26
27 g->wait_for();
28
29 // Set the affinity to the LITTLE cores without pinning in
30 // allow_local_setting mode
31 hetcompute::affinity::set(
32 hetcompute::affinity::settings(
hetcompute::affinity::cores::little, false,
hetcompute::affinity::mode::allow_local_setting));
33
34 // k_wout_attrib task will run in a LITTLE core because the kernel has no
35 // individual affinity specification
36 g->launch(k_wout_attrib);
37
38 // Set the affinity to the big cores with pinning in allow_local_setting mode
39 // by reading the current affinity and then updating the different fields
40 auto affinity = hetcompute::affinity::get();
41
42 // Update the cores from LITTLE to big
43 affinity.set_cores(hetcompute::affinity::cores::big);
44
45 // Enable thread pinning
46 affinity.set_pin_threads();
47
48 // Update the mode from allow_local_setting to override_local_setting in the
49 // settings
50 affinity.set_mode(hetcompute::affinity::mode::override_local_setting
);
51
52 // Update the affinity with the modified affinity object
53 hetcompute::affinity::set(affinity);
54
55 // The second run of k_with_attrib will run on a big core because the
56 // affinity mode is override_local_setting and global affinity settings are
57 // obeyed
58 g->launch(k_with_attrib);
59
60 g->wait_for();
61
62 hetcompute::runtime::shutdown();
63 return 0;
64 }

16.1.4.12 bool hetcompute::internal::soc::is_this_big_core ( )

Is the core on which the calling thread is running a big core?

Returns

true if big core, false if LITTLE core

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 437
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API

16.1.4.13 bool hetcompute::affinity::settings::operator!= ( const settings & rhs )


const

Inequality operator to check for different settings objects

Returns

true–Compared settings are different.


false–Compared settings are equal.

16.1.4.14 bool hetcompute::affinity::settings::operator== ( const settings & rhs )


const

Equality operator to compare settings object

Returns

true–Compared settings are equal.


false–Compared settings are different.

16.1.4.15 void hetcompute::affinity::reset ( )

Reset the thread pool affinity so that all Qualcomm HetCompute threads can run in any core of the system.

16.1.4.16 void hetcompute::affinity::settings::reset_pin_threads ( )

Reset the pin affinity member

16.1.4.17 void hetcompute::affinity::set ( const settings as )

Set the affinity of all Qualcomm HetCompute runtime pool threads.


A successful call to this function will set the affinity of all runtime threads to the requested settings.
The HetCompute affinity settings enables to control three knobs:

• cores: Set where to run the task.


• pinning: Set whether threads can migrate among cores
• mode: Set whether kernel affinity attributes are going to be fulfilled. When mode equals normal,
set_big() and set_little() kernel attributes are respected, when mode equals force, otherwise are
ignored. Force mode can be useful in situations where the programmers want to guarantee that
certain cores are not used; e.g.; leave the little cores for an audio library in a game. The default mode
for HetCompute is normal.
For example, to run all Qualcomm HetCompute threads in all big cores of the SoC allowing kernel
attributes, call this function with a settings object with cores, pin, and mode equal to big, false, and normal
respectively. Or to enable pinning in a system with homogeneous cores, call with all and true. This will set
the mode to normal, since it is the default.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 438
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API

Parameters

in as Affinity settings object.

Example 1 Setting the affinity with force and normal modes

1 #include <hetcompute/hetcompute.hh>
2
3 int
4 main()
5 {
6 hetcompute::runtime::init();
7
8 auto fn = [](int i) { HETCOMPUTE_ILOG("Function executed with specified affinity on arg %d", i); };
9 auto aff_settings =
10 hetcompute::affinity::settings(
hetcompute::affinity::cores::big, false,
hetcompute::affinity::mode::allow_local_setting);
11 // In a big.LITTLE SoC, function fn executes on a big core.
12 hetcompute::affinity::execute(aff_settings, fn, 42);
13
14 auto g = hetcompute::create_group(__FUNCTION__);
15
16 auto k_wout_attrib = hetcompute::create_cpu_kernel([] { HETCOMPUTE_ILOG("
Task without kernel affinity attribute."); });
17
18 auto k_with_attrib = hetcompute::create_cpu_kernel([] { HETCOMPUTE_ILOG("
Task with kernel affinity attribute"); });
19 k_with_attrib.set_little();
20
21 // k_with_attrib kernel will run in a LITTLE core
22 g->launch(k_with_attrib);
23
24 // k_wout_attrib can run in any core
25 g->launch(k_wout_attrib);
26
27 g->wait_for();
28
29 // Set the affinity to the LITTLE cores without pinning in
30 // allow_local_setting mode
31 hetcompute::affinity::set(
32 hetcompute::affinity::settings(
hetcompute::affinity::cores::little, false,
hetcompute::affinity::mode::allow_local_setting));
33
34 // k_wout_attrib task will run in a LITTLE core because the kernel has no
35 // individual affinity specification
36 g->launch(k_wout_attrib);
37
38 // Set the affinity to the big cores with pinning in allow_local_setting mode
39 // by reading the current affinity and then updating the different fields
40 auto affinity = hetcompute::affinity::get();
41
42 // Update the cores from LITTLE to big
43 affinity.set_cores(hetcompute::affinity::cores::big);
44
45 // Enable thread pinning
46 affinity.set_pin_threads();
47
48 // Update the mode from allow_local_setting to override_local_setting in the
49 // settings
50 affinity.set_mode(hetcompute::affinity::mode::override_local_setting
);
51
52 // Update the affinity with the modified affinity object
53 hetcompute::affinity::set(affinity);
54
55 // The second run of k_with_attrib will run on a big core because the
56 // affinity mode is override_local_setting and global affinity settings are

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 439
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Affinity Management API

57 // obeyed
58 g->launch(k_with_attrib);
59
60 g->wait_for();
61
62 hetcompute::runtime::shutdown();
63 return 0;
64 }

16.1.4.18 void hetcompute::affinity::settings::set_cores ( cores cores_attribute )

Set the cores affinity member

Parameters

in cores_attribute Type of desired cores.

16.1.4.19 void hetcompute::affinity::settings::set_mode ( mode md )

Set the mode member


When mode is force, all task will be executed by the CPUs set by the current affinity settings. When
normal, cpu tasks will be run by default by the same set of CPUs, except when the user has manually set the
kernel affinity with set_big() or set_little().

16.1.4.20 void hetcompute::affinity::settings::set_pin_threads ( )

Set the pin affinity member


When the pin is set, each hetcompute thread will be pinned to a single core

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 440
17 Miscellaneous

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 441
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Miscellaneous

17.1 Interoperability

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 442
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Miscellaneous

17.2 Legacy
Functions

• void hetcompute::runtime::init ()
• void hetcompute::runtime::shutdown ()

17.2.1 Function Documentation

17.2.1.1 void hetcompute::runtime::init ( )

Starts up HetCompute SDK’s runtime.


Initializes Hetcompute internal data structures, tasks, schedulers, and thread pools.

17.2.1.2 void hetcompute::runtime::shutdown ( )

Shuts down HetCompute SDK’s runtime.


Shuts down the runtime. It returns only when all running tasks have finished.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 443
18 Class Documentation

18.1 cpu_kernel Class Reference


The documentation for this class was generated from the following file:
• /het-compute-sdk/core/include/hetcompute/cpukernel.hh

18.2 dsp_kernel Class Reference


The documentation for this class was generated from the following file:
• /het-compute-sdk/core/include/hetcompute/dspkernel.hh

18.3 HetComputeApp::features Class Reference


Static Public Member Functions

• static bool supportException ()


The documentation for this class was generated from the following file:
• /het-compute-sdk/core/include/hetcompute/runtime.hh

18.4 hetcompute::beta::pattern::pipeline< UserData > Class


Template Reference
Heterogeneous Pipeline class.

Public Types

• using context = typename parent_type::context


Context type for the pipeline.

Public member functions

• pipeline ()
Constructor.

• pipeline (pipeline const &other)


Copy constructor.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 444
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Class Documentation

• pipeline (pipeline &&other)


Move constructor.

• virtual ∼pipeline ()
Destructor.

• template<typename... Args>
std::enable_if
<!internal::pipeline_utility::check_gpu_kernel
< Args...>::has_gpu_kernel,
void >::type add_stage (Args &&...args)
Add a CPU stage.

• template<typename... Args>
std::enable_if
< internal::pipeline_utility::check_gpu_kernel
< Args...>::has_gpu_kernel,
void >::type add_stage (Args &&...args)
Add a GPU stage.

• pipeline & operator= (pipeline const &other)


Copy assignment operator.

• pipeline & operator= (pipeline &&other)


Move assignment operator.

template<typename... UserData>class hetcompute::beta::pattern::pipeline< UserData >

Heterogeneous Pipeline class.


Template Parameters

UserData The type for the pipeline context data or empty, i.e.,
hetcompute::pattern::pipeline<size_t> or
hetcompute::pattern::pipeline<>.

18.4.1 Member Typedef Documentation

18.4.1.1 template<typename... UserData> using hetcompute::beta::pattern-


::pipeline< UserData >::context = typename parent_type::context

Context type for the pipeline.

18.4.2 Constructors and Destructors

18.4.2.1 template<typename... UserData> hetcompute::beta::pattern::pipeline<


UserData >::pipeline ( )

Constructor.

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 445
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Class Documentation

18.4.2.2 template<typename... UserData> virtual hetcompute::beta::pattern-


::pipeline< UserData >::∼pipeline ( ) [virtual]

Destructor.
Reimplemented from hetcompute::pattern::pipeline< UserData...>.

18.4.2.3 template<typename... UserData> hetcompute::beta::pattern::pipeline<


UserData >::pipeline ( pipeline< UserData > const & other )

Copy constructor.

18.4.2.4 template<typename... UserData> hetcompute::beta::pattern::pipeline<


UserData >::pipeline ( pipeline< UserData > && other )

Move constructor.

18.4.3 Member Function Documentation

18.4.3.1 template<typename... UserData> template<typename... Args> std::enable_-


if<!internal::pipeline_utility::check_gpu_kernel<Args...>::has_gpu_kernel,
void>::type hetcompute::beta::pattern::pipeline< UserData >::add_stage (
Args &&... args )

Add a CPU stage.

Parameters

args The features of the cpu stage.

See Also

template<typename... Args> void add_cpu_stage(Args&&... args)

18.4.3.2 template<typename... UserData> template<typename... Args> std::enable_-


if<internal::pipeline_utility::check_gpu_kernel<Args...>::has_gpu_kernel,
void>::type hetcompute::beta::pattern::pipeline< UserData >::add_stage (
Args &&... args )

Add a GPU stage.

Parameters

args The features of the gpu stage.

See Also

template<typename... Args> void add_gpu_stage(Args&&... args)

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 446
Qualcomm® Snapdragon™ Heterogeneous Compute SDK Class Documentation

18.4.3.3 template<typename... UserData> pipeline& hetcompute::beta::pattern-


::pipeline< UserData >::operator= ( pipeline< UserData > const & other
)

Copy assignment operator.

18.4.3.4 template<typename... UserData> pipeline& hetcompute::beta::pattern-


::pipeline< UserData >::operator= ( pipeline< UserData > && other
)

Move assignment operator.


The documentation for this class was generated from the following file:
• /het-compute-sdk/core/include/hetcompute/pipeline.hh

18.5 hetcompute::internal::pointkernel::pointkernel< RT, Args


> Class Template Reference
The documentation for this class was generated from the following file:
• /het-compute-sdk/core/include/hetcompute/pfor_each.hh

18.6 stage_input_base Class Reference


The documentation for this class was generated from the following file:
• /het-compute-sdk/core/include/hetcompute/pipelinedata.hh

18.7 hetcompute::internal::task_factory< X, Y, Z > Struct


Template Reference
The documentation for this struct was generated from the following file:
• /het-compute-sdk/core/include/hetcompute/cpukernel.hh

18.8 hetcompute::internal::task_factory_dispatch< X, Y >


Struct Template Reference
The documentation for this struct was generated from the following file:
• /het-compute-sdk/core/include/hetcompute/cpukernel.hh

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 447
Index
∼dsp_kernel get, 435
hetcompute::dsp_kernel< int(∗)(Args...)>, 264 get_cores, 435
∼hetcompute_exception get_mode, 435
hetcompute::hetcompute_exception, 427 get_pin_threads, 435
∼pipeline hetcompute_affinity_cores_big, 433
hetcompute::beta::pattern::pipeline, 445 hetcompute_affinity_cores_little, 433
hetcompute::pattern::pipeline, 204 hetcompute_affinity_mode_override_local_-
∼pipeline_context setting,
hetcompute::pipeline_context< UserData >, 434
212 hetcompute_affinity_pin_threads_true, 434
hetcompute::pipeline_context<>, 213 hetcompute_affinity_cores_t, 433
∼pipeline_context_base hetcompute_affinity_execute, 435
hetcompute::pipeline_context_base, 214 hetcompute_affinity_get, 436
∼settings hetcompute_affinity_mode_t, 433
Affinity Settings, 434 hetcompute_affinity_pin_threads_t, 434
∼stage_input hetcompute_affinity_reset, 436
hetcompute::stage_input, 220 hetcompute_affinity_set, 436
∼task_ptr hetcompute_func_ptr_t, 433
hetcompute::task_ptr< ReturnType >, 319 is_this_big_core, 437
hetcompute::task_ptr< ReturnType(Args...)>, little, 433
325 mode, 434
hetcompute::task_ptr< void >, 328 operator==, 438
hetcompute::task_ptr<>, 331 override_local_setting, 434
reset, 438
abort_on_cancel reset_pin_threads, 438
Tasks, 335 set, 438
abort_task set_cores, 440
Tasks, 337 set_mode, 440
acquire_ro set_pin_threads, 440
hetcompute::buffer_ptr, 372 settings, 434
acquire_rw all
hetcompute::buffer_ptr, 373 Affinity Settings, 433
acquire_wi allow_local_setting
hetcompute::buffer_ptr, 374 Affinity Settings, 434
add args_tuple
hetcompute::device_set, 362 hetcompute::task< ReturnType(Args...)>, 302
hetcompute::group, 232, 233 hetcompute::task_ptr< ReturnType(Args...)>,
add_stage 324
hetcompute::beta::pattern::pipeline, 446 arity
addressing_mode hetcompute::task< ReturnType(Args...)>, 304
Texture Data Types, 399 hetcompute::task_ptr< ReturnType(Args...)>,
Affinity Management API, 430 326
Affinity Settings, 431 at
∼settings, 434 hetcompute::buffer_ptr, 375
all, 433
allow_local_setting, 434 begin
big, 433 hetcompute::buffer_ptr, 375
cores, 433 hetcompute::range_base, 289, 290
execute, 434 big

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 448
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX

Affinity Settings, 433 create_buffer


bind_all Buffers, 382, 383
hetcompute::task< ReturnType(Args...)>, 302 create_cpu_kernel
bind_as_data_dependency Kernels, 268, 269
Tasks, 339 create_derivative_texture
bind_by_value Texture APIs, 395
Tasks, 339 create_dsp_kernel
blocking Kernels, 269
Tasks, 339 create_gpu_kernel
Bounded Lock-Free Queue, 402 Kernels, 269–271
bounded_lfqueue, 403 create_group
pop, 403 Groups, 250–252
push, 403 create_pdivide_and_conquer
bounded_lfqueue hetcompute::pattern::pdivide_and_conquerer,
Bounded Lock-Free Queue, 403 188
buffer_ptr Parallel Divide-and-Conquer, 188
hetcompute::buffer_ptr, 372 create_pfor_each
Buffers, 367 Parallel For Loop, 165
create_buffer, 382, 383 create_preduce
Buffers Reference API, 359 hetcompute::pattern::preducer, 178
Parallel Reduction, 178
cancel create_pscan_inclusive
hetcompute::group, 234 hetcompute::pattern::pscan, 184
hetcompute::task<>, 306 Parallel Scan, 185
cancel_pipeline create_psort
hetcompute::pipeline_context_base, 214 hetcompute::pattern::psorter, 196
canceled Parallel Sorting, 196
hetcompute::group, 236 create_ptransform
hetcompute::task<>, 309 hetcompute::pattern::ptransformer, 171
cbegin Parallel Transformation, 172
hetcompute::buffer_ptr, 375 create_sampler
cend Texture APIs, 395
hetcompute::buffer_ptr, 376 create_task
cl hetcompute::pattern::pipeline, 205, 206
Kernels, 272 Tasks, 340, 341
collapsed_task_type create_texture
Tasks, 335 Texture APIs, 396
const_iterator create_value_task
hetcompute::buffer_ptr, 371 Tasks, 342
context
hetcompute::beta::pattern::pipeline, 445 data
hetcompute::pattern::pipeline, 204 hetcompute::index_base, 275
copy_value Data Sharing and Storage Reference API, 407
hetcompute::task< ReturnType >, 300 Data Sharing Synchronization, 408
cores Data Structures Reference API, 401
Affinity Settings, 433 data_type
cpu_kernel, 444 hetcompute::buffer_ptr, 371
hetcompute::cpu_kernel, 260 device_set
hetcompute::cpu_kernel< hetcompute::device_set, 361
FReturnType(FArgs...)>, 262 device_type

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 449
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX

Heterogeneous Compute Device Types, 365 hetcompute::task_ptr< void >, 328


dims hetcompute::task_ptr<>, 331
hetcompute::range_base, 290 hetcompute::task_storage_ptr, 417
disable_sliding_window get_chunk_size
hetcompute::pattern::pipeline, 209 hetcompute::pattern::tuner, 224
do_not_collapse get_cl_kernel_binary
Tasks, 357 hetcompute::gpu_kernel, 267
dsp_kernel, 444 get_cores
hetcompute::dsp_kernel< int(∗)(Args...)>, 264 Affinity Settings, 435
get_cpu_load
empty hetcompute::pattern::tuner, 224
hetcompute::device_set, 362 get_data
enable_sliding_window hetcompute::pipeline_context< UserData >,
hetcompute::pattern::pipeline, 209 212
end get_degree_of_concurrency
hetcompute::buffer_ptr, 376 hetcompute::parallel_stage, 203
hetcompute::range_base, 290 get_doc
ErrorCodes hetcompute::pattern::tuner, 224
HC_GroupAggregateFailure, 429 get_dsp_load
HC_GroupCanceled, 429 hetcompute::pattern::tuner, 224
HC_GroupGenericError, 429 get_fd
HC_Success, 429 Memory Regions, 391
HC_TaskAggregateFailure, 429 get_first_elem_iter_id
HC_TaskCanceled, 429 hetcompute::stage_input, 220
HC_TaskDspFailure, 429 get_gpu_load
HC_TaskGenericError, 429 hetcompute::pattern::tuner, 224
HC_TaskGpuFailure, 429 get_id
ErrorCodes, 429 Memory Regions, 391
hc_error, 429 get_iter_id
Exceptions, 422 hetcompute::pipeline_context_base, 216
Exceptions Reference API, 421 get_iter_lag
execute hetcompute::iteration_lag, 200
Affinity Settings, 434 get_iter_rate_curr
extended_format_plane_type hetcompute::iteration_rate, 201
Texture Data Types, 399 get_iter_rate_pred
hetcompute::iteration_rate, 202
filter_mode
get_ith_element
Texture Data Types, 399
hetcompute::stage_input, 220
finish_after
get_max_stage_iter
Groups, 252
hetcompute::pipeline_context_base, 216
hetcompute::group, 236
get_mode
hetcompute::task<>, 311
Affinity Settings, 435
Tasks, 344
get_name
get hetcompute::group, 238
Affinity Settings, 435 get_num_bytes
hetcompute::group_ptr, 247 Memory Regions, 392
hetcompute::scoped_storage_ptr, 413 get_pin_threads
hetcompute::task_ptr< ReturnType >, 319 Affinity Settings, 435
hetcompute::task_ptr< ReturnType(Args...)>, get_ptr
325 Memory Regions, 392

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 450
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX

get_size hetcompute::aggregate_exception, 422


hetcompute::sliding_window_size, 219 hetcompute::api_exception, 424
get_stage_id hetcompute::beta::call_tuple, 257
hetcompute::pipeline_context_base, 216 hetcompute::beta::call_tuple< Dim, gpu_kernel<
get_type Args...> >, 258
hetcompute::serial_stage, 218 hetcompute::beta::cl_t, 258
gl hetcompute::beta::gl_t, 264
Kernels, 272 hetcompute::bounded_lfqueue, 402
glbuffer_memregion hetcompute::buffer_const_iterator, 368
Memory Regions, 390 hetcompute::buffer_iterator, 369
gpu_kernel hetcompute::buffer_ptr, 370
hetcompute::gpu_kernel, 266, 267 hetcompute::canceled_exception, 424
Graphics Reference API, 393 hetcompute::cpu_kernel, 258
group_ptr hetcompute::cpu_kernel< FReturnType(FArgs...)>,
hetcompute::group_ptr, 246 260
Groups, 230 hetcompute::device_set, 360
create_group, 250–252 hetcompute::do_not_collapse_t, 299
finish_after, 252 hetcompute::dsp_exception, 425
intersect, 253 hetcompute::dsp_kernel, 262
operator&, 254 hetcompute::dsp_kernel< int(∗)(Args...)>, 262
hetcompute::error_exception, 425
HC_GroupAggregateFailure hetcompute::glbuffer_memregion, 387
ErrorCodes, 429 hetcompute::gpu_exception, 426
HC_GroupCanceled hetcompute::gpu_kernel, 264
ErrorCodes, 429 hetcompute::graphics::image_size, 398
HC_GroupGenericError hetcompute::graphics::image_size< 1 >, 398
ErrorCodes, 429 hetcompute::graphics::image_size< 2 >, 399
HC_Success hetcompute::graphics::image_size< 3 >, 399
ErrorCodes, 429 hetcompute::group, 231
HC_TaskAggregateFailure hetcompute::group_ptr, 245
ErrorCodes, 429 hetcompute::hetcompute_exception, 427
HC_TaskCanceled hetcompute::in, 379
ErrorCodes, 429 hetcompute::index, 273
HC_TaskDspFailure hetcompute::index< 1 >, 273
ErrorCodes, 429 hetcompute::index< 2 >, 273
HC_TaskGenericError hetcompute::index< 3 >, 273
ErrorCodes, 429 hetcompute::index_base, 274
HC_TaskGpuFailure hetcompute::inout, 379
ErrorCodes, 429 hetcompute::ion_memregion, 387
has_iter_limit hetcompute::iteration_lag, 199
hetcompute::pipeline_context_base, 216 hetcompute::iteration_rate, 201
has_next hetcompute::lfqueue, 405
hetcompute::aggregate_exception, 423 hetcompute::local, 268
has_profile hetcompute::main_memregion, 388
hetcompute::pattern::tuner, 225 hetcompute::memregion, 389
hc_error hetcompute::out, 379
ErrorCodes, 429 hetcompute::parallel_stage, 202
HetComputeApp::features, 444 hetcompute::pattern::pdivide_and_conquerer, 187
hetcompute::abort_task_exception, 422 hetcompute::pattern::pfor, 162
hetcompute::affinity::settings, 432

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 451
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX

hetcompute::pattern::pfor< hetcompute::internal- next, 423


::pointkernel::pointkernel< RT, what, 424
PKType...>, T2 >, 163 hetcompute::beta::pattern::pipeline
hetcompute::pattern::pfor< T1, void >, 164 ∼pipeline, 445
hetcompute::pattern::pipeline, 203 add_stage, 446
hetcompute::pattern::preducer, 177 context, 445
hetcompute::pattern::pscan, 184 operator=, 446, 447
hetcompute::pattern::psorter, 195 pipeline, 445, 446
hetcompute::pattern::ptransformer, 170 hetcompute::beta::pattern::pipeline< UserData >,
hetcompute::pattern::tuner, 223 444
hetcompute::pipeline_context< UserData >, 211 hetcompute::buffer_ptr
hetcompute::pipeline_context<>, 212 acquire_ro, 372
hetcompute::pipeline_context_base, 213 acquire_rw, 373
hetcompute::range, 280 acquire_wi, 374
hetcompute::range< 1 >, 280 at, 375
hetcompute::range< 2 >, 282 begin, 375
hetcompute::range< 3 >, 285 buffer_ptr, 372
hetcompute::range_base, 289 cbegin, 375
hetcompute::scheduler_storage_ptr, 409 cend, 376
hetcompute::scope_acquire_ro, 379 const_iterator, 371
hetcompute::scope_acquire_rw, 380 data_type, 371
hetcompute::scope_acquire_wi, 381 end, 376
hetcompute::scoped_storage_ptr, 412 host_data, 376
hetcompute::serial_stage, 217 iterator, 372
hetcompute::sliding_window_size, 218 operator=, 376
hetcompute::stage_input, 219 release, 377
hetcompute::task< ReturnType >, 299 saved_host_data, 378
hetcompute::task< ReturnType(Args...)>, 301 size, 378
hetcompute::task< void >, 304 to_string, 378
hetcompute::task<>, 305 treat_as_texture, 378
hetcompute::task_ptr< ReturnType >, 316 hetcompute::canceled_exception
hetcompute::task_ptr< ReturnType(Args...)>, 322 what, 425
hetcompute::task_ptr< void >, 326 hetcompute::cpu_kernel
hetcompute::task_ptr<>, 329 cpu_kernel, 260
hetcompute::task_storage_ptr, 415 hetcompute::cpu_kernel< FReturnType(FArgs...)>
hetcompute::thread_storage_ptr, 419 cpu_kernel, 262
hetcompute::tls_exception, 427 hetcompute::device_set
hetcompute_affinity_cores_big add, 362
Affinity Settings, 433 device_set, 361
hetcompute_affinity_cores_little empty, 362
Affinity Settings, 433 negate, 363
hetcompute_affinity_mode_override_local_setting on_cpu, 363
Affinity Settings, 434 on_cpu_big, 363
hetcompute_affinity_pin_threads_true on_cpu_little, 364
Affinity Settings, 434 on_dsp, 364
hetcompute_affinity_settings_t, 432 on_gpu, 364
hetcompute::abort_task_exception remove, 364, 365
what, 422 to_string, 365
hetcompute::aggregate_exception hetcompute::dsp_exception
has_next, 423 what, 425

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 452
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX

hetcompute::dsp_kernel< int(∗)(Args...)> hetcompute::internal::task_factory< X, Y, Z >, 447


∼dsp_kernel, 264 hetcompute::internal::task_factory_dispatch< X, Y
dsp_kernel, 264 >, 447
operator=, 264 hetcompute::iteration_lag
hetcompute::error_exception get_iter_lag, 200
what, 426 iteration_lag, 200
hetcompute::gpu_exception operator=, 200
what, 426 hetcompute::iteration_rate
hetcompute::gpu_kernel get_iter_rate_curr, 201
get_cl_kernel_binary, 267 get_iter_rate_pred, 202
gpu_kernel, 266, 267 iteration_rate, 201
is_cl, 267 operator=, 202
is_gl, 268 hetcompute::parallel_stage
hetcompute::group get_degree_of_concurrency, 203
add, 232, 233 operator=, 203
cancel, 234 parallel_stage, 202, 203
canceled, 236 hetcompute::pattern::pdivide_and_conquerer
finish_after, 236 create_pdivide_and_conquer, 188
get_name, 238 hetcompute::pattern::pipeline
intersect, 238 ∼pipeline, 204
launch, 238, 240–242 context, 204
wait_for, 244 create_task, 205, 206
hetcompute::group_ptr disable_sliding_window, 209
get, 247 enable_sliding_window, 209
group_ptr, 246 is_valid, 209
operator bool, 247 operator=, 210
operator->, 247 pipeline, 204, 205
operator=, 247, 248 run, 210
reset, 248 hetcompute::pattern::preducer
swap, 248 create_preduce, 178
unique, 249 hetcompute::pattern::pscan
use_count, 249 create_pscan_inclusive, 184
hetcompute::hetcompute_exception hetcompute::pattern::psorter
∼hetcompute_exception, 427 create_psort, 196
what, 427 hetcompute::pattern::ptransformer
hetcompute::index_base create_ptransform, 171
data, 275 hetcompute::pattern::tuner
index_base, 274 get_chunk_size, 224
operator<, 276 get_cpu_load, 224
operator<=, 276 get_doc, 224
operator>, 277 get_dsp_load, 224
operator>=, 278 get_gpu_load, 224
operator+, 275 has_profile, 225
operator+=, 275 is_serial, 225
operator-, 276 is_static, 225
operator-=, 276 set_chunk_size, 225
operator=, 277 set_cpu_load, 225
operator==, 277 set_dsp_load, 226
hetcompute::internal::pointkernel::pointkernel< RT, set_dynamic, 226
Args >, 447 set_gpu_load, 226

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 453
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX

set_max_doc, 227 hetcompute::serial_stage


set_profile, 227 get_type, 218
set_serial, 227 operator=, 218
set_static, 227 serial_stage, 217, 218
tuner, 223 hetcompute::sliding_window_size
hetcompute::pipeline_context< UserData > get_size, 219
∼pipeline_context, 212 operator=, 219
get_data, 212 sliding_window_size, 219
hetcompute::pipeline_context<> hetcompute::stage_input
∼pipeline_context, 213 ∼stage_input, 220
hetcompute::pipeline_context_base get_first_elem_iter_id, 220
∼pipeline_context_base, 214 get_ith_element, 220
cancel_pipeline, 214 input_type, 220
get_iter_id, 216 size, 221
get_max_stage_iter, 216 hetcompute::task< ReturnType >
get_stage_id, 216 copy_value, 300
has_iter_limit, 216 move_value, 300
stop_pipeline, 217 return_type, 300
hetcompute::range< 1 > hetcompute::task< ReturnType(Args...)>
index_to_linear, 281 args_tuple, 302
linear_to_index, 282 arity, 304
linearized_distance, 282 bind_all, 302
range, 281 launch, 304
size, 282 return_type, 302
hetcompute::range< 2 > size_type, 302
index_to_linear, 284 hetcompute::task<>
linear_to_index, 285 cancel, 306
linearized_distance, 285 canceled, 309
range, 283, 284 finish_after, 311
size, 285 is_bound, 312
hetcompute::range< 3 > launch, 313
index_to_linear, 288 size_type, 306
linear_to_index, 288 then, 313
linearized_distance, 288 wait_for, 314
range, 287 hetcompute::task_ptr< ReturnType >
size, 288 ∼task_ptr, 319
hetcompute::range_base get, 319
begin, 289, 290 operator∗=, 319
dims, 290 operatorΓA30C=, 322
end, 290 operator∧ =, 322
length, 291 operator+=, 320
num_elems, 291 operator->, 321
stride, 291 operator-=, 320
hetcompute::scoped_storage_ptr operator/=, 321
get, 413 operator=, 321
operator bool, 413 operator%=, 319
operator pointer_type, 413 operator&=, 319
operator∗, 413 return_type, 318
operator->, 413 swap, 322
scoped_storage_ptr, 413 task_ptr, 318

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 454
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX

task_type, 318 Affinity Settings, 434


hetcompute::task_ptr< ReturnType(Args...)> hetcompute_affinity_reset
∼task_ptr, 325 Affinity Settings, 436
args_tuple, 324 hetcompute_affinity_set
arity, 326 Affinity Settings, 436
get, 325 hetcompute_func_ptr_t
operator->, 325 Affinity Settings, 433
operator=, 325, 326 Heterogeneous Compute Device Types, 360
return_type, 324 device_type, 365
size_type, 324 to_string, 366
swap, 326 host_data
task_ptr, 324 hetcompute::buffer_ptr, 376
task_type, 324
hetcompute::task_ptr< void > image_format
∼task_ptr, 328 Texture Data Types, 399
get, 328 index_base
operator->, 328 hetcompute::index_base, 274
operator=, 329 index_to_linear
swap, 329 hetcompute::range< 1 >, 281
task_ptr, 327, 328 hetcompute::range< 2 >, 284
task_type, 327 hetcompute::range< 3 >, 288
hetcompute::task_ptr<> Indices, 273
∼task_ptr, 331 init
get, 331 Legacy, 443
operator bool, 332 input_type
operator->, 332 hetcompute::stage_input, 220
operator=, 332, 333 Interoperability, 442
reset, 333 intersect
swap, 333 Groups, 253
task_ptr, 331 hetcompute::group, 238
task_type, 331 ion_memregion
unique, 333 Memory Regions, 390, 391
use_count, 334 is_bound
hetcompute::task_storage_ptr hetcompute::task<>, 312
get, 417 is_cacheable
operator bool, 417 Memory Regions, 392
operator pointer_type, 417 is_cl
operator∗, 418 hetcompute::gpu_kernel, 267
operator->, 418 is_gl
operator=, 418 hetcompute::gpu_kernel, 268
task_storage_ptr, 416, 417 is_serial
hetcompute_affinity_cores_t hetcompute::pattern::tuner, 225
Affinity Settings, 433 is_static
hetcompute_affinity_execute hetcompute::pattern::tuner, 225
Affinity Settings, 435 is_supported
hetcompute_affinity_get Texture APIs, 396
Affinity Settings, 436 is_this_big_core
hetcompute_affinity_mode_t Affinity Settings, 437
Affinity Settings, 433 is_valid
hetcompute_affinity_pin_threads_t hetcompute::pattern::pipeline, 209

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 455
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX

iteration_lag Miscellaneous, 441


hetcompute::iteration_lag, 200 mode
iteration_rate Affinity Settings, 434
hetcompute::iteration_rate, 201 move_value
iterator hetcompute::task< ReturnType >, 300
hetcompute::buffer_ptr, 372
negate
Kernels, 256 hetcompute::device_set, 363
cl, 272 next
create_cpu_kernel, 268, 269 hetcompute::aggregate_exception, 423
create_dsp_kernel, 269 non_collapsed_task_type
create_gpu_kernel, 269–271 Tasks, 335
gl, 272 num_elems
hetcompute::range_base, 291
launch
hetcompute::group, 238, 240–242 on_cpu
hetcompute::task< ReturnType(Args...)>, 304 hetcompute::device_set, 363
hetcompute::task<>, 313 on_cpu_big
Tasks, 344, 345 hetcompute::device_set, 363
Legacy, 443 on_cpu_little
init, 443 hetcompute::device_set, 364
shutdown, 443 on_dsp
length hetcompute::device_set, 364
hetcompute::range_base, 291 on_gpu
lfqueue hetcompute::device_set, 364
Unbounded Lock-Free Queue, 406 operator bool
linear_to_index hetcompute::group_ptr, 247
hetcompute::range< 1 >, 282 hetcompute::scoped_storage_ptr, 413
hetcompute::range< 2 >, 285 hetcompute::task_ptr<>, 332
hetcompute::range< 3 >, 288 hetcompute::task_storage_ptr, 417
linearized_distance operator pointer_type
hetcompute::range< 1 >, 282 hetcompute::scoped_storage_ptr, 413
hetcompute::range< 2 >, 285 hetcompute::task_storage_ptr, 417
hetcompute::range< 3 >, 288 operator<
little hetcompute::index_base, 276
Affinity Settings, 433 operator<=
hetcompute::index_base, 276
main_memregion operator>
Memory Regions, 391 hetcompute::index_base, 277
map operator>>
Texture APIs, 397 Tasks, 355
Memory Regions, 385 operator>=
get_fd, 391 hetcompute::index_base, 278
get_id, 391 operator∗
get_num_bytes, 392 hetcompute::scoped_storage_ptr, 413
get_ptr, 392 hetcompute::task_storage_ptr, 418
glbuffer_memregion, 390 Tasks, 348, 349
ion_memregion, 390, 391 operator∗=
is_cacheable, 392 hetcompute::task_ptr< ReturnType >, 319
main_memregion, 391 operatorΓA30C
s_default_alignment, 392

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 456
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX

Tasks, 356, 357 hetcompute::task_ptr<>, 332, 333


operatorΓA30C= hetcompute::task_storage_ptr, 418
hetcompute::task_ptr< ReturnType >, 322 operator==
operator∼ Affinity Settings, 438
Tasks, 357 hetcompute::index_base, 277
operator∧ Tasks, 355
Tasks, 355, 356 operator%
operator∧ = Tasks, 347
hetcompute::task_ptr< ReturnType >, 322 operator%=
operator+ hetcompute::task_ptr< ReturnType >, 319
hetcompute::index_base, 275 operator&
Tasks, 349–351 Groups, 254
operator+= Tasks, 348
hetcompute::index_base, 275 operator&=
hetcompute::task_ptr< ReturnType >, 320 hetcompute::task_ptr< ReturnType >, 319
operator- override_local_setting
hetcompute::index_base, 276 Affinity Settings, 434
Tasks, 352, 353
operator-> Parallel Divide-and-Conquer, 187
hetcompute::group_ptr, 247 create_pdivide_and_conquer, 188
hetcompute::scoped_storage_ptr, 413 pdivide_and_conquer, 189, 191
hetcompute::task_ptr< ReturnType >, 321 pdivide_and_conquer_async, 193, 194
hetcompute::task_ptr< ReturnType(Args...)>, Parallel For Loop, 161
325 create_pfor_each, 165
hetcompute::task_ptr< void >, 328 pfor_each, 165–167
hetcompute::task_ptr<>, 332 pfor_each_async, 167, 168
hetcompute::task_storage_ptr, 418 Parallel Reduction, 177
operator-= create_preduce, 178
hetcompute::index_base, 276 preduce, 179, 180
hetcompute::task_ptr< ReturnType >, 320 preduce_async, 181, 182
operator/ Parallel Scan, 184
Tasks, 354 create_pscan_inclusive, 185
operator/= pscan_inclusive, 185
hetcompute::task_ptr< ReturnType >, 321 pscan_inclusive_async, 186
operator= Parallel Sorting, 195
hetcompute::beta::pattern::pipeline, 446, 447 create_psort, 196
hetcompute::buffer_ptr, 376 psort, 196, 197
hetcompute::dsp_kernel< int(∗)(Args...)>, 264 psort_async, 197, 198
hetcompute::group_ptr, 247, 248 Parallel Transformation, 170
hetcompute::index_base, 277 create_ptransform, 172
hetcompute::iteration_lag, 200 ptransform, 172–174
hetcompute::iteration_rate, 202 ptransform_async, 174, 175
hetcompute::parallel_stage, 203 parallel_stage
hetcompute::pattern::pipeline, 210 hetcompute::parallel_stage, 202, 203
hetcompute::serial_stage, 218 Patterns Reference API, 160
hetcompute::sliding_window_size, 219 pdivide_and_conquer
hetcompute::task_ptr< ReturnType >, 321 Parallel Divide-and-Conquer, 189, 191
hetcompute::task_ptr< ReturnType(Args...)>, pdivide_and_conquer_async
325, 326 Parallel Divide-and-Conquer, 193, 194
hetcompute::task_ptr< void >, 329 pfor_each

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 457
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX

Parallel For Loop, 165–167 hetcompute::task_ptr< ReturnType(Args...)>,


pfor_each_async 324
Parallel For Loop, 167, 168 run
Pipeline, 199 hetcompute::pattern::pipeline, 210
serial_stage_type, 221
pipeline s_default_alignment
hetcompute::beta::pattern::pipeline, 445, 446 Memory Regions, 392
hetcompute::pattern::pipeline, 204, 205 saved_host_data
pop hetcompute::buffer_ptr, 378
Bounded Lock-Free Queue, 403 Scheduler Storage, 409
Unbounded Lock-Free Queue, 406 Scoped Storage, 412
preduce scoped_storage_ptr
Parallel Reduction, 179, 180 hetcompute::scoped_storage_ptr, 413
preduce_async serial_stage
Parallel Reduction, 181, 182 hetcompute::serial_stage, 217, 218
pscan_inclusive serial_stage_type
Parallel Scan, 185 Pipeline, 221
pscan_inclusive_async set
Parallel Scan, 186 Affinity Settings, 438
psort set_chunk_size
Parallel Sorting, 196, 197 hetcompute::pattern::tuner, 225
psort_async set_cores
Parallel Sorting, 197, 198 Affinity Settings, 440
ptransform set_cpu_load
Parallel Transformation, 172–174 hetcompute::pattern::tuner, 225
ptransform_async set_dsp_load
Parallel Transformation, 174, 175 hetcompute::pattern::tuner, 226
push set_dynamic
Bounded Lock-Free Queue, 403 hetcompute::pattern::tuner, 226
Unbounded Lock-Free Queue, 406 set_gpu_load
hetcompute::pattern::tuner, 226
range set_max_doc
hetcompute::range< 1 >, 281 hetcompute::pattern::tuner, 227
hetcompute::range< 2 >, 283, 284 set_mode
hetcompute::range< 3 >, 287 Affinity Settings, 440
Ranges, 280 set_pin_threads
release Affinity Settings, 440
hetcompute::buffer_ptr, 377 set_profile
remove hetcompute::pattern::tuner, 227
hetcompute::device_set, 364, 365 set_serial
reset hetcompute::pattern::tuner, 227
Affinity Settings, 438 set_static
hetcompute::group_ptr, 248 hetcompute::pattern::tuner, 227
hetcompute::task_ptr<>, 333 settings
reset_pin_threads Affinity Settings, 434
Affinity Settings, 438 shutdown
return_type Legacy, 443
hetcompute::task< ReturnType >, 300 size
hetcompute::task< ReturnType(Args...)>, 302 hetcompute::buffer_ptr, 378
hetcompute::task_ptr< ReturnType >, 318 hetcompute::range< 1 >, 282

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 458
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX

hetcompute::range< 2 >, 285 non_collapsed_task_type, 335


hetcompute::range< 3 >, 288 operator>>, 355
hetcompute::stage_input, 221 operator∗, 348, 349
size_type operatorΓA30C, 356, 357
hetcompute::task< ReturnType(Args...)>, 302 operator∼, 357
hetcompute::task<>, 306 operator∧ , 355, 356
hetcompute::task_ptr< ReturnType(Args...)>, operator+, 349–351
324 operator-, 352, 353
sliding_window_size operator/, 354
hetcompute::sliding_window_size, 219 operator==, 355
stage_input_base, 447 operator%, 347
stop_pipeline operator&, 348
hetcompute::pipeline_context_base, 217 Tasks Reference API, 229
stride Texture APIs, 394
hetcompute::range_base, 291 create_derivative_texture, 395
swap create_sampler, 395
hetcompute::group_ptr, 248 create_texture, 396
hetcompute::task_ptr< ReturnType >, 322 is_supported, 396
hetcompute::task_ptr< ReturnType(Args...)>, map, 397
326 unmap, 397
hetcompute::task_ptr< void >, 329 Texture Data Types, 398
hetcompute::task_ptr<>, 333 addressing_mode, 399
extended_format_plane_type, 399
Task Storage, 415 filter_mode, 399
task_ptr image_format, 399
hetcompute::task_ptr< ReturnType >, 318 then
hetcompute::task_ptr< ReturnType(Args...)>, hetcompute::task<>, 313
324 Thread Storage, 419
hetcompute::task_ptr< void >, 327, 328 to_string
hetcompute::task_ptr<>, 331 hetcompute::buffer_ptr, 378
task_storage_ptr hetcompute::device_set, 365
hetcompute::task_storage_ptr, 416, 417 Heterogeneous Compute Device Types, 366
task_type treat_as_texture
hetcompute::task_ptr< ReturnType >, 318 hetcompute::buffer_ptr, 378
hetcompute::task_ptr< ReturnType(Args...)>, Tuner, 223
324 tuner
hetcompute::task_ptr< void >, 327 hetcompute::pattern::tuner, 223
hetcompute::task_ptr<>, 331
Tasks, 292 Unbounded Lock-Free Queue, 405
abort_on_cancel, 335 lfqueue, 406
abort_task, 337 pop, 406
bind_as_data_dependency, 339 push, 406
bind_by_value, 339 unique
blocking, 339 hetcompute::group_ptr, 249
collapsed_task_type, 335 hetcompute::task_ptr<>, 333
create_task, 340, 341 unmap
create_value_task, 342 Texture APIs, 397
do_not_collapse, 357 use_count
finish_after, 344 hetcompute::group_ptr, 249
launch, 344, 345 hetcompute::task_ptr<>, 334

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 459
Qualcomm® Snapdragon™ Heterogeneous Compute SDK INDEX

wait_for
hetcompute::group, 244
hetcompute::task<>, 314
what
hetcompute::abort_task_exception, 422
hetcompute::aggregate_exception, 424
hetcompute::canceled_exception, 425
hetcompute::dsp_exception, 425
hetcompute::error_exception, 426
hetcompute::gpu_exception, 426
hetcompute::hetcompute_exception, 427

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 460
Qualcomm® Snapdragon™ Heterogeneous Compute SDK BIBLIOGRAPHY

Bibliography
[1] Sarita V. Adve and Kourosh Gharachorloo. Shared memory consistency models: A tutorial. IEEE
Computer, 29:66–76, 1995. 152
[2] Gene M. Amdahl. Validity of the single-processor approach to achieving large scale computing
capabilities. In AFIPS Conference Proceedings, volume 30, pages 483–485, Reston, VA, April 1967.
148
[3] Christopher Barton, Călin Cascaval, and José Nelson Amaral. A characterization of shared data access
patterns in upc programs. In Proceedings of the 19th international conference on Languages and
compilers for parallel computing, LCPC’06, pages 111–125, Berlin, Heidelberg, 2007.
Springer-Verlag. 153
[4] Antoni Buades, Bartomeu Coll, and Jean-Michel Morel. A non-local algorithm for image denoising.
In Computer Vision and Pattern Recognition, 2005. 155
[5] Calin Cascaval, Seth Fowler, Pablo Montesinos-Ortego, Wayne Piekarski, Mehrdad Reshadi, Behnam
Robatmili, Michael Weber, and Vrajesh Bhavsar. Zoomm: a parallel web browser engine for
multicore mobile devices. In Proceedings of the 18th ACM SIGPLAN symposium on Principles and
practice of parallel programming, PPoPP ’13, pages 271–280, 2013. 150
[6] Stephanie Coleman and Kathryn S. McKinley. Tile Size Selection Using Cache Organization and
Data Layout. In Proceedings of the ACM SIGPLAN Conference on Programming Languages Design
and Implementation (PLDI ’95, La Jolla, CA, June 1995. SIGPLAN. 153
[7] Michael J. Flynn. Some computer organizations and their effectiveness. IEEE Transactions on
Computers, C-21(9):948–960, Sept. 1972. 149
[8] Benedict R. Gaster and Lee Howes. Can GPGPU programming be liberated from the data-parallel
bottleneck? IEEE Computer, pages 42–52, 2012. 150
[9] John L. Gustafson. Reevaluating Amdahl’s law. Commun. ACM, 31(5):532–533, May 1988. 149
[10] John L. Hennessy and David A. Patterson. Computer Architecture A Quantitative Approach. Morgan
Kaufmann, second edition, 1996. 152
[11] Mark D. Hill and Michael R. Marty. Amdahl’s law in the multicore era. IEEE Computer, July 2008.
149
[12] J. A. Kahle, M. N. Day, H. P. Hofstee, C. R. Johns, T. R. Maurer, and D. Shippy. Introduction to the
Cell multiprocessor. IBM Journal of Research and Development, 49(4.5):589–604, 2005.
http://dx.doi.org/10.1147/rd.494.0589. 151
[13] Milind Kulkarni, Martin Burtscher, Rajeshkar Inkulu, Keshav Pingali, and Calin Caşcaval. How much
parallelism is there in irregular applications? In Proceedings of the 14th ACM SIGPLAN symposium
on Principles and practice of parallel programming, PPoPP ’09, pages 3–14, 2009. 153
[14] Timothy G. Mattson, Beverly A. Sanders, and Berna L. Massingill. Patterns for Parallel
Programming. Addison-Wesley, 2013. 151
[15] NEON intrinsics. http://gcc.gnu.org/onlinedocs/gcc/ARM-NEON-Intrinsics.html, Apr 2013. 149
[16] Qualcomm Research Silicon Valley,
http://developer.qualcomm.com/snapdragon-heterogeneous-compute-sdk. Qualcomm®

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 461
Qualcomm® Snapdragon™ Heterogeneous Compute SDK BIBLIOGRAPHY

Snapdragon™ Heterogeneous Compute SDK User’s Manual, 1.0.0 edition, Oct 2015. 150
[17] Gabriel Rivera and Chau-Wen Tseng. Data transformations for eliminating conflict misses. In
Proceedings of the ACM SIGPLAN Conference on Programming Languages Design and
Implementation (PLDI ’98, pages 38–49, June 1998. 153
[18] Anne Rogers and Keshav Pingali. Process decomposition through locality of reference. SIGPLAN
Notices, 24(7):69–80, July 1989. 152
[19] Josep Torrellas, Monica S. Lam, and John L. Hennessy. Shared Data Placement Optimizations to
Reduce Multiprocessor Cache Miss Rates. In ICPP, pages II–266–II–270, 1990. 152
[20] Michael Wolfe. More Iteration Space Tiling. In Proceedings of Supercomputing ’89, pages 655–664,
Reno, NV, November 1989. ACM. 153

80-P2432-1 B MAY CONTAIN U.S. AND INTERNATIONAL EXPORT CONTROLLED INFORMATION 462

You might also like