C++ AMP - Language and Programming Model, Microsoft Corp.
C++ AMP - Language and Programming Model, Microsoft Corp.
2012 Microsoft Corporation. All rights reserved. Copyright License. Microsoft grants you a license under its copyrights in the specification to (a) make copies of this specification to develop your implementation of this specification, and (b) distribute portions of this specification in your implementation or your documentation of your implementation. Other Contributors. This specification reflects input from NVIDIA Corporation (Nvidia) and Advanced Micro Devices, Inc. (AMD). Patent Notice. Microsoft provides you certain patent rights for implementations of this specification under the terms of Microsofts Community Promise, available at http://www.microsoft.com/openspecifications/en/us/programs/community-promise/default.aspx, which states: Microsoft irrevocably promises not to assert any Microsoft Necessary Claims against you for making, using, selling, offering for sale, importing or distributing any implementation, to the extent it conforms to one of the Covered Specifications, and is compliant with all of the required parts of the mandatory provisions of that specification ("Covered Implementation"), subject to the following: This is a personal promise directly from Microsoft to you, and you acknowledge as a condition of benefiting from it that no Microsoft rights are received from suppliers, distributors, or otherwise in connection with this promise. If you file, maintain, or voluntarily participate in a patent infringement lawsuit against a Microsoft implementation of any Covered Specification, then this personal promise does not apply with respect to any Covered Implementation made or used by you. To clarify, "Microsoft Necessary Claims" are those claims of Microsoft-owned or Microsoft-controlled patents that are necessary to implement the required portions (which also include the required elements of optional portions) of the Covered Specification that are described in detail and not those merely referenced in the Covered Specification. This promise by Microsoft is not an assurance that either (i) any of Microsoft issued patent claims covers a Covered Implementation or are enforceable, or (ii) a Covered Implementation would not infringe patents or other intellectual property rights of any third party. No other rights except those expressly stated in this promise shall be deemed granted, waived or received by implication, exhaustion, estoppel, or otherwise.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Disclaimers. This specification is provided "as-is; Microsoft makes no representations or warranties, express, implied, statutory, or otherwise, regarding this specification, including but not limited to any warranties of merchantability, fitness for a particular purpose, non-infringement, or title. The entire risk as to implementing or otherwise using the Specification is assumed by the user and implementer. IN NO EVENT WILL ANY PARTY BE LIABLE TO ANY OTHER PARTY FOR LOST PROFITS OR ANY FORM OF INDIRECT, SPECIAL, INCIDENTAL, OR CONSEQUENTIAL DAMAGES OF ANY CHARACTER FROM ANY CAUSES OF ACTION OF ANY KIND WITH RESPECT TO THIS AGREEMENT, WHETHER BASED ON BREACH OF CONTRACT, TORT (INCLUDING NEGLIGENCE), OR OTHERWISE, AND WHETHER OR NOT THE OTHER PARTY HAS BEEN ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
ABSTRACT C++ AMP (Accelerated Massive Parallelism) is a native-programming model that contains elements that span the C++ programming language and its runtime library. It provides an easy way to write programs that compile and execute on data-parallel hardware such as graphics cards (GPUs). The syntactic changes that are introduced by C++ AMP are minimal, but additional restrictions are enforced to reflect the limitations of data-parallel hardware. Data-parallel algorithms are supported by the introduction of multi-dimensional array types, array operations on those types, indexing, asynchronous memory transfer, shared memory, synchronization, and tiling/partitioning techniques.
Overview .................................................................................................................................................................. 1 1.1 1.2 1.3 1.4 Conformance ............................................................................................................................................................ 1 Definitions ................................................................................................................................................................. 2 Error Model ............................................................................................................................................................... 5 Programming Model ................................................................................................................................................. 6
C++ Language Extensions for Accelerated Computing ............................................................................................... 6 2.1 Syntax........................................................................................................................................................................ 6 2.1.1 Function Declarator Syntax .................................................................................................................................... 7 2.1.2 2.1.3 Lambda Expression Syntax ..................................................................................................................................... 7 Type Specifiers ....................................................................................................................................................... 8
2.2 Meaning of Function Modifiers ................................................................................................................................ 8 2.2.1 Function Definitions ............................................................................................................................................... 8 2.2.2 2.2.3 Constructors and Destructors ................................................................................................................................ 8 Lambda Expressions ............................................................................................................................................... 9
2.3 Expressions That Involve Restricted Functions ....................................................................................................... 10 2.3.1 Function Pointer Conversions .............................................................................................................................. 10 2.3.2 Function Overloading ........................................................................................................................................... 10 Overload Resolution .................................................................................................................................... 11 Name Hiding .............................................................................................................................................. 12 2.3.2.1 2.3.2.2 2.3.3
Casting.................................................................................................................................................................. 12
2.4 amp Restriction Modifier ........................................................................................................................................ 12 2.4.1 Restrictions on Types ........................................................................................................................................... 12 2.4.1.1 2.4.1.2 2.4.1.3 2.4.2 2.4.3 Type Qualifiers ............................................................................................................................................ 12 Fundamental Types ..................................................................................................................................... 12 Compound Types ........................................................................................................................................ 13
Literals ......................................................................................................................................................... 14 Primary Expressions (C++11 5.1) ................................................................................................................. 14 Lambda Expressions .................................................................................................................................... 14 Function Calls (C++11 5.2.2) ........................................................................................................................ 14 Local Declarations ....................................................................................................................................... 14 tile_static Variables ................................................................................................................................. 14 Type-Casting Restrictions ............................................................................................................................ 15 Miscellaneous Restrictions .......................................................................................................................... 15
Device Modeling ..................................................................................................................................................... 15 3.1 The Concept of a Compute Accelerator .................................................................................................................. 15 3.2 Accelerator .............................................................................................................................................................. 15 3.2.1 Default Accelerator .............................................................................................................................................. 15 3.2.2 3.2.3 3.2.4 3.2.5 3.2.6 Synopsis ............................................................................................................................................................... 16 Static Members .................................................................................................................................................... 17 Constructors ......................................................................................................................................................... 17 Members .............................................................................................................................................................. 18 Properties............................................................................................................................................................. 19
3.3 accelerator_view..................................................................................................................................................... 20 3.3.1 Synopsis ............................................................................................................................................................... 20 3.3.2 3.3.3 3.3.4 Queuing Mode ..................................................................................................................................................... 21 Constructors ......................................................................................................................................................... 21 Members .............................................................................................................................................................. 21
3.4 Device Enumeration and Selection API ................................................................................................................... 23 3.4.1 Synopsis: .............................................................................................................................................................. 23 4 Basic Data Elements ................................................................................................................................................ 23 4.1 index<N> ................................................................................................................................................................. 24 4.1.1 Synopsis ............................................................................................................................................................... 24 4.1.2 4.1.3 4.1.4 Constructors ......................................................................................................................................................... 25 Members .............................................................................................................................................................. 26 Operators ............................................................................................................................................................. 26
4.2 extent<N> ............................................................................................................................................................... 28 4.2.1 Synopsis ............................................................................................................................................................... 28 4.2.2 4.2.3 4.2.4 Constructors ......................................................................................................................................................... 29 Members .............................................................................................................................................................. 30 Operators ............................................................................................................................................................. 31
4.3 tiled_extent<D0,D1,D2> ......................................................................................................................................... 32 4.3.1 Synopsis ............................................................................................................................................................... 32 4.3.2 4.3.3 4.3.4 Constructors ......................................................................................................................................................... 34 Members .............................................................................................................................................................. 34 Operators ............................................................................................................................................................. 35
4.4 tiled_index<D0,D1,D2> ........................................................................................................................................... 36 4.4.1 Synopsis ............................................................................................................................................................... 37 4.4.2 4.4.3 Constructors ......................................................................................................................................................... 39 Members .............................................................................................................................................................. 39
4.5 tile_barrier .............................................................................................................................................................. 40 4.5.1 Synopsis ............................................................................................................................................................... 40 4.5.2 4.5.3 4.5.4 5 Constructors ......................................................................................................................................................... 40 Members .............................................................................................................................................................. 40 Other Memory Fences and Barriers ..................................................................................................................... 41
Data Containers ...................................................................................................................................................... 41 5.1 array<T,N> .............................................................................................................................................................. 41 5.1.1 Synopsis ............................................................................................................................................................... 42 5.1.2 5.1.3 5.1.4 5.1.5 Constructors ......................................................................................................................................................... 49 Staging Array Constructors.......................................................................................................................... 52 Members .............................................................................................................................................................. 54 Indexing................................................................................................................................................................ 55 View Operations .................................................................................................................................................. 56 5.1.2.1
5.2 array_view<T,N> ..................................................................................................................................................... 57 5.2.1 Synopsis ............................................................................................................................................................... 58 5.2.1.1 5.2.1.2 5.2.2 5.2.3 5.2.4 5.2.5 array_view<T,N> ......................................................................................................................................... 58 array_view<const T,N> ................................................................................................................................ 62
5.3 Copying Data ........................................................................................................................................................... 71 5.3.1 Synopsis ............................................................................................................................................................... 71 5.3.2 5.3.3 5.3.4 6 6.1 6.2 6.3 7 7.1 7.2 8 Copying Between array and array_view .............................................................................................................. 72 Copying from Standard Containers to arrays or array_views .............................................................................. 73 Copying from arrays or array_views to Standard Containers .............................................................................. 74
Atomic Operations .................................................................................................................................................. 75 Synposis .................................................................................................................................................................. 75 Atomically Exchanging Values ................................................................................................................................. 76 Atomically Applying an Integer Numerical Operation ............................................................................................ 77 Capturing Data in the Kernel Function Object ........................................................................................................ 80 Exception Behavior ................................................................................................................................................. 80
Correctly Synchronized C++ AMP Programs ............................................................................................................ 80 8.1 Concurrency of Sibling Threads That Are Launched by a parallel_for_each Call .................................................... 80 8.1.1 Correct Usage of Tile Barriers .............................................................................................................................. 81
8.1.2
Establishing Order Between Operations of Concurrent parallel_for_each Threads ........................................... 83 Barrier-incorrect Programs ......................................................................................................................... 83 Compatible Memory Operations ................................................................................................................ 84 Concurrent Memory Operations ................................................................................................................. 84 Racy Programs ............................................................................................................................................. 85 Race-free Programs ..................................................................................................................................... 85
8.1.2.1 8.1.2.2 8.1.2.3 8.1.2.4 8.1.2.5 8.2 8.3 8.4 9 9.1 9.2 10
Commulative Effects of a parallel_for_each Call .................................................................................................... 85 Effects of copy and copy_async Operations ........................................................................................................... 87 Effects of array_view::synchronize, synchronize_async, and Refresh Functions ................................................... 88 fast_math ................................................................................................................................................................ 89 precise_math .......................................................................................................................................................... 91
Graphics (Optional) ................................................................................................................................................. 96 10.1 texture<T,N> ........................................................................................................................................................... 96 10.1.1 Synopsis ........................................................................................................................................................... 96 10.1.2 10.1.3 10.1.4 10.1.5 10.1.6 10.1.7 10.1.8 10.1.9 10.1.10 10.1.11 10.1.12 10.1.13 10.1.14 10.1.15 Introduced typedefs ........................................................................................................................................ 98 Constructing an Uninitialized Texture ............................................................................................................. 98 Constructing a Texture from a Host-side Iterator ......................................................................................... 100 Constructing a Texture from a Host-side Data Source .................................................................................. 100 Constructing a Texture by Cloning Another One ........................................................................................... 101 Assignment Operator .................................................................................................................................... 102 Copying Textures ........................................................................................................................................... 102 Moving Textures ............................................................................................................................................ 102 Querying the Physical Characteristics of a Texture ....................................................................................... 102 Querying the Logical Dimensions of a Texture .............................................................................................. 103 Querying the accelerator_view Where the Texture Resides ......................................................................... 103 Reading and Writing Textures ....................................................................................................................... 103 Global texture copy functions ....................................................................................................................... 104 Global async Texture copy Functions.................................................................................................... 104 Direct3D Interop Functions ........................................................................................................................... 104
10.1.14.1
10.2 writeonly_texture_view<T,N> .............................................................................................................................. 105 10.2.1 Synopsis ......................................................................................................................................................... 105 10.2.2 10.2.3 10.2.4 10.2.5 10.2.6 10.2.6.1 10.2.6.2 Introduced typedefs ...................................................................................................................................... 106 Construct a Write-only View Over a Texture ................................................................................................. 106 Copy Constructors and Assignment Operators ............................................................................................. 106 Destructor ...................................................................................................................................................... 106 Querying the Physical Characteristics of an Underlying Texture ................................................................... 106 Querying the Logical Dimensions of an Underlying Texture (Through a View) ........................................ 106 Writing a Write-only Texture View ........................................................................................................... 107
Global writeonly_texture_view copy Functions ............................................................................................ 107 Global async writeonly_texture_view copy Functions .............................................................................. 107 Direct3D Interop Functions ........................................................................................................................... 107
10.3 norm and unorm ................................................................................................................................................... 108 10.3.1 Synopsis ......................................................................................................................................................... 108 10.3.2 10.3.3 Constructors and Assignment........................................................................................................................ 109 Operators....................................................................................................................................................... 109
10.4 Short Vector Types ................................................................................................................................................ 110 10.4.1 Synopsis ......................................................................................................................................................... 110 10.4.2 10.4.2.1 10.4.2.2 10.4.3 10.4.3.1 10.4.3.2 10.4.3.3 10.4.3.4 11 12 Constructors .................................................................................................................................................. 111 Constructors from Components ................................................................................................................ 112 Explicit conversion constructors ............................................................................................................... 112 Component Access (Swizzling) ...................................................................................................................... 112 Single-component Access.......................................................................................................................... 113 Two-component Access ............................................................................................................................ 113 Three-component Access .......................................................................................................................... 114 Four-component Access ............................................................................................................................ 114
Direct3D Interoperability (Optional) ..................................................................................................................... 114 Error Handling ....................................................................................................................................................... 117 12.1 static_assert .......................................................................................................................................................... 117 12.2 Runtime Errors ...................................................................................................................................................... 117 12.2.1 runtime_exception ........................................................................................................................................ 117 12.2.1.1 12.2.2 12.2.3 12.2.4 12.3 Specific Runtime Exceptions ..................................................................................................................... 118 out_of_memory............................................................................................................................................. 118 invalid_compute_domain .............................................................................................................................. 118 unsupported_feature .................................................................................................................................... 119
13
Appendix: C++ AMP Future Directions (Informative)............................................................................................. 121 13.1 Versioning Restrictions ......................................................................................................................................... 121 13.1.1 auto restriction .............................................................................................................................................. 121 13.1.2 13.1.3 13.2 Automatic Restriction Deduction .................................................................................................................. 121 amp Version................................................................................................................................................... 122
Page 1
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46
Overview
C++ AMP is a programming model that enables the acceleration of C++ code on data-parallel hardware. One example of data-parallel hardware is the discrete graphics card (GPU), which is becoming increasingly relevant for general-purpose parallel computations in addition to its main function as a graphics accelerator. A GPU is conceptually (and usually physically) remote from the CPU, has discrete memory address space, and incurs high cost when data is transferred between CPU memory and GPU memory. The programmer must carefully balance the cost of this data-transfer overhead against the computational acceleration that can be achieved by parallel execution on the device. Another example of data-parallel hardware is the SIMD vector instruction set, and associated registers, that are found in all modern processors. For the remainder of this specification, we refer to the data-parallel hardware as the accelerator. In the few places where the distinction matters, we refer to a GPU or a VectorCPU. The C++ AMP programming model gives you explicit control over the above aspects: copying data between CPU and accelerator, and the computations performed on the GPU. You can explicitly manage all communication between the CPU and the accelerator, and this communication can be either synchronous or asynchronous. The data-parallel computations that are performed on the accelerator are expressed by using multi-dimensional arrays, high-level array-manipulation functions, multi-dimensional indexing operations, and other high-level abstractions, all of which are based on a large subset of the C++ programming language. The programming model contains multiple layers so that you can trade off ease-of-use with maximum performance. C++ AMP has three broad categories of functionality: 1. 2. C++ language and compiler a. Vector functions are compiled into code that is specific to the accelerator. Runtime a. The runtime contains an AMP abstraction of lower-level accelerator APIs, and also supports multiple host threads, processors, and accelerators. b. Asychronous execution is supported through an eventing model. Programming model a. The programming model mostly comprises C++ AMP entry points and call sites, along with runtime boilerplate code. b. The programming model may be categorized as: 1. C++ language extensions and restrictions. 2. Runtime library.
3.
1.1
Conformance
The text in this specification falls into one of the following categories: Informative: shown in this style. Informative text is non-normative; for background information only; not required to be implemented to conform to this specification. Microsoft-specific: shown in this style.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 2
47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95
Microsoft-specific text is non-normative; for background information only; not required to be implemented to conform to this specification; explains features that are specific to the Microsoft implementation of the C++ AMP programming model. However, you may implement these features, or any subset thereof. Normative: all text that is not otherwise marked (see the previous categories) is normative. Normative text falls into one of the following sub-categories: o Optional: the title of each section of the specification that falls into this sub-category includes the suffix (Optional). A conforming implementation of C++ AMP may support such features, or not. (Microsoft specific portions of the text are also considered Optional.) o Required: unless it is marked as Optional, all Normative text is Required. A conforming implementation of C++ AMP must support all Required features.
Conforming implementations must provide all required Normative features and may provide any number of optional features. Implementations may provide additional features so long as they are exposed in namespaces other than those that are listed in this specification. Implementations may provide additional language support for AMPamp-restricted functions (defined in section 2.1) by following the rules in section 13. The programming model uses the Microsoft Visual C++ syntax for properties. Any such property is considered to be optional. An implementation may use equivalent mechanisms for introducing such properties as long as they provide the same functionality of indirection to a member function that the Visual C++ properties provide.
1.2
Definitions
This section introduces terms that are used in this specification. Accelerator A hardware device or capability that enables accelerated computation on data-parallel workloads. Examples include: o Graphics processing unit (GPU), or other coprocessor, that is accessible through the PCIE bus. o SIMD units of the host node that are exposed through software emulation of a hardware accelerator. Array A dense N-dimensional data structure. Array view A view into a linear piece of memory that adds array-like dimensionality. Compressed texture format A format that divides a texture into blocks so that it can be reduced in size by a fixed ratio; typically 4:1 or 6:1. Compressed textures are useful when perfect image/texel fidelity is not necessary and minimization of memory storage and bandwidth are critical to application performance. Constant memory Read-only accelerator memory that is used internally by the C++ AMP runtime. Typically, it holds metadata that describes a compute kernel and captured user data. Divergence; Divergent code When two threads execute different code paths (for example, the then and else branches of the same if statement), they are said to be divergent. Extent A vector of integers that describes lengths of N-dimensional geometric objects.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 3
96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144
Global memory On a GPU, global memory is the main off-chip memory store, Informative: Typcially, on current-generation GPUs, global memory is implemented in DRAM, with access times of 400-1000 cycles; the GPU clock speed is around 1 GHz; and global memory is non-cached. Global memory is accessed in a coalesced pattern with a granularity of 128 bytes, so when 4 bytes of global memory are accessed, 32 successive threads must read the 32 successive 4-byte addresses, to be fully coalesced. GPGPU: A General Purpose GPU, which is a GPU that can run non-graphics computations. GPU: A specialized (co)processor that offloads graphics computation and rendering from the host. As GPUs have evolved, they have become increasingly able to offload non-graphics computations as well (see GPGPU). Informative: The memory space of current-generation GPUs is almost always disjoint from the host system. GPU register model Informative: On typical, current-generation GPUs, registers are partitioned among threads that are in-flight. Suppose that on a given multiprocessor, there are 16,384 registers and a given thread uses 32 registers. In that case, there is a maximum of 512 threads that can be in-flight. If the threads are in thread groups of 256 threads, there can only be 2 thread groups in-flight at a time, which is not enough to mask global memory latency. Ideally there should be between 3 and 8 thread groups in-flight. Therefore, experiment with thread groups of 196 or 128 or even 64 threads. In general, the number of registers per thread should be no more than 64, and those programs for which it is greater than 32 might have difficulty in optimizing characteristics. But what can you do? Spill? Only as a last resorta spilled variable is around 1000x slower to access because you can only spill to global memory. Heterogenous programming A workload that combines kernels that execute on data-parallel compute nodes with algorithms that run on CPUs. Host The operating system process and the CPU(s) that it is running on. Host thread The operating system thread and the CPU(s) that it is running on. A host thread may initiate a copy operation or parallel loop operation that can run on an accelerator. Index A vector of integers that describes an N-dimensional point in iteration space or index space. Kernel; Kernel function A program that is designed to be executed at a C++ AMP call site. More generally, a kernel is a unit of computation that executes on an accelerator. A kernel function is a special case; it is the root of a logical call graph of functions that execute on an accelerator. A C++ analogy is that it is the main() function for an accelerator program. Perfect loop nest A loop nest in which the body of each outer loop is a single statement that is a loop. Pixel A pixel, or picture element, represents one element in a digital image. Typically, pixels are composed of multiple color components such as a red value, a green value, and a blue value. Other color representations exist; these include one-channel images that just represent intensity or black and white values.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 4
145 146 147 148 149 150 151 152 153 154 155 156 157 158 159 160 161 162 163 164 165 166 167 168 169 170 171 172 173 174 175 176 177 178 179 180 181 182 183 184 185 186 187 188 189 190 191 192 193 194 195
Shared (or local) memory A user-defined cache on streaming multiprocessors on GPUs. Shared memory is local to a multiprocessor and is shared across threads that execute on that multiprocessor. Shared memory allocations per thread group affect the total number of thread groups that are in-flight per multiprocessor. For example, if each thread uses 4KB of thread memory where the limit is 16KB per multiprocessor, then the number of thread groups in-flight is limited to 4 and may be less, depending on register allocation patterns. Informative: On the nVIDIA Fermi architecture, shared memory and L1 cache are the same; that is, the same memory is partitioned to be shared memory and L1 cache. SIMD unit Single Instruction Multiple Data. A machine programming model where a single instruction operates over multiple pieces of data. The translation of a program to use SIMD is known as vectorization. GPUs have multiple SIMD units, which are the streaming multiprocessors. Informative: An SSE (Nehalem, Phenom) or AVX (Sandy Bridge) or LRBni (Larrabee) vector unit is a SIMD unit or vector processor. SMP Symmetric Multi-Processor. Standard PC multiprocessor architecure. Streaming multiprocessor Informative: nVIDIA terminology for a collection of scalar processors that must either all execute the same instruction simultaneously or execute noops. The equivalent ATI/AMD terminology is stream processor, which is also known as streaming multiprocessor. Stream processor Informative: ATI/AMD terminology for a streaming multiprocessor. Texel A texel or texture element represents one element of a texture space. Texel elements are mapped to 1D, 2D, or 3D surfaces during sampling, rendering, and/or rasterization, and end up as pixel elements on a display. Texture A texture is a 1D, 2D, or 3D logical array of texels that is optimized in hardware for spacial access by using texture caches. Typically, textures are used to represent image, volumetric, or other visual information, but they are also efficient for many data arrays that have to be optimized for spacial access or have to interpolate between adjacent elements. Textures provide virtualization of storage, whereby shader code can sample a texture object as if it contained logical elements of one type (for example, float4), but the concrete physical storage of the texture is represented as a second type (for example, four 8-bit channels). This enables the application of the same shader algorithms on different types of concrete data. Texture format Texture formats define the type and arrangement of the underlying bytes that represent a texel value. Informative: Direct3D supports many types of formats, which are described under the DXGI_FORMAT enumeration. Texture memory Texture memory space resides in GPU memory and is cached in a texture cache. A texture fetch costs one memory read from GPU memory only on a cache miss; otherwise, it just costs one read from the texture cache. The texture cache is optimized for 2D spatial locality; therefore, threads of the same scheduling unit that read texture addresses that are close together in 2D achieve the best performance. Also, texture memory is designed for streaming fetches that have a constant latency; a cache hit reduces global-memory bandwidth demand but not fetch latency.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 5
196 197 198 199 200 201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243
Thread block Informative: The nVIDIA term for a thread group. Thread group; Thread tile A set of threads that is executed on one multiprocessor. While a scheduling unit represents the set of threads that are executing at one moment on a multiprocessor, a thread group (which is a multiple of scheduling units) is the granularity level that can be executed independently. All threads in a thread block may participate in a barrier; this is not true for any smaller or larger collection. When a scheduling unit accesses global memory and stalls (of course), the whole thread group is switched-out and the next thread group in-flight for that multiprocessor is scheduled in its place. This is why you should try to have at least 4 thread groups in-flight per multiprocessor, to mask the latency of global memory access. Tiling Tiling is the partitioning of an N-dimensional array into same-sized tiles, which are N-dimensional rectangles that have sides that are parallel to the coordinate axes. Essentially, the local view abstraction, which is also known as tiling, is the process of recognizing the current thread group as a cooperative gang of threads, with the decomposition of a global index into a local index plus a tile offset. In C++ AMP, it is viewing a global index as a local index and a tile ID, as described by this canonical correspondence: compute grid ~ dispatch grid x thread group In particular, tiling provides the local geometry with which to take advantage of shared memory and barriers whose usage patterns enable the coalescing of global memory access. Restricted function A function that is declared to obey the restrictions of a particular C++ AMP subset. A function can be CPUrestricted so that it can run on a host CPU. A function can be amp-restricted so that it can run on an amp-capable accelerator such as a GPU or VectorCPU. A function can carry more than one restriction. Vector processor Same as an SIMD unit or streaming multiprocessor or stream processor. Warp Informative: The nVIDIA term for a scheduling unit. Wave; Wavefront Informative: The AMD terms for a scheduling unit.
1.3
Error Model
Host-side runtime library code for C++ AMP has a different error model than device-side code has. For details, examples, and exception categorization, see Error Handling. Host-Side Error Model: On a host, C++ exceptions and _DEBUG assertions are used to present semantic errors, and therefore are categorized and listed as error states in API descriptions. Device-Side Error Model: On a device, error state is conveyed through the assert intrinsic. The debug_printf instrinsic is additionally supported for logging messages from within the accelerator code. Compile-time asserts: The C++ intrinsic static_assert is often used to handle error states that are detectable at compile time. This use of static_assert is a technique for conveying static semantic errors, which therefore are categorized like exception types.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 6
244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267
1.4
Programming Model
Here are the types and patterns in C++ AMP: Indexing level o index<N> o extent<N> o tiled_extent<D0,D1,D2> o tiled_index<D0,D1,D2> Data level o array<T,N> o array_view<T,N>, array_view<const T,N> o texture<T,N> o writeonly_texture_view<T,N> Runtime level o accelerator o accelerator_view Call-site level o parallel_for_each o copy various commands to move data between compute nodes Kernel level o tile_barrier o restrict() clause o fixed_array o tile_static
268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289
C++ AMP adds a closed set of restriction specifiers to the C++ type system, together with new syntax, and also adds rules that govern how they behave with respect to conversion rules and overloading. Restriction specifiers apply to function declarators only. The restriction specifiers perform the following functions: 1. They become part of the signature of the function. 2. They enforce restrictions on the content and/or behavior of that function. 3. They may designate a particular subset of the C++ language. For example, an amp restriction would imply that a fu nction must conform to the defined subset of C++, such that the function can be used on a typical GPU device.
2.1
Syntax
restriction-specifier-seq: restriction-specifier restriction-specifier-seq restriction-specifier restriction-specifier: restrict ( restriction-seq )
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 7
290 291 292 293 294 295 296 297 298 299 300 301 302 303 304 305 306 307 308 309 310 311 312 313 314 315 316 317 318 319 320 321 322 323 324 325 326 327 328 329 330 331 332 333 334 335 336 337 338 339 340 341 342 343 344
restriction-seq: restriction restriction-seq , restriction restriction: amp-restriction cpu amp-restriction: amp The restrict keyword is contextual. The restriction specifiers in a restrict clause are not reserved words. Multiple restrict clauses, such as restrict(A) restrict(B), behave the same as restrict(A,B). Duplicate restrictions are allowed and behave as if the duplicates are discarded. The cpu restriction specifies that this function can only run on the host CPU. If a declarator elides the restriction specifier, it behaves as if it were specified with restrict(cpu). If a declarator contains a restriction specifier, then it specifies the entire set of restrictions (in other words: restrict(amp) means that it runs only on the amp target, not on the CPU).
2.1.1 Function Declarator Syntax The function declarator grammar (classic and trailing return type variation) are adjusted as follows:
D1 ( parameter-declaration-clause ) cv-qualifier-seqopt ref-qualifieropt restriction-specifier-seqopt exception-specificationopt attribute-specifieropt D1 ( parameter-declaration-clause ) cv-qualifier-seqopt ref-qualifieropt restriction-specifier-seqopt exception-specificationopt attribute-specifieropt trailing-return-type
Restriction specifiers may not be applied to other declarators (for example, arrays, pointers, references). They can be applied to all kinds of functions; these include free functions, static and non-static member functions, special member functions, and overloaded operators. Examples:
auto grod() restrict(amp); auto freedle() restrict(amp)-> double; class Fred { public: Fred() restrict(amp) : member-initializer { } Fred& operator=(const Fred&) restrict(amp); int kreeble(int x, int y) const restrict(amp); static void zot() restrict(amp); };
2.1.2 Lambda Expression Syntax The lambda expression syntax is adjusted as follows:
lambda-declarator:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 8
345 346 347 348 349 350 351 352 353 354 355 356 357 358 359 360 361 362 363 364 365 366 367 368 369 370 371 372 373 374 375 376 377 378 379 380 381 382 383 384 385 386 387 388 389 390 391 392 393 394 395 396 397
When a restriction modifier is applied to a lambda expression, the behavior is as if all member functions of the generated functor are restriction-modified. 2.1.3 Type Specifiers Restriction specifiers are not allowed anywhere in the type specifier grammar, even if it specifies a function type. For example, the following is not well-formed and will produce a syntax error:
typedef float FuncType(int); restrict(cpu) FuncType* pf; // Illegal; restriction specifiers not allowed in type specifiers
or just:
float (*pf)(int) restrict(cpu);
2.2
The restriction specifiers on the declaration of a given function F must agree with those that are specified on the definition of function F. Multiple restriction specifiers can be specified for a given function. The effect is that the function enforces the union of the restrictions that are defined by each restriction modifier. The restriction specifiers on a function become part of its signature, and therefore can be used to overload. The restrictions are mangled into the exported function name in a manner similar to how member, based, near are mangled. 2.2.1 Function Definitions The restriction specifiers that are applied to a function definition are recursively applied to all functions that are defined in its body and do not have explicit restriction specifiers (that is, through nested classes that have member functions, and through lambdas). For example:
void f1() restrict(amp) { class C1 { void f2() {} // f2 is amp-restricted }; auto f3 = [] (int y) { }; // Lambda is amp-restricted auto f4 = [] (int y) restrict(cpu) { }; // Lambda is cpu-restricted }
This also applies to the function scope of a lambda body. 2.2.2 Constructors and Destructors Constructors can have overloads that are differentiated by restriction specifiers.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 9
398 399 400 401 402 403 404 405 406 407 408 409 410 411 412 413 414 415 416 417 418 419 420 421 422 423 424 425 426 427 428 429 430 431 432 433 434 435 436 437 438 439 440 441 442 443 444 445 446 447 448 449 450 451 452 453 454 455 456 457 458 459 460
Because destructors cannot be overloaded, the destructor must contain a restriction specifier that covers the union of the restrictions on all of the constructors. (A destructor can also achieve an overloading effect by calling auxiliary cleanup functions that have different restriction specifiers.) For example:
class C1 { public: C1() { } C1() restrict(amp) { } ~C1() restrict(cpu,amp); }; void UnrestrictedFunction() { C1 a; // calls C1::C1() // a is destructed with C1::~C1() } void RestrictedFunction() restrict(amp) { C1 b; // calls C1::C1() restrict(amp) // b is destructed with C1::~C1() } class C2 { public: C2() { } C2() restrict(amp) { } ~C2(); // error: restrict(cpu,amp) required };
A virtual function declaration in a derived class can override a virtual function declaration in a base class only if the derived class function has the same restriction specifiers as the base. For example:
class Base { public: virtual void f1() restrict(R1); }; class Derived : public Base { public: virtual void f1() restrict(R2); // Does not override Base::f1 };
2.2.3 Lambda Expressions When restriction specifiers are applied to a lambda declarator, the behavior is as if the restriction specifiers are applied to all member functions of the compiler-generated function object. For example:
C1 ambientVar; auto functor = [ambientVar] (int y) restrict(amp) -> int { return y + ambientVar.z; };
is equivalent to:
C1 ambientVar; class <lambdaName> { public: <lambdaName>(const C1& c1) restrict(amp) : capturedC1(c1) // C1s copy ctor must also be amp
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 10
461 462 463 464 465 466 467 468 469 470 471 472 473 474 475 476 477 478 479 480 481 482 483 484 485 486 487 488 489 490 491 492 493 494 495 496 497 498 499 500 501 502 503 504 505 506 507 508 509 510 511 512 513 514 515 516
{ } ~<lambdaName>() restrict(amp) { } // C1s dtor must also be amp int operator()(int y) restrict(amp) { return y + ambientVar.z; } }; <lambdaName> functor;
2.3
2.3.1 Function Pointer Conversions New implicit conversion rules must be added to account for restricted function pointers (and references). Given an expression of type pointer to R1-function, this type can be implicitly converted to type pointer to R2-function if-andonly-if R1 has all the restriction specifiers of R2. Stated more intuitively, it is acceptable for the target function to be more restricted than the function pointer that invokes it; it is unacceptable for it to be less restricted. For example:
int func(int) restrict(R1,R2); int (*pfn)(int) restrict(R1) = func; // ok, since func(int) restrict(R1,R2) is at least R1
(C++ AMP does not support function pointers in the current restrict(amp) subset.) 2.3.2 Function Overloading Restriction specifiers become part of the function type to which they are attached. That is, they become part of the signature of the function. Therefore, functions can be overloaded by differing modifiers, and each unique set of modifiers forms a unique overload. The restriction specifiers of a function must not overlap with restriction specifiers in another function in the same overload set.
int func(int x) restrict(cpu,amp); int func(int x) restrict(cpu); // error, overlaps with previous declaration
The target of the function call operator must resolve to an overloaded set of functions that is at least as restricted as the body of the calling function (see Overload Resolution). For example:
void f1(); void f2() restrict(amp); void f3() restrict(amp) { f2(); // okay: f2 has amp restriction f1(); // error: f1 lacks amp restriction }
It is permissible for a less restrictive call site to call a more restrictive function. Compiler-generated constructors and destructors (and other special member functions) behave as if they were declared conforming to the restrictions of the calling context. (This may cause an error if the class contains members that violate the restrictions of the calling context.) For example:
struct S1 { int a; int b; int f1() restrict(amp) { return a+b; }
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 11
517 518 519 520 521 522 523 524 525 526 527 528 529 530 531 532 533 534 535 536 537 538 539 540 541 542 543 544 545 546 547 548 549 550 551 552 553 554 555 556 557 558 559 560 561 562 563 564 565 566 567 568 569 570 571 572 573 574 575
int f2() restrict(cpu) { return a*b; } }; void d3dCaller() restrict(amp) { S1 s; // okay, behaves as if compiler-generated ctor was amp int x = s.f1(); // s.~S1() called here; also okay } void d3dCaller() restrict(cpu) { S1 s; // okay, behaves as if compiler-generated ctor was cpu int x = s.f2(); // s.~S1() called here; also okay }
The compiler must behave this way because the local usage of Grod in this case should not affect potential uses of it in other restricted or unrestricted scopes. 2.3.2.1 Overload Resolution
Overload resolution depends on the set of restrictions (function modifiers) that are in force at the call site.
int func(int x) restrict(A); int func(int x) restrict(B,C); int func(int x) restrict(D); void f1() restrict(B) { int x = func(5); // calls func(int x) restrict(B,C) }
A call to function F is valid if-and-only-if the overload set of F covers all of the restrictions that are in force in the calling function. This rule can be satisfied by just one function F that contains all of the require restrictions, or by a set of overloaded functions F that each specify a subset of the restrictions that are in force at the call site. For example:
void Z() restrict(amp,sse,cpu) { } void Z_caller() restrict(amp,sse,cpu) { Z(); // okay; all restrictions available in a single function } void X() restrict(amp) { } void X() restrict(sse) { } void X() restrict(cpu) { } void X_caller() restrict(amp,sse,cpu) { X(); // okay; all restrictions available in separate functions } void Y() restrict(amp) { } void Y_caller() restrict(cpu,amp) { Y(); // error; no available Y() that satisfies CPU restriction }
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 12
576 577 578 579 580 581 582 583 584 585 586 587 588 589 590 591 592 593 594 595 596 597 598 599 600 601 602 603 604 605 606 607 608 609 610 611 612 613 614 615 616 617 618 619 620 621 622
(When a call to a restricted function is satisfied by more than one function, the compiler must generate an as-if-runtime dispatch to the correctly restricted version.)
2.3.2.2
Name Hiding
Overloading by using restriction specifiers does not affect the name-hiding rules. For example:
void f1(int x) restrict(amp) { ... } namespace N1 { void f1(double d) restrict(cpu) { .... } void f1_caller() restrict(amp) { f1(10); // error; global f1() is hidden by N1::f1 } }
The name-hiding rules in C++11 Section 3.3.10 state that within namespace N1, the global name f1 is hidden by the local name f1, and is not overloaded by it. 2.3.3 Casting A restricted function type can be cast to a more restricted function type by using a normal C-style cast or reinterpret_cast. (A cast is not required when you are losing restrictions, only when you are gaining them.) For example:
void unrestricted_func(int,int); void restricted_caller() restrict(amp) { ((void ()(int,int) restrict(amp))unrestricted_func)(6, 7); reinterpret_cast<(void ()(int,int) restrict(amp)>(unrestricted_func)(6, 7); }
A program that does unsafe casting such as this can exhibit undefined behavior.
2.4
The amp restriction modifier applies a relatively small set of restrictions that reflect the current limitations of GPU hardware and the underlying programming model. 2.4.1 Restrictions on Types Not all types can be supported on current GPU hardware. The amp restriction modifier restricts functions from using unsupported types, in their function signatures or in their function bodies. We refer to the set of supported types as being amp-compatible. Any type that is referenced in an amp restriction function must be amp-compatible. Some uses require further restrictions. 2.4.1.1 Type Qualifiers
The volatile type qualifier is not supported in an amp-modified function. A variable or member that is qualified by using volatile may not be declared or accessed in amp restricted code. 2.4.1.2 Fundamental Types
Of the set of C++ fundamental types, only the following ones are supported in an amp-modified function.
2
Compilers are always free to optimize this if they can determine the target statically.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 13
623 624 625 626 627 628 629 630 631 632 633 634 635 636 637 638 639 640 641 642 643 644 645 646 647 648 649 650 651 652 653 654 655 656 657 658 659 660 661 662 663 664 665 666 667 668 669 670 671 672
The representation of these types on a device that is running an amp function is identical to that of its host. Some additional types can be used (or generated as the result type of an expression), but they are not completely supported. These include: bool std::nullptr_t
These types can be used as local variables, parameters, and return types, but there are limitations that concern their aggregation into compound types (see section 2.4.1.3). They are not considered amp-compatible. 2.4.1.3 Compound Types
The element type of an array must be an amp-compatible type. An array type whose element type is amp-compatible is itself amp-compatible. Pointers must point to amp-compatible types and/or bool. Pointers to pointers are not supported. No pointer type is considered amp-compatible. Pointers are only supported as local variables and/or function parameters and/or function return types. References (lvalue and rvalue) must refer to amp-compatible types and/or bool and or concurrency::array and/or concurrency::graphics::texture. Additionally, references to bool types and/or references to pointers, are supported as local variables and/or function parameters and/or return types (as long as the pointer type is itself supported). Classes, structs, and unions must contain only members whose types are amp-compatible. Furthermore, members must not be bitfields, pointers, or references. In exception to this rule, classes, structs, and unions are allowed to have members that are references to instances of classes array and texture. Classes, structs, and unions are also allowed to have members of type bool, as long as such members are at least four bytes aligned. Classes may have amp-compatible base classes, but must not have virtual base classes. Class array_view is an amp-compatible type. Empty classes (and structs and unions, and pure lambdas) are allowed as local variables or parameters, but not as members of a class or elements of an array. Pointers to members (C++11 8.3.3) must refer to non-static data members. Enumeration types must have underlying types that consist of int, unsigned int, long, or unsigned long. The representation of an amp-compatible compound type (with the exception of pointer and reference) on a device is identical to that of its host. 2.4.2 Restrictions on Function Declarators The function declarator (C++11 8.3.5) of an amp-modified function: must not have a trailing ellipsis () in its parameter list must have no parameters, or must have parameters whose types are amp-compatible must have a return type that is amp-compatible must not be virtual must not have a throw specification must not have extern C linkage when multiple restriction specifiers are present
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 14
673 674 675 676 677 678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718
2.4.3 Restrictions on Function Scopes The function scope of an amp-modified function may contain any valid C++ declaration, statement, or expression, except for those that are specified here. 2.4.3.1 Literals
A C++ AMP program is ill-formed if the value of an integer constant or floating-point constant exceeds the allowable range of any of the above types. 2.4.3.2 Primary Expressions (C++11 5.1)
An identifier or qualified identifier that refers to an object must refer only to: a parameter to the function or a local variable that is declared at a block scope in the function or a non-static member of the class of which this function is a member or a static const member that can be reduced to a literal or a captured variable in a lambda expression 2.4.3.3 Lambda Expressions
If a lambda expression appears in the body of an amp-modified function, the amp modifier may be elided and the lambda is still considered an amp lambda. A lambda expression must not capture any context variable by reference, except for context variables of type concurrency::array and concurrency::graphics::texture. 2.4.3.4 Function Calls (C++11 5.2.2)
The target of a function call operator: must not be a virtual function must not be a pointer to a function must not recursively invoke itself or any other function that is directly or indirectly recursive. These restrictions apply to all function-like invocations. These include: object constructors and destructors overloaded operators, including new and delete 2.4.3.5 Local Declarations
Local declarations must not specify any storage class other than register, auto , or tile_static. Variables must have types that are amp-compatible or bool. 2.4.3.5.1 tile_static Variables
A variable that is declared together with the tile_static storage class can be accessed by all threads in a tile (group of threads). (The tile_static storage class is valid only within a restrict(amp) context.) The storage lifetime of a tile_static variable begins when the execution of a thread in a tile reaches the point of declaration, and ends when the kernel function is exited by the last thread in the tile. Each thread tile that accesses the variable must perceive to access a separate, per-tile instance of the variable. A tile_static variable declaration does not constitute a barrier. tile_static variables are not initialized by the compiler and assume no default initial values. The tile_static storage class must only be used to declare local (function or block scope) variables. The type of a tile_static variable must not be a pointer or reference type.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 15
719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735
A tile_static variable must not have an initializer and no constructors or destructors will be called for it; its initial contents are undefined. 2.4.3.6 Type-Casting Restrictions
A type-cast must not be used to convert a pointer to an integral type, nor an integral type to a pointer. This restriction applies to reinterpret_cast (C++11 5.2.10) and to C-style casts (C++11 5.4). Casting away const-ness may cause a compiler warning and/or undefined behavior. 2.4.3.7 Miscellaneous Restrictions
The pointer-to-member operators .* and ->* must only be used to access pointer-to-data member objects. Pointer arithmetic must not be performed on pointers to bool values. Furthermore, an amp-restricted function must not contain any of these: dynamic_cast or typeid operators goto statements or labeled statements asm declarations Function try block, try blocks, catch blocks, or throw.
736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763
3
3.1
Device Modeling
The Concept of a Compute Accelerator
A compute accelerator is a hardware capability that is optimized for data-parallel computing. An accelerator might be a device that is attached to a PCIe bus (such as a GPU), or it might be an extended instruction set on the main CPU (such as SSE or AVX). Informative: Future architectures might bridg e these two extremes, for example, AMDs Fusion or Intels Knights Ferry. C++ AMP has functionality for copying data between host and accelerator memories: accelerator-to-host is always a synchronization point, unless asynchronous copy is specified. In general, for optimal performance, memory content should stay on an accelerator for as long as possible. In some cases, accelerator memory and CPU memory are one and the same. Depending on the architecture, there may never be a need to copy between the two physical locations of memory.
3.2
Accelerator
An accelerator is an abstraction of a physical data-parallel-optimized compute node. An accelerator is often a discrete GPU, but it can also be a virtual host-side entity such as the Microsoft DirectX REF device, or WARP (a CPU-side device that is accelerated by using SSE instructions), or it can refer to the CPU itself. 3.2.1 Default Accelerator C++ AMP supports the notion of a default accelerator, which is an accelerator that is chosen automatically when the program does not explicitly do so. You A user may explicitly create a default accelerator object in one of two ways: 1. Invoke the default constructor:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 16
764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815
accelerator def;
2.
You may also influence which accelerator is chosen as the default by calling accelerator::set_default prior to invoking any operation that would otherwise choose the default. Such operations include the above two calls, and also invoking parallel_for_each without an explicit accelerator_view argument, creating an array that is not bound to an explicit accelerator_view, and other such operations. If you do not call accelerator::set_default, the default is chosen in an implementation-specific manner. Microsoft-specific: The Microsoft implementation of C++ AMP uses the following heuristic to select a default accelerator when one is not specified by a call to accelerator::set_default: 1. If the debug runtime is used, prefer an accelerator that supports debugging. 2. If the process environment variable CPPAMP_DEFAULT_ACCELERATOR is set, interpret its value as a device path and prefer the device that corresponds to it. 3. Otherwise, the following criteria are used to determine the best accelerator: a. Prefer non-emulated devices b. Prefer the device that has the most available memory. c. Prefer the device that is not attached to the display. 3.2.2 Synopsis
// Microsoft-specific: static const wchar_t direct3d_warp[]; // = L"direct3d\\warp" static const wchar_t direct3d_ref[]; // = L"direct3d\\ref"
static const wchar_t cpu_accelerator[]; accelerator(); explicit accelerator(const wstring& path); accelerator(const accelerator& other); static vector<accelerator> get_all(); static void set_default(const wstring& path); accelerator& operator=(const accelerator& other); __declspec(property(get)) __declspec(property(get)) __declspec(property(get)) __declspec(property(get)) __declspec(property(get)) __declspec(property(get)) __declspec(property(get)) wstring device_path; unsigned int version; // hiword=major, loword=minor wstring description; bool is_debug; bool is_emulated; bool has_display; bool supports_double_precision; // = L"cpu"
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 17
816 817 818 819 820 821 822 823 824 825 826
__declspec(property(get)) size_t dedicated_memory; __declspec(property(get)) accelerator_view default_view; accelerator_view create_view(); accelerator_view create_view(queuing_mode qmode); bool operator==(const accelerator& other) const; bool operator!=(const accelerator& other) const; };
class accelerator
Represents a physical accelerated computing device. An object of this type can be created by enumerating the available devices, or by getting the default device, the reference device, or the Warp device.
827 828
3.2.3
Static Members
829 830
static bool set_default(const wstring& path);
Sets the default accelerator to the device path that is named by the path argument. accelerator(const wstring& path) for a description of the allowable path strings. See the constructor
This establishes a process-wide default accelerator and influences all subsequent operations that might create a default accelerator. Parameters path The device path of the default accelerator. Return Value: A Boolean flag that indicates whether the default was set. This value is false if the default has already been set for this process.
accelerator()
Constructs a new accelerator object that represents the default accelerator, which is usually chosen as the fastest available accelerator. This is equivalent to calling the constructor accelerator(accelerator::default_accelerator). The actual accelerator that is chosen as the default can be affected by calling accelerator::set_default prior to calling this constructor. Parameters: None.
834
accelerator(const wstring& path)
Constructs a new accelerator object that represents the physical device that is named by the path argument. The path can be one of these: 1. accelerator::default_accelerator (or Ldefault), which represents the path of the fastest available accelerator, as chosen by the runtime. 2. accelerator::cpu_accelerator (or Lcpu), which represents the CPU. A parallel_for_each must not be invoked over this accelerator. 3. A valid device path that uniquely identifies a hardware accelerator that is available on the host system.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 18
Microsoft-specific: 4. accelerator::direct3d_warp (or Ldirect3d\\warp), which represents the WARP accelerator. 5. accelerator::direct3d_ref (or Ldirect3d\\ref), which represents the REF accelerator.
835
accelerator(const accelerator& other);
Copy constructs an accelerator object. This function does a shallow copy that has the newly created accelerator object pointing to the same underlying device as the passed accelerator parameter. Parameters: other The accelerator object to be copied.
Members
const const const const wchar_t wchar_t wchar_t wchar_t default_accelerator[] direct3d_warp[] direct3d_ref[] cpu_accelerator[]
These are static constant string literals that represent device paths for known accelerators, or in the case of default_accelerator, that direct the runtime to choose an accelerator automatically. default_accelerator: The string Ldefault represents the default accelerator, which directs the runtime to choose the fastest available accelerator. The selection criteria are discussed in section 3.2.1 Default Accelerator. cpu_accelerator: The string Lcpu represents the host system. This accelerator is used to provide a location for system allocated memory such as arrays and staging arrays. It is not a valid target for accelerated computations. Microsoft-specific: direct3d_warp: The string Ldirect3d\\warp represents the device path of the CPU-accelerated Warp device. On other non-Direct3D platforms, this member may not exist. direct3d_ref: The string Ldirect3d\\ref represents the software rasterizer, or Reference, device. This particular device is useful for debugging. On other non-Direct3D platforms, this member may not exist.
839
accelerator& operator=(const accelerator& other)
Assigns an accelerator object to this accelerator object and returns a reference to this object. This function does a shallow assignment that has the newly created accelerator object pointing to the same underlying device as the passed accelerator parameter. Parameters: other Return Value: A reference to this accelerator object. The accelerator object to be assigned from.
840
__declspec(property(get)) accelerator_view default_view
Returns the default accelerator view that is associated with the accelerator. The queueing_mode of the default accelerator_view is queueing_mode_automatic. Return Value: The default accelerator_view object that is associated with the accelerator.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 19
841
accelerator_view create_view(queuing_mode qmode)
Creates and returns a new accelerator view on the accelerator that has the supplied queuing mode. Return Value: The new accelerator_view object that is created on the compute device. Parameters: qmode The queuing mode of the accelerator_view to be created. See Queuing Mode.
842
accelerator_view create_view()
Creates and returns a new resource view on the accelerator. Equivalent to create_view(queuing_mode_automatic). Return Value: The new accelerator_view object that is created on the compute device.
843 844
bool operator==(const accelerator& other) const
Compares this accelerator with the passed accelerator object to determine whether they represent the same underlying device. Parameters: other Return Value: A Boolean value that indicates whether the passed accelerator object is same as this accelerator. The accelerator object to be compared against.
845 846
bool operator!=(const accelerator& other) const
Compares this accelerator with the passed accelerator object to determine whether they represent different devices. Parameters: other Return Value: A Boolean value that indicates whether the passed accelerator object is different from this accelerator. The accelerator object to be compared against.
3.2.6
Properties
The following read-only properties are part of the public interface of the class accelerator, to enable querying for the accelerator characteristics:
__declspec(property(get)) wstring device_path
Returns a system-wide unique device instance path that matches the Device Instance Path property for the device in Device Manager, or one of the predefined path constants direct3d_warp or direct3d_ref.
852
__declspec(property(get)) wstring description
Returns a short textual description of the accelerator device.
853
__declspec(property(get)) unsigned int version
Returns a 32-bit unsigned integer that represents the version number of this accelerator. The format of the integer is major.minor, where the major version number is in the high-order 16 bits, and the minor version number is in the low-order bits.
854
__declspec(property(get)) bool has_display
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 20
Returns a Boolean value that indicates whether the accelerator is attached to a display.
855
__declspec(property(get)) bool dedicated_memory
Returns the amount of dedicated memory (in KB) on an accelerator device. There is no guarantee that this amount of memory is actually available to use.
856
__declspec(property(get)) bool supports_double_precision
Returns a Boolean value that indicates whether this accelerator supports double-precision (double) computations.
857
__declspec(property(get)) bool is_debug
Returns a Boolean value that indicates whether the accelerator supports debugging.
858
__declspec(property(get)) bool is_emulated
Returns a Boolean value that indicates whether the accelerator is emulated. This is true, for example, with the reference accelerator.
859 860 861 862 863 864 865 866 867 868 869 870 871 872 873 874 875 876 877 878 879 880 881 882 883 884 885 886 887 888 889 890 891 892
3.3
accelerator_view
An accelerator_view represents a logical (isolated) view of an accelerator. One physical compute device may have many logical (isolated) accelerator views. Each accelerator has a default accelerator view, and additional accelerator views may be optionally created by the user. Physical devices must potentially be shared among many client threads. Client threads may choose to cooperatively use the same accelerator_view of an accelerator, or each client may communicate with a compute device through an independent accelerator_view object for isolation from other client threads. An accelerator_view can be created with a queuing mode of immediate or automatic. (See Queuing Mode).
3.3.1
Synopsis
class accelerator_view { public: accelerator_view() = delete; accelerator_view(const accelerator_view& other); accelerator_view& operator=(const accelerator_view& other); __declspec(property(get)) __declspec(property(get)) __declspec(property(get)) __declspec(property(get)) Concurrency::accelerator accelerator; bool is_debug; unsigned int version; queuing_mode queuing_mode;
void flush(); void wait(); std::shared_future<void> create_marker(); bool operator==(const accelerator_view& other) const; bool operator!=(const accelerator_view& other) const; };
class accelerator_view
Represents a logical (isolated) accelerator view of a compute accelerator. An object of this type can be obtained by calling the default_view property or create_view member functions on an accelerator object.
893
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 21
894 895 896 897 898 899 900 901 902 903 904 905 906 907 908 909 910 911 912
3.3.2
Queuing Mode
If the queuing mode is queuing_mode_immediate, then any commands (such as copy or parallel_for_each) are sent to the corresponding accelerator before control is returned to the caller. If the queuing mode is queuing_mode_automatic, then such commands are queued up on a command queue that corresponds to this accelerator_view. Commands are not actually sent to the device until flush() is called.
3.3.3
Constructors
An accelerator_view object may only be constructed by using a copy or move constructor. There is no default constructor.
accelerator_view(const accelerator_view& other)
Copy-constructs an accelerator_view object. This function does a shallow copy that has the newly created accelerator_view object pointing to the same underlying view as the other parameter. Parameters: other The accelerator_view object to be copied.
916
__declspec(property(get)) queuing_mode queuing_mode
Returns the queuing mode that this accelerator_view was created with. See Queuing Mode. Return Value: The queuing mode.
917
__declspec(property(get)) unsigned int version
Returns a 32-bit unsigned integer that represents the version number of this accelerator view. The format of the integer is major.minor, where the major version number is in the high-order 16 bits, and the minor version number is in the low-order bits. The version of the accelerator view is usually the same as that of the parent accelerator. Microsoft-specific: The version may differ from the accelerator only when the accelerator_view is created from a Direct3D device by using the interop API.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 22
918
__declspec(property(get)) Concurrency::accelerator accelerator
Returns the accelerator that this accelerator_view was created on.
919
__declspec(property(get)) bool is_debug
Returns a Boolean value that indicates whether the accelerator_view supports debugging through extensive error reporting. The is_debug property of the accelerator view is usually same as that of the parent accelerator. The value may differ from the accelerator only when the accelerator_view is created from a Direct3D device by using the interop API.
920
void wait()
Performs a blocking wait for completion of all commands that are submitted to the accelerator view prior to calling wait. If the queuing_mode is queuing_mode_immediate, this function returns immediately without blocking. Return Value: None
921
void flush()
Sends the queued-up commands in the accelerator_view to the device for execution. An accelerator_view internally maintains a buffer of commands such as data transfers between the host memory and device buffers, and kernel invocations (parallel_for_each calls). This member function sends the commands to the device for processing. Normally, these commands are sent to the GPU automatically whenever the runtime determines that they must be, for example, when the command buffer is full or when it is waiting for transfer of data from the device buffers to host memory. The flush member function sends the commands manually to the device. Calling this member function incurs an overhead and must be used with discretion. A typical use of this member function is when the CPU waits for an arbitrary amount of time and wants to force the execution of queued device commands in the meantime. Because flush operates asynchronously, it can return either before or after the device finishes executing the buffered commands. However, the commands always complete eventually. If the queuing_mode is queuing_mode_immediate, this function does nothing. Return Value: None
922
std::shared_future<void> create_marker()
Inserts a marker event into the command queue of the accelerator_view. This marker is returned as a std::future. When all commands that were submitted prior to the marker event creation have completed, the future unblocks. Return Value: A future that can be waited on, and will block until the current batch of commands has completed.
923 924
bool operator==(const accelerator_view& other) const
Compares this accelerator_view with the passed accelerator_view object to determine whether they represent the same underlying object. Parameters: other Return Value: A Boolean value that indicates whether the passed accelerator_view object is same as this accelerator_view. The accelerator_view object to be compared against.
925 C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 23
926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952
3.4
The physical compute devices can be enumerated or selected by calling the following static member function of the class accelerator.
3.4.1
Synopsis:
vector<accelerator> accelerator::get_all();
As an example, if you want to enumerate the available accelerators and select the one that is the Warp accelerator, you could use the following code:
vector<accelerator> gpus = accelerator::get_all(); auto warpIter = std::find_if(gpus.begin(), gpus.end(), [] (accelerator& accl) { return accl.device_path == accelerator::direct3d_warp; });
As a second example, if you want to find an accelerator that is not emulated and is not attached to a display, you could do this:
vector<accelerator> gpus = accelerator::get_all(); auto headlessIter = std::find_if(gpus.begin(), gpus.end(), [] (accelerator& accl) { return !accl.has_display && !accl.is_emulated; });
In C++ AMP, you can express solutions to data-parallel problems in terms of N-dimensional data aggregates and operations over them. Keep in mind the concept of an array. An array associates values in an index space with an element type. For example, an array could be the set of pixels on a screen where each pixel is represented by four 32-bit values: Red, Green, Blue, and Alpha. The index space would then be the screen resolution, for example, all points: { {y, x} | 0 <= y < 1200, 0 <= x < 1600, x and y are integers }.
Index space properties: 1. An affine space is the iteration space of an affine loop nest. 2. An index point is a point in N-space {i0, i1, , in}, where each ik is a 32-bit signed integer.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 24
965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017
3. 4.
An index space is the set of all index points in an affine space. A general array is defined over an index space; that is, for every index point in the index space, there is an associated array element. A canonical index space is an index space that has sides that are parallel to the coordinate axes in N-space. If the C++ AMP compute node is a GPU, then before a kernel is computed, the index space of every array must be transformed into a canonical index space.
4.1
index<N>
Defines an N-dimensional index point; this may also be viewed as a vector that is based at the origin in N-space. The index<N> type represents an N-dimensional vector of int that specifies a unique position in an N-dimensional space. The values in the coordinate vector are ordered from most-significant to least-significant. Therefore, in Cartesian 3dimensional space, the index vector (7,5,3) represents the position at (z=7, y=5, x=3). The position is relative to the origin in the N-dimensional space, and can contain negative component values. Informative: As a scoping decision,we decided to limit specializations of index, extent, and other properties to 1, 2, and 3 dimension (not 4, as before). This also applies to arrays and array_views. General N-dimensional support is still provided with slightly reduced convenience.
4.1.1
Synopsis
template <int N> class index { public: static const int rank = N; typedef int value_type; index() restrict(amp,cpu); index(const index& other) restrict(amp,cpu); explicit index(int i0) restrict(amp,cpu); // N==1 index(int i0, int i1) restrict(amp,cpu); // N==2 index(int i0, int i1, int i2) restrict(amp,cpu); // N==3 explicit index(const int components[]) restrict(amp,cpu); index& operator=(const index& other) restrict(amp,cpu); int operator[](unsigned int c) const restrict(amp,cpu); int& operator[](unsigned int c) restrict(amp,cpu); template friend template friend template friend <int N> bool operator==(const index<N>& lhs, const index<N>& rhs) restrict(amp,cpu); <int N> bool operator!=(const index<N>& lhs, const index<N>& rhs) restrict(amp,cpu); <int N> index<N> operator+(const index<N>& lhs, const index<N>& rhs) restrict(amp,cpu); template <int N> friend index<N> operator-(const index<N>& lhs, const index<N>& rhs) restrict(amp,cpu); index& operator+=(const index& rhs) restrict(amp,cpu); index& operator-=(const index& rhs) restrict(amp,cpu);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 25
1018 1019 1020 1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052
template friend template friend template friend template friend template friend template friend template friend template friend template friend template friend index& index& index& index& index&
<int N> index<N> <int N> index<N> <int N> index<N> <int N> index<N> <int N> index<N> <int N> index<N> <int N> index<N> <int N> index<N> <int N> index<N> <int N> index<N>
operator+(const index<N>& lhs, int rhs) restrict(amp,cpu); operator+(int lhs, const index<N>& rhs) restrict(amp,cpu); operator-(const index<N>& lhs, int rhs) restrict(amp,cpu); operator-(int lhs, const index<N>& rhs) restrict(amp,cpu); operator*(const index<N>& lhs, int rhs) restrict(amp,cpu); operator*(int lhs, const index<N>& rhs) restrict(amp,cpu); operator/(const index<N>& lhs, int rhs) restrict(amp,cpu); operator/(int lhs, const index<N>& rhs) restrict(amp,cpu); operator%(const index<N>& lhs, int rhs) restrict(amp,cpu); operator%(int lhs, const index<N>& rhs) restrict(amp,cpu); rhs) rhs) rhs) rhs) rhs) restrict(amp,cpu); restrict(amp,cpu); restrict(amp,cpu); restrict(amp,cpu); restrict(amp,cpu);
index& operator++() restrict(amp,cpu); index operator++(int) restrict(amp,cpu); index& operator--() restrict(amp,cpu); index operator--(int) restrict(amp,cpu); };
1058 1059
index(const index& other) restrict(amp,cpu)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 26
Copy constructor. Constructs a new index<N> from the supplied argument other. Parameters: other
1060
explicit index(int i0) restrict(amp,cpu) // N==1 index(int i0, int i1) restrict(amp,cpu) // N==2 index(int i0, int i1, int i2) restrict(amp,cpu) // N==3
Constructs an index<N> that has the coordinate values provided by i02. These are specialized constructors that are only valid when the rank of the index N {1,2,3}. Invoking a specialized constructor whose argument count N causes a compilation error. Parameters: i0 [, i1 [, i2 ] ] The component values of the index vector.
1061
explicit index(const int components[]) restrict(amp,cpu)
Constructs an index<N> that has the coordinate values that are provided by the array of int component values. If the coordinate array length N, the behavior is undefined. If the array value is NULL or not a valid pointer, the behavior is undefined. Parameters: components
1064
int operator[](unsigned int c) const restrict(amp,cpu) int& operator[](unsigned int c) restrict(amp,cpu)
Returns the index component value at position c. Parameters: c Return Value: A the component value at position c. The dimension axis whose coordinate is to be accessed.
Compares two objects of index<N>. The expression leftIdx rightIdx is true if leftIdx[i] rightIdx[i] for every i from 0 to N-1. Parameters: The left-hand index<N> to be compared. lhs The right-hand index<N> to be compared. rhs
1068
template <int N>
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 27
friend index<N> operator+(const index<N>& lhs, const index<N>& rhs) restrict(amp,cpu) template <int N> friend index<N> operator-(const index<N>& lhs, const index<N>& rhs) restrict(amp,cpu)
Binary arithmetic operations that produce a new index<N> that is the result of performing the corresponding pair-wise binary arithmetic operation on the elements of the operands. The result index<N> is such that for a given operator , result[i] = leftIdx[i] rightIdx[i] for every i from 0 to N-1. Parameters: The left-hand index<N> of the arithmetic operation. lhs The right-hand index<N> of the arithmetic operation. rhs
1069
index& operator+=(const index& rhs) restrict(amp,cpu) index& operator-=(const index& rhs) restrict(amp,cpu)
For a given operator , produces the same effect as (*this) = (*this) rhs; The return value is *this. Parameters: rhs
1070 1071
template friend template friend template friend template friend template friend template friend template friend template friend template friend template friend <int N> index<N> <int N> index<N> <int N> index<N> <int N> index<N> <int N> index<N> <int N> index<N> <int N> index<N> <int N> index<N> <int N> index<N> <int N> index<N> operator+(const index<N>& idx, int value) restrict(amp,cpu) operator+(int value, const index<N>& idx) restrict(amp,cpu) operator-(const index<N>& idx, int value) restrict(amp,cpu) operator-(int value, const index<N>& idx) restrict(amp,cpu) operator*(const index<N>& idx, int value) restrict(amp,cpu) operator*(int value, const index<N>& idx) restrict(amp,cpu) operator/(const index<N>& idx, int value) restrict(amp,cpu) operator/(int value, const index<N>& idx) restrict(amp,cpu) operator%(const index<N>& idx, int value) restrict(amp,cpu) operator%(int value, const index<N>& idx) restrict(amp,cpu)
Binary arithmetic operations that produce a new index<N> that is the result of performing the corresponding binary arithmetic operation on the elements of the index operands. The result index<N> is such that for a given operator , result[i] = idx[i] value or result[i] = value idx[i] for every i from 0 to N-1. Parameters: idx value
1072
index& index& index& index& index& operator+=(int operator-=(int operator*=(int operator/=(int operator%=(int value) value) value) value) value) restrict(amp,cpu) restrict(amp,cpu) restrict(amp,cpu) restrict(amp,cpu) restrict(amp,cpu)
For a given operator , produces the same effect as (*this) = (*this) value;
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 28
1073 1074
index& index index& index operator++() restrict(amp,cpu) operator++(int) restrict(amp,cpu) operator--() restrict(amp,cpu) operator--(int) restrict(amp,cpu)
For a given operator , produces the same effect as (*this) = (*this) 1; For prefix increment and decrement, the return value is *this. Otherwise, a new index<N> is returned.
1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 1092 1093 1094 1095 1096 1097 1098 1099 1100 1101 1102 1103 1104 1105 1106 1107 1108 1109 1110 1111 1112 1113 1114 1115
4.2
extent<N>
The extent<N> type represents an N-dimensional vector of int that specifies the bounds of an N-dimensional space that has an origin of 0. The values in the coordinate vector are ordered from most-significant to least-significant. Therefore, in Cartesian 3-dimensional space, the extent vector (7,5,3) represents a space where the z coordinate ranges from 0 to 7, the y coordinate ranges from 0 to 5, and the x coordinate ranges from 0 to 3. 4.2.1 Synopsis
template <int N> class extent { public: static const int rank = N; typedef int value_type; extent() restrict(amp,cpu); extent(const extent& other) restrict(amp,cpu); explicit extent(int e0) restrict(amp,cpu); // N==1 extent(int e0, int e1) restrict(amp,cpu); // N==2 extent(int e0, int e1, int e2) restrict(amp,cpu); // N==3 explicit extent(const int components[]) restrict(amp,cpu); extent& operator=(const extent& other) restrict(amp,cpu); int operator[](unsigned int c) const restrict(amp,cpu); int& operator[](unsigned int c) restrict(amp,cpu); int size() const restrict(amp,cpu); bool contains(const index<N>& idx) const restrict(amp,cpu); template <int D0> tiled_extent<D0> tile() const; template <int D0, int D1> tiled_extent<D0,D1> tile() const; template <int D0, int D1, int D2> tiled_extent<D0,D1,D2> tile() const; extent operator+(const index<N>& idx) restrict(amp,cpu); extent operator-(const index<N>& idx) restrict(amp,cpu); template <int N> friend bool operator==(const extent<N>& lhs, const extent<N>& rhs) restrict(amp,cpu); template <int N>
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 29
1116 1117 1118 1119 1120 1121 1122 1123 1124 1125 1126 1127 1128 1129 1130 1131 1132 1133 1134 1135 1136 1137 1138 1139 1140 1141 1142 1143 1144 1145 1146 1147 1148 1149 1150 1151
friend bool operator!=(const extent<N>& lhs, const extent<N>& rhs) restrict(amp,cpu); template friend template friend template friend template friend template friend template friend template friend template friend template friend template friend extent& extent& extent& extent& extent& <int N> extent<N> <int N> extent<N> <int N> extent<N> <int N> extent<N> <int N> extent<N> <int N> extent<N> <int N> extent<N> <int N> extent<N> <int N> extent<N> <int N> extent<N>
operator+(const extent<N>& lhs, int rhs) restrict(amp,cpu); operator+(int lhs, const extent<N>& rhs) restrict(amp,cpu); operator-(const extent<N>& lhs, int rhs) restrict(amp,cpu); operator-(int lhs, const extent<N>& rhs) restrict(amp,cpu); operator*(const extent<N>& lhs, int rhs) restrict(amp,cpu); operator*(int lhs, const extent<N>& rhs) restrict(amp,cpu); operator/(const extent<N>& lhs, int rhs) restrict(amp,cpu); operator/(int lhs, const extent<N>& rhs) restrict(amp,cpu); operator%(const extent<N>& lhs, int rhs) restrict(amp,cpu); operator%(int lhs, const extent<N>& rhs) restrict(amp,cpu); rhs) rhs) rhs) rhs) rhs) restrict(amp,cpu); restrict(amp,cpu); restrict(amp,cpu); restrict(amp,cpu); restrict(amp,cpu);
extent& operator++() restrict(amp,cpu); extent operator++(int) restrict(amp,cpu); extent& operator--() restrict(amp,cpu); extent operator--(int) restrict(amp,cpu); };
1156
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 30
1157
extent(const extent& other) restrict(amp,cpu)
Copy constructor. Constructs a new extent<N> from the supplied argument ix. Parameters: other
1158
explicit extent(int e0) restrict(amp,cpu) // N==1 extent(int e0, int e1) restrict(amp,cpu) // N==2 extent(int e0, int e1, int e2) restrict(amp,cpu) // N==3
Constructs an extent<N> that has the coordinate values that are provided by e02. These are specialized constructors that are only valid when the rank of the extent N {1,2,3}. Invoking a specialized constructor whose argument count N causes a compilation error. Parameters: e0 [, e1 [, e2 ] ] The component values of the extent vector.
1159
explicit extent(const int components[]) restrict(amp,cpu);
Constructs an extent<N> with the coordinate values provided the array of int component values. If the coordinate array length N, the behavior is undefined. If the array value is NULL or not a valid pointer, the behavior is undefined. Parameters: An array of N int values. components
1163
int operator[](unsigned int c) const restrict(amp,cpu) int& operator[](unsigned int c) restrict(amp,cpu)
Returns the extent component value at position c. Parameters: c The dimension axis whose coordinate is to be accessed. Return Value: The component value at position c.
1164
bool contains(const index<N>& idx) const restrict(amp,cpu)
Tests whether the index idx is correctly contained in this extent (with an assumed origin of zero). Parameters: An object of type index<N> idx Return Value: Returns true if the idx is contained in the space that is defined by this extent (with an assumed origin of zero).
1165
int size() const restrict(amp,cpu)
This member function returns the total linear size of this extent<N> (in units of elements), which is computed as: extent[0] * extent[1] * extent[N-1]
1166
template <int D0> tiled_extent<D0> tile() const template <int D0, int D1> tiled_extent<D0,D1> tile() const template <int D0, int D1, int D2> tiled_extent<D0,D1,D2> tile() const
Produces a tiled_extent object that has the tile extents that are given by D0, D1, and D2.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 31
tile<D0,D1,D2>() is only supported on extent<3>. It produces a compile-time error if it is used on an extent where N 3. tile<D0,D1>() is only supported on extent <2>. It produces a compile-time error if it used on an extent where N 2. tile<D0>() is only supported on extent <1>. It produces a compile-time error if it is used on an extent where N 1.
Compares two objects of extent<N>. The expression leftExt rightExt is true if leftExt[i] rightExt[i] for every i from 0 to N-1. Parameters: The left-hand extent<N> to be compared. lhs The right-hand extent<N> to be compared. rhs
1170
extent<N> operator+(const index<N>& idx) restrict(amp,cpu) extent<N> operator-(const index<N>& idx) restrict(amp,cpu)
Adds (or subtracts) an object of type index<N> from this extent to form a new extent. The result extent<N> is such that for a given operator , result[i] = this[i] idx[i] Parameters: The right-hand index<N> to be added or subtracted. idx
1171 1172
template friend template friend template friend template friend template friend template friend template friend template friend template friend template friend <int N> extent<N> <int N> extent<N> <int N> extent<N> <int N> extent<N> <int N> extent<N> <int N> extent<N> <int N> extent<N> <int N> extent<N> <int N> extent<N> <int N> extent<N> operator+(const extent<N>& ext, int value) restrict(amp,cpu) operator+(int value, const extent<N>& ext) restrict(amp,cpu) operator-(const extent<N>& ext, int value) restrict(amp,cpu) operator-(int value, const extent<N>& ext) restrict(amp,cpu) operator*(const extent<N>& ext, int value) restrict(amp,cpu) operator*(int value, const extent<N>& ext) restrict(amp,cpu) operator/(const extent<N>& ext, int value) restrict(amp,cpu) operator/(int value, const extent<N>& ext) restrict(amp,cpu) operator%(const extent<N>& ext, int value) restrict(amp,cpu) operator%(int value, const extent<N>& ext) restrict(amp,cpu)
Binary arithmetic operations that produce a new extent<N> that is the result of performing the corresponding binary arithmetic operation on the elements of the extent operands. The result extent<N> is such that for a given operator , result[i] = ext[i] value or result[i] = value ext[i] for every i from 0 to N-1. Parameters:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 32
ext value
1173
extent& extent& extent& extent& extent& operator+=(int operator-=(int operator*=(int operator/=(int operator%=(int value) value) value) value) value) restrict(amp,cpu) restrict(amp,cpu) restrict(amp,cpu) restrict(amp,cpu) restrict(amp,cpu)
For a given operator , produces the same effect as (*this) = (*this) value The return value is *this. Parameters: Value
1174 1175
extent& extent extent& extent operator++() restrict(amp,cpu) operator++(int) restrict(amp,cpu) operator--() restrict(amp,cpu) operator--(int) restrict(amp,cpu)
For a given operator , produces the same effect as (*this) = (*this) 1 For prefix increment and decrement, the return value is *this. Otherwise, a new extent<N> is returned.
1176 1177 1178 1179 1180 1181 1182 1183 1184 1185 1186 1187 1188 1189 1190 1191 1192 1193 1194 1195 1196 1197 1198 1199 1200 1201 1202 1203 1204 1205
4.3
tiled_extent<D0,D1,D2>
A tiled_extent is an extent of 1 to 3 dimensions that also subdivides the index space into 1-, 2-, or 3-dimensional tiles. It has three specialized forms: tiled_extent<D0>, tiled_extent<D0,D1>, and tiled_extent<D0,D1,D2>, where D0-2 specify the positive length of the tile along each dimension, with D0 being the most-significant dimension and D2 being the leastsignificant. Partial template specializations are provided to represent 2-D and 1-D tiled extents. A tiled_extent can be formed from an extent by calling extent<N>::tile<D0,D1,D2>() or one of the other two specializations of extent<N>::tile(). A tiled_extent exposes much the same interface as an extent does.
4.3.1
Synopsis
template <int D0, int D1=0, int D2=0> class tiled_extent : public extent<3> { public: static const int rank = 3; tiled_extent(); tiled_extent(const tiled_extent& other); tiled_extent(const extent<3>& extent); tiled_extent& operator=(const tiled_extent& other); int size() const restrict(amp,cpu);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 33
1206 1207 1208 1209 1210 1211 1212 1213 1214 1215 1216 1217 1218 1219 1220 1221 1222 1223 1224 1225 1226 1227 1228 1229 1230 1231 1232 1233 1234 1235 1236 1237 1238 1239 1240 1241 1242 1243 1244 1245 1246 1247 1248 1249 1250 1251 1252 1253 1254 1255 1256 1257 1258 1259 1260 1261 1262 1263
bool contains(const index<3>& idx) const restrict(amp,cpu); tiled_extent pad() const; tiled_extent truncate() const; __declspec(property(get=get_tile_extent)) extent<3> tile_extent; extent<3> get_tile_extent() const restrict(amp,cpu); static const int tile_dim0 = D0; static const int tile_dim1 = D1; static const int tile_dim2 = D2; friend bool operator==(const const friend bool operator!=(const const }; tiled_extent& tiled_extent& tiled_extent& tiled_extent& lhs, rhs) restrict(amp,cpu); lhs, rhs) restrict(amp,cpu);
template <int D0, int D1> class tiled_extent<D0,D1,0> : public extent<2> { public: static const int rank = 2; tiled_extent() restrict(amp,cpu); tiled_extent(const tiled_extent& other) restrict(amp,cpu); tiled_extent(const extent<2>& extent) restrict(amp,cpu); tiled_extent& operator=(const tiled_extent& other); int size() const restrict(amp,cpu); bool contains(const index<2>& idx) const restrict(amp,cpu); tiled_extent pad() const; tiled_extent truncate() const; __declspec(property(get=get_tile_extent)) extent<2> tile_extent; extent<2> get_tile_extent() const restrict(amp,cpu); static const int tile_dim0 = D0; static const int tile_dim1 = D1; friend bool operator==(const const friend bool operator!=(const const }; template <int D0> class tiled_extent<D0,0,0> : public extent<1> { public: static const int rank = 1; tiled_extent(); tiled_extent(const tiled_extent& other); tiled_extent(const extent<1>& extent); tiled_extent& tiled_extent& tiled_extent& tiled_extent& lhs, rhs) restrict(amp,cpu); lhs, rhs) restrict(amp,cpu);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 34
1264 1265 1266 1267 1268 1269 1270 1271 1272 1273 1274 1275 1276 1277 1278 1279 1280 1281 1282 1283 1284
tiled_extent& operator=(const tiled_extent& other); int size() const restrict(amp,cpu); bool contains(const index<1>& idx) const restrict(amp,cpu); tiled_extent pad() const; tiled_extent truncate() const; __declspec(property(get=get_tile_extent)) extent<1> tile_extent; extent<1> get_tile_extent() const restrict(amp,cpu); static const int tile_dim0 = D0; friend bool operator==(const const friend bool operator!=(const const }; tiled_extent& tiled_extent& tiled_extent& tiled_extent& lhs, rhs) restrict(amp,cpu); lhs, rhs) restrict(amp,cpu);
template <int D0, int D1=0, int D2=0> class tiled_extent template <int D0, int D1> class tiled_extent<D0,D1,0> template <int D0> class tiled_extent<D0,0,0>
Represents an extent that is subdivided into 1-, 2-, or 3-dimensional tiles. Template Arguments D0, D1, D2 The length of the tile in each specified dimension, where D0 is the mostsignificant dimension and D2 is the least-significant.
tiled_extent()
Default constructor. Parameters: None. The origin and extent is default-constructed and is therefore zero.
1289
tiled_extent(const tiled_extent& other)
Copy constructor. Constructs a new tiled_extent from the supplied argument other. Parameters: An object of type tiled_extent from which to initialize this new extent. other
1290
tiled_extent(const extent<N>& extent)
Constructs a tiled_extent<N> usingwith the extent extent. The origin is default-constructed and is therefore zero. Notice that this constructor allows implicit conversions from extent<N> to tiled_extent<N>. Parameters: extent The extent of this tiled_extent
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 35
1294
bool contains(const index<N>& idx) const restrict(amp,cpu)
Tests whether the index idx is correctly contained in the origin and extent of this tiled extent. Parameters: An object of type index<N> idx Return Value: Returns true if the idx is contained in the space that is defined by this tiled extent.
1295
int size() const restrict(amp,cpu)
Returns the total linear size of this tiled extent (in units of elements), which is computed as: extent[0] * extent[1] * extent[N-1]
1296
tiled_extent pad() const
Returns a new tiled_extent that has the extents adjusted up to be evenly divisible by the tile dimensions. The origin of the new tiled_extent is the same as the origin of this one.
1297
tiled_extent truncate() const
Returns a new tiled_extent that has the extents adjusted down to be evenly divisible by the tile dimensions. The origin of the new tiled_extent is the same as the origin of this one.
1298
__declspec(property(get=get_tile_extent)) extent<N> tile_extent
Returns an instance of an extent<N> that captures the values of the tiled_extent template arguments D0, D1, and D2. For example: tiled_extent<64,16,4> tg; extent<3> myTileExtent = tg.tile_extent; assert(myTileExtent.z == 64); assert(myTileExtent.y == 16); assert(myTileExtent.x == 4);
1299
extent<1> get_tile_extent() const restrict(amp,cpu); // for N==1 extent<2> get_tile_extent() const restrict(amp,cpu); // for N==2 extent<3> get_tile_extent() const restrict(amp,cpu); // for N==3
This is a getter member function for the tile_extent property. It returns extent<1> for tiled_extent<1>, extent<2> for tiled_extent<2> and extent<3> for tiled_extent<3>. Extent represents values that were passed to tiled_extent template arguments.
1300
static const int tile_dim0 static const int tile_dim1 static const int tile_dim2
These constants enable access to the template arguments of tiled_extent.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 36
lhs rhs is true if lhs.extent rhs.extent and lhs.origin rhs.origin. Parameters: The left-hand tiled_extent to be compared. lhs The right-hand tiled_extent to be compared. rhs
1304 1305 1306 1307 1308 1309 1310 1311 1312 1313 1314 1315 1316 1317 1318 1319 1320 1321 1322 1323 1324 1325 1326 1327 1328
4.4
tiled_index<D0,D1,D2>
A tiled_index is a set of indices of 1 to 3 dimensions that have been subdivided into 1-, 2-, or 3-dimensional tiles in a tiled_extent. It has three specialized forms: tiled_index<D0>, tiled_index<D0,D1>, and tiled_index<D0,D1,D2>, where D0-2 specify the length of the tile along each dimension, with D0 being the most-significant dimension and D2 being the leastsignificant. Partial template specializations are provided to represent 2-D and 1-D tiled indices. A tiled_index is implicitly convertible to an index<N>, where the implicit index represents the global index. A tiled_index contains 4 member indices that are related to one another mathematically and help the user pinpoint a global index to an index in a tiled space. A tiled_index contains a global index into an extent space. The other indices obey the following relations: .local .global % (D0,D1,D2) .tile .global / (D0,D1,D2) .tile_origin .global - .local This is shown in the following example and diagram:
parallel_for_each(extent<2>(20,24).tile<5,4>(), [&](tiled_index<5,4> ti) { /* ... */ });
0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 1 2 2 2 2 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8 9 0 1 2 3 0 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 0 1 2 3 4 5 6 7 8 9 0 1 2 3 4 5 6 7 8
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 37
1 9 1329 1330 1331 1332 1333 1334 1335 1336 1337 1338 1339 1340 1341 1342 1343 1344 1345 1346 1347 1348 1349 1350 1351 1352 1353 1354 1355 1356 1357 1358 1359 1360 1361 1362 1363 1364 1365 1366 1367 1368 1369 1370 1371 1372 1373 1374 1375 1376 1377 1378 1379 1380 1381 1382 1. Each cell in the diagram represents one thread that is scheduled by the parallel_for_each call. As with the non-tiled parallel_for_each, notice that the number of threads that are scheduled is given by the extent parameter to the parallel_for_each call. In vector notation, the total number of tiles that are scheduled is <20,24> / <5,4> = <4,6>, which we can observe in the diagram as 4 tiles along the vertical axis and 6 tiles along the horizontal axis. The tile that is shown in red is tile number <0,0>. The tile in yellow is tile number <1,2>. The thread in blue: a. Has a global id of <5,8>. b. Has a local id <0,0> within its tile. That is, it lies on the origin of the tile. The thread in green: a. Has a global id of <6,9>. b. Has a local id of <1,1> within its tile. c. The blue thread (number <5,8>) is the tile origin of the green thread.
2. 3. 4.
5.
4.4.1
Synopsis
template <int D0, int D1=0, int D2=0> class tiled_index { public: static const int rank = 3; const const const const const index<3> global; index<3> local; index<3> tile; index<3> tile_origin; tile_barrier barrier; index<3>& global, index<3> local, index<3> tile, index<3> tile_origin, tile_barrier& barrier) restrict(amp,cpu); tiled_index& other) restrict(amp,cpu);
const index<3>& operator index<3>() const restrict(amp,cpu); tile_extent get_tile_extent() const restrict(amp,cpu); __declspec(property(get=get_tile_extent)) extent<3> tile_extent; static const int tile_dim0 = D0; static const int tile_dim1 = D1; static const int tile_dim2 = D2; }; template <int D0, int D1> class tiled_index<D0,D1,0> { public: static const int rank = 2; const index<2> global;
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 38
1383 1384 1385 1386 1387 1388 1389 1390 1391 1392 1393 1394 1395 1396 1397 1398 1399 1400 1401 1402 1403 1404 1405 1406 1407 1408 1409 1410 1411 1412 1413 1414 1415 1416 1417 1418 1419 1420 1421 1422 1423 1424 1425 1426 1427 1428 1429 1430 1431 1432 1433 1434
index<2> local; index<2> tile; index<2> tile_origin; tile_barrier barrier; index<2>& global, index<2> local, index<2> tile, index<2> tile_origin, tile_barrier& barrier) restrict(amp,cpu); tiled_index& other) restrict(amp,cpu);
const index<2>& operator index<2>() const restrict(amp,cpu); tile_extent get_tile_extent() const restrict(amp,cpu); __declspec(property(get)) extent<2> tile_extent; static const int tile_dim0 = D0; static const int tile_dim1 = D1; }; template <int D0> class tiled_index<D0,0,0> { public: static const int rank = 1; const const const const const index<1> global; index<1> local; index<1> tile; index<1> tile_origin; tile_barrier barrier; index<1>& global, index<1> local, index<1> tile, index<1> tile_origin, tile_barrier& barrier) restrict(amp,cpu); tiled_index& other) restrict(amp,cpu);
operator index<1>() const restrict(amp,cpu); tile_extent get_tile_extent() const restrict(amp,cpu); __declspec(property(get)) extent<1> tile_extent; static const int tile_dim0 = D0; };
template <int D0, int D1=0, int D2=0> class tiled_index template <int D0, int D1> class tiled_index<D0,D1,0> template <int D0 > class tiled_index<D0,0,0>
Represents a set of related indices that are subdivided into 1-, 2-, or 3-dimensional tiles. Template Arguments D0, D1, D2 The length of the tile in each specified dimension, where D0 is the most-
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 39
Constructs a new tiled_index out of the index of the tile (in global coordinates) and the relative position within the tile (in local coordinates). The other indices (global and tile_origin) are computed. Parameters: An object of type index<N> that is taken to be the global index of this global tile. An object of type index<N> that is taken to be the local index within this local tile. An object of type index<N> that is taken to be the coordinates of the tile current tile. An object of type index<N> that is taken to be the global index of the tile_origin top-left corner of the tile. An object of type tile_barrier. barrier
1441
tiled_index(const tiled_index& other) restrict(amp,cpu)
Copy constructor. Constructs a new tiled_index from the supplied argument other. Parameters: An object of type tiled_index from which to initialize this. other
1445
const index<N> local
An index of rank 1, 2, or 3 that represents the relative index in the current tile of a tiled extent.
1446
const index<N> tile
An index of rank 1, 2, or 3 that represents the coordinates of the current tile of a tiled extent.
1447
const index<N> tile_origin
An index of rank 1, 2, or 3 that represents the global coordinates of the origin of the current tile in a tiled extent.
1448
const tile_barrier barrier
An object that represents a barrier within the current tile of threads.
1449
operator index<N>() const restrict(amp,cpu)
Implicit conversion operator that converts a tiled_index<D0,D1,D2> into an index<N>. The implicit conversion converts to the .global index member.
1450 C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 40
1451
static const int tile_dim0 static const int tile_dim1 static const int tile_dim2
These constants enable access to the template arguments of tiled_index.
1452 1453 1454 1455 1456 1457 1458 1459 1460 1461 1462 1463 1464 1465 1466 1467 1468 1469 1470 1471 1472 1473 1474 1475
4.5
tile_barrier
The tile_barrier class is a capability class that can only be created by the system, and passed to a tiled parallel_for_each function object as part of the tiled_index parameter. It provides member functions, such as wait, whose purpose is to synchronize execution of threads that are running within the thread tile. A call to wait must not occur in divergent code within a thread tile. Section 8 defines divergence and lack thereof. 4.5.1 Synopsis
class tile_barrier { public: tile_barrier(const tile_barrier& other) restrict(amp,cpu); void void void void wait() restrict(amp); wait_with_all_memory_fence() restrict(amp); wait_with_global_memory_fence() restrict(amp); wait_with_tile_static_memory_fence() restrict(amp);};
4.5.2
Constructors
The tile_barrier class does not have a public default constructor, only a copy-constructor.
tile_barrier(const tile_barrier& other) restrict(amp,cpu)
Copy constructor. Constructs a new tile_barrier from the supplied argument other. Parameters: An object of type tile_barrier from which to initialize this. other
The tile_barrier class does not have an assignment operator. Section 8 describes the C++ AMP memory model, of which class tile_barrier is an important part.
void wait() restrict(amp)
Blocks execution of all threads in the thread tile until all of them have reached this call. Establishes a memory fence on all tile_static and global memory operations that are executed by the threads in the tile such that all memory operations that
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 41
are issued prior to hitting the barrier are visible to all other threads after the barrier has completed and none of the memory operations that occur after the barrier are speculatively executed before they hit the barrier. This is identical to wait_with_all_memory_fence.
1482
void wait_with_all_memory_fence() restrict(amp)
Blocks execution of all threads in the thread tile until all of them have reached this call. Establishes a memory fence on all tile_static and global memory operations that are executed by the threads in the tile such that all memory operations that are issued prior to hitting the barrier are visible to all other threads after the barrier has completed and none of the memory operations that occur after the barrier are speculatively executed before they hit the barrier. This is identical to wait.
1483
void wait_with_global_memory_fence() restrict(amp)
Blocks execution of all threads in the thread tile until all of them have reached this call. Establishes a memory fence on global memory operations (but not tile-static memory operations) that are executed by the threads in the tile such that all global memory operations that are issued prior to hitting the barrier are visible to all other threads after the barrier has completed and none of the global memory operations that occur after the barrier are speculatively executed before they hit the barrier.
1484
void wait_with_tile_static_memory_fence() restrict(amp)
Blocks execution of all threads in the thread tile until all of them have reached this call. Establishes a memory fence on tilestatic memory operations (but not global memory operations) that are executed by the threads in the tile such that all global memory operations that are issued prior to hitting the barrier are visible to all other threads after the barrier has completed and none of the tile-static memory operations that occur after the barrier are speculatively executed before they hit the barrier.
1485 1486 1487 1488 1489 1490 1491 4.5.4 Other Memory Fences and Barriers
C++ AMP provides functions that serve as memory fences, which establish a happens-before relationship between memory operations that are performed by threads within the same thread tile. These functions are available in the concurrency namespace. Section 8 describes the C++ AMP memory model.
void all_memory_fence(const tile_barrier&) restrict(amp)
Establishes a thread-tile scoped memory fence for both global and tile-static memory operations.
1492
void global_memory_fence(const tile_barrier&) restrict(amp)
Establishes a thread-tile scoped memory fence for global (but not tile-static) memory operations.
1493
void tile_static_memory_fence(const tile_barrier&) restrict(amp)
Establishes a thread-tile scoped memory fence for tile-static (but not global) memory operations.
1494
1495 1496 1497 1498 1499 1500 1501 1502 1503 1504 1505 1506
5
5.1
Data Containers
array<T,N>
The type array<T,N> represents a dense and regular (not jagged) N-dimensional array that resides on a specific location such as an accelerator or the CPU. The element type of the array is T, which is necessarily of a type that is compatible with the target accelerator. While the rank of the array is determined statically and is part of the type, the extent of the array is runtime-determined, and is expressed by using class extent<N>. The array element type T must be a standard-layout C++ class. Array data is laid out contiguously in memory. Elements that differ by one in the least-significant dimension are adjacent in memory. C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 42
1507 1508 1509 1510 1511 1512 1513 1514 1515 1516 1517 1518 1519 1520 1521 1522 1523 1524 1525 1526 1527 1528 1529 1530 1531 1532 1533 1534 1535 1536 1537 1538 1539 1540 1541 1542 1543 1544 1545 1546 1547 1548 1549 1550 1551 1552 1553 1554 1555 1556 1557 1558 1559
Arrays are logically considered to be value types in that when an array is copied to another array, a deep copy is performed. Two arrays never point to the same data. The array<T,N> type is used in several distinct scenarios: As a data container to be used in computations on an accelerator As a data container to hold memory on the host CPU (to be used to copy to and from other arrays) As a staging object to act as a fast intermediary in host-to-accelerator copies
An array can have any number of dimensions, although some functionality is specialized for array<T,1>, array<T,2>, and array<T,3>. The dimension defaults to 1 if the template argument is elided.
5.1.1
Synopsis
template <typename T, int N=1> class array { public: static const int rank = N; typedef T value_type; array() = delete; explicit array(const extent<N>& extent); array(const extent<N>& extent, accelerator_view av, accelerator_view associated_av); // staging template <typename InputIterator> array(const extent<N>& extent, InputIterator srcBegin); template <typename InputIterator> array(const extent<N>& extent, InputIterator srcBegin, InputIterator srcEnd); template <typename InputIterator> array(const extent<N>& extent, InputIterator srcBegin, accelerator_view av, accelerator_view associated_av); // staging template <typename InputIterator> array(const extent<N>& extent, InputIterator srcBegin, InputIterator srcEnd, accelerator_view av, accelerator_view associated_av); // staging template <typename InputIterator> array(const extent<N>& extent, InputIterator srcBegin, accelerator_view av); template <typename InputIterator> array(const extent<N>& extent, InputIterator srcBegin, InputIterator srcEnd, accelerator_view av); explicit array(const array_view<const T,N>& src); array(const array_view<const T,N>& src, accelerator_view av, accelerator_view associated_av); // staging array(const array_view<const T,N>& src, accelerator_view av); array(const array& other); array(array&& other); array& operator=(const array& other); array& operator=(array&& other);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 43
1560 1561 1562 1563 1564 1565 1566 1567 1568 1569 1570 1571 1572 1573 1574 1575 1576 1577 1578 1579 1580 1581 1582 1583 1584 1585 1586 1587 1588 1589 1590 1591 1592 1593 1594 1595 1596 1597 1598 1599 1600 1601 1602 1603 1604 1605 1606 1607 1608 1609 1610 1611 1612 1613 1614 1615 1616 1617
array& operator=(const array_view<const T,N>& src); void copy_to(array& dest) const; void copy_to(array_view<T,N>& dest) const; __declspec(property(get)) extent<N> extent; __declspec(property(get)) accelerator_view accelerator_view; __declspec(property(get)) accelerator_view associated_accelerator_view; T& operator[](const index<N>& idx) restrict(amp,cpu); const T& operator[](const index<N>& idx) const restrict(amp,cpu); array_view<T,N-1> operator[](int i) restrict(amp,cpu); array_view<const T,N-1> operator[](int i) const restrict(amp,cpu); const T& operator()(const index<N>& idx) const restrict(amp,cpu); T& operator()(const index<N>& idx) restrict(amp,cpu); array_view<T,N-1> operator()(int i) restrict(amp,cpu); array_view<const T,N-1> operator()(int i) const restrict(amp,cpu); array_view<T,N> section(const index<N>& idx, const extent<N>& ext) restrict(amp,cpu); array_view<const T,N> section(const index<N>& idx, const extent<N>& ext) const restrict(amp,cpu); array_view<T,N> section(const index<N>& idx) restrict(amp,cpu); array_view<const T,N> section(const index<N>& idx) const restrict(amp,cpu); template <typename ElementType> array_view<ElementType,1> reinterpret_as() restrict(amp,cpu); template <typename ElementType> array_view<const ElementType,1> reinterpret_as() const restrict(amp,cpu); template <int K> array_view<T,K> view_as(const extent<K>& viewExtent) restrict(amp,cpu); template <int K> array_view<const T,K> view_as(const extent<K>& viewExtent) const restrict(amp,cpu); operator std::vector<T>() const; T* data() restrict(amp,cpu); const T* data() const restrict(amp,cpu); }; template<typename T> class array<T,1> { public: static const int rank = 1; typedef T value_type; const extent<1> extent; array() = delete; explicit array(const extent<1>& extent); explicit array(int e0); array(const extent<1>& extent, accelerator_view av, accelerator_view associated_av); // staging array(int e0, accelerator_view av, accelerator_view associated_av); // staging
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 44
1618 1619 1620 1621 1622 1623 1624 1625 1626 1627 1628 1629 1630 1631 1632 1633 1634 1635 1636 1637 1638 1639 1640 1641 1642 1643 1644 1645 1646 1647 1648 1649 1650 1651 1652 1653 1654 1655 1656 1657 1658 1659 1660 1661 1662 1663 1664 1665 1666 1667 1668 1669 1670 1671 1672 1673 1674 1675
array(const extent<1>& extent, accelerator_view av); array(int e0, accelerator_view av); template <typename InputIterator> array(const extent<1>& extent, InputIterator srcBegin); template <typename InputIterator> array(const extent<1>& extent, InputIterator srcBegin, InputIterator srcEnd); template <typename InputIterator> array(int e0, InputIterator srcBegin); template <typename InputIterator> array(int e0, InputIterator srcBegin, InputIterator srcEnd); template <typename InputIterator> array(const extent<1>& extent, InputIterator srcBegin, accelerator_view av, accelerator_view associated_av); // staging template <typename InputIterator> array(const extent<1>& extent, InputIterator srcBegin, InputIterator srcEnd, accelerator_view av, accelerator_view associated_av); // staging template <typename InputIterator> array(int e0, InputIterator srcBegin, accelerator_view av, accelerator_view associated_av); // staging template <typename InputIterator> array(int e0, InputIterator srcBegin, InputIterator srcEnd, accelerator_view av, accelerator_view associated_av); // staging template <typename InputIterator> array(const extent<1>& extent, InputIterator srcBegin, accelerator_view av); template <typename InputIterator> array(const extent<1>& extent, InputIterator srcBegin, InputIterator srcEnd, accelerator_view av); template <typename InputIterator> array(int e0, InputIterator srcBegin, InputIterator srcEnd, accelerator_view av); template <typename InputIterator> array(int e0, InputIterator srcBegin, InputIterator srcEnd, accelerator_view av); array(const array_view<const T,1>& src); array(const array_view<const T,1>& src, accelerator_view av, accelerator_view associated_av); // staging array(const array_view<const T,1>& src, accelerator_view av); array(const array& other); array(array&& other); array& operator=(const array& other); array& operator=(array&& other); array& operator=(const array_view<const T,1>& src); void copy_to(array& dest) const; void copy_to(array_view<T,1>& dest) const; __declspec(property(get)) extent<1> extent; __declspec(property(get)) int x; __declspec(property(get)) accelerator_view accelerator_view; T& operator[](const index<1>& idx) restrict(amp,cpu); const T& operator[](const index<1>& idx) const restrict(amp,cpu); T& operator[](int i0) restrict(amp,cpu);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 45
1676 1677 1678 1679 1680 1681 1682 1683 1684 1685 1686 1687 1688 1689 1690 1691 1692 1693 1694 1695 1696 1697 1698 1699 1700 1701 1702 1703 1704 1705 1706 1707 1708 1709 1710 1711 1712 1713 1714 1715 1716 1717 1718 1719 1720 1721 1722 1723 1724 1725 1726 1727 1728 1729 1730 1731 1732 1733
const T& operator[](int i0) const restrict(amp,cpu); T& operator()(const index<1>& idx) restrict(amp,cpu); const T& operator()(const index<1>& idx) const restrict(amp,cpu); T& operator()(int i0) restrict(amp,cpu); const T& operator()(int i0) const restrict(amp,cpu); array_view<T,1> section(const index<1>& idx, const extent<1>& ext) restrict(amp,cpu); array_view<const T,1> section(const index<1>& idx, const extent<1>& ext) const restrict(amp,cpu); array_view<T,1> section(const index<1>& idx) restrict(amp,cpu); array_view<const T,1> section(const index<1>& idx) const restrict(amp,cpu); array_view<T,1> section(int i0, int e0) restrict(amp,cpu); array_view<const T,1> section(int i0, int e0) const restrict(amp,cpu); template <typename ElementType> array_view<ElementType,1> reinterpret_as() restrict(amp,cpu); template <typename ElementType> array_view<const ElementType,1> reinterpret_as() const restrict(amp,cpu); template <int K> array_view<T,K> view_as(const extent<K>& viewExtent) restrict(amp,cpu); template <int K> array_view<const T,K> view_as(const extent<K>& viewExtent) const restrict(amp,cpu); operator std::vector<T>() const; T* data() restrict(amp,cpu); const T* data() const restrict(amp,cpu); };
template<typename T> class array<T,2> { public: static const int rank = 2; typedef T value_type; const extent<2> extent; array() = delete; explicit array(const extent<2>& extent); array(int e0, int e1); array(const extent<2>& extent, accelerator_view av, accelerator_view associated_av); // staging array(int e0, int e1, accelerator_view av, accelerator_view associated_av); // staging array(const extent<2>& extent, accelerator_view av); array(int e0, int e1, accelerator_view av); template <typename InputIterator> array(const extent<2>& extent, InputIterator srcBegin); template <typename InputIterator> array(const extent<2>& extent, InputIterator srcBegin, InputIterator srcEnd); template <typename InputIterator> array(int e0, int e1, InputIterator srcBegin); template <typename InputIterator> array(int e0, int e1, InputIterator srcBegin, InputIterator srcEnd);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 46
1734 1735 1736 1737 1738 1739 1740 1741 1742 1743 1744 1745 1746 1747 1748 1749 1750 1751 1752 1753 1754 1755 1756 1757 1758 1759 1760 1761 1762 1763 1764 1765 1766 1767 1768 1769 1770 1771 1772 1773 1774 1775 1776 1777 1778 1779 1780 1781 1782 1783 1784 1785 1786 1787 1788 1789 1790 1791
template <typename InputIterator> array(const extent<2>& extent, InputIterator srcBegin, accelerator_view av, accelerator_view associated_av); // staging template <typename InputIterator> array(const extent<2>& extent, InputIterator srcBegin, InputIterator srcEnd, accelerator_view av, accelerator_view associated_av); // staging template <typename InputIterator> array(int e0, int e2, InputIterator srcBegin, accelerator_view av, accelerator_view associated_av); // staging template <typename InputIterator> array(int e0, int e2, InputIterator srcBegin, InputIterator srcEnd, accelerator_view av, accelerator_view associated_av); // staging template <typename InputIterator> array(const extent<2>& extent, InputIterator srcBegin, accelerator_view av); template <typename InputIterator> array(const extent<2>& extent, InputIterator srcBegin, InputIterator srcEnd, accelerator_view av); template <typename InputIterator> array(int e0, int e1, InputIterator srcBegin, accelerator_view av); template <typename InputIterator> array(int e0, int e1, InputIterator srcBegin, InputIterator srcEnd, accelerator_view av); array(const array_view<const T,2>& src); array(const array_view<const T,2>& src, accelerator_view av, accelerator_view associated_av); // staging array(const array_view<const T,2>& src, accelerator_view av); array(const array& other); array(array&& other); array& operator=(const array& other); array& operator=(array&& other); array& operator=(const array_view<const T,2>& src); void copy_to(array& dest) const; void copy_to(array_view<T,2>& dest) const; __declspec(property(get)) extent<2> extent; __declspec(property(get)) int y; __declspec(property(get)) int x; __declspec(property(get)) accelerator_view accelerator_view; T& operator[](const index<2>& idx) restrict(amp,cpu); const T& operator[](const index<2>& idx) const restrict(amp,cpu); array_view<T,1> operator[](int i0) restrict(amp,cpu); array_view<const T,1> operator[](int i0) const restrict(amp,cpu); T& operator()(const index<2>& idx) restrict(amp,cpu); const T& operator()(const index<2>& idx) const restrict(amp,cpu); T& operator()(int i0, int i1) restrict(amp,cpu); const T& operator()(int i0, int i1) const restrict(amp,cpu); array_view<T,2> section(const index<2>& idx, const extent<2>& ext) restrict(amp,cpu); array_view<const T,2> section(const index<2>& idx, const extent<2>& ext) const restrict(amp,cpu);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 47
1792 1793 1794 1795 1796 1797 1798 1799 1800 1801 1802 1803 1804 1805 1806 1807 1808 1809 1810 1811 1812 1813 1814 1815 1816 1817 1818 1819 1820 1821 1822 1823 1824 1825 1826 1827 1828 1829 1830 1831 1832 1833 1834 1835 1836 1837 1838 1839 1840 1841 1842 1843 1844 1845 1846 1847 1848 1849
array_view<T,2> section(const index<2>& idx) restrict(amp,cpu); array_view<const T,2> section(const index<2>& idx) const restrict(amp,cpu); array_view<T,2> section(int i0, int i1, int e0, int e1) restrict(amp,cpu); array_view<const T,2> section(int i0, int i1, int e0, int e1) const restrict(amp,cpu); template <typename ElementType> array_view<ElementType,1> reinterpret_as() restrict(amp,cpu); template <typename ElementType> array_view<const ElementType,1> reinterpret_as() const restrict(amp,cpu); template <int K> array_view<T,K> view_as(const extent<K>& viewExtent) restrict(amp,cpu); template <int K> array_view<const T,K> view_as(const extent<K>& viewExtent) const restrict(amp,cpu); operator std::vector<T>() const; T* data() restrict(amp,cpu); const T* data() const restrict(amp,cpu); };
template<typename T> class array<T,3> { public: static const int rank = 3; typedef T value_type; const extent<3> extent; array() = delete; explicit array(const extent<3>& extent); array(int e0, int e1, int e2); array(const extent<3>& extent, accelerator_view av, accelerator_view associated_av); // staging array(int e0, int e1, int e2, accelerator_view av, accelerator_view associated_av); // staging array(const extent<3>& extent, accelerator_view av); array(int e0, int e1, int e2, accelerator_view av); template <typename InputIterator> array(const extent<3>& extent, InputIterator srcBegin); template <typename InputIterator> array(const extent<3>& extent, InputIterator srcBegin, InputIterator srcEnd); template <typename InputIterator> array(int e0, int e1, int e2, InputIterator srcBegin); template <typename InputIterator> array(int e0, int e1, int e2, InputIterator srcBegin, InputIterator srcEnd); template <typename InputIterator> array(const extent<3>& extent, InputIterator srcBegin, accelerator_view av, accelerator_view associated_av); // staging template <typename InputIterator> array(const extent<3>& extent, InputIterator srcBegin, InputIterator srcEnd, accelerator_view av, accelerator_view associated_av); // staging template <typename InputIterator> array(int e0, int e2, int e2, InputIterator srcBegin,
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 48
1850 1851 1852 1853 1854 1855 1856 1857 1858 1859 1860 1861 1862 1863 1864 1865 1866 1867 1868 1869 1870 1871 1872 1873 1874 1875 1876 1877 1878 1879 1880 1881 1882 1883 1884 1885 1886 1887 1888 1889 1890 1891 1892 1893 1894 1895 1896 1897 1898 1899 1900 1901 1902 1903 1904 1905 1906 1907
accelerator_view av, accelerator_view associated_av); // staging template <typename InputIterator> array(int e0, int e2, int e2, InputIterator srcBegin, InputIterator srcEnd, accelerator_view av, accelerator_view associated_av); // staging template <typename InputIterator> array(const extent<3>& extent, InputIterator srcBegin, accelerator_view av); template <typename InputIterator> array(const extent<3>& extent, InputIterator srcBegin, InputIterator srcEnd, accelerator_view av); template <typename InputIterator> array(int e0, int e1, int e2, InputIterator srcBegin, accelerator_view av); template <typename InputIterator> array(int e0, int e1, int e2, InputIterator srcBegin, InputIterator srcEnd, accelerator_view av); array(const array_view<const T,3>& src); array(const array_view<const T,3>& src, accelerator_view av, accelerator_view associated_av); // staging array(const array_view<const T,3>& src, accelerator_view av); array(const array& other); array(array&& other); array& operator=(const array& other); array& operator=(array&& other); array& operator=(const array_view<const T,3>& src); void copy_to(array& dest) const; void copy_to(array_view<T,3>& dest) const; __declspec(property(get)) extent<3> extent; __declspec(property(get)) accelerator_view accelerator_view; T& operator[](const index<3>& idx) restrict(amp,cpu); const T& operator[](const index<3>& idx) const restrict(amp,cpu); array_view<T,2> operator[](int i0) restrict(amp,cpu); array_view<const T,2> operator[](int i0) const restrict(amp,cpu); T& operator()(const index<3>& idx) restrict(amp,cpu); const T& operator()(const index<3>& idx) const restrict(amp,cpu); T& operator()(int i0, int i1, int i2) restrict(amp,cpu); const T& operator()(int i0, int i1, int i2) const restrict(amp,cpu); array_view<T,3> section(const index<3>& idx, const extent<3>& ext) restrict(amp,cpu); array_view<const T,3> section(const index<3>& idx, const extent<3>& ext) const restrict(amp,cpu); array_view<T,3> section(const index<3>& idx) restrict(amp,cpu); array_view<const T,3> section(const index<3>& idx) const restrict(amp,cpu); array_view<T,3> section(int i0, int i1, int i2, int e0, int e1, int e2) restrict(amp,cpu); array_view<const T,3> section(int i0, int i1, int i2, int e0, int e1, int e2) const restrict(amp,cpu); template <typename ElementType> array_view<ElementType,1> reinterpret_as() restrict(amp,cpu); template <typename ElementType>
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 49
1908 1909 1910 1911 1912 1913 1914 1915 1916 1917 1918 1919 1920 1921
array_view<const ElementType,1> reinterpret_as() const restrict(amp,cpu); template <int K> array_view<T,K> view_as(const extent<K>& viewExtent) restrict(amp,cpu); template <int K> array_view<const T,K> view_as(const extent<K>& viewExtent) const restrict(amp,cpu); operator std::vector<T>() const; T* data() restrict(amp,cpu); const T* data() const restrict(amp,cpu); };
1922
static const int rank = N
The rank of this array.
1923
typedef T value_type;
The element type of this array.
1924 1925 1926 1927 1928 5.1.2 Constructors There is no default constructor for array<T,N>. All constructors are restricted to run on the CPU only (cannot be executed on an amp target).
array(const array& other)
Copy constructor. Constructs a new array<T,N> from the supplied argument other. other. A deep copy is performed. Parameters: An object of type array<T,N> from which to initialize this new array. Other
1929
array(array&& other)
Move constructor. Constructs a new array<T,N> by moving from the supplied argument other. Parameters: An object of type array<T,N> from which to initialize this new array. Other
1930
explicit array(const extent<N>& extent)
Constructs a new array, located on the default accelerator, by using the supplied extent. Parameters: Extent The extent in each dimension of this array.
1931
explicit array<T,1>::array(int e0) array<T,2>::array(int e0, int e1) array<T,3>::array(int e0, int e1, int e2)
Equivalent to construction by using array(extent<N>(e0 [, e1 [, e2 ]])). Parameters:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 50
e0 [, e1 [, e2 ] ]
1932
template <typename InputIterator> array(const extent<N>& extent, InputIterator srcBegin [, InputIterator srcEnd])
Constructs a new array, located on the default accelerator, that has the supplied extent initialized by using the contents of a source container that is specified by a beginning iterator and an optional ending iterator. The source data is copied by value into this array as if by calling copy(). If the number of available container elements is less than this->extent.size(), undefined behavior results. Parameters: extent The extent in each dimension of this array. srcBegin srcEnd A beginning iterator into the source container. An ending iterator into the source container.
1933
template <typename InputIterator> array<T,1>::array(int e0, InputIterator srcBegin [, InputIterator srcEnd]) template <typename InputIterator> array<T,2>::array(int e0, int e1, InputIterator srcBegin [, InputIterator srcEnd]) template <typename InputIterator> array<T,3>::array(int e0, int e1, int e2, InputIterator srcBegin [, InputIterator srcEnd])
Equivalent to construction by using array(extent<N>(e0 [, e1 [, e2 ]]), src). Parameters: e0 [, e1 [, e2 ] ] The component values that form the extent of this array. srcBegin srcEnd A beginning iterator into the source container. An ending iterator into the source container.
1934
explicit array(const array_view<const T,N>& src)
Constructs a new array that is initialized by using the contents of the array_view src. The extent of this array is taken from the extent of the source array_view. The src is copied by value into this array as if by calling copy(src, *this) (see 5.3.2). Parameters: An array_view object from which to copy the data into this array (and src also to determine the extent of this array).
1935
explicit array(const extent<N>& extent, accelerator_view av)
Constructs a new array, located on the accelerator that is bound to the accelerator_view av, that has the supplied extent. Parameters: extent The extent in each dimension of this array. av An accelerator_view object that specifies the location of this array.
1936
array<T,1>::array(int e0, accelerator_view av) array<T,2>::array(int e0, int e1, accelerator_view av) array<T,3>::array(int e0, int e1, int e2, accelerator_view av)
Equivalent to construction using array(extent<N>(e0 [, e1 [, e2 ]]), av).
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 51
Parameters: e0 [, e1 [, e2 ] ] av
The component values that form the extent of this array. An accelerator_view object that specifies the location of this array.
1937
template <typename InputIterator> array(const extent<N>& extent, InputIterator srcBegin [, InputIterator srcEnd], accelerator_view av)
Constructs a new array, located on the accelerator that is bound to the accelerator_view av, using the supplied extent, initialized by using the contents of the source container specified by a beginning iterator and an optional ending iterator. The data is copied by value into this array as if by calling copy(). Parameters: extent The extent in each dimension of this array. srcBegin srcEnd av A beginning iterator into the source container. An ending iterator into the source container. An accelerator_view object that specifies the location of this array.
1938
array(const array_view<const T,N>& src, accelerator_view av)
Constructs a new array that is initialized by using the contents of the array_view src. The extent of this array is taken from the extent of the source array_view. The src is copied by value into this array as if by calling copy(src, *this) (see 5.3.2). The new array is located on the accelerator that is bound to the accelerator_view av. Parameters: An array_view object from which to copy the data into this array (and src also to determine the extent of this array). av An accelerator_view object that specifies the location of this array.
1939
template <typename InputIterator> array<T,1>::array(int e0, InputIterator srcBegin [, InputIterator srcEnd], accelerator_view av) template <typename InputIterator> array<T,2>::array(int e0, int e1, InputIterator srcBegin [, InputIterator srcEnd], accelerator_view av) template <typename InputIterator> array<T,3>::array(int e0, int e1, int e2, InputIterator srcBegin [, InputIterator srcEnd], accelerator_view av)
Equivalent to construction by using array(extent<N>(e0 [, e1 [, e2 ]]), srcBegin [, srcEnd], av). Parameters: e0 [, e1 [, e2 ] ] The component values that form the extent of this array. srcBegin srcEnd av A beginning iterator into the source container. An ending iterator into the source container. An accelerator_view object that specifies the location of this array.
1940
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 52
1941 1942 1943 1944 1945 1946 1947 1948 1949 1950 1951 1952 1953 1954 1955 1956 1957 1958 1959 1960 1961 1962 1963 1964 1965 1966 1967 1968 1969 1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
5.1.2.1
Staging arrays are used as a hint to optimize repeated copies between two accelerators (in this version of C++ AMP, this is between the CPU and an accelerator). Staging arrays are optimized for data transfers, and do not have stable user-space memory. Microsoft-specific: On Windows, staging arrays are backed by DirectX staging buffers, which have the correct hardware alignment to ensure efficient DMA transfer between the CPU and a device. Staging arrays are differentiated from normal arrays by their construction using a second accelerator. The accelerator_view property of a staging array returns the value of the first accelerator argument that it was constructed with ( acclSrc, below). It is not supported to change or examine the contents of a staging array while it is involved in a transfer operation (for example, between lines 17 and 22 in the following example).
1. class SimulationServer 2. { 3. array<float,2> acceleratorArray; 4. array<float,2> stagingArray; 5. public: 6. SimulationServer(const accelerator_view& av) 7. :acceleratorArray(extent<2>(1000,1000), av), 8. stagingArray(extent<2>(1000,1000), accelerator(cpu).default_view, 9. accelerator(gpu).default_view) 10. { 11. } 12. 13. void OnCompute() 14. { 15. array<float,2> &a = acceleratorArray; 16. ApplyNetworkChanges(stagingArray.data()); 17. a = stagingArray; 18. parallel_for_each(a.extents, [&](index<2> idx) 19. { 20. // Update a[idx] according to simulation 21. } 22. stagingArray = a; 23. SendToClient(stagingArray.data()); 24. } 25. };
1980
array<T,1>::array(int e0, accelerator_view av, accelerator_view associated_av) array<T,2>::array(int e0, int e1, accelerator_view av, accelerator_view associated_av) array<T,3>::array(int e0, int e1, int e2, accelerator_view av, accelerator_view associated_av)
Equivalent to construction by using array(extent<N>(e0 [, e1 [, e2 ]]), acclSrc, acclDest). Parameters: e0 [, e1 [, e2 ] ] The component values that form the extent of this array.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 53
acclSrc acclDest
An accelerator object that specifies the home location of this array. An accelerator object that specifies a target device accelerator.
1981
template <typename InputIterator> array(const extent<N>& extent, InputIterator srcBegin, accelerator_view av, accelerator_view associated_av)
Constructs a staging array using the given extent, which acts as a staging area between accelerators acclSrc (which must be the CPU accelerator) and acclDest. The staging array will be initialized by using the data that is specified by src as if by calling copy(src, *this) (see 5.3.2). Parameters: extent The extent in each dimension of this array. src A template argument that must resolve to a linear container that supports .data() and .size() members (such as std::vector or std::array). An accelerator object that specifies the home location of this array. An accelerator object that specifies a target device accelerator.
acclSrc acclDest
1982
array(const extent<N>& extent, const value_type* src, accelerator_view av, accelerator_view associated_av)
Constructs a staging array using the given extent, which acts as a staging area between accelerators acclSrc (which must be the CPU accelerator) and acclDest. The staging array will be initialized by using the data specified by src as if by calling copy(src, *this) (see 5.3.2). Parameters: extent The extent in each dimension of this array. src acclSrc acclDest A pointer to the source data that will be copied into this array. An accelerator object that specifies the home location of this array. An accelerator object that specifies a target device accelerator.
1983
array(const array_view<const T,N>& src, accelerator_view av, accelerator_view associated_av)
Constructs a staging array that is initialized by using the array_view that is given by src, which acts as a staging area between accelerators acclSrc (which must be the CPU accelerator) and acclDest. The extent of this array is taken from the extent of the source array_view. The staging array will be initialized from src as if by calling copy(src, *this) (see 5.3.2). Parameters: An array_view object from which to copy the data into this array (and src also to determine the extent of this array). acclSrc acclDest An accelerator object that specifies the home location of this array. An accelerator object that specifies a target device accelerator.
1984
template <typename InputIterator> array<T,1>::array(int e0, InputIterator srcBegin, accelerator_view av, accelerator_view associated_av) template <typename InputIterator>
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 54
array<T,2>::array(int e0, int e1, InputIterator srcBegin, accelerator_view av, accelerator_view associated_av) template <typename InputIterator> array<T,3>::array(int e0, int e1, int e2, InputIterator srcBegin, accelerator_view av, accelerator_view associated_av)
Equivalent to construction by using array(extent<N>(e0 [, e1 [, e2 ]]), src, acclSrc, acclDest). Parameters: e0 [, e1 [, e2 ] ] The component values that will form the extent of this array. src A template argument that must resolve to a linear container that supports .data() and .size() members (such as std::vector or std::array). An accelerator object that specifies the home location of this array. An accelerator object that specifies a target device accelerator.
acclSrc acclDest
1985
array<T,1>::array(int e0, const value_type* src, accelerator_view av, accelerator_view associated_av) array<T,2>::array(int e0, int e1, const value_type* src, accelerator_view av, accelerator_view associated_av) array<T,3>::array(int e0, int e1, int e2, const value_type* src, accelerator_view av, accelerator_view associated_av)
Equivalent to construction by using array(extent<N>(e0 [, e1 [, e2 ]]), src, acclSrc, acclDest). Parameters: e0 [, e1 [, e2 ] ] The component values that will form the extent of this array. src acclSrc acclDest A pointer to the source data that will be copied into this array. An accelerator object that specifies the home location of this array. An accelerator object that specifies a target device accelerator.
1989
__declspec(property(get)) int z __declspec(property(get)) int y __declspec(property(get)) int x
These properties are shortcuts for extent component access when N 3.
1990
__declspec(property(get)) accelerator_view accelerator_view
Returns the accelerator_view that represents the location where this array has been allocated. This property is only accessible on the CPU.
1991
array& operator=(const array& other)
Assigns the contents of the array other to this array by using a deep copy. This function can only be called on the CPU. Parameters: An object of type array<T,N> from which to copy into this array. other Return Value:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 55
Returns *this.
1992
array& operator=(array&& other)
Moves the contents of the array other to this array. This function can only be called on the CPU. Parameters: An object of type array<T,N> from which to move into this array. other Return Value: Returns *this.
1993
array& operator=(const array_view<const T,N>& src)
Assigns the contents of the array_view src, as if by calling copy(src, *this) (see 5.3.2). Parameters: An object of type array_view<T,N> from which to copy into this array. src Return Value: Returns *this.
1994
void copy_to(array<T,N>& dest)
Copies the contents of this array to the array that is given by dest, as if by calling copy(*this, dest) (see 5.3.2). Parameters: An object of type array <T,N> to which to copy data from this array. dest
1995
void copy_to(array_view<T,N>& dest)
Copies the contents of this array to the array_view that is given by dest, as if by calling copy(*this, dest) (see 5.3.2). Parameters: An object of type array_view<T,N> to which to copy data from this dest array.
1996
T* data() restrict(amp,cpu) const T* data() const restrict(amp,cpu)
Returns a pointer to the raw data that underlies this array. Return Value: A (const) pointer to the first element in the linearized array.
1997
operator std::vector<T>() const
Implicitly converts an array to a std::vector, as if by copy(*this, vector) (see 5.3.2). Return Value: An object of type vector<T> that contains a copy of the data that is contained on the array.
T& operator[](const index<N>& idx) restrict(amp,cpu) T& operator()(const index<N>& idx) restrict(amp,cpu)
Returns a reference to the element of this array that is at the location in N-dimensional space that is specified by idx. Parameters: An object of type index<N> that specifies the location of the element. idx
2001
const T& operator[](const index<N>& idx) const restrict(amp,cpu) const T& operator()(const index<N>& idx) const restrict(amp,cpu)
Returns a const reference to the element of this array that is at the location in N-dimensional space that is specified by idx. Parameters: An object of type index<N> that specifies the location of the element. idx
2002
T& array<T,1>::operator()(int i0) restrict(amp,cpu) T& array<T,2>::operator()(int i0, int i1) restrict(amp,cpu)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 56
2003
const T& array<T,1>::operator()(int i0) const restrict(amp,cpu) const T& array<T,2>::operator()(int i0, int i1) const restrict(amp,cpu) const T& array<T,3>::operator()(int i0, int i1, int i2) const restrict(amp,cpu)
Equivalent to array<T,N>::operator()(index<N>(i0 [, i1 [, i2 ]])) const. Parameters: i0 [, i1 [, i2 ] ] The component values that will form the index into this array.
2004
array_view<T,N-1> operator[](int i0) restrict(amp,cpu) array_view<const T,N-1> operator[](int i0) const restrict(amp,cpu)
This overload is defined for array<T,N> where N 2. This mode of indexing is equivalent to projecting on the most-significant dimension. It enables C-style indexing. For example: array<float,4> myArray(myExtents, ); myArray[index<4>(5,4,3,2)] = 7; assert(myArray[5][4][3][2] == 7); Parameters: i0
An integer that is the index into the most-significant dimension of this array.
Return Value: Returns an array_view whose dimension is one lower than that of this array.
array_view<T,N> section(const index<N>& offset, const extent<N>& ext) restrict(amp,cpu) array_view<const T,N> section(const index<N>& offset, const extent<N>& ext) const restrict(amp,cpu) See array_view<T,N>::section(const index<N>&, const extent<N>&) in section 5.2.2 for a description of this
function.
2008
array_view<T,N> section(const index<N>& idx) restrict(amp,cpu) array_view<const T,N> section(const index<N>& idx) const restrict(amp,cpu)
Equivalent to section(idx, this->extent idx).
2009
array_view<T,1> array<T,1>::section(int i0, int e0) restrict(amp,cpu) array_view<const T,1> array<T,1>::section(int i0, int e0) const restrict(amp,cpu) array_view<T,2> array<T,2>::section(int i0, int i1, int e0, int e1) restrict(amp,cpu) array_view<const T,2> array<T,2>::section(int i0, int i1, int e0, int e1) const restrict(amp,cpu) array_view<T,3> array<T,3>::section(int i0, int i1, int i2, int e0, int e1, int e2) restrict(amp,cpu) array_view<const T,3> array<T,3>::section(int i0, int i1, int i2, int e0, int e1, int e2) const restrict(amp,cpu)
Equivalent to array<T,N>::section(index<N>(i0 [, i1 [, i2 ]]), extent<N>(e0 [, e1 [, e2 ]])) const. Parameters: i0 [, i1 [, i2 ] ] The component values that will form the origin of the section.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 57
e0 [, e1 [, e2 ] ]
The component values that will form the extent of the section.
2010
template<typename ElementType> array_view<ElementType,1> reinterpret_as() restrict(amp,cpu) template<typename ElementType> array_view<const ElementType,1> reinterpret_as() const restrict(amp,cpu)
Sometimes it is desirable to view the data of an N-dimensional array as a linear array, possibly with a (unsafe) reinterpretation of the element type. This can be achieved through the reinterpret_as member function. For example: struct RGB { float r; float g; float b; }; array<RGB,3> a = ...; array_view<float,1> v = a.reinterpret_as<float>(); assert(v.extent == 3*a.extent); The size of the reinterpreted ElementType must evenly divide into the total size of this array. Return Value: Returns an array_view from this array<T,N> with the element type reinterpreted from T to ElementType, and the rank reduced from N to 1.
2011
template <int K> array_view<T,K> view_as(extent<K> viewExtent) restrict(amp,cpu) template <int K> array_view<const T,K> view_as(extent<K> viewExtent) const restrict(amp,cpu)
An array of higher rank can be reshaped into an array of lower rank, or vice versa, by using the view_as member function. For example: array<float,1> a(100); array_view<float,2> av = a.view_as(extent<2>(2,50)); Return Value: Returns an array_view from this array<T,N> with the rank changed to K from N.
2012 2013 2014 2015 2016 2017 2018 2019 2020 2021 2022 2023 2024 2025 2026 2027 2028 2029 2030 2031 2032
5.2
array_view<T,N>
The array_view<T,N> type represents a possibly cached view into the data that is held in an array<T,N>, or a section thereof. It also provides such views over native CPU data. It exposes an indexing interface that is congruent to that of array<T,N>. Like an array, an array_view is an N-dimensional object, where N defaults to 1 if it is elided. The array element type T must be a standard-layout C++ class. array_views may be accessed locally, where their source data lives, or remotely on a different accelerator or coherence domain. When they are accessed remotely, views are copied and cached as necessary. Except for the effects of automatic caching, array_views have a performance profile similar to that of arrays (small to negligible access penalty when the data is accessed through views). There are three remote usage scenarios: 1. 2. 3. A view to a system memory pointer is passed through a parallel_for_each call to an accelerator and accessed on the accelerator. A view to an accelerator-residing array is passed by using a parallel_for_each to another accelerator and is accessed there. A view to an accelerator-residing array is accessed on the CPU.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 58
2033 2034 2035 2036 2037 2038 2039 2040 2041 2042 2043 2044 2045 2046 2047 2048 2049 2050 2051 2052 2053 2054 2055 2056 2057 2058 2059 2060 2061 2062 2063 2064 2065 2066 2067 2068 2069 2070 2071 2072 2073 2074 2075 2076 2077 2078 2079
When any of these scenarios occur, the referenced views are implicitly copied by the system to the remote location and, if they are modified through the array_view, they are copied back to the home location. An implementation is free to optimize the copying back of changes, and may copy only changed elements or may copy unchanged portions as well. Overlapping array_views to the same data source are not guaranteed to maintain aliasing between arrays/array_views on a remote location. Multi-threaded access to the same data source, either directly or through views, must be synchronized by the user. The runtime makes the following guarantees regarding the caching of data inside array views. 1. 2. Let A be an array and V a view to the array. Then, all well synchronized accesses to A and V in program order obey a serial happens-before relationship. Let A be an array and V1 and V2 be overlapping views to the array. When they are executing on the accelerator where A has been allocated, all well-synchronized accesses through A, V1, and V2 are aliased through A and induce a total happens-before relationship which obeys program order. (No caching.) Otherwise, if they are executing on different accelerators, then the behavior of writes to V1 and V2 is undefined (a race).
When an array_view is created over a pointer in system memory, you commit to: 1. 2. Only changing the view directly through the view class, Or, adhering to the following rules when accessing the data directly (not through the view): a. Calling synchronize() before the data is accessed directly, b. And, if the underlying data is modified, calling refresh() prior to further accessing it through the view.
Either action will notify the array_view that the underlying native memory has changed and that any accelerator-residing copies are now stale. If the user abides by these rules, then the guarantees that are provided by the system for pointerbased views are identical to those that are provided to views of data-parallel arrays. 5.2.1 Synopsis The array_view<T,N> has the following specializations: array_view<T,1> array_view<T,2> array_view<T,3> array_view<const T,N> array_view<const T,1> array_view<const T,2> array_view<const T,3> 5.2.1.1 array_view<T,N>
The generic array_view<T,N> represents a view over elements of type T with rank N. The elements are both readable and writeable.
template <typename T, int N = 1> class array_view { public: static const int rank = N; typedef T value_type; const extent<N> extent;
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 59
2080 2081 2082 2083 2084 2085 2086 2087 2088 2089 2090 2091 2092 2093 2094 2095 2096 2097 2098 2099 2100 2101 2102 2103 2104 2105 2106 2107 2108 2109 2110 2111 2112 2113 2114 2115 2116 2117 2118 2119 2120 2121 2122 2123 2124 2125 2126 2127 2128 2129 2130 2131 2132 2133 2134 2135 2136 2137
array_view() = delete; array_view(array<T,N>& src, bool discard_original = false) restrict(amp,cpu); template <typename Container> array_view(const extent<N>& extent, Container src, bool discard_original = false); array_view(const extent<N>& extent, value_type* src, bool discard_original = false) restrict(amp,cpu); array_view(const array_view& other, bool discard_original = false) restrict(amp,cpu); array_view& operator=(const array_view& other) restrict(amp,cpu); void copy_to(array<T,N>& dest) const; void copy_to(array_view& dest) const; __declspec(property(get)) extent<N> extent; // These are restrict(amp,cpu) T& operator[](const index<N>& idx) restrict(amp,cpu); const T& operator[](const index<N>& idx) const restrict(amp,cpu); array_view<T,N-1> operator[](int i) restrict(amp,cpu); array_view<const T,N-1> operator[](int i) const restrict(amp,cpu); T& operator()(const index<N>& idx) restrict(amp,cpu); const T& operator()(const index<N>& idx) const restrict(amp,cpu); array_view<T,N-1> operator()(int i) restrict(amp,cpu); array_view<const T,N-1> operator()(int i) const restrict(amp,cpu); array_view<T,N> section(const index<N>& idx, const extent<N>& ext) restrict(amp,cpu); array_view<const T,N> section(const index<N>& idx, const extent<N>& ext) const restrict(amp,cpu); array_view<T,N> section(const index<N>& idx) restrict(amp,cpu); array_view<const T,N> section(const index<N>& idx) const restrict(amp,cpu); void synchronize(); std::shared_future<void> synchronize_async() const void refresh(); void discard_data(); }; template <typename T> class array_view<T,1> { public: static const int rank = 1; typedef T value_type; const extent<1> extent; array_view() = delete; array_view(array<T,1>& src, bool discard_original = false) restrict(amp,cpu); template <typename Container> array_view(const extent<1>& extent, Container src, bool discard_original = false); template <typename Container> array_view(int e0, Container src, bool discard_original = false); array_view(const extent<1>& extent, value_type* src, bool discard_original = false) restrict(amp,cpu); array_view(int e0, value_type* src, bool discard_original = false) restrict(amp,cpu);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 60
2138 2139 2140 2141 2142 2143 2144 2145 2146 2147 2148 2149 2150 2151 2152 2153 2154 2155 2156 2157 2158 2159 2160 2161 2162 2163 2164 2165 2166 2167 2168 2169 2170 2171 2172 2173 2174 2175 2176 2177 2178 2179 2180 2181 2182 2183 2184 2185 2186 2187 2188 2189 2190 2191 2192 2193 2194 2195
array_view(const array_view& other, bool discard_original = false) restrict(amp,cpu); array_view& operator=(const array_view& other) restrict(amp,cpu); void copy_to(array<T,1>& dest) const; void copy_to(array_view& dest) const; __declspec(property(get)) extent<1> extent; T& operator[](const index<1>& idx) restrict(amp,cpu); const T& operator[](const index<1>& idx) const restrict(amp,cpu); T& operator[](int i) restrict(amp,cpu); const T& operator[](int i) const restrict(amp,cpu); T& operator()(const index<1>& idx) restrict(amp,cpu); const T& operator()(const index<1>& idx) const restrict(amp,cpu); T& operator()(int i) restrict(amp,cpu); const T& operator()(int i) const restrict(amp,cpu); array_view<T,1> section(const index<1>& idx, const extent<1>& ext) restrict(amp,cpu); array_view<const T,1> section(const index<1>& idx, const extent<1>& ext) const restrict(amp,cpu); array_view<T,1> section(const index<1>& idx) restrict(amp,cpu); array_view<const T,1> section(const index<1>& idx) const restrict(amp,cpu); array_view<T,1> section(int i0) restrict(amp,cpu); array_view<const T,1> section(int i0) const restrict(amp,cpu); template <typename ElementType> array_view<T,1> reinterpret_as() restrict(amp,cpu); template <typename ElementType> array_view<const T,1> reinterpret_as() const restrict(amp,cpu); template <int K> array_view<T,K> view_as(extent<K> viewExtent) restrict(amp,cpu); template <int K> array_view<const T,K> view_as(extent<K> viewExtent) const restrict(amp,cpu); T* data() restrict(amp,cpu); const T* data() const restrict(amp,cpu); void synchronize(); std::shared_future<void> synchronize_async() const void refresh(); void discard_data(); };
template <typename T> class array_view<T,2> { public: static const int rank = 2; typedef T value_type; const extent<2> extent;
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 61
2196 2197 2198 2199 2200 2201 2202 2203 2204 2205 2206 2207 2208 2209 2210 2211 2212 2213 2214 2215 2216 2217 2218 2219 2220 2221 2222 2223 2224 2225 2226 2227 2228 2229 2230 2231 2232 2233 2234 2235 2236 2237 2238 2239 2240 2241 2242 2243 2244 2245 2246 2247 2248 2249 2250 2251 2252 2253
array_view() = delete; array_view(array<T,2>& src, bool discard_original = false) restrict(amp,cpu); template <typename Container> array_view(const extent<2>& extent, Container src, bool discard_original = false); template <typename Container> array_view(int e0, int e1, Container src, bool discard_original = false); array_view(const extent<2>& extent, value_type* src, bool discard_original = false) restrict(amp,cpu); array_view(int e0, int e1, value_type* src, bool discard_original = false) restrict(amp,cpu); array_view(const array_view& other, bool discard_original = false) restrict(amp,cpu); array_view& operator=(const array_view& other) restrict(amp,cpu); void copy_to(array<T,2>& dest) const; void copy_to(array_view& dest) const; __declspec(property(get)) extent<2> extent; T& operator[](const index<2>& idx) restrict(amp,cpu); const T& operator[](const index<2>& idx) const restrict(amp,cpu); array_view<T,N-1> operator[](int i) restrict(amp,cpu); array_view<const T,N-1> operator[](int i) const restrict(amp,cpu); T& operator()(const index<2>& idx) restrict(amp,cpu); const T& operator()(const index<2>& idx) const restrict(amp,cpu); T& operator()(int i0, int i1) restrict(amp,cpu); const T& operator()(int i0, int i1) const restrict(amp,cpu); array_view<T,2> section(const index<1>& idx, const extent<1>& ext) restrict(amp,cpu); array_view<const T,2> section(const index<1>& idx, const extent<1>& ext) const restrict(amp,cpu); array_view<T,2> section(const index<2>& idx) restrict(amp,cpu); array_view<const T,2> section(const index<2>& idx) const restrict(amp,cpu); array_view<T,2> section(int i0, int i1) restrict(amp,cpu); array_view<const T,2> section(int i0, int i1) const restrict(amp,cpu); void synchronize(); std::shared_future<void> synchronize_async() const void refresh(); void discard_data(); }; template <typename T> class array_view<T,3> { public: static const int rank = 3; typedef T value_type; const extent<3> extent; array_view() = delete; array_view(array<T,3>& src, bool discard_original = false) restrict(amp,cpu); template <typename Container> array_view(const extent<3>& extent, Container src, bool discard_original = false);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 62
2254 2255 2256 2257 2258 2259 2260 2261 2262 2263 2264 2265 2266 2267 2268 2269 2270 2271 2272 2273 2274 2275 2276 2277 2278 2279 2280 2281 2282 2283 2284 2285 2286 2287 2288 2289 2290 2291 2292 2293 2294 2295 2296 2297 2298 2299 2300 2301 2302 2303 2304 2305 2306 2307 2308 2309
template <typename Container> array_view(int e0, int e1, int e2, Container src, bool discard_original = false); array_view(const extent<3>& extent, value_type* src, bool discard_original = false) restrict(amp,cpu); array_view(int e0, int e1, int e2, value_type* src, bool discard_original = false) restrict(amp,cpu); array_view(const array_view& other, bool discard_original = false) restrict(amp,cpu); array_view& operator=(const array_view& other) restrict(amp,cpu); void copy_to(array<T,3>& dest) const; void copy_to(array_view& dest) const; __declspec(property(get)) extent<3> extent; T& operator[](const index<3>& idx) restrict(amp,cpu); const T& operator[](const index<3>& idx) const restrict(amp,cpu); array_view<T,N-1> operator[](int i) restrict(amp,cpu); array_view<const T,N-1> operator[](int i) const restrict(amp,cpu); T& operator()(const index<3>& idx) restrict(amp,cpu); const T& operator()(const index<3>& idx) const restrict(amp,cpu); T& operator()(int i0, int i1, int i2) restrict(amp,cpu); const T& operator()(int i0, int i1, int i2) const restrict(amp,cpu); array_view<T,3> section(const index<1>& idx, const extent<1>& ext) restrict(amp,cpu); array_view<const T,3> section(const index<1>& idx, const extent<1>& ext) const restrict(amp,cpu); array_view<T,3> section(const index<3>& idx) restrict(amp,cpu); array_view<const T,3> section(const index<3>& idx) const restrict(amp,cpu); array_view<T,3> section(int i0, int i1, int i2) restrict(amp,cpu); array_view<const T,3> section(int i0, int i1, int i2) const restrict(amp,cpu); void synchronize(); std::shared_future<void> synchronize_async() const void refresh(); void discard_data(); };
5.2.1.2
array_view<const T,N>
The partial specialization array_view<const T,N> represents a view over elements of type const T with rank N. The elements are read-only. At the boundary of a call site (such as parallel_for_each), this form of array_view need only be copied to the target accelerator if it is not already there. It will not be copied out.
template <typename T, int N=1> class array_view<const T,N> { public: static const int rank = N; typedef const T value_type; const extent<N> extent; array_view() = delete;
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 63
2310 2311 2312 2313 2314 2315 2316 2317 2318 2319 2320 2321 2322 2323 2324 2325 2326 2327 2328 2329 2330 2331 2332 2333 2334 2335 2336 2337 2338 2339 2340 2341 2342 2343 2344 2345 2346 2347 2348 2349 2350 2351 2352 2353 2354 2355 2356 2357 2358 2359 2360 2361 2362 2363 2364 2365 2366 2367
array_view(const array<T,N>& src) restrict(amp,cpu); template <typename Container> array_view(const extent<N>& extent, const Container src); array_view(const extent<N>& extent, const value_type* src) restrict(amp,cpu); array_view(const array_view& other) restrict(amp,cpu); array_view(const array_view<const T,N>& other) restrict(amp,cpu); array_view& operator=(const array_view& other) restrict(amp,cpu); void copy_to(array<T,N>& dest) const; void copy_to(array_view<T,N>& dest) const; __declspec(property(get)) extent<N> extent; const T& operator[](const index<N>& idx) const restrict(amp,cpu); array_view<const T,N-1> operator[](int i) const restrict(amp,cpu); const T& operator()(const index<N>& idx) const restrict(amp,cpu); array_view<const T,N-1> operator()(int i) const restrict(amp,cpu); array_view<const T,N> section(const index<N>& idx, const extent<N>& ext) const restrict(amp,cpu); array_view<const T,N> section(const index<N>& idx) const restrict(amp,cpu); void refresh(); }; template <typename T> class array_view<const T,1> { public: static const int rank = 1; typedef const T value_type; const extent<1> extent; array_view() = delete; array_view(const array<T,1>& src) restrict(amp,cpu); template <typename Container> array_view(const extent<1>& extent, const Container src); template <typename Container> array_view(int e0, const Container src); array_view(const extent<1>& extent, const value_type* src) restrict(amp,cpu); array_view(int e0, const value_type* src) restrict(amp,cpu); array_view(const array_view& other) restrict(amp,cpu); array_view(const array_view<const T,1>& other) restrict(amp,cpu); array_view& operator=(const array_view& other) restrict(amp,cpu); void copy_to(array<T,1>& dest) const; void copy_to(array_view<T,1>& dest) const; __declspec(property(get)) extent<1> extent; // These are restrict(amp,cpu) const T& operator[](const index<1>& idx) const restrict(amp,cpu);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 64
2368 2369 2370 2371 2372 2373 2374 2375 2376 2377 2378 2379 2380 2381 2382 2383 2384 2385 2386 2387 2388 2389 2390 2391 2392 2393 2394 2395 2396 2397 2398 2399 2400 2401 2402 2403 2404 2405 2406 2407 2408 2409 2410 2411 2412 2413 2414 2415 2416 2417 2418 2419 2420 2421 2422 2423 2424 2425
const T& operator[](int i) const restrict(amp,cpu); const T& operator()(const index<1>& idx) const restrict(amp,cpu); const T& operator()(int i) const restrict(amp,cpu); array_view<const T,1> section(const index<N>& idx, const extent<N>& ext) const restrict(amp,cpu); array_view<const T,1> section(const index<1>& idx) const restrict(amp,cpu); array_view<const T,1> section(int i0) const restrict(amp,cpu); template <typename ElementType> array_view<const T,1> reinterpret_as() const restrict(amp,cpu); template <int K> array_view<const T,K> view_as(extent<K> viewExtent) const restrict(amp,cpu); const T* data() const restrict(amp,cpu); void refresh(); }; template <typename T> class array_view<const T,2> { public: static const int rank = 2; typedef const T value_type; const extent<2> extent; array_view() = delete; array_view(const array<T,2>& src) restrict(amp,cpu); template <typename Container> array_view(const extent<2>& extent, const Container src); template <typename Container> array_view(int e0, int e1, const Container src); array_view(const extent<2>& extent, const value_type* src) restrict(amp,cpu); array_view(int e0, int e1, const value_type* src) restrict(amp,cpu); array_view(const array_view& other) restrict(amp,cpu); array_view(const array_view<const T,2>& other) restrict(amp,cpu); array_view& operator=(const array_view& other) restrict(amp,cpu); void copy_to(array<T,2>& dest) const; void copy_to(array_view<T,2>& dest) const; __declspec(property(get)) extent<2> extent; const T& operator[](const index<2>& idx) const restrict(amp,cpu); array_view<const T,1> operator[](int i) const restrict(amp,cpu); const T& operator()(const index<2>& idx) const restrict(amp,cpu); const T& operator()(int i0, int i1,) const restrict(amp,cpu); array_view<const T,2> section(const index<N>& idx, const extent<N>& ext) const restrict(amp,cpu); array_view<const T,2> section(const index<2>& idx) const restrict(amp,cpu);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 65
2426 2427 2428 2429 2430 2431 2432 2433 2434 2435 2436 2437 2438 2439 2440 2441 2442 2443 2444 2445 2446 2447 2448 2449 2450 2451 2452 2453 2454 2455 2456 2457 2458 2459 2460 2461 2462 2463 2464 2465 2466 2467 2468 2469 2470 2471 2472 2473 2474 2475 2476 2477 2478 2479
array_view<const T,2> section(int i0, int i1,) const restrict(amp,cpu); void refresh(); }; template <typename T> class array_view<const T,3> { public: static const int rank = 3; typedef const T value_type; const extent<3> extent; array_view() = delete; array_view(const array<T,3>& src) restrict(amp,cpu); template <typename Container> array_view(const extent<3>& extent, const Container src); template <typename Container> array_view(int e0, int e1, int e2, const Container src); array_view(const extent<3>& extent, const value_type* src) restrict(amp,cpu); array_view(int e0, int e1, int e2, const value_type* src) restrict(amp,cpu); array_view(const array_view& other) restrict(amp,cpu); array_view(const array_view<const T,3>& other) restrict(amp,cpu); array_view& operator=(const array_view& other) restrict(amp,cpu); void copy_to(array<T,3>& dest) const; void copy_to(array_view<T,3>& dest) const; __declspec(property(get)) extent<3> extent; // These are restrict(amp,cpu) const T& operator[](const index<3>& idx) const restrict(amp,cpu); array_view<const T,2> operator[](int i) const restrict(amp,cpu); const T& operator()(const index<3>& idx) const restrict(amp,cpu); const T& operator()(int i0, int i1, int i2) const restrict(amp,cpu); array_view<const T,3> section(const index<3>& idx, const extent<3>& ext) const restrict(amp,cpu); array_view<const T,3> section(const index<3>& idx) const restrict(amp,cpu); array_view<const T,3> section(int i0, int i1, int i2) const restrict(amp,cpu); void refresh(); };
5.2.2
Constructors
The array_view type cannot be default-constructed. It must be bound at construction time to a memory location. No bounds-checking is performed when array_views are constructed.
array_view<T,N>::array_view(array<T,N>& src, bool discard_original = false) restrict(amp,cpu) array_view<const T,N>::array_view(const array<T,N>& src) restrict(amp,cpu)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 66
Constructs an array_view that is bound to the data that is contained in the src array. The extent of the array_view is that of the src array, and the origin of the array view is at zero. Parameters: src An array that contains the data that this array_view is bound to. discard_original A Boolean flag that indicates whether the current data that underlies this view is to be discarded. This is an optimization hint to the runtime and is used to avoid copying the current contents of the view to a target accelerator_view, and its use is recommended if the existing content is not needed. This parameter is ignored if an array_view is constructed in a restrict(amp) function.
2480
template <typename Container> array_view<T,N>::array_view(const extent<N>& extent, Container src, bool discard_original = false) template <typename Container> array_view<const T,N>::array_view(const extent<N>& extent, const Container src)
Constructs an array_view that is bound to the data that is contained in the src container. The extent of the array_view is the one that is given by the extent argument, and the origin of the array view is at zero. Parameters: src A template argument that must resolve to a linear container that supports .data() and .size() members (such as std::vector or std::array) extent discard_original The extent of this array_view. A Boolean flag that indicates whether the current data that underlies this view is to be discarded. This is an optimization hint to the runtime and is used to avoid copying the current contents of the view to a target accelerator_view, and its use is recommended if the existing content is not needed. This parameter is ignored if an array_view is constructed in a restrict(amp) function.
2481
array_view<T,N>::array_view(const extent<N>& extent, value_type* src, bool discard_original = false) restrict(amp,cpu) array_view<const T,N>::array_view(const extent<N>& extent, const value_type* src) restrict(amp,cpu)
Constructs an array_view that is bound to the data that is contained in the src container. The extent of the array_view is the one that is given by the extent argument, and the origin of the array view is at zero. Parameters: src A pointer to the source data that will be copied into this array. extent discard_original The extent of this array_view. A Boolean flag that indicates whether the current data that underlies this view is to be discarded. This is an optimization hint to the runtime and is used to avoid copying the current contents of the view to a target accelerator_view, and its use is recommended if the existing content is not needed. This parameter is ignored if an array_view is constructed in a restrict(amp) function.
2482
template <typename Container> array_view<T,1>::array_view(int e0, Container src, bool discard_original = false) template <typename Container> array_view<T,2>::array_view(int e0, int e1, Container src, bool discard_original = false)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 67
template <typename Container> array_view<T,3>::array_view(int e0, int e1, int e2, Container src, bool discard_original = false) template <typename array_view<const template <typename array_view<const template <typename array_view<const Container> T,1>::array_view(int e0, const Container src) Container> T,2>::array_view(int e0, int e1, const Container src) Container> T,3>::array_view(int e0, int e1, int e2, const Container src)
Equivalent to construction by using array_view(extent<N>(e0 [, e1 [, e2 ]]), src). Parameters: e0 [, e1 [, e2 ] ] The component values that will form the extent of this array_view. Src discard_original A template argument that must resolve to a linear container that supports .data() and .size() members (such as std::vector or std::array) A Boolean flag that indicates whether the current data that underlies this view is to be discarded. This is an optimization hint to the runtime and is used to avoid copying the current contents of the view to a target accelerator_view, and its use is recommended if the existing content is not needed. This parameter is ignored if an array_view is constructed in a restrict(amp) function.
2483
array_view<T,1>::array_view(int e0, value_type* src, bool discard_original = false) restrict(amp,cpu) array_view<T,2>::array_view(int e0, int e1, value_type* src, bool discard_original = false) restrict(amp,cpu) array_view<T,3>::array_view(int e0, int e1, int e2, value_type* src, bool discard_original = false) restrict(amp,cpu) array_view<const T,1>::array_view(int e0, const value_type* src) restrict(amp,cpu) array_view<const T,2>::array_view(int e0, int e1, const value_type* src) restrict(amp,cpu) array_view<const T,3>::array_view(int e0, int e1, int e2, const value_type* src) restrict(amp,cpu)
Equivalent to construction by using array_view(extent<N>(e0 [, e1 [, e2 ]]), src, discard_original). Parameters: e0 [, e1 [, e2 ] ] The component values that will form the extent of this array_view. Src discard_original A pointer to the source data that will be copied into this array. A Boolean flag that indicates whether the current data that underlies this view is to be discarded. This is an optimization hint to the runtime and is used to avoid copying the current contents of the view to a target accelerator_view, and its use is recommended if the existing content is not needed. This parameter is ignored if an array_view is constructed in a restrict(amp) function.
2484
array_view(const array_view& other, bool discard_original = false) restrict(amp,cpu) array_view(const array_view<const T,N>& other) restrict(amp,cpu);
Copy constructor. Constructs a new array_view<T,N> from the supplied argument other. A shallow copy is performed. Parameters: An object of type array_view<T,N> or array_view<const T,N> from Other which to initialize this new array_view.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 68
discard_original
A Boolean flag that indicates whether the current data that underlies this view is to be discarded. This is an optimization hint to the runtime and is used to avoid copying the current contents of the view to a target accelerator_view, and its use is recommended if the existing content is not needed. This parameter is ignored if an array_view is constructed in a restrict(amp) function.
2488
array_view& operator=(const array_view& other) restrict(amp,cpu)
Assigns the contents of the array other to this array, by using a shallow copy. Parameters: An object of type array_view<T,N> from which to copy into this array. other Return Value: Returns *this.
2489
void copy_to(array<T,N>& dest)
Copies the contents of this array_view to the array given by dest, as if by calling copy(*this, dest) (see 5.3.2). Parameters: An object of type array <T,N> to which to copy data from this array. dest
2490
void copy_to(array_view& dest)
Copies the contents of this array_view to the array_view given by dest, as if by calling copy(*this, dest) (see 5.3.2). Parameters: An object of type array_view<T,N> to which to copy data from this dest array.
2491
T* array_view<T,1>::data() restrict(amp,cpu) const T* array_view<T,1>::data() const restrict(amp,cpu)
Returns a pointer to the raw data that underlies this array_view. This is only available on array_views of rank 1. Return Value: A (const) pointer to the first element in the linearized array.
2492
void array_view<T, N>::refresh() void array_view<const T, N>::refresh()
Calling this member function informs the array_view that its bound memory has been modified outside the array_view interface. This renders all cached information stale.
2493
void array_view<T, N>::synchronize()
Calling this member function synchronizes any modifications made to this array_view to its underlying data container. For example, for an array_view on system memory, if the contents of the view are modified on a remote accelerator_view through a parallel_for_each invocation, calling synchronize ensures that the modifications are synchronized to the source data and will be visible through the system memory pointer that the array_view was created over.
2494
std::shared_future<void> synchronize_async()
An asynchronous version of synchronize, which returns an STL future. When the future is ready, the synchronization operation is complete.
2495
void array_view<T, N>::discard_data()
Indicates to the runtime that it may discard the current logical contents of this array_view. This is an optimization hint to the runtime and is used to avoid copying the current contents of the view to a target accelerator_view, and its use is recommended if the existing content is not needed.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 69
2501
const T& operator[](const index<N>& idx) const restrict(amp,cpu) const T& operator()(const index<N>& idx) const restrict(amp,cpu)
Returns a const reference to the element of this array_view that is at the location in N-dimensional space that is specified by idx. Parameters: An object of type index<N> that specifies the location of the element. idx
2502
T& T& T& T& array_view<T,1>::operator()(int array_view<T,1>::operator[](int array_view<T,2>::operator()(int array_view<T,3>::operator()(int i0) i0) i0, i0, restrict(amp,cpu) restrict(amp,cpu) int i1) restrict(amp,cpu) int i1, int i2) restrict(amp,cpu)
Equivalent to array_view<T,N>::operator()(index<N>(i0 [, i1 [, i2 ]])). Parameters: i0 [, i1 [, i2 ] ] The component values that will form the index into this array.
2503
const T& array_view<T,1>::operator()(int i0) const restrict(amp,cpu) const T& array_view<T,2>::operator()(int i0, int i1) const restrict(amp,cpu) const T& array_view<T,3>::operator()(int i0, int i1, int i2) const restrict(amp,cpu)
Equivalent to array_view<T,N>::operator()(index<N>(i0 [, i1 [, i2 ]])) const. Parameters: i0 [, i1 [, i2 ] ] The component values that will form the index into this array.
2504
array_view<T,N-1> operator[](int i0) restrict(amp,cpu) array_view<const T,N-1> operator[](int i0) const restrict(amp,cpu)
This overload is defined for array_view<T,N> where N 2. This mode of indexing is equivalent to projecting on the most-significant dimension. It enables C-style indexing. For example: array<float,4> myArray(myExtents, ); myArray[index<4>(5,4,3,2)] = 7; assert(myArray[5][4][3][2] == 7); Parameters: i0
An integer that is the index into the most-significant dimension of this array.
Return Value: Returns an array_view whose dimension is one lower than that of this array_view.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 70
array_view<T,N> section(const index<N>& idx, const extent<N>& ext) restrict(amp,cpu) array_view<const T,N> section(const index<N>& idx, const extent<N>& ext) const restrict(amp,cpu)
Returns a subsection of the source array view at the origin that is specified by idx and with the extent that is specified by ext Example: array<float,2> a(extent<2>(200,100)); array_view<float,2> v1(a); // v1.extent = <200,100> array_view<float,2> v2 = v1.section(index<2>(15,25), extent<2>(40,50)); assert(v2(0,0) == v1(15,25)); Parameters: idx ext
Provides the offset/origin of the resulting section. Provides the extent of the resulting section.
Return Value: Returns a subsection of the source array at the specified origin, and with the specified extent.
2508
array_view<T,N> section(const index<N>& idx) const restrict(amp,cpu) array_view<const T,N> section(const index<N>& idx) restrict(amp,cpu)
Equivalent to section(idx, this->extent idx).
2509
array_view<T,1> array_view<T,1>::section(int i0, int e0) restrict(amp,cpu) array_view<const T,1> array_view<T,1>::section(int i0, int e0) const restrict(amp,cpu) array_view<T,2> array_view<T,2>::section(int i0, int i1, int e0, int e1) restrict(amp,cpu) array_view<const T,2> array_view<T,2>::section(int i0, int i1, int e0, int e1) const restrict(amp,cpu) array_view<T,3> array_view<T,3>::section(int i0, int i1, int i2, int e0, int e1, int e2) restrict(amp,cpu) array_view<const T,3> array_view<T,3>::section(int i0, int i1, int i2, int e0, int e1, int e2) const restrict(amp,cpu)
Equivalent to section(index<N>(i0 [, i1 [, i2 ]]), extent<N>(e0 [, e1 [, e2 ]])). Parameters: i0 [, i1 [, i2 ] ] The component values that will form the origin of the section. e0 [, e1 [, e2 ] ] The component values that will form the extent of the section.
2510
template<typename ElementType> array_view<ElementType,1> array_view<T,1>::reinterpret_as() restrict(amp,cpu) template<typename ElementType> array_view<const ElementType,1> array_view<T,1>::reinterpret_as() const restrict(amp,cpu)
This member function is similar to array<T,N>::reinterpret_as (see 5.1.5), although it only supports array_views of rank 1 (only those guarantee that all elements are laid out contiguously). The size of the reinterpreted ElementType must evenly divide into the total size of this array_view. Return Value: Returns an array_view from this array_view<T,1> with the element type reinterpreted from T to ElementType.
2511
template <int K> array_view<T,K> array_view<T,1>::view_as(extent<K> viewExtent) restrict(amp,cpu) template <int K> array_view<const T,K> array_view<T,1>::view_as(extent<K> viewExtent) const restrict(amp,cpu)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 71
This member function is similar to array<T,N>::view_as (see 5.1.5), although it only supports array_views of rank 1 (only those guarantee that all elements are laid out contiguously). Return Value: Returns an array_view from this array_view<T,1> with the rank changed to K from 1.
2512 2513 2514 2515 2516 2517 2518 2519 2520 2521 2522 2523 2524 2525 2526 2527 2528 2529 2530 2531 2532 2533 2534 2535 2536 2537 2538 2539 2540 2541 2542 2543 2544 2545 2546 2547 2548 2549 2550 2551 2552 2553 2554 2555 2556 2557 2558
5.3
Copying Data
C++ AMP offers a universal copy function that covers all synchronous data transfer requirements. In call cases, copying data is not supported while executing on an accelerator (in other words, the copy functions do not have a restrict(amp) clause). The general form of copy is:
copy(src, dest);
Informative: This more closely follows the STL convention (destination is the last argument, as in std::copy) and is the opposite of the C-style convention (destination is the first argument, as in memcpy). Copying to array and array_view types is supported from the following sources: An array or array_view that has the same rank and element type as the destination array or array_view. A standard container whose element type is the same as the destination array or array_view.
Informative: Containers that expose .size() and .data() members (for example, std::vector, and std::array) can be handled more efficiently. The copy operation always performs a deep copy. Asynchronous copy has the same semantics as synchronous copy, except that they return a shared_future<void> that can be waited on.
5.3.1
Synopsis
template <typename T, int N> void copy(const array<T,N>& src, array<T,N>& dest); template <typename T, int N> void copy(const array<T,N>& src, array_view<T,N>& dest); template <typename T, int N> void copy(const array_view<const T,N>& src, array<T,N>& dest); template <typename T, int N> void copy(const array_view<const T,N>& src, array_view<T,N>& dest); template <typename T, int N> void copy(const array_view<T,N>& src, array<T,N>& dest); template <typename T, int N> void copy(const array_view<T,N>& src, array_view<T,N>& dest); template <typename InputIter, typename T, void copy(InputIter srcBegin, InputIter template <typename InputIter, typename T, void copy(InputIter srcBegin, InputIter int N> srcEnd, array<T,N>& dest); int N> srcEnd, array_view<T,N>& dest);
template <typename InputIter, typename T, int N> void copy(InputIter srcBegin, array<T,N>& dest);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 72
2559 2560 2561 2562 2563 2564 2565 2566 2567 2568 2569 2570 2571 2572 2573 2574 2575 2576 2577 2578 2579 2580 2581 2582 2583 2584 2585 2586 2587 2588 2589 2590 2591 2592 2593 2594 2595 2596 2597 2598 2599 2600
template <typename InputIter, typename T, int N> void copy(InputIter srcBegin, array_view<T,N>& dest); template <typename OutputIter, typename T, int N> void copy(const array<T,N>& src, OutputIter destBegin); template <typename OutputIter, typename T, int N> void copy(const array_view<T,N>& src, OutputIter destBegin); template <typename T, shared_future<void> template <typename T, shared_future<void> template <typename T, shared_future<void> template <typename T, shared_future<void> template <typename T, shared_future<void> template <typename T, shared_future<void> int N> copy_async(const array<T,N>& src, array<T,N>& dest); int N> copy_async(const array<T,N>& src, array_view<T,N>& dest); int N> copy_async(const array_view<const T,N>& src, array<T,N>& dest); int N> copy_async(const array_view<const T,N>& src, array_view<T,N>& dest); int N> copy_async(const array_view<T,N>& src, array<T,N>& dest); int N> copy_async(const array_view<T,N>& src, array_view<T,N>& dest);
template <typename InputIter, typename T, int N> shared_future<void> copy_async(InputIter srcBegin, InputIter srcEnd, array<T,N>& dest); template <typename InputIter, typename T, int N> shared_future<void> copy_async(InputIter srcBegin, InputIter srcEnd, array_view<T,N>& dest); template <typename InputIter, typename T, int N> shared_future<void> copy_async(InputIter srcBegin, array<T,N>& dest); template <typename InputIter, typename T, int N> shared_future<void> copy_async(InputIter srcBegin, array_view<T,N>& dest); template <typename OutputIter, typename T, int N> shared_future<void> copy_async(const array<T,N>& src, OutputIter destBegin); template <typename OutputIter, typename T, int N> shared_future<void> copy_async(const array_view<T,N>& src, OutputIter destBegin);
5.3.2
2601
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 73
template <typename T, int N> void copy(const array<T,N>& src, array_view<T,N>& dest) template <typename T, int N> shared_future<void> copy_async(const array<T,N>& src, array_view<T,N>& dest)
The contents of src are copied into dest. If the extents of src and dest do not match, a runtime exception is thrown . Parameters: An object of type array<T,N> to be copied from. src dest An object of type array_view<T,N> to be copied to.
2602
template <typename T, int N> void copy(const array_view<const T,N>& src, array<T,N>& dest) template <typename T, int N> void copy(const array_view<T,N>& src, array<T,N>& dest) template <typename T, int N> shared_future<void> copy_async(const array_view<const T,N>& src, array<T,N>& dest) template <typename T, int N> shared_future<void> copy_async(const array_view<T,N>& src, array<T,N>& dest)
The contents of src are copied into dest. If the extents of src and dest do not match, a runtime exception is thrown . Parameters: An object of type array_view<T,N> (or array_view<const T,N>) to be src copied from. dest An object of type array<T,N> to be copied to.
2603
template <typename T, int N> void copy(const array_view<const T,N>& src, array_view<T,N>& dest) template <typename T, int N> shared_future<void> copy_async(const array_view<const T,N>& src, array_view<T,N>& dest)
The contents of src are copied into dest. If the extents of src and dest do not match, a runtime exception is thrown . Parameters: An object of type array_view<T,N> (or array_view<const T,N>) to be src copied from. dest An object of type array_view<T,N> to be copied to.
2604 2605 2606 2607 2608 2609 2610 2611 5.3.3 Copying from Standard Containers to arrays or array_views
A standard container can be copied into an array or array_view by specifying an iterator range. Informative: Standard containers that present a .size() and a .data() (such as std::vector and std::array) operation can be handled very efficiently.
template <typename InputIter, typename T, int N> void copy(InputIter srcBegin, InputIter srcEnd, array<T,N>& dest) template <typename InputIter, typename T, int N>
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 74
void copy(InputIter srcBegin, array<T,N>& dest) template <typename InputIter, typename T, int N> shared_future<void> copy_async(InputIter srcBegin, InputIter srcEnd, array<T,N>& dest) template <typename InputIter, typename T, int N> shared_future<void> copy_async(InputIter srcBegin, array<T,N>& dest)
The contents of a source container from the iterator range [srcBegin,srcEnd) are copied into dest. If the number of elements in the iterator range is not equal to dest.extent.size(), an exception is thrown. In the overloads that do not take an end-iterator, it is assumed that the source iterator is able to provide at least dest.extent.size() elements, but no checking is performed (and is impossible). Parameters: srcBegin An iterator to the first element of a source container. srcEnd dest An iterator to the end of a source container. An object of type array<T,N> to be copied to.
2612
template <typename InputIter, typename T, int N> void copy(InputIter srcBegin, InputIter srcEnd, array_view<T,N>& dest) template <typename InputIter, typename T, int N> shared_future<void> copy_async(InputIter srcBegin, InputIter srcEnd, array_view<T,N>& dest)
The contents of a source container from the iterator range [srcBegin,srcEnd) are copied into dest. If the number of elements in the iterator range is not equal to dest.extent.size(), an exception is thrown. Parameters: srcBegin An iterator to the first element of a source container. srcEnd Dest An iterator to the end of a source container. An object of type array_view<T,N> to be copied to.
2613 2614 2615 2616 2617 2618 2619 5.3.4 Copying from arrays or array_views to Standard Containers
An array or array_view can be copied into a standard container by specifying the begin iterator. Standard containers that present a .size() and a .data() (such as std::vector and std::array) operation can be handled very efficiently.
template <typename OutputIter, typename T, int N> void copy(const array<T,N>& src, OutputIter destBegin) template <typename OutputIter, typename T, int N> shared_future<void> copy_async(const array<T,N>& src, OutputIter destBegin)
The contents of a source array are copied into dest, starting with iterator destBegin. If the number of elements in the range that starts with destBegin in the destination container is smaller than src.extent.size(), an exception is thrown. Parameters: An object of type array<T,N> to be copied from. src destBegin An output iterator that addresses the position of the first element in the destination container.
2620 C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 75
template <typename OutputIter, typename T, int N> void copy(const array_view<T,N>& src, OutputIter destBegin) template <typename OutputIter, typename T, int N> shared_future<void> copy_async(const array_view<T,N>& src, OutputIter destBegin)
The contents of a source array are copied into dest, starting with iterator destBegin. If the number of elements in the range that starts with destBegin in the destination container is smaller than src.extent.size(), an exception is thrown. Parameters: An object of type array_view<T,N> to be copied from. src destBegin An output iterator that addresses the position of the first element in the destination container.
2621
2622 2623 2624 2625 2626 2627 2628 2629 2630 2631 2632 2633 2634 2635 2636 2637 2638 2639 2640 2641 2642 2643 2644 2645 2646 2647 2648 2649 2650 2651 2652 2653 2654 2655 2656 2657 2658 2659 2660 2661
Atomic Operations
C++ AMP provides a set of atomic operations in the concurrency namespace. These operations are applicable in restrict(amp) contexts and may be applied to memory locations in concurrency::array instances and to memory locations in tile_static variables. Section 8 describes the C++ AMP memory model and how atomic operations fit into it.
6.1
Synposis
int atomic_exchange(int * dest, int val) restrict(amp) unsigned int atomic_exchange(unsigned int * dest, unsigned int val) restrict(amp) float atomic_exchange(float * dest, float val) restrict(amp) bool atomic_compare_exchange(int * dest, int * expected_value, int val) restrict(amp) bool atomic_compare_exchange(unsigned int * dest, unsigned int * expected_value, unsigned int val) restrict(amp) int atomic_fetch_add(int * dest, int val) restrict(amp) unsigned int atomic_fetch_add(unsigned int * dest, unsigned int val) restrict(amp) int atomic_fetch_sub(int * dest, int val) restrict(amp) unsigned int atomic_fetch_sub(unsigned int * dest, unsigned int val) restrict(amp) int atomic_fetch_max(int * dest, int val) restrict(amp) unsigned int atomic_fetch_max(unsigned int * dest, unsigned int val) int atomic_fetch_min(int * dest, int val) restrict(amp) unsigned int atomic_fetch_min(unsigned int * dest, unsigned int val) int atomic_fetch_and(int * dest, int val) restrict(amp) unsigned int atomic_fetch_and(unsigned int * dest, unsigned int val) int atomic_fetch_or(int * dest, int val) restrict(amp) unsigned int atomic_fetch_or(unsigned int * dest, unsigned int val) int atomic_fetch_xor(int * dest, int val) restrict(amp) unsigned int atomic_fetch_xor(unsigned int * dest, unsigned int val) restrict(amp) int atomic_fetch_inc(int * dest) restrict(amp) unsigned int atomic_fetch_inc(unsigned int * dest) restrict(amp) int atomic_fetch_dec(int * dest) restrict(amp) unsigned int atomic_fetch_dec(unsigned int * dest) restrict(amp)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 76
6.2
int atomic_exchange(int * dest, int val) restrict(amp) unsigned int atomic_exchange(unsigned int * dest, unsigned int val) restrict(amp) float atomic_exchange(float * dest, float val) restrict(amp)
Atomically reads the value that is stored in dest, replaces it with the value that is given in val, and returns the old value to the caller. This function provides overloads for int, unsigned int, and float parameters. Parameters: dst An pointer to the location that has to be atomically modified. The location may reside in a concurrency::array or in a tile_static variable. val Return value: These functions return the old value that was previously stored at dst, and that was atomically replaced. These functions always succeed. The new value to be stored in the location that is pointed to be dst.
2665
bool atomic_compare_exchange(int * dest, int * expected_val, int val) restrict(amp) bool atomic_compare_exchange(unsigned int * dest, unsigned int * expected_val, unsigned int val) restrict(amp)
These functions attempt to atomically perform these three steps atomically: 1. Read the value that is stored in the location that is pointed to by dest. 2. Compare the value that is read in the previous step with the value that is contained in the location that is pointed by expected_val. 3. Carry the following operations, depending on the result of the comparison of the previous step: a. If the values are identical, then the function tries to atomically change the value that is pointed by dest to the value in val. The function indicates by its return value whether this transformation succeeded or not. b. If the values are not identical, then the function stores the value that is read in step (1) into the location that is pointed to by expected_val, and returns false. In terms of sequential semantics, the function is equivalent to the following pseudo-code:
auto t = *dest; bool eq = t == *expected_val; if (eq) *dst = val; *expected_val = t; return eq;
The function may fail spuriously. It is guaranteed that the system as a whole will make progress when threads are contending to atomically modify a variable, but there is no upper bound on the number of failed attempts that any particular thread may experience. Parameters: dst A pointer to the location that has to be atomically modified. The location may reside in a concurrency::array or in a tile_static variable. expected_val A pointer to a local variable or function parameter. On calling the function, the location that is pointed by expected_val contains the value that the caller expects dst to contain. On return from the function, expected_val contains the most recent value that is read from dst. The new value to be stored in the location that is pointed to be dst.
The return value indicates whether the function succeeded in atomically reading, comparing, and modifying the contents of the memory location.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 77
2666 2667
6.3
int atomic_fetch_add(int * dest, int val) restrict(amp) unsigned int atomic_fetch_add(unsigned int * dest, unsigned int val) restrict(amp) int atomic_fetch_sub(int * dest, int val) restrict(amp) unsigned int atomic_fetch_sub(unsigned int * dest, unsigned int val) restrict(amp) int atomic_fetch_max(int * dest, int val) restrict(amp) unsigned int atomic_fetch_max(unsigned int * dest, unsigned int val) int atomic_fetch_min(int * dest, int val) restrict(amp) unsigned int atomic_fetch_min(unsigned int * dest, unsigned int val) int atomic_fetch_and(int * dest, int val) restrict(amp) unsigned int atomic_fetch_and(unsigned int * dest, unsigned int val) int atomic_fetch_or(int * dest, int val) restrict(amp) unsigned int atomic_fetch_or(unsigned int * dest, unsigned int val) int atomic_fetch_xor(int * dest, int val) restrict(amp) unsigned int atomic_fetch_xor(unsigned int * dest, unsigned int val) restrict(amp)
Atomically reads the value that is stored in dest, applies the binary numerical operation that is specific to the function that has the read value and val serving as input operands, and stores the result back to the location that is pointed by dest. In terms of sequential semantics, the operation that is performed by any of the above functions is described by this pseudocode:
A pointer to the location that has to be atomically modified. The location may reside in a concurrency::array or in a tile_static variable. The second operand that participates in the calculation of the binary operation whose result is stored into the location that is pointed to be dst.
Return value: These functions return the old value that was previously stored at dst, and that was atomically replaced. These functions always succeed.
2668
int atomic_fetch_inc(int * dest) restrict(amp) unsigned int atomic_fetch_inc(unsigned int * dest) restrict(amp) int atomic_fetch_dec(int * dest) restrict(amp) unsigned int atomic_fetch_dec(unsigned int * dest) restrict(amp)
Atomically increment or decrement the value that is stored at the location that is pointed to by dest. Parameters: Dst An pointer to the location that has to be atomically modified. The location may reside in a concurrency::array or in a tile_static variable. Return value: These functions return the old value that was previously stored at dst, and that was atomically replaced. These functions always succeed.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 78
2669 2670 2671 2672 2673 2674 2675 2676 2677 2678 2679 2680 2681 2682 2683 2684 2685 2686 2687 2688 2689 2690 2691 2692 2693 2694 2695 2696 2697 2698 2699 2700 2701 2702 2703 2704 2705 2706 2707 2708 2709 2710 2711 2712 2713 2714 2715 2716 2717 2718 2719 2720 2721 2722 2723 2724
In C++ AMP you use a form of parallel_for_each() to launch data-parallel computations on accelerators. The behavior of parallel_for_each is similar to that of std::for_each: execute a function for each element in a container. The C++ AMP specialization over containers of type extent and tiled_extent enable execution of functions on accelerators. The parallel_for_each function takes the following general forms: 1. Non-tiled:
template <int N, typename Kernel> void parallel_for_each(extent<N> compute_domain, Kernel f);
2.
Tiled:
template <int D0, int D1, int D2, typename Kernel> void parallel_for_each(tiled_extent<D0,D1,D2> compute_domain, Kernel f); template <int D0, int D1, typename Kernel> void parallel_for_each(tiled_extent<D0,D1> compute_domain, Kernel f); template <int D0, typename Kernel> void parallel_for_each(tiled_extent<D0> compute_domain, Kernel f);
2.
Tiled:
template <int D0, int D1, int D2, typename Kernel> void parallel_for_each(const accelerator_view& accl_view, tiled_extent<D0,D1,D2> compute_domain, Kernel f); template <int D0, int D1, typename Kernel> void parallel_for_each(const accelerator_view& accl_view, tiled_extent<D0,D1> compute_domain, Kernel f); template <int D0, typename Kernel> void parallel_for_each(const accelerator_view& accl_view, tiled_extent<D0> compute_domain, Kernel f);
A parallel_for_each over an extent represents a dense loop nest of independent serial loops. When parallel_for_each executes, a parallel activity is spawned for each index in the compute domain. Each parallel activity is associated with an index value. (This index is an index<N> in the case of a non-tiled parallel_for_each, or a tiled_index<D0,D1,D2> in the case of a tiled parallel_for_each.) A parallel activity typically uses its index to access the appropriate locations in the input/output arrays. A call to parallel_for_each behaves as if it were synchronous. In practice, the call may be asynchronous because it executes on a separate device, but because data copy-out is a synchronizing event, you cannot tell the difference. There are no guarantees on the order and concurrency of the parallel activities that are spawned by the non-tiled parallel_for_each. Therefore, do not assume that one activity can wait for another sibling activity to complete for itself to make progress. This is discussed in further detail in section 8.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 79
2725 2726 2727 2728 2729 2730 2731 2732 2733 2734 2735 2736 2737 2738 2739 2740 2741 2742 2743 2744 2745 2746 2747 2748 2749 2750 2751 2752 2753 2754 2755 2756 2757 2758 2759 2760 2761 2762 2763 2764 2765 2766 2767 2768 2769
The tiled version of parallel_for_each organizes the parallel activities into fixed-size tiles of 1, 2, or 3 dimensions, as given by the tiled_extent<> argument. The tiled_extent that is provided as the first parameter to parallel_for_each must be divisible, along each of its dimensions, by the respective tile extent. Tiling beyond 3 dimensions is not supported. Threads (parallel activities) in the same tile have access to shared tile_static memory, and can use tiled_index::barrier.wait (4.5.3) to synchronize access to it. When an amp-restricted kernel is launched, the implementation of tiled parallel_for_each provides the following minimum capabilities: The maximum number of tiles per dimension will be no less than 65535. The maximum number of threads in a tile will be no less than 1024. o In 3D tiling, the maximal value of D0 will be no less than 64.
Microsoft-specific: When an amp-restricted kernel is launched, the tiled parallel_for_each provides the above portable guarantees and no more. That is, The maximum number of tiles per dimension is 65535. The maximum number of threads in a tile is 1024. o In 3D tiling, the maximum value that is supported for D0 is 64. The execution behind the parallel_for_each occurs on an accelerator. This accelerator may be passed explicitly to parallel_for_each (as an optional first argument). Otherwise, the target accelerator is chosen from the objects of type array<T,N> that were captured in the kernel lambda. All arrays must be bound to the same accelerator; if they are not, an exception is thrown. The tiled_index<> argument that is passed to the kernel contains a collection of indices that include those that are relative to the current tile. The argument f of template-argument type Kernel to the parallel_for_each function must be a lambda or functor that offers an appropriate function call operator, which the implementation of parallel_for_each invokes by using the instantiated index type. To execute on an accelerator, the function call operator must be marked restrict(amp) (but may have additional restrictions), and it must be callable from a caller that is passing in the instantiated index type. Overload resolution is handled as if the caller contained this code:
template <typename IndexType, typename Kernel> void parallel_for_each_stub(IndexType i, Kernel f) restrict(amp) { f(i); }
Where the Kernel f argument is the same one that is passed into parallel_for_each by the caller, and the index instance i is the thread identifier, where IndexType is the following type: Non-Tiled parallel_for_each: index<N>, where N must be the same rank as the extent<N> that is used in the parallel_for_each. Tiled parallel_for_each: tiled_index<D0 [, D1 [, D2]]>, where the tile extents must match those of the tiled_extent that are used in the parallel_for_each.
The value that is returned by the kernel function, if any, is ignored. Microsoft-specific:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 80
2770 2771 2772 2773 2774 2775 2776 2777 2778 2779 2780 2781 2782 2783 2784 2785
In the Microsoft implementation of C++ AMP, every function that is referenced directly or indirectly by the kernel function, as 3 well as the kernel function itself, must be inlineable .
7.1
Because the kernel function object does not take any other arguments, all other data that is operated on by the kernel, other than the thread index, must be captured in the lambda or function object that is passed to parallel_for_each. The function object must be an amp-compatible class, struct, or union type, including those that are introduced by lambda expressions. Note: class array_view is an amp-compatible type.
7.2
Exception Behavior
If an error occurs when the parallel_for_each is trying to launch, an exception is thrown. Exceptions can be thrown for the following reasons: 1. 2. 3. 4. Failure to create shader Failure to create buffers Invalid extent passed Mismatched accelerators
2786 2787 2788 2789 2790 2791 2792 2793 2794 2795 2796 2797 2798 2799 2800 2801 2802 2803 2804 2805 2806 2807 2808 2809 2810 2811
Correctly synchronized C++ AMP programs are correctly synchronized C++ programs that also adhere to these additional C++ AMP rules:
2.
8.1
In this section, we will consider the relationship between sibling threads in a parallel_for_each call. Interaction between separate parallel_for_each calls, copy operations, and other host-side operations will be considered in the following subsections. A parallel_for_each call logically initiates the operation of multiple sibling threads, one for each coordinate in the extent or tiled_extent that is passed to it. All the threads that are launched by a parallel_for_each are potentially concurrent. Unless barriers are used, an implementation is free to schedule these threads in any order. In addition, the memory model for normal memory accesses is weak; that is, operations can be arbitrarily reordered as long as each thread executes in its original program order. Therefore, any two memory operations from any two threads in a parallel_for_each are by default concurrent, unless the application has explicitly enforced an order between these two operations by using atomic operations, fences, or barriers.
3
An implementation can employ whole-program compilation (such as link-time code-gen) to achieve this.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 81
2812 2813 2814 2815 2816 2817 2818 2819 2820 2821 2822 2823 2824 2825 2826 2827 2828 2829 2830 2831 2832 2833 2834 2835 2836 2837 2838 2839 2840 2841 2842 2843 2844 2845 2846 2847 2848 2849 2850 2851 2852 2853 2854 2855 2856 2857 2858 2859 2860 2861 2862 2863 2864 2865 2866 2867 2868 2869 2870 2871 2872
Conversely, an implementation may also schedule only one logical thread at a time, in a non-cooperative manner; that is, without letting any other threads make any progress except for hitting a tile barrier or terminating. When a thread encounters a tile barrier, an implementation must wrest control from that thread and provide progress to some other thread in the tile until they all have reached the barrier. Similarly, when a thread finishes execution, the system is obligated to execute steps from some other thread. Therefore, an implementation is obligated to switch context between threads only when a thread has hit a barrier (barriers pertain just to the tiled parallel_for_each), or is finished. An implementation does not have to admit any concurrency at a finer level than that which is dictated by barriers and thread termination. All implementations, however, are obligated to ensure that progress is continually made, until all threads that are launched by a parallel_for_each are completed. An immediate corollary is that C++ AMP does not provide a mechanism that a thread could use, without using tile barriers, to poll for a change that has to be effected by another thread. In particular, C++ AMP does not support locks that are implemented by using atomic operations and fences, because a thread could end up polling forever, while waiting for a lock to become available. The usage of tile barriers enables the creation of a limited form of locking that is scoped to a thread tile. For example:
void tile_lock_example() { parallel_for_each( extent<1>(TILE_SIZE).tile<TILE_SIZE>(), [] (tiled_index<TILE_SIZE> tidx) restrict(amp) { tile_static int lock; // Initialize lock: if (tidx.local[0] == 0) lock = 0; tidx.barrier.wait(); bool performed_my_exclusive_work = false; for (;;) { // try to acquire the lock if (!performed_my_ exclusive _work && atomic_compare_exchange(&lock, 0, 1)) { // The lock has been acquired - mutual exclusion from the rest of the threads in the tile // is provided here.... some_synchronized_op(); // Release the lock atomic_exchange(&lock, 0); performed_my_exclusive_work = true; } else { // The lock wasn't acquired, or we are already finished. Perhaps we can do something // else in the meanwhile. some_non_exclusive_op(); } // The tile barrier ensures progress, so threads can spin in the for loop until they // are successful in acquiring the lock. tidx.barrier.wait(); } }); }
Informative: More often than not, such non-deterministic locking within a tile is not really necessary, because a static schedule of the threads that is based on integer thread IDs is possible, and results in more efficient and more maintainable code. But we bring this example here for completeness and to illustrate a valid form of polling. 8.1.1 Correct Usage of Tile Barriers Correct C++ AMP programs require all threads in a tile to hit all tile barriers uniformly. That is, at a minimum, when a thread encounters a particular tile_barrier::wait call site (or any other barrier method of class tile_barrier), all other threads in the tile must encounter the same call site. C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 82
2873 2874 2875 2876 2877 2878 2879 2880 2881 2882 2883 2884 2885 2886 2887 2888 2889 2890 2891 2892 2893 2894 2895 2896 2897 2898 2899 2900 2901 2902 2903 2904 2905 2906 2907 2908 2909 2910 2911 2912 2913 2914 2915 2916 2917 2918 2919 2920 2921 2922 2923
Informative: This requirement, however, is typically not sufficient to allow for efficient implementations. For example, it allows for the call stack of threads to differ, when they hit a barrier. To be able to generate good quality code for vector targets, much stronger constraints should be placed on the usage of barriers, as explained later. C++ AMP requires all active control flow expressions that lead to a tile barrier to be tile-uniform. Active control flow expressions are those that guard the scopes of all control flow constructs and logical expressions, which are actively being executed when a barrier is called. For example, the condition of an if statement is an active control flow expression as long as either the true or the false hand of the if statement are still executing. If either of those hands contains a tile barrier, or leads to one through an arbitrary nesting of scopes and function calls, then the control flow expression that controls the if statement must be tile-uniform. What follows is the list of control flow constructs that may lead to a barrier and their corresponding control expressions:
if (<control-expression>) <statement> else <statement> switch (<control-expression> { <cases> } for (<init-expression>; <control-expression>; <iteration-expression>) <statement> while (<control-expression>) <statement> do <statement> while(<control-expression>); <control-expression> ? <expression> : <expression> <control-expression> && <expression> <control-expression> || <expression>
All active control flow constructs are strictly nested in accordance with the programs text, starting from the scope of the lambda at the parallel_for_each all the way to the scope that contains the barrier. C++ AMP requires that, when a barrier is encountered by one thread: 1. 2. 3. 4. That the same barrier will be encountered by all other threads in the tile. That the sequence of active control flow statements and/or expressions be identical for all threads when they reach the barrier. That each of the correspondng control expressions be tile-uniform (which is defined below). That any active control flow statement or expression has not been departed (necessarily in a non-uniform fashion) by a break, continue, or return statement. That is, any breaking statement that instructs the program to leave an active scope must in itself behave as if it was a barrier; that is, it must adhere to the four preceding rules.
Informally, a tile-uniform expression is an expression that only involves variables, literals, and function calls that have a uniform value throughout the tile. Formally, C++ AMP specifies that: 1. 2. 3. 4. Tile-uniform expressions may reference literals and template parameters. Tile-uniform expressions may reference const (or effectively const) data members of the function object parameter of parallel_for_each. Tile-uniform expressions may reference tiled_index<,,>::tile. Tile-uniform expressions may reference values that are loaded from tile_static variables as long as those values are loaded immediately and uniformly after a tile barrier. That is, if the barrier and the load of the value occur at the same function and the barrier dominates the load and no potential store into the same tile_static variable intervenes between the barrier and the load, then the loaded value will be considered tile-uniform. Control expressions may reference tile-uniform local variables and parameters. Uniform local variables and parameters are variables and parameters that are always initialized and assigned-to under uniform control flow (that is, by using the same rules that are defined here for barriers), and that are only assigned tile-uniform expressions. Tile-uniform expressions may reference the return values of functions that return tile-uniform expressions. Tile-uniform expressions may not reference any expression that is not explicitly listed by the previous rules.
5.
6. 7.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 83
2924 2925 2926 2927 2928 2929 2930 2931 2932 2933 2934 2935 2936 2937 2938 2939 2940 2941 2942 2943 2944 2945 2946 2947 2948 2949 2950 2951 2952 2953 2954 2955 2956 2957 2958 2959 2960 2961 2962 2963 2964 2965 2966 2967 2968
An implementation is not obligated to warn when a barrier does not meet the criteria that are set forth above. An implementation may disqualify the compilation of programs that contain incorrect barrier usage. Conversely, an implementation may accept programs that contain incorrect barrier usage and may execute them with undefined behavior. 8.1.2 Establishing Order Between Operations of Concurrent parallel_for_each Threads Threads may employ atomic operations, barriers, and fences to establish a happens-before relationship that encompasses their cumulative execution. When the correctness of the synchronization of programs is considered, the following three aspects of the programs are relevant: 1. The types of memory that are potentially accessed concurrently by different threads. The memory type can be: a. Global memory b. Tile-static memory The relationship between the threads that could potentially access the same piece of memory. They could be: a. Within the same thread tile b. Within separate threads tiles or sibiling threads in the basic (non-tiled) parallel_for_each model Memory operations that the program contains: a. Normal memory reads and writes b. Atomic read-modify-write operations c. Memory fences and barriers
2.
3.
Informally, the C++ AMP memory model is a weak memory model that is consistent with the C++ memory model, with the following exceptions: 1. Atomic operations do not necessarily create a sequentially consistent subset of execution. Atomic operations are only coherent, not sequentially consistent. That is, there does not necessarily exist a global linear order that contains all atomic operations that affect all memory locations that were subjects of such operations. Rather, a separate global order exists for each memory location, and these per-location memory orders are not necessarily combinable into one global order. (This means an atomic operation does not constitute a memory fence.) Memory fence operations are limited in their effects to the thread tile that they are performed within. When a thread from tile A executes a fence, the fence operation does not necessarily affect any other thread from any tile other than A. As a result of (1) and (2), the only mechanism that is available for cross-tile communication is atomic operations, and even when atomic operations are concerned, a linear order is only guaranteed to exist on a per-location basis, but not necessarily globally. Fences are bi-directional, which means that they have both acquire and release semantics. Fences can also be further scoped to a particular memory type (global vs. tile-static). Applying normal stores and atomic operations concurrently to the same memory location causes undefined behavior. Applying a normal load and an atomic operation concurrently to the same memory location is allowed (that is, it results in defined bavior).
2.
3.
4. 5. 6. 7.
We will now provide a more formal characterization of the different categories of programs, based on their adherence to synchronization rules. The three classes of adherence are: 1. 2. 3. 8.1.2.1 barrier-incorrect programs racy programs correctly-synchronized programs Barrier-incorrect Programs
A barrier-incorrect program is a program that does not adhere to the correct barrier usage rules that are specified in the previous section. Such programs always have undefined behavior. The remainder of this section discusses barrier-correct programs only.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 84
2969 2970 2971 2972 2973 2974 2975 2976 2977 2978
8.1.2.2
The following definition is later used in the definition of racy programs. Two memory operations that are applied to the same (or overlapping) memory location are compatible if they are both aligned and have the same data width, and either both operations are reads, both operations are atomic, or one operation is a read and the other is atomic. This is summarized in the following table in which T 1 is a thread that is executing operation Op 1 and T2 is a thread that is executing operation Op2.
Op1 Atomic Read Read Write Op2 Atomic Read Atomic Any Compatible? Yes Yes Yes No
2979 2980 2981 2982 2983 2984 2985 2986 2987 2988 2989 2990 2991 2992 2993 2994 2995 2996 2997 2998 2999 3000 3001 3002 3003 3004 3005 3006 3007 3008 3009 3010
4
8.1.2.3
The following definition is later used in the definition of racy programs. Informally, two memory operations by different threads are considered concurrent if no order has been established between them. Order can be established between two memory operations only when they are executed by threads within the same tile. Therefore, any two memory operations by threads from different tiles are always concurrent, even if they are atomic. Within the same tile, order is established by using fences and barriers. Barriers are a strong form of a fence. Formally, let {T1,...,TN} be the threads of a tile. Fix a sharable memory type (be it global or tile-static). Let M be the total set of memory operations of the given memory type that are performed by the collective of the threads in the tile. Let F = <F1,,FL> be the set of memory fence operations of the given memory type, performed by the collection of threads in the tile, and organized arbitrarily into an ordered sequence. Let P be a partitioning of M into a sequence of subsets P = <M 0,,ML>, organized into an ordered sequence in an arbitrary fashion. Let S be the interleaving of F and P, S = <M0,F1,M1,,FL,ML>. S is conforming if both of these conditions hold: 1. Adherence to program order: For each Ti, S respects the fences that are performed by Ti. That is, any operation that is performed by Ti before Ti performed fence Fj appears strictly before Fj in S, and similarly, any operation that is performed by Ti after Fj appears strictly after Fj in S. Self-consistency: For i<j, let Mi be a subset that contains at least one store (atomic or non-atomic) into location L and let Mj be a subset that contains at least one load of L, and no stores into L. Further assume that no subset inbetween Mi and Mj stores into L. Then S provides that all loads in Mj : a. Must return values that are stored into L by operations in Mi b. And, for each thread Ti, the subset of Ti operations in Mj reading L must all return the same value (which is necessarily one that is stored by an operation in M i, as specified by condition (a) above). Respecting initial values. Let Mj be a subset that contains a load of L, and no stores into L. Further assume that there is no Mi where i<j such that Mi contains a store into L. Then all loads of L in Mj will return the initial value of L.
4
2.
3.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 85
3011 3012 3013 3014 3015 3016 3017 3018 3019 3020 3021 3022 3023 3024 3025 3026 3027 3028 3029 3030 3031 3032 3033 3034 3035 3036 3037 3038 3039 3040 3041 3042 3043 3044 3045 3046 3047 3048 3049 3050 3051 3052 3053 3054 3055 3056 3057 3058 3059
In such a conforming sequence S, two operations are concurrent if they have been executed by different threads and they belong to some common subset Mi. Two operations are concurrent in an execution history of a tile, if there exists a conforming interleaving S, as described herein, in which the operations are concurrent. Two operations of a program are concurrent if there possibly exists an execution of the program in which they are concurrent. A barrier behaves like a fence to establish order between operations, except that it provides additional guarantees on the order of execution. Based on the above definition, a barrier is like a fence that only permits a certain kind of interleaving; specifically, one in which the sequence of fences (F in the above formalization) has the fences , corresponding to the barrier execution by individual threads, appearing uninterrupted in S, without any memory operations interleaved between them. For example, consider the following program: C1 Barrier C2 Assume that C1 and C2 are arbitrary sequences of code. Assume this program is executed by two threads T1 and T2; then, the only possible conforming interleavings are given by the following pattern: T1(C1) || T2(C1) T1(Barrier) || T2(Barrier) T1(C2) || T2(C2) Where the || operator implies arbitrary interleaving of the two operand sequences. 8.1.2.4 Racy Programs
Racy programs are programs that have possible executions where at least two operations that are performed by two separate threads are both (a) incompatible AND (b) concurrent. Racy programs do not have semantics assigned to them. They have undefined behavior. 8.1.2.5 Race-free Programs
Race-free programs are, simply, programs that are not racy. Race-free programs have the following semantics assigned to them: 1. 2. If two memory operations are ordered (that is, not concurrent) by fences and/or barriers, then the values that are loaded/stored will respect the ordering. If two memory operations are concurrent, then they must be atomic and/or reads that are performed by threads within the same tile. For each memory location X there exists an eventual total order that includes all such concurrent opertions applied to X and that obey the semantics of loads and atomic read-modify-write transactions.
8.2
An invocation of parallel_for_each receives a function object, the contents of which are made available on the device. The function object may contain: concurrency::array reference data members, concurrency::array_view value data members, concurrency::texture , and concurrency::writeonly_texture_view reference data members. Each of these members could be constrained in the type of access that it provides to kernel code. For example, an array<int,2>& member provides both read and write access to the array, while a const array<int,2>& member provides just read access to the array. Similarly, an array_view<int,2> member provides read and write access, while an array_view<const int,2> member provides read access only. The C++ AMP specification permits implementations in which the memory that backs an array, array_view, or texture could be shared between different accelerators, and possibly also the host, while also permitting implementations where data has to be copied, by the implementation, between different memory regions to support access by some hardware. Simulating coherence at a very granular level is too expensive in the case where disjoint memory regions are required by C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 86
3060 3061 3062 3063 3064 3065 3066 3067 3068 3069 3070 3071 3072 3073 3074 3075 3076 3077 3078 3079 3080 3081 3082 3083 3084 3085 3086 3087 3088 3089 3090 3091 3092 3093 3094 3095 3096 3097 3098 3099 3100 3101 3102 3103 3104 3105 3106 3107 3108 3109 3110 3111
the hardware. Therefore, to support both styles of implementation, this specification stipulates that parallel_for_each has the freedom to implement coherence over array, array_view, and texture by using coarse copying. Specifically, while a parallel_for_each call is being evaluated, implementations may: 1. 2. Load and/or store any location, in any order, any number of times, of each container that is passed into parallel_for_each in read/write mode. Load from any location, in any order, any number of times, of each container that is passed into parallel_for_each in read-only mode.
A parallel_for_each always behaves synchronously. That is, any observable side effects that are caused by any thread that is executing within a parallel_for_each call, or any side effects that are further affected by the implementation due to the freedom it has in moving memory around, as stipulated above, must be visible by the time parallel_for_each returns. However, because the effects of parallel_for_each are constrained to changing values within arrays, array_views, and textures, and each of these objects can synchronize its contents lazily upon access, an asynchronous implementation of parallel_for_each is possible, and encouraged. Informative: Future versions of parallel_for_each may be less constrained in the changes that they may affect to shared memory, and at that point, an asynchronous implementation will no longer be valid. At that point, an explicitly asynchronous parallel_for_each_async would be added to the specification. Even though an implementation could be coarse in the way it implements coherence, it still must provide true aliasing for array_views that refer to the same home location. For example, assuming that a1 and a2 are both array_views that constructed on top of a 100-wide one-dimensional array, with a1 referring to elements [010] of the array and a2 referring to elements [10...20] of the same array. If both a1 and a2 are accessible on a parallel_for_each call, then accessing a1 at position 10 is identical to accessing the view a2 at position 0, because they both refer to the same location of the array that they are providing a view over, namely, position 10 in the original array. This rules holds whenever and wherever a1 and a2 are accessible simultaneously; that is, on the host and in parallel_for_each calls. Therefore, for example, an implementation could clone an array_view that is passed into a parallel_for_each in read-only mode, and pass the cloned data to the device. It can create the clone by using any order of reads from the original. The implementation may read the original a multiple number of times, perhaps to implement load-balancing or reliability features. Similarly, an implementation could copy back results from an internally cloned array, array_view, or texture, onto the original data. It may overwrite any data in the original container, and it can do so multiple times in the realization of a single parallel_for_each call. When two or more overlapping array views are passed to a parallel_for_each, an implementation could create a temporary array that corresponds to a section of the original container that contains at a minimum the union of the views that are necessary for the call. This temporary array will hold the clones of the overlapping array_views while it maintain their aliasing requirements. The guarantee for the aliasing of array_views is provided for views that share the same home location. The home location of an array_view is defined as: 1. 2. In the case of an array_view that is ultimately derived from an array, the home location is the array. In the case of an array_view that is ultimately derived from a host pointer, the home location is the original array view that was created by using the pointer.
This means that two different array_views that have both been created, independently, on top of the same memory region are not guaranteed to appear coherent. In fact, creating and using top-level array_views on the same host storage is not supported. For such array_views to appear coherent, they must have a common top-level array_view ancestor that they both ultimately were derived from, and that top-level array_view must be the only one that is constructed on top of the C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 87
3112 3113 3114 3115 3116 3117 3118 3119 3120 3121 3122 3123 3124 3125 3126 3127 3128 3129 3130 3131 3132 3133 3134 3135 3136 3137 3138 3139 3140 3141 3142 3143 3144 3145 3146 3147 3148 3149 3150 3151 3152 3153 3154 3155 3156 3157 3158 3159 3160 3161 3162 3163 3164 3165
// OK // OK
parallel_for_each( extent<1>(1), [=] (index<1>) restrict(amp) { av3[2] = 15; }); parallel_for_each( extent<1>(1), [=] (index<1>) restrict(amp) { av2[7] = 16; }); parallel_for_each( extent<1>(1), [=] (index<1>) restrict(amp) { av1[7] = 17; }); assert(av1[7] == av2[7]); // OK, never fails, both equal 17 assert(av1[7] == av3[2]); // OK, never fails, both equal 17 }
8.3
Copy operations are offered on array, array_view, and texture. Copy operations copy a source host buffer, array, array_view, or a texture to a destination object that can also be one of these four varieties (except host buffer to host buffer, which is handled by std::copy). A copy operation reads all elements of its source. It may read each element multiple times and it may read elements in any order. It may employ memory load instructions that are either coarser or more granular than the width of the primitive data types in the container, but it is guaranteed to never read a memory location that is strictly outside of the source container. Similarly, copy overwrites every element in its output range. It may do so multiple times and in any order, and may coarsen or break apart individual store operations, but it is guaranteed to never write a memory location that is strictly outside of the target container.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 88
3166 3167 3168 3169 3170 3171 3172 3173 3174 3175 3176 3177 3178 3179 3180 3181 3182 3183 3184 3185 3186 3187 3188 3189 3190 3191 3192 3193 3194 3195 3196 3197 3198 3199 3200 3201 3202 3203 3204 3205 3206 3207 3208 3209 3210 3211 3212 3213 3214 3215 3216
A synchronous copy operation extends from the time when the function is called until it has returned. During this time, any source location may be read and any destination location may be written. An asynchronous copy extends from the time when copy_async is called until the time when the std::future returned is signaled. As always, it is the programmers responsibility not to call functions that could result in a race. For example, this progra m is racy because the two copy operations are concurrent and b is written to by the first parallel activity while it is being updated by the second parallel activity.
array<int> a(100), b(100), c(100); parallel_invoke( [&] { copy(a,b); } [&] { copy(b,c); });
8.4
An array_view may be constructed to wrap over a host-side pointer. For such array_views, it is not supported to access the underlying array_view storage directly, as long as the array_view exists. Access to the storage area is generally accomplished indirectly through the array_view. However, array_view offers mechanisms to synchronize and refresh its contents, and they do enable access to the underlying memory directly. These mechanisms are described below. Reading of the underlying storage is possible under the condition that the view has first been synchronized back to its home storage. This is performed by using the synchronize or synchronize_async member functions of array_view. When a top-level view is initially created on top of a raw buffer, it is synchronized with it. After it has been constructed, a top-level view, and also derived views, may lose coherence with the underlying host-side raw memory buffer if the array_view is passed to parallel_for_each as a mutable view, or if the view is a target of a copy operation, or if the view is written into directly on the host, by using the subscript operator. To restore coherence with host-side underlying memory, synchronize or synchronize_async must be called. Synchronization is restored when synchronize returns, or when the future that is returned by synchronize_async is ready. For the sake of composition with parallel_for_each, copy, and all other host-side operations that involve a view, synchronize should be considered a read of the entire data section that is referred to by the view, as if it was the source of a copy operation, and therefore must not be executed concurrently with any other operation that involves writing the view. Even though synchronize does potentially modify the underlying host memory, it is logically a no-op because it does not affect the logical contents of the array. As such, it is allowed to execute concurrently with other operations that read the array view. As with copy, synchronize works at the granularity of the view that it is applied to. For example, synchronizing a view that represents a sub-section of a parent view does not necessarily synchronize the entire parent view. It is just guaranteed to synchronize the overlapping portions of such related views. array_views are also required to synchronize their home storage: 1. 2. Before they are destructed. When they are accessed by using the subscript operator (on that home location).
As a result of (1), any errors in synchronization that may be encountered during destruction of array views is not propagated through the destructor. Therefore, we encourage you to ensure that array_views that may contain unsynchronized data are explicitly synchronized before they are destructed. As a result of (2), the implementation of the subscript operator may have to contain a coherence-enforcing check, especially on platforms where the accelerator hardware and host memory are not shared, and therefore, coherence is managed explicitly by the C++ AMP runtime. Such a check may be detrimental for code that is written to achieve high
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 89
3217 3218 3219 3220 3221 3222 3223 3224 3225 3226 3227 3228 3229
performance through vectorization of the array view accesses. We recommend that such performance-sensitive code be written to obtain a pointer to the beginning of a run and perform the low -level accesses that are required, based off of the raw pointer into the array_view. array_views are guaranteed to be contiguous in the unit-stride dimension, which enables this style of coding. Furthermore, the code may explicitly synchronize the array_view and at that point read the home storage directly, without the mediation of the view. Sometimes it is desirable to also allow the refreshing of a view directly from its underlying memory. The refresh member function is provided for this task. This function revokes any caches that are associated with the view and resynchronizes the views contents with the underlying memory. As such, it may not be invoked concurrently with any other opera tion that accesses the views data. However, it is safe to assume that refresh does not modify the views underlying data and therefore, that concurrent read access to the underlying data is allowed during refreshs operation and after refresh has returned, until the point when coherence may have been lost again, as was described in the earlier discussion about the synchronize member function.
3230 3231 3232 3233 3234 3235 3236 3237 3238 3239 3240 3241 3242 3243 3244 3245 3246 3247 3248
Math Functions
C++ AMP contains a rich library of floating-point math functions that can be used in an accelerated computation. The C++ AMP library comes in two flavors, each of which is contained in a separate namespace. The functions in the concurrency::fast_math namespace support only single-precision (float) operands and are optimized for performance at the expense of accuracy. The functions in the concurrency::precise_math namespace support both single- and double-precision (double) operands and are optimized for accuracy at the expense of performance. The two namespaces cannot be used together without introducing ambiguities. The accuracy of the functions in the concurrency::precise_math namespace must be at least as high as those in the concurrency::fast_math namespace. All functions are available in the <amp_math.h> header file, and all are decorated restrict(amp).
9.1
fast_math
Functions in the fast_math namespace are designed for computations where accuracy is not a prime requirement, and therefore, the minimum precision is implementation-defined. Not all functions that are available in precise_math are available in fast_math.
C++ API function float acosf(float x) float acos(float x) float asinf(float x) float asin(float x) float atanf(float x) float atan(float x) float atan2f(float y, float x) float atan2(float y, float x) Description Returns the arc cosine in radians and the value is mathematically defined to be between 0 and PI (inclusive). Returns the arc sine in radians and the value is mathematically defined to be between -PI/2 and PI/2 (inclusive). Returns the arc tangent in radians and the value is mathematically defined to be between -PI/2 and PI/2 (inclusive). Calculates the arc tangent of the two variables x and y. It is similar to calculating the arc tangent of y / x, except that the signs of both arguments are used to determine the quadrant of the result.). Returns the result in radians, which is between -PI and PI (inclusive). Rounds x up to the nearest integer. Returns the cosine of x. Returns the hyperbolic cosine of x.
float ceilf(float x) float ceil(float x) float cosf(float x) float cos(float x) float coshf(float x) float cosh(float x)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 90
float expf(float x) float exp(float x) float exp2f(float x) float exp2(float x) float fabsf(float x) float fabs(float x) float floorf(float x) float floor(float x) float fmaxf(float x, float y) float fmax(float x, float y) float fminf(float x, float y) float fmin(float x, float y) float fmodf(float x, float y) float fmod(float x, float y) float frexpf(float x, int * exp) float frexp(float x, int * exp) int isfinite(float x) int isinf(float x) int isnan(float x) float ldexpf(float x, float exp) float ldexp(float x, float exp) float logf(float x) float log(float x) float log10f(float x) float log10(float x) float log2f(float x) float log2(float x) float modff(float x, float * iptr) float modf(float x, float * iptr) float powf(float x, float y) float pow(float x, float y) float roundf(float x) float round(float x) float rsqrtf(float x) float rsqrt(float x) int signbit(float x) int signbit(double x) float sinf(float x) float sin(float x) void sincosf(float x, float* s, float* c) void sincos(float x, float* s, float* c) float sinhf(float x) float sinh(float x) float sqrtf(float x) float sqrt(float x) float tanf(float x) float tan(float x) float tanhf(float x) float tanh(float x)
Returns the value of e (the base of natural logarithms) raised to the power of x. Returns the value of 2 raised to the power of x. Returns the absolute value of floating-point number Rounds x down to the nearest integer. Selects the greater of x and y. Selects the lesser of x and y. Computes the remainder of dividing x by y. The return value is x n * y, where n is the quotient of x / y, rounded towards zero to an integer. Splits the number x into a normalized fraction and an exponent which is stored in exp. Determines if x is finite. Determines if x is infinite. Determines if x is NAN. Returns the result of multiplying the floating-point number x by 2 raised to the power exp Returns the natural logarithm of x. Returns the base 10 logarithm of x. Returns the base 2 logarithm of x. Breaks the argument x into an integral part and a fractional part, each of which has the same sign as x. The integral part is stored in iptr. Returns the value of x raised to the power of y. Rounds x to the nearest integer. Returns the reciprocal of the square root of x. Returns a non-zero value if the value of X has its sign bit set. Returns the sine of x. Returns the sine and cosine of x. Returns the hyperbolic sine of x. Returns the non-negative square root of x. Returns the tangent of x. Returns the hyperbolic tangent of x.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 91
9.2
precise_math
Functions in the precise_math namespace are designed for computations where accuracy is required. In the next table, the precision of each function is stated in units of ulps (error in last position). Functions in the precise_math namespace also support both single and double precision, and are therefore dependent on double-precision support in the underlying hardware, even for single-precision variants.
C++ API function float acosf(float x) float acos(float x) double acos(double x) float acoshf(float x) float acosh(float x) double acosh(float x) float asinf(float x) float asin(float x) double asin(double x) float asinhf(float x) float asinh(float x) double asinh(float x) float atanf(float x) float atan(float x) double atan(double x) float atanhf(float x) float atanh(float x) double atanh(float x) float atan2f(float y, float x) float atan2(float y, float x) double atan2(double y, double x) float cbrtf(float x) float cbrt(float x) double cbrt(double x) float ceilf(float x) float ceil(float x) double ceil(double x) float copysignf(float x, float y) float copysign(float x, float y) double copysign(double x, double y) float cosf(float x) float cos(float x) Return a value whose absolute value matches that of x, but whose sign matches that of y. If x is a NaN, then a NaN with the sign of y is returned. Returns the cosine of x. N/A N/A Rounds x up to the nearest integer. 0 0 Calculates the arc tangent of the two variables x and y. It is similar to calculating the arc tangent of y / x, except that the signs of both arguments are used to determine the quadrant of the result.). Returns the result in radians, which is between -PI and PI (inclusive). Returns the (real) cube root of x. 3 2 Returns the hyperbolic arctangent. 3 2 Returns the arc tangent in radians and the value is mathematically defined to be between -PI/2 and PI/2 (inclusive). 2 2 Returns the hyperbolic arcsine. 3 2 Returns the arc sine in radians and the value is mathematically defined to be between -PI/2 and PI/2 (inclusive). 4 2 Returns the hyperbolic arccosine. 4 2 Description Returns the arc cosine in radians and the value is mathematically defined to be between 0 and PI (inclusive). Precision (float) 3 Precision (double) 2
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 92
double cos(double x) float coshf(float x) float cosh(float x) double cosh(double x) float cospif(float x) float cospi(float x) double cospi(double x) float erff(float x) float erf(float x) double erf(double x) float erfcf(float x) float erfc(float x) double erfc(double x) float erfinvf(float x) float erfinv(float x) double erfinv(double x) float erfcinvf(float x) float erfcinv(float x) double erfcinv(double x) float expf(float x) float exp(float x) double exp(double x) float exp2f(float x) float exp2(float x) double exp2(double x) float exp10f(float x) float exp10(float x) double exp10(double x) float expm1f(float x) float expm1(float x) double expm1(double x) float fabsf(float x) float fabs(float x) double fabs(double x) float fdimf(float x, float y) float fdim(float x, float y) double fdim(double x, double y) float floorf(float x) float floor(float x) double floor(double x) float fmaf(float x, float y, float z) Computes (x * y) + z, rounded as one ternary operation: they compute the value (as if) to infinite precision and round once to 0 05 Rounds x down to the nearest integer. 0 0 These functions return max(x-y,0). If x or y or both are NaN, Nan is returned. 0 0 Returns the absolute value of floating-point number N/A N/A Returns a value equivalent to 'exp (x) - 1' 1 1 Returns the value of 10 raised to the power of x. 2 1 Returns the value of 2 raised to the power of x. 2 1 Returns the value of e (the base of natural logarithms) raised to the power of x. 2 1 Returns the inverse of the complementary error function. 7 8 Returns the inverse error function. 3 8 Returns the complementary error function of x that is 1.0 - erf (x). 6 5 Returns the error function of x; defined as erf(x) = 2/sqrt(pi)* integral from 0 to x of exp(-t*t) dt 3 2 Returns the cosine of pi * x. 2 2 Returns the hyperbolic cosine of x. 2 2
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 93
float fma(float x, float y, float z) double fma(double x, double y, double z) float fmaxf(float x, float y) float fmax(float x, float y) double fmax(double x, double y) float fminf(float x, float y) float fmin(float x, float y) double fmin(double x, double y) float fmodf(float x, float y) float fmod(float x, float y) double fmod(double x, double y) int fpclassify(float x); int fpclassify(double x);
the result format, according to the current rounding mode. A range error may occur. Selects the greater of x and y. N/A N/A
N/A
N/A
Computes the remainder of dividing x by y. The return value is x n * y, where n is the quotient of x / y, rounded towards zero to an integer. Floating point numbers can have special values, such as infinite or NaN. With the macro fpclassify(x) you can find out what type x is. The function takes any floating-point expression as argument. The result is one of the following values: FP_NAN : x is "Not a Number". FP_INFINITE: x is either plus or minus infinity. FP_ZERO: x is zero. FP_SUBNORMAL : x is too small to be represented in normalized format. FP_NORMAL : if nothing of the above is correct then it must be a normal floating-point number.
N/A
N/A
float frexpf(float x, int * exp) float frexp(float x, int * exp) double frexp(double x, int * exp) float hypotf(float x, float y) float hypot(float x, float y) double hypot(double x, double y) int ilogbf (float x) int ilogb(float x) int ilogb(double x) int isfinite(float x) int isfinite(double x) int isinf(float x) int isinf(double x) int isnan(float x) int isnan(double x) int isnormal(float x) int isnormal(double x) float ldexpf(float x, float exp) float ldexp(float x, float exp) double ldexpf(double x, double exp)
Splits the number x into a normalized fraction and an exponent which is stored in exp.
Returns sqrt(x*x+y*y). This is the length of the hypotenuse of a right-angle triangle with sides of length x and y, or the distance of the point (x,y) from the origin. Return the exponent part of their argument as a signed integer. When no error occurs, these functions are equivalent to the corresponding logb() functions, cast to (int). An error will occur for zero and infinity and NaN, and possibly for overflow. Determines if x is finite.
N/A
N/A
Determines if x is infinite.
N/A
N/A
Determines if x is NAN.
N/A
N/A
Determines if x is normal.
N/A
N/A
Returns the result of multiplying the floating-point number x by 2 raised to the power exp
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 94
float lgammaf(float x) float lgamma(float x) double lgamma(double x) float logf(float x) float log(float x) double log(double x) float log10f(float x) float log10(float x) double log10(double x) float log2f(float x) float log2(float x) double log2(double x) float log1pf (float x) float log1p(float x) double log1p(double x) float logbf(float x) float logb(float x) double logb(double x)
Computes the natural logarithm of the absolute value of gamma ofx. A range error occurs if x is too large. A range error may occur if x is a negative integer or zero. Returns the natural logarithm of x.
66
47
Returns a value equivalent to 'log (1 + x)'. It is computed in a way that is accurate even if the value of x is near zero.
These functions extract the exponent of x and return it as a floating-point value. If FLT_RADIX is two, logb(x) is equal to floor(log2(x)), except it's probably faster. If x is de-normalized, logb() returns the exponent x would have if it were normalized.
float modff(float x, float * iptr) float modf(float x, float * iptr) double modf(double x, double * iptr) float nanf(int tagp) float nan(int tagp) double nan(int tagp) float nearbyintf(float x) float nearbyint(float x) double nearbyint(double x) float nextafterf(float x, float y) float nextafter(float x, float y) double nextafter(double x, double y)
Breaks the argument x into an integral part and a fractional part, each of which has the same sign as x. The integral part is stored in iptr. return a representation (determined by tagp) of a quiet NaN. If the implementation does not support quiet NaNs, these functions return zero. Rounds the argument to an integer value in floating point format, using the current rounding direction
N/A
N/A
Returns the next representable neighbor of x in the direction towards y. The size of the step between x and the result depends on the type of the result. If x = y the function simply returns y. If either value is NaN, then NaN is returned. Otherwise a value corresponding to the value of the least significant bit in the mantissa is added or subtracted, depending on the direction. Returns the value of x raised to the power of y.
N/A
N/A
float powf(float x, float y) float pow(float x, float y) double pow(double x, double y) float rcbrtf(float x) float rcbrt(float x) double rcbrt(double x) float remainderf(float x, float y)
6 7
Outside interval -10.001 ... -2.264; larger inside. Outside interval -10.001 ... -2.264; larger inside.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 95
float remainder(float x, float y) double remainder(double x, double y) float remquof(float x, float y, int * quo) float remquo(float x, float y, int * quo) double remquo(double x, double y, int * quo) float roundf(float x) float round(float x) double round(double x) float rsqrtf(float x) float rsqrt(float x) double rsqrt(double x) float sinpif(float x) float sinpi(float x) double sinpi(double x) float scalbf(float x, float exp) float scalb(float x, float exp) double scalb(double x, double exp) float scalbnf(float x, int exp) float scalbn(float x, int exp) double scalbn(double x, int exp) int signbit(float x) int signbit(double x) float sinf(float x) float sin(float x) double sin(double x) void sincosf(float x, float * s, float * c) void sincos(float x, float * s, float * c) void sincos(double x, double * s, double * c) float sinhf(float x) float sinh(float x) double sinh(double x) float sqrtf(float x) float sqrt(float x) double sqrt(double x) float tgammaf(float x) float tgamma(float x) double tgamma(double x) float tanf(float x) float tan(float x) double tan(double x) float tanhf(float x)
8
n * y, where n is the value x / y, rounded to the nearest integer. If this quotient is 1/2 (mod 1), it is rounded to the nearest even number (independent of the current rounding mode). If the return value is 0, it has the sign of x. Computes the remainder and part of the quotient upon division of x by y. A few bits of the quotient are stored via the quo pointer. The remainder is returned. Rounds x to the nearest integer. 0 0
Multiplies their first argument x by FLT_RADIX (probably 2) to the power exp. If FLT_RADIX equals 2, then scalbn() is equivalent to ldexp(). The value of FLT_RADIX is found in <float.h>. Returns a non-zero value if the value of X has its sign bit set. Returns the sine of x.
N/A 2
N/A 2
08
This function returns the value of the Gamma function for the argument x.
11
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 96
float tanh(float x) double tanh(double x) float tanpif(float x) float tanpi(float x) double tanpi(double x) float truncf(float x) float trunc(float x) double trunc(double x) Rounds x to the nearest integer not larger in absolute value. 0 0 Returns the tangent of pi * x. 2 2
3257
3258 3259 3260 3261 3262 3263 3264 3265 3266 3267 3268 3269 3270 3271 3272 3273
10 Graphics (Optional)
Programming model elements that are defined in <amp_graphics.h> and <amp_short_vectors.h> are designed for graphics programming in conjunction with accelerated compute on an accelerator device, and are therefore appropriate only for GPU accelerators. Accelerator devices that do not support native graphics functionality need not implement these features. All types in this section are defined in the concurrency::graphics namespace.
10.1 texture<T,N>
The texture class provides the means to create textures from raw memory or from file. textures are similar to arrays in that they are containers of data and they behave like STL containers with respect to assignment and copy construction. textures are templated on T, the element type, and on N, the rank of the texture. N can be one of 1, 2, or 3. The element type of the texture, also referred to as the textures logical element type, is one of a closed set of short vector types that are defined in the concurrency::graphics namespace and covered elsewhere in this specification. The next table briefly enumerates the supported element types.
Rank of element type, (also referred to as number of scalar elements) 1 2 3 4 Signed Integer Unsigned Integer Single precision floating point number Single precision singed normalized number norm norm2 norm3 norm4 Single precision unsigned normalized number unorm unorm2 unorm3 unorm4 Double precision floating point number double double2 double3 double4
3274 3275 3276 3277 3278 3279 3280 3281 3282 3283 3284
Remarks: 1. norm and unorm vector types are vector of floats and are normalized to the range [-1..1] and [0...1], respectively. 2. Grayed-out cells represent vector types that are defined by C++ AMP but are not necessarily supported as texture value types. Implementations can optionally support the types in the grayed-out cells in the above table. Microsoft-specific: Grayed-out cells in the above table are not supported. 10.1.1 Synopsis
template <typename T, int N> class texture
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 97
3285 3286 3287 3288 3289 3290 3291 3292 3293 3294 3295 3296 3297 3298 3299 3300 3301 3302 3303 3304 3305 3306 3307 3308 3309 3310 3311 3312 3313 3314 3315 3316 3317 3318 3319 3320 3321 3322 3323 3324 3325 3326 3327 3328 3329 3330 3331 3332 3333 3334 3335 3336 3337 3338 3339 3340 3341 3342 3343 3344 3345 3346 3347
{ public: static const int rank = _Rank; typedef typename T value_type; typedef short_vectors_traits<T>::scalar_type scalar_type; texture(const extent<N>& _Ext); texture(int _E0); texture(int _E0, int _E1); texture(int _E0, int _E1, int _E2); texture(const extent<N>& _Ext, const accelerator_view& _Acc_view); texture(int _E0, const accelerator_view& _Acc_view); texture(int _E0, int _E1, const accelerator_view& _Acc_view); texture(int _E0, int _E1, int _E2, const accelerator_view& _Acc_view); texture(const extent<N>& _Ext, unsigned int _Bits_per_scalar_element); texture(int _E0, unsigned int _Bits_per_scalar_element); texture(int _E0, int _E1, unsigned int _Bits_per_scalar_element); texture(int _E0, int _E1, int _E2, unsigned int _Bits_per_scalar_element); texture(const extent<N>& _Ext, unsigned int _Bits_per_scalar_element, const accelerator_view& _Acc_view); texture(int _E0, unsigned int _Bits_per_scalar_element, const accelerator_view& _Acc_view); texture(int _E0, int _E1, unsigned int _Bits_per_scalar_element, const accelerator_view& _Acc_view); texture(int _E0, int _E1, int _E2, unsigned int _Bits_per_scalar_element, const accelerator_view& _Acc_view); template <typename TInputIterator> texture(const extent<N>&, TInputIterator _Src_first, , TInputIterator _Src_last); template <typename TInputIterator> texture(int _E0, TInputIterator _Src_first, , TInputIterator _Src_last); template <typename TInputIterator> texture(int _E0, int _E1, TInputIterator _Src_first, , TInputIterator _Src_last); template <typename TInputIterator> texture(int _E0, int _E1, int _E2, TInputIterator _Src_first, TInputIterator _Src_last); template <typename TInputIterator> texture(const extent<N>&, TInputIterator _Src_first, TInputIterator _Src_last, const accelerator_view& _Acc_view); template <typename TInputIterator> texture(int _E0, TInputIterator _Src_first, , TInputIterator _Src_last, const accelerator_view& _Acc_view); template <typename TInputIterator> texture(int _E0, int _E1, TInputIterator _Src_first, , TInputIterator _Src_last, const accelerator_view& _Acc_view); texture(int _E0, int _E1, int _E2, TInputIterator _Src_first, , TInputIterator _Src_last, const accelerator_view& _Acc_view); texture(const extent<N>&, const void * _Source, unsigned int _Src_byte_size, unsigned int _Bits_per_scalar_element); texture(int _E0, const void * _Source, unsigned int _Src_byte_size,
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 98
3348 3349 3350 3351 3352 3353 3354 3355 3356 3357 3358 3359 3360 3361 3362 3363 3364 3365 3366 3367 3368 3369 3370 3371 3372 3373 3374 3375 3376 3377 3378 3379 3380 3381 3382 3383 3384 3385 3386 3387 3388 3389 3390 3391 3392 3393 3394 3395 3396 3397 3398
_Bits_per_scalar_element); _E1, const void * _Source, unsigned int _Src_byte_size, _Bits_per_scalar_element); _E1, int _E2, const void * _Source, _Src_byte_size, unsigned int _Bits_per_scalar_element);
texture(const extent<N>&, const void * _Source, unsigned int _Src_byte_size, unsigned int _Bits_per_scalar_element, const accelerator_view& _Acc_view); texture(int _E0, unsigned texture(int _E0, unsigned texture(int _E0, unsigned const void * _Source, unsigned int _Src_byte_size, int _Bits_per_scalar_element, const accelerator_view& _Acc_view); int _E1, const void * _Source, unsigned int _Src_byte_size, int _Bits_per_scalar_element, const accelerator_view& _Acc_view); int _E1, int _E2, const void * _Source, unsigned int _Src_byte_size, int _Bits_per_scalar_element, const accelerator_view& _Acc_view);
texture(const texture& _Src); texture(const texture& _Src, const accelerator_view& _Acc_view); texture& operator=(const texture& _Src); texture(texture&& _Other); texture& operator=(const texture&& _Other); void copy_to(texture& _Dest) const; void copy_to(writeonly_texture_view<T,N>& _Dest) const; unsigned int get_Bits_per_scalar_element() const; __declspec(property(get= get_Bits_per_scalar_element)) int bits_per_scalar_element; unsigned int get_data_length() const; __declspec(property(get=get_data_length)) unsigned int data_length; extent<_Rank> get_extent() const restrict(cpu,direct3d); __declspec(property(get=get_extent)) extent<_Rank> extent; accelerator_view get_accelerator_view() const; __declspec(property(get=get_accelerator_view)) accelerator_view accelerator_view; value_type value_type value_type value_type value_type value_type value_type }; operator[] (const index<_Rank>& _Index) const restrict(amp); operator[] (int _I0) const restrict(amp); operator() (const index<_Rank>& _Index) const restrict(amp); operator() (int _I0) const restrict(amp); operator() (int _I0, int _I1) const restrict(amp); operator() (int _I0, int _I1, int _I2) const restrict(amp); get(const index<_Rank>& _Index) const restrict(amp);
3399
typedef ... scalar_type; The scalar type that serves as the component of the textures value type. For example, for texture<int2, 3>, the scalar type would be int.
3400 3401
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 99
texture(const extent<N>& _Ext); texture(int _E0); texture(int _E0, int _E1); texture(int _E0, int _E1, int _E2); texture(const extent<N>& _Ext, const accelerator_view& _Acc_view); texture(int _E0, const accelerator_view& _Acc_view); texture(int _E0, int _E1, const accelerator_view& _Acc_view); texture(int _E0, int _E1, int _E2, const accelerator_view& _Acc_view); texture(const extent<N>& _Ext, unsigned int _Bits_per_scalar_element); texture(int _E0, unsigned int _Bits_per_scalar_element); texture(int _E0, int _E1, unsigned int _Bits_per_scalar_element); texture(int _E0, int _E1, int _E2, unsigned int _Bits_per_scalar_element); texture(const extent<N>& _Ext, unsigned int _Bits_per_scalar_element, const accelerator_view& _Acc_view); texture(int _E0, unsigned int _Bits_per_scalar_element, const accelerator_view& _Acc_view); texture(int _E0, int _E1, unsigned int _Bits_per_scalar_element, const accelerator_view& _Acc_view); texture(int _E0, int _E1, int _E2, unsigned int _Bits_per_scalar_element, const accelerator_view& _Acc_view); Creates an uninitialized texture that has the specified shape, that is, the number of bits per scalar element, on the specified accelerator view. Parameters: _Ext _E0 _E1 _E2 _Bits_per_scalar_element _Acc_view Error condition Out of memory Invalid number of bits per scalar element specified Invalid combination of value_type and bits per scalar element accelerator_view does not support textures Extents of the texture to create Extent of dimension 0 Extent of dimension 1 Extent of dimension 2 Number of bits per each scalar element in the underlying scalar type of the texture. If 0 is specified, the number of bits is defaulted to the value that is specified in the table later in this document. Accelerator view in which to create the texture Exception thrown concurrency::runtime_exception concurrency::runtime_exception concurrency::unsupported_feature
concurrency::unsupported_feature
The next table summarizes the valid combinations of underlying scalar types (columns), ranks(rows), supported values for bits-per-scalar-element (inside the table cells), and default value of bits-per-scalar-element for each given combination (highlighted in green). Implementations can optionally support textures of double4, by using implementation-specific values of bits-per-scalar-element. Microsoft-specific: the current implementation does not support textures of double4.
Rank 1 2
norm 8, 16 8, 16
unorm 8, 16 8, 16
double 64 64
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 100
8, 16, 32
8, 16, 32
16, 32
8, 16
8, 16
concurrency::unsupported_feature
3413 3414 3415 10.1.5 Constructing a Texture from a Host-side Data Source
texture(const extent<N>&, const void * _Source, unsigned int _Src_byte_size, unsigned int _Bits_per_scalar_element); texture(int _E0, const void * _Source, unsigned int _Src_byte_size, unsigned int _Bits_per_scalar_element); texture(int _E0, int _E1, const void * _Source, unsigned int _Src_byte_size, unsigned int _Bits_per_scalar_element); texture(int _E0, int _E1, int _E2, const void * _Source, unsigned int _Src_byte_size, unsigned int _Bits_per_scalar_element); texture(const extent<N>&, const void * _Source, unsigned int _Src_byte_size, unsigned int _Bits_per_scalar_element, const accelerator_view& _Acc_view);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 101
texture(int _E0, const void * _Source, unsigned int _Src_byte_size, unsigned int _Bits_per_scalar_element, const accelerator_view& _Acc_view); texture(int _E0, int _E1, const void * _Source, unsigned int _Src_byte_size, unsigned int _Bits_per_scalar_element, const accelerator_view& _Acc_view); texture(int _E0, int _E1, int _E2, const void * _Source, unsigned int _Src_byte_size, unsigned int _Bits_per_scalar_element, const accelerator_view& _Acc_view); Creates a texture from a host-side provided buffer. The format of the data source must be compatible with the textures vector type, and the amount of data in the data source must be exactly the amount that is required to initialize a texture in the specified format, with the given number of bits per scalar element. For example, a 2D texture of uint2 that is initialized by using the extent of 100x200 and ans has _Bits_per_scalar_element equal to 8 requires a total of 100 * 200 * 2 * 8 = 320,000 bits available to copy from _Source, which is equal to 40,000 bytes. (In other words, one byte, per one scalar element, for each scalar element, and each pixel, in the texture). Parameters: _Ext _E0 _E1 _E2 _Source _Src_byte_size _Bits_per_scalar_element _Acc_view Error condition Out of memory Inadequate amount of data supplied through the host buffer (_Src_byte_size < texture.data_length) Invalid number of bits per scalar element specified Invalid combination of value_type and bits per scalar element Accelerator_view does not support textures Extents of the texture to create Extent of dimension 0 Extent of dimension 1 Extent of dimension 2 Pointer to a host buffer Number of bytes of the host source buffer Number of bits per each scalar element in the underlying scalar type of the texture. If 0 is specified, the number of bits is defaulted to the value that is specified in the table in the previous section. Accelerator view in which to create the texture Exception thrown concurrency::runtime_exception concurrency::runtime_exception
concurrency::runtime_exception concurrency::unsupported_feature
concurrency::unsupported_feature
3419
texture(const texture& _Src, const accelerator_view& _Acc_view); Initializes one texture from another.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 102
Parameters: _Src _Acc_view Error condition Out of memory Accelerator_view does not support textures Source texture or texture_view to copy from Accelerator view in which to create the texture Exception thrown concurrency::runtime_exception concurrency::unsupported_feature
texture& operator=(const texture& _Src); Release the resource of this texture, allocate the resource according to the properties of _Src, and then deep copy the content of _Src to this texture. Parameters: _Src Error condition Out of memory Source texture or texture_view to copy from Exception thrown concurrency::runtime_exception
3428 3429
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 103
unsigned int get_Bits_per_scalar_element()const; __declspec(property(get=get_Bits_per_scalar_element)) unsigned int bits_per_scalar_element; Gets the bits-per-scalar-element of the texture Error conditions: none
3430 3431
unsigned int get_data_length() const; __declspec(property(get=get_data_length)) unsigned int data_length; Gets the physical data length (in bytes) that is required to represent the texture on the host side with its native format. Error conditions: none
3432 3433
3434 3435 3436 10.1.12 Querying the accelerator_view Where the Texture Resides
accelerator_view get_accelerator_view() const; __declspec(property(get=get_accelerator_view)) accelerator_view accelerator_view; Retrieves the accelerator_view where the texture resides Error conditions: none
3437 3438 3439 3440 3441 3442 3443 3444 3445 3446 3447 3448 3449 3450 3451 3452 10.1.13 Reading and Writing Textures This is the core function of class texture on the accelerator. Unlike arrays, the entire value type has to be get/set, and is returned or accepted wholly. textures do not support the return of references to their data internal representations. Due to platform restrictions, only a limited number of texture types support simultaneous reading and writing. Reading is supported on all texture types, but writing through a texture& is only supported for textures of int, uint, and float, and even in those cases, the number of bits in the physical format must be 32. In case a lower number of bits is used (8 or 16) and a kernel is invoked that contains code that could possibly both write into and read from one of these rank-1 texture types, an implementation is permitted to raise a runtime exception. Microsoft-specific: The Microsoft implementation always raises a runtime exception in such a situation. Trying to call set on a texture& of a different element type that is, on other than int, uint, and float) causes a static assert. To write into textures of other value types, you must go through a writeonly_texture_view<T,N>.
value_type operator[] (const index<_Rank>& _Index) const restrict(amp); value_type operator[] (int _I0) const restrict(amp); value_type operator() (const index<_Rank>& _Index) const restrict(amp); value_type operator() (int _I0) const restrict(amp); value_type operator() (int _I0, int _I1) const restrict(amp); value_type operator() (int _I0, int _I1, int _I2) const restrict(amp); value_type get(const index<_Rank>& _Index) const restrict(amp); void set(const index<_Rank>& _Index, const value_type& _Value) const restrict(amp);
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 104
Loads one texel out of the texture. In case the overload where an integer tuple is used, if an overload that does not agree with the rank of the matrix is used, then a static_assert ensues and the program fails to compile. If the texture is indexed, at runtime, outside of its logical bounds, behavior is undefined. Parameters _Index _I0, _I1, _I0 An N-dimension logical integer coordinate to read from Index components, equivalent to providing index<1>(_I0), or index<2>(_I0,_I1) or index<2>(_I0,_I1,_I2). The arity of the function that is used must agree with the rank of the matrix. For example, the overload that takes (_I0,_I1) is only available on textures of rank 2. Value to write into the texture
_Value
Error conditions: if set is called on texture types that are not supported, a static_assert ensues.
3453 3454
(*) Out of memory errors may occur due to the need to allocate temporary buffers in some memory-transfer scenarios.
template <typename T, int N> void copy(const void * _Src, unsigned int _Src_byte_size, texture<T,N>& _Texture); Copies raw texture data to a device-side texture. The buffer must be laid out in accordance with the texture format and dimensions. Parameters _Texture _Src _Src_byte_size Error condition Out of memory Buffer too small Destination texture Pointer to the source buffer on the host Number of bytes in the destination buffer Exception thrown
3458 3459 3460 3461 3462 3463 10.1.14.1 Global async Texture copy Functions For each copy function that is specified above, a copy_async function is also provided, and returns a shared_future<void>. 10.1.15 Direct3D Interop Functions The following functions are provided in the direct3d namespace to convert between DX COM interfaces and textures.
template <typename T, int N> texture<T,N> make_texture(const Concurrency::accelerator_view &_Av, const IUnknown* pTexture); Creates a texture from the corresponding DX interface Parameters
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 105
Av pTexture Return value Error condition Out of memory Invalid D3D texture argument
A Direct3D accelerator view on which the texture is to be created. A pointer to a suitable texture The created texture Exception thrown
3464
template <typename T, N> IUnknown * get_texture<const texture<T, N>& _Texture); Retrieves a DX interface pointer from a C++ AMP texture object. Class texture allows the retrieval of a texture interface pointer (the exact interface depends on the rank of the class). Parameters _Texture Return value Error condition: no Source texture Texture interface as IUnknown *
3465 3466 3467 3468 3469 3470 3471 3472 3473 3474 3475 3476 3477 3478 3479 3480 3481 3482 3483 3484 3485 3486 3487 3488 3489 3490 3491 3492 3493 3494 3495 3496 3497 3498 3499 3500 3501
10.2 writeonly_texture_view<T,N>
C++ AMP write-only texture views, which are coded as writeonly_texture_view<T, N>, provide write-only access into any texture.
10.2.1 Synopsis
template <typename T, int N> class writeonly_texture_view<T,N> { public: static const int rank = _Rank; typedef typename T value_type; typedef short_vectors_traits<T>::scalar_type scalar_type; writeonly_texture_view(texture<T,N>& _Src) restrict(cpu,direct3d); writeonly_texture_view(const writeonly_texture_view&) restrict(cpu,direct3d); writeonly_texture_view operator=(const writeonly_texture_view&) restrict(cpu,direct3d); ~writeonly_texture_view() restrict(cpu,direct3d); unsigned int get_Bits_per_scalar_element()const; __declspec(property(get= get_Bits_per_scalar_element)) int bits_per_scalar_element; unsigned int get_data_length() const; __declspec(property(get=get_data_length)) unsigned int data_length; extent<_Rank> get_extent() const restrict(cpu,direct3d); __declspec(property(get=get_extent)) extent<_Rank> extent; accelerator_view get_accelerator_view() const; __declspec(property(get=get_accelerator_view)) accelerator_view accelerator_view; void set(const index<_Rank>& _Index, const value_type& _Val) restrict(amp); };
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 106
3502
3503
typedef ... scalar_type; The scalar type that serves as the component of the textures value type. For example, for writeonly _texture_view<int2,3>, the scalar type would be int.
3504
3509 3510 3511 10.2.6 Querying the Physical Characteristics of an Underlying Texture
unsigned int get_Bits_per_scalar_element()const; __declspec(property(get=get_Bits_per_scalar_element)) int bits_per_scalar_element; Gets the bits-per-scalar-element of the texture. Error conditions: none
3512 3513
unsigned int get_data_length() const; __declspec(property(get=get_data_length)) unsigned int data_length; Gets the physical data length (in bytes) that is required to represent the texture on the host side with its native format. Error conditions: none
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 107
__declspec(property(get=get_extent)) extent<_Rank> extent; These members have the same meaning as the equivalent ones on the array class. Error conditions: none
10.2.6.2 Writing a Write-only Texture View This is the main purpose of this type. All texture types can be written through a write-only view.
void set(const index<_Rank>& _Index, const value_type& _Val) const restrict(amp); Stores one texel in the texture. If the texture is indexed, at runtime, outside of its logical bounds, behavior is undefined. Parameters _Index _I0, _I1, _I0 _Val Error conditions: none An N-dimension logical integer coordinate to read from Index components Value to store into the texture
10.2.7.1 Global async writeonly_texture_view copy Functions For each copy function that is specified above, a copy_async function is also provided, and returns a shared_future<void>. 10.2.8 Direct3D Interop Functions The following functions are provided in the direct3d namespace to convert between DX COM interfaces and writeonly_texture_views.
template <typename T, N> IUnknown * get_texture<const writeonly_texture_view<T, N>& _TextureView); Retrieves a DX interface pointer from a C++ AMP writeonly_texture_view object. Parameters _TextureView Return value Error condition: no Source texture view Texture interface as IUnknown *
3529
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 108
3530 3531 3532 3533 3534 3535 3536 3537 3538 3539 3540 3541 3542 3543 3544 3545 3546 3547 3548 3549 3550 3551 3552 3553 3554 3555 3556 3557 3558 3559 3560 3561 3562 3563 3564 3565 3566 3567 3568 3569 3570 3571 3572 3573 3574 3575 3576 3577 3578 3579 3580 3581 3582 3583 3584 3585
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 109
3586 3587 3588 3589 3590 3591 3592 3593 3594 3595 3596 3597 3598 3599 3600 3601 3602 3603 3604 3605 3606 3607 3608 3609 3610 3611 3612 3613 3614 3615 3616 3617 3618 3619 3620 3621 3622 3623 3624 3625 3626 3627 3628 3629 3630 3631 3632 3633 3634 3635 3636 3637 3638 3639
unorm operator+(const unorm& lhs, const unorm& rhs) restrict(cpu, amp); norm operator+(const norm& lhs, const norm& rhs) restrict(cpu, amp); unorm operator-(const unorm& lhs, const unorm& rhs) restrict(cpu, amp); norm operator-(const norm& lhs, const norm& rhs) restrict(cpu, amp); unorm operator*(const unorm& lhs, const unorm& rhs) restrict(cpu, amp); norm operator*(const norm& lhs, const norm& rhs) restrict(cpu, amp); unorm operator/(const unorm& lhs, const unorm& rhs) restrict(cpu, amp); norm operator/(const norm& lhs, const norm& rhs) restrict(cpu, amp); bool operator==(const unorm& lhs, const unorm& rhs) restrict(cpu, amp); bool operator==(const norm& lhs, const norm& rhs) restrict(cpu, amp); bool operator!=(const unorm& lhs, const unorm& rhs) restrict(cpu, amp); bool operator!=(const norm& lhs, const norm& rhs) restrict(cpu, amp); bool operator>(const unorm& lhs, const unorm& rhs) restrict(cpu, amp); bool operator>(const norm& lhs, const norm& rhs) restrict(cpu, amp); bool operator<(const unorm& lhs, const unorm& rhs) restrict(cpu, amp); bool operator<(const norm& lhs, const norm& rhs) restrict(cpu, amp); bool operator>=(const unorm& lhs, const unorm& rhs) restrict(cpu, amp); bool operator>=(const norm& lhs, const norm& rhs) restrict(cpu, amp); bool operator<=(const unorm& lhs, const unorm& rhs) restrict(cpu, amp); bool operator<=(const norm& lhs, const norm& rhs) restrict(cpu, amp); #define #define #define #define #define #define UNORM_MIN ((unorm)0.0f) UNORM_MAX ((unorm)1.0f) UNORM_ZERO ((norm)0.0f) NORM_ZERO ((norm)0.0f) NORM_MIN ((norm)-1.0f) NORM_MAX ((norm)1.0f)
10.3.2 Constructors and Assignment An object of type norm or unorm can be explicitly constructed from one of the following types: float double int unsigned int norm unorm In all of these constructors, the object is initialized by first converting the argument to the float data type, and then clamping the value into the range that is defined by the type. Assignment from norm to norm is defined, as is assignment from unorm to unorm. Assignment from other types requires an explicit conversion. 10.3.3 Operators All arithmetic operators that are defined for the float type are also defined for norm and unorm. For each supported operator , the result is computed in single-precision floating-point arithmetic, and, if required, is then clamped back to the appropriate range. C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 110
3646 3647 3648 3649 3650 3651 3652 3653 3654 3655 3656 3657 3658 3659 3660 3661 3662 3663 3664 3665 3666 3667 3668 3669 3670 3671 3672 3673 3674 3675 3676 3677 3678 3679
There is no functional difference between the type scalar_N and scalarN. Unlike index<N> and extent<N>, short vector types have no notion of significance or endian-ness, as they are not assumed to be describing the shape of data or compute (even though a user might choose to use them in this way). Also unlike extents and indices, short vector types cannot be indexed by using the subscript operator. Components of short vector types can be accessed by name. By convention, short vector type components can use either Cartesian coordinate names (x, y, z, and w) or color scalar element names (r, g, b, and w). For length-2 vectors, only the names x, y and r, g are available. For length-3 vectors, only the names x, y, z, and r, g, b are available. For length-4 vectors, the full set of names x, y, z, w, and r, g, b, a are available.
The names that are derived from the color channel space (rgba) are available only as properties, not as getter and setter functions. 10.4.1 Synopsis Because the full synopsis of all the short vector types is quite large, this section summarizes the basic structure of all of the short vector types. In the following summary class definition, the word "scalartype" is one of { int, uint, float, double, norm, unorm }. The value N is 2, 3, or 4.
class scalartype_N { public: typedef scalartype value_type; static const int size = N; scalartype_N() restrict(cpu, amp); scalartype_N(scalartype value) restrict(cpu, amp); scalartype_N(const scalartype_N& other) restrict(cpu, amp); // Component-wise constructor see 10.4.2.1 Constructors from Components
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 111
3680 3681 3682 3683 3684 3685 3686 3687 3688 3689 3690 3691 3692 3693 3694 3695 3696 3697 3698 3699 3700 3701 3702 3703 3704 3705 3706 3707 3708 3709 3710 3711 3712 3713 3714 3715 3716 3717 3718 3719 3720 3721 3722 3723 3724 3725 3726 3727 3728 3729
// Constructors that explicitly convert from other short vector types // See 10.4.2.2 Explicit conversion constructors. scalartype_N& operator=(const scalartype_N& other) restrict(cpu, amp); // Operators scalartype_N& operator++() restrict(cpu, amp); scalartype_N operator++(int) restrict(cpu, amp); scalartype_N& operator--() restrict(cpu, amp); scalartype_N operator--(int) restrict(cpu, amp); scalartype_N& operator+=(const scalartype_N& rhs) scalartype_N& operator-=(const scalartype_N& rhs) scalartype_N& operator*=(const scalartype_N& rhs) scalartype_N& operator/=(const scalartype_N& rhs) // Unary negation: not for scalartype == uint scalartype_N operator-() const __GPU; // More integer operators (only for scalartype == int or uint) scalartype_N operator~() const restrict(cpu, amp); scalartype_N& operator%=(const scalartype_N& rhs) restrict(cpu, amp); scalartype_N& operator^=(const scalartype_N& rhs) restrict(cpu, amp); scalartype_N& operator|=(const scalartype_N& rhs) restrict(cpu, amp); scalartype_N& operator&=(const scalartype_N& rhs) restrict(cpu, amp); scalartype_N& operator>>=(const scalartype_N& rhs) restrict(cpu, amp); scalartype_N& operator<<=(const scalartype_N& rhs) restrict(cpu, amp); // Component accessors and properties (a.k.a. swizzling): // See 10.4.3 Component Access (Swizzling) }; scalartype_N operator+(const scalartype_N& lhs, const scalartype_N& rhs) restrict(cpu, scalartype_N operator-(const scalartype_N& lhs, const scalartype_N& rhs) restrict(cpu, scalartype_N operator*(const scalartype_N& lhs, const scalartype_N& rhs) restrict(cpu, scalartype_N operator/(const scalartype_N& lhs, const scalartype_N& rhs) restrict(cpu, bool operator==(const scalartype_N& lhs, const scalartype_N& rhs) restrict(cpu, amp); bool operator!=(const scalartype_N& lhs, const scalartype_N& rhs) restrict(cpu, amp); bool operator>(const scalartype_N& lhs, const scalartype_N& rhs) restrict(cpu, amp); bool operator<(const scalartype_N& lhs, const scalartype_N& rhs) restrict(cpu, amp); bool operator>=(const scalartype_N& lhs, const scalartype_N& rhs) restrict(cpu, amp); bool operator<=(const scalartype_N& lhs, const scalartype_N& rhs) restrict(cpu, amp); amp); amp); amp); amp);
// More integer operators (only for scalartype == int or uint) scalartype_N operator^(const scalartype_N& lhs, const scalartype_N& rhs) restrict(cpu, amp); scalartype_N operator|(const scalartype_N& lhs, const scalartype_N& rhs) restrict(cpu, amp); scalartype_N operator&(const scalartype_N& lhs, const scalartype_N& rhs) restrict(cpu, amp); scalartype_N operator<<(const scalartype_N& lhs, const scalartype_N& rhs) restrict(cpu, amp); scalartype_N operator>>(const scalartype_N& lhs, const scalartype_N& rhs) restrict(cpu, amp); 10.4.2 Constructors scalartype_N()restrict(cpu,amp)
Default constructor. Initializes all components to zero.
3730
scalartype_N(scalartype value) restrict(cpu,amp)
Initializes all components of the short vector to value. Parameters:
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 112
value
3731
scalartype_N(const scalartype_N& other) restrict(cpu,amp)
Copy constructor. Copies the contents of other to this. Parameters: other The source vector to copy from.
3732 3733 3734 3735 10.4.2.1 Constructors from Components A short vector type can also be constructed with values for each of its components.
scalartype_2(scalartype scalartype_3(scalartype scalartype_4(scalartype scalartype v1, v1, v1, v3, scalartype scalartype scalartype scalartype v2) restrict(cpu,amp) // only for length 2 v2, scalartype v3) restrict(cpu,amp) // only for length 3 v2, v4) restrict(cpu,amp) // only for length 4
Creates a short vector that has the provided initialize values for each component. Parameters: v1 The value with which to initialize the x (or r) component. v2 v3 v4 The value with which to initialize the y (or g) component The value with which to initialize the z (or b) component. The value with which to initialize the w (or a) component
3736 3737 3738 3739 3740 10.4.2.2 Explicit conversion constructors A short vector of type scalartype1_N can be constructed from an object of type scalartype2_N, as long as N is the same in both types. For example, a uint_4 can be constructed from a float_4. explicit scalartype_N(const explicit scalartype_N(const explicit scalartype_N(const explicit scalartype_N(const explicit scalartype_N(const explicit scalartype_N(const
int_N& other) restrict(cpu,amp) uint_N& other) restrict(cpu,amp) float_N& other) restrict(cpu,amp) double_N& other) restrict(cpu,amp) norm_N& other) restrict(cpu,amp) unorm_N& other) restrict(cpu,amp)
Constructs a short vector from a differently-typed short vector, and performs an explicit conversion. From the earlier list of 6 constructors, each short vector type will have 5 of them. Parameters: other The source vector to copy/convert from.
10.4.3 Component Access (Swizzling) The components of a short vector may be accessed in a large variety of ways, depending on the length of the short vector. As single scalar components (N 2). As pairs of components, in any permutation (N 2). As triplets of components, in any permutation (N 3). As quadruplets of components, in any permutation (N = 4).
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 113
Because the permutations of such component accessors are so large, they are described here by using symmetric group notation. In such notation, Sxy represents all permutations of the letters x and y, namely xy and yx. Similarly, Sxyz represents all 3! = 6 permutations of the letters x, y, and z, namely xy, xz, yx, yz, zx, and zy. Recall that the z (or b) component of a short vector is only available for vector lengths 3 and 4. The w (or a) component of a short vector is only available for vector length 4. 10.4.3.1 Single-component Access scalartype get_x() const restrict(cpu,amp) scalartype get_y() const restrict(cpu,amp) scalartype get_z() const restrict(cpu,amp) scalartype get_w() const restrict(cpu,amp) void set_x(scalartype v) restrict(cpu,amp) void set_y(scalartype v) restrict(cpu,amp) void set_z(scalartype v) restrict(cpu,amp) void set_w(scalartype v) restrict(cpu,amp) __declspec(property(get=get_x, put=set_x)) scalartype x __declspec(property(get=get_y, put=set_y)) scalartype y __declspec(property(get=get_z, put=set_z)) scalartype z __declspec(property(get=get_w, put=set_w)) scalartype w __declspec(property(get=get_x, put=set_x)) scalartype r __declspec(property(get=get_y, put=set_y)) scalartype g __declspec(property(get=get_z, put=set_z)) scalartype b __declspec(property(get=get_w, put=set_w)) scalartype a
These functions (and properties) enable access to individual components of a short vector type. The properties in the rgba space map to functions in the xyzw space.
3756 3757 10.4.3.2 Two-component Access scalartype_2 get_Sxy() const restrict(cpu,amp) scalartype_2 get_Sxz() const restrict(cpu,amp) scalartype_2 get_Sxw() const restrict(cpu,amp) scalartype_2 get_Syz() const restrict(cpu,amp) scalartype_2 get_Syw() const restrict(cpu,amp) scalartype_2 get_Szw() const restrict(cpu,amp) void set_Sxy(scalartype_2 v) restrict(cpu,amp) void set_Sxz(scalartype_2 v) restrict(cpu,amp) void set_Sxw(scalartype_2 v) restrict(cpu,amp) void set_Syz(scalartype_2 v) restrict(cpu,amp) void set_Syw(scalartype_2 v) restrict(cpu,amp) void set_Szw(scalartype_2 v) restrict(cpu,amp) __declspec(property(get=get_Sxy, put=set_Sxy)) scalartype_2 Sxy __declspec(property(get=get_Sxz, put=set_Sxz)) scalartype_2 Sxz __declspec(property(get=get_Sxw, put=set_Sxw)) scalartype_2 Sxw __declspec(property(get=get_Syz, put=set_Syz)) scalartype_2 Syz __declspec(property(get=get_Syw, put=set_Syw)) scalartype_2 Syw __declspec(property(get=get_Szw, put=set_Szw)) scalartype_2 Szw C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 114
__declspec(property(get=get_Sxy, put=set_Sxy)) scalartype_2 Srg __declspec(property(get=get_Sxz, put=set_Sxz)) scalartype_2 Srb __declspec(property(get=get_Sxw, put=set_Sxw)) scalartype_2 Sra __declspec(property(get=get_Syz, put=set_Syz)) scalartype_2 Sgb __declspec(property(get=get_Syw, put=set_Syw)) scalartype_2 Sga __declspec(property(get=get_Szw, put=set_Szw)) scalartype_2 Sba
These functions (and properties) enable access to pairs of components. For example: int_3 int_2 f3(1,2,3); yz = f3.yz; // yz = (2,3)
3758 3759 10.4.3.3 Three-component Access scalartype_3 get_Sxyz() const restrict(cpu,amp) scalartype_3 get_Sxyw() const restrict(cpu,amp) scalartype_3 get_Sxzw() const restrict(cpu,amp) scalartype_3 get_Syzw() const restrict(cpu,amp) void set_Sxyz(scalartype_3 v) restrict(cpu,amp) void set_Sxyw(scalartype_3 v) restrict(cpu,amp) void set_Sxzw(scalartype_3 v) restrict(cpu,amp) void set_Syzw(scalartype_3 v) restrict(cpu,amp) __declspec(property(get=get_Sxyz, put=set_Sxyz)) scalartype_3 Sxyz __declspec(property(get=get_Sxyw, put=set_Sxyw)) scalartype_3 Sxyw __declspec(property(get=get_Sxzw, put=set_Sxzw)) scalartype_3 Sxzw __declspec(property(get=get_Syzw, put=set_Syzw)) scalartype_3 Syzw __declspec(property(get=get_Sxyz, put=set_Sxyz)) scalartype_3 Srgb __declspec(property(get=get_Sxyw, put=set_Sxyw)) scalartype_3 Srga __declspec(property(get=get_Sxzw, put=set_Sxzw)) scalartype_3 Srba __declspec(property(get=get_Syzw, put=set_Syzw)) scalartype_3 Sgba
These functions (and properties) enable access to triplets of components (for vectors of length 3 or 4). For example: int_4 int_3 f3(1,2,3,4); wzy = f3.wzy; // wzy = (4,3,2)
3760 3761 10.4.3.4 Four-component Access scalartype_4 get_Sxyzw() const restrict(cpu,amp) void set_Sxyzw(scalartype_4 v) restrict(cpu,amp) __declspec(property(get=get_Sxyzw, put=set_Sxyzw)) scalartype_4 Sxyzw __declspec(property(get=get_Sxyzw, put=set_Sxyzw)) scalartype_4 Srgba
These functions (and properties) enable access to all four components (only for vectors of length 4). For example: int_4 int_4 f3(1,2,3,4); wzyx = f3.wzyw; // wzyx = (4,3,2,1)
3762
Page 115
creation of redundant intermediate copies. By using these features, you can incrementally accelerate the computeintensive portions of DirectX applications that use C++ AMP, and use the Direct3D API on data that is produced from C++ AMP computations. The following Direct3D interoperability functions are available in the direct3d namespace:
accelerator_view create_accelerator_view(IUnknown *_D3d_device_interface) Creates an accelerator_view from an existing Direct3D device interface pointer. On failure, the function throws a runtime_exception exception. On success, the reference count of the parameter is incremented by making an AddRef call on the interface to record the C++ AMP reference to the interface. You can safely Release the object when it is no longer required in the DirectX code. The accelerator_view that is created by using this function is thread-safe, just as any C++ AMP created accelerator_view is. This enables concurrent submission of commands to it from multiple host threads. However, you must correctly synchronize concurrent use of the accelerator_view and the raw ID3D11Device interface from multiple host threads to ensure mutual exclusion. Unsynchronized concurrent usage of the accelerator_view and the raw ID3D11Device interface causes undefined behavior. The C++ AMP runtime provides detailed error information in debug mode by using the Direct3D Debug layer. However, if the Direct3D device that is passed to the above function was not created with the D3D11_CREATE_DEVICE_DEBUG flag, the C++ AMP debug mode detailed error information support is unavailable. Parameters: _D3d_device_interface An AMP-supported Direct3D device interface pointer to be used to create the accelerator_view. The parameter must meet all of the following conditions for successful creation of a accelerator_view: 1) 2) 3) Must be a supported Direct3D device interface. For this release, only the ID3D11Device interface is supported. The device must have an AMP-supported feature level. For this release, this means a D3D_FEATURE_LEVEL_11_0. The Direct3D device must not have been created with the D3D11_CREATE_DEVICE_SINGLETHREADED flag.
Return Value: The newly created accelerator_view object. Exceptions: runtime_exception 1) 2) "Failed to create accelerator_view from D3D device.", E_INVALIDARG NULL D3D device pointer., E_INVALIDARG
3773 3774
IUnknown * get_device(const accelerator_view &_Rv) Returns a Direct3D device interface pointer that underlies the passed accelerator_view. Fails with a runtime_exception exception of the passed accelerator_view is not a Direct3D device resource view. On success, it increments the reference count of the Direct3D device interface by calling AddRef on the interface. You must call Release on the returned interface after you are finished using it, for correct reclamation of the resources that are associated with the object. You must correctly synchronize concurrent use of the accelerator_view and the raw ID3D11Device interface from multiple host threads to ensure mutual exclusion. Unsynchronized concurrent usage of the accelerator_view and the raw ID3D11Device interface causes undefined behavior. Parameters: _Rv The accelerator_view object for which the Direct3D device interface is needed.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 116
Return Value: A IUnknown interface pointer that corresponds to the Direct3D device that underlies the passed accelerator_view. You must use the QueryInterface member function on the returned interface to obtain the correct Direct3D device interface pointer. Exceptions: runtime_exception 1) 2) Uninitialized resource view argument., E_INAVLIDARG "Cannot get D3D device from a non-D3D accelerator_view.", E_INVALIDARG
3775 3776
template <typename T, int N> array<T,N> make_array(const extent<N> &_Extent, const accelerator_view &_Rv, IUnknown *_D3d_buffer_interface) Creates an array that has the specified extents on the specified accelerator_view from an existing Direct3D buffer interface pointer. On failure, the member function throws a runtime_exception exception. On success, the reference count of the Direct3D buffer object is incremented by making an AddRef call on the interface to record the C++ AMP reference to the interface, and you can safely Release the object when it is no longer required in the DirectX code. Parameters: _Extent _Rv _D3d_buffer_interface The extent of the array to be created. The accelerator_view that the array is to be created on. AN AMP-supported Direct3D device buffer pointer to be used to create the array. The parameter must meet all of the following conditions for successful creation of a accelerator_view: 1) Must be a supported Direct3D buffer interface. For this release, only ID3D11Buffer interface is supported. The Direct3D device on which the buffer was created must be the same as the underlying the accelerator_view parameter rv. The Direct3D buffer must also satisfy the following conditions: a. The buffer size in bytes must be equal to the size in bytes of the field to be created (g.get_size() * sizeof(_Elem_type)). b. Must have been created by using DEFAULT_USAGE. c. SHADER_RESOURCE and UNORDERED_ACCESS bindings should be allowed for the buffer. The Direct3D buffer must be a STRUCTURED_BUFFER that has a structure byte stride of 4.
2)
3)
4)
Return Value: The newly created array object. Exceptions: runtime_exception 1) 2) 3) 4) 5) "Invalid extents argument.", E_INVALIDARG "Uninitialized resource view argument.", E_INVALIDARG "NULL D3D buffer pointer.", E_INVALIDARG Invalid D3D buffer argument., E_INVALIDARG "Cannot create D3D buffer on a non-D3D accelerator_view.", E_INVALIDARG
3777 3778
template <size_t RANK, typename _Elem_type> IUnknown * get_d3d_buffer_interface(const array<_Elem_type, RANK> &_F)
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 117
Returns a Direct3D buffer interface pointer that underlies the passed array. Fails with a runtime_exception exception if the passed array is not on a Direct3D device resource view. On success, it increments the reference count of the D3D buffer interface by calling AddRef on the interface. You must call Release on the returned interface after you are finished using it, for correct reclamation of the resources that are associated with the object. Parameters: _F Return Value: An IUnknown interface pointer that corresponds to the Direct3D buffer that underlies the passed array. You must use the QueryInterface member function on the returned interface to obtain the correct Direct3D buffer interface pointer. Exceptions: runtime_exception "Cannot get D3D buffer from a non-D3D array.", E_INVALIDARG The array for which the underlying Direct3D buffer interface is needed.
3779 3780
3781 3782 3783 3784 3785 3786 3787 3788 3789 3790 3791 3792 3793 3794 3795 3796 3797 3798 3799 3800
12 Error Handling
12.1 static_assert
The C++ intrinsic static_assert is often used to handle error states that are detectable at compile time. In this way, static_assert is a technique for conveying static semantic errors so that they will be categorized in a way that resembles exception types.
12.2.1 runtime_exception A runtime_exception instance comprises a textual description of the error and an HRESULT error code to indicate the cause of the error.
class runtime_exception
The exception type that all AMP runtime exceptions derive from. A runtime_exception instance comprises a textual description of the error and an HRESULT error code to indicate the cause of the error.
3801 3802
runtime_exception(const char * _Message, HRESULT _Hresult) throw() Constructs a runtime_exception exception that has the specified message and HRESULT error code. Parameters: _Message _Hresult Descriptive message of error HRESULT error code that caused this exception
3803
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 118
3804
runtime_exception (HRESULT _Hresult) throw() Constructs a runtime_exception exception that has the specified HRESULT error code. Parameters: _Hresult HRESULT error code that caused this exception
3805 3806
HRESULT get_error_code() const throw() Returns the error code that caused this exception. Return Value: Returns the HRESULT error code that caused this exception.
3809 3810 3811 3812 3813 3814 3815 3816 12.2.2 out_of_memory An instance of this exception type is thrown when an underlying OS/DirectX API call fails due to failure to allocate system or device memory (E_OUTOFMEMORY HRESULT error code). If the runtime fails to allocate memory from the heap by using the C++ new operator, a std::bad_alloc exception is thrown instead of the C++ AMP out_of_memory exception.
3817
explicit out_of_memory(const char * _Message) throw() Constructs a out_of_memory exception that has the specified message. Parameters: _Message Descriptive message of error
3818 3819
out_of_memory() throw() Constructs an out_of_memory exception. Parameters: None.
3820 3821
12.2.3 invalid_compute_domain
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 119
An instance of this exception type is thrown when the runtime fails to devise a dispatch for the compute domain that is specified at a parallel_for_each call site.
3826
explicit invalid_compute_domain(const char * _Message) throw() Constructs an invalid_compute_domain exception that has the specified message. Parameters: _Message Descriptive message of error
3827 3828
invalid_compute_domain() throw() Constructs an invalid_compute_domain exception. Parameters: None.
3829 3830 3831 3832 3833 3834 3835 12.2.4 unsupported_feature An instance of this exception type is thrown on executing a d3d11-qualified function on the host when the function uses an intrinsic that is unsupported on the host (such as tiled_index<>::barrier.wait()). class unsupported_feature : public runtime_exception
Exception that is thrown when an unsupported feature is used.
3836
explicit unsupported_feature (const char * _Message) throw() Constructs an unsupported_feature exception that has the specified message. Parameters: _Message Descriptive message of error
3837 3838
unsupported_feature () throw() Constructs an unsupported_feature exception. Parameters: None.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 120
1) The debug version of the runtime is being used (that is, the code is compiled with the _DEBUG preprocessor definition). 2) The debug layer is available, which in turn requires the DirectX SDK to be installed on the system. 3) The accelerator_view on which the kernel is invoked must be on a device that supports the printf and abort intrinsics. As of the publication date of this document, only the REF device supports these intrinsics.
void direct3d_printf(const char *_Format_string, ) restrict(amp) Prints formatted output from a kernel to the debug output and optionally to one user-configured output stream per accelerator_view. The functions semantics are same as the C Library printf function, except that it does not have a return value. Also, this function is executed as is any other device-side function: per-thread, and in the context of the calling thread. Due to the asynchronous nature of kernel execution, the output from this call may appear anytime between the launch of the kernel that contains the printf call and the completion of the kernels execution. When it is executed on the host, this function prints the formatted output only to the debug output. Parameters: _Format_string Return Value: None. The format string. An optional list of parameters of variable count.
3853
void direct3d_errorf(char *_Format_string, ) restrict(amp) This intrinsic aborts the execution of a kernel and prints the formatted output to the debug output and optionally to one user-configured debug output stream per resource view. The formatted output is prepended with the string ASSERTION FAILURE:. This function is executed only on the first thread that reaches the call, upon which the kernel is immediately aborted. Also the kernel is terminated without executing any destructors for local or group shared variables. Due to the asynchronous nature of kernel execution, the actual abort may happen asynchronously any time between the dispatch of the kernel and the completion of the kernels execution. When the abort is detected by the runtime, it raises an assertion_failure exception on the host, with the abort call instances formatted output as the error message. On the host, this function prints the formatted output to the debug output and raises an assertion_failure exception, with the abort call instances formatted output as the error message. Parameters: _Format_string The format string. An optional list of parameters of variable count.
3854
void direct3d_abort() restrict(amp) This intrinsic aborts the execution of a kernel. This function is executed only on the first thread that reaches the call, upon which the kernel is immediately aborted. Also, the kernel is terminated without executing any destructors for local variables. Due to the asynchronous nature of kernel execution, the actual abort may happen asynchronously at any time between the dispatch of the kernel and the completion of the kernels execution.
Due to the asynchronous nature of kernel execution, the direct3d_printf and direct3d_errorf messages from kernels that execute on a device appear asynchronously during the execution of the shader or after its completion, and not immediately after the async launch of the kernel. Therefore, these messages from a kernel may be interleaved with messages from other kernels that are executing concurrently or from error messages from other runtime calls in the debug output. It is the programmers responsibility to include appropriate information in the messages that originate from kernels to indicate the origin of the messages.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 121
3863 3864 3865 3866 3867 3868 3869 3870 3871 3872 3873 3874 3875 3876 3877 3878 3879 3880 3881 3882 3883 3884 3885 3886 3887 3888 3889 3890 3891 3892 3893 3894 3895 3896 3897 3898 3899 3900 3901 3902 3903 3904 3905 3906 3907 3908 3909 3910 3911 3912
In this example, f2 is verified for compulsory adherence to the restrict(cpu) restriction. This causes an error because f2 calls f1, which is not cpu-restricted. Had we changed restriction on f1 to restrict(cpu), then f2 would pass the adherence test to the explicitly specified restrict(cpu). With respect to the auto restriction, the compiler has to check whether f2 conforms to restrict(amp), which is the only other restriction that is not explicitly specified. In the context of verifying the plausibility of inferring an amp-restriction for f2, the compiler notices that f2 calls f1, which is, in our modified example, not amprestricted, and therefore, f2 is also inferred to be not amp-restricted. Thus, the total inferred restriction for f2 is restrict(cpu). If we now change the restriction for f1 into restrict(cpu,amp), then the inference for f2 would reach the conclusion that f2 is also restrict(cpu,amp). When two overloads are available to call from a given restriction context, and they differ only because one is explicitly restricted but the other one is implicitly inferred to be restricted, the explicitly restricted overload is chosen. 13.1.2 Automatic Restriction Deduction Implementations are encouraged to support a mode in which functions that have their definitions accompany their declarations (and where no other declarations occur for such functions) have their restriction set automatically deduced.
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 122
3913 3914 3915 3916 3917 3918 3919 3920 3921 3922 3923 3924 3925 3926 3927 3928 3929 3930 3931 3932 3933 3934 3935 3936 3937 3938 3939 3940 3941 3942 3943 3944 3945 3946 3947 3948 3949 3950 3951 3952 3953 3954 3955 3956 3957 3958 3959 3960 3961 3962 3963 3964 3965
In such a mode, when the compiler encounters a function declaration that is also a definition, and a previous declaration for the function has not been encountered before, then the compiler analyzes the function as if it was restricted with restrict(cpu,auto). This enables easy reuse of existing code in amp-restricted code, at the cost of prolonged compilation times. 13.1.3 amp Version The amp-restriction production of the C++ grammar is amended thus: amp-restriction: amp amp-versionopt amp-version: : integer-constant : integer-constant . integer-constant An amp version specifies the lowest version of amp that this function supports. In other words, if a function is decorated with restrict(amp:1), then that function also supports any version greater or equal to 1. When the amp version is elided, the implied version is implementation-defined. Implementations are encouraged to support a compiler flag that controls the default version assumed. When versioning is used in conjunction with restrict(auto) and/or automatic restriction deduction, the compiler will infer the maximal version of the amp restriction that the function adheres to. Section 2.3.2 specifies that restriction specifiers of a function must not overlap with any restriction specifiers in another function within the same overload set.
int func(int x) restrict(cpu,amp); int func(int x) restrict(cpu); // error, overlaps with previous declaration
This rule is relaxed in the case of versioning: functions that are overloaded with amp versions are not considered to overlap:
int func(int x) restrict(cpu); int func(int x) restrict(amp:1); int func(int x) restrict(amp:2);
When an overload set contains multiple versions of the amp specifier, the function that has the highest version number that is not higher than the callee is chosen:
void glorp() restrict(amp:1) { } void glorp() restrict(amp:2) { } void glorp_caller() restrict(amp:2) { glorp(); // okay; resolves to call glorp() restrict(amp:2) }
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 123
Area volatile Local/Param/Function Return Local/Param/Function Return Local/Param/Function Return Local/Param/Function Return Local/Param/Function Return Local/Param/Function Return Local/Param/Function Return Local/Param/Function Return Local/Param/Function Return Local/Param/Function Return Local/Param/Function Return Local/Param/Function Return Local/Param/Function Return Local/Param/Function Return Local/Param/Function Return Local/Param/Function Return Local/Param/Function Return Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members Struct/class/union members
Feature char (8 - signed/unsigned/plain) short (16 - signed/unsigned) int (32 - signed/unsigned) long (32 - signed/unsigned) long long (64 - signed/unsigned) half-precision float (16) float (32) double (64) long double (?) bool (8) wchar_t (16) Pointer (single-indirection) Pointer (multiple-indirection) Reference Reference to pointer Reference/pointer to function static local char (8 - signed/unsigned/plain) short (16 - signed/unsigned) int (32 - signed/unsigned) long (32 - signed/unsigned) long long (64 - signed/unsigned) half-precision float (16) float (32) double (64) long double (?) bool (8) wchar_t (16) Pointer Reference Reference/pointer to function bitfields unaligned members pointer-to-member (data) pointer-to-member (function) static data members static member functions non-static member functions Virtual member functions
amp:1 No No No Yes Yes No No Yes Yes No Yes No Yes No Yes Yes No No No No Yes Yes No No Yes Yes No No No No No No No No No No No Yes Yes No
amp:1.1 Yes Yes Yes Yes Yes No No Yes Yes No Yes Yes Yes No Yes Yes No No Yes Yes Yes Yes No No Yes Yes No Yes Yes No No No No No No No No Yes Yes No
amp:1.2 Yes Yes Yes Yes Yes Yes No Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes No Yes Yes Yes Yes No No No Yes Yes No Yes Yes Yes
amp:2 Yes Yes Yes Yes Yes Yes No Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes No Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes
cpu Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
C++ AMP : Language and Programming Model : Version 0.9 : January 2012
Page 124
Struct/class/union members Struct/class/union members Enums Enums Enums Enums Enums Structs/Classes Structs/Classes Arrays Arrays Arrays Arrays Declarations Function Declarators Function Declarators Statements Statements Statements Statements Statements Statements Statements Statements Statements Statements Statements Statements Statements Statements Statements Statements Statements
Constructors Destructors char (8 - signed/unsigned/plain) short (16 - signed/unsigned) int (32 - signed/unsigned) long (32 - signed/unsigned) long long (64 - signed/unsigned) Non-virtual base classes Virtual base classes of pointers of non-POD classes of POD classes of arrays tile_static Varargs () throw() specification global variables static class members Lambda capture-by-reference (on gpu) Lambda capture-by-reference (in p_f_e) Recursive function call conversion between pointer and integral new delete dynamic_cast typeid goto labels asm throw try/catch __try/__except __leave
Yes Yes Yes Yes Yes Yes No Yes Yes No Yes Yes Yes Yes No No No No No No No Yes No No No No No No No No No No No
Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes No No No No Yes No Yes Yes Yes Yes No No No No No No No No No
Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes No No Yes Yes Yes Yes Yes Yes Yes Yes No No No No No No No No No
Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes No Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
C++ AMP : Language and Programming Model : Version 0.9 : January 2012