AMD_HIP_Programming_Guide
AMD_HIP_Programming_Guide
DISCLAIMER
The information contained herein is for informational purposes only, and is subject to change without notice. In addition, any stated support is
planned and is also subject to change. While every precaution has been taken in the preparation of this document, it may contain technical
inaccuracies, omissions and typographical errors, and AMD is under no obligation to update or otherwise correct this information. Advanced Micro
Devices, Inc. makes no representations or warranties with respect to the accuracy or completeness of the contents of this document, and assumes no
liability of any kind, including the implied warranties of noninfringement, merchantability or fitness for particular purposes, with respect to the
operation or use of AMD hardware, software or other products described herein. No license, including implied or arising by estoppel, to any
intellectual property rights is granted by this document. Terms and limitations applicable to the purchase or use of AMD’s products are as set forth
in a signed agreement between the parties or in AMD's Standard Terms and Conditions of Sale.
* AMD®, the AMD Arrow logo, AMD Instinct™, Radeon™, ROCm® and combinations
* thereof are trademarks of Advanced Micro Devices, Inc. Linux® is the registered trademark of Linus Torvalds in the U.S. and other countries.
* PCIe® is a registered trademark of PCI-SIG Corporation. Other product names used in this publication are for identification purposes only and
may be trademarks of their respective companies.
[Public]
Table of Contents
Table of Contents ............................................................................................................................ 4
Chapter 1 Introduction ............................................................................................................. 8
1.1 Features ............................................................................................................................. 8
1.2 Accessing HIP ................................................................................................................... 8
1.2.1 Release Tagging ........................................................................................................ 9
1.3 HIP Portability and Compiler Technology........................................................................ 9
Chapter 2 Installing HIP......................................................................................................... 10
2.1 Installing Pre-built Packages ........................................................................................... 10
2.2 Prerequisites .................................................................................................................... 10
2.3 AMD Platform................................................................................................................. 10
2.4 NVIDIA Platform ............................................................................................................ 10
2.5 Building HIP from Source............................................................................................... 11
2.5.1 Get HIP Source Code .............................................................................................. 11
2.5.2 Set Environment Variables...................................................................................... 11
2.5.3 Build HIP................................................................................................................. 11
2.5.4 Default paths and environment variables ................................................................ 12
2.6 Verifying HIP Installation ............................................................................................... 12
Chapter 3 Programming with HIP ........................................................................................ 13
3.1 HIP Terminology............................................................................................................. 13
3.2 Getting Started with HIP API.......................................................................................... 14
3.2.1 HIP API Overview .................................................................................................. 14
3.2.2 HIP API Examples .................................................................................................. 14
3.3 Introduction to Memory Allocation ................................................................................ 15
3.3.1 Host Memory........................................................................................................... 15
3.3.2 Memory allocation flags .......................................................................................... 15
3.3.3 NUMA-aware host memory allocation ................................................................... 16
3.3.4 Managed memory allocation ................................................................................... 16
3.3.5 HIP Stream Memory Operations ............................................................................. 17
3.3.6 Coherency Controls ................................................................................................. 17
3.3.7 Visibility of Zero-Copy Host Memory.................................................................... 18
3.4 HIP Kernel Language ...................................................................................................... 22
4 Introduction Chapter 1
[Public]
Chapter 1 Introduction 5
[Public]
6 Introduction Chapter 1
[Public]
Chapter 1 Introduction 7
[Public]
Chapter 1 Introduction
HIP is a C++ Runtime API and kernel language that allows developers to create portable
applications for AMD and NVIDIA GPUs from a single source code.
1.1 Features
The key features are:
• HIP has little or no performance impact over coding directly in the CUDA mode.
• HIP allows coding in a single-source C++ programming language, including features such
as
o Templates
o C++11 lambdas
o Classes
o namespaces.
• HIP allows developers to use the development environment and tools on each target
platform.
• The HIPify tool to automatically convert sources from CUDA to HIP.
• Developers can specialize in the platform (CUDA or AMD) to tune for performance.
New projects can be developed directly in the portable HIP C++ language and run on either
NVIDIA or AMD platforms. Additionally, HIP provides porting tools, making it easy to port
existing CUDA codes to the HIP layer, with no loss of performance compared to the original
CUDA application. Thus, you can compile the HIP source code to run on either platform and
isolate some features to a specific platform using conditional compilation.
NOTE: HIP is not intended to be a drop-in replacement for CUDA, and developers should expect
to do some manual coding and performance tuning work to complete the port.
• Develop branch: This is the default branch, which consists of new features still under
development.
CAUTION: This branch and its features may be unstable.
• Main branch: This is the stable branch and is up to date with the latest release branch. For
example, if the latest HIP release is rocm-4.1.x, the main repository is based on this
release.
• Release branch: The release branch corresponds to each ROCM release listed with release
tags, such as rocm-4.0.x, rocm-4.1.x, and others.
8 Introduction Chapter 1
[Public]
On the NVIDIA CUDA platform, HIP provides a header file, which translates from the HIP
runtime APIs to CUDA runtime APIs. The header file contains mostly inline functions and, thus,
has a very low overhead developers coding in HIP should expect the same performance as coding
in native CUDA. The code is then compiled with nvcc, the standard C++ compiler provided with
the CUDA SDK. Developers can use any tools supported by the CUDA SDK including the CUDA
profiler and debugger.
Thus, HIP provides source portability to either platform. HIP provides the hipcc compiler driver
which will call the appropriate toolchain depending on the desired platform. The source code for
all headers and the library implementation is available on GitHub.
Chapter 1 Introduction 9
[Public]
2.2 Prerequisites
You can develop HIP code on the AMD ROCm platform using the HIP-Clang compiler and on a
CUDA platform with nvcc.
https://rocmdocs.amd.com/en/latest/Installation_Guide/Installation-Guide.html
Note, HIP-Clang is the compiler for compiling HIP programs on the AMD platform.
• Add the ROCm package server to your system as per the OS-specific guide available here.
• Install the "hip-nvcc" package. This will install CUDA SDK and the HIP porting layer.
• By default, HIP looks for CUDA SDK in /usr/local/cuda (can be overriden by setting
CUDA_PATH env variable).
• By default, HIP is installed into /opt/rocm/hip (can be overridden by setting HIP_PATH
environment variable).
• Optionally, consider adding /opt/rocm/bin to your path to make it easier to use the tools.
HIP uses Radeon Open Compute Common Language Runtime (ROCclr), a virtual device interface
defined on the AMD platform, and HIP runtimes that interact with different backends. For more
information, see
https://github.com/ROCm-Developer-Tools/ROCclr
The HIPAMD repository provides implementation specifically for the AMD platform. For more
details, see
https://github.com/ROCm-Developer-Tools/hipamd
/opt/rocm/bin/hipconfig --full
https://github.com/ROCm-Developer-Tools/HIP/tree/main/samples/0_Intro/square
host, host cpu Executes the HIP runtime API and is capable of initiating kernel launches to one or more
devices.
default device Each host thread maintains a "default device". Most HIP runtime APIs (including
memory allocation, copy commands, kernel launches) do not use accept an explicit
device argument but instead implicitly use the default device. The default device can be
set with hipSetDevice.
HIP-Clang Heterogeneous AMDGPU Compiler, with its capability to compile HIP programs on the
AMD platform. https://github.com/RadeonOpenCompute/llvm-project
hipify tools Tools to convert CUDA code to portable C++ code (https://github.com/ROCm-
Developer-Tools/HIPIFY).
ROCclr A virtual device interface that computes runtimes interact with different backends such as
ROCr on Linux or PAL on Windows. The ROCclr is an abstraction layer allowing
runtimes to work on both OSes without much effort.
https://github.com/ROCm-Developer-Tools/ROCclr
https://github.com/RadeonOpenCompute/ROCm/blob/master/AMD-HIP-API-4.5.pdf
The HIP kernel language defines builtins for determining grid and block coordinates, math
functions, short vectors, atomics, and timer functions. It also specifies additional defines and
keywords for function types, address spaces, and optimization controls. For a detailed description,
see Section 3.4 in this document.
3.2.2.2 Example 2
The HIP Runtime API code and compute kernel definition can exist in the same source file - HIP
takes care of generating host and device code appropriately.
https://github.com/ROCm-Developer-Tools/HIP/tree/main/samples
• Faster HostToDevice and DeviceToHost Data Transfers: The runtime tracks the
hipHostMalloc allocations and can avoid some of the setup required for regular unpinned
memory. For exact measurements on a specific system, experiment with --unpinned and -
-pinned switches for the hipBusBandwidth tool.
• Zero-Copy GPU Access: GPU can directly access the host memory over the CPU/GPU
interconnect, without need to copy the data. This avoids the need for the copy, but during
the kernel access each memory access must traverse the interconnect, which can be tens
of times slower than accessing the GPU's local device memory. Zero-copy memory can be
a good choice when the memory accesses are infrequent (perhaps only once). Zero-copy
memory is typically "Coherent" and thus not cached by the GPU but this can be overridden
if desired and is explained in more detail below.
hipHostMallocNumaUser is the flag to allow host memory allocation to follow NUMA policy set
by the user.
See hipHostMalloc API in the HIP API guide for more information,
https://github.com/RadeonOpenCompute/ROCm/blob/master/AMD-HIP-API-4.5.pdf
NUMA also measures the distance between the GPU and CPU devices. By default, each GPU
selects a Numa CPU node that has the least NUMA distance between them; the host memory is
automatically allocated closest to the memory pool of the NUMA node of the current GPU device.
Note, using the hipSetDevice API with a different GPU provides access to the host allocation.
However, it may have a longer NUMA distance.
For example,
int managed_memory = 0;
HIPCHECK(hipDeviceGetAttribute(&managed_memory,
hipDeviceAttributeManagedMemory,p_gpuDevice));
if (!managed_memory ) {
printf ("info: managed memory access not supported on the device %d\n Skipped\n",
p_gpuDevice);
}
else {
HIPCHECK(hipSetDevice(p_gpuDevice));
HIPCHECK(hipMallocManaged(&Hmm, N * sizeof(T)));
. . .
}
NOTE: The managed memory capability check may not be necessary; however, if HMM is not
supported, then managed malloc will fall back to using system memory. Other managed memory
API calls will, then, have undefined behavior.
https://github.com/RadeonOpenCompute/ROCm/blob/master/AMD-HIP-API-4.5.pdf
• hipStreamWaitValue32
• hipStreamWaitValue64
• hipStreamWriteValue32
• hipStreamWriteValue64
NOTE: CPU access to the semaphore's memory requires volatile keyword to disable CPU
compiler's optimizations on memory access.
For more details, please check the documentation HIP-API.pdf.
• Coherent memory: Supports fine-grain synchronization while the kernel is running. For
example, a kernel can perform atomic operations visible to the host CPU or to other (peer)
GPUs. Synchronization instructions include threadfence_system and C++11-style atomic
operations. However, coherent memory cannot be cached by the GPU and thus may have
lower performance.
• Non-coherent memory: Can be cached by GPU but cannot support synchronization while
the kernel is running. Non-coherent memory can be optionally synchronized only at
command (end-of-kernel or copy command) boundaries. This memory is appropriate for
high-performance access when fine-grain synchronization is not required.
HIP provides the developer with controls to select which type of memory is used via allocation
flags passed to hipHostMalloc and the HIP_HOST_COHERENT environment variable. By
default, the environment variable HIP_HOST_COHERENT is set to 0 in HIP.
• No flags are passed in: the host memory allocation is coherent, the
HIP_HOST_COHERENT environment variable is ignored.
hipEventSynchronize host waits for the device-scope yes depends - see the
specified event to release description below
complete
3.3.7.1 hipEventSynchronize
Developers can control the release scope for hipEvents. By default, the GPU performs a device-
scope acquire and release operation with each recorded event. This will make host and device
memory visible to other commands executing on the same device.
A stronger system-level fence can be specified when the event is created with
hipEventCreateWithFlags.
hipEventDisableTiming: Events created with this flag do not record profiling data, thus,
providing optimal performance if used for synchronization.
In case events are used across multiple dispatches, for example, start and stop events from
different hipExtLaunchKernelGGL/hipExtLaunchKernel call are treated as invalid unrecorded
events, and HIP displays an error "hipErrorInvalidHandle" from hipEventElapsedTime.
• Coherent host memory is the default and easiest to use since the memory is visible to the
CPU at specific synchronization points. This memory allows in-kernel synchronization
commands such as threadfence_system to work transparently.
• HIP/ROCm also supports cache host memory in the GPU using the "Non-Coherent" host
memory allocations. This can provide a performance benefit, but care must be taken to use
the correct synchronization.
By default Direct Dispatch is enabled in the HIP runtime. With this feature, the conventional
producer-consumer model where the runtime creates a worker thread (consumer) for each HIP
Stream, where as the host thread (producer) enqueues commands to a command queue (per
stream) is no longer applicable.
For Direct Dispatch, the runtime would directly queue a packet to the AQL queue (user mode
queue to GPU) in case of Dispatch and some of the synchronization. This has shown to the total
latency of the HIP Dispatch API and latency to launch first wave on the GPU.
In addition, eliminating the threads in runtime has reduced the variance in the dispatch numbers as
the thread scheduling delays and atomics/locks synchronization latencies are reduced.
This feature can be disabled via setting the following environment variable,
AMD_DIRECT_DISPATCH=0
HIP supports runtime compilation (hipRTC), the usage of which will provide the possibility of
optimizations and performance improvement compared with other APIs via regular offline static
compilation.
hipRTC APIs accept HIP source files in character string format as input parameters and create
handles of programs by compiling the HIP source files without spawning separate processes.
For an example on how to program HIP application using runtime compilation mechanism, see
https://github.com/ROCm-Developer-Tools/HIP/blob/main/tests/src/hiprtc/saxpy.cpp
The example shows how to program a HIP application using the runtime compilation mechanism.
HIP Graph is supported. For more details, refer to the HIP API Guide at,
https://github.com/RadeonOpenCompute/ROCm/blob/master/AMD-HIP-API-4.5.pdf
In HIP-Clang, the long double type is an 80-bit extended precision format for x86_64, which is
not supported by AMDGPU. HIP-Clang treats long double type as IEEE double type for
AMDGPU. Using long double type in HIP source code will not cause an issue as long as data of
long double type is not transferred between host and device. However, the long double type should
not be used as kernel argument type.
By default, HIP-Clang assumes -ffp-contract=fast. For x86_64, FMA is off by default since the
generic x86_64 target does not support FMA by default. To turn on FMA on x86_64, either use -
mfma or -march=native on CPU's supporting FMA.
When contractions are enabled and the CPU has not enabled FMA instructions, the GPU can
produce different numerical results than the CPU for expressions that can be contracted.
If a host function is used between Clang (or hipcc) and gcc for x86_64, its definition is compiled
by one compiler, but a different compiler compiles the caller, _Float16 or aggregates containing
Float16 must not be used as a function argument or return type. This is due to the lack of a stable
ABI for _Float16 on x86_64. Passing _Float16 or aggregates containing _Float16 between clang
and gcc could cause undefined behavior.
HIP does not support math functions with rounding modes ru (round up), rd (round down), and rz
(round towards zero). HIP only supports math functions with rounding mode rn (round to nearest).
The math functions with postfixes _ru, _rd, and _rz are implemented the same way as those with
postfix _rn. They serve as a workaround to get programs using them compiled.
• The first type of static library does not export device functions, and only exports and
launches host functions within the same library. The advantage of this type is the ability to
link with a non-hipcc compiler such as gcc.
• The second type exports device functions to be linked by other code objects. However this
requires using hipcc as the linker.
In addition, the first type of library contains host objects with device code embedded as fat
binaries. It is generated using the flag --emit-static-lib. The second type of library contains
relocatable device objects and is generated using ar.
• A kernel-launch syntax that uses standard C++, resembles a function call, and is portable
to all HIP targets
• Short-vector headers that can serve on a host or a device
• Math functions resembling those in the "math.h" header included with standard C++
compilers
• Built-in functions for accessing specific GPU hardware capabilities
This section describes the built-in variables and functions accessible from the HIP kernel. It is
intended for readers familiar with CUDA kernel syntax and who want to understand how HIP is
different.
The __device__ keyword can combine with the host keyword (see host).
3.4.1.2 __global__
HIP __global__ functions must have a void return type. See the Kernel Launch example for more
information.
HIP lacks dynamic-parallelism support, so __global__ functions cannot be called from the device.
3.4.1.3 __host__
__host__ can combine with __device__, in which case the function compiles for both the host and
device. These functions cannot use the HIP grid coordinate functions. For example, "threadIdx_x".
A possible workaround is to pass the necessary coordinate info as an argument to the function.
_host__ cannot combine with __global__.
HIP parses the __noinline__ and __forceinline__ keywords and converts them to the appropriate
Clang attributes.
__global__ functions are often referred to as kernels, and calling one is termed launching the
kernel. These functions require the caller to specify an "execution configuration" that includes the
grid and block dimensions. The execution configuration can also include other information for the
launch, such as the amount of additional shared memory to allocate and the stream where the
kernel should execute. HIP introduces a standard C++ calling convention to pass the execution
configuration to the kernel in addition to the Cuda <<< >>> syntax.
• In HIP, kernels launch with either the <<< >>> syntax or the "hipLaunchKernel" function.
• The first five parameters to hipLaunchKernel are the following:
o symbol kernelName: the name of the kernel to launch. To support template kernels
that contain "," use the HIP_KERNEL_NAME macro. The hipify tools insert this
automatically.
o dim3 gridDim: 3D-grid dimensions specifying the number of blocks to launch.
o dim3 blockDim: 3D-block dimensions specifying the number of threads in each block.
o size_t dynamicShared: the amount of additional shared memory to allocate when
launching the kernel (see shared)
o hipStream_t: stream where the kernel should execute. A value of 0 corresponds to the
NULL stream (see Synchronization Functions).
• Kernel arguments must follow the five parameters
The hipLaunchKernel macro always starts with the five parameters specified above, followed by
the kernel arguments. HIPIFY tools optionally convert CUDA launch syntax to hipLaunchKernel,
including conversion of optional arguments in <<< >>> to the five required hipLaunchKernel
parameters. The dim3 constructor accepts zero to three arguments and will by default initialize
unspecified dimensions to 1. See dim3. The kernel uses coordinate built-ins (thread*, block*,
grid*) to determine the coordinate index and coordinate bounds of the work item that is currently
executing. For more information, see Coordinate Built-Ins.
The __constant__ keyword is supported. The host writes constant memory before launching the
kernel; from the GPU, this memory is read-only during kernel execution. The functions for
accessing constant memory (hipGetSymbolAddress(), hipGetSymbolSize(),
hipMemcpyToSymbol(), hipMemcpyToSymbolAsync(), hipMemcpyFromSymbol(),
hipMemcpyFromSymbolAsync()) are available.
3.4.2.2 __shared__
extern __shared__ allows the host to dynamically allocate shared memory and is specified as a
launch parameter.
Now, the HIP-Clang compiler provides support for extern shared declarations, and the
HIP_DYNAMIC_SHARED option is no longer required.
3.4.2.3 __managed__
3.4.2.4 __restrict__
The __restrict__ keyword tells the compiler that the associated memory pointer will not alias with
any other pointer in the kernel or function. This feature can help the compiler generate better code.
In most cases, all pointer arguments must use this keyword to realize the benefit.
These built-ins determine the coordinate of the active work item in the execution grid. They are
defined in hip_runtime.h (rather than being implicitly defined by the compiler).
threadIdx.x threadIdx.x
threadIdx.y threadIdx.y
threadIdx.z threadIdx.z
blockIdx.x blockIdx.x
blockIdx.y blockIdx.y
blockIdx.z blockIdx.z
blockDim.x blockDim.x
blockDim.y blockDim.y
blockDim.z blockDim.z
gridDim.x gridDim.x
gridDim.y gridDim.y
gridDim.z gridDim.z
3.4.3.2 warpSize
The warpSize variable is of type int and contains the warp size (in threads) for the target device.
Note that all current Nvidia devices return 32 for this variable, and all current AMD devices return
64. Device code should use the warpSize built-in to develop portable wave-aware code.
Short vector types derive from the basic integer and floating-point types. They are structures
defined in hip_vector_types.h. The first, second, third, and fourth components of the vector are
accessible through the x, y, z, and w fields, respectively. All the short vector types support a
constructor function of the form make_<type_name>(). For example, float4 make_float4(float x,
float y, float z, float w) creates a vector of type float4 and value (x,y,z,w).
Signed Integers
Unsigned Integers
Floating Points
3.4.4.2 dim3
dim3 is a three-dimensional integer vector type commonly used to specify grid and group
dimensions. Unspecified dimensions are initialized to 1.
typedef struct dim3 {
uint32_t x;
uint32_t y;
uint32_t z;
dim3(uint32_t _x=1, uint32_t _y=1, uint32_t _z=1) : x(_x), y(_y), z(_z) {};
};
HIP provides a workaround for threadfence_system() under the HIP-Clang path. To enable the
workaround, HIP should be built with environment variable HIP_COHERENT_HOST_ALLOC
enabled.
• The kernel should only operate on finegrained system memory; which should be allocated
with hipHostMalloc().
• Remove all memcpy for those allocated finegrained system memory regions.
Calculate the arc tangent of the ratio of first and second input arguments.
Break down the input argument into fractional and integral parts.
Calculate the value of the first argument to the power of the second argument.
Calculate the value of the Bessel function of the first kind of order 0 for the input
argument.
Calculate the value of the Bessel function of the first kind of order 1 for the input
argument.
Calculate the natural logarithm of the absolute value of the gamma function of the
input argument.
Break down the input argument into fractional and integral parts.
Calculate the square root of the sum of squares of three coordinates of the
argument.
Calculate the square root of the sum of squares of four coordinates of the
argument.
Calculate the square root of the sum of squares of any number of coordinates.
Calculate one over the square root of the sum of squares of two arguments.
Calculate one over the square root of the sum of squares of three coordinates of the
argument.
Calculate one over the square root of the sum of squares of four coordinates of the
argument.
Calculate the reciprocal of square root of the sum of squares of any number of
coordinates.
Calculate the sine and cosine of the first input argument multiplied by PI.
Calculate the value of the Bessel function of the second kind of order 0 for the
input argument.
Calculate the value of the Bessel function of the second kind of order 1 for the
input argument.
Calculate the value of the Bessel function of the second kind of order n for the
input argument.
Calculate the arc tangent of the ratio of first and second input arguments.
Break down the input argument into fractional and integral parts.
Calculate the value of the first argument to the power of the second argument.
double j0 ( double x )
Calculate the value of the Bessel function of the first kind of order 0 for the input
argument.
double j1 ( double x )
Calculate the value of the Bessel function of the first kind of order 1 for the input
argument.
Calculate the value of the Bessel function of the first kind of order n for the input
argument.
Calculate the natural logarithm of the absolute value of the gamma function of the
input argument.
Break down the input argument into fractional and integral parts.
Calculate the square root of the sum of squares of three coordinates of the
argument.
Calculate the square root of the sum of squares of four coordinates of the
argument.
Calculate one over the square root of the sum of squares of two arguments.
Calculate one over the square root of the sum of squares of three coordinates of the
argument.
Calculate one over the square root of the sum of squares of four coordinates of the
argument.
Calculate the reciprocal of the square root of the sum of squares of any number of
coordinates.
Calculate the sine and cosine of the first input argument multiplied by PI.
Calculate the value of the Bessel function of the second kind of order 0 for the
input argument.
double y1 ( double x )
Calculate the value of the Bessel function of the second kind of order 1 for the
input argument.
Calculate the value of the Bessel function of the second kind of order n for the
input argument.
NOTE: [1] __RETURN_TYPE is dependent on the compiler. It is usually 'int' for C compilers and
'bool' for C++ compilers.
The following table lists supported integer intrinsics. Note, intrinsics are supported on devices
only.
Function
Function
Return the number of consecutive high-order zero bits in 32-bit unsigned integer.
Return the number of consecutive high-order zero bits in 64-bit signed integer.
Find the position of least significant bit set to 1 in a 32-bit unsigned integer.1
Find the position of least significant bit set to 1 in a 32-bit signed integer.
Find the position of least significant bit set to 1 in a 64-bit unsigned integer.1
Find the position of least significant bit set to 1 in a 64 bit signed integer.
NOTE: The HIP-Clang implementation of __ffs() and __ffsll() contains code to add a constant +1
to produce the ffs result format. For the cases where this overhead is not acceptable and the
programmer is willing to specialize for the platform, HIP-Clang provides
__lastbit_u32_u32(unsigned int input) and __lastbit_u32_u64(unsigned long long int input).
The following table provides a list of supported floating-point intrinsics. Note, intrinsics are
supported on devices only.
Function
float __cosf ( float x )
Function
The supported Texture functions are listed in the following header files:
• "texture_functions.h"
https://github.com/ROCm-Developer-Tools/HIP/blob/main/include/hip/hcc_detail/texture_functions.h
• "texture_indirect_functions.h"
https://github.com/ROCm-Developer-Tools/HIP/blob/main/include/hip/hcc_detail/texture_indirect_functions.h
HIP provides the following built-in functions for reading a high-resolution timer from the device.
clock_t clock()
long long int clock64()
Returns the value of a counter that is incremented every clock cycle on devices. The difference in
values returned provides the cycles used.
HIP adds new APIs with _system as suffix to support system scope atomic operations. For
example, atomicAnd atomic is dedicated to the GPU device, atomicAnd_system will allow
developers to extend the atomic operation to system scope, from the GPU device to other CPUs
and GPU devices in the system.
This compilation flag is not set ("0") by default, so the HIP runtime uses current float/double
atomicAdd functions. If the compilation flag is set to 1 with the CMake option,
D__HIP_USE_CMPXCHG_FOR_FP_ATOMICS=1, the old float/double atomicAdd functions are
used for compatibility with compilers that do not support floating point atomics.
For details on how to build HIP runtime, refer to the HIP Installation section in this guide.
HIP enables atomic operations on 32-bit integers. Additionally, it supports an atomic float add.
AMD hardware, however, implements the float add using a CAS loop, so this function may not
perform efficiently.
Warp cross-lane functions operate across all lanes in a warp. The hardware guarantees that all
warp lanes will execute in lockstep, so additional synchronization is unnecessary and the
instructions use no shared memory.
Note that Nvidia and AMD devices have different warp sizes, so portable code should use the
warpSize built-ins to query the warp size. Hipified code from the CUDA path requires careful
review to ensure it doesn’t assume a waveSize of 32. "Wave-aware" code that assumes a waveSize
of 32 will run on a wave-64 machine, but it will utilize only half of the machine resources.
WarpSize built-ins should only be used in device functions and its value depends on GPU arch.
Host functions should use hipGetDeviceProperties to get the default warp size of a GPU device:
cudaDeviceProp props;
cudaGetDeviceProperties(&props, deviceID);
int w = props.warpSize;
// implement portable algorithm based on w (rather than assume 32 or 64)
Note, assembly kernels may be built for warp size, which is different than the default warp size.
Threads in a warp are referred to as lanes and are numbered from 0 to warpSize -- 1. For these
functions, each warp lane contributes 1 -- the bit value (the predicate), which is efficiently
broadcast to all lanes in the warp. The 32-bit int predicate from each lane reduces to a 1-bit value:
0 (predicate = 0) or 1 (predicate != 0). __any and __all provide a summary view of the predicates
that the other warp lanes contribute:
• __all() returns 1 if all other warp lanes contribute nonzero predicates, or 0 otherwise
Applications can test whether the target platform supports the any/all instruction using the
hasWarpVote device property or the HIP_ARCH_HAS_WARP_VOTE compiler define.
__ballot provides a bit mask containing the 1-bit predicate value from each lane. The nth bit of the
result contains the 1 bit contributed by the nth warp lane. Note that HIP's __ballot function
supports a 64-bit return value (compared with 32 bits). Code ported from CUDA should support
the larger warp sizes that the HIP version of this instruction supports. Applications can test
whether the target platform supports the ballot instruction using the hasWarpBallot device
property or the HIP_ARCH_HAS_WARP_BALLOT compiler define.
Half-float shuffles are not supported. The default width is warpSize---see Warp Cross-Lane
Functions. Applications should not assume the warpSize is 32 or 64.
Warp matrix functions allow a warp to cooperatively operate on small matrices whose elements
are spread over the lanes in an unspecified manner. This feature was introduced in CUDA 9.
HIP does not support any of the kernel language warp matrix types or functions.
The hardware support for independent thread scheduling introduced in certain architectures
supporting CUDA allows threads to progress independently of each other and enables intra-warp
synchronizations that were previously not allowed.
3.4.7.16 Assert
The assert function is under development. HIP does support an "abort" call which will terminate
the process execution from inside the kernel.
3.4.7.17 Printf
3.4.9 __launch_bounds__
GPU multiprocessors have a fixed pool of resources (primarily registers and shared memory)
which are shared by the actively running warps. Using more resources can increase IPC of the
kernel but reduces the resources available for other warps and limits the number of warps that can
be simultaneously running. Thus, GPUs have a complex relationship between resource usage and
performance.
__launch_bounds__ allows the application to provide usage hints that influence the resources
(primarily registers) used by the generated code. It is a function attribute that must be attached to
a __global__ function:
When the kernel is launched with HIP APIs, for example, hipModuleLaunchKernel(), HIP
validates to ensure the input kernel dimension size is not larger than the specified launch_bounds.
In case it exceeds, HIP returns launch failure, if AMD_LOG_LEVEL is set with the proper value.
The error details are shown in the error log message, including launch parameters of kernel dim
size, launch bounds, and the name of the faulting kernel. It is usually helpful to identify the
faulting kernel. Besides, the kernel dim size and launch bounds values also assist in debugging
such failures.
• The compiler uses the hints only to manage register usage and does not
automatically reduce shared memory or other resources.
• Compilation fails if the compiler cannot generate a kernel that meets the
requirements of the specified launch bounds.
• From MAX_THREADS_PER_BLOCK, the compiler derives the maximum number of
warps/block that can be used at launch time. Values of
MAX_THREADS_PER_BLOCK less than the default allows the compiler to use a
larger pool of registers: each warp uses registers, and this hint contains the launch
to a warps/block size that is less than maximum.
A compute unit (CU) is responsible for executing the waves of a workgroup. It is composed of one
or more execution units (EU) that are responsible for executing waves. An EU can have enough
resources to maintain the state of more than one executing wave. This allows an EU to hide
latency by switching between waves in a similar way to symmetric multithreading on a CPU. To
allow the state for multiple waves to fit on an EU, the resources used by a single wave have to be
limited. Limiting such resources can allow greater latency hiding but it can result in having to spill
some register state to memory. This attribute allows an advanced developer to tune the number of
waves that are capable of fitting within the resources of an EU. It can be used to ensure at least a
certain number will fit to help hide latency and can also be used to ensure no more than a certain
number will fit to limit cache thrashing.
• Warps (rather than blocks): The developer is trying to tell the compiler to control resource
utilization to guarantee some amount of active Warps/EU for latency hiding. Specifying
active warps in terms of blocks appears to hide the micro-architectural details of the warp
size, however, makes the interface more confusing since the developer ultimately needs to
compute the number of warps to obtain the desired level of control.
• Execution Units (rather than multiProcessor): The use of execution units rather than
multiprocessors provides support for architectures with multiple execution units/multi-
processor. For example, the AMD GCN architecture has 4 execution units per
multiProcessor. The hipDeviceProps has a field executionUnitsPerMultiprocessor.
Platform-specific coding techniques such as #ifdef can be used to specify different
launch_bounds for NVCC and HIP-Clang platforms if desired.
3.4.9.4 Maxregcount
Unlike nvcc, HIP-Clang does not support the "--maxregcount" option. Instead, users are
encouraged to use the hip_launch_bounds directive since the parameters are more intuitive and
portable than micro-architecture details like registers, and also the directive allows per-kernel
control rather than an entire file. hip_launch_bounds works on both HIP-Clang and nvcc targets.
asm volatile ("v_mac_f32_e32 %0, %2, %3" : "=v" (out[i]) : "0"(out[i]), "v" (a), "v" (in[i]));
The HIP compiler inserts the GCN into the kernel using asm() Assembler statement. volatile
keyword is used so that the optimizers must not change the number of volatile operations or
change their order of execution relative to other volatile operations. v_mac_f32_e32 is the GCN
instruction. For more information, refer to the AMD GCN3 ISA architecture manual Index for the
respective operand in the ordered fashion is provided by % followed by a position in the list of
operands "v" is the constraint code (for target-specific AMDGPU) for 32-bit VGPR register. For
more information, refer to the Supported Constraint Code List for AMDGPU. Output Constraints
are specified by an "=" prefix as shown above ("=v"). This indicates that assembly will write to
this operand, and the operand will then be made available as a return value of the asm expression.
Input constraints do not have a prefix - just the constraint code. The constraint string of "0" says to
use the assigned register for output as an input as well (it being the 0'th constraint).
The file format for binary is `.co` which means Code Object. The following command builds the
code object using `hipcc`.
`hipcc --genco --offload-arch=[TARGET GPU] [INPUT FILE] -o [OUTPUT FILE]`
[TARGET GPU] = GPU architecture
[INPUT FILE] = Name of the file containing kernels
[OUTPUT FILE] = Name of the generated code object file
NOTE: When using binary code objects is that the number of arguments to the kernel is different
on HIP-Clang and NVCC path. Refer to the sample in samples/0_Intro/module_api for differences
in the arguments to be passed to the kernel.
3.4.15 gfx-arch-specific-kernel
Clang defined '__gfx*__' macros can be used to execute gfx arch-specific codes inside the kernel.
Refer to the sample 14_gpu_arch in samples/2_Cookbook.
3.5.1 roc-obj
High-level wrapper around low-level tooling as described below. For a more detailed overview,
see the help text available with roc-obj --help.
3.5.1.1 Examples
roc-obj <executable>...
3.5.1.1.2 Extract all ROCm code objects from a list of executables, and disassemble them
3.5.1.1.3 Extract all ROCm code objects from a list of executables into dir/
3.5.1.1.4 Extract only ROCm code objects matching regex over Target ID
ROCm Code Objects can be listed/accessed using the following URI syntax:
code_object_uri ::== file_uri | memory_uri
file_uri ::== file:// extract_file [ range_specifier ]
memory_uri ::== memory:// process_id range_specifier
range_specifier ::== [ # | ? ] offset= number & size= number
extract_file ::== URI_ENCODED_OS_FILE_PATH
process_id ::== DECIMAL_NUMBER
number ::== HEX_NUMBER | DECIMAL_NUMBER | OCTAL_NUMBER
Examples
• file://dir1/dir2/hello_world#offset=133&size=14472
• memory://1234#offset=0x20000&size=3000
Use this tool to list available ROCm code objects. Code objects are listed by bundle number,
entry ID, and URI syntax.
List the URIs of the code objects embedded in the specfied host executables.
• -v -Verbose output - Adds column headers for more human readable format
• -h -Show this help message
Options:
• -o <path> Path for output. If "-" specified, code object is printed to STDOUT.
• -v Verbose output (includes Entry ID).
• -h Show this help message
Note, when specifying a URI argument to roc-obj-extract, if cut and pasting the output from roc-
obj-ls you need to escape the '&' character or your shell will interpret it as the option to run the
command as a background process.
3.5.2.4 Examples
AMD_LOG_LEVEL
AMD_LOG_MASK
In the HIP code, call ClPrint() function with proper input variables as needed, for example,
ClPrint(amd::LOG_INFO, amd::LOG_INIT, "Initializing HSA stack.");
memoryBusWidth: 0
clockInstructionRate: 1000 Mhz
totalGlobalMem: 7.98 GB
maxSharedMemoryPerMultiProcessor: 64.00 KB
totalConstMem: 8573157376
sharedMemPerBlock: 64.00 KB
canMapHostMemory: 1
regsPerBlock: 0
warpSize: 32
l2CacheSize: 0
computeMode: 0
maxThreadsPerBlock: 1024
maxThreadsDim.x: 1024
maxThreadsDim.y: 1024
maxThreadsDim.z: 1024
maxGridSize.x: 2147483647
maxGridSize.y: 2147483647
maxGridSize.z: 2147483647
major: 10
minor: 12
concurrentKernels: 1
cooperativeLaunch: 0
cooperativeMultiDeviceLaunch: 0
arch.hasGlobalInt32Atomics: 1
arch.hasGlobalFloatAtomicExch: 1
arch.hasSharedInt32Atomics: 1
arch.hasSharedFloatAtomicExch: 1
arch.hasFloatAtomicAdd: 1
arch.hasGlobalInt64Atomics: 1
arch.hasSharedInt64Atomics: 1
arch.hasDoubles: 1
arch.hasWarpVote: 1
arch.hasWarpBallot: 1
arch.hasWarpShuffle: 1
arch.hasFunnelShift: 0
arch.hasThreadFenceSystem: 1
arch.hasSyncThreadsExt: 0
arch.hasSurfaceFuncs: 0
arch.has3dGrid: 1
arch.hasDynamicParallelism: 0
gcnArch: 1012
isIntegrated: 0
maxTexture1D: 65536
maxTexture2D.width: 16384
maxTexture2D.height: 16384
maxTexture3D.width: 2048
maxTexture3D.height: 2048
maxTexture3D.depth: 2048
isLargeBar: 0
:3:hip_device_runtime.cpp :471 : 23647701557: 5617 : [7fad295dd840] hipGetDeviceCount (
0x7ffdbe7db714 )
:3:hip_device_runtime.cpp :473 : 23647701608: 5617 : [7fad295dd840] hipGetDeviceCount:
Returned hipSuccess
:3:hip_peer.cpp :76 : 23647701731: 5617 : [7fad295dd840] hipDeviceCanAccessPeer (
0x7ffdbe7db728, 0, 0 )
:3:hip_peer.cpp :60 : 23647701784: 5617 : [7fad295dd840] canAccessPeer: Returned
hipSuccess
:3:hip_peer.cpp :77 : 23647701831: 5617 : [7fad295dd840] hipDeviceCanAccessPeer:
Returned hipSuccess
peers:
ltrace is a standard linux tool which provides a message to stderr on every dynamic library call.
Since ROCr and the ROCt (the ROC thunk, which is the thin user-space interface to the ROC
kernel driver) are both dynamic libraries, this provides an easy way to trace the activity in these
libraries. Tracing can be a powerful way to quickly observe the flow of the application before
diving into the details with a command-line debugger. ltrace is a helpful tool to visualize the
runtime behavior of the entire ROCm software stack. The trace can also show performance issues
related to accidental calls to expensive API calls on the critical path.
Samples
HIP developers on ROCm can use AMD's ROCgdb for debugging and profiling. ROCgdb is the
ROCm source-level debugger for Linux, based on GDB, the GNU source-level debugger. It is
similar to cuda-gdb. It can be used with debugger frontends, such as eclipse, vscode, or gdb-
dashboard.
The sample below shows you how to use the ROCgdb run and debug HIP applications.
Note, ROCgdb is installed with the ROCM package in the folder /opt/rocm/bin.
$ export PATH=$PATH:/opt/rocm/bin
$ rocgdb ./hipTexObjPitch
GNU gdb (rocm-dkms-no-npi-hipclang-6549) 10.1
Copyright (C) 2020 Free Software Foundation, Inc.
License GPLv3+: GNU GPL version 3 or later <http://gnu.org/licenses/gpl.html>
...
For bug reporting instructions, please see:
<https://github.com/ROCm-Developer-Tools/ROCgdb/issues>.
Find the GDB manual and other documentation resources online at:
<http://www.gnu.org/software/gdb/documentation/>.
...
Reading symbols from ./hipTexObjPitch...
(gdb) break main
Breakpoint 1 at 0x4013d1: file /home/test/hip/tests/src/texture/hipTexObjPitch.cpp, line 98.
(gdb) run
Starting program: /home/test/hip/build/directed_tests/texture/hipTexObjPitch
[Thread debugging using libthread_db enabled]
Breakpoint 1, main ()
at /home/test/hip/tests/src/texture/hipTexObjPitch.cpp:98
98 texture2Dtest<float>();
(gdb)c
See the sections below for a description of environment variables. They are supported on the
ROCm path.
Developers can control kernel command serialization from the host using the environment
variable,
Note, HIP runtime can wait for GPU idle before/after any GPU command depending on the
environment setting.
For system with multiple devices, it is possible to make only certain device(s) visible to HIP via
the setting environment varible - HIP_VISIBLE_DEVICES. Only devices whose index is present
in the sequence are visible to HIP.
For example,
$ HIP_VISIBLE_DEVICES=0,1
or in the application,
if (totalDeviceNum > 2) {
setenv("HIP_VISIBLE_DEVICES", "0,1,2", 1);
assert(getDeviceNumber(false) == 3);
... ...
}
Developers can dump code object to analyze compiler-related issues via setting environment
variable, GPU_DUMP_CODE_OBJECT
HSA provides environment varibles help to analyze issues in drivers or hardware. For example,
GPU_DUMP_CODE_OBJECT 0 0: Disable.
Dump code object. 1: Enable.
enqueue.
3: Both.
AMD_DIRECT_DISPATCH 0 0: Disable.
Enable direct kernel dispatch. 1: Enable.
• From GDB, you can set environment variables "set env". Note the command does not use
an '=' sign:
(gdb) set env AMD_SERIALIZE_KERNEL 3
• The fault will be caught by the runtime but was actually generated by an asynchronous
command running on the GPU. So, the GDB backtrace will show a path in the runtime.
• To determine the true location of the fault, force the kernels to execute synchronously by
seeing the environment variables AMD_SERIALIZE_KERNEL=3 AMD_SERIALIZE_COPY=3.
This will force HIP runtime to wait for the kernel to finish executing before retuning. If the
fault occurs during the execution of a kernel, you can see the code which launched the
kernel inside the backtrace. A bit of guesswork is required to determine which thread is
actually causing the issue - typically it will the thread which is waiting inside the libhsa-
runtime64.so.
• VM faults inside kernels can be caused by:
o incorrect code (ie a for loop which extends past array boundaries),
o memory issues - kernel arguments which are invalid (null pointers, unregistered
host pointers, bad pointers),
o synchronization issues,
o compiler issues (incorrect code generation from the compiler),
o runtime issues.
The HIP version can be queried from the following HIP API call,
hipRuntimeGetVersion(&runtimeVersion);
The version returned will always be greater than the versions in previous ROCm releases.
NOTE: The version definition of HIP runtime is different from CUDA. On the AMD platform, the
function returns HIP runtime version, while on the NVIDIA platform, it returns CUDA runtime
version. There is no mapping or a correlation between HIP version and CUDA version.
$ export PATH=$PATH:[MYHIP]/bin
2. Define the environment variable.
$ export HIP_PATH=[MYHIP]
3. Build an executable file.
$ cd ~/hip/samples/0_Intro/square
$ make
/home/user/hip/bin/hipify-perl square.cu > square.cpp
/home/user/hip/bin/hipcc square.cpp -o square.out
/home/user/hip/bin/hipcc -use-staticlib square.cpp -o square.out.static
• Starting the port on a CUDA machine is often the easiest approach since you can
incrementally port pieces of the code to HIP while leaving the rest in CUDA. (Recall that
on CUDA machines HIP is just a thin layer over CUDA, so the two code types can
interoperate on nvcc platforms.) Also, the HIP port can be compared with the original
CUDA code for function and performance.
• Once the CUDA code is ported to HIP and is running on the CUDA machine, compile the
HIP code using the HIP compiler on an AMD machine.
• HIP ports can replace CUDA versions: HIP can deliver the same performance as a native
CUDA implementation, with the benefit of portability to both Nvidia and AMD architectures
as well as a path to future C++ standard support. You can handle platform-specific
features through the conditional compilation or by adding them to the open-source HIP
infrastructure.
• Use bin/hipconvertinplace-perl.sh to hipify all code files in the CUDA source directory.
The hipexamine-perl.sh tool will scan a source directory to determine which files contain CUDA
code and how much of that code can be automatically hipified.
> cd examples/rodinia_3.0/cuda/kmeans
> $HIP_DIR/bin/hipexamine-perl.sh.
info: hipify ./kmeans.h =====>
info: hipify ./unistd.h =====>
info: hipify ./kmeans.c =====>
info: hipify ./kmeans_cuda_kernel.cu =====>
info: converted 40 CUDA->HIP refs( dev:0 mem:0 kern:0 builtin:37 math:0 stream:0 event:0 err:0
def:0 tex:3 other:0 ) warn:0 LOC:185
info: hipify ./getopt.h =====>
info: hipify ./kmeans_cuda.cu =====>
info: converted 49 CUDA->HIP refs( dev:3 mem:32 kern:2 builtin:0 math:0 stream:0 event:0 err:0
def:0 tex:12 other:0 ) warn:0 LOC:311
info: hipify ./rmse.c =====>
info: hipify ./cluster.c =====>
info: hipify ./getopt.c =====>
info: hipify ./kmeans_clustering.c =====>
info: TOTAL-converted 89 CUDA->HIP refs( dev:3 mem:32 kern:2 builtin:37 math:0 stream:0 event:0
err:0 def:0 tex:15 other:0 ) warn:0 LOC:3607
kernels (1 total) : kmeansPoint(1)
hipexamine-perl scans each code file (cpp, c, h, hpp, etc.) found in the specified directory:
• Files with no CUDA code (kmeans.h) print a one-line summary just listing the source file
name.
• Files with CUDA code print a summary of what was found - for example, the
kmeans_cuda_kernel.cu file:
• Information in kmeans_cuda_kernel.cu :
o How many CUDA calls were converted to HIP (40)
o Breakdown of the CUDA functionality used (dev:0 mem:0 etc). This file uses many
CUDA builtins (37) and texture functions (3).
o Warning for code that looks like CUDA API but was not converted (0 in this file).
o Count Lines-of-Code (LOC) - 185 for this file.
• hipexamine-perl also presents a summary at the end of the process for the statistics
collected across all files. This has a similar format to the per-file reporting, and also
includes a list of all kernels which have been called. An example from above:
info: TOTAL-converted 89 CUDA->HIP refs( dev:3 mem:32 kern:2 builtin:37 math:0 stream:0 event:0
err:0 def:0 tex:15 other:0 ) warn:0 LOC:3607
kernels (1 total) : kmeansPoint(1)
The hipconvertinplace-perl.sh script will perform an in-place conversion for all code files in the
specified directory. This can be quite handy when dealing with an existing CUDA code base since
the script preserves the existing directory structure and filenames - and includes work. After
converting in-place, you can review the code to add additional parameters to directory names.
> hipconvertinplace-perl.sh MY_SRC_DIR
All HIP projects target either AMD or NVIDIA platform. The platform affects the headers that are
included and libraries that are used for linking.
Often, it is useful to know whether the underlying compiler is HIP-Clang or NVIDIA. This
knowledge can guard platform-specific code or aid in platform-specific performance tuning.
#ifdef __HIP_PLATFORM_AMD__
// Compiled with HIP-Clang
#endif
#ifdef __HIP_PLATFORM_NVIDIA__
// Compiled with nvcc
// Could be compiling with CUDA language extensions enabled (for example, a ".cu file)
// Could be in pass-through mode to an underlying host compile OR (for example, a .cpp file)
#ifdef __CUDACC__
// Compiled with nvcc (CUDA language extensions enabled)
HIP-Clang directly generates the host code (using the Clang x86 target) without passing the code
to another host compiler. Thus, they have no equivalent of the __CUDACC__ define.
NVCC makes two passes over the code: one for host code and one for device code. HIP-Clang
will have multiple passes over the code: one for the host code, and one for each architecture on the
device code. __HIP_DEVICE_COMPILE__ is set to a nonzero value when the compiler (HIP-
Clang or nvcc) is compiling code for a device inside a __global__ kernel or for a device function.
__HIP_DEVICE_COMPILE__ can replace #ifdef checks on the __CUDA_ARCH__ define.
// #ifdef __CUDA_ARCH__
#if __HIP_DEVICE_COMPILE__
This type of code requires special attention since AMD and CUDA devices have different
architectural capabilities. Moreover, you cannot determine the presence of a feature using a simple
comparison against an architecture's version number. HIP provides a set of defines and device
properties to query whether a specific architectural feature is supported.
The AMD platform HIP uses the Radeon Open Compute common language runtime called
ROCclr. ROCclr is a virtual device interface that HIP runtimes to interact with different backends,
allowing runtimes to work on Linux and Windows without much effort.
On the NVIDIA platform, HIP is just a thin layer on top of CUDA. On a non-AMD platform, HIP
runtime determines if CUDA is available and can be used. If available, HIP_PLATFORM is set to
NVIDIA, and underneath the CUDA path is used.
4.3.6 hipLaunchKernel
hipLaunchKernel is a variadic macro that accepts as parameters the launch configurations (grid
dims, group dims, stream, dynamic shared size) followed by a variable number of kernel
arguments. This sequence is then expanded into the appropriate kernel launch syntax depending on
the platform. While this can be a convenient single-line kernel launch syntax, the macro
implementation can cause issues when nested inside other macros. For example, consider the
following:
// Will cause compile error:
#define MY_LAUNCH(command, doTrace) \
{\
if (doTrace) printf ("TRACE: %s\n", #command); \
(command); /* The nested ( ) will cause compile error */\
}
NOTE: Avoid nesting macro parameters inside parenthesis - here's an alternative that will work:
#define MY_LAUNCH(command, doTrace) \
{\
if (doTrace) printf ("TRACE: %s\n", #command); \
command;\
}
Option Description
--fgpu-rdc Generate relocatable device code, which allows kernels or device functions
calling device functions in different translation units.
-ggdb Equivalent to `-g` plus tuning for GDB. This is recommended when using
ROCm's GDB to debug GPU code.
--gpu-max-threads-per- Generate code to support up to the specified number of threads per block.
block=<num>
https://clang.llvm.org/docs/ClangOffloadBundlerFileFormat.html#target-id
--offload-arch=X
NOTE: For backward compatibility, hipcc also accepts --amdgpu-target=X for specifying target
ID. However, it will be deprecated in future releases.
hipcc adds the necessary libraries for HIP as well as for the accelerator compiler (nvcc or AMD
compiler). It is recommended to link with hipcc since it automatically links the binary to the
necessary HIP runtime libraries. It also enables linking and managing GPU objects.
-lm Option
HIP-Clang generates both device and host code using the same Clang-based compiler. The code
uses the same API as gcc, which allows code generated by different gcc-compatible compilers to
be linked together. For example, code compiled using HIP-Clang can link with code compiled
using "standard" compilers (such as gcc, ICC, and Clang). Take care to ensure all compilers use
the same standard C++ header and library formats.
If you pass "--stdlib=libc++" to hipcc, hipcc will use the libc++ library. Generally, libc++ provides
a broader set of C++ features while libstdc++ is the standard for more compilers (notably
including g++).
When cross-linking C++ code, any C++ functions that use types from the C++ standard library
(including std::string, std::vector and other containers) must use the same standard-library
implementation. They include the following:
• Functions or kernels defined in HIP-Clang that are called from a standard compiler
• Functions defined in a standard compiler are called from HIP-Clang.
• Applications with these interfaces should use the default libstdc++ linking.
Applications that are compiled entirely with hipcc, and which benefit from advanced C++ features
not supported in libstdc++, and which do not require portability to nvcc, may choose to use
libc++.
• hip_runtime_api.h: defines all the HIP runtime APIs (e.g., hipMalloc) and the types required
to call them. A source file that is only calling HIP APIs but neither defines nor launches
any kernels can include hip_runtime_api.h. hip_runtime_api.h uses no custom hc language
features and can be compiled using a standard C++ compiler.
CUDA has slightly different content for these two files. In some cases, you may need to convert
hipified code to include the richer hip_runtime.h instead of hip_runtime_api.h.
You can capture the hipconfig output and passed it to the standard compiler; below is a sample
makefile syntax:
CPPFLAGS += $(shell $(HIP_PATH)/bin/hipconfig --cpp_config)
Nvcc includes some headers by default. However, HIP does not include default headers, and
instead, all required files must be explicitly included. Specifically, files that call HIP run-time
APIs or define HIP kernels must explicitly include the appropriate HIP headers. If the compilation
process reports that it cannot find necessary APIs (for example, "error: identifier ‘hipSetDevice’ is
undefined"), ensure that the file includes hip_runtime.h (or hip_runtime_api.h, if appropriate). The
hipify-perl script automatically converts "cuda_runtime.h" to "hip_runtime.h," and it converts
"cuda_runtime_api.h" to "hip_runtime_api.h", but it may miss nested headers or macros.
4.4.3.1 cuda.h
The HIP-Clang path provides an empty cuda.h file. Some existing CUDA programs include this
file but do not require any of the functions.
For new projects or ports which can be re-factored, we recommend the use of the extension
".hip.cpp" for source files, and ".hip.h" or ".hip.hpp" for header files. This indicates that the code
is standard C++ code, but also provides a unique indication for make tools to run hipcc when
appropriate.
4.5 Workarounds
4.5.1 memcpyToSymbol
HIP support for hipMemcpyToSymbol is complete. This feature allows a kernel to define a
device-side data symbol that can be accessed on the host side. The symbol can be in __constant or
device space.
Note that the symbol name needs to be encased in the HIP_SYMBOL macro, as shown in the code
example below. This also applies to hipMemcpyFromSymbol, hipGetSymbolAddress, and
hipGetSymbolSize.
4.5.2 CU_POINTER_ATTRIBUTE_MEMORY_TYPE
To get pointer's memory type in HIP/HIP-Clang one should use hipPointerGetAttributes API. The
first parameter of the API is hipPointerAttribute_t which has 'memoryType' as a member variable.
'memoryType' indicates the input pointer is allocated on device or host.
For example:
double * ptr;
hipMalloc(reinterpret_cast<void**>(&ptr), sizeof(double));
hipPointerAttribute_t attr;
hipPointerGetAttributes(&attr, ptr); /*attr.memoryType will have value as hipMemoryTypeDevice*/
double* ptrHost;
hipHostMalloc(&ptrHost, sizeof(double));
hipPointerAttribute_t attr;
hipPointerGetAttributes(&attr, ptrHost); /*attr.memoryType will have value as
hipMemoryTypeHost*/
4.5.3 threadfence_system
Threadfence_system makes all device memory writes, all writes to mapped host memory, and all
writes to peer memory visible to CPU and other GPU devices. Some implementations can provide
this behavior by flushing the GPU L2 cache. HIP/HIP-Clang does not provide this functionality.
As a workaround, users can set the environment variable HSA_DISABLE_CACHE=1 to disable
the GPU L2 cache. This will affect all accesses and for all kernels and so may have a performance
impact.
AMD compilers currently load all data into both the L1 and L2 caches, so __ldg is treated as a no-
op.
• For programs that use textures only to benefit from improved caching, use the __ldg
instruction
• Programs that use texture object and reference APIs work well on HIP
Refer to the section on HIP Logging in this document for more information.
• Both APIs support events, streams, memory management, memory copy, and error
handling.
• Both APIs deliver similar performance.
• Driver APIs calls begin with the prefix cu while Runtime APIs begin with the prefix cuda.
For example, the Driver API API contains cuEventCreate while the Runtime API contains
cudaEventCreate, with similar functionality.
• The Driver API defines a different but largely overlapping error code space than the
Runtime API uses a different coding convention. For example, Driver API defines
CUDA_ERROR_INVALID_VALUE while the Runtime API defines cudaErrorInvalidValue
NOTE: The Driver API offers two additional pieces of functionality not provided by the Runtime
API: cuModule and cuCtx APIs.
Both Driver and Runtime APIs define a function for launching kernels (called cuLaunchKernel or
cudaLaunchKernel. The kernel arguments and the execution configuration (grid dimensions, group
dimensions, dynamic shared memory, and stream) are passed as arguments to the launch function.
The Runtime additionally provides the <<< >>> syntax for launching kernels, which resembles a
special function call and is easier to use than explicit launch API (in particular the handling of
kernel arguments). However, this syntax is not standard C++ and is available only when NVCC is
used to compile the host code.
The Module features are useful in an environment that generates the code objects directly, such as
a new accelerator language front-end. Here, NVCC is not used. Instead, the environment may have
a different kernel language or a different compilation flow. Other environments have many kernels
and do not want them to be all loaded automatically. The Module functions can be used to load the
generated code objects and launch kernels. As we will see below, HIP defines a Module API
which provides similar explicit control over code object management.
The CUDA Runtime API unifies the Context API with the Device API. This simplifies the APIs
and has little loss of functionality since each Context can contain a single device, and the benefits
of multiple contexts have been replaced with other interfaces. HIP provides a context API to
facilitate easy porting from existing Driver codes. In HIP, the Ctx functions largely provide an
alternate syntax for changing the active device. Most new applications will prefer to use
hipSetDevice or the stream APIs , therefore HIP has marked hipCtx APIs as deprecated. Support
for these APIs may not be available in future releases. For more details on deprecated APIs, refer
to HIP deprecated APIs at:
https://github.com/ROCm-Developer-
Tools/HIP/blob/main/docs/markdown/hip_deprecated_api_list.md
Like the CUDA Driver API, the Module API provides additional control over how code is loaded,
including options to load code from files or in-memory pointers. NVCC and HIP-Clang target
different architectures and use different code object formats: NVCC is `cubin` or `ptx` files, while
the HIP-Clang path is the `hsaco` format. The external compilers which generate these code
objects are responsible for generating and loading the correct code object for each platform.
Notably, there is no fat binary format that can contain code for both NVCC and HIP-Clang
platforms. The following table summarizes the formats used on each platform:
`hipcc` uses HIP-Clang or NVCC to compile host codes. Both may embed code objects into the
final executable, and these code objects will be automatically loaded when the application starts.
The hipModule API can be used to load additional code objects, and in this way provides an
extended capability to the automatically loaded code objects. HIP-Clang allows both capabilities
to be used together if desired. It is possible to create a program with no kernels and thus no
automatic loading.
HIP defines a single error space and uses camel-case for all errors (i.e. hipErrorInvalidValue)
When the executable or shared library is loaded by the dynamic linker, the initialization functions
are called. In the initialization functions, when __hipRegisterFatBinary is called, the code objects
containing all kernels are loaded; when __hipRegisterFunction is called, the stub functions are
associated with the corresponding kernels in code objects. HIP-Clang implements two sets of
kernels launching APIs.
By default, in the host code, for the <<<>>> statement, hip-clang first emits call of
hipConfigureCall to set up the threads and grids, then emits call of the stub function with the given
arguments. In the stub function, hipSetupArgument is called for each kernel argument, then
hipLaunchByPtr is called with a function pointer to the stub function. In hipLaunchByPtr, the real
kernel associated with the stub function is launched.
If HIP program is compiled with -fhip-new-launch-api, in the host code, for the <<<>>>
statement, hip-clang first emits call of __hipPushCallConfiguration to save the grid dimension,
block dimension, shared memory usage and stream to a stack, then emits call of the stub function
with the given arguments. In the stub function, __hipPopCallConfiguration is called to get the
saved grid dimension, block dimension, shared memory usage and stream, then hipLaunchKernel
is called with a function pointer to the stub function. In hipLaunchKernel, the real kernel
associated with the stub function is launched.
For example,
CUDA
CUmodule module;
void *imagePtr = ...; // Somehow populate data pointer with code object
const int numOptions = 1;
CUJit_option options[numOptions];
void * optionValues[numOptions];
options[0] = CU_JIT_MAX_REGISTERS;
unsigned maxRegs = 15;
optionValues[0] = (void*)(&maxRegs);
HIP
hipModule_t module;
void *imagePtr = ...; // Somehow populate data pointer with code object
const int numOptions = 1;
hipJitOption options[numOptions];
void * optionValues[numOptions];
options[0] = hipJitOptionMaxRegisters;
unsigned maxRegs = 15;
optionValues[0] = (void*)(&maxRegs);
// hipModuleLoadData(module, imagePtr) will be called on HIP-Clang path, JIT options will not be
used, and
// cupModuleLoadDataEx(module, imagePtr, numOptions, options, optionValues) will be called on
NVCC path
hipModuleLoadDataEx(module, imagePtr, numOptions, options, optionValues);
hipFunction_t k;
hipModuleGetFunction(&k, module, "myKernel");
#ifdef __HIP_PLATFORM_NVCC__
hipInit(0);
hipDevice_t device;
hipCtx_t context;
hipDeviceGet(&device, 0);
hipCtxCreate(&context, 0, device);
#endif
hipMalloc((void**)&Ad, SIZE);
hipMalloc((void**)&Bd, SIZE);
hipMemcpyHtoD(Ad, A, SIZE);
hipMemcpyHtoD(Bd, B, SIZE);
hipModule_t Module;
hipFunction_t Function;
hipModuleLoad(&Module, fileName);
hipModuleGetFunction(&Function, Module, kernel_name);
std::vector<void*>argBuffer(2);
memcpy(&argBuffer[0], &Ad, sizeof(void*));
memcpy(&argBuffer[1], &Bd, sizeof(void*));
size_t size = argBuffer.size()*sizeof(void*);
void *config[] = {
HIP_LAUNCH_PARAM_BUFFER_POINTER, &argBuffer[0],
HIP_LAUNCH_PARAM_BUFFER_SIZE, &size,
HIP_LAUNCH_PARAM_END
};
hipModuleLaunchKernel(Function, 1, 1, 1, LEN, 1, 1, 0, 0, NULL, (void**)&config);
hipMemcpyDtoH(B, Bd, SIZE);
for(uint32_t i=0;i<LEN;i++){
std::cout<<A[i]<<" - "<<B[i]<<std::endl;
}
#ifdef __HIP_PLATFORM_NVCC__
hipCtxDetach(context);
#endif
return 0;
}
// Host code:
texture<float, 2, hipReadModeElementType> tex;
void myFunc ()
https://github.com/RadeonOpenCompute/ROCm/blob/master/AMD-HIP-API-4.5.pdf
• cuBLAS
• cuRAND
• cuFFT
• cuSPARSE
• cuDNN
https://github.com/RadeonOpenCompute/ROCm/blob/master/AMD-HIP-API-4.5.pdf
https://github.com/ROCm-Developer-Tools/HIP/blob/main/docs/markdown/hip-math-api.md
Chapter 6 Appendix C
https://rocmdocs.amd.com/en/latest/Programming_Guides/HIP-FAQ.html#hip-faq
Chapter 6 Appendix C 95