Decoding Cuda Binary File Format
Decoding Cuda Binary File Format
section’s
Executable and Linkable Format [1], abbreviated to ELF, is index in the section header table.
a standard file format typically used on Unix and Unix-like Several of the sections have an entry in the symbol table of
systems. A variant of this format is also used by NVIDIA the GPU ELF. Additionally, the symbol table can have entries
software in order to package low-level GPU code. An ELF for subroutines within the kernel functions, with an st_info
file consists of four components: the ELF header, the section value of 34.These have a st_shndx value equal to the section
header table which describes each of the ELF file’s sections, number of the associated kernel function’s .text... section,
the program header table which defines the memory seg- and an st_value value equal to their offset inside the kernel
ments, and finally the various sections. function’s code.
Here we describe the file format of CUDA applications,
.nv.info The ELF also contains sections named ".nv.info.func"
which is vital when trying to modify existing code. We per-
for each kernel "func", and also a ".nv.info" section, contain-
formed all of our experimentation on Linux (Ubuntu and
ing various pieces of metadata. For example, the amount
openSUSE), and so information concerning the CPU exe-
of local memory allocated per-thread is controlled by the
cutable may not be applicable for operating systems such as
MIN_STACK_SIZE, FRAME_SIZE, and MAX_STACK_SIZE
Windows. Information concerning the GPU ELF, however,
attributes inside of .nv.info.
is applicable regardless of the operating system.
The .nv.info sections contain one or more attributes. An
attribute starts with a byte indicating the attribute format,
GPU ELF and then a byte with the attribute ID. If the attribute format
Every CUDA program has one or more executable ELF files is 1 (NVAL), then there is no associated value. If the attribute
embedded inside of it, which contain the GPU code. Notably, format is 2 (BVAL), then the following byte contains the
this nested ELF can have some differences from typical ELF attribute’s value. If the attribute format is 3 (HVAL), then the
files, such as a unique version number, which may cause following two bytes contain the attribute’s value. Finally, if
problems for existing programs and libraries when trying to the attribute format is 4 (SVAL), then the following two bytes
analyze it. contain some value, n, and then the next n bytes contain the
Every ELF’s header has an attribute named e_version, attribute’s values.
which is usually set to 1; other values are typically considered For example, for FRAME_SIZE, the attribute type will be 4
invalid. Though the embedded GPU ELF files created by some (SVAL), the next two bytes after the attribute ID will contain
versions of nvcc match this, with others this value is set to the size 8, the next four bytes contain the associated kernel
the compiler version. For example, with nvcc version 6.5, function’s index in the symbol table, and the remaining four
e_version is set to 65, and with nvcc version 8.0, e_version is bytes contain the actual frame size value.
set to 80. The ELF header is otherwise as expected, though In Table 1, we list all attribute IDs we are aware of that can
the e_machine attribute - which indicates the architecture - appear in the .nv.info sections, with their human-readable
is set to 190 to indicate CUDA. name and their number in the actual binary. Some attributes
The compiler will usually generate mangled names for are only compatible with more recent versions of the CUDA
each kernel function. For example, a kernel with the name SDK.
"foo" and one integer parameter will likely have the mangled
name "_Z3fooi". The ELF will use this name instead of the CPU Executable
original, unmangled version. With older versions of nvcc, the GPU ELF described above
For each kernel it describes, the ELF will contain a section was stored inside the .rodata section of the CPU ELF. But
called ".text.func", where func is the mangled name of the starting with version 5.0 released in 2012, the compiler cre-
kernel, containing the function’s binary code. The highest ates an executable containing dedicated sections for GPU
eight bits of the INFO attribute for this section control the code.
number of registers which will be allocated per-thread, and Most important is the .nv_fatbin section. It is split into an
the lowest eight bits hold this section’s index in the symbol arbitrary number of distinct regions, each of which contains
table. one or more GPU ELF files, PTX code files, and/or cubin
For a kernel function which uses shared memory, the ELF files. Each region begins with a 16 byte header: the first 8
will also contain a section named ".nv.shared.func" for each bytes are the .nv_fatbin magic number, and the remaining
kernel "func" - the size of this section is the number of bytes eight bytes contain the size of the rest of the region. The
of shared memory which the GPU will allocate per thread- rest of the region alternates between detailed headers and
block for the associated kernel function. Similarly, it can the embedded file (ELF, PTX, or cubin) which the detailed
contain sections named ".nv.constantX.func" for different header describes.
values of X, allocating (and possibly initializing) constant In the detailed header, the first 4-byte word contains the
memory for the kernel functions. Each .nv... section’s INFO embedded file’s type and ptxas flags; the lower two bytes
1
Table 1. Known .nv.info attributes. have a value of 2 for GPU ELF files. The second word is
the offset of the embedded file, relative to the start of this
Attribute (EIATTR) ID detailed header. The dword comprising the third and fourth
ERROR 0x00 words holds the size of the embedded file. The seventh word
PAD 0x01 is the code version, which is dependent on the compiler. The
IMAGE_SLOT 0x02 eighth word contains the target architecture - a value of 20 for
JUMPTABLE_RELOCS 0x03 compute capability 2.0, a value of 35 for compute capability
CTAIDZ_USED 0x04 3.5, etcetera. The rest of the detailed header contains less
MAX_THREADS 0x05 important metadata, such as the operating system or the
IMAGE_OFFSET 0x06 source code’s filename.
IMAGE_SIZE 0x07 Another section of the CPU ELF that is unique to CUDA
TEXTURE_NORMALIZED 0x08 programs is called .nvFatBinSegment. It contains metadata
SAMPLER_INIT 0x09 about the .nv_fatbin section, such as the starting addresses
PARAM_CBANK 0x0a of its regions. Its size is a multiple of six words (24 bytes),
SMEM_PARAM_OFFSETS 0x0b where the third word in each group of six is an address inside
CBANK_PARAM_OFFSETS 0x0c of the .nv_fatbin section. If we modify the .nv_fatbin, then
SYNC_STACK 0x0d these addresses need to be changed to match it.
TEXID_SAMPID_MAP 0x0e Side Effects of Modification
EXTERNS 0x0f
If we modify the size of the GPU kernel in a way that re-
REQNTID 0x10
quires us to change the size of the .nv_fatbin section, then
FRAME_SIZE 0x11
adjusting the parts described above are insufficient to keep
MIN_STACK_SIZE 0x12
the executable working. There are other changes that need
SAMPLER_FORCE_UNNORMALIZED 0x13
to be made to prevent the program from simply crashing.
BINDLESS_IMAGE_OFFSETS 0x14 Increasing the size of the GPU code shifts the addresses
BINDLESS_TEXTURE_BANK 0x15 of various parts of the executable. As such, several changes
BINDLESS_SURFACE_BANK 0x16 need to be made to the CPU executable. The offsets for sev-
KPARAM_INFO 0x17 eral sections need to be fixed in the program header section.
SMEM_PARAM_SIZE 0x18 We also scan the CPU assembly code for any addresses which
CBANK_PARAM_SIZE 0x19 point anywhere after the changed part of the .nv_fatbin sec-
QUERY_NUMATTRIB 0x1a tion, and increment them appropriately in the binary. Simi-
MAXREG_COUNT 0x1b larly, we fix such addresses inside several sections including
EXIT_INSTR_OFFSETS 0x1c any symbol tables (indicated by a type of SHT_SYMTAB or
S2RCTAID_INSTR_OFFSETS 0x1d SHT_DYNSYM), relocation tables (with type SHT_RELA),
CRS_STACK_SIZE 0x1e dynamic tables (with type SHT_DYNAMIC), and global offset
NEED_CNP_WRAPPER 0x1f tables (with name ".got").
NEED_CNP_PATCH 0x20 While we find that these fixes seem to work in practice,
EXPLICIT_CACHING 0x21 it is difficult to guarantee that they will be successful. For
ISTYPEP_USED 0x22 example, cases may arise where other data is mistakenly
MAX_STACK_SIZE 0x23 treated as an address and incremented, creating errors. As
SUQ_USED 0x24 such, whenever possible, it is best to prepare the executable
LD_CACHEMOD_INSTR_OFFSETS 0x25 in such a way that modifications will not require extra space
LOAD_CACHE_REQUEST 0x26 for additional code.
ATOM_SYS_INSTR_OFFSETS 0x27
COOP_GROUP_INSTR_OFFSETS 0x28 References
COOP_GROUP_MASK_REGIDS 0x29 [1] Committee, T., et al. Tool interface standard (tis) executable and
linking format (elf) specification version 1.2. TIS Committee (1995).
SW1850030_WAR 0x2a
WMMA_USED 0x2b