Cuda Compiler Driver NVCC: Reference Guide
Cuda Compiler Driver NVCC: Reference Guide
Reference Guide
CHANGES FROM PREVIOUS VERSION
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | ii
TABLE OF CONTENTS
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | iii
4.2.2.9. --cubin (-cubin)............................................................................... 14
4.2.2.10. --ptx (-ptx)................................................................................... 14
4.2.2.11. --preprocess (-E)............................................................................ 14
4.2.2.12. --generate-dependencies (-M).............................................................15
4.2.2.13. --generate-nonsystem-dependencies (-MM)............................................. 15
4.2.2.14. --generate-dependencies-with-compile (-MD).......................................... 15
4.2.2.15. --generate-nonsystem-dependencies-with-compile (-MMD)...........................15
4.2.2.16. --run (-run)................................................................................... 15
4.2.3. Options for Specifying Behavior of Compiler/Linker......................................... 16
4.2.3.1. --profile (-pg)................................................................................. 16
4.2.3.2. --debug (-g)....................................................................................16
4.2.3.3. --device-debug (-G).......................................................................... 16
4.2.3.4. --extensible-whole-program (-ewp)........................................................16
4.2.3.5. --generate-line-info (-lineinfo)............................................................. 16
4.2.3.6. --optimize level (-O)......................................................................... 16
4.2.3.7. --ftemplate-backtrace-limit limit (-ftemplate-backtrace-limit).......................16
4.2.3.8. --ftemplate-depth limit (-ftemplate-depth)..............................................16
4.2.3.9. --shared (-shared)............................................................................ 16
4.2.3.10. --x {c|c++|cu} (-x).......................................................................... 17
4.2.3.11. --std {c++03|c++11|c++14} (-std).........................................................17
4.2.3.12. --no-host-device-initializer-list (-nohdinitlist).......................................... 17
4.2.3.13. --expt-relaxed-constexpr (-expt-relaxed-constexpr).................................. 17
4.2.3.14. --extended-lambda (-extended-lambda).................................................18
4.2.3.15. --expt-extended-lambda (-expt-extended-lambda)....................................18
4.2.3.16. --machine {32|64} (-m).................................................................... 18
4.2.4. Options for Passing Specific Phase Options.................................................... 18
4.2.4.1. --compiler-options options,... (-Xcompiler).............................................. 18
4.2.4.2. --linker-options options,... (-Xlinker)..................................................... 18
4.2.4.3. --archive-options options,... (-Xarchive)..................................................18
4.2.4.4. --ptxas-options options,... (-Xptxas)...................................................... 18
4.2.4.5. --nvlink-options options,... (-Xnvlink)..................................................... 18
4.2.5. Options for Guiding the Compiler Driver...................................................... 19
4.2.5.1. --dont-use-profile (-noprof)................................................................. 19
4.2.5.2. --dryrun (-dryrun).............................................................................19
4.2.5.3. --verbose (-v)..................................................................................19
4.2.5.4. --keep (-keep).................................................................................19
4.2.5.5. --keep-dir directory (-keep-dir)............................................................ 19
4.2.5.6. --save-temps (-save-temps)................................................................. 19
4.2.5.7. --clean-targets (-clean)...................................................................... 19
4.2.5.8. --run-args arguments,... (-run-args)....................................................... 19
4.2.5.9. --input-drive-prefix prefix (-idp)........................................................... 19
4.2.5.10. --dependency-drive-prefix prefix (-ddp)................................................ 20
4.2.5.11. --drive-prefix prefix (-dp)................................................................. 20
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | iv
4.2.5.12. --dependency-target-name target (-MT)................................................ 20
4.2.5.14. --no-device-link (-nodlink)................................................................. 20
4.2.6. Options for Steering CUDA Compilation........................................................ 20
4.2.6.1. --default-stream {legacy|null|per-thread} (-default-stream)......................... 20
4.2.7. Options for Steering GPU Code Generation................................................... 21
4.2.7.1. --gpu-architecture arch (-arch)............................................................ 21
4.2.7.2. --gpu-code code,... (-code).................................................................21
4.2.7.3. --generate-code specification (-gencode)................................................ 22
4.2.7.4. --relocatable-device-code {true|false} (-rdc)............................................ 22
4.2.7.5. --entries entry,... (-e)....................................................................... 23
4.2.7.6. --maxrregcount amount (-maxrregcount).................................................23
4.2.7.7. --use_fast_math (-use_fast_math)......................................................... 23
4.2.7.8. --ftz {true|false} (-ftz)...................................................................... 23
4.2.7.9. --prec-div {true|false} (-prec-div)......................................................... 24
4.2.7.10. --prec-sqrt {true|false} (-prec-sqrt)......................................................24
4.2.7.11. --fmad {true|false} (-fmad)............................................................... 24
4.2.7.12. --compile-as-tools-patch (-astoolspatch)................................................ 25
4.2.7.13. --keep-device-functions (-keep-device-functions)..................................... 25
4.2.8. Generic Tool Options.............................................................................. 25
4.2.8.1. --disable-warnings (-w)...................................................................... 25
4.2.8.2. --source-in-ptx (-src-in-ptx).................................................................25
4.2.8.3. --restrict (-restrict)...........................................................................25
4.2.8.4. --Wno-deprecated-gpu-targets (-Wno-deprecated-gpu-targets).......................25
4.2.8.5. --Wno-deprecated-declarations (-Wno-deprecated-declarations).....................26
4.2.8.6. --Wreorder (-Wreorder)...................................................................... 26
4.2.8.7. --Werror kind,... (-Werror)..................................................................26
4.2.8.8. --resource-usage (-res-usage)...............................................................26
4.2.8.9. --help (-h)......................................................................................26
4.2.8.10. --version (-V).................................................................................26
4.2.8.11. --options-file file,... (-optf)............................................................... 26
4.2.8.12. --time filename (-time).................................................................... 27
4.2.8.13. --qpp-config config (-qpp-config).........................................................27
4.2.9. Phase Options.......................................................................................27
4.2.9.1. Ptxas Options..................................................................................27
4.2.9.2. NVLINK Options............................................................................... 30
Chapter 5. GPU Compilation................................................................................. 31
5.1. GPU Generations........................................................................................ 31
5.2. GPU Feature List........................................................................................ 31
5.3. Application Compatibility.............................................................................. 32
5.4. Virtual Architectures....................................................................................32
5.5. Virtual Architecture Feature List..................................................................... 33
5.6. Further Mechanisms.....................................................................................34
5.6.1. Just-in-Time Compilation......................................................................... 34
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | v
5.6.2. Fatbinaries.......................................................................................... 35
5.7. NVCC Examples.......................................................................................... 35
5.7.1. Base Notation.......................................................................................35
5.7.2. Shorthand............................................................................................35
5.7.2.1. Shorthand 1....................................................................................35
5.7.2.2. Shorthand 2....................................................................................36
5.7.2.3. Shorthand 3....................................................................................36
5.7.3. Extended Notation................................................................................. 36
5.7.4. Virtual Architecture Identification Macro...................................................... 37
Chapter 6. Using Separate Compilation in CUDA........................................................ 38
6.1. Code Changes for Separate Compilation............................................................ 38
6.2. NVCC Options for Separate Compilation............................................................ 38
6.3. Libraries...................................................................................................39
6.4. Examples.................................................................................................. 40
6.5. Potential Separate Compilation Issues...............................................................41
6.5.1. Object Compatibility.............................................................................. 41
6.5.2. JIT Linking Support................................................................................ 42
6.5.3. Implicit CUDA Host Code......................................................................... 42
6.5.4. Using __CUDA_ARCH__............................................................................ 42
6.5.5. Device Code in Libraries..........................................................................43
Chapter 7. Miscellaneous NVCC Usage..................................................................... 44
7.1. Cross Compilation....................................................................................... 44
7.2. Keeping Intermediate Phase Files.................................................................... 44
7.3. Cleaning Up Generated Files.......................................................................... 44
7.4. Printing Code Generation Statistics.................................................................. 45
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | vi
LIST OF FIGURES
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | vii
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | viii
Chapter 1.
INTRODUCTION
1.1. Overview
1.1.1. CUDA Programming Model
The CUDA Toolkit targets a class of applications whose control part runs as a process
on a general purpose computing device, and which use one or more NVIDIA GPUs as
coprocessors for accelerating single program, multiple data (SPMD) parallel jobs. Such jobs
are self-contained, in the sense that they can be executed and completed by a batch of
GPU threads entirely without intervention by the host process, thereby gaining optimal
benefit from the parallel graphics hardware.
The GPU code is implemented as a collection of functions in a language that is
essentially C++, but with some annotations for distinguishing them from the host code,
plus annotations for distinguishing different types of data memory that exists on the
GPU. Such functions may have parameters, and they can be called using a syntax that is
very similar to regular C function calling, but slightly extended for being able to specify
the matrix of GPU threads that must execute the called function. During its life time, the
host process may dispatch many parallel GPU tasks.
For more information on the CUDA programming model, consult the CUDA C
Programming Guide.
1.1.2. CUDA Sources
Source files for CUDA applications consist of a mixture of conventional C++ host code,
plus GPU device functions. The CUDA compilation trajectory separates the device
functions from the host code, compiles the device functions using the proprietary
NVIDIA compilers and assembler, compiles the host code using a C++ host compiler
that is available, and afterwards embeds the compiled GPU functions as fatbinary
images in the host object file. In the linking stage, specific CUDA runtime libraries are
added for supporting remote SPMD procedure calling and for providing explicit GPU
manipulation such as allocation of GPU memory buffers and host-GPU data transfer.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 1
Introduction
1.1.3. Purpose of NVCC
The compilation trajectory involves several splitting, compilation, preprocessing, and
merging steps for each CUDA source file. It is the purpose of nvcc, the CUDA compiler
driver, to hide the intricate details of CUDA compilation from developers. It accepts a
range of conventional compiler options, such as for defining macros and include/library
paths, and for steering the compilation process. All non-CUDA compilation steps are
forwarded to a C++ host compiler that is supported by nvcc, and nvcc translates its
options to appropriate host compiler command line options.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 2
Chapter 2.
COMPILATION PHASES
2.2. NVCC Phases
A compilation phase is the a logical translation step that can be selected by command
line options to nvcc. A single compilation phase can still be broken up by nvcc into
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 3
Compilation Phases
smaller steps, but these smaller steps are just implementations of the phase: they depend
on seemingly arbitrary capabilities of the internal tools that nvcc uses, and all of these
internals may change with a new release of the CUDA Toolkit. Hence, only compilation
phases are stable across releases, and although nvcc provides options to display the
compilation steps that it executes, these are for debugging purposes only and must not
be copied and used into build scripts.
nvcc phases are selected by a combination of command line options and input file
name suffixes, and the execution of these phases may be modified by other command
line options. In phase selection, the input file suffix defines the phase input, while the
command line option defines the required output of the phase.
The following paragraphs will list the recognized file name suffixes and the supported
compilation phases. A full explanation of the nvcc command line options can be found
in NVCC Command Options.
Note that nvcc does not make any distinction between object, library or resource files. It
just passes files of these types to the linker when the linking phase is executed.
2.4. Supported Phases
The following table specifies the supported compilation phases, plus the option to
nvcc that enables execution of this phase. It also lists the default name of the output
file generated by this phase, which will take effect when no explicit output file name is
specified using option --output-file:
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 4
Compilation Phases
nvcc Option
Phase Default Output File Name
Long Name Short Name
CUDA --cuda -cuda .cpp.ii appended to source file name, as in
compilation to x.cu.cpp.ii. This output file can be compiled
C/C++ source by the host compiler that was used by nvcc to
file preprocess the .cu file.
C/C++ --preprocess -E <result on standard output>
preprocessing
C/C++ --compile -c Source file name with suffix replaced by o on Linux
compilation to and Mac OS X, or obj on Windows
object file
Cubin --cubin -cubin Source file name with suffix replaced by cubin
generation
from CUDA
source files
Cubin --cubin -cubin Source file name with suffix replaced by cubin
generation
from PTX
intermediate
files.
PTX generation --ptx -ptx Source file name with suffix replaced by ptx
from CUDA
source files
Fatbinary --fatbin -fatbin Source file name with suffix replaced by fatbin
generation
from source,
PTX or cubin
files
Linking --device- -dlink a_dlink.obj on Windows or a_dlink.o on other
relocatable link platforms
device code.
Cubin --device- -dlink - a_dlink.cubin
generation link -- cubin
from linked cubin
relocatable
device code.
Fatbinary --device- -dlink - a_dlink.fatbin
generation link -- fatbin
from linked fatbin
relocatable
device code
Linking an <no phase option> a.exe on Windows or a.out on other platforms
executable
Constructing --lib -lib a.lib on Windows or a.a on other platforms
an object file
archive, or
library
make --generate- -M <result on standard output>
dependency dependencies
generation
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 5
Compilation Phases
nvcc Option
Phase Default Output File Name
Long Name Short Name
make --generate- -MM <result on standard output>
dependency nonsystem-
generation dependencies
without
headers in
system paths.
Running an --run -run
executable
Notes:
‣ The last phase in this list is more of a convenience phase. It allows running the
compiled and linked executable without having to explicitly set the library path to
the CUDA dynamic libraries.
‣ Unless a phase option is specified, nvcc will compile and link all its input files.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 6
Chapter 3.
THE CUDA COMPILATION TRAJECTORY
CUDA compilation works as follows: the input program is preprocessed for device
compilation compilation and is compiled to CUDA binary (cubin) and/or PTX
intermediate code, which are placed in a fatbinary. The input program is preprocessed
once again for host compilation and is synthesized to embed the fatbinary and transform
CUDA specific C++ extensions into standard C++ constructs. Then the C++ host compiler
compiles the synthesized host code with the embedded fatbinary into a host object. The
exact steps that are followed to achieve this are displayed in Figure 1.
The embedded fatbinary is inspected by the CUDA runtime system whenever the device
code is launched by the host program to obtain an appropriate fatbinary image for the
current GPU.
CUDA programs are compiled in the whole program compilation mode by default,
i.e., the device code cannot reference an entity from a separate file. In the whole
program compilation mode, device link steps have no effect. For more information
on the separate compilation and the whole program compilation, see Using Separate
Compilation in CUDA.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 7
The CUDA Compilation Trajectory
.cu
A B
C+ + Preprocessor C+ + Preprocessor
A is passed t o B as an input file.
.cpp4.ii .cpp1.ii A B
A is # include'd in B.
nvlink a_dlink.reg.c
a_dlink.cubin
fat binary
C+ + Com piler
a_dlink.o / a_dlink.obj
Host Linker
execut able
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 8
Chapter 4.
NVCC COMMAND OPTIONS
Long option names are used throughout the document, unless specified otherwise,
however, short names can be used instead of long names to have the same effect.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 9
NVCC Command Options
‣ name
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 10
NVCC Command Options
Allowed Values
‣ none
‣ shared
‣ static
Default
The static CUDA runtime library is used by default.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 11
NVCC Command Options
Allowed Values
‣ none
‣ static
Default
The static CUDA device runtime library is used by default.
4.2.2.1. --link (-link)
Specify the default behavior: compile and link all input files.
4.2.2.2. --lib (-lib)
Compile all input files into object files, if necessary, and add the results to the specified library
output file.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 12
NVCC Command Options
4.2.2.3. --device-link (-dlink)
Link object files with relocatable device code and .ptx, .cubin, and .fatbin files into an
object file with executable device code, which can be passed to the host linker.
4.2.2.4. --device-c (-dc)
Compile each .c, .cc, .cpp, .cxx, and .cu input file into an object file that contains
relocatable device code.
It is equivalent to --relocatable-device-code=true --compile.
4.2.2.5. --device-w (-dw)
Compile each .c, .cc, .cpp, .cxx, and .cu input file into an object file that contains
executable device code.
It is equivalent to --relocatable-device-code=false --compile.
4.2.2.6. --cuda (-cuda)
Compile each .cu input file to a .cu.cpp.ii file.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 13
NVCC Command Options
4.2.2.7. --compile (-c)
Compile each .c, .cc, .cpp, .cxx, and .cu input file into an object file.
4.2.2.8. --fatbin (-fatbin)
Compile all .cu, .ptx, and .cubin input files to device-only .fatbin files.
nvcc discards the host code for each .cu input file with this option.
4.2.2.9. --cubin (-cubin)
Compile all .cu and .ptx input files to device-only .cubin files.
nvcc discards the host code for each .cu input file with this option.
4.2.2.10. --ptx (-ptx)
Compile all .cu input files to device-only .ptx files.
nvcc discards the host code for each .cu input file with this option.
4.2.2.11. --preprocess (-E)
Preprocess all .c, .cc, .cpp, .cxx, and .cu input files.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 14
NVCC Command Options
4.2.2.12. --generate-dependencies (-M)
Generate a dependency file that can be included in a Makefile for the .c, .cc, .cpp, .cxx,
and .cu input file.
4.2.2.13. --generate-nonsystem-dependencies (-MM)
Same as --generate-dependencies but skip headers files found in system directories
(Linux only).
4.2.2.14. --generate-dependencies-with-compile (-MD)
Generate a dependency file and compile the input file. The dependency file can be included in a
Makefile for the .c, .cc, .cpp, .cxx, and .cu input file.
This option cannot be specified together with -E. The dependency file name is computed
as follows:
‣ If -MF is specified, then the specified file is used as the dependency file name.
‣ If -o is specified, the dependency file name is computed from the specified file name
by replacing the suffix with '.d'.
‣ Otherwise, the dependency file name is computed by replacing the input file
names's suffix with '.d'.
If the dependency file name is computed based on either -MF or -o, then multiple input
files are not supported.
4.2.2.15. --generate-nonsystem-dependencies-with-compile
(-MMD)
Same as --generate-dependencies-with-compile but skip header files found in system
directories (Linux only).
4.2.2.16. --run (-run)
Compile and link all input files into an executable, and executes it.
When the input is a single executable, it is executed without any compilation or linking.
This step is intended for developers who do not want to be bothered with setting the
necessary environment variables; these are set temporarily by nvcc.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 15
NVCC Command Options
4.2.3.2. --debug (-g)
Generate debug information for host code.
4.2.3.3. --device-debug (-G)
Generate debug information for device code.
This option turns off all optimizations on device code. It is not intended for profiling;
use --generate-line-info instead for profiling.
4.2.3.4. --extensible-whole-program (-ewp)
Generate extensible whole program device code, which allows some calls to not be resolved until
linking with libcudadevrt.
4.2.3.5. --generate-line-info (-lineinfo)
Generate line-number information for device code.
4.2.3.9. --shared (-shared)
Generate a shared library during linking.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 16
NVCC Command Options
Use option --linker-options when other linker options are required for more
control.
Allowed Values
‣ c
‣ c++
‣ cu
Default
The language of the source code is determined based on the file name suffix.
Allowed Values
‣ c++03
‣ c++11
‣ c++14
Default
The default C++ dialect depends on the host compiler. nvcc matches the default C++
dialect that the host compiler uses.
4.2.3.12. --no-host-device-initializer-list (-
nohdinitlist)
Do not consider member functions of std::initializer_list as __host__ __device__
functions implicitly.
4.2.3.13. --expt-relaxed-constexpr (-expt-relaxed-
constexpr)
Experimental flag: Allow host code to invoke __device__ constexpr functions, and device
code to invoke __host__ constexpr functions.
Note that the behavior of this flag may change in future compiler releases.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 17
NVCC Command Options
4.2.3.14. --extended-lambda (-extended-lambda)
Allow __host__, __device__ annotations in lambda declarations.
4.2.3.15. --expt-extended-lambda (-expt-extended-lambda)
Alias for --extended-lambda.
Allowed Values
‣ 32
‣ 64
Default
This option is set based on the host platform on which nvcc is executed.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 18
NVCC Command Options
4.2.5.2. --dryrun (-dryrun)
List the compilation sub-commands without executing them.
4.2.5.3. --verbose (-v)
List the compilation sub-commands while executing them.
4.2.5.4. --keep (-keep)
Keep all intermediate files that are generated during internal compilation steps.
4.2.5.6. --save-temps (-save-temps)
This option is an alias of --keep.
4.2.5.7. --clean-targets (-clean)
Delete all the non-temporary files that the same nvcc command would generate without this
option.
This option reverses the behavior of nvcc. When specified, none of the compilation
phases will be executed. Instead, all of the non-temporary files that nvcc would
otherwise create will be deleted.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 19
NVCC Command Options
4.2.5.13. --no-align-double
Specify that -malign-double should not be passed as a compiler argument on 32-bit
platforms.
WARNING: this makes the ABI incompatible with the CUDA's kernel ABI for certain
64-bit types.
4.2.5.14. --no-device-link (-nodlink)
Skip the device link step when linking object files.
Allowed Values
legacy
The CUDA legacy stream (per context, implicitly synchronizes with other streams)
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 20
NVCC Command Options
per-thread
Normal CUDA stream (per thread, does not implicitly synchronize with other
streams)
null
Deprecated alias for legacy
Default
legacy is used as the default stream.
See Virtual Architecture Feature List for the list of supported virtual architectures and
GPU Feature List for the list of supported real architectures.
Default
sm_30 is used as the default value; PTX is generated for compute_30 then assembled
and optimized for sm_30.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 21
NVCC Command Options
During runtime, such embedded PTX code is dynamically compiled by the CUDA
runtime system if no binary load image is found for the current GPU.
Architectures specified for options --gpu-architecture and --gpu-code may
be virtual as well as real, but the code architectures must be compatible with the
arch architecture. When the --gpu-code option is used, the value for the --gpu-
architecture option must be a virtual PTX architecture.
See Virtual Architecture Feature List for the list of supported virtual architectures and
GPU Feature List for the list of supported real architectures.
See Virtual Architecture Feature List for the list of supported virtual architectures and
GPU Feature List for the list of supported real architectures.
Allowed Values
‣ true
‣ false
Default
The generation of relocatable device code is disabled.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 22
NVCC Command Options
Default
nvcc generates code for all entry functions.
Default
No maximum is assumed.
4.2.7.7. --use_fast_math (-use_fast_math)
Make use of fast math library.
--use_fast_math implies --ftz=true --prec-div=false --prec-sqrt=false
--fmad=true.
Allowed Values
‣ true
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 23
NVCC Command Options
‣ false
Default
This option is set to false and nvcc preserves denormal values.
Allowed Values
‣ true
‣ false
Default
This option is set to true and nvcc enables the IEEE round-to-nearest mode.
Allowed Values
‣ true
‣ false
Default
This option is set to true and nvcc enables the IEEE round-to-nearest mode.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 24
NVCC Command Options
Allowed Values
‣ true
‣ false
Default
This option is set to true and nvcc enables the contraction of floating-point multiplies
and adds/subtracts into floating-point multiply-add operations (FMAD, FFMA, or
DFMA).
4.2.7.12. --compile-as-tools-patch (-astoolspatch)
Compile patch code for CUDA tools. Implies --keep-device-functions.
May only be used in conjunction with --ptx or --cubin or --fatbin.
Shall not be used in conjunction with -rdc=true or -ewp.
4.2.7.13. --keep-device-functions (-keep-device-
functions)
In whole program compilation mode, preserve user defined external linkage __device__
function definitions in generated PTX.
4.2.8.2. --source-in-ptx (-src-in-ptx)
Interleave source in PTX.
May only be used in conjunction with --device-debug or --generate-line-
info.
4.2.8.3. --restrict (-restrict)
Assert that all kernel pointer parameters are restrict pointers.
4.2.8.4. --Wno-deprecated-gpu-targets (-Wno-deprecated-
gpu-targets)
Suppress warnings about deprecated GPU target architectures.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 25
NVCC Command Options
4.2.8.5. --Wno-deprecated-declarations (-Wno-deprecated-
declarations)
Suppress warning on use of a deprecated entity.
4.2.8.6. --Wreorder (-Wreorder)
Generate warnings when member initializers are reordered.
4.2.8.8. --resource-usage (-res-usage)
Show resource usage such as registers and memory of the GPU code.
This option implies --nvlink-options=--verbose when --relocatable-
device-code=true is set. Otherwise, it implies --ptxas-options=--verbose.
4.2.8.9. --help (-h)
Print help information on this tool.
4.2.8.10. --version (-V)
Print version information on this tool.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 26
NVCC Command Options
4.2.9. Phase Options
The following sections lists some useful options to lower level compilation tools.
4.2.9.1. Ptxas Options
The following table lists some useful ptxas options which can be specified with nvcc
option -Xptxas.
4.2.9.1.1. --allow-expensive-optimizations (-allow-expensive-
optimizations)
Enable (disable) to allow compiler to perform expensive optimizations using maximum available
resources (memory and compile-time).
If unspecified, default behavior is to enable this feature for optimization level >= O2.
4.2.9.1.2. --compile-only (-c)
Generate relocatable object.
4.2.9.1.3. --def-load-cache (-dlcm)
Default cache modifier on global/generic load.
Default value: ca.
4.2.9.1.4. --def-store-cache (-dscm)
Default cache modifier on global/generic store.
4.2.9.1.5. --device-debug (-g)
Semantics same as nvcc option --device-debug.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 27
NVCC Command Options
4.2.9.1.6. --disable-optimizer-constants (-disable-optimizer-
consts)
Disable use of optimizer constant bank.
4.2.9.1.8. --fmad (-fmad)
Semantics same as nvcc option --fmad.
4.2.9.1.9. --force-load-cache (-flcm)
Force specified cache modifier on global/generic load.
4.2.9.1.10. --force-store-cache (-fscm)
Force specified cache modifier on global/generic store.
4.2.9.1.11. --generate-line-info (-lineinfo)
Semantics same as nvcc option --generate-line-info.
4.2.9.1.13. --help (-h)
Semantics same as nvcc option --help.
4.2.9.1.14. --machine (-m)
Semantics same as nvcc option --machine.
4.2.9.1.16. --opt-level N (-O)
Specify optimization level.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 28
NVCC Command Options
Default value: 3.
4.2.9.1.19. --preserve-relocs (-preserve-relocs)
This option will make ptxas to generate relocatable references for variables and preserve
relocations generated for them in linked executable.
4.2.9.1.20. --sp-bound-check (-sp-bound-check)
Generate stack-pointer bounds-checking code sequence.
This option is turned on automatically when --device-debug or --opt-level=0 is
specified.
4.2.9.1.21. --verbose (-v)
Enable verbose mode which prints code generation statistics.
4.2.9.1.22. --version (-V)
Semantics same as nvcc option --version.
4.2.9.1.23. --warning-as-error (-Werror)
Make all warnings into errors.
4.2.9.1.24. --warn-on-double-precision-use (-warn-double-
usage)
Warning if double(s) are used in an instruction.
4.2.9.1.25. --warn-on-local-memory-usage (-warn-lmem-usage)
Warning if local memory is used.
4.2.9.1.26. --warn-on-spills (-warn-spills)
Warning if registers are spilled to local memory.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 29
NVCC Command Options
4.2.9.2. NVLINK Options
The following table lists some useful nvlink options which can be specified with nvcc
option --nvlink-options.
4.2.9.2.1. --disable-warnings (-w)
Inhibit all warning messages.
4.2.9.2.2. --preserve-relocs (-preserve-relocs)
Preserve resolved relocations in linked executable.
4.2.9.2.3. --verbose (-v)
Enable verbose mode which prints code generation statistics.
4.2.9.2.4. --warning-as-error (-Werror)
Make all warnings into errors.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 30
Chapter 5.
GPU COMPILATION
This chapter describes the GPU compilation model that is maintained by nvcc, in
cooperation with the CUDA driver. It goes through some technical sections, with
concrete examples at the end.
5.1. GPU Generations
In order to allow for architectural evolution, NVIDIA GPUs are released in different
generations. New generations introduce major improvements in functionality and/
or chip architecture, while GPU models within the same generation show minor
configuration differences that moderately affect functionality, performance, or both.
Binary compatibility of GPU applications is not guaranteed across different generations.
For example, a CUDA application that has been compiled for a Fermi GPU will
very likely not run on a Kepler GPU (and vice versa). This is the instruction set and
instruction encodings of a geneartion is different from those of of other generations.
Binary compatibility within one GPU generation can be guaranteed under certain
conditions because they share the basic instruction set. This is the case between two GPU
versions that do not show functional differences at all (for instance when one version is
a scaled down version of the other), or when one version is functionally included in the
other. An example of the latter is the base Kepler version sm_30 whose functionality is
a subset of all other Kepler versions: any code compiled for sm_30 will run on all other
Kepler GPUs.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 31
GPU Compilation
sm_x2y2. From this it indeed follows that sm_30 is the base Kepler model, and it also
explains why higher entries in the tables are always functional extensions to the lower
entries. This is denoted by the plus sign in the table. Moreover, if we abstract from the
instruction encoding, it implies that sm_30's functionality will continue to be included
in all later GPU generations. As we will see next, this property will be the foundation for
application compatibility support by nvcc.
sm_30 and sm_32 Basic features
+ Kepler support
+ Unified memory programming
sm_35 + Dynamic parallelism support
sm_50, sm_52, and + Maxwell support
sm_53
5.3. Application Compatibility
Binary code compatibility over CPU generations, together with a published instruction
set architecture is the usual mechanism for ensuring that distributed applications out
there in the field will continue to run on newer versions of the CPU when these become
mainstream.
This situation is different for GPUs, because NVIDIA cannot guarantee binary
compatibility without sacrificing regular opportunities for GPU improvements. Rather,
as is already conventional in the graphics programming domain, nvcc relies on a
two stage compilation model for ensuring application compatibility with future GPU
generations.
5.4. Virtual Architectures
GPU compilation is performed via an intermediate representation, PTX, which can be
considered as assembly for a virtual GPU architecture. Contrary to an actual graphics
processor, such a virtual GPU is defined entirely by the set of capabilities, or features,
that it provides to the application. In particular, a virtual GPU architecture provides a
(largely) generic instruction set, and binary instruction encoding is a non-issue because
PTX programs are always represented in text format.
Hence, a nvcc compilation command always uses two architectures: a virtual
intermediate architecture, plus a real GPU architecture to specify the intended processor
to execute on. For such an nvcc command to be valid, the real architecture must be an
implementation of the virtual architecture. This is further explained below.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 32
GPU Compilation
The chosen virtual architecture is more of a statement on the GPU capabilities that
the application requires: using a smallest virtual architecture still allows a widest range
of actual architectures for the second nvcc stage. Conversely, specifying a virtual
architecture that provides features unused by the application unnecessarily restricts the
set of possible GPUs that can be specified in the second nvcc stage.
From this it follows that the virtual architecture should always be chosen as low as
possible, thereby maximizing the actual GPUs to run on. The real architecture should be
chosen as high as possible (assuming that this always generates better code), but this is
only possible with knowledge of the actual GPUs on which the application is expected
to run. As we will see later, in the situation of just in time compilation, where the driver
has this exact knowledge: the runtime GPU is the one on which the program is about to
be launched/executed.
virt ual com put e archit ect ure
NVCC
x.cu (device code)
St age 1
(PTX Generat ion)
x.pt x
CUDA Runt im e
real sm archit ect ure
St age 2
(Cubin Generat ion)
x.cubin Execut e
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 33
GPU Compilation
The above table lists the currently defined virtual architectures. The virtual architecture
naming scheme is the same as the real architecture naming scheme shown in Section
GPU Feature List.
5.6. Further Mechanisms
Clearly, compilation staging in itself does not help towards the goal of application
compatibility with future GPUs. For this we need the two other mechanisms by CUDA
Samples: just in time compilation (JIT) and fatbinaries.
5.6.1. Just-in-Time Compilation
The compilation step to an actual GPU binds the code to one generation of GPUs. Within
that generation, it involves a choice between GPU coverage and possible performance.
For example, compiling to sm_30 allows the code to run on all Kepler-generation GPUs,
but compiling to sm_35 would probably yield better code if Kepler GK110 and later are
the only targets.
virt ual com put e archit ect ure
NVCC
x.cu (device code)
St age 1
(PTX Generat ion)
x.pt x
CUDA Runt im e
real sm archit ect ure
St age 2
(Cubin Generat ion)
x.cubin
Execut e
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 34
GPU Compilation
By specifying a virtual code architecture instead of a real GPU, nvcc postpones the
assembly of PTX code until application runtime, at which the target GPU is exactly
known. For instance, the command below allows generation of exactly matching GPU
binary code, when the application is launched on an sm_50 or later architecture.
nvcc x.cu --gpu-architecture=compute_50 --gpu-code=compute_50
5.6.2. Fatbinaries
A different solution to overcome startup delay by JIT while still allowing execution on
newer GPUs is to specify multiple code instances, as in
nvcc x.cu --gpu-architecture=compute_50 --gpu-code=compute_50,sm_50,sm_52
This command generates exact code for two Kepler variants, plus PTX code for use by
JIT in case a next-generation GPU is encountered. nvcc organizes its device code in
fatbinaries, which are able to hold multiple translations of the same GPU source code. At
runtime, the CUDA driver will select the most appropriate translation when the device
function is launched.
5.7. NVCC Examples
5.7.1. Base Notation
nvcc provides the options --gpu-architecture and --gpu-code for specifying the
target architectures for both translation stages. Except for allowed short hands described
below, the --gpu-architecture option takes a single value, which must be the name
of a virtual compute architecture, while option --gpu-code takes a list of values which
must all be the names of actual GPUs. nvcc performs a stage 2 translation for each of
these GPUs, and will embed the result in the result of compilation (which usually is a
host object file or executable).
Example
nvcc x.cu --gpu-architecture=compute_50 --gpu-code=sm_50,sm_52
5.7.2. Shorthand
nvcc allows a number of shorthands for simple cases.
5.7.2.1. Shorthand 1
--gpu-code arguments can be virtual architectures. In this case the stage 2 translation
will be omitted for such virtual architecture, and the stage 1 PTX result will be
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 35
GPU Compilation
embedded instead. At application launch, and in case the driver does not find a better
alternative, the stage 2 compilation will be invoked by the driver with the PTX as input.
Example
nvcc x.cu --gpu-architecture=compute_50 --gpu-code=compute_50,sm_50,sm_52
5.7.2.2. Shorthand 2
The --gpu-code option can be omitted. Only in this case, the --gpu-architecture
value can be a non-virtual architecture. The --gpu-code values default to the
closest virtual architecture that is implemented by the GPU specified with --gpu-
architecture, plus the --gpu-architecture, value itself. The closest virtual
architecture is used as the effective --gpu-architecture, value. If the --gpu-
architecture value is a virtual architecture, it is also used as the effective --gpu-code
value.
Example
nvcc x.cu --gpu-architecture=sm_52
nvcc x.cu --gpu-architecture=compute_50
are equivalent to
5.7.2.3. Shorthand 3
Both --gpu-architecture and --gpu-code options can be omitted.
Example
nvcc x.cu
is equivalent to
5.7.3. Extended Notation
The options --gpu-architecture and --gpu-code can be used in all cases where
code is to be generated for one or more GPUs using a common virtual architecture. This
will cause a single invocation of nvcc stage 1 (that is, preprocessing and generation of
virtual PTX assembly code), followed by a compilation stage 2 (binary code generation)
repeated for each specified GPU.
Using a common virtual architecture means that all assumed GPU features are fixed
for the entire nvcc compilation. For instance, the following nvcc command assumes no
half-precision floating-point operation support for both the sm_50 code and the sm_53
code:
nvcc x.cu --gpu-architecture=compute_50 --gpu-code=compute_50,sm_50,sm_53
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 36
GPU Compilation
Or, leaving actual GPU code generation to the JIT compiler in the CUDA driver:
nvcc x.cu \
--generate-code arch=compute_50,code=compute_50 \
--generate-code arch=compute_53,code=compute_53
The code sub-options can be combined with a slightly more complex syntax:
nvcc x.cu \
--generate-code arch=compute_50,code=[sm_50,sm_52] \
--generate-code arch=compute_53,code=sm_53
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 37
Chapter 6.
USING SEPARATE COMPILATION IN CUDA
Prior to the 5.0 release, CUDA did not support separate compilation, so CUDA code
could not call device functions or access variables across files. Such compilation is
referred to as whole program compilation. We have always supported the separate
compilation of host code, it was just the device CUDA code that needed to all be within
one file. Starting with CUDA 5.0, separate compilation of device code is supported,
but the old whole program mode is still the default, so there are new options to invoke
separate compilation.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 38
Using Separate Compilation in CUDA
To invoke just the device linker, the --device-link option can be used, which emits
a host object containing the embedded executable device code. The output of that must
then be passed to the host linker. Or:
nvcc <objects>
can be used to implicitly call both the device and host linkers. This works because if the
device linker does not see any relocatable code it does not do anything.
Figure 4 shows the flow (nvcc --device-c has the same flow as #unique_46/
unique_46_Connect_42_cuda-compilation-from-cu-to-o)
x.cu y.cu z.cpp
Device Linker
a_dlink.o / a_dlink.obj
Host Linker
6.3. Libraries
The device linker has the ability to read the static host library formats (.a on Linux and
Mac OS X, .lib on Windows). It ignores any dynamic (.so or .dll) libraries. The --
library and --library-path options can be used to pass libraries to both the device
and host linker. The library name is specified without the library file extension when the
--library option is used.
Alternatively, the library name, including the library file extension, can be used without
the --library option on Windows.
nvcc --gpu-architecture=sm_50 a.obj b.obj foo.lib --library-path=<path>
Note that the device linker ignores any objects that do not have relocatable device code.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 39
Using Separate Compilation in CUDA
6.4. Examples
Suppose we have the following files:
//---------- b.h ----------
#define N 8
__syncthreads();
bar();
}
foo<<<1, N>>>();
if(cudaGetSymbolAddress((void**)&dg, g)){
printf("couldn't get the symbol addr\n");
return 1;
}
if(cudaMemcpy(hg, dg, N * sizeof(int), cudaMemcpyDeviceToHost)){
printf("couldn't memcpy\n");
return 1;
}
return 0;
}
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 40
Using Separate Compilation in CUDA
These can be compiled with the following commands (these examples are for Linux):
nvcc --gpu-architecture=sm_50 --device-c a.cu b.cu
nvcc --gpu-architecture=sm_50 a.o b.o
If you want to invoke the device and host linker separately, you can do:
nvcc --gpu-architecture=sm_50 --device-c a.cu b.cu
nvcc --gpu-architecture=sm_50 --device-link a.o b.o --output-file link.o
g++ a.o b.o link.o --library-path=<path> --library=cudart
Note that all desired target architectures must be passed to the device linker, as that
specifies what will be in the final executable (some objects or libraries may contain
device code for multiple architectures, and the link step can then choose what to put in
the final executable).
If you want to use the driver API to load a linked cubin, you can request just the cubin:
nvcc --gpu-architecture=sm_50 --device-link a.o b.o \
--cubin --output-file link.cubin
Note that only static libraries are supported by the device linker.
A PTX file can be compiled to a host object file and then linked by using:
nvcc --gpu-architecture=sm_50 --device-c a.ptx
An example that uses libraries, host linker, and dynamic parallelism would be:
nvcc --gpu-architecture=sm_50 --device-c a.cu b.cu
nvcc --gpu-architecture=sm_50 --device-link a.o b.o --output-file link.o
nvcc --lib --output-file libgpu.a a.o b.o link.o
g++ host.o --library=gpu --library-path=<path> \
--library=cudadevrt --library=cudart
It is possible to do multiple device links within a single host executable, as long as each
device link is independent of the other. This requirement of independence means that
they cannot share code across device executables, nor can they share addresses (e.g.,
a device function address can be passed from host to device for a callback only if the
device link sees both the caller and potential callback callee; you cannot pass an address
from one device executable to another, as those are separate address spaces).
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 41
Using Separate Compilation in CUDA
An object could have been compiled for a different architecture but also have PTX
available, in which case the device linker will JIT the PTX to cubin for the desired
architecture and then link. Relocatable device code requires CUDA 5.0 or later Toolkit.
If a kernel is limited to a certain number of registers with the launch_bounds attribute
or the --maxrregcount option, then all functions that the kernel calls must not use
more than that number of registers; if they exceed the limit, then a link error will be
given.
6.5.4. Using __CUDA_ARCH__
In separate compilation, __CUDA_ARCH__ must not be used in headers such that
different objects could contain different behavior. Or, it must be guaranteed that all
objects will compile for the same compute_arch. If a weak function or template function
is defined in a header and its behavior depends on __CUDA_ARCH__, then the instances
of that function in the objects could conflict if the objects are compiled for different
compute arch. For example, if an a.h contains:
template<typename T>
__device__ T* getptr(void)
{
#if __CUDA_ARCH__ == 500
return NULL; /* no address */
#else
__shared__ T arr[256];
return arr;
#endif
}
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 42
Using Separate Compilation in CUDA
Then if a.cu and b.cu both include a.h and instantiate getptr for the same type, and
b.cu expects a non-NULL address, and compile with:
nvcc --gpu-architecture=compute_50 --device-c a.cu
nvcc --gpu-architecture=compute_52 --device-c b.cu
nvcc --gpu-architecture=sm_52 a.o b.o
At link time only one version of the getptr is used, so the behavior would depend
on which version is picked. To avoid this, either a.cu and b.cu must be compiled for
the same compute arch, or __CUDA_ARCH__ should not be used in the shared header
function.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 43
Chapter 7.
MISCELLANEOUS NVCC USAGE
7.1. Cross Compilation
Cross compilation is controlled by using the following nvcc command line options:
‣ --compiler-bindir is used for cross compilation, where the underlying host
compiler is capable of generating objects for the target platform.
‣ --machine=32. This option signals that the target platform is a 32-bit platform. Use
this when the host platform is a 64-bit platform.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 44
Miscellaneous NVCC Usage
Because using --clean-targets will remove exactly what the original nvcc command
created, it is important to exactly repeat all of the options in the original command. For
instance, in the following example, omitting --keep, or adding --compile will have
different cleanup effects.
nvcc acos.cu --keep
nvcc acos.cu --keep --clean-targets
As shown in the above example, the amounts of statically allocated global memory
(gmem) and constant memory in bank 14 (cmem) are listed.
Global memory and some of the constant banks are module scoped resources and
not per kernel resources. Allocation of constant variables to constant banks is profile
specific.
Followed by this, per kernel resource information is printed.
Stack frame is per thread stack usage used by this function. Spill stores and loads
represent stores and loads done on stack memory which are being used for storing
variables that couldn't be allocated to physical registers.
Similarly number of registers, amount of shared memory and total space in constant
bank allocated is shown.
www.nvidia.com
CUDA Compiler Driver NVCC TRM-06721-001_v10.2 | 45
Notice
ALL NVIDIA DESIGN SPECIFICATIONS, REFERENCE BOARDS, FILES, DRAWINGS,
DIAGNOSTICS, LISTS, AND OTHER DOCUMENTS (TOGETHER AND SEPARATELY,
"MATERIALS") ARE BEING PROVIDED "AS IS." NVIDIA MAKES NO WARRANTIES,
EXPRESSED, IMPLIED, STATUTORY, OR OTHERWISE WITH RESPECT TO THE
MATERIALS, AND EXPRESSLY DISCLAIMS ALL IMPLIED WARRANTIES OF
NONINFRINGEMENT, MERCHANTABILITY, AND FITNESS FOR A PARTICULAR
PURPOSE.
Information furnished is believed to be accurate and reliable. However, NVIDIA
Corporation assumes no responsibility for the consequences of use of such
information or for any infringement of patents or other rights of third parties
that may result from its use. No license is granted by implication of otherwise
under any patent rights of NVIDIA Corporation. Specifications mentioned in this
publication are subject to change without notice. This publication supersedes and
replaces all other information previously supplied. NVIDIA Corporation products
are not authorized as critical components in life support devices or systems
without express written approval of NVIDIA Corporation.
Trademarks
NVIDIA and the NVIDIA logo are trademarks or registered trademarks of NVIDIA
Corporation in the U.S. and other countries. Other company and product names
may be trademarks of the respective companies with which they are associated.
Copyright
© 2007-2019 NVIDIA Corporation. All rights reserved.
www.nvidia.com