Quick-Reference Guide To Optimization With Intel Compilers
Quick-Reference Guide To Optimization With Intel Compilers
In general, -O3, IPO and/or PGO, and utilizing the Optimization Reports (in Once you have identified performance hot-spots, you may need to provide the The following options allow the compiler to help you parallelize your application
the Fine Tuning section) to control aliasing and improve memory utilization, compiler with more information to fine-tune specific functions. The Optimization for multi-processor or Hyper-Threading Technology capable systems.
provides the best performance for Itanium processor-based systems. and Vectorization Reports may show places where loops could not be optimized
fully due to pointer aliasing or memory access overlaps, for example. Also, the
Intel® C++ and Fortran Compilers User’s Guides include details on other #pragmas,
directives and intrinsics that can be used to control software-pipelining, unrolling, Quick-Reference Guide
vectorization, and prefetching for further fine-tuning within your application code. to Optimization
with Intel® Compilers
Windows§ Linux§ Comment Windows Linux Comment Windows Linux Comment
Command Command Command Command Command Command
/G1 -tpp1 Targets optimization for the Itanium /Qunroll[n] -unroll[n] Sets the maximum number of times /Qopenmp -openmp Enables the parallelizer to generate multi-
processor. to unroll loops. -unroll0 disables loop threaded code based on the OpenMP*
For Intel® Pentium® 4 and
unrolling. The default is -unroll, which directives. Intel® Itanium® Processor families
/G2 -tpp2 Targets optimization for the Itanium uses default heuristics.
2 processor. Generated code is also /Qopenmp_ -openmp_ Controls the OpenMP parallelizer’s A Step-by-Step Approach to Application
compatible with the Itanium processor. /Qrestrict[-] -[no]restrict Enables/disables pointer disambiguation report{0|1|2} report{0|1|2} diagnostic levels. The default is / Tuning with Intel Compilers
(Default) with the restrict qualifier. Qopenmp_report1.
Before you begin performance tuning, ensure that
/QIPF_fma[-] -IPF_fma[-] Enables [disables] the combining -falias Assumes aliasing in the program. /Qparallel -parallel Detects parallel loops capable of your application runs as intended with a base set
of floating-point multiplies and add/ (C++ Linux only) being executed safely in parallel and of options or in debug-mode (-Od and -Zi).
subtract operations. automatically generates multithreaded 1. Use the Automatic Optimization Options
-ffnalias Assumes aliasing within functions. code for these loops
/QIPF_fp_ -IPF_fp_ Enables floating-point speculations (C++ Linux only) (-O1, -O2, or -O3) and determine which
speculationmode speculation with one of the following modes: /Qpar_ -par_report{0|1|2|3} Controls the auto-parallelizer’s diagnostic one works best for your application by
mode /Oa -fno-alias Assumes no aliasing in program. report{0|1|2|3} levels as follows: measuring performance with each.
fast−Speculate floating-point (C++ Linux only)
operations. 0: displays no diagnostic information. 1. Add in Interprocedural Optimization
off−Disables speculation of floating- /Ow -fno-fnalias Assumes no aliasing within functions, but 1: indicates loops successfully parallelized (IPO) and/or Profile-Guided Optimization
point operations. assumes aliasing across calls. (default). (PGO) and again measure performance to
safe−Speculate only when safe. (C++ Linux only) 2: loops successfully and unsuccessfully determine if your application benefits from
strict−This is the same as specifying parallelized. either of them.
off. /Qalias_args[-] -alias_args[-] Implies arguments may be aliased [not 3: adds information about any proven 2. Fine-tune performance with the
aliased]. or assumed dependencies inhibiting Processor-Specific Options to target
/Qftz[-] -ftz[-] Flushes denormal results to zero. parallelization. IA-32 or Intel® Itanium® processor
The option is turned ON with -O3 /Qopt_report -opt_report Generates an optimization report directed
to stderr. systems specifically. This step works best
by default. This only impacts the /Qpar_threshold[n]- par_threshold[n] Sets a threshold for the auto- by identifying performance “hot-spots”
application when the main program parallelization of loops based on the with the Intel® VTune™ Performance
/Qopt_report_ -opt_report_ Specifies the filename for the optimization
or dll main is compiled. probability of profitable execution of the Analyzer so you know which parts of your
filefilename filefilename report.
loop in parallel, n=0 to 100. Default: n=75. application need specific tuning. Also,
/Qivdep_parallel -ivdep_parallel Indicates there is absolutely no loop- This option is used for loops whose
/Qopt_report_ -opt_report_ Specifies the verbosity level of the output. the Intel Compiler’s Optimization Reports
carried memory dependency in the computation work volume cannot be
levellevel levellevel Valid arguments are min (default), med, max. show where the compiler could use more
loop where the IVDEP directive is determined at compile time.
specified. of your help.
/Qopt_report_ -opt_report_ Specifies the compilation name for 0: parallelize loops regardless of
phasename phasename which reports are generated. The option 3. Run your applications on multi-processor
/QIPF_fltacc[-] -IPF_fltacc[-] Enables [disables] optimizations that computation work volume. or Hyper-Threading Technology capable
affect floating-point accuracy. can be used multiple times in the same 100: parallelize loops only if profitable
compilation to get output from multiple systems using the Parallel Performance
parallel execution is almost certain. options.
/QIPF_flt_eval_ -IPF_flt_eval_ Evaluates floating-point operands to phases. Valid name arguments:
method{0|2} method0 the precision indicated by the program. ipo: Interprocedural Optimizer
hlo: High Level Optimizer
ilo: Intermediate Language Scalar
Optimizer
ecg: Code Generator For product and purchase information visit:
omp: OpenMP* ��������
all: All phases www.intel.com/software/products Application Performance
/Qopt_report_
routine[rtn]
-opt_report_
routine [rtn]
Specifies a routine rtn. Reports from
all routines with names that include
�����������������������
rtn as part of the name are generated.
By default, reports for all routines are
Copyright © 2003, Intel Corporation, All Rights Reserved. Intel, the Intel logo, Itanium, Pentium, Intel
generated. Centrino, Intel Xeon, Intel XScale, and VTune are trademarks or registered trademarks of Intel Corporation
or its subsidiaries in the United States and other countries.
/Qopt_report_help -opt_report_help Displays all possible settings for -
*Other names and brands may be claimed as the property of others.
opt_report_phase. No compilation is
performed. 1103/JXP/ITF/PP/4K
254349-001
Automatic Optimization Options. Interprocedural Optimization (IPO) and IA-32 Processor-specific Optimization.
Before you begin performance tuning, ensure that your application runs as Profile-Guided Optimization (PGO) Options. These options allow you to tune performance specifically for the Intel processor-based systems you are targeting. As with each previous step, measure the
intended with a base set of options or in debug mode (-Od and-Zi). These IPO controls function-inlining to reduce function call overhead and improve data performance benefit of each option to guide your decisions. Use the Intel Compilers’ Optimization Reports to assist in determining whether you can provide more
are general optimization options that should be at the heart of any application layout across functions. PGO provides run-time feedback to guide optimization help to the compiler in the form of anti-aliasing or memory disambiguation information.
tuning for Intel® Pentium® 4 and Itanium® processors. Try these different decisions about data and code layout to improve instruction-cache, paging IA-32-Specific Optimization Recommendation: Use the -QaxN (-axN on Linux), new in the 8.0 compilers, for best performance across all Pentium 4 processors
options and measure your performance before proceeding to more advanced and branch prediction. IPO can increase code size. Be sure to measure your and the Pentium M processor. (You may also want to experiment with -QaxB (-axB) on Pentium M processors.)
optimizations. execution performance, compile-time, and code-size tradeoffs with these
options. IPO is best used in conjunction with PGO to guide which functions
to inline.
/Od -O0 No optimization. Useful during /Qip -ip Single file optimization. Allows selective /G6 /tpp6 Schedule code for Pentium III or earlier processors.
(No Optimization) application development and inlining optimization within a single
debugging. source file. /G7 /tpp7 Schedule code for Intel Pentium 4 and later processors. (Default)
/O1 -O1 Omits optimizations that tend to /Qipo -ipo Multi-file optimization. Permits inlining /Qax{K|W|N|B|P} -ax{K|W|N|B|P} Automatic Processor Dispatch. Generates specialized code for the indicated processors while also generating
(Optimize for size) increase object size. Creates the and other optimizations among multiple generic IA-32 code. You can use more than one code to tune for multiple processors in the same executable.
smallest optimized code in most source files. K- Intel Pentium III and compatible Intel processors.
cases. On Linux systems with W- Intel Pentium 4 and compatible Intel processors.
IA-32 processors only, there is no /Qprof_gen -prof_gen Instruments a program for profiling. N- Intel Pentium 4 and compatible Intel processors.
difference between -O1 and -O2. B- Intel Pentium M and compatible Intel processors.
/Qprof_dirdir -prof_dirdir Specifies a directory for the profiling
This option has proven useful output files, *.dyn and *dpi. P- Intel processors code-named Prescott and compatible Intel processors.
in many large server/database Beginning with Intel Compilers version 8.0, K and W are deprecated and will be removed from future releases.
applications where memory paging /Qprof_use -prof_use Enables use of profiling information during N provides additional Pentium 4 processor tuning beyond W.
due to larger code size is an issue. optimization.
/Qx{K|W|N|B|P} -x{K|W|N|B|P} Processor-specific Targeting. Generates specialized code for the indicated processor. The executable should
/O2 -O1 or -O2 Default setting. Creates the fastest only be run on the targeted compatible processors.
(Maximize speed) code in most cases, but may Profile-Guided Optimization (PGO) Steps
increase code size significantly IA-32-Specific Optimization Recommendation: Use the –QaxN (-axN on Linux) K- Intel Pentium III and compatible Intel processors.
over /O1. On Linux systems with for best performance across all Pentium 4 processors and the Pentium M W- Intel Pentium 4 and compatible Intel processors.
Step One with –QaxB (-axB) on Pentium M
processor. (You may also want to experiment N- Intel Pentium 4 and compatible Intel processors.
IA-32 processors, -O1 and -O2 are Compile with
equivalent. processors.)
PGO B- Intel Pentium M and compatible Intel processors.
P- Intel processors code-named Prescott and compatible Intel processors.
/Ox n/a Equivalent to /O2 except that Beginning with Intel Compilers version 8.0, K and W are deprecated and will be removed from future releases.
(Maximize /Ox does not imply /Gy (function N provides additional Pentium 4 processor tuning beyond W.
optimization) packaging) or /Gf (string pooling)
[Windows only]. Instrumented N, B, and P generate a run-time check to determine that the correct compatible Intel processor is used to
Executable prevent potential run-time faults that could otherwise occur with K and W.
/O3 -O3 Same as /O2, plus loop foo.exe
(High-level transformations and data /Qprefetch[-] -prefetch[-] Enables or disables prefetch insertion (requires -O3).
optimizations) prefetching for improved memory
usage efficiency. For the full benefit
/Qfp_port -fp_port Rounds floating-point results after floating-point operations, so rounding to user-declared precision happens
of /O3 on Intel 32-bit processors, at assignments and type conversions; this has some impact on speed. The default is to keep results of
also use the /Qx{K, W, N, B, P} floating-point operations in higher precision. Use this if you are experiencing differences in floating-point
Step Two precision versus other platforms.
or /Qax{K, W, N, B, P} options Run instrumented
for Pentium III and Pentium 4 application to produce
Dynamic Information Files /Qvec_report{0|1|2|3|4|5} -vec Controls amount of vectorizer diagnostic information as follows:
processors and subsequent IA-32 _report{0|1|2|3|4|5} n = 0: no information
processors. Dynamic n = 1: indicates vectorized loops (default)
This option has proven useful for Information n = 2: indicates vectorized and non-vectorized loops
Summary File
a broad range of applications, n = 3: indicates vectorized and non-vectorized loops and prohibits
particularly loopy, kernel-based data dependence information
code common in high-performance Step Three n = 4: indicates non-vectorized loops
computing. Feedback Compile n = 5: indicates non-vectorized loops and prohibits data
with PGO dependence information
/Zi -g Generates debug information
for use with any of the common
platform debuggers.
Profile-Guided
Application