Run Your Ansys Fluent Simulations at Top Speed
Run Your Ansys Fluent Simulations at Top Speed
Ansys Fluent takes advantage of Intel Advanced Vector Extensions (Intel AVX)
and Intel Math Kernel Library (Intel MKL) to enable greater efficiencies and
optimized performance.
Key Terms
Solver: Software code that solves a problem or problems through mathematical calculations.
Sparse solver: A linear solver that has been optimized for sparsely populated matrices, like those found in finite element
analysis (FEA).
Algebraic multigrid (AMG) solver: A type of solver that solves differential equations using a hierarchy of discrete
functions, models, variables and equations.
Smoother: An algorithm that provides an approximate solution to a specific piece of the overall set of equations and allows
important patterns to stand out. Smoothers are an important part of an AMG solver.
Incomplete lower/upper (ILU)/LDU smoother: The ILU smoother is one of the native Ansys Fluent smoothers. The LDU
smoother is the Intel MKL smoother that replaces the ILU smoother in this testing to improve performance.
Intel Advanced Vector Extensions (Intel AVX-512 and Intel AVX2): A set of instructions for performing single instruction,
multiple data (SIMD) operations on Intel processors, improving performance for large datasets. Intel AVX2 includes 256-bit
integer instructions and supports floating-point fused multiply-add instructions. Intel AVX-512 enables twice the number
of floating-point operations per second per clock cycle compared to Intel AVX2 by decreasing latency.
Intel Math Kernel Library (Intel MKL): A math computing library of optimized, threaded routines for linear algebra, fast
Fourier transforms (FFTs), vector math, random number generators and direct and iterative sparse solvers.
Performance Results
Intel conducted two kinds of testing to evaluate performance improvements to Ansys Fluent using the Intel MKL sparse LDU
smoother: the Ansys Fluent benchmark suite and Ansys Fluent built-in software profiles.
Ansys Fluent Software Benchmark Suite
Intel measured performance improvements on a subset from the Ansys Fluent benchmark suite using a version of the Intel
MKL sparse LDU smoother interface available in Ansys Fluent 2020 R1. Tables 1 and 2 show relative solver ratings in single-
node and eight-node hardware configurations respectively, each comparing four software variations:
• Fluent baseline binaries with the Fluent smoother
• Fluent baseline binaries with the Intel MKL sparse LDU smoother
• Fluent optimized binaries (-platform=Intel) with the Fluent smoother
• Fluent optimized binaries with Intel the MKL sparse LDU smoother
Table 1. Ansys Fluent 2020 R1 performance with the Intel MKL sparse LDU smoother (single-node) running the Ansys Fluent
benchmark suite1
2
White Paper | Run Your Ansys® Fluent® Simulations at Top Speed
Table 2. Ansys Fluent 2020 R1 performance with the Intel MKL sparse LDU smoother (eight-node cluster) running the Ansys
Fluent benchmark suite1
Eight-Node Cluster: Intel Xeon Platinum Ansys Fluent Relative Solver Rating
8280L Processor Normalized to Fluent Baseline Binaries with a Native Smoother
Fluent Baseline Fluent Optimized
Plus Intel MKL Binaries* Plus Intel
Sparse LDU Fluent Optimized MKL Sparse LDU
Case N Core Count Fluent Baseline Smoother Binaries* Smoother
sedan_4m 448 1.00 0.97 1.03 1.00
aircraft_wing_14m 448 1.00 1.01 1.00 0.98
combustor_12m 448 1.00 1.04 1.12 1.16
pump_2m 448 1.00 1.00 1.03 1.01
rotor_3m 448 1.00 0.98 1.03 0.99
aircraft_wing_2m 448 1.00 0.96 1.01 0.93
exhaust_ 448 1.00 1.05 1.01 1.05
system_33m
landing_gear_15m 448 1.00 1.02 1.02 1.02
combustor_71m 448 1.00 1.13 1.00 1.13
f1_racecar_140m 448 1.00 1.15 1.01 1.15
open_ 448 1.00 1.08 1.01 1.09
racecar_280m
*Optimized binaries are selected with the optional -platform=intel Fluent command-line argument.
The level of performance improvement found in this testing varies by test case, but a wide variety of test cases, including
the most common solver options in Fluent, saw improvements of up to 13 percent in the single-node tests and 16 percent in
the eight-node cluster. These results show that, in general, where using either the Fluent optimized binaries or the Intel MKL
sparse LDU smoother is beneficial, the benefits of using both are complementary. 2
The performance benefits of both the optimized “-platform=intel” option and the Intel MKL sparse LDU smoother can
effectively scale to thousands of cores in large-scale runs. Profiling was recorded for the test case F1_racecar_140m on eight
nodes of Intel Xeon Platinum 8280L processors, using 56 cores per node for a total of 448 cores. 3 These are top-down profiles
where each function shown is a subcomponent of some or all of the functions above it.
Figures 1 and 2 illustrate the performance boost at increasing core counts on an Intel Xeon Platinum 8260L processor–based
cluster with Intel Omni-Path Architecture (Intel OPA) interconnect. As shown in Figure 1, the test case F1_racecar_140m scales
well to 6,144 cores on the cluster, with each node fully subscribed, and with Intel OPA interconnect and Intel MPI Library.4
9,000
8,000
7,000
Solver Rating
6,000
5,000
4,000
3,000
2,000
1,000
0
0 768 1,536 2,304 3,072 3,840 4,608 5,376 6,144
Number of Cores, 48 Processes per Node (PPN)
Ansys Fluent baseline
Ansys Fluent optimized (-platform=intel) Figure 1. Racecar test case scaling up to
Ansys Fluent optimized and Intel MKL sparse LDU smoother 6,144 cores4
Ideal scaling
3
White Paper | Run Your Ansys® Fluent® Simulations at Top Speed
Figure 2 shows that the combined impact of the Intel MKL sparse LDU smoother and the “-platform=intel” optimized code
path is a 12 percent to 19 percent improvement over the range shown.4
1.40
Relative Solver Rating
1.20
1.00
0.80
0.60
0.40
0.20
0.00
4 8 16 24 32 48 64 96 128
Number of Cluster Nodes, 48 Processes per Node (PPN)
Ansys Fluent baseline (basis for normalization)
Without Intel MKL Sparse LDU Smoother With Intel MKL Sparse LDU Smoother
Minimum Maximum Rank Minimum Maximum
Index Function N0 Time Rank Time Time N0 Time Rank Time Rank Time
Before the introduction of the Intel MKL sparse LDU smoother, the native Ansys smoother, Smooth_ILU4, was present in the
profile, consuming 118 seconds out of 193 seconds of solver time. After its introduction, the Intel MKL sparse LDU smoother
showed up in the software profile as Smooth_ILU_ud, and time consumption was reduced to 96 seconds, an improvement of
19 percent for that function.
4
White Paper | Run Your Ansys® Fluent® Simulations at Top Speed
5
White Paper | Run Your Ansys® Fluent® Simulations at Top Speed
Appendix 1: How to Try the Intel MKL Sparse LDU Smoother Yourself
This appendix demonstrates how to choose the Intel MKL sparse LDU smoother in an Ansys Fluent execution, both for single-
precision and double-precision executions. Note that this process will be simplified considerably in future releases.
The interface for the Intel MKL sparse LDU smoother is implemented in two separate libraries, which are included with the
Fluent 2020 R1 release. These libraries support either single-precision or double-precision execution. Depending whether
Fluent is going to run with single or double precision, the user must identify which library is needed:
• Single precision: <ANSYS_HOME>/v201/fluent/lib/lnamd64/libmklsmoother_sp.so
• Double precision: <ANSYS_HOME>/v201/fluent/lib/lnamd64/libmklsmoother_dp.so
Each interface library contains the interface routine “ilu” as a member. Modify the Fluent installation as follows. (Note that, for
this step, it might be best to make a separate copy of the installation for modification.)
1. Under the Fluent installation, locate Intel MKL libraries that were bundled with Ansys products.
<ANSYS_HOME>/v201/tp/IntelMKL
2. Create a new subdirectory under IntelMKL to store the Intel MKL 2020 libraries, and name the subdirectory 2020.0.166:
cd <ANSYS_HOME>/v201/tp/IntelMKL
mkdir 2020.0.166
mkdir -parents 2020.0.166/linx64/lib/intel64/
3. Copy the Intel MKL 2020 gold release libraries into the Fluent installation directory:
cp -p <MKL_HOME>/mkl/lib/intel64/*.so 2020.0.166/linx64/lib/intel64/
4. Locate the Fluent executable wrapper script to make a change to the version of Intel MKL referenced:
<ANSYS_HOME>/v201/fluent/fluent20.1.0/bin/fluent
5. Modify the following line in that script from:
sys_prepend_ld_library_path
$FLUENT_INC/../tp/IntelMKL/2017.6.256/linx64/lib/intel64
to:
sys_prepend_ld_library_path
$FLUENT_INC/../tp/IntelMKL/2020.0.166/linx64/lib/intel64
6. Finally, in the working directory for the simulation, modify the input command file, or .jou file, to call the ilu interface routine
as a user-defined smoother. Add the following line to the input before the solve step:
(rpsetvar ‘amg-coupled/user-defined-smoother ilu@libmklsmoother_sp.so )
Note that the line above references the single precision (_sp.so) version. Use double precision (_dp.so) if
appropriate. Also, the line assumes that a copy of libmklsmoother_sp.so exists in the local working directory. You
can add a full path to reference the Fluent installation instead with the following:
(rpsetvar ‘amg-coupled/user-defined-smoother ilu@ANSYS_HOME/v201/fluent/lib/lnamd64/
libmklsmoother _sp.so )
6
White Paper | Run Your Ansys® Fluent® Simulations at Top Speed
Appendix 2: Deeper Dive into the Intel MKL Sparse LDU Smoother and Intel MKL Inspector-
Executor Sparse BLAS (Intel MKL IE SpBLAS)
This appendix explains in greater detail the workings of the Intel MKL sparse LDU smoother. Detailed documentation for Intel
MKL APIs is available at https://software.intel.com/en-us/mkl-developer-reference-c.
Intel MKL IE SpBLAS Overview and Details on the Sparse LDU Smoother
Intel MKL IE SpBLAS offers a rich set of sparse linear algebra functions, including some sparse solvers, and it employs an
inspector-executor paradigm that separates operations (APIs) into two stages: analysis and execution. For a given matrix, the
analysis would typically be called up to one time, whereas the execution might be called multiple times. During the analysis
stage, the API inspects the matrix properties, including size, sparsity pattern and available parallelism, and it can apply matrix
format/structure changes to the target or enable a more optimized algorithm. In the execution stage, multiple routine calls can
take advantage of the analysis-stage data in order to improve performance. The Intel MKL IE SpBLAS model, with an optimize
step separate from the execute step, allows Intel MKL to make use of Intel AVX-512 instruction sets in ways that were not
possible previously, with only a single-stage execution model.
A common use case of this inspector-executor model is in an iterative solver where the matrix is not changing but the
application of the matrix operation happens many times—typically until some sort of convergence happens. The inspector
step’s time cost is effectively amortized across the many execution calls, which can be sped up significantly by the choices
made during inspection.
Starting with Intel MKL 2019 update 5 release, the library includes the sparse LDU smoother interface and the mkl_sparse_?_
lu_smoother routine, which is supported for both single and double precision using real data types. However, this interface
was not initially fully optimized for single precision. Intel MKL 2020 makes available the Intel MKL IE SpBLAS mkl_sparse_?_
lu_smoother function that is now optimized for single and double precision, and Ansys has added an interface to the Ansys
Fluent 2020 R1 solver to allow users to link against Intel MKL 2020 to try out the sparse LDU smoother function.
The Intel MKL sparse LDU smoother also follows the inspector-executor approach. The mkl_sparse_?_lu_smoother routine
performs an update for an iterative solution x of the system Ax=b. It applies one iteration of an approximate preconditioner,
which is based on the following representation: A ~ (L + D)*E*(U + D), where E is an approximate inverse of the diagonal (using
the exact inverse will result in a symmetric Gauss-Seidel preconditioner), L and U are lower/upper triangular parts of A and D
is the diagonal (the block diagonal in the case of a block compressed sparse row [BSR] format) of A (A = L + D + U). Essentially,
this routine performs the following three operations:
• r = b - A*x
• (L + D)*E*(U + D)*dx = r
• y = x + dx
The pseudo-code below demonstrates the inspector-executor model and, in particular, the mkl_sparse_?_lu_smoother
interface, in addition to the main ideas that were used for integration with the Ansys Fluent ILU smoother.
In order to work with Intel MKL IE SpBLAS, first you must create a handle for a matrix that stores the data internally and can
be passed into the interface. Assuming that matrix A is stored in BSR format with a number of rows equal to nRows, a number
of columns equal to nCols and square blocks of size block_size>=1, then bsrRowPtr, bsrColInd and bsrVal are arrays of row
pointers, column indices and values respectively.
sparse_matrix_t bsrA;
mkl_sparse_s_create_bsr(&bsrA, SPARSE_INDEX_BASE_ZERO, SPARSE_LAYOUT_ROW_MAJOR, nRows,
nCols, block_size, bsrRowPtr, bsrRowPtr+1, bsrColInd, bsrVal);
7
White Paper | Run Your Ansys® Fluent® Simulations at Top Speed
You can then call the inspector stage (that is, add hints and call mkl_sparse_optimize). The main point of this stage is to
provide a hint regarding the number of expected calls, the type of operation and the matrix structure. Then, you can perform
the analysis and do appropriate optimizations, including, but not limited to, matrix conversion and thread-balancing. The
handle will then store any optimizations made during the inspector step. The number of hints that can be set here is not
limited, and its effect is cumulative. The inspector stage is optional and should be called only once.
sparse_operations_t op = SPARSE_OPERATION_NON_TRANSPOSE;
descr_gen.type = SPARSE_MATRIX_TYPE_GENERAL;
mkl_sparse_optimize(bsrA);
The last step is to apply the smoother. This stage can be repeated as many times as necessary with the same matrix handle.
mkl_sparse_?_lu_smoother (op, bsrA, descr_gen, D, E, x, b)
1
Based on Intel testing as of November 19, 2019. Configurations: Single-node: Intel Xeon Platinum 8280L processor running the Ansys Fluent benchmark suite with and without “-platform=intel”
option and the Intel MKL sparse LDU smoother. Eight-node cluster: Intel Xeon Platinum 8280L processor running the Ansys Fluent benchmark suite with and without “-platform=intel” option
and the Intel MKL sparse LDU smoother.
2
The Intel AVX2–enabled code path option has been available with recent versions of Fluent and is enabled with the Fluent command-line option -platform=intel. This option speeds up execu-
tion time particularly for cases using the polyhedral cell type, like combustor_12m. The Intel MKL sparse LDU smoother option benefits cases that use Smooth_ILU4 calls within the Fluent solver.
Those cases can be identified by adding software profiling to the Fluent journal input file, and then scanning the output to see if Smooth_ILU4 appears in the performance profile generated
during a run.
3
If you try the preview, you can test the performance this way on your own real-world models. These profiling results are easily generated from an Ansys Fluent execution.
4
Based on Intel testing as of November 26, 2019. Configurations: Endeavor cluster: Intel Xeon Platinum 8260L processor (48 cores per cluster node) running the Ansys Fluent benchmark suite
with and without “-platform=intel” option and the Intel MKL sparse LDU smoother, and with Intel OPA interconnect.
Software and workloads used in performance tests may have been optimized for performance only on Intel microprocessors.
Performance tests, such as SYSmark and MobileMark, are measured using specific computer systems, components, software, operations and functions. Any change to any of those factors may
cause the results to vary. You should consult other information and performance tests to assist you in fully evaluating your contemplated purchases, including the performance of that product
when combined with other products. For more complete information visit www.intel.com/benchmarks.
Performance results are based on testing as of dates shown in configurations and may not reflect all publicly available updates. See backup for configuration details. No product or component
can be absolutely secure.
Your costs and results may vary.
Intel technologies may require enabled hardware, software, or service activation.
Intel’s compilers may or may not optimize to the same degree for non-Intel microprocessors for optimizations that are not unique to Intel microprocessors. These optimizations include SSE2,
SSE3, and SSSE3 instruction sets and other optimizations. Intel does not guarantee the availability, functionality, or effectiveness of any optimization on microprocessors not manufactured by
Intel. Microprocessor-dependent optimizations in this product are intended for use with Intel microprocessors. Certain optimizations not specific to Intel microarchitecture are reserved for Intel
microprocessors. Please refer to the applicable product User and Reference Guides for more information regarding the specific instruction sets covered by this notice.
You may not use or facilitate the use of this document in connection with any infringement or other legal analysis concerning Intel products described herein. You agree to grant Intel a non-
exclusive, royalty-free license to any patent claim thereafter drafted which includes subject matter disclosed herein.
No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this document.
The products described may contain design defects or errors known as errata which may cause the product to deviate from published specifications. Current characterized errata are available
on request.
© Intel Corporation. Intel, the Intel logo, and other Intel marks are trademarks of Intel Corporation or its subsidiaries. Other names and brands may be claimed as the property of others.
Printed in USA 0620/KD/PRW/PDF Please Recycle 342850-001US