A Survey of Numerical Linear Algebra Methods Utilizing Mixed Precision Arithmetic
A Survey of Numerical Linear Algebra Methods Utilizing Mixed Precision Arithmetic
September 7, 2021
This document was prepared as an account of work sponsored by an agency of the United States
government. Neither the United States government nor Lawrence Livermore National Security, LLC,
nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or
responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or
process disclosed, or represents that its use would not infringe privately owned rights. Reference herein
to any specific commercial product, process, or service by trade name, trademark, manufacturer, or
otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the
United States government or Lawrence Livermore National Security, LLC. The views and opinions of
authors expressed herein do not necessarily state or reflect those of the United States government or
Lawrence Livermore National Security, LLC, and shall not be used for advertising or product
endorsement purposes.
A Survey of Numerical Linear Algebra
Methods Utilizing Mixed Precision
Arithmetic
Ahmad Abdelfattah1 , Hartwig Anzt1,2 , Erik G. Boman3 , Erin Carson4 , Terry Cojean2 , Jack
Dongarra1,5,6 , Alyson Fox7 , Mark Gates1 , Nicholas J. Higham6 , Xiaoye S. Li8 , Jennifer Loe3 ,
Piotr Luszczek1 , Srikara Pranesh6 , Siva Rajamanickam3 , Tobias Ribizel2 , Barry F. Smith9 ,
Kasia Swirydowicz10 , Stephen Thomas10 , Stanimire Tomov1 , Yaohung M. Tsai1 , and Ulrike
Meier Yang7
Abstract
The efficient utilization of mixed-precision numerical linear algebra algorithms can offer attractive acceleration to
scientific computing applications. Especially with the hardware integration of low-precision special-function units
designed for machine learning applications, the traditional numerical algorithms community urgently needs to
reconsider the floating-point formats used in the distinct operations to efficiently leverage the available compute power.
In this work, we provide a comprehensive survey of mixed-precision numerical linear algebra routines, including
the underlying concepts, theoretical background, and experimental results for both dense and sparse linear algebra
problems.
Keywords
Mixed Precision Arithmetic, Numerical Mathematics, Linear Algebra, High Performance Computing, GPUs
values in different formats is hardware independent and only 8 Lawrence Berkeley National Lab, Berkeley, USA
9 Argonne National Lab, Argonne, USA
depends on the size of the floating point formats. Specifically,
10 National Renewable Energy Lab, Boulder, USA
for the widely adopted IEEE754 formats for double precision
(64 bit) and single precision (32 bit) (IEEE), the runtime Corresponding author:
difference for memory/communication operations is roughly Hartwig Anzt, Karlsruher Institut für Technologie (KIT), Germany.
2×, independent of the hardware platform. At the same Email: [email protected]
Machine balance exponent and how many bits are used for the significand.
(# floating point operations per read) Generally, we use the term high precision for precision
100 formats that provide high accuracy at the cost of a larger
KNL V100
Core i7 memory volume (in terms of bits) and low precision to refer
P100
K40
Origin2000 M2050 KNC to precision formats that compose of fewer bits (smaller
Peak Mflop/s ÷ MW/s
T3E P4 Core2Duo RaspberryPi memory volume) and provide low(er) accuracy. Unless
10
PII PIII K Computer
r Pentium explicitly stated, we think of IEEE double precision when
C1060
ea
Cray X1 K10
ry
CM-5E
%
486DX2 NEC SX-7 when using the term low precision. These formats are of
30
NEC SX-5
C90 NEC SX-4 particular interest as they are natively supported by a broad
1VAX-11 Y-MP r
ea range of hardware architectures. However, in particular with
y
8088 per
15% the rise of machine learning, a number of architectures now
1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 also provide native support for floating point formats, that
year are even more compact than IEEE754 single precision. In
particular IEEE754 half precision and BFloat16 are formats
Figure 1.1. Evolution of the machine balance of processors that experience increased interest by the community. For all
over different hardware generations. floating point formats, their bitwise configuration determines
the characteristics. Roughly speaking, the length of the
But while the idea of mixed-precision algorithms has been exponent of a precision format determines the range of a
around for several decades, recent hardware trends have format, the length of the significand determines the precision
motivated increased research and development activities. of a format. Relevant indicators in that context are the largest
Within the past few years, hardware vendors have started and smallest representable numbers in a format, and the
designing special-purpose units for low-precision arithmetic unit roundoff of a format u. In Table2.1 we list some of
in response to the machine learning community’s demand the most relevant formats used in modern scientific high
for high compute power in low-precision formats. Also, performance computing along with their key characteristics.
the server-level products are increasingly featuring low- Traditionally, hardware and software are strictly coupling
precision special function units (e.g., NVIDIA tensor cores the precision format used for arithmetic operations and for
in V100 GPUs) providing about 16×higher performance memory operations. However, given that most architectures
than what can be achieved in IEEE double precision. are nowadays overprovisioned for arithmetic operations,
Exploiting this compute power efficiently could offer up there exist trends to break up this strict coupling. On
to an order of magnitude of speedup to compute-bound the hardware side, a recent example of an architecture
algorithms. At the same time, the gap between the compute breaking up the coupling between memory format and
power on the one hand and the memory bandwidth on arithmetic format are the NVIDIA Tensor Cores integrated
the other hand keeps increasing, making data access and into NVIDIA’s Volta GPU architecture v10 (2017). These
communication progressively more expensive compared special function units designed to perform high performance
with arithmetic operations (Figure 1.1). Given the over matrix matrix multiplication take half precision (FP16) input
provisioning of modern hardware for arithmetic operations, data, but compute in FP32 v10 (2017). On the software side,
it may be a rational decision for memory-bound algorithms the concept of a memory accessor separating the memory
to compress all data in cache before communicating with precision from the arithmetic precision pursues the same
remote processors or main memory goal: computing in higher precision while handling the data
In this paper, we present mixed-precision linear algebra in lower precision in memory Anzt et al. (2019b). In 4.3
algorithms and the attainable performance advantages for we will detail the software-based approach in more detail.
dense linear algebra (Section 3) and for sparse linear algebra For the V100 GPU experiments in Section 3, we claim
(Section 4). We conclude in Section 5 with an outlook on that the algorithms operating on the Volta tensor cores use
current algorithm development and perspectives for mixed- half precision, acknowledging that internally, the arithmetic
precision technology on future architectures. We note that operations are using higher precision after converting the half
this survey is focusing on numerical linear algebra operating precision input data.
on explicitly-available linear operators, matrix-free methods
remain outside the scope of this work.
3 Dense Linear Algebra
2 Precision Formats, Hardware Realization, Carefully designed mixed-precision dense linear algebra
algorithms can leverage the potential performance advan-
and Notation tages of low-precision arithmetic. With this in mind, we
Before presenting mixed precision algorithms, we want start the section in Section 3.1 by presenting basic linear
to establish some notation we use throughout the rest algebra subroutines specifically designed to exploit the com-
of the paper, and provide some background on precision pute power of NVIDIA’s Tensor Cores, which provide high
formats and their realization in hardware. We exclusively compute power in low precision. Building on low-precision
focus on floating point formats that are composed of a Basic Linear Algebra Subprograms (BLAS) and guided by
sign bit, a sequence of exponent bits, and a sequence of the concept of Newton’s method (Section 3.2), it is possible
significand bits. The distinct precision formats then differ in to derive high performance linear solvers running in low
the composition in terms of how many bits are used for the precision that, embedded in an iterative refinement (Section
Table 2.1. Parameters for various floating-point arithmetics. “Range” denotes the order of magnitude of the smallest subnormal
(xmin,s ) and largest and smallest positive normalized floating-point numbers. BFloat16 does not support subnormal numbers.
3.3), succeed in generating high-accuracy solutions while supported half-precision arithmetic in the Maxwell GPU
conducting most of the work in low-precision arithmetic. The architecture. Throughout this section, we will focus on
standard approach is based on factorizing a matrix in low pre- NVIDIA’s GPUs and math libraries to highlight half-
cision and using an iterative refinement scheme in high pre- precision developments for numerical kernels.
cision to recover a high accuracy solution (see Section 3.3). While NVIDIA’s Maxwell GPU architecture introduced
However, for numerical reasons, it can be advantageous to hardware support for IEEE FP16 arithmetic, the Volta
use the factorization computed in low precision as a pre- architecture, which powers the Summit supercomputer,∗
conditioner for a Generalized Minimum Residual (GMRES) comes with hardware acceleration units (called Tensor
iterative solver embedded in an iterative refinement loop Cores) for matrix multiplication in FP16. These Tensor
(see Section 3.4). Using sophisticated scaling and shifting Cores are theoretically 12× faster than the theoretical FP16
techniques, symmetry and positive definiteness of a system peak performance of the preceding architecture (Pascal
matrix can be exploited in a Generalized Minimum Residual- architecture). Applications taking advantage of the Tensor
based Iterative Refinement (GMRES-IR) variant using a low- Cores can run up to 4× faster than using the regular FP16
precision Cholesky factorization as a preconditioner (see arithmetic on the same GPU. The Tensor Cores are also
Section 3.5). In Section 3.6, we present some performance able to perform a mixed-precision multiplication with a low-
results demonstrating the potential of these techniques on precision input (e.g., half-precision) and a higher-precision
modern GPU architectures. The scope of mixed-precision output (typically single-precision). The Tensor Core units are
iterative refinement is not limited to linear systems and discussed in more detail in Section 3.1.1.
extends to least-square problems (Section 3.7) and eigen- In terms of half-precision BLAS, most of the avail-
value solvers (Section 3.8). able routines provide only dense matrix multiplications
(GEMMs). From the perspective of machine learning appli-
cations, most of the performance-critical components in
3.1 Low-Precision BLAS training/inference can be reformulated to take advantage
The revolution of machine learning applications and artificial of the GEMM kernel. As for dense linear algebra, many
intelligence (AI) spiked an interest in developing high- high-level algorithms are built to extract their high per-
performance 16-bit, half-precision floating point arithmetic formance from GEMM calls. Therefore, accelerating such
(FP16), because most AI applications do not necessarily performance-critical steps through FP16 GEMM (HGEMM)
require the accuracy of 32-bit, full-precision floating point would propagate the performance advantage to the entire
arithmetic (FP32) or 64-bit, double-precision floating point algorithm while keeping other numerical stages in their orig-
arithmetic (FP64) (Gupta et al. 2015). FP16 also enables inal precision(s). An example of this practice is the mixed
machine learning applications to run faster, not only because precision dense LU factorization (Haidar et al. 2018b), which
of the faster arithmetic, but also because of the reduction in is used to accelerate the solution of Ax = b in double
memory storage and traffic by a factor of 2× against FP32, precision, see Section 3.3.
and by a factor of 4× against FP64. 3.1.1 Hardware Acceleration of Half Precision The
In terms of vendor support, NVIDIA, Google, and CUDA Toolkit is one of the first programming models to
AMD manufacture hardware that is capable of performing provide half-precision (i.e., FP16) arithmetic. Support was
FP16 arithmetic. Google’s Tensor Core Processing Units added in late 2015 for selected embedded GPU models
(TPUs) are customized chips that are mainly designed for based on the Maxwell architecture, and FP16 arithmetic
machine learning workloads using the 16-bit brain floating has become mainstream in CUDA-enabled GPUs since
point (BFloat16) format. AMD also provides half-precision the Pascal architecture. FP16 has a dynamic range that is
capabilities, and their software stack shows support for both significantly smaller than single or double precision (see
the BFloat16 format and the IEEE FP16 format, (IEEE). Table 2.1).
The theoretical performance of half precision on AMD The Volta and Turing architectures introduced hardware
GPUs follows the expected 2× speedup against FP32 acceleration for matrix multiplication in FP16 using the
and 4× speedup against FP64. As an example, the Mi50 aforementioned Tensor Cores. Using Tensor Cores for FP16,
GPU has a theoretical FP16 performance of 26.5 TFLOP/s these GPUs can deliver a theoretical peak performance that
vs. 13.3 TFLOP/s for FP32 and 6.6 TFLOP/s for FP64.
But perhaps the most widely accessible hardware with
half-precision capability are NVIDIA’s GPUs, which first ∗ https://www.olcf.ornl.gov/summit/
100
precision can also be used within a Newton step: F (xk )
can be evaluated at a precision higher than the working
10-1 precision to inject more information into the iteration. A
detailed analysis of Newton’s method in mixed-precision
floating-point arithmetic is given by Tisseur (2001). Our
10-2 interest here is in using Newton’s method as a tool for
0
10
20
30
40
50
60
70
80
90
0
10
11
12
13
in this case, the constraint on the condition number can xmax denotes the largest finite floating-point number (see
be relaxed compared to the LU-based refinement scheme. Table 2.1).
For example, with factorization precision FP16, working
precision FP32, and residual precision FP64, we can expect Algorithm 3.2 (Two-sided diagonal scaling then round.)
forward and backward errors on the order of FP32 as long This algorithm rounds A ∈ Rn×n to the FP16 matrix A(h) ,
as κ∞ (A) < 108 . We refer to (Higham 2019, Table 3.1) for scaling all elements to avoid overflow. θ ∈ (0, 1] is a
limiting forward and backward errors for this GMRES-based parameter.
approach when two precisions are used, with the residual 1: Apply any two-sided diagonal scaling algorithm to A, to
precision equal to the working precision. obtain diagonal matrices R, S.
The idea behind GMRES-IR is that even though the low- 2: Let β be the maximum magnitude of any entry of T AS.
precision LU factors may be of low quality, they can still 3: µ = θxmax /β
be effective preconditioners in using the GMRES method 4: A(h) = f l` (µ(T AS))
to solve the correction equation Adk = rk , resulting in an
effective solve precision us = u. The condition number of
the resulting preconditioned system is reduced enough to For FP16, in light of the narrow range, we will also
guarantee backward stability of the approximate solution multiply the shifted matrix by a scalar to bring it close to the
computed by GMRES even for matrices that are nearly overflow level xmax and to minimize the chance of underflow
numerically singular with respect to the working precision. and of subnormal numbers being produced.
In contrast, using a basic triangular solve with the low- Higham et al. (2019) recommend two different algorithms
precision LU factors to solve Adk = rk , for which us = u` , for determining T and S; both algorithms are carried out at
provides no degree of relative accuracy once κ∞ (A) exceeds the working precision. The first option is row and column
u−1
` . Using preconditioned GMRES, we can still guarantee equilibration, which ensures that every row and column
that the solution of the correction equation has some correct has the maximum element in modulus equal to 1—that
digits and a residual at the level of the convergence tolerance is, each row and column is equilibrated. The LAPACK
requested by the algorithm despite the apparent low quality routines xyyEQU carry out this form of scaling Anderson
of the computed preconditioners. et al. (1999). The second option, for symmetric matrices,
Since this paper focuses on the practical usage and is a symmetry-preserving two-sided scaling proposed by
possible performance gains rather than error analysis, we Knight et al. (2014). The algorithm is iterative and scales
point the reader to Higham (2002), Carson and Higham simultaneously on both sides rather than sequentially on one
(2017), Carson and Higham (2018), and Higham (2019) for side and then the other.
detailed error analysis of both standard iterative refinement For more details on scaling see Higham et al. (2019),
and GMRES-IR. Of course, in order to be beneficial, it is Higham and Pranesh (2019a), and Carson et al. (2020).
necessary that the total number of GMRES iterations and the
total number of refinement steps remains small. As shown in
Carson and Higham (2017) and Carson and Higham (2018),
3.5 Low-Precision Cholesky Factorization
this is indeed the case for many problems. In the previous section, we considered general matrices. We
We note that the HPL-AI mixed-precision benchmark,† now assume that we are given a symmetric positive definite
which is designed to take into account the availability matrix A ∈ Rn×n in precision u and wish to compute a
of hardware accelerators for low-precision computation, is Cholesky factorization at precision u` > u for use in iterative
based on GMRES-IR. refinement. The most practically important cases are where
(u` , u) = (half, single), (half, double), or (single, double).
3.4.1 Scaling It is clear that the use of low-precision
The obvious approach is to form A(`) = f l` (A), where f l`
floating point arithmetic in iterative refinement can lead to
denotes the operation of rounding to precision u` , and then
significant speedups. However, FP16 has a small dynamic
compute the Cholesky factorization of A(`) in precision u` .
range, and therefore encountering overflow, underflow, and
However, this approach can fail for two reasons. First, if
subnormal numbers is very likely‡ .
FP16 is used, then the limited range might cause overflow
We consider a two-sided diagonal scaling prior to during the rounding. Second, for both BFloat16 and FP16,
converting to FP16: A is replaced by RAS, where: A(`) can fail to be (sufficiently) positive definite, because
a matrix where the smallest eigenvalue is safely bounded
T = diag(ti ), S = diag(si ), ti , si > 0, i = 1 : n.
away from zero with respect to single precision or double
Such scaling algorithms have been developed in the context precision can become numerically indefinite under rounding
of linear systems and linear programming problems. In to half precision. The second issue can also arise when
contrast to previous studies (see Elble and Sahinidis (2012)), a double precision matrix is rounded to single precision.
where the aim of scaling has been to reduce a condition To overcome these problems, Higham and Pranesh (2019b)
number or to speed up the convergence of an iterative method propose scaling and shifting.
applied to the scaled matrix, we scale in order to help
squeeze a single-precision or double-precision matrix into
half precision, with a particular aim to use the resulting half- † https://icl.bitbucket.io/hpl-ai/
precision LU factors for iterative refinement. ‡ Note that some hardware architectures, e.g., NVIDIA tensor cores, perform
Higham et al. (2019) propose the use of two-sided computations and accumulations in higher precision, and only truncate
diagonal scaling given in Algorithm 3.2. Recall that down to FP16 when writing the results to main memory.
3.5.1 Scaling The first step is to scale the matrix A to Algorithm 3.1), mixed-precision factorizations apply higher
1/2
the unit diagonal matrix H = D−1 AD−1 , D = diag(aii ), precision (e.g., uw ) at critical parts of the algorithm to obtain
and D will be kept at precision u. Cholesky factorization is more accurate factorizations while retaining the performance
essentially numerically invariant under two-sided diagonal of the low-precision counterpart.
scaling, so the sole reason for scaling is to reduce the The mixed-precision factorizations were motivated by the
dynamic range in order to avoid overflow and reduce the need to get extra precision when working with very low
chance of underflow for FP16. For BFloat16 or FP32, it is precisions, like the FP16. Also, this allows one to easily
not usually necessary to scale, and we can work with A overcome implementation issues and other limitations of
throughout. using FP16 arithmetic and harness the power of specialized
hardware (e.g., Tensor Cores) for a larger range of scientific
3.5.2 Shifting We now convert H to the lower-precision
computing applications.
u` , incorporating a shift to ensure that the lower-
precision matrix is sufficiently positive definite for Cholesky The developments were applied to GPU Tensor Cores and
factorization to succeed. We shift H by cn u` I, where cn is illustrate that FP16 can be used to obtain FP64 accuracy for
a positive integer constant, to obtain G = H + cn u` I. Since problems with κ∞ (A) of up to 105 , compared to a more
the diagonal of H is I, this shift incurs no rounding error, typical requirement of κ∞ (A) < 104 . The work illustrates
and it produces the same result whether we shift in precision that mixed-precision techniques can be of great interest for
u and then round or round and then shift in precision u` . linear solvers in many engineering areas. The results show
that on single NVIDIA V100 GPU, the new solvers can be
Our final precision-u` matrix is constructed as:
up to 4× faster than an optimized double-precision solver
G = H + cn u` I, (Haidar et al. 2018b), (Haidar et al. 2017), (Haidar et al.
2018a), and (Haidar et al. 2020).
β = 1 + cn u` , µ = θxmax /β, (3.2)
A building block for the mixed-precision factorizations is
(h)
A = f l` (µG), mixed-precision BLAS. Having mixed-precision BLAS can
ease the development of many mixed-precision LAPACK
where θ ∈ (0, 1) is a parameter. Note that β = maxij |gij |, algorithms. Currently, cuBLAS provides a mixed FP32-FP16
so the largest absolute value of any element of A(h) is precision HGEMM that uses the GPU’s Tensor Cores for
θxmax . Note also that since the growth factor for Cholesky FP16 acceleration. In this GEMM, the input matrices A and
factorization is 1 (see Higham (2002, Problem 10.4)), there B can be FP32, be internally cast to FP16, used to compute
is no danger of overflow during Cholesky factorization of a GEMM on Tensor Cores in full (FP32) accuracy, and then
A(h) . the result is stored back on the GPU memory in FP32. There
Higham and Pranesh (2019a, Section 3.3) provide analysis are two main benefits to having such mixed-precision BLAS
suggesting the choice of cn . A pragmatic approach is routines. First, note that this mixed-precision HGEMM is
to take cn to be a small constant, and if the Cholesky almost as fast as the non-mixed FP16 HGEMM (Figure 3.2).
factorization fails, increase c and try again. Based on this, we Second, the use of mixed-precision gains about one more
present the low-precision Cholesky factorization algorithm decimal place of accuracy (Figure 3.3).
in Algorithm 3.3. Aside from the two main benefits outlined above,
the availability of mixed-precision GEMMs also enables
Algorithm 3.3 (Low-precision Cholesky factorization.) us to easily develop other mixed-precision algorithms
Given a symmetric positive definite A ∈ Rn×n in precision (e.g., LAPACK), including the various mixed-precision
u, this algorithm computes an approximate Cholesky factorizations that we recently added in MAGMA (Haidar
factorization RT R ≈ µD−1 AD−1 at precision u` , where et al. 2018b). Figure 3.5 shows the performance of the
1/2
D = diag(aii ). The scalar θ ∈ (0, 1] and the positive mixed-precision LU (marked as “FP16-TC hgetrf LU”).
integer c are parameters. Note that this factorization is about 4×–5× faster than
1/2 dgetrf. Its data storage is in FP32, and the implementation
1: D = diag(aii ), H = D−1 AD−1
2: G = H + cu` I is the same as sgetrf, except that it uses the mixed-precision
3: β = 1 + cu` HGEMMs for the trailing matrix updates.
4: µ = θxmax /β Figure 3.6 shows the mixed-precision iterative refinement
5: A(h) = f l` (µG) in MAGMA (Haidar et al. 2018b), which uses a backward
6: Attempt Cholesky factorization A(h) = RT R in preci- error criterion for convergence. The 4× overall acceleration
sion u` . is due to a number of optimizations. First, note that the 3
7: if Cholesky factorization failed then iterations to get to FP64 accuracy led to a loss of about
8: c ← 2c, goto line 2 2 TFLOP/s compared with the hgetrf performance (24
9: end if TFLOP/s vs. 26 TFLOP/s) (i.e., the overhead of one iteration
can be deduced as being about 2%). Losing 75% (e.g.,
through up to 40 iterations) would lead to no acceleration
compared to the FP64 solver. This overhead per iteration
3.6 Mixed-precision Factorizations is very low, owing to fusing all data conversions with
Haidar et al. (2018b) proposed iterative refinement methods computational kernels. Without fusion, the overhead would
using mixed-precision factorizations. While classical itera- have been easily about 3× higher. Second, note that iterative
tive refinement and extensions like the GMRES-IR use fixed- refinement using the mixed-precision factorization has less
precision factorizations (e.g., in precision u` as illustrated in than half of the overhead in terms of iterations to solution
14
12
value of A. Higham and Pranesh (2019b) assume that A is
10 well conditioned and propose the Cholesky and GMRES-IR-
8 based least squares solver given in Algorithm 3.4.
6
4
2
Algorithm 3.4 (Cholesky-based GMRES-IR for the least
0
2k 4k 6k 8k 10k 14k 18k
matrix size
22k 26k 30k 34k squares problem.) Let a full rank A ∈ Rm×n , where m ≥ n,
and b ∈ Rm be given in precision u. This algorithm solves
Figure 3.5. Mixed-precision LU (hgetrf) in MAGMA and its the least squares problem minx kb − Axk2 using Cholesky-
speedup vs. FP64 LU. based GMRES-IR. The scalar θ ∈ (0, 1] and the positive
integer c are parameters.
Performance of solving Ax=b
using FP64 or IR with GMRes to achieve FP64 accuracy3 1: Compute B = AS, where S = diag(1/kaj k2 ), with aj
24
22
FP16-TC->64 dhgesv
FP16->64 dhgesv
3 the jth column of A.
20
FP32->64 dsgesv
FP64 dgesv 3 105 2: µ = θxmax
18 3 7
6
3: B (h) = f l` (µ1/2 B)
16 7 104 4: Compute C = B (h)T B (h) in precision u` .
14 3 6 4X 5: Compute the Cholesky factorization C+
Tflop/s
κ∞ (A)
3 3
12
3
6 6
2
2
2 10
cu` diag(cii ) = RT R in precision u` .
10 6 2 2
3 2 2
6: if Cholesky factorization failed then
8 2 10
3 6 7: c ← 2c, goto line 5
6 3 6 2
4 3
6 2 10
1 8: end if
3
6
2
5 2
2
9: Form b(h) = f l` (SAT b).
2
0 10
0 10: Solve RT Ry0 = b(h) in precision u` and form x0 =
2k 4k 6k 8k 10k 14k 18k 22k 26k 30k 34k
Matrix size µSy0 at precision u.
11: for i = 0 : imax − 1 do
Figure 3.6. Mixed-precision iterative refinement in MAGMA and 12: Compute ri = AT (b − Axi ) at precision ur and
acceleration vs. FP64 solvers. Note ≈ 2% overhead per round ri to precision u.
iteration, and less than half the overhead in terms of iterations 13: Solve M AT Adi = M ri by GMRES at precision u,
for mixed-precision LU vs. regular FP16 LU (the 3 vs. 7
where M = µSR−1 R−T S and matrix–vector products
iterations until FP64 convergence). The condition numbers of
the matrices are computed using FP64. with AT A are computed at precision ur , and store di at
precision u.
14: xi+1 = xi + di at precision u.
(3 vs. 7 iterations until FP64 convergence). This is due to 15: if converged then
the extra digit of accuracy that the mixed-precision HGEMM 16: return xi+1 , quit
has over the FP16 HGEMM, which also translates to a more 17: end if
accurate mixed-precision LU. 18: end for
Using mixed precision Cholesky factorization in Algo-
rithm 3.3, Abdelfattah et al. (2020) obtain speedups of up
to 4.7 over a double precision solver on an NVIDIA V100. Line 1 of Algorithm 3.4 produces a matrix B with
columns of unit 2-norm. The computation C = B (h)T B (h)
3.7 Iterative Refinement for Least Squares on line 4 produces a symmetric positive definite matrix with
constant diagonal elements µ = θxmax , so overflow cannot
Problems
occur for θ < 1. The shift on line 5 is analogous to that in
We consider the linear least squares problem minx kAx − Algorithm 3.3, but here the matrix C is already well scaled
bk2 , where A ∈ Rm×n with m ≥ n having full rank. The and in precision u` , so there is no need to scale C to have
idea of mixed-precision iterative refinement and GMRES-IR unit diagonal.
for square linear systems can be adapted to the least
Note that although Algorithm 3.4 explicitly forms C =
squares case. Least squares problems may be ill conditioned
B (h)T B (h) in Algorithm 3.4, C is used to form a
in practice, and so rounding errors may result in an
preconditioner, so the usual problems with forming a cross-
insufficiently accurate solution. In this case, iterative
product matrix (loss of significance and condition squaring)
refinement may be used to improve accuracy, and it also
are less of a concern. Also note that if we are working in
improves stability.
FP16 on an NVIDIA V100, we can exploit the Tensor Cores
3.7.1 Cholesky-Based Approach The normal equations when forming C to accumulate block fused multiply-add
method solves: operations in single precision; this leads to a more accurate
AT Ax = AT b C, as shown by the error analysis of Blanchard et al. (2020).
For the computed R̂, we have: of linear systems are applicable. The only thing that
must change is the analysis of the method for solving
R̂T R̂ ≈ B (h)T B (h) ≈ µSAT AS, the correction equation, since we now work with a QR
factorization of A, which can be used in various ways.
or
The work in Carson et al. (2020) also extends the GMRES-
(AT A)−1 ≈ µS R̂−1 R̂−T S,
based refinement scheme of Carson and Higham (2017) to
so we are preconditioning with an approximation to the the least squares case and shows that one can construct a left
inverse of AT A. For large n, as long as GMRES converges preconditioner using the existing QR factors of A such that
quickly, the cost of the refinement stage should be negligible GMRES provably converges to a backward stable solution of
compared with the cost of forming AT A and computing the the preconditioned augmented system. Further, it is shown
Cholesky factorization. that an existing preconditioner developed for saddle point
We also mention the Cholesky-QR algorithm for systems can also work well in the GMRES-based approach
computing a QR factorization A = QR. It forms the cross- in practice, even though the error analysis is not applicable.
product matrix AT A, computes the Cholesky factorization We refer the reader to Carson et al. (2020) for further details.
AT A = RT R, and then obtains the orthogonal factor Q as For details of convergence tests for iterative refinement of
Q = AR−1 ; this process can be iterated for better numerical least squares problems see Demmel et al. (2006).
stability (Fukaya et al. 2020). Mixed precision can be
exploited in this algorithm, as shown by Yamazaki et al. 3.8 Eigenvalue Problems
(2015) and Yamazaki et al. (2016). Newton’s method can be used to refine an approximate
3.7.2 Augmented Matrix Approach As mentioned, the eigenpair of a matrix by defining a function F : Rn+1 →
Cholesky-based approach described in the previous section Rn+1 that has as its first n components (A − λI)x and
is intended only for the case where the matrix is very well its last component eTs x − 1 for some unit vector es , with
conditioned. Another approach to mixed-precision least- this last component serving to normalize x. If an initial
squares iterative refinement with a less stringent requirement eigen decomposition is available, it can be exploited to
on the condition number is presented by Carson et al. (2020). simplify the implementation of the Newton iteration. This
This approach is based on using the QR factorization: idea was developed by Dongarra (1982) and Dongarra et al.
(1983), building on a Schur decomposition and allowing
R the residual (A − λI)x to be computed in higher precision.
A=Q ,
0 Algorithm 3.5 implements this procedure, called SICE,
which, in each iteration, solves a linear system resulting
where Q = [Q1 , Q2 ] ∈ Rm×m is an orthogonal matrix with from a rank-1 update in order to refine a single eigen-
Q1 ∈ Rm×n and Q2 ∈ Rm×(m−n) , and R ∈ Rn×n is upper pair. The rank-1 update is introduced while replacing one
triangular. The unique least squares solution is x = R−1 QT1 b column in A − λI to remove one degree of freedom on
with residual kb − Axk2 = kQT2 bk2 . eigenvector correction and, at the same time, compute a
An iterative refinement approach for least squares systems correction for the corresponding eigenvalue. The original
was suggested by Björck (1967a). Refinement is performed formulation (Dongarra 1982) solves the system with two
on the augmented system series of Givens rotations to make it upper triangular. This
I
A r b process is hard to parallelize on modern architectures. Also,
= , (3.3) some form of orthogonalization should be considered while
AT 0 x 0
using the algorithm to refine more than one eigenvalue.
which is equivalent to the normal equations. In this way, This idea has been extended to the generalized eigenvalue
the solution xi and residual ri for the least squares problem problem by Tisseur (2001) and Davies et al. (2001).
are simultaneously refined. Björck (1967a) shows that this For the symmetric eigenvalue problem, Petschow et al.
augmented system can be solved by reusing the QR factors (2014) use extra precision to improve the accuracy of
of A. the multiple relatively robust representations (MRRR)
Existing analyses of the convergence and accuracy of algorithm, with little or no performance penalty.
this approach in finite precision assume that, at most, two Algorithm 3.6 shows another iterative refinement proce-
precisions are used; the working precision u is used to dure from Ogita and Aishima (2018) for solving a symmetric
compute the QR factorization, solve the augmented system, eigenvalue problem. This method also succeeds for clustered
and compute the update. A second precision ur ≤ u is used eigenvalues Ogita and Aishima (2019). Lines 4, 5, and 10
to compute the residuals. Typically ur = u2 , in which case represent the compute-intensive parts of the algorithm, which
it can be shown that as long as the condition number of the amounts to 4 calls to the matrix-matrix multiply function
augmented system matrix is smaller than u−1 , the refinement xGEMM. Line 8 uses the 2-norm, but it is recommended to
process will converge with a limiting forward error on the approximate using the Frobenius norm, because it is much
order of u; see Björck (1990) and Higham (2002, sect. 20.5) easier to compute in practice. Line 9 is an element-wise
and the references therein. operation to construct the refinement matrix E. Line 10 is
Carson et al. (2020) show that the three-precision iterative the update of eigenvectors by applying the refinement matrix
refinement approach of Carson and Higham (2018) can E. High-precision arithmetic is required for all computations
be applied in this case; the theorems developed in Carson except line 8 for the matrix norm. Even though the algo-
and Higham (2018) for the forward error and normwise rithm may be applied for only a subset of ` eigenvectors,
and componentwise backward error for iterative refinement the refinement iterations are confined to the corresponding
Algorithm 3.5 SICE algorithm for iteratively refining Algorithm 3.6 Iterative refinement for symmetric eigenvalue
computed eigenvalue. problem.
1: function [x, λ] ← SICE(A, x0 , λ0 ) 1: Input: A = AT ∈ Rn×n , X b ∈ Rn×` , 1 ≤ ` ≤ n
[Q, T ] ←schur(A) . Schur decomposition to find 0 n×`
2: 2: Output: X ∈R , De = diag λei ∈ R`×` ,
A = QT QT where T is upper quasitriangular. Ee ∈ R`×` , ω ∈ R
3: [m, s] ← max(x0 ) . Find maximum value and h i
3: function X 0 , D,e E, e ω ← R EF S Y E V(A, X)b
index in the eigenvector.
4: x0 ← x0 /m . Normalize 4: R ← In − X bT X b
T
5: for i = 1, 2, . . . do 5: S ← X AX
b b
6: c ← −xi−1 − (A − λi−1 I)[:, s] . Column s of 6: λbi ← sii /(1 − rii ) for i = 1, . . . , ` . Compute
A − λi−1 I approximate eigenvalues.
7: d ← QT c 7: De ← diag λei
8: f ← eTs Q . Row s of Q
9: Solve the rank-1 updated system 8: ω ←2 S−D e + kAk2 kRk2
2
Q(T − λi−1 I + df T )QT yi = Axi−1 − λi−1 xi−1
sij +λej rij if λ ei − λ
ej > ω
10: λi ← λi−1 + yi [s] . Eigenvalue correction. 9: eij ← e j −λ
λ ei
for
11: xi ← xi−1 + yi . Eigenvector correction. rij /2 otherwise
12: xi [s] ← xi−1 [s] . Restore x[s]. 1 ≤ i, j ≤ ` . Compute the entries of the refinement
13: if 2 × yi [s] > yi−1 [s] then matrix E.
e
14: Break from for loop. 10: X0 ← X b +X
bEe . Update X b by X(I
b n + E)e
15: end if 11: end function
16: end for
17: x ← xi
18: λ ← λi
19: end function
algebra methods utilize components that are not critical solved using the low-precision factorization. However, the
to the final accuracy (e.g., preconditioners or individual triangular factors are (block-) sparse, and for solving the
operations in a larger algorithm) in lower precision than triangular systems, the same strategy of gathering data from
working precision, or trade low-precision memory access the sparse data structures into contiguous memory proves
against additional iteration steps. In Section 4.2, we present successful,
a theoretical analysis of the rounding effects low-precision (block-) sparse triangular solve:
computations have on the accuracy of Krylov solvers.
However, as previously mentioned, it is usually not the 1. gather the values from the sparse triangular structures
arithmetic computations that limit the performance of into dense blocks in registers / fast memory and
iterative algorithms for sparse problems, but rather it is the 2. invoke dense linear algebra kernels to solve for the
communication and memory bandwidth. In Section 4.3, we right-hand side.
present the idea of radically decoupling the format used Again, gathering the data in dense blocks enables the use of
for arithmetic computations from the format that is used efficient dense linear algebra kernels. Using low precision,
for communication and memory operations. This concept the memory-bound “gather” step benefits from a reduced
can span from using lower precision for memory accesses memory access volume, and the dense linear algebra kernels
to using dedicated compression techniques before invoking benefit from the higher performance in low precision.
communication operations. Examples of how this concept of For dense linear algebra, the performance benefits of
format decoupling and compression helps accelerate sparse mixed-precision iterative refinement over high-precision
linear algebra include preconditioners for iterative solvers dense direct solvers mostly correlate with the hardware-
(Section 4.4) and multigrid methods (Section 4.5). specific arithmetic performance limits in the different
precision formats. In particular, the performance benefits are
4.1 Mixed Precision Sparse Direct Solvers mostly independent of the problem characteristics. This is
The factorization process of a sparse matrix usually different when using mixed-precision iterative refinement
generates fill-in elements, significantly increasing the for the direct solution of sparse problems, as the matrix
number of nonzero elements in the factors. The fill-in is structure determines the amount and structure of the fill-
usually structured, and the fill-in locations can be predicted in, the efficiency of the dense linear algebra kernels
from the sparsity pattern of the original system matrix. To operating on the induced dense blocks, and the ratio between
improve performance and memory efficiency, factorization- memory operations and arithmetic operations. As a result,
based sparse solvers typically operate in a block-sparse it is much harder to predict whether the mixed-precision
fashion: forming blocks covering the nonzero elements iterative refinement variant of a sparse direct solver provides
reduces the indexing information to index the blocks, and performance benefits over the execution of a sparse direct
storing the elements as small dense blocks allows for solver in high precision.
the application of highly efficient dense linear algebra
operations. There exist two options for realizing the concept 4.2 Mixed-Precision Krylov Solvers
of block-sparse factorizations. One is to convert the system
The scope of our review includes both Lanczos-based (short-
matrix into a block-sparse matrix prior to the factorization
term recurrence) and Arnoldi-based (long-term recurrence)
process. The other, more popular, one is based on forming
methods and the associated methods for solving linear
the dense blocks “on-the-fly” in registers/fast memory during
systems of equations Ax = b. In the context of long-
the factorization process, (block-) sparse factorization:
term recurrence methods, we consider both the Arnoldi-QR
1. gather the values from sparse data structures into dense algorithm with the modified Gram-Schmidt implementation
blocks in registers/fast memory; of the GMRES Krylov subspace method for iteratively
2. invoke dense linear algebra kernels on dense blocks; solving linear systems of equations as well as Flexible
and GMRES (FGMRES). The emphasis here is to examine
3. scatter results into the sparse output data structure. the approaches employed to date that incorporate mixed-
precision floating point arithmetic to speed-up computations
Similar to dense linear solvers, sparse direct solvers can
while retaining some or all of the numerical properties of the
also benefit from the mixed-precision iterative refinement
original algorithms in FP64 arithmetic (i.e., representation
framework presented in Section 3.3: The (block-) sparse
error and loss of orthogonality).
factorization is computed in low precision, thereby
leveraging the corresponding high compute power, and 4.2.1 Lanczos-CG First, we briefly summarize the most
the iterative refinement process recovers a high-accuracy well-known results on the finite precision behavior of
solution. Contrary to the dense case, the low precision Lanczos and CG methods and discuss how such results could
(block-) sparse factorizations not only benefit from higher potentially be extended to the mixed precision case and
arithmetic performance in the invocation of the compute- existing progress in this area. We also note that the literature
bound dense linear algebra kernels, but they also benefit from on finite precision behavior of Lanczos-based methods is
the reduced memory access volume in the memory-bound expansive, and we cannot hope to fully describe it here. For
gather and scatter operations. a more thorough account and historical references, we point
The iterative refinement process for recovering high- the reader to the survey of Meurant and Strakoš (2006).
precision solutions for a sparse linear system is conceptually Fundamental relations dealing with the loss of orthog-
identical to the dense case: like in Algorithm 3.1, an onality and other important quantities in finite precision
error correction equation computed in high precision is Lanczos have been derived by Paige (1980). These results
were subsequently used by Greenbaum to prove backward theoretical study, which we believe can be achieved by
stability-like results for the CG method (Greenbaum 1989); extending the results in Sleijpen and van der Vorst (1996)
namely, Greenbaum showed that CG in finite precision can and the related work of Van Der Vorst and Ye (2000) to the
be seen as an exact CG run on a larger linear system, in mixed precision setting.
which the coefficient matrix has eigenvalues in tight clusters
4.2.2 Flexible GMRES Much work has been done
around the eigenvalues of the original matrix, where the
involving the use of lower-precision preconditioners within
diameter of these clusters depends on properties of the
iterative solvers, in particular, GMRES and FGMRES run in
matrix and the machine precision. Greenbaum also proved
a higher precision.
fundamental results on the maximum attainable accuracy in
Arioli and Duff (2009) rigorously prove, that using
finite precision, that is, the limiting value of kxk − xk/kxk
a triangular factorization computed as a preconditioner
for approximate solutions xk and true solution x, in CG and
in FP32, FGMRES run in FP64 produces a solution
other “recursively computed residual methods” (Greenbaum
with backward error to FP64 accuracy. In contrast, they
1997). The results of Paige and Greenbaum have also been
demonstrate that using FP64 iterative refinement as the
extended to s-step Lanczos/CG variants in Carson (2015),
solver may fail in such cases. They provide numerical
where it is shown that s-step Lanczos in finite precision
experiments which support their theoretical analysis. This
behaves like a classical Lanczos run in a lower “effective”
builds on the previous work of Arioli et al. (2007), in which
precision, where this “effective” precision depends on the
it is proved that FGMRES is backward stable.
conditioning of the polynomials used to generate the s-step
Building on the work of Arioli and Duff (2009), Hogg
bases. We believe that these existing results can be extended
and Scott (2010) develop a single-precision (FP32) imple-
to the mixed precision case.
mentation of an LDLT factorization method for solving
Existing results in the area of mixed precision Lanczos- sparse-symmetric linear systems. This FP32 factorization is
based methods are contained within the work on “inexact then used as a preconditioner within FP64 iterative solvers,
Krylov subspace methods,” which also applies to Arnoldi- including iterative refinement and FGMRES, effectively
based methods (see Simoncini and Szyld (2003) and van den creating a mixed-precision solver. They demonstrate that
Eshof and Sleijpen (2004)). Within such frameworks, it for linear systems that are sufficiently well conditioned,
is assumed that the matrix-vector products are computed the mixed-precision approach was sufficient for obtaining
with some bounded perturbation (which can change in each FP64 accuracy; the remaining cases required a full FP64
iteration), and all other computation is exact. These methods implementation. Additionally, it is shown that the mixed-
were motivated by improving performance in applications precision approach is beneficial in terms of performance for
where the matrix-vector products dominate the cost of sufficiently large problems.
the computation (e.g., when the matrix is dense or the
application of A involves solving a linear system). Many 4.2.3 Arnoldi-QR MGS-GMRES For MGS-GMRES the
theoretical results on “inexact Krylov subspace methods,” mixed-precision work by Gratton et al. (2020) is the
mostly focused on the maximum attainable accuracy, have most recent and appropriate—and in particular the loss-of-
been proved in the literature. A surprising result is that the orthogonality relations due to Björck (1967b) and Paige
inexactness in the matrix-vector products can be permitted (1980), later refined by Paige et al. (2006), are employed
to grow in norm as the iterations progress at a rate in order to provide tolerances for mixed FP32–FP64
proportional to the inverse of the residual norm without computations. MGS-GMRES convergence stalls (the norm-
affecting the maximum attainable accuracy. However, a wise relative backward error approaches ε) when linear
crucial practical question is whether inexactness will affect independence of the Krylov vectors is lost, and this is
the convergence behavior before the attainable accuracy is signaled by Paige’s S matrix norm kSk2 = 1. The S matrix
reached; this is entirely possible in the case of short-term (Paige 2018) is derived from the lower triangular T matrix
recurrence methods such as CG and has not been well studied appearing in the rounding error analyses by Giraud et al.
theoretically. (2004).
For comprehensiveness, we briefly mention works which To summarize, Gratton et al. (2020) postulate starting
make use of mixed precision Krylov subspace methods in from the Arnoldi-QR algorithm using the modified Gram-
practical applications, focusing on performance rather than Schmidt algorithm and employing exact arithmetic in the
on theoretical results. MGS-GMRES iterative solver. The Arnoldi-QR algorithm
applied to a non-symmetric matrix A produces the matrix
One instance of this is in the work of Clark et al.
factorization, with loss of orthogonality Fk .
(2010), which uses mixed precision CG and BICGSTAB
T
methods to implement the “reliable update” strategy of AVk = Vk+1 Hk , Vk+1 Vk+1 = I + Fk (4.4)
Sleijpen and van der Vorst (1996) within a Lattice QCD
They next introduce inexact (e.g., single precision) inner
application run on GPUs. The idea behind the “reliable
products—this directly relates to the loss-of-orthogonality
update” strategy is that the true residual is computed and
relations for the A = QR factorization produced by MGS.
used to replace the recursively updated residual in select
The resulting loss of orthogonality, as measured by kI −
iterations, thus improving the attainable accuracy; this is
QT Qk2 , grows as O(ε)κ(A), as was derived by Björck
done in conjunction with batched updates to the solution
(1967b) and O(ε)κ([ r0 , AVk ]) for Arnoldi-QR—which is
vector. By using higher (FP64) precision only in the true
described in Paige and Strakoš (2002), Paige et al. (2006),
residual computations and group updates (and FP32 or FP16
and related work. The inexact inner products are given by:
for the rest of the computation), the authors claim they
are able to achieve FP64 accuracy. This deserves further hij = viT wj + ηij , (4.5)
where hij are elements of the Hessenberg matrix Hk , and triangular solve rj = (I + Lj−1 )−1 QTj−1 aj to update R, as
the Arnoldi-QR algorithm produces a QR factorization of this would directly employ the forward error analysis of
the matrix: Higham (1989). The former affects the loss of orthogonality,
whereas the latter affects the representation error for QR—
r0 , AVk = Vk+1 β e1 , Hk . (4.6) but then also for Arnoldi-QR. This could allow more (or
most) of the inner products to be computed in FP32.
The loss of orthogonality relations for Fk are given below, Evidence for maintaining orthogonality is provided in
where the matrix U is strictly upper triangular. Figure 4.1, with kI − QT Qk plotted for A = QR using
T the inner products in standard MGS (blue) in FP64 versus
v1 v2 · · · v1T vk+1
the inverse compact WY MGS (red) with QTj−1 qj−1 in
Fk = Ūk + ŪkT , Uk =
.. (4.7)
. FP32 (simulated in MATLAB), and we observe at least
vkT vk+1 the same or slightly higher error levels. The x-axis is the
log condition number for randomly generated matrices. The
Define the matrices as below. lower triangular solve is computed in FP64.
Barlow (2019) contains similar if not the same algorithm
η11 · · · η1k h21 ··· h2k
formulations in block form. His work is related to Björck’s
Nk =
. .. Rk =
..
. 1994 paper (Björck 1994, Section 7), which derives the
ηkk hk+1,k triangular matrix T using a recursive form for MGS, and
(4.8) which is referred to as a “compact WY” representation in
The loss of orthogonality relation, derived by Björck (1967b) the literature. While Björck used a lower triangular matrix T
for the A = QR factorization via the modified Gram- for the compact WY form of MGS, Malard and Paige (1994)
Schmidt algorithm, can be applied to the Arnoldi-QR derived the upper triangular form, also employed by Barlow,
algorithm to obtain: which reverses the order of elementary projectors. The latter
is unstable in that a backward recurrence leads to O(ε)κ2 (A)
Nk = − 0, Uk Hk = −Uk Rk . (4.9) loss of orthogonality. An interesting observation from Leon
et al. (2013) is that the upper triangular form is less stable
The complete loss of orthogonality (triggers a loss of than the lower triangular, even though the backward-forward
linear independence) of the Krylov vectors in MGS-GMRES algorithm results in re-orthogonalization; see the algorithm
signals the minimum error is achieved, and GMRES then in Leon et al. (2013).
stalls or really can go no further than when the norm-wise Barlow (2019) employs the Householder compact WY
relative backward error reaches O(ε). Gratton et al. (2020) representation of reflectors and also refers to the work of
show how to maintain sufficient orthogonality to achieve a Puglisi (1992)—discussed in Joffrain et al. (2006)—and this
desired relative residual error level by switching the inner is referred to as the “inverse compact WY” representation of
products from FP64 to FP32 at certain tolerance levels Householder; this originally comes from Walker’s work on
and combine this with inexact matrix-vector products as in Householder GMRES Walker (1988). Barlow then extends
van den Eshof and Sleijpen (2004) and Simoncini and Szyld this approach to the block compact WY form of MGS; see
(2003). also the technical report by Sun (1996). The contribution by
In practice, the restarted variant of GMRES is often Świrydowicz et al. (2020) was to note that there exists an
employed to reduce memory requirements. The algorithm inverse compact WY representation for MGS—having the
produces both implicit (iteratively-computed) and explicit projector P with lower triangular correction matrix T :
residuals. Thus, we might ask whether either can be
performed in reduced precision. The work described herein P = I − Qj−1 T QTj−1
on iterative refinement by Carson and Higham for mixed = I − Qj−1 (I + Lj−1 )−1 QTj−1
precision can be applied to analyze the convergence
—and to “lag” the norm kqj−1 k2 so that these can be
of restarted GMRES(m), assuming a fixed number of
computed in one global reduction. Barlow (2019) makes
iterations, because restarted GMRES is just iterative
this connection for blocks, and in effect this is given in his
refinement with GMRES as the solver for the correction
equation (3.10), and references Puglisi (1992).
term. However, a more detailed analysis with experiments
Björck and Paige (1992) made the link between
has yet to be performed. We are fairly certain that the
Householder and MGS based on the observation made by
residual computations must be performed in higher precision
Sheffield. Paige defines this to be augmentation, and Gratton
in order to achieve a norm-wise backward error close to FP64
et al. (2020) also references this work. Paige has also recently
machine round-off.
extended these augmentation ideas to Lanczos. The T matrix
4.2.4 Alternative Approaches Although somewhat out- appears in Paige and Wülling (2014) and then later in
side the scope of this review, we can demonstrate that it is Paige (2018) to derive the loss of orthogonality matrix S =
possible to modify the Gratton et al. (2020) analysis based (I + LTj−1 )−1 LTj−1 . This also appears in the work of Giraud
on the inverse compact W Y form of the MGS algorithm, et al. (2004); Langou also worked with Smoktunowicz et al.
introduced by Świrydowicz et al. (2020). Rather than treat all (2006) on the Pythagorean trick to reduce cancellation error
of the inner products in the MGS-GMRES algorithm equally, in the computation of vector norms and a Cholesky-like form
consider the strictly upper triangular matrix U = LT from of classical Gram-Schmidt (CGS).
the loss of orthogonality relations. We introduce a single- In order to combine single-double floating-point opera-
precision (FP32) Lk−1,1:k−2 = (QT1:k−2 qk−1 )T and an FP64 tions in MGS-GMRES, at first it appears that we could
Figure 4.2. GMRES residuals and loss of orthogonality kSk2 A promising—and maybe the only promising—strategy
for impcol e matrix. to overcome this problem is to utilize the bandwidth
capacity more carefully, reduce the communication vol-
ume and the number of communication points, and—
store the T matrix in FP32, but then we would still have whenever possible—trade communication against compu-
to form QTj−1 aj , and store Qj−1 in FP64. By examining tations. Specifically, the idea is to radically decouple the
the cost trade-offs a bit further, we can instead use a form memory precision from the arithmetic precision, employ
of re-orthogonalization based on a backward-forward solver high precision only in the computations, and lower the
recurrence: precision as much as possible when accessing data in main
memory or communicating with remote processors (Anzt
T = (I + LTj−1 )−1 (I + Lj−1 )−1 , et al. 2019b). An important aspect in this context is the
design of a “memory accessor” that converts data on the fly
and our initial computational results, displayed in Figure 4.2,
between the IEEE high precision arithmetic format and the
demonstrate this works well, driving the relative residual
memory/communication format (Figure 4.3). The memory/-
and, more importantly, the norm-wise relative backward
communication format does not necessarily have to be part of
error to O(ε) in FP64, with orthogonality maintained to O(ε)
the IEEE standard but can also be an arbitrary composition
in FP32 as indicated by the magenta curve. Here, the black
of sign, exponent, and significand bits (Grützmacher et al.
curve is the FP64 loss of orthogonality metric given by kSk2 .
2019) or even nonstandard formats like Gustafson’s Posits
The representation error (backwards error) for A + E =
(Unum type III (Gustafson 2015)). On an abstract level, the
QR computed by MGS is not affected by FP32 inner
idea is to compress data before and after memory operations
products and remains O(ε). We are not aware of whether
and only use the working precision in the arithmetic oper-
or not this was previously known.
ations. While one generally distinguishes between “lossy
compression” and “lossless compression” (Sayood 2012),
4.3 Memory Format Decoupling significant bandwidth reduction usually requires the loss of
We already elaborated on sparse linear algebra operations some information. How much information can be disre-
being memory bound across the complete hardware garded without endangering the numerical stability heavily
technology food chain. Additionally, we are witnessing depends on the algorithm and the problem characteristics.
Thus, the choice of the memory format requires careful Invert the diagonal block
using Gauss-Jordan elimination.
consideration (e.g., in the form of automated format select); Select storage format:
see Section 4.4. 16-bit fp5,10 fp8,7 fp11,4
Compute condition number
and exponent range. 32-bit fp8,23 fp11,20
4.4 Mixed-Precision Preconditioning
In the iterative solution process of large sparse systems 64-bit fp11,52
(e.g., when using Kylov solvers) preconditioners are
an important building block for facilitating satisfactory Figure 4.4. Storage format optimization for block-Jacobi:
convergence. The concept of preconditioning is to turn starting from the most compact storage (left top), the format is
an ill-conditioned linear system Ax = b into a (left-) extended in exponent bits to fit the data range (rightwards) and
to preserve regularity (downwards) until the range is fit and
preconditioned system M Ax = M b (or AM y = b, x = regularity is satisfied. Note that these configurations are chosen
M y for right-preconditioning), which allows for faster to reflect the hardware characteristics (16/32/64 bit access) and
convergence of the iterative solver (Anzt et al. 2018). significand-truncated IEEE standard precision formats.
The convergence characteristics typically depend on the
conditioning of the target system. For an ill-conditioned
A, the preconditioner is also required to be ill-conditioned. when reading data from main memory. Fortunately,
Otherwise, the preconditioner cannot be expected to improve most iterative solvers and preconditioners are memory
the conditioning of the problem or the convergence of the bound, and the conversion can be hidden behind the
iterative solver. In that respect, the preconditioner basically memory transfers Flegar et al. (2021). A production-
tries to approximate the inverse of the system matrix. ready implementation of an adaptive-precision block-
Obviously, if the preconditioner is the exact inverse, the Jacobi preconditioner decoupling memory precision from
solution is readily available. However, computing the exact arithmetic precision is available in the Ginkgo library Anzt
inverse is prohibitively expensive, and, in most cases, the et al. (2020).
preconditioner is just a rough approximation of the system
matrix inverse. As a consequence, it is natural to question 4.4.1 Adaptive-Precision Block-Jacobi Preconditioning
the need for using high precision for a preconditioner The adaptive precision block-Jacobi preconditioner realizes
that is inherently carrying only limited accuracy. Indeed, the concept of decoupling arithmetic precision from memory
choosing a lower-precision format for the preconditioner precision proposed in Section 4.3 for a block-Jacobi
is a valid strategy as long as the accuracy loss induced preconditioner (Anzt et al. 2019a). The idea here is to
by using a lower-precision format impacts neither the compute a block-Jacobi preconditioner in high precision but
preconditioner accuracy nor its regularity. For example, then store the distinct inverted diagonal blocks in the lowest
Trilinos The Trilinos Project Team (2020) allows the floating point precision format that avoids overflow and still
use of low-precision preconditioners inside high-precision preserves the regularity of the preconditioner (Figure 4.4).
iterative solvers. However, the use of lower precision in This storage format is chosen for each diagonal
the preconditioner application results in different rounding block individually, respectively reflecting the numerical
effects than when using high precision. Specifically, the characteristics like condition number and value range.
rounding effects make the preconditioner non-constant, as Figure 4.5 (top) visualizes the distribution of formats
the rounding effects are not only larger than in high precision when storing the inverted diagonal blocks of size 24 for
but also depend on the input data (Anzt et al. 2019a). As symmetric positive–definite matrices of the Suite Sparse
a result, low-precision preconditioners can only be used to Matrix Collection. Obviously, converting to a lower-
accelerate an iterative method that can handle non-constant precision format generally reduces the accuracy of the
preconditioners (i.e., can converge even if the preconditioner linear operator, but as block-Jacobi preconditioners ignore
changes in between iterations). For the Krylov subspace all off-(block)diagonal entries, they are typically only a
solvers generating search directions orthogonal to the rough approximation of the matrix inverse and therefore,
previous search direction, a changing preconditioner requires by design, only have very limited accuracy. Experimental
an additional orthogonalization of the preconditioned results reveal that the use of a lower-precision format for
search direction against the previous preconditioned search storing the inverted diagonal blocks has, in most cases,
direction. The flexible Krylov solvers (e.g., FGMRES, FCG) only negligible effects on the preconditioner effectiveness
contain this additional orthogonalization and are therefore and the outer solver convergence. At the same time, storing
slightly more expensive. At the same time, they do allow for the inverted diagonal blocks in lower precision reduces the
using low-precision preconditioners, which can compensate memory access volume in every preconditioner application,
for the additional cost. thereby accelerating the bandwidth bound iterative solution
An alternative workaround is to decouple the memory process (Figure 4.5). For the adaptive-precision block-Jacobi
precision from the arithmetic precision (see Section 4.3) and preconditioner, it is important that the accessor converts the
only store the preconditioner in low precision but apply it inverted diagonal blocks back to the IEEE standard precision
in high precision Anzt et al. (2019a). Running all arithmetic not only for performance reasons—leveraging the highly
in high precision keeps the preconditioner constant and optimized IEEE floating point arithmetic of the processors—
removes the need for the additional orthogonalization of but also for numeric reasons. Using working precision in
the preconditioned search direction. On the other hand, the arithmetic operations of the preconditioner application
decoupling memory precision from arithmetic precision preserves the preconditioner as a constant operator, and
requires on-the-fly conversion of in-between formats applying a preconditioner in lower precision would result in
Figure 4.5. Top: distribution of floating-point formats among the distinct blocks when inverting the blocks in FP64 and preserving
1-digit accuracy of the values in each inverted diagonal block when writing to main memory. Each column represents one
symmetric positive–definite matrix of the Suite Sparse Matrix Collection. Bottom: impact on the top-level CG solver solving the
system-induced linear problem. For most systems, the convergence rate is unaffected by the use of a lower storage precision
format, and almost all preconditioner applications are faster, resulting in an average 20% runtime reduction.
a non-constant preconditioner and require the use of a (more generate its components, whereas AMG can be considered
expensive) flexible iterative solver (Anzt et al. 2019a). more like a “black box” method, in that it can be given
a matrix and the right-hand side and will generate the
4.5 Mixed-Precision Multigrid Methods components for each level automatically using sensible
heuristics. These methods are an interesting target for
Multigrid methods are highly effective iterative methods.
multi-precision treatment due to their different components
There are basically two types of multigrid methods:
that affect the overall algorithm in different ways. GMG
geometric multigrid methods (GMG) and algebraic multigrid
and AMG components combine smoothers, coarser grid,
methods (AMG). GMG requires actual grids on each level to
restriction, and prolongation operators on each level. In computer architecture. Their mixed-precision version takes
addition, it is of interest to investigate changes in precision about 84% of the time of the FP64 version.
on different levels. Finally, GMG and AMG can be used An approach described in a presentation by Clark (2019)
as preconditioners to other solvers (i.e., there is potential takes the use of mixed precision even further to involve
to use lower precision across the whole preconditioner). half precision. Clark and collaborators achieved good results
Historically, most work focused on the use of a lower- using a FP64 defect correction approach with a FP32 Krylov
precision GMG or AMG method as a preconditioner to a solver and a half-precision AMG preconditioner.
FP64 solver.
Another interesting related study by Fox and Kolasinski
Ljungkvist and Kronbichler (2017, 2019) successfully
(2019) examines the use of ZFP, a lossy compression
used mixed precision to solve the Laplace problem for
algorithm, within multigrid. Due to the local structure of
different orders with a matrix-free geometric multigrid
ZFP, ZFP can easily be integrated into numerical simulations
approach. Their solver infrastructure allows for using mixed-
without changing the underlying algorithms. However, since
precision arithmetic to perform the multigrid V-cycle in
ZFP is a lossy algorithm, it will introduce some error,
FP32 with an outer correction in FP64, thereby increasing
thus, it is important to understand if the error caused by
throughput by up to 83%.
ZFP overwhelms or other traditional sources of error (e.g.,
Similarly, Glimberg et al. (2011) use a FP32 multigrid to
discretization error).
precondition a FP64 defect correction scheme and solve the
Laplace problem within a nonlinear water wave application ZFP decomposes the field of interest into smaller pieces,
on a GPU architecture. They achieve a speedup of up to 1.6× called blocks, that are then compressed and decompressed
for the mixed-precision version over the FP64 version and a independently. ZFP compressed arrays implemented in
speedup of 1.9× for a purely FP32 version. Lindstrom (2018) are C++ classes that enable random-
Yamagishi and Matsumura (2016) also apply a FP32 accessible arrays whose storage size is specified by the user.
multigrid to a FP64 conjugate gradient solver to the In particular, ZFP fixed-rate arrays specify a rate used to
Poisson/Helmholtz problem within their non-hydrostatic compress each block of the data field to a finite number
ocean model. They report a speedup up to 2× for a FP32 of bits. The study uses ZFP fixed-rate arrays to represent
Matvec over a FP64 one and improved overall times using the approximation vector in MG on a 2-D Poisson problem
this approach; however, they compare the full application run with Dirichlet boundary conditions when the number of
only to their CPU version. interior nodes of the finest grid is (28 − 1)2 . Figure 4.6
There are various publications that pursue the same presents the relative residual for a V-cycle with or without
strategy of using a FP32 AMG preconditioner to a FP64 ZFP fixed-rate arrays. The orange line represents the relative
solver. residual with respect to the FP64 solution, while the blue line
represents the relative residual with respect to the solution
Emans and van der Meer (2012) perform a careful
with ZFP fixed-rate arrays with a rate of 32. As the number
analysis of the individual kernels of preconditioned Krylov
of V-cycles increase, the relative residual between the two
solvers on multi-core CPUs, including sparse matrix-vector
solutions matches until the relative residual for the ZFP
multiplications (SpMV), which make up a large portion
solution approximately reaches machine unit roundoff u for
of AMG. They also consider the effect of communication,
FP32.
where lower precision leads to smaller size messages, but
latencies are still an issue, particularly on the coarsest levels Figure 4.7 displays a similar study for when the rate used
of AMG. They find that the use of mixed precision for for the ZFP fixed-rate arrays is adapted depending on the
the preconditioner barely affects convergence and therefore level within the V-cycle. It is assumed that the ZFP fixed-rate
speedups for the kernels, which were between 1.1× and arrays have a fixed set of possible rates, 64, 48, 32, or 16. The
1.5×, can potentially carry over to the whole solver and blue line in Figure 4.7 depicts the relative compression error
lead to improvements of runtimes within computational fluid (i.e., the error between the FP64 solution and the ZFP fixed-
dynamics applications. rate solution, where the finest level has a rate of 64, and the
Sumiyoshi et al. (2014) investigate AMG performance rate for the coarser levels is sequentially lowered). That is, if
on a heterogeneous computer architecture with both CPUs we consider a 6-level V-cycle for which the finest level has a
and GPUs for isotropic and anisotropic Poisson problems. fixed rate of 64, then the second finest level has a fixed rate of
They consider smoothed aggregation AMG as a stand- 48, then 32, and then 16 for the remaining coarse grids. The
alone solver. They carefully analyze different portions of the orange, green, and red lines depict the relative compression
algorithm on five different architectures, including one multi- error for a rate of 48, 32, and 16, respectively for the finest
core CPU cluster. They report speedups between 1.2× and level. The purple dashed line depicts the relative truncation
1.6× on the GPU-CPU architectures for the mixed-precision error for the FP64 solution. Each ZFP fixed-rate solution
implementation over the FP64 version. These speedups are remains below the truncation error and the compression error
related to SpMV performance (between 1.6× and 1.8×) on continuously decreases until the relative error for the ZFP
these architectures. However, the mixed-precision version solution approximately reaches the respective machine unit
was slightly slower on the CPU-only architecture, which roundoff u dependent on the rate of the finest level.
achieved barely any improvement for the SpMV operations. This study shows that, for MG on a Poisson problem,
Richter et al. (2014) examine the performance of a FP32 applying ZFP to the approximation vector can significantly
AMG preconditioner (ML and PETSc) applied to a FP64 decrease memory use and is expected to decrease run times,
PCG solver. They apply the method for an electrostatic while the generated errors stay below the discretization error.
simulation of the high voltage isolator on a GPU/CPU Since a hardware version of ZFP is not available yet, no
10 4
10 6 We have presented mixed-precision algorithms for dense
and sparse linear algebra that are outperforming traditional
10 8
algorithms operating in high precision. For performance-
10 10 bound dense linear algebra algorithms, mixed-precision
Compression Error-64
10 12 Compression Error-48
Compression Error-32 iterative refinement that employs a low-precision error
Compression Error-16
10 14 Truncation Error correction solver remains the first-choice algorithm to exploit
0 5 10 15 20 25 30 35 the compute power in low precision. For sparse linear
V-cycle algebra, the memory-bound nature of the the algorithms
makes the concept of decoupling memory precision from
Figure 4.7. Relative compression error for an adaptive-rate arithmetic precision attractive. Furthermore, preconditioners
ZFP solution, where the rate for the approximation vector is with limited approximation accuracy are natural targets
sequentially lowered on the coarser grids. The purple line for the use of lower precision. Carefully adjusting the
represents the truncation error for the double-precision solution.
preconditioner precision to the numerical requirements and
the approximation accuracy can render run time savings
without impacting the iterative solver’s convergence.
actual runs were possible; however, the results show good As AI and deep learning are currently driving the hardware
potential for using GMG and/or AMG as a preconditioner. market, we expect a large number of processors and
Currently, Tamstorf et al. (2020a) appear to be the accelerators featuring low-precision special function units
only ones who investigated the theory of multi-precision and support for non-standard precision formats and integer
multigrid methods. Their original intent was to improve operations. For numerical linear algebra, we anticipate
the appearance of the movement of cloth within Disney significant potential in the use of integer arithmetic for
movies, which requires higher than FP64 accuracy. However, numerical calculations and the low-precision function units
their theory applies equally to decreased precision. They designed for deep learning. As we see the machine imbalance
have created a theoretical framework with rigorous proofs continuing to grow, we also expect format decoupling and
for a mixed-precision version of multigrid for solving compression techniques to become essential and are eager to
the algebraic equations that arise from discretizing linear see hardware-support for data compression.
elliptic partial differential equations (PDEs). The arising
matrices being sparse and symmetric positive definite enable
the use of the so-called energy or A norm to establish Acknowledgments
convergence and error estimates. Bounds on the convergence This work was supported by the US Exascale Computing
behavior of multigrid are developed and analyzed as a Project (17-SC-20-SC), a collaborative effort of the U.S.
function of the matrix condition number. Both theoretical Department of Energy Office of Science and the National
and numerical results confirm that convergence to the Nuclear Security Administration. This work was performed
level of discretization accuracy can be achieved with under the auspices of the U.S. Department of Energy by
mixed-precision versions of V-cycles and full multigrid. Lawrence Livermore National Laboratory under Contract
This framework is inspired by the results of Carson and DE-AC52-07NA27344.
Higham (2017) but ultimately provides tighter bounds for
many PDEs. Tamstorf et al. (2020b) further extend their
theoretical framework to include the quantization error. They
Disclaimer
use the bounds to guide the choice of precision level in This document was prepared as an account of work
their progressive-precision multigrid scheme by balancing sponsored by an agency of the United States government.
Neither the United States government nor Lawrence McKenney A and Sorensen DC (1999) LAPACK Users’ Guide.
Livermore National Security, LLC, nor any of their Third edition. Philadelphia, PA, USA: Society for Industrial
employees makes any warranty, expressed or implied, and Applied Mathematics. ISBN 0-89871-447-8. URL http:
or assumes any legal liability or responsibility for the //www.netlib.org/lapack/lug/.
accuracy, completeness, or usefulness of any information, Anzt H, Cojean T, Chen YC, Flegar G, Göbel F, Grützmacher
apparatus, product, or process disclosed, or represents T, Nayak P, Ribizel T and Tsai YH (2020) Ginkgo: A high
that its use would not infringe privately owned rights. performance numerical linear algebra library. Journal of Open
Reference herein to any specific commercial product, Source Software x(x): xxxx. DOI:10.21105/joss.02260. URL
process, or service by trade name, trademark, manufacturer, https://doi.org/10.21105/joss.02260.
or otherwise does not necessarily constitute or imply Anzt H, Dongarra J, Flegar G, Higham NJ and Quintana-Ortı́ ES
its endorsement, recommendation, or favoring by the (2019a) Adaptive precision in block-jacobi preconditioning
United States government or Lawrence Livermore National for iterative sparse linear system solvers. Concurrency and
Security, LLC. The views and opinions of authors expressed Computation: Practice and Experience 31(6): e4460.
herein do not necessarily state or reflect those of the Anzt H, Flegar G, Grützmacher T and Quintana-Ortı́ ES (2019b)
United States government or Lawrence Livermore National Toward a modular precision ecosystem for high-performance
Security, LLC, and shall not be used for advertising or computing. The International Journal of High Performance
product endorsement purposes. Computing Applications 33(6): 1069–1078.
Anzt H, Huckle TK, Bräckle J and Dongarra J (2018) Incomplete
Sandia National Laboratories is a multimission laboratory sparse approximate inverses for parallel preconditioning.
managed and operated by National Technology and Parallel Computing 71: 1–22.
Engineering Solutions of Sandia, LLC., a wholly owned Arioli M and Duff IS (2009) Using FGMRES to obtain backward
subsidiary of Honeywell International, Inc., for the
stability in mixed precision. Electron. Trans. Numer. Anal. 33:
U.S. Department of Energy’s National Nuclear Security
31–44. URL https://eudml.org/doc/130614.
Administration under contract DE-NA-0003525.
Arioli M, Duff IS, Gratton S and Pralet S (2007) A note on
GMRES preconditioned by a perturbed LDLT decomposition
References with static pivoting. SIAM J. Sci. Comput. 29(5): 2024–2044.
(2017) Nvidia tesla v100 gpu architecture. https://images. DOI:10.1137/060661545.
nvidia.com/content/volta-architecture/ Barlow JL (2019) Block modified Gram–Schmidt algorithms and
pdf/volta-architecture-whitepaper.pdf. their analysis. SIAM J. Matrix Anal. Appl. 40(4): 1257–1290.
Abdelfattah A, Tomov S and Dongarra J (2020) Investigating the Björck Å (1967a) Iterative refinement of linear least squares
benefit of FP16-enabled mixed-precision solvers for symmetric solutions I. BIT Numerical Mathematics 7(4): 257–278. DOI:
positive definite matrices using GPUs. In: Krzhizhanovskaya 10.1007/BF01939321.
VV, Závodszky G, Lees MH, Dongarra JJ, Sloot PMA Björck Å (1967b) Solving linear least squares problems by Gram-
and Teixeira SBJ (eds.) Computational Science—ICCS 2020, Schmidt orthogonalization. BIT Numerical Mathematics 7(1):
number 12138 in Lecture Notes in Computer Science. Springer 1–21.
International Publishing, pp. 237–250. DOI:10.1007/978-3- Björck Å (1990) Iterative refinement and reliable computing.
030-50417-5 18. In: Cox MG and Hammarling SJ (eds.) Reliable Numerical
Abdelfattah A, Tomov S and Dongarra JJ (2019) Fast Batched Computation. Oxford University Press, pp. 249–266.
Matrix Multiplication for Small Sizes Using Half-Precision Björck Å (1994) Numerics of Gram-Schmidt orthogonalization.
Arithmetic on GPUs. In: 2019 IEEE International Parallel Lin. Alg. Appl. 197: 297–316.
and Distributed Processing Symposium, IPDPS 2019, Rio de Björck Å and Paige CC (1992) Loss and recapture of orthogonality
Janeiro, Brazil, May 20-24, 2019. IEEE, pp. 111–122. DOI:10. in the modified Gram-Schmidt algorithm. SIAM J. Matrix Anal.
1109/IPDPS.2019.00022. URL https://doi.org/10. Appl. 13(1): 176–190.
1109/IPDPS.2019.00022. Blanchard P, Higham NJ, Lopez F, Mary T and Pranesh S (2020)
Agullo E, Demmel J, Dongarra J, Hadri B, Kurzak J, Langou Mixed precision block fused multiply-add: Error analysis and
J, Ltaief H, Luszczek P and Tomov S (2009) Numerical application to GPU tensor cores. SIAM J. Sci. Comput. 42(3):
linear algebra on emerging architectures: The PLASMA and C124–C141. DOI:10.1137/19M1289546.
MAGMA projects. Journal of Physics: Conference Series 180: Carson E and Higham NJ (2017) A new analysis of iterative
012037. DOI:10.1088/1742-6596/180/1/012037. refinement and its application to accurate solution of ill-
Alvermann A, Basermann A, Bungartz HJ, Carbogno C, Ernst conditioned sparse linear systems. SIAM Journal on Scientific
D, Fehske H, Futamura Y, Galgon M, Hager G, Huber S, Computing 39(6): A2834–A2856.
Huckle T, Ida A, Imakura A, Kawai M, Köcher S, Kreutzer Carson E and Higham NJ (2018) Accelerating the solution of linear
M, Kus P, Lang B, Lederer H, Manin V, Marek A, Nakajima systems by iterative refinement in three precisions. SIAM
K, Nemec L, Reuter K, Rippl M, Röhrig-Zöllner M, Sakurai T, Journal on Scientific Computing 40(2): A817–A847.
Scheffler M, Scheurer C, Shahzad F, Simoes Brambila D, Thies Carson E, Higham NJ and Pranesh S (2020) Three-precision
J and Wellein G (2019) Benefits from using mixed precision GMRES-based iterative refinement for least squares prob-
computations in the ELPA-AEO and ESSEX-II eigensolver lems. MIMS EPrint 2020.5, Manchester Institute for
projects. Japan J. Indust. Appl. Math. 36(2): 699–717. DOI: Mathematical Sciences, The University of Manchester, UK.
10.1007/s13160-019-00360-8. URL http://eprints.maths.manchester.ac.uk/
Anderson E, Bai Z, Bischof CH, Blackford S, Demmel JW,
Dongarra JJ, Du Croz JJ, Greenbaum A, Hammarling SJ,
2770/. Revised June 2020. To appear in SIAM J. Sci. Greenbaum A (1997) Estimating the attainable accuracy of
Comput. recursively computed residual methods. SIAM J. Matrix Anal.
Carson EC (2015) Communication-avoiding Krylov subspace Appl. 18(3): 535–551.
methods in theory and practice. PhD Thesis, University of Grützmacher T, Cojean T, Flegar G, Göbel F and Anzt H (2019) A
California, Berkeley. customized precision format based on mantissa segmentation
Clark K (2019) Effective use of mixed precision for hpc. Smoky for accelerating sparse linear algebra. Concurrency and
Mountain Conference 2019. Computation: Practice and Experience : e5418.
Clark MA, Babich R, Barros K, Brower RC and Rebbi C Gupta S, Agrawal A, Gopalakrishnan K and Narayanan P
(2010) Solving lattice QCD systems of equations using (2015) Deep Learning with Limited Numerical Precision.
mixed precision solvers on GPUs. Computer Physics In: Proceedings of the 32Nd International Conference on
Communications 181(9): 1517–1528. International Conference on Machine Learning - Volume 37,
Davies PI, Higham NJ and Tisseur F (2001) Analysis of ICML’15. JMLR.org, pp. 1737–1746. URL http://dl.
the Cholesky method with iterative refinement for solving acm.org/citation.cfm?id=3045118.3045303.
the symmetric definite generalized eigenproblem. SIAM Gustafson J (2015) The End of Error: Unum Computing. Chapman
J. Matrix Anal. Appl. 23(2): 472–493. DOI:10.1137/ & Hall/CRC Computational Science. Taylor & Francis. ISBN
S0895479800373498. 9781482239867. URL https://books.google.de/
Demmel JW, Hida Y, Kahan W, Li XS, Mukherjee S and Riedy books?id=W2ThoAEACAAJ.
EJ (2006) Error bounds from extra-precise iterative refinement. Haidar A, Abdelfattah A, Zounon M, Wu P, Pranesh S, Tomov S
ACM Trans. Math. Software 32(2): 325–351. DOI:10.1145/ and Dongarra J (2018a) the design of fast and energy-efficient
1141885.1141894. linear solvers: On the potential of half-precision arithmetic
Dongarra JJ (1982) Algorithm 589 SICEDR: A FORTRAN and iterative refinement techniques. In: Shi Y, Fu H, Tian
subroutine for improving the accuracy of computed matrix Y, Krzhizhanovskaya VV, Lees MH, Dongarra J and Sloot
eigenvalues. ACM Trans. Math. Software 8(4): 371–375. DOI: PMA (eds.) Computational Science—ICCS 2018. Springer
10.1145/356012.356016. International Publishing, Cham, pp. 586–600. DOI:10.1007/
Dongarra JJ, Moler CB and Wilkinson JH (1983) Improving the 978-3-319-93698-7 45.
accuracy of computed eigenvalues and eigenvectors. SIAM J. Haidar A, Bayraktar H, Tomov S, Dongarra J and Higham
Numer. Anal. 20(1): 23–45. DOI:10.1137/0720002. NJ (2020) Mixed-precision solution of linear systems
Elble JM and Sahinidis NV (2012) Scaling linear optimization using accelerator-based computing. Technical Report ICL-
problems prior to application of the simplex method. Comput. UT-20-05, Innovative Computing Laboratory, University
Optim. Appl. 52(2): 345–371. DOI:10.1007/s10589-011-9420- of Tennessee, Knoxville, TN, USA. URL https:
4. //www.icl.utk.edu/publications/mixed-
Emans M and van der Meer A (2012) Mixed-precision amg as precision-solution-linear-systems-using-
linear equation solver for definite systems. In: Proceedings accelerator-based-computing. To appear in Proc.
of International Conference on Computational Science, ICCS Roy. Soc. London Ser. A.
2010, volume 1. pp. 175–183. Haidar A, Tomov S, Dongarra J and Higham NJ (2018b)
Flegar G, Anzt H, Cojean T and Quintana-Ortı́ ES (2021) Adaptive Harnessing GPU Tensor Cores for Fast FP16 Arithmetic
precision block-jacobi for high performance preconditioning to Speed Up Mixed-precision Iterative Refinement Solvers.
in the ginkgo linear algebra software. ACM Transaction on In: Proceedings of the International Conference for High
Mathematical Software 47(2). DOI:10.1145/3441850. Performance Computing, Networking, Storage, and Analysis,
SC ’18. Piscataway, NJ, USA: IEEE Press, pp. 47:1–47:11.
Fox A and Kolasinski A (2019) Error analysis of inline zfp
DOI:10.1109/SC.2018.00050. URL https://doi.org/
compression for multigrid methods. 2019 Copper Mountain
10.1109/SC.2018.00050.
Conference for Multigrid Methods.
Haidar A, Wu P, Tomov S and Dongarra J (2017) Investigating half
Fukaya T, Kannan R, Nakatsukasa Y, Yamamoto Y and Yanagisawa
precision arithmetic to accelerate dense linear system solvers.
Y (2020) Shifted CholeskyQR for computing the QR
In: Proceedings of the 8th Workshop on Latest Advances in
factorization of ill-conditioned matrices. SIAM J. Sci. Comput.
Scalable Algorithms for Large-Scale Systems. pp. 1–8.
42(1): A477–A503. DOI:10.1137/18M1218212.
Higham NJ (1989) The accuracy of solutions to triangular systems.
Giraud L, Gratton S and Langou J (2004) A rank-k update
SIAM J. Numer. Anal. 26(5): 1252–1265.
procedure for reorthogonalizing the orthogonal factor from
modified Gram–Schmidt. SIAM J. Matrix Anal. Appl. 25(4): Higham NJ (1997) Iterative refinement for linear systems and
1163–1177. LAPACK. IMA J. Numer. Anal. 17(4): 495–509. DOI:10.1093/
imanum/17.4.495.
Glimberg SL, Engsig-Karup AP and Madsen MG (2011)
A fast gpu-accelerated mixed-precision strategy for fully Higham NJ (2002) Accuracy and Stability of Numerical Algorithms.
nonlinearwater wave computations. In: Proceedings of Second edition. Philadelphia, PA, USA: Society for Industrial
ENUMATH 2011. and Applied Mathematics. ISBN 0-89871-521-0. DOI:10.
1137/1.9780898718027.
Gratton S, Simon E, Titley-Peloquin D and Toint P (2020)
Exploiting variable precision in GMRES. SIAM J. Sci. Comput. Higham NJ (2019) Error analysis for standard and GMRES-based
(to appear) . iterative refinement in two and three-precisions. MIMS EPrint
2019.19, Manchester Institute for Mathematical Sciences,
Greenbaum A (1989) Behavior of slightly perturbed Lanczos and
The University of Manchester. URL http://eprints.
conjugate-gradient recurrences. Lin. Alg. Appl. 113: 7–63.
maths.manchester.ac.uk/2735/.
Higham NJ and Mary T (2019) A new approach to probabilistic Meurant G and Strakoš Z (2006) The Lanczos and conjugate
rounding error analysis. SIAM Journal on Scientific Computing gradient algorithms in finite precision arithmetic. Acta
41(5): A2815–A2835. DOI:10.1137/18M1226312. Numerica 15: 471–542. DOI:10.1017/s096249290626001x.
Higham NJ and Mary T (2020) Sharper probabilistic backward Moler CB (1967) Iterative refinement in floating point. Journal of
error analysis for basic linear algebra kernels with random the ACM (JACM) 14(2): 316–321.
data. MIMS EPrint 2020.4, Manchester Institute for Ogita T and Aishima K (2018) Iterative refinement for symmetric
Mathematical Sciences, The University of Manchester, UK. eigenvalue decomposition. Japan Journal of Industrial and
URL http://eprints.maths.manchester.ac.uk/ Applied Mathematics 35(3): 1007–1035.
2776/. Revised August 2020. To appear in SIAM J. Sci. Ogita T and Aishima K (2019) Iterative refinement for symmetric
Comput. eigenvalue decomposition II: clustered eigenvalues. Japan J.
Higham NJ and Pranesh S (2019a) Exploiting lower precision Indust. Appl. Math. 36: 435–459. https://doi.org/10.
arithmetic in solving symmetric positive definite linear 1007/s13160-019-00348-4.
systems and least squares problems. MIMS EPrint Paige CC (1980) Accuracy and effectiveness of the Lanczos
2019.20, Manchester Institute for Mathematical Sciences, The algorithm for the symmetric eigenproblem. Lin. Alg. Appl. 34:
University of Manchester, UK. URL http://eprints. 235–258.
maths.manchester.ac.uk/2771/. Revised July 2020. Paige CC (2018) The effects of loss of orthogonality on large
Higham NJ and Pranesh S (2019b) Simulating low precision scale numerical computations. In: International Conference
floating-point arithmetic. SIAM J. Sci. Comput. 41(5): C585– on Computational Science and Its Applications. Springer, pp.
C602. DOI:10.1137/19M1251308. 429–439.
Higham NJ, Pranesh S and Zounon M (2019) Squeezing a matrix Paige CC, Rozložnı́k M and Strakoš Z (2006) Modified gram-
into half precision, with an application to solving linear schmidt MGS, least squares, and backward stability of MGS-
systems. SIAM J. Sci. Comput. 41(4): A2536–A2551. DOI: GMRES. SIAM J. Matrix Anal. Appl. 28(1): 264–284.
10.1137/18M1229511. Paige CC and Strakoš Z (2002) Residual and backward error bounds
Hogg JD and Scott JA (2010) A fast and robust mixed-precision in minimum residual Krylov subspace methods. SIAM J. Sci.
solver for the solution of sparse symmetric linear systems. Comput. 23(6): 1898–1923.
ACM Trans. Math. Software 37(2): 17:1–17:24. DOI:10.1145/ Paige CC and Wülling W (2014) Properties of a unitary matrix
1731022.1731027. obtained from a sequence of normalized vectors. SIAM J.
IEEE (2019) IEEE Standard for Floating-Point Arithmetic, IEEE Matrix. Anal. Appl. 35(2): 526–545.
Std 754-2019 (Revision of IEEE 754-2008). New York, USA: Petschow M, Quintana-Ortı́ E and Bientinesi P (2014) Improved
The Institute of Electrical and Electronics Engineers. ISBN accuracy and parallelism for MRRR-based eigensolvers—A
978-1-5044-5924-2. DOI:10.1109/IEEESTD.2019.8766229. mixed precision approach. SIAM J. Sci. Comput. 36(2): C240–
Joffrain T, Low TM, Quintana-Ortı́ ES, van de Geijn R and Van Zee C263. DOI:10.1137/130911561.
FG (2006) Accumulating Householder transformations, revis- Puglisi C (1992) Modification of the Householder method based on
ited. ACM Transactions on Mathematical Software (TOMS) the compact WY representation. SIAM J. Sci. Stat. Comput.
32(2): 169–179. 13(3): 723–726.
Knight PA, Ruiz D and Uçar B (2014) A symmetry preserving Richter C, Schops S and Clemens M (2014) Gpu-accelerated mixed
algorithm for matrix scaling. SIAM J. Matrix Anal. Appl. 35(3): precision algebraic multigrid preconditioners for discrete
931–955. DOI:10.1137/110825753. elliptic field problems. IEEE Transactions on Magnetics 50(2).
Langou J, Langou J, Luszczek P, Kurzak J, Buttari A and Dongarra Saad Y and Schultz MH (1986) Gmres: A generalized minimal
J (2006) Exploiting the performance of 32 bit floating point residual algorithm for solving nonsymmetric linear systems.
arithmetic in obtaining 64 bit accuracy (revisiting iterative SIAM Journal on scientific and statistical computing 7(3): 856–
refinement for linear systems). In: Proceedings of the 2006 869.
ACM/IEEE Conference on Supercomputing. DOI:10.1109/SC. Sayood K (2012) Introduction to Data Compression, Fourth
2006.30. Edition. 4th edition. San Francisco, CA, USA: Morgan
Leon SJ, Björck Å and Gander W (2013) Gram-Schmidt Kaufmann Publishers Inc. ISBN 0124157963.
orthogonalization: 100 years and more. Numer. Lin. Alg. Appl. Simoncini V and Szyld DB (2003) Theory of inexact Krylov
20(3): 492–532. subspace methods and applications to scientific computing.
Lindstrom P (2018) Zfp version 0.5.3. URL https://zfp. SIAM J. Sci. Comput. 25(2): 454–477.
readthedocs.io/en/release0.5.3/index.htm. Skeel RD (1980) Iterative refinement implies numerical stability for
Ljungkvist K and Kronbichler M (2017) Multigrid for matrix-free Gaussian elimination. Math. Comp. 35(151): 817–832. DOI:
finite element computations on graphics processors. Technical 10.1090/S0025-5718-1980-0572859-4.
report / Department of Information Technology, Uppsala Sleijpen GL and van der Vorst HA (1996) Reliable updated
University . residuals in hybrid Bi-CG methods. Computing 56(2): 141–
Ljungkvist K and Kronbichler M (2019) Multigrid for matrix-free 163.
high-order finite element computations on graphics processors. Smoktunowicz A, Barlow JL and Langou J (2006) A note on
ACM Transactions on Parallel Processing . the error analysis of classical Gram–Schmidt. Numerische
Malard J and Paige C (1994) Efficiency and scalability of two Mathematik 105(2): 299–313.
parallel QR factorization algorithms. In: Proceedings of IEEE Stewart GW (1973) Introduction to Matrix Computations. New
Scalable High Performance Computing Conference. IEEE, pp. York: Academic Press. ISBN 0-12-670350-7.
615–622.