0% found this document useful (0 votes)

20 views25 pages

A Survey of Numerical Linear Algebra Methods Utilizing Mixed Precision Arithmetic

This document presents a comprehensive survey of mixed-precision numerical linear algebra methods, highlighting their potential for accelerating scientific computing applications. It discusses the impact of floating-point formats on computational accuracy and efficiency, particularly in the context of modern hardware architectures that support low-precision arithmetic. The paper emphasizes the importance of leveraging mixed-precision techniques to optimize performance while maintaining high accuracy in numerical algorithms.

Uploaded by

therlf2000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

20 views25 pages

A Survey of Numerical Linear Algebra Methods Utilizing Mixed Precision Arithmetic

Uploaded by

therlf2000

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

LLNL-JRNL-826451

A Survey of Numerical Linear Algebra

Methods Utilizing Mixed Precision
Arithmetic

A. Abdelfattah, H. Anzt, E. Boman, E. Carson, T. Cojean, J.

Dongarra, A. Fox, M. Gates, N. Higham, X. S. Li, Y. Liu, J.
Loe, P. Luszczek, S. Pranesh, S. Rajamanickam, T. Ribizel, B.
Smith, K. Swirydowicz, S. Thomas, S. Tomov, M. Tzai, I.
Yamazaki, U. M. Yang

September 7, 2021

International Journal of High Performance Computing

Applications
Disclaimer

This document was prepared as an account of work sponsored by an agency of the United States
government. Neither the United States government nor Lawrence Livermore National Security, LLC,
nor any of their employees makes any warranty, expressed or implied, or assumes any legal liability or
responsibility for the accuracy, completeness, or usefulness of any information, apparatus, product, or
process disclosed, or represents that its use would not infringe privately owned rights. Reference herein
to any specific commercial product, process, or service by trade name, trademark, manufacturer, or
otherwise does not necessarily constitute or imply its endorsement, recommendation, or favoring by the
United States government or Lawrence Livermore National Security, LLC. The views and opinions of
authors expressed herein do not necessarily state or reflect those of the United States government or
Lawrence Livermore National Security, LLC, and shall not be used for advertising or product
endorsement purposes.
A Survey of Numerical Linear Algebra
Methods Utilizing Mixed Precision
Arithmetic
Ahmad Abdelfattah1 , Hartwig Anzt1,2 , Erik G. Boman3 , Erin Carson4 , Terry Cojean2 , Jack
Dongarra1,5,6 , Alyson Fox7 , Mark Gates1 , Nicholas J. Higham6 , Xiaoye S. Li8 , Jennifer Loe3 ,
Piotr Luszczek1 , Srikara Pranesh6 , Siva Rajamanickam3 , Tobias Ribizel2 , Barry F. Smith9 ,
Kasia Swirydowicz10 , Stephen Thomas10 , Stanimire Tomov1 , Yaohung M. Tsai1 , and Ulrike
Meier Yang7

Abstract
The efficient utilization of mixed-precision numerical linear algebra algorithms can offer attractive acceleration to
scientific computing applications. Especially with the hardware integration of low-precision special-function units
designed for machine learning applications, the traditional numerical algorithms community urgently needs to
reconsider the floating-point formats used in the distinct operations to efficiently leverage the available compute power.
In this work, we provide a comprehensive survey of mixed-precision numerical linear algebra routines, including
the underlying concepts, theoretical background, and experimental results for both dense and sparse linear algebra
problems.

Keywords
Mixed Precision Arithmetic, Numerical Mathematics, Linear Algebra, High Performance Computing, GPUs

1 Introduction time, the cost of performing arithmetic operations heavily

depends on the hardware support for computations in a
In computational numerics, the accuracy of a computed certain floating point format, and the acceleration/slowdown
result and the time needed to complete the algorithm of computing in different formats can vary significantly
both heavily depend on the floating point format used for between hardware architectures. For example, on Intel’s 64-
storage and arithmetic operations. While we acknowledge bit processors, the compute performance in single precision
the existence of other formats, here we focus on fixed- is twice the compute performance of double precision. On
size floating point formats composed of a sign bit, an AMD’S Radeon VII GPU, the ratio between the single-
exponent, and a significand. Roughly speaking, the length precision performance and the double-precision performance
of the exponent determines the value range of a floating is about 4×.
point format, and the length of the significand determines Given the performance differences for computing and
the relative accuracy of the format in that range. While a communicating in different precision formats, there is
sufficient exponent range is necessary for the meaningful a long history of efforts that aim to improve the
data representation, generally the accuracy of an algorithm performance of numerical algorithms by carefully combining
output strongly correlates with the significand length. precision formats. The overall goal of these mixed-precision
The cost of communication and computation in a algorithms is to accelerate the applications with the use of
numerical application grows with the size of the floating lower-precision formats while maintaining the high accuracy
point format. Communication here can mean access to of the output.
main memory and data transfers within computing cores, in
between distinct compute cores, or between distinct compute
nodes. The cost of data transfers is composed of a constant 1 University of Tennessee, Knoxville, USA
access latency and the transfer time that reflects the ratio 2 Karlsruhe Institute of Technology, Karlsruhe, Germany
3 Sandia National Lab, Albuquerque, USA
between message size and data transfer rate. Consequently,
4 Charles University, Prague, Czech Republic
when ignoring the latency, the communication cost linearly 5 Oak Ridge National Lab, Oak Ridge, USA
increases with the size (in bits) of the floating point format. 6 University of Manchester, Manchester, UK
This implies that the runtime impact of communicating 7 Lawrence Livermore National Lab, USA

values in different formats is hardware independent and only 8 Lawrence Berkeley National Lab, Berkeley, USA
9 Argonne National Lab, Argonne, USA
depends on the size of the floating point formats. Specifically,
10 National Renewable Energy Lab, Boulder, USA
for the widely adopted IEEE754 formats for double precision
(64 bit) and single precision (32 bit) (IEEE), the runtime Corresponding author:
difference for memory/communication operations is roughly Hartwig Anzt, Karlsruher Institut für Technologie (KIT), Germany.
2×, independent of the hardware platform. At the same Email: [email protected]

Prepared using sagej.cls [Version: 2017/01/17 v1.20]

2 ()

Machine balance exponent and how many bits are used for the significand.
(# floating point operations per read) Generally, we use the term high precision for precision
100 formats that provide high accuracy at the cost of a larger
KNL V100
Core i7 memory volume (in terms of bits) and low precision to refer
P100
K40
Origin2000 M2050 KNC to precision formats that compose of fewer bits (smaller
Peak Mflop/s ÷ MW/s

T3E P4 Core2Duo RaspberryPi memory volume) and provide low(er) accuracy. Unless
10
PII PIII K Computer
r Pentium explicitly stated, we think of IEEE double precision when
C1060
ea
Cray X1 K10
ry

using the term high precision and IEEE754 single precision

CM-5E
%

486DX2 NEC SX-7 when using the term low precision. These formats are of
30

NEC SX-5
C90 NEC SX-4 particular interest as they are natively supported by a broad
1VAX-11 Y-MP r
ea range of hardware architectures. However, in particular with
y
8088 per
15% the rise of machine learning, a number of architectures now
1975 1980 1985 1990 1995 2000 2005 2010 2015 2020 also provide native support for floating point formats, that
year are even more compact than IEEE754 single precision. In
particular IEEE754 half precision and BFloat16 are formats
Figure 1.1. Evolution of the machine balance of processors that experience increased interest by the community. For all
over different hardware generations. floating point formats, their bitwise configuration determines
the characteristics. Roughly speaking, the length of the
But while the idea of mixed-precision algorithms has been exponent of a precision format determines the range of a
around for several decades, recent hardware trends have format, the length of the significand determines the precision
motivated increased research and development activities. of a format. Relevant indicators in that context are the largest
Within the past few years, hardware vendors have started and smallest representable numbers in a format, and the
designing special-purpose units for low-precision arithmetic unit roundoff of a format u. In Table2.1 we list some of
in response to the machine learning community’s demand the most relevant formats used in modern scientific high
for high compute power in low-precision formats. Also, performance computing along with their key characteristics.
the server-level products are increasingly featuring low- Traditionally, hardware and software are strictly coupling
precision special function units (e.g., NVIDIA tensor cores the precision format used for arithmetic operations and for
in V100 GPUs) providing about 16×higher performance memory operations. However, given that most architectures
than what can be achieved in IEEE double precision. are nowadays overprovisioned for arithmetic operations,
Exploiting this compute power efficiently could offer up there exist trends to break up this strict coupling. On
to an order of magnitude of speedup to compute-bound the hardware side, a recent example of an architecture
algorithms. At the same time, the gap between the compute breaking up the coupling between memory format and
power on the one hand and the memory bandwidth on arithmetic format are the NVIDIA Tensor Cores integrated
the other hand keeps increasing, making data access and into NVIDIA’s Volta GPU architecture v10 (2017). These
communication progressively more expensive compared special function units designed to perform high performance
with arithmetic operations (Figure 1.1). Given the over matrix matrix multiplication take half precision (FP16) input
provisioning of modern hardware for arithmetic operations, data, but compute in FP32 v10 (2017). On the software side,
it may be a rational decision for memory-bound algorithms the concept of a memory accessor separating the memory
to compress all data in cache before communicating with precision from the arithmetic precision pursues the same
remote processors or main memory goal: computing in higher precision while handling the data
In this paper, we present mixed-precision linear algebra in lower precision in memory Anzt et al. (2019b). In 4.3
algorithms and the attainable performance advantages for we will detail the software-based approach in more detail.
dense linear algebra (Section 3) and for sparse linear algebra For the V100 GPU experiments in Section 3, we claim
(Section 4). We conclude in Section 5 with an outlook on that the algorithms operating on the Volta tensor cores use
current algorithm development and perspectives for mixed- half precision, acknowledging that internally, the arithmetic
precision technology on future architectures. We note that operations are using higher precision after converting the half
this survey is focusing on numerical linear algebra operating precision input data.
on explicitly-available linear operators, matrix-free methods
remain outside the scope of this work.
3 Dense Linear Algebra

2 Precision Formats, Hardware Realization, Carefully designed mixed-precision dense linear algebra
algorithms can leverage the potential performance advan-
and Notation tages of low-precision arithmetic. With this in mind, we
Before presenting mixed precision algorithms, we want start the section in Section 3.1 by presenting basic linear
to establish some notation we use throughout the rest algebra subroutines specifically designed to exploit the com-
of the paper, and provide some background on precision pute power of NVIDIA’s Tensor Cores, which provide high
formats and their realization in hardware. We exclusively compute power in low precision. Building on low-precision
focus on floating point formats that are composed of a Basic Linear Algebra Subprograms (BLAS) and guided by
sign bit, a sequence of exponent bits, and a sequence of the concept of Newton’s method (Section 3.2), it is possible
significand bits. The distinct precision formats then differ in to derive high performance linear solvers running in low
the composition in terms of how many bits are used for the precision that, embedded in an iterative refinement (Section

Prepared using sagej.cls

Abdalafattah et al. 3

Table 2.1. Parameters for various floating-point arithmetics. “Range” denotes the order of magnitude of the smallest subnormal
(xmin,s ) and largest and smallest positive normalized floating-point numbers. BFloat16 does not support subnormal numbers.

Arithmetic Size Range Unit roundoff

(bits) xmin,s xmin xmax u
BFloat16 16 – 1.2 × 10−38 3.4 × 1038 3.9 × 10−3
IEEE FP16 16 6.0 × 10−8 6.1 × 10−5 6.6 × 104 4.9 × 10−4
IEEE FP32 32 1.4 × 10−45 1.2 × 10−38 3.4 × 1038 6.0 × 10−8
IEEE FP64 64 4.9 × 10−324 2.2 × 10−308 1.8 × 10308 1.1 × 10−16
IEEE FP128 128 6.5 × 10−4966 3.4 × 10−4932 1.2 × 104932 9.6 × 10−35

3.3), succeed in generating high-accuracy solutions while supported half-precision arithmetic in the Maxwell GPU
conducting most of the work in low-precision arithmetic. The architecture. Throughout this section, we will focus on
standard approach is based on factorizing a matrix in low pre- NVIDIA’s GPUs and math libraries to highlight half-
cision and using an iterative refinement scheme in high pre- precision developments for numerical kernels.
cision to recover a high accuracy solution (see Section 3.3). While NVIDIA’s Maxwell GPU architecture introduced
However, for numerical reasons, it can be advantageous to hardware support for IEEE FP16 arithmetic, the Volta
use the factorization computed in low precision as a pre- architecture, which powers the Summit supercomputer,∗
conditioner for a Generalized Minimum Residual (GMRES) comes with hardware acceleration units (called Tensor
iterative solver embedded in an iterative refinement loop Cores) for matrix multiplication in FP16. These Tensor
(see Section 3.4). Using sophisticated scaling and shifting Cores are theoretically 12× faster than the theoretical FP16
techniques, symmetry and positive definiteness of a system peak performance of the preceding architecture (Pascal
matrix can be exploited in a Generalized Minimum Residual- architecture). Applications taking advantage of the Tensor
based Iterative Refinement (GMRES-IR) variant using a low- Cores can run up to 4× faster than using the regular FP16
precision Cholesky factorization as a preconditioner (see arithmetic on the same GPU. The Tensor Cores are also
Section 3.5). In Section 3.6, we present some performance able to perform a mixed-precision multiplication with a low-
results demonstrating the potential of these techniques on precision input (e.g., half-precision) and a higher-precision
modern GPU architectures. The scope of mixed-precision output (typically single-precision). The Tensor Core units are
iterative refinement is not limited to linear systems and discussed in more detail in Section 3.1.1.
extends to least-square problems (Section 3.7) and eigen- In terms of half-precision BLAS, most of the avail-
value solvers (Section 3.8). able routines provide only dense matrix multiplications
(GEMMs). From the perspective of machine learning appli-
cations, most of the performance-critical components in
3.1 Low-Precision BLAS training/inference can be reformulated to take advantage
The revolution of machine learning applications and artificial of the GEMM kernel. As for dense linear algebra, many
intelligence (AI) spiked an interest in developing high- high-level algorithms are built to extract their high per-
performance 16-bit, half-precision floating point arithmetic formance from GEMM calls. Therefore, accelerating such
(FP16), because most AI applications do not necessarily performance-critical steps through FP16 GEMM (HGEMM)
require the accuracy of 32-bit, full-precision floating point would propagate the performance advantage to the entire
arithmetic (FP32) or 64-bit, double-precision floating point algorithm while keeping other numerical stages in their orig-
arithmetic (FP64) (Gupta et al. 2015). FP16 also enables inal precision(s). An example of this practice is the mixed
machine learning applications to run faster, not only because precision dense LU factorization (Haidar et al. 2018b), which
of the faster arithmetic, but also because of the reduction in is used to accelerate the solution of Ax = b in double
memory storage and traffic by a factor of 2× against FP32, precision, see Section 3.3.
and by a factor of 4× against FP64. 3.1.1 Hardware Acceleration of Half Precision The
In terms of vendor support, NVIDIA, Google, and CUDA Toolkit is one of the first programming models to
AMD manufacture hardware that is capable of performing provide half-precision (i.e., FP16) arithmetic. Support was
FP16 arithmetic. Google’s Tensor Core Processing Units added in late 2015 for selected embedded GPU models
(TPUs) are customized chips that are mainly designed for based on the Maxwell architecture, and FP16 arithmetic
machine learning workloads using the 16-bit brain floating has become mainstream in CUDA-enabled GPUs since
point (BFloat16) format. AMD also provides half-precision the Pascal architecture. FP16 has a dynamic range that is
capabilities, and their software stack shows support for both significantly smaller than single or double precision (see
the BFloat16 format and the IEEE FP16 format, (IEEE). Table 2.1).
The theoretical performance of half precision on AMD The Volta and Turing architectures introduced hardware
GPUs follows the expected 2× speedup against FP32 acceleration for matrix multiplication in FP16 using the
and 4× speedup against FP64. As an example, the Mi50 aforementioned Tensor Cores. Using Tensor Cores for FP16,
GPU has a theoretical FP16 performance of 26.5 TFLOP/s these GPUs can deliver a theoretical peak performance that
vs. 13.3 TFLOP/s for FP32 and 6.6 TFLOP/s for FP64.
But perhaps the most widely accessible hardware with
half-precision capability are NVIDIA’s GPUs, which first ∗ https://www.olcf.ornl.gov/summit/

Prepared using sagej.cls

4 ()

instruction allows one warp to perform four independent

GEMM operations of size (8, 8, 4). However, using the
explicit instruction may lead to long-term compatibility
issues for open-source libraries as new architectures are
released.
3.1.2 Half-precision GEMM (HGEMM) The cuBLAS
library provides several routines that take advantage of the
reduced FP16 precision. Figure 3.2 shows the performance
of three different HGEMM kernels. An HGEMM kernel
with half-precision output can achieve up to 30 TFLOP/s of
performance if the Tensor Cores are turned off. While this
is around 2× the single-precision performance, significantly
higher performance can be achieved if the Tensor Cores are
turned on. As the figure shows, the Tensor Cores are capable
of delivering an asymptotic 100 TFLOP/s, which is 5× the
Figure 3.1. Programmability of the Tensor Core units. asymptotic performance of a non-accelerated HGEMM.
Performance of cuBLAS HGEMM on square sizes (CUDA-9.1)
130
HGEMM-FP16 output (tensor cores on)
120 HGEMM-FP32 output (tensor cores on)
is up to 8× faster than the peak FP32 performance. As an 110 HGEMM-FP16 output (tensor cores off)
example, each Volta V100 GPU has 640 Tensor Cores evenly 100
distributed across 80 multiprocessors. Each Tensor Core 90
80
Tfop/s

possesses a mixed-precision 4 × 4 × 4 matrix processing 70

array that performs the operation D = A × B + C, where 60
A, B, C, and D are 4 × 4 matrices. The inputs A and B 50
40
must be represented in FP16 format, while C and D can be 30
represented in FP16 or in FP32 formats. It is also possible 20
that C and D point to the same matrix. 10
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
NVIDIA’s cuBLAS library provides various optimized
routines, mostly GEMMs, that can take advantage of the Matrix size (x 1000)

Tensor Core acceleration if configured accordingly. As

Figure 3.2. Performance of different HGEMM kernel from the
an example, the routine cublasHgemm implements the cuBLAS library on square sizes. Results are shown on a Tesla
GEMM operation for real FP16 arithmetic. V100 GPU using CUDA-9.1.
Apart from the vendor library, taking advantage of the
Tensor Cores in a custom kernel is also possible through However, perhaps the most interesting performance graph
low-level APIs that are provided by the programming model. of Figure 3.2 is the HGEMM with FP32 output, where we
As shown in Figure 3.1, Tensor Cores deal with input and can see that the performance is close to the accelerated
output data through opaque data structures called fragments. HGEMM kernel but with much more precision on the
Each fragment is used to store one matrix. Fragments can be output. This is of particular importance for mixed-precision
loaded from shared memory or from global memory using algorithms (Haidar et al. 2018b, 2017). To put this in
the load matrix sync() API. A similar API is available perspective, Figure 3.3 shows the forward error between the
for storing the contents of an output fragment into the three different HGEMM kernels, with respect to the single-
shared/global memory of the GPU. The mma sync() API precision GEMM kernel from the Intel MKL library. The
is used to perform the multiplication. The user is responsible forward error is computed as:
for declaring the fragments as required and calling the APIs
kRcuBLAS − RM KL kF
in the correct sequence. √ ,
k + 2 |α| kAkF kBkF + 2 |β| kCkF
The programming model imposes some restrictions to
the programming of the Tensor Cores. First, the GEMM where k is equal to the matrix size, and α and β are the two
dimensions (M , N , K), which also control the size of scalars in the standard GEMM operation (C = αAB + βC).
the fragments, are limited to three discrete combinations, The first surprising observation is that an HGEMM operation
namely (16, 16, 16), (32, 8, 16), and (8, 32, 16). with FP16 output is more accurate if the Tensor Cores are
Second, the operations of load, store, and multiply must turned on (as the accumulation in the Tensor cores happens
be performed by one full warp (32 threads). Finally, the in FP32), which means that the utilization of the Tensor Core
load/store APIs require that the leading dimension of the units achieves both better performance and higher accuracy.
corresponding matrix be a multiple of 16 bytes. As an The second observation is that performing HGEMM with
example, a standard GEMM operation of size (16, 16, FP32 output achieves at least two more digits of accuracy
16) requires three load matrix sync() calls (for A, when compared with the other two HGEMM variants. Given
B, and C), one mma sync() call, and then a final that HGEMM with FP32 output is mostly within 90% of the
store matrix sync() call to write the result. The latest peak Tensor Core throughput, it is clearly the best option
CUDA version to date (10.1) provides direct access to the for mixed-precision algorithms that target achieving higher
Tensor Cores through an instruction called mma.sync. The accuracy while taking advantage of the half-precision.

Prepared using sagej.cls

Abdalafattah et al. 5

Forward Error of Square HGEMM

10-2
probabilistic assumptions about the rounding errors one
can obtain bounds in which the problem size is replaced
10-3 by its square root (Higham and Mary 2019), and even
smaller constants can be obtained for data with zero or small
Forward Error

10-4 mean (Higham and Mary 2020). These analyses provide

a theoretical explanation for the practical findings that
10-5 acceptable accuracy can be obtained for large n and low
precisions.
10-6
HGEMM-FP16 output (tensor cores on)
HGEMM-FP32 output (tensor cores on)
HGEMM-FP16 output (tensor cores off)
3.2 Newton’s Method
10-7
Newton’s method is widely used for solving systems of
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
Matrix size (x 1000) nonlinear equations F (x) = 0, where F : Rn → Rn . It takes
the form:
Figure 3.3. Forward error of HGEMM with respect to MKL xk+1 = xk − F 0 (xk )−1 F (xk ), (3.1)
SGEMM (C = αAB + βC ). Results are shown for square sizes
using cuBLAS 9.1 and MKL 2018.1. where F 0 denotes the Jacobian of F , which is assumed to
be nonsingular. In practice, of course, the explicit inverse
Performance of Batch HGEMM on square sizes (batch size = 1k)
is not computed, but rather a system of equations is solved
MAGMA with F 0 (xk ) as the coefficient matrix. Newton’s method
101 cuBLAS
lends itself to an obvious mixed-precision implementation,
whereby early iterations are carried out at low precision, and
the precision is increased as the iteration converges. Mixed
Tfop/s

100
precision can also be used within a Newton step: F (xk )
can be evaluated at a precision higher than the working
10-1 precision to inject more information into the iteration. A
detailed analysis of Newton’s method in mixed-precision
floating-point arithmetic is given by Tisseur (2001). Our
10-2 interest here is in using Newton’s method as a tool for
0
10

0
10

refining approximate solutions computed at low precision to

Matrix size the linear equations problem and the eigenvalue problem, as
Figure 3.4. Performance of the batch HGEMM kernel on we will discuss in later sections.
square sizes. Results are shown on a Tesla V100 GPU using
CUDA-9.1. 3.3 Iterative Refinement
A common approach to the solution of linear systems,
either dense or sparse, is to perform an LU factorization
3.1.3 Batch HGEMM Apart from the vendor-supplied
of the coefficient matrix. First, the coefficient matrix A is
BLAS, a few efforts have focused on building open-source
factored into the product of a lower triangular matrix L
BLAS routines that utilize NVIDIA Tensor Cores. An
and an upper triangular matrix U . Partial-row pivoting is, in
example of such efforts is in the MAGMA library (Agullo
general, used to improve numerical stability, which results
et al. 2009) which has a batch HGEMM kernel that makes
in a factorization P A = LU , where P is a permutation
use of the Tensor Cores (Abdelfattah et al. 2019). The
matrix. The solution for the system is achieved by first
kernel builds an abstraction layer over the Tensor Cores to
solving Ly = P b (forward substitution) and then solving
overcome their size restrictions, so that arbitrary blocking
U x = y (backward substitution). Because of roundoff errors,
sizes can be used by the kernel. The batch HGEMM kernel
the computed solution x carries a numerical error magnified
in MAGMA outperforms cuBLAS for relatively small sizes,
by the condition number of the coefficient matrix A.
as shown in Figure 3.4. The same work also shows that
Iterative refinement attempts to improve xb by forming the
extremely small matrices (e.g., with sizes ≤ 10) do not
residual r = b − Ab x, solving Ad = r using the LU factors,
necessarily benefit from Tensor Core acceleration.
and then updating x b←x b + z. This process is repeated
3.1.4 Error bounds What can we say about rounding error as necessary. Iterative refinement is effectively Newton’s
bounds for low-precision BLAS? Hardware that employs low method (3.1) applied to F (x) = b − Ax. The method goes
precision matrix multiplication with accumulation at high back to the beginning of the digital computer era and has
precision, such as the the NVIDIA tensor cores, requires a been analyzed by Wilkinson (1963), Moler (1967), Stewart
careful analysis that takes account of the internal precisions (1973), Higham (1997), and Higham (2002).
and the matrix size. A general such analysis, which quantifies Iterative refinement inherently presents opportunities
the gain from the use of higher precision accumulation, is for a mixed-precision approach, and there is a wealth
given in Blanchard et al. (2020). A second question concerns of existing work in this area. The three tasks—original
the interaction of precision and dimension: an error bound solve/factorization, residual computation, and correction
with a constant nu (say) provides no information if nu > equation solve—can be done in the same precision (fixed
1, such as when n = 104 and u is the unit roundoff for precision) or in different precisions (mixed precision). The
half precision (see Table 2.1). However, standard rounding original usage was mixed precision, with the residual
error bounds are based on worst-case analyses. By making computed at twice the working precision.

Prepared using sagej.cls

6 ()

Solver F.E. B.E. Prec. Algorithm 3.1 Mixed-precision iterative refinement.

Moler (1967) LU N − u ≥ ur
Stewart (1973) LU N − u ≥ ur 1: Compute the LU factorization with partial pivoting
Skeel (1980) LU C C u ≥ ur
Higham (1997) arb C C u ≥ ur
P A = LU . (u` )
Tisseur (2001) arb N N u ≥ ur 2: Solve for initial solution x0 . (u` )
Langou et al. (2006) LU N N u` ≥ u = ur 3: for k = 1, 2, . . . do
Carson and Higham (2017) arb C − u ≥ ur
Carson and Higham (2018) arb C C, N u` ≥ us ≥ u ≥ ur 4: rk ← b − Axk−1 . (ur )
Table 3.1. Summary of existing analyses of mixed precision 5: Solve Adk = rk . (us )
iterative refinement. 6: xk ← xk−1 + dk . (u)
7: Check convergence
8: end for
We define various precisions. Assume that the data A, b,
and the solution x b are stored in working precision u. We let
u` represent the precision at which the LU factorization of via triangular solve with the computed LU factors), it is
A is computed and ur represent the precision at which the shown that with the factorization done at half the working
residuals are computed. We also define us to be the precision precision and the residual computed at double the working
at which the correction equation is effectively solved; in precision, forward and backward errors on the order of the
practice this will either be u or u` . A generic mixed precision working precision are still attainable, as long as the condition
iterative refinement algorithm is given in 3.1. number of A is below some threshold. For example, as
In Table 3.1 we give citations to work which analyzes long as κ∞ (A) = kAk∞ A−1 ∞ < 104 , then using FP16
iterative refinement in mixed precision. We list whether the for the O(n3 ) portion (the LU factorization) and (FP32,
analysis applies to LU or an arbitrary solver (‘arb’) as well FP64) or (FP64, 128-bit, quadruple-precision floating point
as whether the analyses of backward and/or forward error are arithmetic (FP128)) as the (working, residual) precision
normwise (‘N’) or componentwise (‘C’). We note that, if x̂ for the O(n2 ) portion (refinement loop), one can expect
is an approximate solution of Ax = b, the normwise forward to achieve forward error and backward error on the order
error is defined by of 10−8 and 10−16 , respectively. A full summary of the
attainable componentwise forward errors and the normwise
kx̂ − xk / kxk , and componentwise backward errors for LU-based iterative
refinement with various precision combinations can be found
and the normwise backward error is defined by
in Carson and Higham (2018, Table 7.1).
min{ : (A + ∆A)x̂ = b, k∆Ak ≤ kAk} When mixed precision is used, the method in Algo-
rithm 3.1 can offer significant improvements for the solu-
which can be evaluated as tion of linear systems in many cases: if the low precision
computation is significantly faster than the high precision
krk /(kAk kx̂k), computation, if the iterative refinement procedure converges
in a small number of steps, and if the cost of each iteration is
where r = b − Ax̂. small compared with the cost of the system factorization. If
The work in Langou et al. (2006) is notable as it suggests the cost of each iteration is too high, then a high number of
that with double (FP64) as the working precision, the iterations will result in a performance loss with respect to the
factorization P A = LU and the solution of the triangular full, double precision solver. In the sparse case, for a fixed
systems Ly = P b and U x = y can be carried out in matrix size, both the cost of the system factorization and the
single precision (FP32). As a result, the only operation cost of the iterative refinement step may vary substantially
with computational complexity of O(n3 ) is handled in the depending on the number of nonzeros and the matrix sparsity
lower precision, while all operations performed in working structure; this will be addressed in Section 4.1. In the dense
precision (triangular solves) and residual precision (residual case, results are more predictable. Note that the choice of
and solution updates) are of at most O(n2 ) complexity. The the stopping criterion in the iterative refinement process is
coefficient matrix A is converted to factorization precision critical. For an analysis of convergence criteria see Demmel
for the LU factorization, and the resulting factors are stored et al. (2006).
in factorization precision, while the initial coefficient matrix
A needs to be kept in memory. Therefore, using FP32 as
factorization precision for a FP64 problem, this approach 3.4 GMRES-IR
increases the memory requirement by 50% compared to the As mentioned, the analysis in Carson and Higham (2018)
standard algorithm. allows for a general solver for the correction equation
Building on the work of Langou et al. (2006), Carson within iterative refinement. Carson and Higham (2017)
and Higham (2018) recently analyzed the general three- first proposed the use of the GMRES method (Saad and
precision iterative refinement scheme presented in Algorithm Schultz 1986) preconditioned by the FP16 LU factors as
3.1. The analysis generalizes previously studied cases (e.g., the solver in the correction equation. This variant is called
the two-precision analyses of Langou et al. (2006) and GMRES-IR by Carson and Higham. Algorithmically, it
Higham (1997)), and also covers the case where three follows the same structure as Algorithm 3.1, except line
different precisions are used. It also allows for an arbitrary 5 is performed via preconditioned GMRES. A summary
solver to be used in line 5 of Algorithm 3.1. For LU-based of results for various precision combinations can be found
iterative refinement (i.e., when the solve in line 5 is done in Carson and Higham (2018, Table 8.1). It is shown that

Prepared using sagej.cls

Abdalafattah et al. 7

in this case, the constraint on the condition number can xmax denotes the largest finite floating-point number (see
be relaxed compared to the LU-based refinement scheme. Table 2.1).
For example, with factorization precision FP16, working
precision FP32, and residual precision FP64, we can expect Algorithm 3.2 (Two-sided diagonal scaling then round.)
forward and backward errors on the order of FP32 as long This algorithm rounds A ∈ Rn×n to the FP16 matrix A(h) ,
as κ∞ (A) < 108 . We refer to (Higham 2019, Table 3.1) for scaling all elements to avoid overflow. θ ∈ (0, 1] is a
limiting forward and backward errors for this GMRES-based parameter.
approach when two precisions are used, with the residual 1: Apply any two-sided diagonal scaling algorithm to A, to
precision equal to the working precision. obtain diagonal matrices R, S.
The idea behind GMRES-IR is that even though the low- 2: Let β be the maximum magnitude of any entry of T AS.
precision LU factors may be of low quality, they can still 3: µ = θxmax /β
be effective preconditioners in using the GMRES method 4: A(h) = f l` (µ(T AS))
to solve the correction equation Adk = rk , resulting in an
effective solve precision us = u. The condition number of
the resulting preconditioned system is reduced enough to For FP16, in light of the narrow range, we will also
guarantee backward stability of the approximate solution multiply the shifted matrix by a scalar to bring it close to the
computed by GMRES even for matrices that are nearly overflow level xmax and to minimize the chance of underflow
numerically singular with respect to the working precision. and of subnormal numbers being produced.
In contrast, using a basic triangular solve with the low- Higham et al. (2019) recommend two different algorithms
precision LU factors to solve Adk = rk , for which us = u` , for determining T and S; both algorithms are carried out at
provides no degree of relative accuracy once κ∞ (A) exceeds the working precision. The first option is row and column
u−1
` . Using preconditioned GMRES, we can still guarantee equilibration, which ensures that every row and column
that the solution of the correction equation has some correct has the maximum element in modulus equal to 1—that
digits and a residual at the level of the convergence tolerance is, each row and column is equilibrated. The LAPACK
requested by the algorithm despite the apparent low quality routines xyyEQU carry out this form of scaling Anderson
of the computed preconditioners. et al. (1999). The second option, for symmetric matrices,
Since this paper focuses on the practical usage and is a symmetry-preserving two-sided scaling proposed by
possible performance gains rather than error analysis, we Knight et al. (2014). The algorithm is iterative and scales
point the reader to Higham (2002), Carson and Higham simultaneously on both sides rather than sequentially on one
(2017), Carson and Higham (2018), and Higham (2019) for side and then the other.
detailed error analysis of both standard iterative refinement For more details on scaling see Higham et al. (2019),
and GMRES-IR. Of course, in order to be beneficial, it is Higham and Pranesh (2019a), and Carson et al. (2020).
necessary that the total number of GMRES iterations and the
total number of refinement steps remains small. As shown in
Carson and Higham (2017) and Carson and Higham (2018),
3.5 Low-Precision Cholesky Factorization
this is indeed the case for many problems. In the previous section, we considered general matrices. We
We note that the HPL-AI mixed-precision benchmark,† now assume that we are given a symmetric positive definite
which is designed to take into account the availability matrix A ∈ Rn×n in precision u and wish to compute a
of hardware accelerators for low-precision computation, is Cholesky factorization at precision u` > u for use in iterative
based on GMRES-IR. refinement. The most practically important cases are where
(u` , u) = (half, single), (half, double), or (single, double).
3.4.1 Scaling It is clear that the use of low-precision
The obvious approach is to form A(`) = f l` (A), where f l`
floating point arithmetic in iterative refinement can lead to
denotes the operation of rounding to precision u` , and then
significant speedups. However, FP16 has a small dynamic
compute the Cholesky factorization of A(`) in precision u` .
range, and therefore encountering overflow, underflow, and
However, this approach can fail for two reasons. First, if
subnormal numbers is very likely‡ .
FP16 is used, then the limited range might cause overflow
We consider a two-sided diagonal scaling prior to during the rounding. Second, for both BFloat16 and FP16,
converting to FP16: A is replaced by RAS, where: A(`) can fail to be (sufficiently) positive definite, because
a matrix where the smallest eigenvalue is safely bounded
T = diag(ti ), S = diag(si ), ti , si > 0, i = 1 : n.
away from zero with respect to single precision or double
Such scaling algorithms have been developed in the context precision can become numerically indefinite under rounding
of linear systems and linear programming problems. In to half precision. The second issue can also arise when
contrast to previous studies (see Elble and Sahinidis (2012)), a double precision matrix is rounded to single precision.
where the aim of scaling has been to reduce a condition To overcome these problems, Higham and Pranesh (2019b)
number or to speed up the convergence of an iterative method propose scaling and shifting.
applied to the scaled matrix, we scale in order to help
squeeze a single-precision or double-precision matrix into
half precision, with a particular aim to use the resulting half- † https://icl.bitbucket.io/hpl-ai/
precision LU factors for iterative refinement. ‡ Note that some hardware architectures, e.g., NVIDIA tensor cores, perform
Higham et al. (2019) propose the use of two-sided computations and accumulations in higher precision, and only truncate
diagonal scaling given in Algorithm 3.2. Recall that down to FP16 when writing the results to main memory.

Prepared using sagej.cls

8 ()

3.5.1 Scaling The first step is to scale the matrix A to Algorithm 3.1), mixed-precision factorizations apply higher
1/2
the unit diagonal matrix H = D−1 AD−1 , D = diag(aii ), precision (e.g., uw ) at critical parts of the algorithm to obtain
and D will be kept at precision u. Cholesky factorization is more accurate factorizations while retaining the performance
essentially numerically invariant under two-sided diagonal of the low-precision counterpart.
scaling, so the sole reason for scaling is to reduce the The mixed-precision factorizations were motivated by the
dynamic range in order to avoid overflow and reduce the need to get extra precision when working with very low
chance of underflow for FP16. For BFloat16 or FP32, it is precisions, like the FP16. Also, this allows one to easily
not usually necessary to scale, and we can work with A overcome implementation issues and other limitations of
throughout. using FP16 arithmetic and harness the power of specialized
hardware (e.g., Tensor Cores) for a larger range of scientific
3.5.2 Shifting We now convert H to the lower-precision
computing applications.
u` , incorporating a shift to ensure that the lower-
precision matrix is sufficiently positive definite for Cholesky The developments were applied to GPU Tensor Cores and
factorization to succeed. We shift H by cn u` I, where cn is illustrate that FP16 can be used to obtain FP64 accuracy for
a positive integer constant, to obtain G = H + cn u` I. Since problems with κ∞ (A) of up to 105 , compared to a more
the diagonal of H is I, this shift incurs no rounding error, typical requirement of κ∞ (A) < 104 . The work illustrates
and it produces the same result whether we shift in precision that mixed-precision techniques can be of great interest for
u and then round or round and then shift in precision u` . linear solvers in many engineering areas. The results show
that on single NVIDIA V100 GPU, the new solvers can be
Our final precision-u` matrix is constructed as:
up to 4× faster than an optimized double-precision solver
G = H + cn u` I, (Haidar et al. 2018b), (Haidar et al. 2017), (Haidar et al.
2018a), and (Haidar et al. 2020).
β = 1 + cn u` , µ = θxmax /β, (3.2)
A building block for the mixed-precision factorizations is
(h)
A = f l` (µG), mixed-precision BLAS. Having mixed-precision BLAS can
ease the development of many mixed-precision LAPACK
where θ ∈ (0, 1) is a parameter. Note that β = maxij |gij |, algorithms. Currently, cuBLAS provides a mixed FP32-FP16
so the largest absolute value of any element of A(h) is precision HGEMM that uses the GPU’s Tensor Cores for
θxmax . Note also that since the growth factor for Cholesky FP16 acceleration. In this GEMM, the input matrices A and
factorization is 1 (see Higham (2002, Problem 10.4)), there B can be FP32, be internally cast to FP16, used to compute
is no danger of overflow during Cholesky factorization of a GEMM on Tensor Cores in full (FP32) accuracy, and then
A(h) . the result is stored back on the GPU memory in FP32. There
Higham and Pranesh (2019a, Section 3.3) provide analysis are two main benefits to having such mixed-precision BLAS
suggesting the choice of cn . A pragmatic approach is routines. First, note that this mixed-precision HGEMM is
to take cn to be a small constant, and if the Cholesky almost as fast as the non-mixed FP16 HGEMM (Figure 3.2).
factorization fails, increase c and try again. Based on this, we Second, the use of mixed-precision gains about one more
present the low-precision Cholesky factorization algorithm decimal place of accuracy (Figure 3.3).
in Algorithm 3.3. Aside from the two main benefits outlined above,
the availability of mixed-precision GEMMs also enables
Algorithm 3.3 (Low-precision Cholesky factorization.) us to easily develop other mixed-precision algorithms
Given a symmetric positive definite A ∈ Rn×n in precision (e.g., LAPACK), including the various mixed-precision
u, this algorithm computes an approximate Cholesky factorizations that we recently added in MAGMA (Haidar
factorization RT R ≈ µD−1 AD−1 at precision u` , where et al. 2018b). Figure 3.5 shows the performance of the
1/2
D = diag(aii ). The scalar θ ∈ (0, 1] and the positive mixed-precision LU (marked as “FP16-TC hgetrf LU”).
integer c are parameters. Note that this factorization is about 4×–5× faster than
1/2 dgetrf. Its data storage is in FP32, and the implementation
1: D = diag(aii ), H = D−1 AD−1
2: G = H + cu` I is the same as sgetrf, except that it uses the mixed-precision
3: β = 1 + cu` HGEMMs for the trailing matrix updates.
4: µ = θxmax /β Figure 3.6 shows the mixed-precision iterative refinement
5: A(h) = f l` (µG) in MAGMA (Haidar et al. 2018b), which uses a backward
6: Attempt Cholesky factorization A(h) = RT R in preci- error criterion for convergence. The 4× overall acceleration
sion u` . is due to a number of optimizations. First, note that the 3
7: if Cholesky factorization failed then iterations to get to FP64 accuracy led to a loss of about
8: c ← 2c, goto line 2 2 TFLOP/s compared with the hgetrf performance (24
9: end if TFLOP/s vs. 26 TFLOP/s) (i.e., the overhead of one iteration
can be deduced as being about 2%). Losing 75% (e.g.,
through up to 40 iterations) would lead to no acceleration
compared to the FP64 solver. This overhead per iteration
3.6 Mixed-precision Factorizations is very low, owing to fusing all data conversions with
Haidar et al. (2018b) proposed iterative refinement methods computational kernels. Without fusion, the overhead would
using mixed-precision factorizations. While classical itera- have been easily about 3× higher. Second, note that iterative
tive refinement and extensions like the GMRES-IR use fixed- refinement using the mixed-precision factorization has less
precision factorizations (e.g., in precision u` as illustrated in than half of the overhead in terms of iterations to solution

Prepared using sagej.cls

Abdalafattah et al. 9

using the Cholesky factorization of AT A (Section 3.5). In

26 FP16-TC (Tensor Cores) hgetrf LU
FP16 hgetrf LU general, this method is not recommended unless A is very
24 FP32 sgetrf LU
22 FP64 dgetrf LU well conditioned because it has a backward error bound of
20 order κ2 (A)u (Higham 2002, sect. 20.4), and the Cholesky
18 factorization can break down for κ2 (A) > u−1/2 , where
16 κ2 (A) is the ratio of the largest to the smallest singular
4-5X
Tflop/s

14
12
value of A. Higham and Pranesh (2019b) assume that A is
10 well conditioned and propose the Cholesky and GMRES-IR-
8 based least squares solver given in Algorithm 3.4.
6
4
2
Algorithm 3.4 (Cholesky-based GMRES-IR for the least
0
2k 4k 6k 8k 10k 14k 18k
matrix size
22k 26k 30k 34k squares problem.) Let a full rank A ∈ Rm×n , where m ≥ n,
and b ∈ Rm be given in precision u. This algorithm solves
Figure 3.5. Mixed-precision LU (hgetrf) in MAGMA and its the least squares problem minx kb − Axk2 using Cholesky-
speedup vs. FP64 LU. based GMRES-IR. The scalar θ ∈ (0, 1] and the positive
integer c are parameters.
Performance of solving Ax=b
using FP64 or IR with GMRes to achieve FP64 accuracy3 1: Compute B = AS, where S = diag(1/kaj k2 ), with aj
24
22
FP16-TC->64 dhgesv
FP16->64 dhgesv
3 the jth column of A.
20
FP32->64 dsgesv
FP64 dgesv 3 105 2: µ = θxmax
18 3 7
6
3: B (h) = f l` (µ1/2 B)
16 7 104 4: Compute C = B (h)T B (h) in precision u` .
14 3 6 4X 5: Compute the Cholesky factorization C+
Tflop/s

κ∞ (A)

3 3
12
3
6 6
2
2
2 10
cu` diag(cii ) = RT R in precision u` .
10 6 2 2
3 2 2
6: if Cholesky factorization failed then
8 2 10
3 6 7: c ← 2c, goto line 5
6 3 6 2
4 3
6 2 10
1 8: end if
3
6
2
5 2
2
9: Form b(h) = f l` (SAT b).
2
0 10
0 10: Solve RT Ry0 = b(h) in precision u` and form x0 =
2k 4k 6k 8k 10k 14k 18k 22k 26k 30k 34k
Matrix size µSy0 at precision u.
11: for i = 0 : imax − 1 do
Figure 3.6. Mixed-precision iterative refinement in MAGMA and 12: Compute ri = AT (b − Axi ) at precision ur and
acceleration vs. FP64 solvers. Note ≈ 2% overhead per round ri to precision u.
iteration, and less than half the overhead in terms of iterations 13: Solve M AT Adi = M ri by GMRES at precision u,
for mixed-precision LU vs. regular FP16 LU (the 3 vs. 7
where M = µSR−1 R−T S and matrix–vector products
iterations until FP64 convergence). The condition numbers of
the matrices are computed using FP64. with AT A are computed at precision ur , and store di at
precision u.
14: xi+1 = xi + di at precision u.
(3 vs. 7 iterations until FP64 convergence). This is due to 15: if converged then
the extra digit of accuracy that the mixed-precision HGEMM 16: return xi+1 , quit
has over the FP16 HGEMM, which also translates to a more 17: end if
accurate mixed-precision LU. 18: end for
Using mixed precision Cholesky factorization in Algo-
rithm 3.3, Abdelfattah et al. (2020) obtain speedups of up
to 4.7 over a double precision solver on an NVIDIA V100. Line 1 of Algorithm 3.4 produces a matrix B with
columns of unit 2-norm. The computation C = B (h)T B (h)
3.7 Iterative Refinement for Least Squares on line 4 produces a symmetric positive definite matrix with
constant diagonal elements µ = θxmax , so overflow cannot
Problems
occur for θ < 1. The shift on line 5 is analogous to that in
We consider the linear least squares problem minx kAx − Algorithm 3.3, but here the matrix C is already well scaled
bk2 , where A ∈ Rm×n with m ≥ n having full rank. The and in precision u` , so there is no need to scale C to have
idea of mixed-precision iterative refinement and GMRES-IR unit diagonal.
for square linear systems can be adapted to the least
Note that although Algorithm 3.4 explicitly forms C =
squares case. Least squares problems may be ill conditioned
B (h)T B (h) in Algorithm 3.4, C is used to form a
in practice, and so rounding errors may result in an
preconditioner, so the usual problems with forming a cross-
insufficiently accurate solution. In this case, iterative
product matrix (loss of significance and condition squaring)
refinement may be used to improve accuracy, and it also
are less of a concern. Also note that if we are working in
improves stability.
FP16 on an NVIDIA V100, we can exploit the Tensor Cores
3.7.1 Cholesky-Based Approach The normal equations when forming C to accumulate block fused multiply-add
method solves: operations in single precision; this leads to a more accurate
AT Ax = AT b C, as shown by the error analysis of Blanchard et al. (2020).

Prepared using sagej.cls

10 ()

For the computed R̂, we have: of linear systems are applicable. The only thing that
must change is the analysis of the method for solving
R̂T R̂ ≈ B (h)T B (h) ≈ µSAT AS, the correction equation, since we now work with a QR
factorization of A, which can be used in various ways.
or
The work in Carson et al. (2020) also extends the GMRES-
(AT A)−1 ≈ µS R̂−1 R̂−T S,
based refinement scheme of Carson and Higham (2017) to
so we are preconditioning with an approximation to the the least squares case and shows that one can construct a left
inverse of AT A. For large n, as long as GMRES converges preconditioner using the existing QR factors of A such that
quickly, the cost of the refinement stage should be negligible GMRES provably converges to a backward stable solution of
compared with the cost of forming AT A and computing the the preconditioned augmented system. Further, it is shown
Cholesky factorization. that an existing preconditioner developed for saddle point
We also mention the Cholesky-QR algorithm for systems can also work well in the GMRES-based approach
computing a QR factorization A = QR. It forms the cross- in practice, even though the error analysis is not applicable.
product matrix AT A, computes the Cholesky factorization We refer the reader to Carson et al. (2020) for further details.
AT A = RT R, and then obtains the orthogonal factor Q as For details of convergence tests for iterative refinement of
Q = AR−1 ; this process can be iterated for better numerical least squares problems see Demmel et al. (2006).
stability (Fukaya et al. 2020). Mixed precision can be
exploited in this algorithm, as shown by Yamazaki et al. 3.8 Eigenvalue Problems
(2015) and Yamazaki et al. (2016). Newton’s method can be used to refine an approximate
3.7.2 Augmented Matrix Approach As mentioned, the eigenpair of a matrix by defining a function F : Rn+1 →
Cholesky-based approach described in the previous section Rn+1 that has as its first n components (A − λI)x and
is intended only for the case where the matrix is very well its last component eTs x − 1 for some unit vector es , with
conditioned. Another approach to mixed-precision least- this last component serving to normalize x. If an initial
squares iterative refinement with a less stringent requirement eigen decomposition is available, it can be exploited to
on the condition number is presented by Carson et al. (2020). simplify the implementation of the Newton iteration. This
This approach is based on using the QR factorization: idea was developed by Dongarra (1982) and Dongarra et al.
(1983), building on a Schur decomposition and allowing
R the residual (A − λI)x to be computed in higher precision.
A=Q ,
0 Algorithm 3.5 implements this procedure, called SICE,
which, in each iteration, solves a linear system resulting
where Q = [Q1 , Q2 ] ∈ Rm×m is an orthogonal matrix with from a rank-1 update in order to refine a single eigen-
Q1 ∈ Rm×n and Q2 ∈ Rm×(m−n) , and R ∈ Rn×n is upper pair. The rank-1 update is introduced while replacing one
triangular. The unique least squares solution is x = R−1 QT1 b column in A − λI to remove one degree of freedom on
with residual kb − Axk2 = kQT2 bk2 . eigenvector correction and, at the same time, compute a
An iterative refinement approach for least squares systems correction for the corresponding eigenvalue. The original
was suggested by Björck (1967a). Refinement is performed formulation (Dongarra 1982) solves the system with two
on the augmented system series of Givens rotations to make it upper triangular. This

I

A r b process is hard to parallelize on modern architectures. Also,
= , (3.3) some form of orthogonalization should be considered while
AT 0 x 0
using the algorithm to refine more than one eigenvalue.
which is equivalent to the normal equations. In this way, This idea has been extended to the generalized eigenvalue
the solution xi and residual ri for the least squares problem problem by Tisseur (2001) and Davies et al. (2001).
are simultaneously refined. Björck (1967a) shows that this For the symmetric eigenvalue problem, Petschow et al.
augmented system can be solved by reusing the QR factors (2014) use extra precision to improve the accuracy of
of A. the multiple relatively robust representations (MRRR)
Existing analyses of the convergence and accuracy of algorithm, with little or no performance penalty.
this approach in finite precision assume that, at most, two Algorithm 3.6 shows another iterative refinement proce-
precisions are used; the working precision u is used to dure from Ogita and Aishima (2018) for solving a symmetric
compute the QR factorization, solve the augmented system, eigenvalue problem. This method also succeeds for clustered
and compute the update. A second precision ur ≤ u is used eigenvalues Ogita and Aishima (2019). Lines 4, 5, and 10
to compute the residuals. Typically ur = u2 , in which case represent the compute-intensive parts of the algorithm, which
it can be shown that as long as the condition number of the amounts to 4 calls to the matrix-matrix multiply function
augmented system matrix is smaller than u−1 , the refinement xGEMM. Line 8 uses the 2-norm, but it is recommended to
process will converge with a limiting forward error on the approximate using the Frobenius norm, because it is much
order of u; see Björck (1990) and Higham (2002, sect. 20.5) easier to compute in practice. Line 9 is an element-wise
and the references therein. operation to construct the refinement matrix E. Line 10 is
Carson et al. (2020) show that the three-precision iterative the update of eigenvectors by applying the refinement matrix
refinement approach of Carson and Higham (2018) can E. High-precision arithmetic is required for all computations
be applied in this case; the theorems developed in Carson except line 8 for the matrix norm. Even though the algo-
and Higham (2018) for the forward error and normwise rithm may be applied for only a subset of ` eigenvectors,
and componentwise backward error for iterative refinement the refinement iterations are confined to the corresponding

Prepared using sagej.cls

Abdalafattah et al. 11

Algorithm 3.5 SICE algorithm for iteratively refining Algorithm 3.6 Iterative refinement for symmetric eigenvalue
computed eigenvalue. problem.
1: function [x, λ] ← SICE(A, x0 , λ0 ) 1: Input: A = AT ∈ Rn×n , X b ∈ Rn×` , 1 ≤ ` ≤ n
[Q, T ] ←schur(A) . Schur decomposition to find 0 n×`

2: 2: Output: X ∈R , De = diag λei ∈ R`×` ,
A = QT QT where T is upper quasitriangular. Ee ∈ R`×` , ω ∈ R
3: [m, s] ← max(x0 ) . Find maximum value and h i
3: function X 0 , D,e E, e ω ← R EF S Y E V(A, X)b
index in the eigenvector.
4: x0 ← x0 /m . Normalize 4: R ← In − X bT X b
T
5: for i = 1, 2, . . . do 5: S ← X AX
b b
6: c ← −xi−1 − (A − λi−1 I)[:, s] . Column s of 6: λbi ← sii /(1 − rii ) for i = 1, . . . , ` . Compute
A − λi−1 I approximate eigenvalues.

7: d ← QT c 7: De ← diag λei
8: f ← eTs Q . Row s of Q
9: Solve the rank-1 updated system 8: ω ←2 S−D e + kAk2 kRk2
2
Q(T − λi−1 I + df T )QT yi = Axi−1 − λi−1 xi−1

 sij +λej rij if λ ei − λ
ej > ω
10: λi ← λi−1 + yi [s] . Eigenvalue correction. 9: eij ← e j −λ
λ ei
for
11: xi ← xi−1 + yi . Eigenvector correction. rij /2 otherwise
12: xi [s] ← xi−1 [s] . Restore x[s]. 1 ≤ i, j ≤ ` . Compute the entries of the refinement
13: if 2 × yi [s] > yi−1 [s] then matrix E.
e
14: Break from for loop. 10: X0 ← X b +X
bEe . Update X b by X(I
b n + E)e
15: end if 11: end function
16: end for
17: x ← xi
18: λ ← λi
19: end function

subspace and its approximation provided on input. Hence,

the refinement process could be limited: the desired accuracy
might be unattainable if only a part of the spectrum is
refined in higher precision and the subspace spanned by input
approximate eigenvectors does not cover the real eigenvector
close enough. It is designed to be applied on the clus-
tered eigenvalues after the eigenspace has been accurately
captured. The clustered eigenvalues’ subspace is not very Figure 3.7. Convergence of eigenvalue refinement using
sensitive to perturbations, provided that the gap between algorithm 3.6 from single to double precision for a real
symmetric matrix of size n = 100. The matrix is generated with
the clustered eigenvalues and all the others is sufficiently
random eigenvectors and geometrically distributed eigenvalues
large. Thus, eigenvalue gaps must be expanded in each between 1 (blue) and 10−5 (red).
clustering subproblem by using diagonal shifts as proposed
by Ogita and Aishima (2019) in addition to a modification
for improved orthogonality of the refined eigenvectors. Fig-
4 Sparse Linear Algebra
ure 3.7 shows the convergence behavior of Algorithm 3.6 on
a real symmetric matrix of size n = 100 when refining the In contrast to dense linear algebra, sparse linear algebra
entire spectrum from single precision toward double preci- operations are typically memory bound, and the primary
sion. The matrix is generated with random eigenvectors and goal of mixed-precision sparse linear algebra algorithms is
geometrically distributed eigenvalues between 1 and 10−5 . to reduce the memory access and communication volume.
Each line represents the convergence of one eigenvalue, However, for sparse direct solvers, blocking strategies and
and the normalized residual kAx − λxk/kAkkxk is plotted fill-in during the factorization can result in matrices for
against subsequent iteration numbers. The color indicates the which the factorization step becomes compute bound. This
value of eigenvector from blue as 1 toward red as 10−7 . makes mixed-precision iterative refinement strategies like
Here we can see the condition of solving the eigenvector is those presented in Section 3.3 also attractive for sparse linear
related to the gap between the eigenvalue and it’s neighbor- systems (see Section 4.1). For 2-D and 3-D problems, the
hoods. The blue ones are well conditioned and can converge matrix representation of a finite element discretization has
toward double precision accuracy in 2 iterations while the a larger matrix bandwidth, and the fill-in arising in the
red ones requires up to 8 iterations. This is an important factorization often exceeds the hardware capabilities in terms
computational aspect of the convergence rate. The larger the of memory and computational cost. Consequently, many
iteration count, the more time will be spent in line 4, 5, and applications rely on iterative solvers and preconditioned
10 of Algorithm 3.6 that have to use higher precision matrix iterative solvers to handle sparse linear systems. Mixed-
product. precision strategies to accelerate iterative sparse linear

Prepared using sagej.cls

12 ()

algebra methods utilize components that are not critical solved using the low-precision factorization. However, the
to the final accuracy (e.g., preconditioners or individual triangular factors are (block-) sparse, and for solving the
operations in a larger algorithm) in lower precision than triangular systems, the same strategy of gathering data from
working precision, or trade low-precision memory access the sparse data structures into contiguous memory proves
against additional iteration steps. In Section 4.2, we present successful,
a theoretical analysis of the rounding effects low-precision (block-) sparse triangular solve:
computations have on the accuracy of Krylov solvers.
However, as previously mentioned, it is usually not the 1. gather the values from the sparse triangular structures
arithmetic computations that limit the performance of into dense blocks in registers / fast memory and
iterative algorithms for sparse problems, but rather it is the 2. invoke dense linear algebra kernels to solve for the
communication and memory bandwidth. In Section 4.3, we right-hand side.
present the idea of radically decoupling the format used Again, gathering the data in dense blocks enables the use of
for arithmetic computations from the format that is used efficient dense linear algebra kernels. Using low precision,
for communication and memory operations. This concept the memory-bound “gather” step benefits from a reduced
can span from using lower precision for memory accesses memory access volume, and the dense linear algebra kernels
to using dedicated compression techniques before invoking benefit from the higher performance in low precision.
communication operations. Examples of how this concept of For dense linear algebra, the performance benefits of
format decoupling and compression helps accelerate sparse mixed-precision iterative refinement over high-precision
linear algebra include preconditioners for iterative solvers dense direct solvers mostly correlate with the hardware-
(Section 4.4) and multigrid methods (Section 4.5). specific arithmetic performance limits in the different
precision formats. In particular, the performance benefits are
4.1 Mixed Precision Sparse Direct Solvers mostly independent of the problem characteristics. This is
The factorization process of a sparse matrix usually different when using mixed-precision iterative refinement
generates fill-in elements, significantly increasing the for the direct solution of sparse problems, as the matrix
number of nonzero elements in the factors. The fill-in is structure determines the amount and structure of the fill-
usually structured, and the fill-in locations can be predicted in, the efficiency of the dense linear algebra kernels
from the sparsity pattern of the original system matrix. To operating on the induced dense blocks, and the ratio between
improve performance and memory efficiency, factorization- memory operations and arithmetic operations. As a result,
based sparse solvers typically operate in a block-sparse it is much harder to predict whether the mixed-precision
fashion: forming blocks covering the nonzero elements iterative refinement variant of a sparse direct solver provides
reduces the indexing information to index the blocks, and performance benefits over the execution of a sparse direct
storing the elements as small dense blocks allows for solver in high precision.
the application of highly efficient dense linear algebra
operations. There exist two options for realizing the concept 4.2 Mixed-Precision Krylov Solvers
of block-sparse factorizations. One is to convert the system
The scope of our review includes both Lanczos-based (short-
matrix into a block-sparse matrix prior to the factorization
term recurrence) and Arnoldi-based (long-term recurrence)
process. The other, more popular, one is based on forming
methods and the associated methods for solving linear
the dense blocks “on-the-fly” in registers/fast memory during
systems of equations Ax = b. In the context of long-
the factorization process, (block-) sparse factorization:
term recurrence methods, we consider both the Arnoldi-QR
1. gather the values from sparse data structures into dense algorithm with the modified Gram-Schmidt implementation
blocks in registers/fast memory; of the GMRES Krylov subspace method for iteratively
2. invoke dense linear algebra kernels on dense blocks; solving linear systems of equations as well as Flexible
and GMRES (FGMRES). The emphasis here is to examine
3. scatter results into the sparse output data structure. the approaches employed to date that incorporate mixed-
precision floating point arithmetic to speed-up computations
Similar to dense linear solvers, sparse direct solvers can
while retaining some or all of the numerical properties of the
also benefit from the mixed-precision iterative refinement
original algorithms in FP64 arithmetic (i.e., representation
framework presented in Section 3.3: The (block-) sparse
error and loss of orthogonality).
factorization is computed in low precision, thereby
leveraging the corresponding high compute power, and 4.2.1 Lanczos-CG First, we briefly summarize the most
the iterative refinement process recovers a high-accuracy well-known results on the finite precision behavior of
solution. Contrary to the dense case, the low precision Lanczos and CG methods and discuss how such results could
(block-) sparse factorizations not only benefit from higher potentially be extended to the mixed precision case and
arithmetic performance in the invocation of the compute- existing progress in this area. We also note that the literature
bound dense linear algebra kernels, but they also benefit from on finite precision behavior of Lanczos-based methods is
the reduced memory access volume in the memory-bound expansive, and we cannot hope to fully describe it here. For
gather and scatter operations. a more thorough account and historical references, we point
The iterative refinement process for recovering high- the reader to the survey of Meurant and Strakoš (2006).
precision solutions for a sparse linear system is conceptually Fundamental relations dealing with the loss of orthog-
identical to the dense case: like in Algorithm 3.1, an onality and other important quantities in finite precision
error correction equation computed in high precision is Lanczos have been derived by Paige (1980). These results

Prepared using sagej.cls

Abdalafattah et al. 13

were subsequently used by Greenbaum to prove backward theoretical study, which we believe can be achieved by
stability-like results for the CG method (Greenbaum 1989); extending the results in Sleijpen and van der Vorst (1996)
namely, Greenbaum showed that CG in finite precision can and the related work of Van Der Vorst and Ye (2000) to the
be seen as an exact CG run on a larger linear system, in mixed precision setting.
which the coefficient matrix has eigenvalues in tight clusters
4.2.2 Flexible GMRES Much work has been done
around the eigenvalues of the original matrix, where the
involving the use of lower-precision preconditioners within
diameter of these clusters depends on properties of the
iterative solvers, in particular, GMRES and FGMRES run in
matrix and the machine precision. Greenbaum also proved
a higher precision.
fundamental results on the maximum attainable accuracy in
Arioli and Duff (2009) rigorously prove, that using
finite precision, that is, the limiting value of kxk − xk/kxk
a triangular factorization computed as a preconditioner
for approximate solutions xk and true solution x, in CG and
in FP32, FGMRES run in FP64 produces a solution
other “recursively computed residual methods” (Greenbaum
with backward error to FP64 accuracy. In contrast, they
1997). The results of Paige and Greenbaum have also been
demonstrate that using FP64 iterative refinement as the
extended to s-step Lanczos/CG variants in Carson (2015),
solver may fail in such cases. They provide numerical
where it is shown that s-step Lanczos in finite precision
experiments which support their theoretical analysis. This
behaves like a classical Lanczos run in a lower “effective”
builds on the previous work of Arioli et al. (2007), in which
precision, where this “effective” precision depends on the
it is proved that FGMRES is backward stable.
conditioning of the polynomials used to generate the s-step
Building on the work of Arioli and Duff (2009), Hogg
bases. We believe that these existing results can be extended
and Scott (2010) develop a single-precision (FP32) imple-
to the mixed precision case.
mentation of an LDLT factorization method for solving
Existing results in the area of mixed precision Lanczos- sparse-symmetric linear systems. This FP32 factorization is
based methods are contained within the work on “inexact then used as a preconditioner within FP64 iterative solvers,
Krylov subspace methods,” which also applies to Arnoldi- including iterative refinement and FGMRES, effectively
based methods (see Simoncini and Szyld (2003) and van den creating a mixed-precision solver. They demonstrate that
Eshof and Sleijpen (2004)). Within such frameworks, it for linear systems that are sufficiently well conditioned,
is assumed that the matrix-vector products are computed the mixed-precision approach was sufficient for obtaining
with some bounded perturbation (which can change in each FP64 accuracy; the remaining cases required a full FP64
iteration), and all other computation is exact. These methods implementation. Additionally, it is shown that the mixed-
were motivated by improving performance in applications precision approach is beneficial in terms of performance for
where the matrix-vector products dominate the cost of sufficiently large problems.
the computation (e.g., when the matrix is dense or the
application of A involves solving a linear system). Many 4.2.3 Arnoldi-QR MGS-GMRES For MGS-GMRES the
theoretical results on “inexact Krylov subspace methods,” mixed-precision work by Gratton et al. (2020) is the
mostly focused on the maximum attainable accuracy, have most recent and appropriate—and in particular the loss-of-
been proved in the literature. A surprising result is that the orthogonality relations due to Björck (1967b) and Paige
inexactness in the matrix-vector products can be permitted (1980), later refined by Paige et al. (2006), are employed
to grow in norm as the iterations progress at a rate in order to provide tolerances for mixed FP32–FP64
proportional to the inverse of the residual norm without computations. MGS-GMRES convergence stalls (the norm-
affecting the maximum attainable accuracy. However, a wise relative backward error approaches ε) when linear
crucial practical question is whether inexactness will affect independence of the Krylov vectors is lost, and this is
the convergence behavior before the attainable accuracy is signaled by Paige’s S matrix norm kSk2 = 1. The S matrix
reached; this is entirely possible in the case of short-term (Paige 2018) is derived from the lower triangular T matrix
recurrence methods such as CG and has not been well studied appearing in the rounding error analyses by Giraud et al.
theoretically. (2004).
For comprehensiveness, we briefly mention works which To summarize, Gratton et al. (2020) postulate starting
make use of mixed precision Krylov subspace methods in from the Arnoldi-QR algorithm using the modified Gram-
practical applications, focusing on performance rather than Schmidt algorithm and employing exact arithmetic in the
on theoretical results. MGS-GMRES iterative solver. The Arnoldi-QR algorithm
applied to a non-symmetric matrix A produces the matrix
One instance of this is in the work of Clark et al.
factorization, with loss of orthogonality Fk .
(2010), which uses mixed precision CG and BICGSTAB
T
methods to implement the “reliable update” strategy of AVk = Vk+1 Hk , Vk+1 Vk+1 = I + Fk (4.4)
Sleijpen and van der Vorst (1996) within a Lattice QCD
They next introduce inexact (e.g., single precision) inner
application run on GPUs. The idea behind the “reliable
products—this directly relates to the loss-of-orthogonality
update” strategy is that the true residual is computed and
relations for the A = QR factorization produced by MGS.
used to replace the recursively updated residual in select
The resulting loss of orthogonality, as measured by kI −
iterations, thus improving the attainable accuracy; this is
QT Qk2 , grows as O(ε)κ(A), as was derived by Björck
done in conjunction with batched updates to the solution
(1967b) and O(ε)κ([ r0 , AVk ]) for Arnoldi-QR—which is
vector. By using higher (FP64) precision only in the true
described in Paige and Strakoš (2002), Paige et al. (2006),
residual computations and group updates (and FP32 or FP16
and related work. The inexact inner products are given by:
for the rest of the computation), the authors claim they
are able to achieve FP64 accuracy. This deserves further hij = viT wj + ηij , (4.5)

Prepared using sagej.cls

14 ()

where hij are elements of the Hessenberg matrix Hk , and triangular solve rj = (I + Lj−1 )−1 QTj−1 aj to update R, as
the Arnoldi-QR algorithm produces a QR factorization of this would directly employ the forward error analysis of
the matrix: Higham (1989). The former affects the loss of orthogonality,
whereas the latter affects the representation error for QR—
r0 , AVk = Vk+1 β e1 , Hk . (4.6) but then also for Arnoldi-QR. This could allow more (or
most) of the inner products to be computed in FP32.
The loss of orthogonality relations for Fk are given below, Evidence for maintaining orthogonality is provided in
where the matrix U is strictly upper triangular. Figure 4.1, with kI − QT Qk plotted for A = QR using
 T the inner products in standard MGS (blue) in FP64 versus
v1 v2 · · · v1T vk+1

the inverse compact WY MGS (red) with QTj−1 qj−1 in
Fk = Ūk + ŪkT , Uk = 
 ..  (4.7)

. FP32 (simulated in MATLAB), and we observe at least
vkT vk+1 the same or slightly higher error levels. The x-axis is the
log condition number for randomly generated matrices. The
Define the matrices as below. lower triangular solve is computed in FP64.
    Barlow (2019) contains similar if not the same algorithm
η11 · · · η1k h21 ··· h2k
formulations in block form. His work is related to Björck’s
Nk = 
 . ..  Rk = 
  .. 
.  1994 paper (Björck 1994, Section 7), which derives the
ηkk hk+1,k triangular matrix T using a recursive form for MGS, and
(4.8) which is referred to as a “compact WY” representation in
The loss of orthogonality relation, derived by Björck (1967b) the literature. While Björck used a lower triangular matrix T
for the A = QR factorization via the modified Gram- for the compact WY form of MGS, Malard and Paige (1994)
Schmidt algorithm, can be applied to the Arnoldi-QR derived the upper triangular form, also employed by Barlow,
algorithm to obtain: which reverses the order of elementary projectors. The latter
is unstable in that a backward recurrence leads to O(ε)κ2 (A)
Nk = − 0, Uk Hk = −Uk Rk . (4.9) loss of orthogonality. An interesting observation from Leon
et al. (2013) is that the upper triangular form is less stable
The complete loss of orthogonality (triggers a loss of than the lower triangular, even though the backward-forward
linear independence) of the Krylov vectors in MGS-GMRES algorithm results in re-orthogonalization; see the algorithm
signals the minimum error is achieved, and GMRES then in Leon et al. (2013).
stalls or really can go no further than when the norm-wise Barlow (2019) employs the Householder compact WY
relative backward error reaches O(ε). Gratton et al. (2020) representation of reflectors and also refers to the work of
show how to maintain sufficient orthogonality to achieve a Puglisi (1992)—discussed in Joffrain et al. (2006)—and this
desired relative residual error level by switching the inner is referred to as the “inverse compact WY” representation of
products from FP64 to FP32 at certain tolerance levels Householder; this originally comes from Walker’s work on
and combine this with inexact matrix-vector products as in Householder GMRES Walker (1988). Barlow then extends
van den Eshof and Sleijpen (2004) and Simoncini and Szyld this approach to the block compact WY form of MGS; see
(2003). also the technical report by Sun (1996). The contribution by
In practice, the restarted variant of GMRES is often Świrydowicz et al. (2020) was to note that there exists an
employed to reduce memory requirements. The algorithm inverse compact WY representation for MGS—having the
produces both implicit (iteratively-computed) and explicit projector P with lower triangular correction matrix T :
residuals. Thus, we might ask whether either can be
performed in reduced precision. The work described herein P = I − Qj−1 T QTj−1
on iterative refinement by Carson and Higham for mixed = I − Qj−1 (I + Lj−1 )−1 QTj−1
precision can be applied to analyze the convergence
—and to “lag” the norm kqj−1 k2 so that these can be
of restarted GMRES(m), assuming a fixed number of
computed in one global reduction. Barlow (2019) makes
iterations, because restarted GMRES is just iterative
this connection for blocks, and in effect this is given in his
refinement with GMRES as the solver for the correction
equation (3.10), and references Puglisi (1992).
term. However, a more detailed analysis with experiments
Björck and Paige (1992) made the link between
has yet to be performed. We are fairly certain that the
Householder and MGS based on the observation made by
residual computations must be performed in higher precision
Sheffield. Paige defines this to be augmentation, and Gratton
in order to achieve a norm-wise backward error close to FP64
et al. (2020) also references this work. Paige has also recently
machine round-off.
extended these augmentation ideas to Lanczos. The T matrix
4.2.4 Alternative Approaches Although somewhat out- appears in Paige and Wülling (2014) and then later in
side the scope of this review, we can demonstrate that it is Paige (2018) to derive the loss of orthogonality matrix S =
possible to modify the Gratton et al. (2020) analysis based (I + LTj−1 )−1 LTj−1 . This also appears in the work of Giraud
on the inverse compact W Y form of the MGS algorithm, et al. (2004); Langou also worked with Smoktunowicz et al.
introduced by Świrydowicz et al. (2020). Rather than treat all (2006) on the Pythagorean trick to reduce cancellation error
of the inner products in the MGS-GMRES algorithm equally, in the computation of vector norms and a Cholesky-like form
consider the strictly upper triangular matrix U = LT from of classical Gram-Schmidt (CGS).
the loss of orthogonality relations. We introduce a single- In order to combine single-double floating-point opera-
precision (FP32) Lk−1,1:k−2 = (QT1:k−2 qk−1 )T and an FP64 tions in MGS-GMRES, at first it appears that we could

Prepared using sagej.cls

Abdalafattah et al. 15

Figure 4.1. Loss of orthogonality for mixed single-double MGS

algorithm.
Figure 4.3. Accessor separating the memory format from the
arithmetic format and realizing on-the-fly data conversion in
each memory access.

a widening gap between the compute power (in terms of

FLOP/s) on the one side and the communication power
(in terms of memory bandwidth) on the other. In modern
processor technology, retrieving values from main memory
takes several orders of magnitude longer than performing
arithmetic operations, and communicating between distinct
nodes of a cluster is again orders of magnitude slower
than main memory access. With no disruptive hardware
changes on the horizon, we are facing a situation where all
applications suffer from the slow communication to main
memory or in-between nodes.

Figure 4.2. GMRES residuals and loss of orthogonality kSk2 A promising—and maybe the only promising—strategy
for impcol e matrix. to overcome this problem is to utilize the bandwidth
capacity more carefully, reduce the communication vol-
ume and the number of communication points, and—
store the T matrix in FP32, but then we would still have whenever possible—trade communication against compu-
to form QTj−1 aj , and store Qj−1 in FP64. By examining tations. Specifically, the idea is to radically decouple the
the cost trade-offs a bit further, we can instead use a form memory precision from the arithmetic precision, employ
of re-orthogonalization based on a backward-forward solver high precision only in the computations, and lower the
recurrence: precision as much as possible when accessing data in main
memory or communicating with remote processors (Anzt
T = (I + LTj−1 )−1 (I + Lj−1 )−1 , et al. 2019b). An important aspect in this context is the
design of a “memory accessor” that converts data on the fly
and our initial computational results, displayed in Figure 4.2,
between the IEEE high precision arithmetic format and the
demonstrate this works well, driving the relative residual
memory/communication format (Figure 4.3). The memory/-
and, more importantly, the norm-wise relative backward
communication format does not necessarily have to be part of
error to O(ε) in FP64, with orthogonality maintained to O(ε)
the IEEE standard but can also be an arbitrary composition
in FP32 as indicated by the magenta curve. Here, the black
of sign, exponent, and significand bits (Grützmacher et al.
curve is the FP64 loss of orthogonality metric given by kSk2 .
2019) or even nonstandard formats like Gustafson’s Posits
The representation error (backwards error) for A + E =
(Unum type III (Gustafson 2015)). On an abstract level, the
QR computed by MGS is not affected by FP32 inner
idea is to compress data before and after memory operations
products and remains O(ε). We are not aware of whether
and only use the working precision in the arithmetic oper-
or not this was previously known.
ations. While one generally distinguishes between “lossy
compression” and “lossless compression” (Sayood 2012),
4.3 Memory Format Decoupling significant bandwidth reduction usually requires the loss of
We already elaborated on sparse linear algebra operations some information. How much information can be disre-
being memory bound across the complete hardware garded without endangering the numerical stability heavily
technology food chain. Additionally, we are witnessing depends on the algorithm and the problem characteristics.

Prepared using sagej.cls

16 ()

Thus, the choice of the memory format requires careful Invert the diagonal block
using Gauss-Jordan elimination.
consideration (e.g., in the form of automated format select); Select storage format:
see Section 4.4. 16-bit fp5,10 fp8,7 fp11,4
Compute condition number
and exponent range. 32-bit fp8,23 fp11,20
4.4 Mixed-Precision Preconditioning
In the iterative solution process of large sparse systems 64-bit fp11,52
(e.g., when using Kylov solvers) preconditioners are
an important building block for facilitating satisfactory Figure 4.4. Storage format optimization for block-Jacobi:
convergence. The concept of preconditioning is to turn starting from the most compact storage (left top), the format is
an ill-conditioned linear system Ax = b into a (left-) extended in exponent bits to fit the data range (rightwards) and
to preserve regularity (downwards) until the range is fit and
preconditioned system M Ax = M b (or AM y = b, x = regularity is satisfied. Note that these configurations are chosen
M y for right-preconditioning), which allows for faster to reflect the hardware characteristics (16/32/64 bit access) and
convergence of the iterative solver (Anzt et al. 2018). significand-truncated IEEE standard precision formats.
The convergence characteristics typically depend on the
conditioning of the target system. For an ill-conditioned
A, the preconditioner is also required to be ill-conditioned. when reading data from main memory. Fortunately,
Otherwise, the preconditioner cannot be expected to improve most iterative solvers and preconditioners are memory
the conditioning of the problem or the convergence of the bound, and the conversion can be hidden behind the
iterative solver. In that respect, the preconditioner basically memory transfers Flegar et al. (2021). A production-
tries to approximate the inverse of the system matrix. ready implementation of an adaptive-precision block-
Obviously, if the preconditioner is the exact inverse, the Jacobi preconditioner decoupling memory precision from
solution is readily available. However, computing the exact arithmetic precision is available in the Ginkgo library Anzt
inverse is prohibitively expensive, and, in most cases, the et al. (2020).
preconditioner is just a rough approximation of the system
matrix inverse. As a consequence, it is natural to question 4.4.1 Adaptive-Precision Block-Jacobi Preconditioning
the need for using high precision for a preconditioner The adaptive precision block-Jacobi preconditioner realizes
that is inherently carrying only limited accuracy. Indeed, the concept of decoupling arithmetic precision from memory
choosing a lower-precision format for the preconditioner precision proposed in Section 4.3 for a block-Jacobi
is a valid strategy as long as the accuracy loss induced preconditioner (Anzt et al. 2019a). The idea here is to
by using a lower-precision format impacts neither the compute a block-Jacobi preconditioner in high precision but
preconditioner accuracy nor its regularity. For example, then store the distinct inverted diagonal blocks in the lowest
Trilinos The Trilinos Project Team (2020) allows the floating point precision format that avoids overflow and still
use of low-precision preconditioners inside high-precision preserves the regularity of the preconditioner (Figure 4.4).
iterative solvers. However, the use of lower precision in This storage format is chosen for each diagonal
the preconditioner application results in different rounding block individually, respectively reflecting the numerical
effects than when using high precision. Specifically, the characteristics like condition number and value range.
rounding effects make the preconditioner non-constant, as Figure 4.5 (top) visualizes the distribution of formats
the rounding effects are not only larger than in high precision when storing the inverted diagonal blocks of size 24 for
but also depend on the input data (Anzt et al. 2019a). As symmetric positive–definite matrices of the Suite Sparse
a result, low-precision preconditioners can only be used to Matrix Collection. Obviously, converting to a lower-
accelerate an iterative method that can handle non-constant precision format generally reduces the accuracy of the
preconditioners (i.e., can converge even if the preconditioner linear operator, but as block-Jacobi preconditioners ignore
changes in between iterations). For the Krylov subspace all off-(block)diagonal entries, they are typically only a
solvers generating search directions orthogonal to the rough approximation of the matrix inverse and therefore,
previous search direction, a changing preconditioner requires by design, only have very limited accuracy. Experimental
an additional orthogonalization of the preconditioned results reveal that the use of a lower-precision format for
search direction against the previous preconditioned search storing the inverted diagonal blocks has, in most cases,
direction. The flexible Krylov solvers (e.g., FGMRES, FCG) only negligible effects on the preconditioner effectiveness
contain this additional orthogonalization and are therefore and the outer solver convergence. At the same time, storing
slightly more expensive. At the same time, they do allow for the inverted diagonal blocks in lower precision reduces the
using low-precision preconditioners, which can compensate memory access volume in every preconditioner application,
for the additional cost. thereby accelerating the bandwidth bound iterative solution
An alternative workaround is to decouple the memory process (Figure 4.5). For the adaptive-precision block-Jacobi
precision from the arithmetic precision (see Section 4.3) and preconditioner, it is important that the accessor converts the
only store the preconditioner in low precision but apply it inverted diagonal blocks back to the IEEE standard precision
in high precision Anzt et al. (2019a). Running all arithmetic not only for performance reasons—leveraging the highly
in high precision keeps the preconditioner constant and optimized IEEE floating point arithmetic of the processors—
removes the need for the additional orthogonalization of but also for numeric reasons. Using working precision in
the preconditioned search direction. On the other hand, the arithmetic operations of the preconditioner application
decoupling memory precision from arithmetic precision preserves the preconditioner as a constant operator, and
requires on-the-fly conversion of in-between formats applying a preconditioner in lower precision would result in

Prepared using sagej.cls

Abdalafattah et al. 17

Figure 4.5. Top: distribution of floating-point formats among the distinct blocks when inverting the blocks in FP64 and preserving
1-digit accuracy of the values in each inverted diagonal block when writing to main memory. Each column represents one
symmetric positive–definite matrix of the Suite Sparse Matrix Collection. Bottom: impact on the top-level CG solver solving the
system-induced linear problem. For most systems, the convergence rate is unaffected by the use of a lower storage precision
format, and almost all preconditioner applications are faster, resulting in an average 20% runtime reduction.

a non-constant preconditioner and require the use of a (more generate its components, whereas AMG can be considered
expensive) flexible iterative solver (Anzt et al. 2019a). more like a “black box” method, in that it can be given
a matrix and the right-hand side and will generate the
4.5 Mixed-Precision Multigrid Methods components for each level automatically using sensible
heuristics. These methods are an interesting target for
Multigrid methods are highly effective iterative methods.
multi-precision treatment due to their different components
There are basically two types of multigrid methods:
that affect the overall algorithm in different ways. GMG
geometric multigrid methods (GMG) and algebraic multigrid
and AMG components combine smoothers, coarser grid,
methods (AMG). GMG requires actual grids on each level to

Prepared using sagej.cls

18 ()

restriction, and prolongation operators on each level. In computer architecture. Their mixed-precision version takes
addition, it is of interest to investigate changes in precision about 84% of the time of the FP64 version.
on different levels. Finally, GMG and AMG can be used An approach described in a presentation by Clark (2019)
as preconditioners to other solvers (i.e., there is potential takes the use of mixed precision even further to involve
to use lower precision across the whole preconditioner). half precision. Clark and collaborators achieved good results
Historically, most work focused on the use of a lower- using a FP64 defect correction approach with a FP32 Krylov
precision GMG or AMG method as a preconditioner to a solver and a half-precision AMG preconditioner.
FP64 solver.
Another interesting related study by Fox and Kolasinski
Ljungkvist and Kronbichler (2017, 2019) successfully
(2019) examines the use of ZFP, a lossy compression
used mixed precision to solve the Laplace problem for
algorithm, within multigrid. Due to the local structure of
different orders with a matrix-free geometric multigrid
ZFP, ZFP can easily be integrated into numerical simulations
approach. Their solver infrastructure allows for using mixed-
without changing the underlying algorithms. However, since
precision arithmetic to perform the multigrid V-cycle in
ZFP is a lossy algorithm, it will introduce some error,
FP32 with an outer correction in FP64, thereby increasing
thus, it is important to understand if the error caused by
throughput by up to 83%.
ZFP overwhelms or other traditional sources of error (e.g.,
Similarly, Glimberg et al. (2011) use a FP32 multigrid to
discretization error).
precondition a FP64 defect correction scheme and solve the
Laplace problem within a nonlinear water wave application ZFP decomposes the field of interest into smaller pieces,
on a GPU architecture. They achieve a speedup of up to 1.6× called blocks, that are then compressed and decompressed
for the mixed-precision version over the FP64 version and a independently. ZFP compressed arrays implemented in
speedup of 1.9× for a purely FP32 version. Lindstrom (2018) are C++ classes that enable random-
Yamagishi and Matsumura (2016) also apply a FP32 accessible arrays whose storage size is specified by the user.
multigrid to a FP64 conjugate gradient solver to the In particular, ZFP fixed-rate arrays specify a rate used to
Poisson/Helmholtz problem within their non-hydrostatic compress each block of the data field to a finite number
ocean model. They report a speedup up to 2× for a FP32 of bits. The study uses ZFP fixed-rate arrays to represent
Matvec over a FP64 one and improved overall times using the approximation vector in MG on a 2-D Poisson problem
this approach; however, they compare the full application run with Dirichlet boundary conditions when the number of
only to their CPU version. interior nodes of the finest grid is (28 − 1)2 . Figure 4.6
There are various publications that pursue the same presents the relative residual for a V-cycle with or without
strategy of using a FP32 AMG preconditioner to a FP64 ZFP fixed-rate arrays. The orange line represents the relative
solver. residual with respect to the FP64 solution, while the blue line
represents the relative residual with respect to the solution
Emans and van der Meer (2012) perform a careful
with ZFP fixed-rate arrays with a rate of 32. As the number
analysis of the individual kernels of preconditioned Krylov
of V-cycles increase, the relative residual between the two
solvers on multi-core CPUs, including sparse matrix-vector
solutions matches until the relative residual for the ZFP
multiplications (SpMV), which make up a large portion
solution approximately reaches machine unit roundoff u for
of AMG. They also consider the effect of communication,
FP32.
where lower precision leads to smaller size messages, but
latencies are still an issue, particularly on the coarsest levels Figure 4.7 displays a similar study for when the rate used
of AMG. They find that the use of mixed precision for for the ZFP fixed-rate arrays is adapted depending on the
the preconditioner barely affects convergence and therefore level within the V-cycle. It is assumed that the ZFP fixed-rate
speedups for the kernels, which were between 1.1× and arrays have a fixed set of possible rates, 64, 48, 32, or 16. The
1.5×, can potentially carry over to the whole solver and blue line in Figure 4.7 depicts the relative compression error
lead to improvements of runtimes within computational fluid (i.e., the error between the FP64 solution and the ZFP fixed-
dynamics applications. rate solution, where the finest level has a rate of 64, and the
Sumiyoshi et al. (2014) investigate AMG performance rate for the coarser levels is sequentially lowered). That is, if
on a heterogeneous computer architecture with both CPUs we consider a 6-level V-cycle for which the finest level has a
and GPUs for isotropic and anisotropic Poisson problems. fixed rate of 64, then the second finest level has a fixed rate of
They consider smoothed aggregation AMG as a stand- 48, then 32, and then 16 for the remaining coarse grids. The
alone solver. They carefully analyze different portions of the orange, green, and red lines depict the relative compression
algorithm on five different architectures, including one multi- error for a rate of 48, 32, and 16, respectively for the finest
core CPU cluster. They report speedups between 1.2× and level. The purple dashed line depicts the relative truncation
1.6× on the GPU-CPU architectures for the mixed-precision error for the FP64 solution. Each ZFP fixed-rate solution
implementation over the FP64 version. These speedups are remains below the truncation error and the compression error
related to SpMV performance (between 1.6× and 1.8×) on continuously decreases until the relative error for the ZFP
these architectures. However, the mixed-precision version solution approximately reaches the respective machine unit
was slightly slower on the CPU-only architecture, which roundoff u dependent on the rate of the finest level.
achieved barely any improvement for the SpMV operations. This study shows that, for MG on a Poisson problem,
Richter et al. (2014) examine the performance of a FP32 applying ZFP to the approximation vector can significantly
AMG preconditioner (ML and PETSc) applied to a FP64 decrease memory use and is expected to decrease run times,
PCG solver. They apply the method for an electrostatic while the generated errors stay below the discretization error.
simulation of the high voltage isolator on a GPU/CPU Since a hardware version of ZFP is not available yet, no

Prepared using sagej.cls

Abdalafattah et al. 19

quantization, algebraic and descretization errors. They show

103 that while iterative refinement is susceptible to quantization
Relative Residual 101 errors during the residual and update computation, the V-
10 1 cycle used to compute the correction in each iteration
10 3 is much more resilient, and continues to work if the
10 5 system matrices in the hierarchy become indefinite due to
10 7 quantization.
10 9
10 11 With ZFP - Constant 32-Rate 4.6 Eigenvalue Problems
Double Precision
0 5 10 15 20 25 30 35 Little work appears to have been done on exploiting mixed-
V-cycle precision arithmetic in algorithms for sparse eigenvalue
problems. The ESSEX-II project (Alvermann et al. 2019)
Figure 4.6. Comparison between the relative residual for a MG is developing eigensolvers based on Jacobi-Davidson,
method, where the approximation vector is represented in FP64 subspace iteration, and other methods, and is using lower
(orange) or as a ZFP fixed-rate array with a rate of 32 (blue). precision in early iterations for speed and higher precision
within orthogonalization for robustness.

100 5 Summary and Outlook on the Potential of

10 2 Mixed Precision Technology
Relative Error

10 4
10 6 We have presented mixed-precision algorithms for dense
and sparse linear algebra that are outperforming traditional
10 8
algorithms operating in high precision. For performance-
10 10 bound dense linear algebra algorithms, mixed-precision
Compression Error-64
10 12 Compression Error-48
Compression Error-32 iterative refinement that employs a low-precision error
Compression Error-16
10 14 Truncation Error correction solver remains the first-choice algorithm to exploit
0 5 10 15 20 25 30 35 the compute power in low precision. For sparse linear
V-cycle algebra, the memory-bound nature of the the algorithms
makes the concept of decoupling memory precision from
Figure 4.7. Relative compression error for an adaptive-rate arithmetic precision attractive. Furthermore, preconditioners
ZFP solution, where the rate for the approximation vector is with limited approximation accuracy are natural targets
sequentially lowered on the coarser grids. The purple line for the use of lower precision. Carefully adjusting the
represents the truncation error for the double-precision solution.
preconditioner precision to the numerical requirements and
the approximation accuracy can render run time savings
without impacting the iterative solver’s convergence.
actual runs were possible; however, the results show good As AI and deep learning are currently driving the hardware
potential for using GMG and/or AMG as a preconditioner. market, we expect a large number of processors and
Currently, Tamstorf et al. (2020a) appear to be the accelerators featuring low-precision special function units
only ones who investigated the theory of multi-precision and support for non-standard precision formats and integer
multigrid methods. Their original intent was to improve operations. For numerical linear algebra, we anticipate
the appearance of the movement of cloth within Disney significant potential in the use of integer arithmetic for
movies, which requires higher than FP64 accuracy. However, numerical calculations and the low-precision function units
their theory applies equally to decreased precision. They designed for deep learning. As we see the machine imbalance
have created a theoretical framework with rigorous proofs continuing to grow, we also expect format decoupling and
for a mixed-precision version of multigrid for solving compression techniques to become essential and are eager to
the algebraic equations that arise from discretizing linear see hardware-support for data compression.
elliptic partial differential equations (PDEs). The arising
matrices being sparse and symmetric positive definite enable
the use of the so-called energy or A norm to establish Acknowledgments
convergence and error estimates. Bounds on the convergence This work was supported by the US Exascale Computing
behavior of multigrid are developed and analyzed as a Project (17-SC-20-SC), a collaborative effort of the U.S.
function of the matrix condition number. Both theoretical Department of Energy Office of Science and the National
and numerical results confirm that convergence to the Nuclear Security Administration. This work was performed
level of discretization accuracy can be achieved with under the auspices of the U.S. Department of Energy by
mixed-precision versions of V-cycles and full multigrid. Lawrence Livermore National Laboratory under Contract
This framework is inspired by the results of Carson and DE-AC52-07NA27344.
Higham (2017) but ultimately provides tighter bounds for
many PDEs. Tamstorf et al. (2020b) further extend their
theoretical framework to include the quantization error. They
Disclaimer
use the bounds to guide the choice of precision level in This document was prepared as an account of work
their progressive-precision multigrid scheme by balancing sponsored by an agency of the United States government.

Prepared using sagej.cls

20 ()

Neither the United States government nor Lawrence McKenney A and Sorensen DC (1999) LAPACK Users’ Guide.
Livermore National Security, LLC, nor any of their Third edition. Philadelphia, PA, USA: Society for Industrial
employees makes any warranty, expressed or implied, and Applied Mathematics. ISBN 0-89871-447-8. URL http:
or assumes any legal liability or responsibility for the //www.netlib.org/lapack/lug/.
accuracy, completeness, or usefulness of any information, Anzt H, Cojean T, Chen YC, Flegar G, Göbel F, Grützmacher
apparatus, product, or process disclosed, or represents T, Nayak P, Ribizel T and Tsai YH (2020) Ginkgo: A high
that its use would not infringe privately owned rights. performance numerical linear algebra library. Journal of Open
Reference herein to any specific commercial product, Source Software x(x): xxxx. DOI:10.21105/joss.02260. URL
process, or service by trade name, trademark, manufacturer, https://doi.org/10.21105/joss.02260.
or otherwise does not necessarily constitute or imply Anzt H, Dongarra J, Flegar G, Higham NJ and Quintana-Ortı́ ES
its endorsement, recommendation, or favoring by the (2019a) Adaptive precision in block-jacobi preconditioning
United States government or Lawrence Livermore National for iterative sparse linear system solvers. Concurrency and
Security, LLC. The views and opinions of authors expressed Computation: Practice and Experience 31(6): e4460.
herein do not necessarily state or reflect those of the Anzt H, Flegar G, Grützmacher T and Quintana-Ortı́ ES (2019b)
United States government or Lawrence Livermore National Toward a modular precision ecosystem for high-performance
Security, LLC, and shall not be used for advertising or computing. The International Journal of High Performance
product endorsement purposes. Computing Applications 33(6): 1069–1078.
Anzt H, Huckle TK, Bräckle J and Dongarra J (2018) Incomplete
Sandia National Laboratories is a multimission laboratory sparse approximate inverses for parallel preconditioning.
managed and operated by National Technology and Parallel Computing 71: 1–22.
Engineering Solutions of Sandia, LLC., a wholly owned Arioli M and Duff IS (2009) Using FGMRES to obtain backward
subsidiary of Honeywell International, Inc., for the
stability in mixed precision. Electron. Trans. Numer. Anal. 33:
U.S. Department of Energy’s National Nuclear Security
31–44. URL https://eudml.org/doc/130614.
Administration under contract DE-NA-0003525.
Arioli M, Duff IS, Gratton S and Pralet S (2007) A note on
GMRES preconditioned by a perturbed LDLT decomposition
References with static pivoting. SIAM J. Sci. Comput. 29(5): 2024–2044.
(2017) Nvidia tesla v100 gpu architecture. https://images. DOI:10.1137/060661545.
nvidia.com/content/volta-architecture/ Barlow JL (2019) Block modified Gram–Schmidt algorithms and
pdf/volta-architecture-whitepaper.pdf. their analysis. SIAM J. Matrix Anal. Appl. 40(4): 1257–1290.
Abdelfattah A, Tomov S and Dongarra J (2020) Investigating the Björck Å (1967a) Iterative refinement of linear least squares
benefit of FP16-enabled mixed-precision solvers for symmetric solutions I. BIT Numerical Mathematics 7(4): 257–278. DOI:
positive definite matrices using GPUs. In: Krzhizhanovskaya 10.1007/BF01939321.
VV, Závodszky G, Lees MH, Dongarra JJ, Sloot PMA Björck Å (1967b) Solving linear least squares problems by Gram-
and Teixeira SBJ (eds.) Computational Science—ICCS 2020, Schmidt orthogonalization. BIT Numerical Mathematics 7(1):
number 12138 in Lecture Notes in Computer Science. Springer 1–21.
International Publishing, pp. 237–250. DOI:10.1007/978-3- Björck Å (1990) Iterative refinement and reliable computing.
030-50417-5 18. In: Cox MG and Hammarling SJ (eds.) Reliable Numerical
Abdelfattah A, Tomov S and Dongarra JJ (2019) Fast Batched Computation. Oxford University Press, pp. 249–266.
Matrix Multiplication for Small Sizes Using Half-Precision Björck Å (1994) Numerics of Gram-Schmidt orthogonalization.
Arithmetic on GPUs. In: 2019 IEEE International Parallel Lin. Alg. Appl. 197: 297–316.
and Distributed Processing Symposium, IPDPS 2019, Rio de Björck Å and Paige CC (1992) Loss and recapture of orthogonality
Janeiro, Brazil, May 20-24, 2019. IEEE, pp. 111–122. DOI:10. in the modified Gram-Schmidt algorithm. SIAM J. Matrix Anal.
1109/IPDPS.2019.00022. URL https://doi.org/10. Appl. 13(1): 176–190.
1109/IPDPS.2019.00022. Blanchard P, Higham NJ, Lopez F, Mary T and Pranesh S (2020)
Agullo E, Demmel J, Dongarra J, Hadri B, Kurzak J, Langou Mixed precision block fused multiply-add: Error analysis and
J, Ltaief H, Luszczek P and Tomov S (2009) Numerical application to GPU tensor cores. SIAM J. Sci. Comput. 42(3):
linear algebra on emerging architectures: The PLASMA and C124–C141. DOI:10.1137/19M1289546.
MAGMA projects. Journal of Physics: Conference Series 180: Carson E and Higham NJ (2017) A new analysis of iterative
012037. DOI:10.1088/1742-6596/180/1/012037. refinement and its application to accurate solution of ill-
Alvermann A, Basermann A, Bungartz HJ, Carbogno C, Ernst conditioned sparse linear systems. SIAM Journal on Scientific
D, Fehske H, Futamura Y, Galgon M, Hager G, Huber S, Computing 39(6): A2834–A2856.
Huckle T, Ida A, Imakura A, Kawai M, Köcher S, Kreutzer Carson E and Higham NJ (2018) Accelerating the solution of linear
M, Kus P, Lang B, Lederer H, Manin V, Marek A, Nakajima systems by iterative refinement in three precisions. SIAM
K, Nemec L, Reuter K, Rippl M, Röhrig-Zöllner M, Sakurai T, Journal on Scientific Computing 40(2): A817–A847.
Scheffler M, Scheurer C, Shahzad F, Simoes Brambila D, Thies Carson E, Higham NJ and Pranesh S (2020) Three-precision
J and Wellein G (2019) Benefits from using mixed precision GMRES-based iterative refinement for least squares prob-
computations in the ELPA-AEO and ESSEX-II eigensolver lems. MIMS EPrint 2020.5, Manchester Institute for
projects. Japan J. Indust. Appl. Math. 36(2): 699–717. DOI: Mathematical Sciences, The University of Manchester, UK.
10.1007/s13160-019-00360-8. URL http://eprints.maths.manchester.ac.uk/
Anderson E, Bai Z, Bischof CH, Blackford S, Demmel JW,
Dongarra JJ, Du Croz JJ, Greenbaum A, Hammarling SJ,

Prepared using sagej.cls

Abdalafattah et al. 21

2770/. Revised June 2020. To appear in SIAM J. Sci. Greenbaum A (1997) Estimating the attainable accuracy of
Comput. recursively computed residual methods. SIAM J. Matrix Anal.
Carson EC (2015) Communication-avoiding Krylov subspace Appl. 18(3): 535–551.
methods in theory and practice. PhD Thesis, University of Grützmacher T, Cojean T, Flegar G, Göbel F and Anzt H (2019) A
California, Berkeley. customized precision format based on mantissa segmentation
Clark K (2019) Effective use of mixed precision for hpc. Smoky for accelerating sparse linear algebra. Concurrency and
Mountain Conference 2019. Computation: Practice and Experience : e5418.
Clark MA, Babich R, Barros K, Brower RC and Rebbi C Gupta S, Agrawal A, Gopalakrishnan K and Narayanan P
(2010) Solving lattice QCD systems of equations using (2015) Deep Learning with Limited Numerical Precision.
mixed precision solvers on GPUs. Computer Physics In: Proceedings of the 32Nd International Conference on
Communications 181(9): 1517–1528. International Conference on Machine Learning - Volume 37,
Davies PI, Higham NJ and Tisseur F (2001) Analysis of ICML’15. JMLR.org, pp. 1737–1746. URL http://dl.
the Cholesky method with iterative refinement for solving acm.org/citation.cfm?id=3045118.3045303.
the symmetric definite generalized eigenproblem. SIAM Gustafson J (2015) The End of Error: Unum Computing. Chapman
J. Matrix Anal. Appl. 23(2): 472–493. DOI:10.1137/ & Hall/CRC Computational Science. Taylor & Francis. ISBN
S0895479800373498. 9781482239867. URL https://books.google.de/
Demmel JW, Hida Y, Kahan W, Li XS, Mukherjee S and Riedy books?id=W2ThoAEACAAJ.
EJ (2006) Error bounds from extra-precise iterative refinement. Haidar A, Abdelfattah A, Zounon M, Wu P, Pranesh S, Tomov S
ACM Trans. Math. Software 32(2): 325–351. DOI:10.1145/ and Dongarra J (2018a) the design of fast and energy-efficient
1141885.1141894. linear solvers: On the potential of half-precision arithmetic
Dongarra JJ (1982) Algorithm 589 SICEDR: A FORTRAN and iterative refinement techniques. In: Shi Y, Fu H, Tian
subroutine for improving the accuracy of computed matrix Y, Krzhizhanovskaya VV, Lees MH, Dongarra J and Sloot
eigenvalues. ACM Trans. Math. Software 8(4): 371–375. DOI: PMA (eds.) Computational Science—ICCS 2018. Springer
10.1145/356012.356016. International Publishing, Cham, pp. 586–600. DOI:10.1007/
Dongarra JJ, Moler CB and Wilkinson JH (1983) Improving the 978-3-319-93698-7 45.
accuracy of computed eigenvalues and eigenvectors. SIAM J. Haidar A, Bayraktar H, Tomov S, Dongarra J and Higham
Numer. Anal. 20(1): 23–45. DOI:10.1137/0720002. NJ (2020) Mixed-precision solution of linear systems
Elble JM and Sahinidis NV (2012) Scaling linear optimization using accelerator-based computing. Technical Report ICL-
problems prior to application of the simplex method. Comput. UT-20-05, Innovative Computing Laboratory, University
Optim. Appl. 52(2): 345–371. DOI:10.1007/s10589-011-9420- of Tennessee, Knoxville, TN, USA. URL https:
4. //www.icl.utk.edu/publications/mixed-
Emans M and van der Meer A (2012) Mixed-precision amg as precision-solution-linear-systems-using-
linear equation solver for definite systems. In: Proceedings accelerator-based-computing. To appear in Proc.
of International Conference on Computational Science, ICCS Roy. Soc. London Ser. A.
2010, volume 1. pp. 175–183. Haidar A, Tomov S, Dongarra J and Higham NJ (2018b)
Flegar G, Anzt H, Cojean T and Quintana-Ortı́ ES (2021) Adaptive Harnessing GPU Tensor Cores for Fast FP16 Arithmetic
precision block-jacobi for high performance preconditioning to Speed Up Mixed-precision Iterative Refinement Solvers.
in the ginkgo linear algebra software. ACM Transaction on In: Proceedings of the International Conference for High
Mathematical Software 47(2). DOI:10.1145/3441850. Performance Computing, Networking, Storage, and Analysis,
SC ’18. Piscataway, NJ, USA: IEEE Press, pp. 47:1–47:11.
Fox A and Kolasinski A (2019) Error analysis of inline zfp
DOI:10.1109/SC.2018.00050. URL https://doi.org/
compression for multigrid methods. 2019 Copper Mountain
10.1109/SC.2018.00050.
Conference for Multigrid Methods.
Haidar A, Wu P, Tomov S and Dongarra J (2017) Investigating half
Fukaya T, Kannan R, Nakatsukasa Y, Yamamoto Y and Yanagisawa
precision arithmetic to accelerate dense linear system solvers.
Y (2020) Shifted CholeskyQR for computing the QR
In: Proceedings of the 8th Workshop on Latest Advances in
factorization of ill-conditioned matrices. SIAM J. Sci. Comput.
Scalable Algorithms for Large-Scale Systems. pp. 1–8.
42(1): A477–A503. DOI:10.1137/18M1218212.
Higham NJ (1989) The accuracy of solutions to triangular systems.
Giraud L, Gratton S and Langou J (2004) A rank-k update
SIAM J. Numer. Anal. 26(5): 1252–1265.
procedure for reorthogonalizing the orthogonal factor from
modified Gram–Schmidt. SIAM J. Matrix Anal. Appl. 25(4): Higham NJ (1997) Iterative refinement for linear systems and
1163–1177. LAPACK. IMA J. Numer. Anal. 17(4): 495–509. DOI:10.1093/
imanum/17.4.495.
Glimberg SL, Engsig-Karup AP and Madsen MG (2011)
A fast gpu-accelerated mixed-precision strategy for fully Higham NJ (2002) Accuracy and Stability of Numerical Algorithms.
nonlinearwater wave computations. In: Proceedings of Second edition. Philadelphia, PA, USA: Society for Industrial
ENUMATH 2011. and Applied Mathematics. ISBN 0-89871-521-0. DOI:10.
1137/1.9780898718027.
Gratton S, Simon E, Titley-Peloquin D and Toint P (2020)
Exploiting variable precision in GMRES. SIAM J. Sci. Comput. Higham NJ (2019) Error analysis for standard and GMRES-based
(to appear) . iterative refinement in two and three-precisions. MIMS EPrint
2019.19, Manchester Institute for Mathematical Sciences,
Greenbaum A (1989) Behavior of slightly perturbed Lanczos and
The University of Manchester. URL http://eprints.
conjugate-gradient recurrences. Lin. Alg. Appl. 113: 7–63.
maths.manchester.ac.uk/2735/.

Prepared using sagej.cls

22 ()

Higham NJ and Mary T (2019) A new approach to probabilistic Meurant G and Strakoš Z (2006) The Lanczos and conjugate
rounding error analysis. SIAM Journal on Scientific Computing gradient algorithms in finite precision arithmetic. Acta
41(5): A2815–A2835. DOI:10.1137/18M1226312. Numerica 15: 471–542. DOI:10.1017/s096249290626001x.
Higham NJ and Mary T (2020) Sharper probabilistic backward Moler CB (1967) Iterative refinement in floating point. Journal of
error analysis for basic linear algebra kernels with random the ACM (JACM) 14(2): 316–321.
data. MIMS EPrint 2020.4, Manchester Institute for Ogita T and Aishima K (2018) Iterative refinement for symmetric
Mathematical Sciences, The University of Manchester, UK. eigenvalue decomposition. Japan Journal of Industrial and
URL http://eprints.maths.manchester.ac.uk/ Applied Mathematics 35(3): 1007–1035.
2776/. Revised August 2020. To appear in SIAM J. Sci. Ogita T and Aishima K (2019) Iterative refinement for symmetric
Comput. eigenvalue decomposition II: clustered eigenvalues. Japan J.
Higham NJ and Pranesh S (2019a) Exploiting lower precision Indust. Appl. Math. 36: 435–459. https://doi.org/10.
arithmetic in solving symmetric positive definite linear 1007/s13160-019-00348-4.
systems and least squares problems. MIMS EPrint Paige CC (1980) Accuracy and effectiveness of the Lanczos
2019.20, Manchester Institute for Mathematical Sciences, The algorithm for the symmetric eigenproblem. Lin. Alg. Appl. 34:
University of Manchester, UK. URL http://eprints. 235–258.
maths.manchester.ac.uk/2771/. Revised July 2020. Paige CC (2018) The effects of loss of orthogonality on large
Higham NJ and Pranesh S (2019b) Simulating low precision scale numerical computations. In: International Conference
floating-point arithmetic. SIAM J. Sci. Comput. 41(5): C585– on Computational Science and Its Applications. Springer, pp.
C602. DOI:10.1137/19M1251308. 429–439.
Higham NJ, Pranesh S and Zounon M (2019) Squeezing a matrix Paige CC, Rozložnı́k M and Strakoš Z (2006) Modified gram-
into half precision, with an application to solving linear schmidt MGS, least squares, and backward stability of MGS-
systems. SIAM J. Sci. Comput. 41(4): A2536–A2551. DOI: GMRES. SIAM J. Matrix Anal. Appl. 28(1): 264–284.
10.1137/18M1229511. Paige CC and Strakoš Z (2002) Residual and backward error bounds
Hogg JD and Scott JA (2010) A fast and robust mixed-precision in minimum residual Krylov subspace methods. SIAM J. Sci.
solver for the solution of sparse symmetric linear systems. Comput. 23(6): 1898–1923.
ACM Trans. Math. Software 37(2): 17:1–17:24. DOI:10.1145/ Paige CC and Wülling W (2014) Properties of a unitary matrix
1731022.1731027. obtained from a sequence of normalized vectors. SIAM J.
IEEE (2019) IEEE Standard for Floating-Point Arithmetic, IEEE Matrix. Anal. Appl. 35(2): 526–545.
Std 754-2019 (Revision of IEEE 754-2008). New York, USA: Petschow M, Quintana-Ortı́ E and Bientinesi P (2014) Improved
The Institute of Electrical and Electronics Engineers. ISBN accuracy and parallelism for MRRR-based eigensolvers—A
978-1-5044-5924-2. DOI:10.1109/IEEESTD.2019.8766229. mixed precision approach. SIAM J. Sci. Comput. 36(2): C240–
Joffrain T, Low TM, Quintana-Ortı́ ES, van de Geijn R and Van Zee C263. DOI:10.1137/130911561.
FG (2006) Accumulating Householder transformations, revis- Puglisi C (1992) Modification of the Householder method based on
ited. ACM Transactions on Mathematical Software (TOMS) the compact WY representation. SIAM J. Sci. Stat. Comput.
32(2): 169–179. 13(3): 723–726.
Knight PA, Ruiz D and Uçar B (2014) A symmetry preserving Richter C, Schops S and Clemens M (2014) Gpu-accelerated mixed
algorithm for matrix scaling. SIAM J. Matrix Anal. Appl. 35(3): precision algebraic multigrid preconditioners for discrete
931–955. DOI:10.1137/110825753. elliptic field problems. IEEE Transactions on Magnetics 50(2).
Langou J, Langou J, Luszczek P, Kurzak J, Buttari A and Dongarra Saad Y and Schultz MH (1986) Gmres: A generalized minimal
J (2006) Exploiting the performance of 32 bit floating point residual algorithm for solving nonsymmetric linear systems.
arithmetic in obtaining 64 bit accuracy (revisiting iterative SIAM Journal on scientific and statistical computing 7(3): 856–
refinement for linear systems). In: Proceedings of the 2006 869.
ACM/IEEE Conference on Supercomputing. DOI:10.1109/SC. Sayood K (2012) Introduction to Data Compression, Fourth
2006.30. Edition. 4th edition. San Francisco, CA, USA: Morgan
Leon SJ, Björck Å and Gander W (2013) Gram-Schmidt Kaufmann Publishers Inc. ISBN 0124157963.
orthogonalization: 100 years and more. Numer. Lin. Alg. Appl. Simoncini V and Szyld DB (2003) Theory of inexact Krylov
20(3): 492–532. subspace methods and applications to scientific computing.
Lindstrom P (2018) Zfp version 0.5.3. URL https://zfp. SIAM J. Sci. Comput. 25(2): 454–477.
readthedocs.io/en/release0.5.3/index.htm. Skeel RD (1980) Iterative refinement implies numerical stability for
Ljungkvist K and Kronbichler M (2017) Multigrid for matrix-free Gaussian elimination. Math. Comp. 35(151): 817–832. DOI:
finite element computations on graphics processors. Technical 10.1090/S0025-5718-1980-0572859-4.
report / Department of Information Technology, Uppsala Sleijpen GL and van der Vorst HA (1996) Reliable updated
University . residuals in hybrid Bi-CG methods. Computing 56(2): 141–
Ljungkvist K and Kronbichler M (2019) Multigrid for matrix-free 163.
high-order finite element computations on graphics processors. Smoktunowicz A, Barlow JL and Langou J (2006) A note on
ACM Transactions on Parallel Processing . the error analysis of classical Gram–Schmidt. Numerische
Malard J and Paige C (1994) Efficiency and scalability of two Mathematik 105(2): 299–313.
parallel QR factorization algorithms. In: Proceedings of IEEE Stewart GW (1973) Introduction to Matrix Computations. New
Scalable High Performance Computing Conference. IEEE, pp. York: Academic Press. ISBN 0-12-670350-7.
615–622.

Prepared using sagej.cls

Abdalafattah et al. 23

Sumiyoshi Y, Fujii A, Nukada A and Tanaka T (2014) Mixed- 125–153.

precision amg method for many core accelerators. In: Van Der Vorst HA and Ye Q (2000) Residual replacement strategies
EUROMPI/ASIA 14: Proceedings of the 21st European MPI for Krylov subspace iterative methods for the convergence of
Users’ Group Meeting. pp. 127–132. true residuals. SIAM J. Sci. Comput. 22(3): 835–852.
Sun X (1996) Aggregations of elementary transformations. Techni- Walker HF (1988) Implementation of the GMRES method using
cal Report DUKE-TR-1996-03, Duke University, Durham, NC. Householder transformations. SIAM J. Sci. Stat. Comput. 9(1):
Świrydowicz K, Langou J, Ananthan S, Yang U and Thomas SJ 152–163.
(2020) Low synchronization Gram-Schmidt and generalized Wilkinson JH (1963) Rounding Errors in Algebraic Processes.
minimal residual algorithms. Numer. Lin. Alg. Appl. . London: Notes on Applied Science No. 32, Her Majesty’s
Tamstorf R, Benzaken J and McCormick S (2020a) Algebraic error Stationery Office. ISBN 0-486-67999-3. Also published
analysis for mixed precision multigrid solvers. SIAM Journal by Prentice-Hall, Englewood Cliffs, NJ, USA. Reprinted by
on Scientific Computing Submitted. Dover, New York, 1994.
Tamstorf R, Benzaken J and McCormick S (2020b) Discretization- Yamagishi T and Matsumura Y (2016) Gpu acceleration of a non-
error-accurate mixed precision multigrid solvers. SIAM hydrostatic ocean model with a multigrid poisson/helmholtz
Journal on Scientific Computing Submitted. solver. Procedia Computer Science 80: 1658–1669.
The Trilinos Project Team (2020) The Trilinos Project Website Yamazaki I, Tomov S and Dongarra J (2015) Mixed-precision
URL https://trilinos.github.io. Cholesky QR factorization and its case studies on multicore
Tisseur F (2001) Newton’s method in floating point arithmetic CPU with multiple GPUs. SIAM J. Sci. Comput. 37(1): C307–
and iterative refinement of generalized eigenvalue problems. C330. DOI:10.1137/14M0973773.
SIAM J. Matrix Anal. Appl. 22(4): 1038–1057. DOI:10.1137/ Yamazaki I, Tomov S and Dongarra J (2016) Stability and
S0895479899359837. performance of various singular value QR implementations on
van den Eshof J and Sleijpen GL (2004) Inexact Krylov subspace multicore CPU with a GPU. ACM Trans. Math. Software 43(2):
methods for linear systems. SIAM J. Matrix Anal. Appl. 26(1): 10:1–10:18. DOI:10.1145/2898347.

Prepared using sagej.cls

ECSE343MidtermW2023 Final
No ratings yet
ECSE343MidtermW2023 Final
8 pages
Numerical Methods and Modelling
100% (9)
Numerical Methods and Modelling
343 pages
LIFT DATA SHEET (Single Mobile Crane Lift)
No ratings yet
LIFT DATA SHEET (Single Mobile Crane Lift)
1 page
Iso 3511 Instrument - Symbols - Part - 4 PDF
0% (1)
Iso 3511 Instrument - Symbols - Part - 4 PDF
10 pages
Mixed Precision Algorithms in Numerical Linear Algebra
No ratings yet
Mixed Precision Algorithms in Numerical Linear Algebra
68 pages
International Journal of Engineering Research and Development
No ratings yet
International Journal of Engineering Research and Development
7 pages
Projective Rational Arithmetic With Floating Point: Vaclav Skala
No ratings yet
Projective Rational Arithmetic With Floating Point: Vaclav Skala
5 pages
A High Performance and Full Utilization Hardware Implementation of Floating Point Arithmetic Units
No ratings yet
A High Performance and Full Utilization Hardware Implementation of Floating Point Arithmetic Units
4 pages
31 Design JJ New
No ratings yet
31 Design JJ New
8 pages
A Mixed Precision Semi-Lagrangian Algorithm and Its Performance On Accelerators
No ratings yet
A Mixed Precision Semi-Lagrangian Algorithm and Its Performance On Accelerators
7 pages
Floating Point Adder
No ratings yet
Floating Point Adder
14 pages
Motuner A Compiler-Based Auto-Tuning Approach For Mixed-Precision Operators
No ratings yet
Motuner A Compiler-Based Auto-Tuning Approach For Mixed-Precision Operators
9 pages
Hardware Design and Arithmetic Algorithms For A Variable-Precision, Interval Arithmetic Coprocessor
No ratings yet
Hardware Design and Arithmetic Algorithms For A Variable-Precision, Interval Arithmetic Coprocessor
8 pages
An Fpga Based 64-Bit Ieee - 754 Double Precision Floating Point Adder/Subtractor and Multiplier Using VHDL
No ratings yet
An Fpga Based 64-Bit Ieee - 754 Double Precision Floating Point Adder/Subtractor and Multiplier Using VHDL
11 pages
Design and Implementation of An Optimized Double Precision Floating Point Divider On FPGA
No ratings yet
Design and Implementation of An Optimized Double Precision Floating Point Divider On FPGA
8 pages
Ijspr 5901 30318
No ratings yet
Ijspr 5901 30318
5 pages
NVIDIA CUDA Floating Point
No ratings yet
NVIDIA CUDA Floating Point
7 pages
Optimizing Modular Multiplication For Nvidia'S Maxwell Gpus
No ratings yet
Optimizing Modular Multiplication For Nvidia'S Maxwell Gpus
8 pages
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
No ratings yet
S S 32-B M C D: Imulation and Ynthesis of IT Ultiplier Using Onfigurable Evices
8 pages
Implementation of Double Precision Floating Point Radix-2 FFT Using VHDL
No ratings yet
Implementation of Double Precision Floating Point Radix-2 FFT Using VHDL
7 pages
Floating-Point Hardware Design A Test Perspective
No ratings yet
Floating-Point Hardware Design A Test Perspective
5 pages
ITCE 380 Lab Report 1
No ratings yet
ITCE 380 Lab Report 1
9 pages
Algorithms 14 00198
No ratings yet
Algorithms 14 00198
21 pages
Floating Point Subtraction and Division
No ratings yet
Floating Point Subtraction and Division
12 pages
Fix Point Implementation of Elementry Functions
No ratings yet
Fix Point Implementation of Elementry Functions
134 pages
Reconfigurablecomputing: Euclidean Distance Based Sorting
No ratings yet
Reconfigurablecomputing: Euclidean Distance Based Sorting
27 pages
Floating Point Arithmetic Unit With Multi-Precision For DSP Applications
No ratings yet
Floating Point Arithmetic Unit With Multi-Precision For DSP Applications
8 pages
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
No ratings yet
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
104 pages
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
No ratings yet
Lecture Notes On Numerical Methods For Engineering (?) : Pedro Fortuny Ayuso
104 pages
An Application-Oriented Analysis of Powerprecision
No ratings yet
An Application-Oriented Analysis of Powerprecision
7 pages
Project Report Vlsi
No ratings yet
Project Report Vlsi
33 pages
Matrix Computation On The GPU
No ratings yet
Matrix Computation On The GPU
455 pages
Minres Algorithm FPGA
No ratings yet
Minres Algorithm FPGA
6 pages
Hybrid FP FXP Dot Product
No ratings yet
Hybrid FP FXP Dot Product
12 pages
Icl Utk 1031 2017
No ratings yet
Icl Utk 1031 2017
45 pages
Floating Point Elsevier
No ratings yet
Floating Point Elsevier
12 pages
ZMLIB - Fortran Multiple Precision Library
No ratings yet
ZMLIB - Fortran Multiple Precision Library
9 pages
Ijspr 1203 438
No ratings yet
Ijspr 1203 438
4 pages
2174 PDF
No ratings yet
2174 PDF
7 pages
"Digit-Recurrence Algorithms For Division and Square Root With Limited Precision Primitives" Literature Survey
No ratings yet
"Digit-Recurrence Algorithms For Division and Square Root With Limited Precision Primitives" Literature Survey
7 pages
Lecture 10 (Temp)
No ratings yet
Lecture 10 (Temp)
50 pages
IT3030E CA Chap3 Arithmetics
No ratings yet
IT3030E CA Chap3 Arithmetics
39 pages
Richard Khoury, Douglas Wilhelm Harder (Auth.) - Numerical Methods and Modelling For Engineering-Springer International Publishing (2016)
No ratings yet
Richard Khoury, Douglas Wilhelm Harder (Auth.) - Numerical Methods and Modelling For Engineering-Springer International Publishing (2016)
343 pages
Energy Efficient High Speed Floating Point Arithmetic Unit: Somya Kumawat, Arpan Shah, Ramesh Bharti
No ratings yet
Energy Efficient High Speed Floating Point Arithmetic Unit: Somya Kumawat, Arpan Shah, Ramesh Bharti
3 pages
Horonanilidad
No ratings yet
Horonanilidad
63 pages
10 1 1 961 4530 PDF
No ratings yet
10 1 1 961 4530 PDF
5 pages
Implementation of A High Speed Single Precision Floating Point Unit Using Verilog
No ratings yet
Implementation of A High Speed Single Precision Floating Point Unit Using Verilog
5 pages
Design of Low-Area and High Speed Pipelined
No ratings yet
Design of Low-Area and High Speed Pipelined
6 pages
Design of Double Ieee Precision
No ratings yet
Design of Double Ieee Precision
9 pages
Comparative Study of Single Precision Floating Point Division Using Different Computational Algorithms
No ratings yet
Comparative Study of Single Precision Floating Point Division Using Different Computational Algorithms
9 pages
Hardware Algorithm For Variable Precision Multiplication On FPGA
No ratings yet
Hardware Algorithm For Variable Precision Multiplication On FPGA
4 pages
Fast Floating Point Square Root: Thomas F. Hain, David B. Mercer
No ratings yet
Fast Floating Point Square Root: Thomas F. Hain, David B. Mercer
7 pages
WBMT2049-T2/WI2032TH - Numerical Analysis For ODE's
No ratings yet
WBMT2049-T2/WI2032TH - Numerical Analysis For ODE's
30 pages
32 Bit Floating Point ALU
0% (1)
32 Bit Floating Point ALU
7 pages
32 Bit Floating Point ALU
80% (5)
32 Bit Floating Point ALU
7 pages
CC2032 COAL Lab # 07
No ratings yet
CC2032 COAL Lab # 07
9 pages
International Journal of Engineering Research and Development
No ratings yet
International Journal of Engineering Research and Development
6 pages
Lecture 3
No ratings yet
Lecture 3
116 pages
05 15slides
No ratings yet
05 15slides
117 pages
ICSE Class 8 Maths Selina Solutions Chapter 18 Constructions
No ratings yet
ICSE Class 8 Maths Selina Solutions Chapter 18 Constructions
17 pages
2 Operations On Polynomials
No ratings yet
2 Operations On Polynomials
5 pages
Land Use & Zoning: Line & Grade
No ratings yet
Land Use & Zoning: Line & Grade
19 pages
Criminalistics Review materialsLATEST
No ratings yet
Criminalistics Review materialsLATEST
60 pages
PL 300
100% (2)
PL 300
15 pages
Type Ia Supernova
No ratings yet
Type Ia Supernova
22 pages
LinkWay S2 Datasheet 012 Web
No ratings yet
LinkWay S2 Datasheet 012 Web
2 pages
Micrologix 1200 and 1500 Programmable Controllers Firmware Upgrade
No ratings yet
Micrologix 1200 and 1500 Programmable Controllers Firmware Upgrade
12 pages
Data Reconciliation
No ratings yet
Data Reconciliation
15 pages
A Review On Cellular Manufacturing Syste
No ratings yet
A Review On Cellular Manufacturing Syste
5 pages
Polybatic Edge 540 Part2
No ratings yet
Polybatic Edge 540 Part2
8 pages
Module 5 in Mathematics in The Modern World: Community College of Manito Manito, Albay A.Y. 2021 - 2022
No ratings yet
Module 5 in Mathematics in The Modern World: Community College of Manito Manito, Albay A.Y. 2021 - 2022
4 pages
Lecturas No 1 Role and Scope of Industrial Engineers (1) - Recognized
No ratings yet
Lecturas No 1 Role and Scope of Industrial Engineers (1) - Recognized
14 pages
National Masonry VIC Brochure MDG Book 1 Structural Fire Acoustic v24
No ratings yet
National Masonry VIC Brochure MDG Book 1 Structural Fire Acoustic v24
38 pages
Schmersal Safety Sensor
No ratings yet
Schmersal Safety Sensor
6 pages
Analog Electronics Instrumentation - Current Loops
No ratings yet
Analog Electronics Instrumentation - Current Loops
23 pages
2010 01 12 3DBeam CDT6
No ratings yet
2010 01 12 3DBeam CDT6
65 pages
CSA 4 - AC Power Analysis
No ratings yet
CSA 4 - AC Power Analysis
28 pages
The Elite Public School
No ratings yet
The Elite Public School
1 page
AEF3e Level 1 TG PCM Grammar 2A
No ratings yet
AEF3e Level 1 TG PCM Grammar 2A
1 page
Office Automation
No ratings yet
Office Automation
14 pages
Chapter 3 Stacks
No ratings yet
Chapter 3 Stacks
28 pages
XpressBees ReverseReattemptDate CustomerAlternateAddress MobileUpdationAPI
No ratings yet
XpressBees ReverseReattemptDate CustomerAlternateAddress MobileUpdationAPI
5 pages
Experiment No. 1 Linear System Simulator
100% (1)
Experiment No. 1 Linear System Simulator
2 pages
Wind Turbine Blade Design On SolidWorks
No ratings yet
Wind Turbine Blade Design On SolidWorks
6 pages
C MCQ's
No ratings yet
C MCQ's
6 pages
Software User Manual PDF
No ratings yet
Software User Manual PDF
24 pages
Dragonpay API
No ratings yet
Dragonpay API
31 pages

A Survey of Numerical Linear Algebra Methods Utilizing Mixed Precision Arithmetic

Uploaded by

A Survey of Numerical Linear Algebra Methods Utilizing Mixed Precision Arithmetic

Uploaded by

LLNL-JRNL-826451

A Survey of Numerical Linear Algebra

A. Abdelfattah, H. Anzt, E. Boman, E. Carson, T. Cojean, J.

International Journal of High Performance Computing

1 Introduction time, the cost of performing arithmetic operations heavily

Prepared using sagej.cls [Version: 2017/01/17 v1.20]

using the term high precision and IEEE754 single precision

Prepared using sagej.cls

Arithmetic Size Range Unit roundoff

Prepared using sagej.cls

instruction allows one warp to perform four independent

possesses a mixed-precision 4 × 4 × 4 matrix processing 70

Tensor Core acceleration if configured accordingly. As

Prepared using sagej.cls

Forward Error of Square HGEMM

10-4 mean (Higham and Mary 2020). These analyses provide

refining approximate solutions computed at low precision to

Prepared using sagej.cls

Solver F.E. B.E. Prec. Algorithm 3.1 Mixed-precision iterative refinement.

Prepared using sagej.cls

Prepared using sagej.cls

Prepared using sagej.cls

using the Cholesky factorization of AT A (Section 3.5). In

Prepared using sagej.cls

Prepared using sagej.cls

subspace and its approximation provided on input. Hence,

Prepared using sagej.cls

Prepared using sagej.cls

Prepared using sagej.cls

Prepared using sagej.cls

Figure 4.1. Loss of orthogonality for mixed single-double MGS

a widening gap between the compute power (in terms of

Prepared using sagej.cls

Prepared using sagej.cls

Prepared using sagej.cls

Prepared using sagej.cls

quantization, algebraic and descretization errors. They show

100 5 Summary and Outlook on the Potential of

Prepared using sagej.cls

Prepared using sagej.cls

Prepared using sagej.cls

Prepared using sagej.cls

Sumiyoshi Y, Fujii A, Nukada A and Tanaka T (2014) Mixed- 125–153.

Prepared using sagej.cls

You might also like