Matrix Square Root Computation Algorithm
Matrix Square Root Computation Algorithm
Introduction
172
(1)
(2)
j1
Uik Ukj .
(3)
k=i+1
173
uniform distribution on [0, 1). Figure 1 shows the run times for the methods, for
values of n up to 8000. A block size of 64 was chosen, although the speed did
not appear to be particularly sensitive to the block sizesimilar results were
obtained with blocks of size 16, 32, and 128. The block method was found to
2 T /T ,
be up to 6 times faster than the point method. The residuals U
where U is the computed value of U , were similar for both methods. Table
1 shows that, for n = 4000, approximately 85% of the run time is spent in
ZGEMM calls.
450
400
350
time (s)
300
250
point
block
recursion
200
150
100
50
0
1000
2000
3000
4000
n
5000
6000
7000
8000
Fig. 1. Run times for the point, block, and recursion methods for computing the square
root of a complex n n triangular matrix for n [0, 8000]
A larger block size enables larger GEMM calls to be made. However, it leads
to larger calls to the point algorithm and to xTRSYL (which only uses level 2
BLAS). A recursive approach may allow increased use of level 3 BLAS.
Equation (1) can be rewritten as
2
T11 T12
U11 U12
=
,
0 U22
0 T22
(4)
where the submatrices are of size n/2 or (n 1)/2 depending on the parity of
2
2
= T11 and U22
= T22 can be solved recursively, until some base
n. Then U11
level is reached, at which point the point algorithm is used. The Sylvester equation U11 U12 + U12 U22 = T12 can then be solved using a recursive algorithm
174
(5)
(6)
(7)
(8)
Equation (7) is solved recursively, followed by (5) and (8), and nally (6). At
the base level a routine such as xTRSYL is used.
The run times for a Fortran implementation of the recursion method in complex arithmetic, with a base level of size 64, are shown in Figure 1. The approach
was found to be consistently 10% faster than the block method, and up to 8 times
faster than the point method, with similar residuals in each case. The precise
choice of base level made little dierence to the run time.
Table 2 shows that the run time is dominated by GEMM calls and that the
time spent in ZTRSYL and the point algorithm is similar to the block method.
The largest GEMM call uses a submatrix of size n/4.
Table 1. Proling of the block method for computing the square root of a triangular
matrix, with n = 4000. Format: time in seconds (number of calls).
Total time taken:
Calls to point algorithm:
Calls to ZTRSYL
Calls to ZGEMM:
24.03
0.019 (63)
3.47 (1953)
20.54 (39711)
We use the standard model of oating point arithmetic [11, 2.2] in which the
result of a oating point operation, op, on two scalars x and y is written as
f l(x op y) = (x op y)(1 + ),
|| u,
i = 1,
175
Table 2. Proling of the recursive method for computing the square root of a triangular
matrix, with n = 4000. Format: time in seconds (number of calls).
Total time taken:
Calls to point algorithm:
Calls to ZTRSYL
Calls to ZGEMM total:
Calls to ZGEMM with n = 1000
Calls to ZGEMM with n = 500
Calls to ZGEMM with n = 250
Calls to ZGEMM with n = 125
Calls to ZGEMM with n <= 63
where
|n |
22.04
0.002 (64)
3.37 (2016)
18.64 (2604)
7.40 (4)
5.34 (24)
3.16 (112)
1.81 (480)
0.94 (1984)
nu
=: n .
1 nu
(11)
In the (non-recursive) block method, to bound Tij we must account for the
error in performing the matrix multiplications on the right-hand side of (3).
Standard error analysis for matrix multiplication yields, for blocks of size m,
j1
j1
|2 .
Uik Ukj
Uik Ukj
n |U
f l
ij
k=i+1
k=i+1
Substituting this into the residual for the Sylvester equation in the o-diagonal
blocks, we obtain the componentwise bound (10).
176
To obtain a bound for the recursive blocked method we must rst check if
(11) holds when the Sylvester equation is solved using Jonsson and K
agstr
oms
recursive algorithm. This can be done by induction, assuming that (11) holds at
the base level. For the inductive step, if suces to incorporate the error estimates
for the matrix multiplications in the right hand sides of (5)(8) into the residual
bound.
Induction can then be applied to the recursive blocked method for the square
root. The bounds (10) and (11) are assumed to hold at the base level. The
inductive step is similar to the analysis for the block method. Overall, (10) is
obtained.
We conclude that both our blocked algorithms for computing the matrix
square root satisfy backward error bounds of the same forms (9) and (10) as
the point algorithm.
Serial Implementations
When used with full (non-triangular) matrices, more modest speedups are expected because of the signicant overhead in computing the Schur decomposition. Figure 2 compares run times of the MATLAB function sqrtm (which does
not use any blocking) and Fortran implementations of the the point method
(fort point) and the recursive blocked method (fort recurse), called from
within MATLAB using a mex interface, on a 64 bit Intel i3 machine. The matrices have elements whose real and imaginary parts are chosen from the uniform
random distribution on the interval [0, 1). The recursive routine is found to be
up to 2.5 times faster than sqrtm and 2 times faster than fort point.
An extension of the Schur method due to Higham [10] enables the square root
of a real matrix to be computed without using complex arithmetic. A real Schur
decomposition of A is computed. Square roots of the 2 2 diagonal blocks of
the upper quasi-triangular factor are computed using an explicit formula. The
recurrence (3) now proceeds either a block column or a block superdiagonal at
a time, where the blocks are of size 1 1, 1 2, 2 1, or 2 2 depending
on the diagonal block structure. A MATLAB implementation of this algorithm
sqrtm real is available in the Matrix Function Toolbox [9]. The algorithm can
also be implemented in a recursive manner, the only subtlety being that the
splitting point for the recursion must be chosen to avoid splitting any 2
2 diagonal blocks. A similar error analysis to 3 applies to the real recursive
method, though since only a normwise bound is available for the point algorithm
applied to the quasi-triangular matrix the backward error bound (10) holds in
the Frobenius norm rather than elementwise.
Figure 3 compares the run times of sqrtm and sqrtm real with Fortran implementations of the real point method (fort point real) and the real recursive
method (fort recurse real), also called from within MATLAB. The matrix elements are chosen from the uniform random distribution on [0, 1). The recursive
routine is found to be up to 6 times faster than sqrtm and sqrtm real and 2
times faster than fort point real.
177
250
200
time (s)
150
sqrtm
fort_point
fort_recurse
100
50
200
400
600
800
1000
n
1200
1400
1600
1800
2000
Fig. 2. Run times for sqrtm, fort recurse, and fort point for computing the square
root of a full n n matrix for n [0, 2000]
Both the real and complex recursive blocked routines spend over 90% of
their run time in computing the Schur decomposition, compared with 44% for
fort point, 46% for fort point real, 25% for sqrtm, and 16% for sqrtm real.
The latter two percentages reect the overhead of the MATLAB interpreter in
executing the recurrences for the (quasi-) triangular square root phase. The 90%
gure is consistent with the op counts of 28n3 ops for computing the Schur
decomposition and transforming back from Schur form and n3 /3 ops for the
square root of the triangular matrix.
Parallel Implementations
178
250
200
150
time (s)
sqrtm_real
sqrtm
fort_point_real
fort_recurse_real
100
50
200
400
600
800
1000
n
1200
1400
1600
1800
2000
Fig. 3. Run times for sqrtm, sqrtm real, fort recurse real and fort point real for
computing the square root of a full n n matrix for n [0, 2000]
The recursive block method can be parallelized using OpenMP tasks. Each
recursive call generates a new task. Synchronization points are required to ensure
that data dependencies are preserved. Hence, in equation (4), U11 and U22 can
be computed in parallel, and only then can U12 be found. When solving the
Sylvester equation recursively, only (5) and (8) can be solved in parallel.
When sucient threads are available (for example when computing the Schur
decomposition) threaded BLAS should be used. When all threads are busy (for
example during the triangular phase of the algorithm), serial BLAS should be
used, to avoid the overhead of creating threads unnecessarily. Unfortunately, it
is not possible to control the number of threads available to individual BLAS
calls in this way. In the implementations described below threaded BLAS are
used throughout, despite this overparallelization overhead.
The parallelized Fortran test codes were compiled on a machine containing
4 Intel Xeon CPUs, with 8 available threads, linking to ACML threaded BLAS
[1]. Figure 4 compares run times for the triangular phase of the algorithm, with
triangular test matrices generated with elements having real and imaginary parts
chosen from the uniform random distribution on the interval [0, 1).
The point algorithm does not use BLAS, but 2-fold speedups on eight cores
are obtained using OpenMP. With standard blocking, threaded BLAS alone
gives a 2-fold speed up, but using OpenMP gives a 5.5 times speedup. With
recursive blocking, a 3-fold speedup is obtained by using threaded BLAS,
179
140
point
block
recursion
120
time (s)
100
80
60
40
20
1 Thread
8 Threads BLAS
8 Threads OpenMP
Fig. 4. Run times for parallel implementations of the point, block, and recursion methods for computing the square root of a 4000 4000 triangular matrix
but using OpenMP then decreases the performance because of the multiple synchronization points at each level of the recursion. Overall, if the only parallelization available is from threaded BLAS, then the recursive algorithm is the
fastest. However, if OpenMP is used then shorter run times are obtained using
the standard blocking method.
Figure 5 compares run times for computing the square root of a full square
matrix. Here, the run times are dominated by the Schur decomposition, so the
most signicant gains are obtained by simply using threaded BLAS and the gains
due to the new triangular algorithms are less apparent.
180
1000
point
block
recursion
900
800
700
time (s)
600
500
400
300
200
100
0
1 Thread
8 Threads BLAS
8 Threads OpenMP
Fig. 5. Run times for parallel implementations of the point, block, and recursion methods for computing the square root of a 4000 4000 full matrix
181
250
200
time (s)
150
point
block
recursion
100
50
1000
2000
3000
4000
n
5000
6000
7000
8000
Fig. 6. Run times for the point, block, and recursion methods for multiplying randomly
generated n n triangular matrices for n [0, 8000]
1 )11 and (U
1 )22 are computed and (U
1 )12 is obtained by solving
Then (U
1
1
U11 (U )12 + U12 (U )22 = 0. Provided that forward substitution is used, the
right (or left) recursive inversion method can be shown inductively to satisfy
the same right (or left) elementwise residual bound as the point method [7]. A
Fortran implementation of this idea was found to perform similarly to LAPACK
code xTRTRI, so no real benet was derived from recursive blocking.
Conclusions
182
of the NAG Library [15]. Since future marks of the NAG Library will be implemented explicitly in parallel with OpenMP, the standard blocking algorithm will
be used. Recursive blocking is also fruitful for multiplying triangular matrices.
Because of the importance of the (quasi-) triangular square root, which arises
in algorithms for computing the matrix logarithm [2], [3], matrix pth roots [5], [8],
and arbitrary matrix powers [13], this computational kernel is a strong contender
for inclusion in any future extensions of the BLAS.
References
1. Advanced Micro Devices, Inc., Numerical Algorithms Group Ltd. AMD Core Math
Library (ACML), 4.1.0 edn. (2008)
2. Al-Mohy, A.H., Higham, N.J.: Improved inverse scaling and squaring algorithms
for the matrix logarithm. SIAM J. Sci. Comput. 34(4), C153C169 (2012)
3. Al-Mohy, A.H., Higham, N.J., Relton, S.D.: Computing the Frechet derivative of
the matrix logarithm and estimating the condition number. MIMS EPrint 2012.72.
Manchester Institute for Mathematical Sciences. The University of Manchester, UK
(2012)
4. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du
Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK
Users Guide, 3rd edn. Society for Industrial and Applied Mathematics. Philadelphia, PA (1999)
5. Bini, D.A., Higham, N.J., Meini, B.: Algorithms for the matrix pth root. Numer.
Algorithms 39(4), 349378 (2005)
6. Bj
orck,
A., Hammarling, S.: A Schur method for the square root of a matrix. Linear
Algebra Appl. 52/53, 127140 (1983)
7. Du Croz, J.J., Higham, N.J.: Stability methods for matrix inversion. IMA J. Numer.
Anal. 12(1), 119 (1992)
8. Guo, C.-H., Higham, N.J.: A SchurNewton method for the matrix pth root and
its inverse. SIAM J. Matrix Anal. Appl. 28(3), 788804 (2006)
9. Higham, N.J.: The Matrix Function Toolbox,
http://www.ma.man.ac.uk/~ higham/mftoolbox
10. Higham, N.J.: Computing real square roots of a real matrix. Linear Algebra
Appl. 88/89, 405430 (1987)
11. Higham, N.J.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM
(2002)
12. Higham, N.J.: Functions of Matrices: Theory and Computation. SIAM (2008)
13. Higham, N.J., Lin, L.: A SchurPade algorithm for fractional powers of a matrix.
SIAM J. Matrix Anal. Appl. 32(3), 10561078 (2011)
14. Jonsson, I., K
agstr
om, B.: Recursive blocked algorithms for solving triangular systems - part I: One-sided and coupled Sylvester-type matrix equations. ACM Trans.
Math. Software 28(4), 392415 (2002)
15. Numerical Algorithms Group. The NAG Fortran Library, http://www.nag.co.uk