0% found this document useful (0 votes)

206 views12 pages

Matrix Square Root Computation Algorithm

The document describes blocked algorithms for computing the matrix square root using the Schur method. Standard blocking and recursive blocking are proposed to make the triangular matrix square root computation rich in matrix multiplications. Numerical experiments show significant speedups over the point algorithm when using blocked methods and exploiting BLAS. Recursive blocking provides the best performance. The blocked methods are shown to maintain the excellent numerical stability of the point algorithm.

Uploaded by

doraemonmini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

206 views12 pages

Matrix Square Root Computation Algorithm

Uploaded by

doraemonmini

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Blocked Schur Algorithms

for Computing the Matrix Square Root

Edvin Deadman1 , Nicholas J. Higham2 , and Rui Ralha3
1

Numerical Algorithms Group

[Link]@[Link]
2
University of Manchester
higham@[Link]
3
University of Minho, Portugal
r ralha@[Link]

Abstract. The Schur method for computing a matrix square root

reduces the matrix to the Schur triangular form and then computes a
square root of the triangular matrix. We show that by using either standard blocking or recursive blocking the computation of the square root of
the triangular matrix can be made rich in matrix multiplication. Numerical experiments making appropriate use of level 3 BLAS show signicant
speedups over the point algorithm, both in the square root phase and in
the algorithm as a whole. In parallel implementations, recursive blocking is found to provide better performance than standard blocking when
the parallelism comes only from threaded BLAS, but the reverse is true
when parallelism is explicitly expressed using OpenMP. The excellent
numerical stability of the point algorithm is shown to be preserved by
blocking. These results are extended to the real Schur method. Blocking
is also shown to be eective for multiplying triangular matrices.

Introduction

A square root of a matrix A Cnn is any matrix satisfying X 2 = A. Matrix

square roots have many applications, including in Markov models of nance, the
solution of dierential equations and the computation of the polar decomposition
and the matrix sign function [12].
A square root of a matrix (if one exists) is not unique. However, if A has
no eigenvalues on the closed negative real line then there is a unique principal
square root A1/2 whose eigenvalues all lie in the open right half-plane. This is the
square root usually needed in practice. If A is real, then so is A1/2 . For proofs
of these facts and more on the theory of matrix square roots see [12].
The most numerically stable way of computing matrix square roots is via
the Schur method of Bj
orck and Hammarling [6]. The matrix A is reduced to
upper triangular form and a recurrence relation enables the square root of the
triangular matrix to be computed a column or superdiagonal at a time. In 2 we
show that the recurrence can be reorganized using a standard blocking scheme or
recursive blocking in order to make it rich in matrix multiplications. We show experimentally that signicant speedups result when level 3 BLAS are exploited in

P. Manninen and P. Oster

(Eds.): PARA 2012, LNCS 7782, pp. 171182, 2013.
c Springer-Verlag Berlin Heidelberg 2013

172

E. Deadman, N.J. Higham, and R. Ralha

the implementation, with recursive blocking providing the best performance. In

3 we show that the blocked methods maintain the excellent backward stability
of the non-blocked method. In 4 we discuss the use of the new approach within
the Schur method and explain how it can be extended to the real Schur method
of Higham [10]. We compare our serial implementations with existing MATLAB functions. In 5 we compare parallel implementations of the Schur method,
nding that standard blocking oers the greatest speedups when the code is explicitly parallelized with OpenMP. In 6 we discuss some further applications of
recursive blocking to multiplication and inversion of triangular matrices. Finally,
conclusions are given in 7.

The Use of Blocking in the Schur Method

To compute A1/2 , a Schur decomposition A = QT Q is obtained, where T is

upper triangular and Q is unitary. Then A1/2 = QT 1/2 Q . For the remainder of
this section we will focus on upper triangular matrices only. The equation
U2 = T

(1)

can be solved by noting that U is also upper triangular, so that by equating

elements,
Uii2 = Tii ,
Uii Uij + Uij Ujj = Tij

(2)
j1

Uik Ukj .

(3)

k=i+1

These equations can be solved either a column or a superdiagonal at a time,

but solving a column at a time is preferable since it allows more ecient use of
cache memory. Dierent choices of sign in the scalar square roots of (2) lead to
dierent matrix square roots. This method will be referred to hereafter as the
point method.
The algorithm can be blocked by letting the Uij and Tij in (2) and (3) refer
to m m blocks, where m n (we assume, for simplicity, that m divides
n). The diagonal blocks Uii are then obtained using the point method and the
o-diagonal blocks are obtained by solving the Sylvester equations (3) using
LAPACK routine xTRSYL (where x denotes D or Z according to whether real
or complex arithmetic is used) [4]. Level 3 BLAS can be used in computing the
right-hand side of (3) so signicant improvements in eciency are expected. This
approach is referred to as the (standard) block method.
To test this approach, a Fortran implementation was written and compiled
with gfortran on a 64 bit Intel Xeon machine, using the ACML Library for
LAPACK and BLAS calls. Complex upper triangular matrices were generated,
with random elements whose real and imaginary parts were chosen from the

Blocked Schur Algorithms for Computing the Matrix Square Root

173

uniform distribution on [0, 1). Figure 1 shows the run times for the methods, for
values of n up to 8000. A block size of 64 was chosen, although the speed did
not appear to be particularly sensitive to the block sizesimilar results were
obtained with blocks of size 16, 32, and 128. The block method was found to
2 T /T ,
be up to 6 times faster than the point method. The residuals U

where U is the computed value of U , were similar for both methods. Table
1 shows that, for n = 4000, approximately 85% of the run time is spent in
ZGEMM calls.
450
400
350

time (s)

300
250

point
block
recursion

200
150
100
50
0

1000

2000

3000

4000
n

5000

6000

7000

8000

Fig. 1. Run times for the point, block, and recursion methods for computing the square
root of a complex n n triangular matrix for n [0, 8000]

A larger block size enables larger GEMM calls to be made. However, it leads
to larger calls to the point algorithm and to xTRSYL (which only uses level 2
BLAS). A recursive approach may allow increased use of level 3 BLAS.
Equation (1) can be rewritten as

2

T11 T12
U11 U12
=
,
0 U22
0 T22

(4)

where the submatrices are of size n/2 or (n 1)/2 depending on the parity of
2
2
= T11 and U22
= T22 can be solved recursively, until some base
n. Then U11
level is reached, at which point the point algorithm is used. The Sylvester equation U11 U12 + U12 U22 = T12 can then be solved using a recursive algorithm

174

E. Deadman, N.J. Higham, and R. Ralha

devised by Jonsson and K

agstrom [14]. In this algorithm, the Sylvester equation
AX + XB = C, with A and B triangular, is written as

A11 A12
X11 X12
+
0 A22
X21 X22

B11 B12
C11 C12
X11 X12
=
,
X21 X22
0 B22
C21 C22
where each submatrix is of size n/2 or (n 1)/2. Then
A11 X11 + X11 B11 = C11 A12 X21 ,

(5)

A11 X12 + X12 B22 = C12 A12 X22 X11 B12 ,

(6)

A22 X21 + X21 B11 = C21 ,

(7)

A22 X22 + X22 B22 = C22 X21 B12 .

(8)

Equation (7) is solved recursively, followed by (5) and (8), and nally (6). At
the base level a routine such as xTRSYL is used.
The run times for a Fortran implementation of the recursion method in complex arithmetic, with a base level of size 64, are shown in Figure 1. The approach
was found to be consistently 10% faster than the block method, and up to 8 times
faster than the point method, with similar residuals in each case. The precise
choice of base level made little dierence to the run time.
Table 2 shows that the run time is dominated by GEMM calls and that the
time spent in ZTRSYL and the point algorithm is similar to the block method.
The largest GEMM call uses a submatrix of size n/4.
Table 1. Proling of the block method for computing the square root of a triangular
matrix, with n = 4000. Format: time in seconds (number of calls).
Total time taken:
Calls to point algorithm:
Calls to ZTRSYL
Calls to ZGEMM:

24.03
0.019 (63)
3.47 (1953)
20.54 (39711)

Stability of the Blocked Algorithms

We use the standard model of oating point arithmetic [11, 2.2] in which the
result of a oating point operation, op, on two scalars x and y is written as
f l(x op y) = (x op y)(1 + ),

|| u,

where u is the unit roundo. In analyzing a sequence of oating point operations

it is useful to write [11, 3.4]
n

(1 + i )i = 1 + n ,
i=1

i = 1,

Blocked Schur Algorithms for Computing the Matrix Square Root

175

Table 2. Proling of the recursive method for computing the square root of a triangular
matrix, with n = 4000. Format: time in seconds (number of calls).
Total time taken:
Calls to point algorithm:
Calls to ZTRSYL
Calls to ZGEMM total:
Calls to ZGEMM with n = 1000
Calls to ZGEMM with n = 500
Calls to ZGEMM with n = 250
Calls to ZGEMM with n = 125
Calls to ZGEMM with n <= 63

where
|n |

22.04
0.002 (64)
3.37 (2016)
18.64 (2604)
7.40 (4)
5.34 (24)
3.16 (112)
1.81 (480)
0.94 (1984)

nu
=: n .
1 nu

It is also convenient to dene

n = cn for some small integer c whose precise
value is unimportant. We use a hat denote a computed quantity and write |A|
for the matrix whose elements are the absolute values of the elements of A.
Bjorck and Hammarling [6] obtained a normwise backward error bound for
of the full matrix A satises
the Schur method. The computed square root X
2

X = A + A, where
2.
AF
n3 X
(9)
F
Higham [12, 6.2] obtained a componentwise bound for the triangular phase of
of the triangular matrix T satises
the algorithm. The computed square root U
2 = T + T , where
U
|2 .
|T |
n |U
(10)
This bound implies (9). We now investigate whether the bound (10) still holds
when the triangular phase of the algorithm is blocked.
Consider the Sylvester equation AX + XB = C in n n matrices with triangular A and B. When it is solved in the standard way by the solution of n
satises [11, 16.1]
triangular systems the residual of the computed X
+ XB)|

+ |X||B|).

|C (AX

n (|A||X|

(11)

In the (non-recursive) block method, to bound Tij we must account for the
error in performing the matrix multiplications on the right-hand side of (3).
Standard error analysis for matrix multiplication yields, for blocks of size m,

j1

j1

|2 .
Uik Ukj
Uik Ukj
n |U
f l
ij

k=i+1

k=i+1

Substituting this into the residual for the Sylvester equation in the o-diagonal
blocks, we obtain the componentwise bound (10).

176

E. Deadman, N.J. Higham, and R. Ralha

To obtain a bound for the recursive blocked method we must rst check if
(11) holds when the Sylvester equation is solved using Jonsson and K
agstr
oms
recursive algorithm. This can be done by induction, assuming that (11) holds at
the base level. For the inductive step, if suces to incorporate the error estimates
for the matrix multiplications in the right hand sides of (5)(8) into the residual
bound.
Induction can then be applied to the recursive blocked method for the square
root. The bounds (10) and (11) are assumed to hold at the base level. The
inductive step is similar to the analysis for the block method. Overall, (10) is
obtained.
We conclude that both our blocked algorithms for computing the matrix
square root satisfy backward error bounds of the same forms (9) and (10) as
the point algorithm.

Serial Implementations

When used with full (non-triangular) matrices, more modest speedups are expected because of the signicant overhead in computing the Schur decomposition. Figure 2 compares run times of the MATLAB function sqrtm (which does
not use any blocking) and Fortran implementations of the the point method
(fort point) and the recursive blocked method (fort recurse), called from
within MATLAB using a mex interface, on a 64 bit Intel i3 machine. The matrices have elements whose real and imaginary parts are chosen from the uniform
random distribution on the interval [0, 1). The recursive routine is found to be
up to 2.5 times faster than sqrtm and 2 times faster than fort point.
An extension of the Schur method due to Higham [10] enables the square root
of a real matrix to be computed without using complex arithmetic. A real Schur
decomposition of A is computed. Square roots of the 2 2 diagonal blocks of
the upper quasi-triangular factor are computed using an explicit formula. The
recurrence (3) now proceeds either a block column or a block superdiagonal at
a time, where the blocks are of size 1 1, 1 2, 2 1, or 2 2 depending
on the diagonal block structure. A MATLAB implementation of this algorithm
sqrtm real is available in the Matrix Function Toolbox [9]. The algorithm can
also be implemented in a recursive manner, the only subtlety being that the
splitting point for the recursion must be chosen to avoid splitting any 2
2 diagonal blocks. A similar error analysis to 3 applies to the real recursive
method, though since only a normwise bound is available for the point algorithm
applied to the quasi-triangular matrix the backward error bound (10) holds in
the Frobenius norm rather than elementwise.
Figure 3 compares the run times of sqrtm and sqrtm real with Fortran implementations of the real point method (fort point real) and the real recursive
method (fort recurse real), also called from within MATLAB. The matrix elements are chosen from the uniform random distribution on [0, 1). The recursive
routine is found to be up to 6 times faster than sqrtm and sqrtm real and 2
times faster than fort point real.

Blocked Schur Algorithms for Computing the Matrix Square Root

177

250

200

time (s)

150
sqrtm
fort_point
fort_recurse
100

200

400

600

800

1000
n

1200

1400

1600

1800

2000

Fig. 2. Run times for sqrtm, fort recurse, and fort point for computing the square
root of a full n n matrix for n [0, 2000]

Both the real and complex recursive blocked routines spend over 90% of
their run time in computing the Schur decomposition, compared with 44% for
fort point, 46% for fort point real, 25% for sqrtm, and 16% for sqrtm real.
The latter two percentages reect the overhead of the MATLAB interpreter in
executing the recurrences for the (quasi-) triangular square root phase. The 90%
gure is consistent with the op counts of 28n3 ops for computing the Schur
decomposition and transforming back from Schur form and n3 /3 ops for the
square root of the triangular matrix.

Parallel Implementations

The blocked and recursive algorithms allow parallel architectures to be exploited

simply by using threaded BLAS. Further performance gains might be extracted
by explicitly parallelizing the triangular phase using OpenMP.
In (3), the (i, j) element of U can be computed only after the elements to its
left in the ith row and below it in the jth column have been found. Computing
U by column therefore oers no opportunity for parallelism within the column
computation. Instead we will compute U by superdiagonal, which allows the
elements on each superdiagonal to be computed in parallel. Parallelization of
the blocked algorithm is analogous.

178

E. Deadman, N.J. Higham, and R. Ralha

250

200

150
time (s)

sqrtm_real
sqrtm
fort_point_real
fort_recurse_real

100

200

400

600

800

1000
n

1200

1400

1600

1800

2000

Fig. 3. Run times for sqrtm, sqrtm real, fort recurse real and fort point real for
computing the square root of a full n n matrix for n [0, 2000]

The recursive block method can be parallelized using OpenMP tasks. Each
recursive call generates a new task. Synchronization points are required to ensure
that data dependencies are preserved. Hence, in equation (4), U11 and U22 can
be computed in parallel, and only then can U12 be found. When solving the
Sylvester equation recursively, only (5) and (8) can be solved in parallel.
When sucient threads are available (for example when computing the Schur
decomposition) threaded BLAS should be used. When all threads are busy (for
example during the triangular phase of the algorithm), serial BLAS should be
used, to avoid the overhead of creating threads unnecessarily. Unfortunately, it
is not possible to control the number of threads available to individual BLAS
calls in this way. In the implementations described below threaded BLAS are
used throughout, despite this overparallelization overhead.
The parallelized Fortran test codes were compiled on a machine containing
4 Intel Xeon CPUs, with 8 available threads, linking to ACML threaded BLAS
[1]. Figure 4 compares run times for the triangular phase of the algorithm, with
triangular test matrices generated with elements having real and imaginary parts
chosen from the uniform random distribution on the interval [0, 1).
The point algorithm does not use BLAS, but 2-fold speedups on eight cores
are obtained using OpenMP. With standard blocking, threaded BLAS alone
gives a 2-fold speed up, but using OpenMP gives a 5.5 times speedup. With
recursive blocking, a 3-fold speedup is obtained by using threaded BLAS,

Blocked Schur Algorithms for Computing the Matrix Square Root

179

140
point
block
recursion

120

time (s)

100

1 Thread

8 Threads BLAS

8 Threads OpenMP

Fig. 4. Run times for parallel implementations of the point, block, and recursion methods for computing the square root of a 4000 4000 triangular matrix

but using OpenMP then decreases the performance because of the multiple synchronization points at each level of the recursion. Overall, if the only parallelization available is from threaded BLAS, then the recursive algorithm is the
fastest. However, if OpenMP is used then shorter run times are obtained using
the standard blocking method.
Figure 5 compares run times for computing the square root of a full square
matrix. Here, the run times are dominated by the Schur decomposition, so the
most signicant gains are obtained by simply using threaded BLAS and the gains
due to the new triangular algorithms are less apparent.

Further Applications of Recursive Blocking

We briey mention two further applications of recursive blocking schemes.

Currently there are no LAPACK or BLAS routines designed specically for
multiplying two triangular matrices, T = U V (the closest is the BLAS routine
xTRMM which multiplies a triangular matrix by a full matrix). However, a block
algorithm is easily derived by partitioning the matrices into blocks. The product
of two o-diagonal blocks is computed using xGEMM. The product of an odiagonal block and a diagonal block is computed using xTRMM. Finally the
point method is used when multiplying two diagonal blocks.

180

E. Deadman, N.J. Higham, and R. Ralha

1000
point
block
recursion

900
800
700

time (s)

600
500
400
300
200
100
0

1 Thread

8 Threads BLAS

8 Threads OpenMP

Fig. 5. Run times for parallel implementations of the point, block, and recursion methods for computing the square root of a 4000 4000 full matrix

In the recursive approach, T = U V is rewritten as

U11 U12
V11 V12
T11 T12
=
.
0 T22
0 U22
0 V22
Then T11 = U11 V11 and T22 = U22 V22 are computed recursively and T12 =
U11 V12 + U12 V22 is computed using two calls to xTRMM.
Figure 6 shows run times for some triangular matrix multiplications using
serial Fortran implementations of the point method, standard blocking, and recursive blocking on a single Intel Xeon CPU (the block size and base levels were
both 64 in this case, although the results were not too sensitive to the precise
choice of these parameters). As for the matrix square root, the block algorithms
signicantly outperform the point algorithm, with the recursive approach outperforming the standard blocking approach by approximately 5%. However, if
the result of the multiplication is required to overwrite one of the matrices (so
that U U V , as is the case in xTRMM) then standard blocking may be preferable because less workspace is required.
The inverse of a triangular matrix can be computed recursively, by expanding
U U 1 = I as
1

(U )11 (U 1 )12
I0
U11 U12
=
.
0 U22
0
(U 1 )22
0I

Blocked Schur Algorithms for Computing the Matrix Square Root

181

250

200

time (s)

150
point
block
recursion
100

1000

2000

3000

4000
n

5000

6000

7000

8000

Fig. 6. Run times for the point, block, and recursion methods for multiplying randomly
generated n n triangular matrices for n [0, 8000]

1 )11 and (U
1 )22 are computed and (U
1 )12 is obtained by solving
Then (U
1
1

U11 (U )12 + U12 (U )22 = 0. Provided that forward substitution is used, the
right (or left) recursive inversion method can be shown inductively to satisfy
the same right (or left) elementwise residual bound as the point method [7]. A
Fortran implementation of this idea was found to perform similarly to LAPACK
code xTRTRI, so no real benet was derived from recursive blocking.

Conclusions

We investigated two dierent blocking techniques within Bj

orck and Hammarlings recurrence for computing a square root of a triangular matrix, nding that
in serial implementations recursive blocking gives the best performance. Neither
approach entails any loss of backward stability. We implemented the recursive
blocking with both the Schur method and the real Schur method (which works
entirely in real arithmetic) and found the new codes to be signicantly faster
than corresponding point codes, which include the MATLAB functions sqrtm
(built-in) and sqrtm real (from [9]). Parallel implementations were investigated
using a combination of threaded BLAS and explicit parallelization via OpenMP.
When the only parallelization comes from threaded BLAS recursive blocking still
gives the best performance. However, when OpenMP is used better performance
is obtained using standard blocking. The new codes will appear in a future mark

182

E. Deadman, N.J. Higham, and R. Ralha

of the NAG Library [15]. Since future marks of the NAG Library will be implemented explicitly in parallel with OpenMP, the standard blocking algorithm will
be used. Recursive blocking is also fruitful for multiplying triangular matrices.
Because of the importance of the (quasi-) triangular square root, which arises
in algorithms for computing the matrix logarithm [2], [3], matrix pth roots [5], [8],
and arbitrary matrix powers [13], this computational kernel is a strong contender
for inclusion in any future extensions of the BLAS.

References
1. Advanced Micro Devices, Inc., Numerical Algorithms Group Ltd. AMD Core Math
Library (ACML), 4.1.0 edn. (2008)
2. Al-Mohy, A.H., Higham, N.J.: Improved inverse scaling and squaring algorithms
for the matrix logarithm. SIAM J. Sci. Comput. 34(4), C153C169 (2012)
3. Al-Mohy, A.H., Higham, N.J., Relton, S.D.: Computing the Frechet derivative of
the matrix logarithm and estimating the condition number. MIMS EPrint 2012.72.
Manchester Institute for Mathematical Sciences. The University of Manchester, UK
(2012)
4. Anderson, E., Bai, Z., Bischof, C., Blackford, S., Demmel, J., Dongarra, J., Du
Croz, J., Greenbaum, A., Hammarling, S., McKenney, A., Sorensen, D.: LAPACK
Users Guide, 3rd edn. Society for Industrial and Applied Mathematics. Philadelphia, PA (1999)
5. Bini, D.A., Higham, N.J., Meini, B.: Algorithms for the matrix pth root. Numer.
Algorithms 39(4), 349378 (2005)
6. Bj
orck,
A., Hammarling, S.: A Schur method for the square root of a matrix. Linear
Algebra Appl. 52/53, 127140 (1983)
7. Du Croz, J.J., Higham, N.J.: Stability methods for matrix inversion. IMA J. Numer.
Anal. 12(1), 119 (1992)
8. Guo, C.-H., Higham, N.J.: A SchurNewton method for the matrix pth root and
its inverse. SIAM J. Matrix Anal. Appl. 28(3), 788804 (2006)
9. Higham, N.J.: The Matrix Function Toolbox,
[Link] higham/mftoolbox
10. Higham, N.J.: Computing real square roots of a real matrix. Linear Algebra
Appl. 88/89, 405430 (1987)
11. Higham, N.J.: Accuracy and Stability of Numerical Algorithms, 2nd edn. SIAM
(2002)
12. Higham, N.J.: Functions of Matrices: Theory and Computation. SIAM (2008)
13. Higham, N.J., Lin, L.: A SchurPade algorithm for fractional powers of a matrix.
SIAM J. Matrix Anal. Appl. 32(3), 10561078 (2011)
14. Jonsson, I., K
agstr
om, B.: Recursive blocked algorithms for solving triangular systems - part I: One-sided and coupled Sylvester-type matrix equations. ACM Trans.
Math. Software 28(4), 392415 (2002)
15. Numerical Algorithms Group. The NAG Fortran Library, [Link]

Summary and Conclusion
No ratings yet
Summary and Conclusion
7 pages
Block LU Factorization
No ratings yet
Block LU Factorization
14 pages
Computing The PTH Roots of A Matrix With Repeated
No ratings yet
Computing The PTH Roots of A Matrix With Repeated
18 pages
Solvingsingular Linear Equation
No ratings yet
Solvingsingular Linear Equation
49 pages
Block LU Decomposition
No ratings yet
Block LU Decomposition
8 pages
Triangular Matrix Inversion Methods
No ratings yet
Triangular Matrix Inversion Methods
34 pages
Numerical Solutions for CE Problems
No ratings yet
Numerical Solutions for CE Problems
19 pages
Numerical Methods Programming
No ratings yet
Numerical Methods Programming
9 pages
Linear Algebra: Assignment I
No ratings yet
Linear Algebra: Assignment I
11 pages
Fixed-Point Square Root Algorithm
No ratings yet
Fixed-Point Square Root Algorithm
3 pages
Dsa Assignment 1
No ratings yet
Dsa Assignment 1
5 pages
Recursive LU Factorization Guide
No ratings yet
Recursive LU Factorization Guide
4 pages
Iterative Methods for Sparse Systems
No ratings yet
Iterative Methods for Sparse Systems
24 pages
Second Midterm Exam
No ratings yet
Second Midterm Exam
11 pages
Matrix Operations and Root Finding Experiments
No ratings yet
Matrix Operations and Root Finding Experiments
36 pages
Eeiol 2006feb04 Ems Ta
No ratings yet
Eeiol 2006feb04 Ems Ta
5 pages
43.DARPAN (IT-2) - 09315003119 - Maths
No ratings yet
43.DARPAN (IT-2) - 09315003119 - Maths
36 pages
Linear System: 2011 Intro. To Computation Mathematics LAB Session
No ratings yet
Linear System: 2011 Intro. To Computation Mathematics LAB Session
7 pages
FDMcode
No ratings yet
FDMcode
9 pages
Notes About Numerical Methods With Matlab
No ratings yet
Notes About Numerical Methods With Matlab
50 pages
Numerical Analysis Algorithms in MATLAB
No ratings yet
Numerical Analysis Algorithms in MATLAB
16 pages
DSP Implementation of Cholesky Decomposition
No ratings yet
DSP Implementation of Cholesky Decomposition
4 pages
DSP Implementation of Cholesky Decomposition
No ratings yet
DSP Implementation of Cholesky Decomposition
4 pages
(J.W.cooley, J.W.tukey) An Algorithm For The Machine Calculation of Complex Fourier Series
No ratings yet
(J.W.cooley, J.W.tukey) An Algorithm For The Machine Calculation of Complex Fourier Series
6 pages
Using The Gaussian Elimination Method For Large Banded Matrix Equations
No ratings yet
Using The Gaussian Elimination Method For Large Banded Matrix Equations
75 pages
1
No ratings yet
1
17 pages
Fast Multiplication of The Algebraic Normal Forms of Two Boolean Functions
No ratings yet
Fast Multiplication of The Algebraic Normal Forms of Two Boolean Functions
12 pages
Gmres Fom Versus QMR Bicg
No ratings yet
Gmres Fom Versus QMR Bicg
24 pages
Divide & Conquer Matrix Multiplication
No ratings yet
Divide & Conquer Matrix Multiplication
9 pages
Block 2
No ratings yet
Block 2
208 pages
Algorithm Flowchart Numerical Method
No ratings yet
Algorithm Flowchart Numerical Method
34 pages
Integer Multiplication Algorithm
No ratings yet
Integer Multiplication Algorithm
4 pages
FPGA Implementation of Modified Non-Restoring Square Root Core
No ratings yet
FPGA Implementation of Modified Non-Restoring Square Root Core
6 pages
Cie 115 Module
No ratings yet
Cie 115 Module
88 pages
An Introduction To Numerical Methods.a Matlab Approach
100% (1)
An Introduction To Numerical Methods.a Matlab Approach
435 pages
Gaussian Elimination in Linear Algebra
No ratings yet
Gaussian Elimination in Linear Algebra
4 pages
UNIT
No ratings yet
UNIT
4 pages
Linear Equation Faster Than Mul
No ratings yet
Linear Equation Faster Than Mul
18 pages
Lecture 33 Algebraic Computation and FFTs
No ratings yet
Lecture 33 Algebraic Computation and FFTs
16 pages
Unit 1 MT 202 CBNST
No ratings yet
Unit 1 MT 202 CBNST
24 pages
CS170 Solutions Overview
No ratings yet
CS170 Solutions Overview
66 pages
Me Module 7
No ratings yet
Me Module 7
18 pages
VSS NumericalLibraries
No ratings yet
VSS NumericalLibraries
21 pages
Trigo Table
No ratings yet
Trigo Table
4 pages
Solutions To Homework 3o
No ratings yet
Solutions To Homework 3o
2 pages
Fast Inverse Square Root Algorithm
No ratings yet
Fast Inverse Square Root Algorithm
8 pages
EQTVuAhSFY (Dragged) 3
No ratings yet
EQTVuAhSFY (Dragged) 3
1 page
Outline of Next 2 Lectures: Matrix Computations: Direct Methods I
No ratings yet
Outline of Next 2 Lectures: Matrix Computations: Direct Methods I
16 pages
Solution of Linear Algebraic Equations
No ratings yet
Solution of Linear Algebraic Equations
6 pages
Matrix & Fibonacci Performance Analysis
No ratings yet
Matrix & Fibonacci Performance Analysis
7 pages
Computational Linear Algebra Syllabus
No ratings yet
Computational Linear Algebra Syllabus
161 pages
Comprehensive Guide To Numerical Methods For ECE Students (Theory Claude)
No ratings yet
Comprehensive Guide To Numerical Methods For ECE Students (Theory Claude)
9 pages
Matrix Operations and Numerical Methods
No ratings yet
Matrix Operations and Numerical Methods
17 pages
Introduction To Numerical Methods: Grading System
No ratings yet
Introduction To Numerical Methods: Grading System
20 pages
Chap2 5
No ratings yet
Chap2 5
6 pages
Numerical Methods for Root Finding
No ratings yet
Numerical Methods for Root Finding
27 pages
Enayatullah Atal: Mid Term Assignment
No ratings yet
Enayatullah Atal: Mid Term Assignment
11 pages
An Optimized Square Root Algorithm For Implementation in Fpga Hardware
No ratings yet
An Optimized Square Root Algorithm For Implementation in Fpga Hardware
8 pages
4.6 Iterative Solvers: Ij If I If I
No ratings yet
4.6 Iterative Solvers: Ij If I If I
9 pages
Conservation of Absolute Vorticity
No ratings yet
Conservation of Absolute Vorticity
20 pages
Compatible Observable
No ratings yet
Compatible Observable
9 pages
Solid State Theory Course 2013
No ratings yet
Solid State Theory Course 2013
164 pages
Monochromator&PM Introduction
No ratings yet
Monochromator&PM Introduction
8 pages
Parallel Reduction for C++ Devs
No ratings yet
Parallel Reduction for C++ Devs
7 pages
Parallel Algorithms Course Guide
No ratings yet
Parallel Algorithms Course Guide
3 pages
Patterns For Parallel Programming
No ratings yet
Patterns For Parallel Programming
34 pages
ARWUsers Guide V3
No ratings yet
ARWUsers Guide V3
384 pages
Parallel Computing in Quantum Chemistry 1st Edition Curtis L. Janssen - Download The Ebook Now For Full and Detailed Access
100% (1)
Parallel Computing in Quantum Chemistry 1st Edition Curtis L. Janssen - Download The Ebook Now For Full and Detailed Access
80 pages
Autodyn Parallel Processing Guide
No ratings yet
Autodyn Parallel Processing Guide
52 pages
DualSPHysics v3.0 GUIDE PDF
No ratings yet
DualSPHysics v3.0 GUIDE PDF
89 pages
Multicore Architecture
No ratings yet
Multicore Architecture
2 pages
MPI Workshop Overview and Notes
No ratings yet
MPI Workshop Overview and Notes
46 pages
Syllabus of HPCA
No ratings yet
Syllabus of HPCA
2 pages
Clusterguide-V4 0
No ratings yet
Clusterguide-V4 0
112 pages
OpenMP Programming Tutorial Guide
No ratings yet
OpenMP Programming Tutorial Guide
62 pages
Velvet Manual
No ratings yet
Velvet Manual
22 pages
Introduction to OpenMP Basics
No ratings yet
Introduction to OpenMP Basics
152 pages
PP Cuda Unit1 1
No ratings yet
PP Cuda Unit1 1
77 pages
Installing LAMMPS on macOS High Sierra
No ratings yet
Installing LAMMPS on macOS High Sierra
6 pages
Oneapi Programming Guide
No ratings yet
Oneapi Programming Guide
196 pages
OpenCL Programming
100% (1)
OpenCL Programming
246 pages
Assignment 2 Cluster Computing
No ratings yet
Assignment 2 Cluster Computing
3 pages
Chapter 5
No ratings yet
Chapter 5
92 pages
OpenNN: C++ Neural Network Library
No ratings yet
OpenNN: C++ Neural Network Library
2 pages
Lab Manual - LP V - LA 3
No ratings yet
Lab Manual - LP V - LA 3
14 pages
Ncorr Installation & User Guide
No ratings yet
Ncorr Installation & User Guide
55 pages
Open - FOAM - V6 - User - Guide - Running - Applications - Parallel
No ratings yet
Open - FOAM - V6 - User - Guide - Running - Applications - Parallel
9 pages
Process Synchronization Guide
No ratings yet
Process Synchronization Guide
70 pages
OpenMP K-Means Clustering Guide
No ratings yet
OpenMP K-Means Clustering Guide
12 pages
OpenMP for Multi-Core Programming
No ratings yet
OpenMP for Multi-Core Programming
100 pages
Data Parallel Patterns
No ratings yet
Data Parallel Patterns
9 pages
Tbtrans Screen
No ratings yet
Tbtrans Screen
31 pages

Matrix Square Root Computation Algorithm

Uploaded by

Matrix Square Root Computation Algorithm

Uploaded by

Blocked Schur Algorithms

for Computing the Matrix Square Root

Numerical Algorithms Group

Abstract. The Schur method for computing a matrix square root

A square root of a matrix A Cnn is any matrix satisfying X 2 = A. Matrix

P. Manninen and P. Oster

E. Deadman, N.J. Higham, and R. Ralha

the implementation, with recursive blocking providing the best performance. In

The Use of Blocking in the Schur Method

To compute A1/2 , a Schur decomposition A = QT Q is obtained, where T is

can be solved by noting that U is also upper triangular, so that by equating

These equations can be solved either a column or a superdiagonal at a time,

Blocked Schur Algorithms for Computing the Matrix Square Root

E. Deadman, N.J. Higham, and R. Ralha

devised by Jonsson and K

A11 X12 + X12 B22 = C12 A12 X22 X11 B12 ,

A22 X21 + X21 B11 = C21 ,

A22 X22 + X22 B22 = C22 X21 B12 .

Stability of the Blocked Algorithms

where u is the unit roundo. In analyzing a sequence of oating point operations

Blocked Schur Algorithms for Computing the Matrix Square Root

It is also convenient to dene 

E. Deadman, N.J. Higham, and R. Ralha

Blocked Schur Algorithms for Computing the Matrix Square Root

The blocked and recursive algorithms allow parallel architectures to be exploited

E. Deadman, N.J. Higham, and R. Ralha

Blocked Schur Algorithms for Computing the Matrix Square Root

Further Applications of Recursive Blocking

We briey mention two further applications of recursive blocking schemes.

E. Deadman, N.J. Higham, and R. Ralha

In the recursive approach, T = U V is rewritten as

Blocked Schur Algorithms for Computing the Matrix Square Root

We investigated two dierent blocking techniques within Bj

E. Deadman, N.J. Higham, and R. Ralha

You might also like

It is also convenient to dene