Siam J. M A A - 2001 Society For Industrial and Applied Mathematics Vol. 22, No. 4, Pp. 1038-1057
Siam J. M A A - 2001 Society For Industrial and Applied Mathematics Vol. 22, No. 4, Pp. 1038-1057
SIAM J. MATRIX ANAL. APPL. c 2001 Society for Industrial and Applied Mathematics
Vol. 22, No. 4, pp. 10381057
Abstract. We examine the behavior of Newtons method in oating point arithmetic, allowing
for extended precision in computation of the residual, inaccurate evaluation of the Jacobian and
unstable solution of the linear systems. We bound the limiting accuracy and the smallest norm of
the residual. The application that motivates this work is iterative renement for the generalized
eigenvalue problem. We show that iterative renement by Newtons method can be used to improve
the forward and backward errors of computed eigenpairs.
Key words. Newtons method, generalized eigenvalue problem, iterative renement, Cholesky
method, backward error, forward error, rounding error analysis, limiting accuracy, limiting residual
AMS subject classications. 65F15, 65F35
PII. S0895479899359837
1. Introduction. This work is motivated by the symmetric denite generalized
eigenvalue problem Ax = Bx (A and B symmetric and one of them positive denite),
for which no method is known that takes advantage of the symmetry, is ecient, and
is backward stable. For the special case where both matrices are positive denite,
such a method is available [26]. The aim is to show that iterative renement by New-
tons method can be used to improve the forward and backward errors of computed
eigenpairs. An important question is how accurately the residuals must be evaluated
in order to improve the relative forward error and/or the backward error.
For added generality we give a detailed analysis of the general Newton method
in oating point arithmetic, allowing for extended precision in computation of the
residual, possibly inaccurate evaluation of the Jacobian and unstable linear system
solvers. We bound the limiting accuracy that can be obtained and the smallest norm
of the residual.
Lancaster [19], Wozniakowski [28], Ypma [29], [30], and Dennis and Walker [6]
have also considered the eects of inaccuracy, computational or otherwise, on New-
tons method for solving nonlinear algebraic equations. None of these authors analyzed
the behavior of the residual. Lancaster and Ypma were interested in how the approx-
imate iterate is related to the exact one rather than the error in the approximate
iterate. Wozniakowski carried out his analysis with the big-Oh notation and therefore
his results contain unknown constants. We follow the same approach as Dennis and
Walker [6] in that our results are based directly on the error in the computed iterates.
The analysis in [6] is very general and uses several assumptions and constants that
are dicult to interpret and understand even for the special case discussed therein
(iterative renement for linear systems of equations).
The residual contains information that is crucial for improving an approximate
solution by Newtons method. Thus it should be computed as accurately as possible.
Received by the editors August 4, 1999; accepted for publication (in revised form) by J. Varah
October 5, 2000; published electronically February 23, 2001.
http://www.siam.org/journals/simax/22-4/35983.html
R
m
be continuously dierentiable on R
m
. We denote by J the Jacobian matrix
(F
i
/v
j
) of F and assume that J is Lipschitz continuous with constant in R
m
,
that is,
|J(w) J(v)| |w v| for all v, w R
m
,
where | | denotes any vector norm and the corresponding operator norm. We denote
by (J) = |J||J
1
| the condition number of the matrix J. We attempt to solve the
system of nonlinear equations F(v) = 0 by Newtons method:
J(v
i
)(v
i+1
v
i
) = F(v
i
), i 0, (2.1)
where v
0
is given. We implement (2.1) as
Solve J(v
i
)d
i
= F(v
i
),
v
i+1
= v
i
+d
i
.
Newtons method is attractive because under appropriate conditions it converges
rapidly from any suciently good initial guess. In particular, if the Jacobian is non-
singular at the solution, local quadratic convergence can be proved [5, Thm. 5.2.1].
The Kantorovich theorem yields a weaker bound on the convergence rate but makes
no assumption on the nonsingularity of Jacobian at the solution [5, Thm. 5.3.1], [24].
We use hats to denote computed quantities. We work with the standard model
of oating point arithmetic [16, section 2.3]
fl(xop y) = (xop y)(1 +), [[ u, op = +, , , /,
where u is the unit roundo.
1040 FRANC OISE TISSEUR
In oating point arithmetic, we have
v
i+1
= v
i
(J( v
i
) +E
i
)
1
(F( v
i
) +e
i
) +
i
, (2.2)
where
e
i
is the error made when computing the residual F( v
i
),
E
i
is the error incurred in forming J( v
i
) and solving the linear system for d
i
,
i
is the error made when adding the correction
d
i
to v
i
.
We assume that F( v
i
) is computed in the possibly extended precision u u before
rounding back to working precision u, and that
d
i
, v
i
are computed at precision u.
Hence we assume that there exists a function depending on F, v
i
, u, and u such that
|e
i
| u|F( v
i
)| +(F, v
i
, u, u). (2.3)
Note that standard error analysis shows that |e
i
| u|F( v
i
)| is the best we can
obtain in practice for both mixed and xed precision. Later, we will give an explicit
formula for in the case of linear systems and the generalized eigenvalue problem.
We assume that the error E
i
satises
|E
i
| u(F, v
i
, n, u) (2.4)
for some function that reects both the instability of the linear solver and the
error made when approximating or forming J( v
i
). In practice, we certainly have
(F, v
i
, n, u) |J( v
i
)|. For the error
i
we have
|
i
| u(| v
i
| +|
d
i
|).
We will make use of the constants
n
=
cnu
1 cnu
and
n
=
cn u
1 cn u
, (2.5)
where c is a small integer constant.
2.2. Forward error. First we consider the change in error for a single step of
an iteration of the form (2.2). For notational convenience we write v = v
i
, v = v
i+1
,
and
v = v (J +E)
1
(r +e) +, (2.6)
where r = F(v), J = J(v), and
|E| u(F, v, n, u), (2.7)
|e| u|r| +(F, v, u, u), || u(|v| +|d|),
with
d = (J +E)
1
(r +e). (2.8)
We will often refer to the following lemma.
Lemma 2.1 (see [5, Lem. 4.1.12]). For any v, w R
m
,
|F(w) F(v) J(v)(w v)|
2
|w v|
2
. (2.9)
NEWTONS METHOD IN FLOATING POINT ARITHMETIC 1041
Theorem 2.2. Assume that there is a v
) = 0, J
= J(v
) is
nonsingular, and
|J
1
E| < 1. (2.10)
Then, for all v such that
|J
1
||v v
| < 1, (2.11)
v in (2.6) is well dened and
| v v
| G|v v
| +g,
where
G =
1
1
|J
1
E| +
(1 +u)
2
2(1 )(1 )
|J
1
||v v
| +
u(2 +u)
(1 )(1 )
(J
) +u
and
g =
1 +u
(1 )(1 )
|J
1
|(F, v, u, u) +u|v
|.
Proof. From assumption (2.11) and the Lipschitz property of J we have
|J
1
(J J
)| |J
1
||v v
| < 1. (2.12)
From the identity
J = J
(I +J
1
(J J
)) (2.13)
it then follows that J is nonsingular with inverse given by
J
1
= (I +J
1
(J J
))
1
J
1
and with
|J
1
|
|J
1
|
1 |J
1
(J J
)|
1
1
|J
1
|. (2.14)
Similarly, assumption (2.10) guarantees that J + E is nonsingular and that, using
(2.14),
|(J +E)
1
|
|J
1
|
1 |J
1
E|
1
(1 )(1 )
|J
1
|. (2.15)
Since (J +E)
1
exists, v in (2.6) is well dened. We have
v v
= v v
(J +E)
1
(r +e) +
= (I (J +E)
1
J)(v v
) (J +E)
1
(r J(v v
) +e) +,
which gives
| v v
| |I (J +E)
1
J||v v
| +|(J +E)
1
|(|r J(v v
)| +|e|) +||.
1042 FRANC OISE TISSEUR
From
I (J +E)
1
J = (J +E)
1
E = (I +J
1
E)
1
J
1
E
it follows that
|I (J +E)
1
J|
1
1
|J
1
E|.
From Lemma 2.1,
|r J(v v
)|
2
|v v
|
2
and |r J
(v v
)|
2
|v v
|
2
,
so that
|r| |r J
(v v
)| +|J
(v v
)|
2
|v v
|
2
+|J
||v v
| (2.16)
and hence
|e| u
_
2
|v v
|
2
+|J
||v v
|
_
+(F, v, u, u).
We have
|| u(|v v
| +|v
| +|d|)
with
|d| |(J +E)
1
|(|r| +|e|) (2.17)
|(J +E)
1
|
_
(1 +u)|r| +(F, v, u, u)
_
1
(1 )(1 )
|J
1
|
_
(1 +u)
_
2
|v v
| +|J
|
_
|v v
|
+(F, v, u, u)
_
,
using (2.15) and (2.16). Hence,
| v v
| G|v v
| +g,
where G and g are given in the statement of the theorem.
Assumptions (2.10) and (2.11) are necessary for v in (2.6) to be dened. Assump-
tion (2.10) is a condition on the stability of the linear system solver and the accuracy
of the Jacobian.
In exact arithmetic we have u = (F, v, u, u) = = 0 and E = 0. Then,
for 1/2, Theorem 2.2 reduces to the local quadratic convergence theorem for
Newtons method [5, Thm. 5.2.1] applied to a single step.
Clearly, for
1
8
,
1
8
, if J
)
1
8
, then we
have G
1
2
. Thus the error contracts unless g
>
|v v
|
=
1 +u
(1 )(1 )
|J
1
|
|v
|
(F, v, u, u) +u,
NEWTONS METHOD IN FLOATING POINT ARITHMETIC 1043
which depends on the accuracy with which the residual is computed. If |J
1
|(F, v, u, u)
cu|v
| for some constant c, then we can expect to obtain a normwise relative error
of order cu.
Note that the rate of convergence depends on the accuracy of the Jacobian and
on the stability of the linear system solver, since G depends strongly on E, but the
limiting accuracy is essentially independent of the solver (for <
1
8
, say). Note also
that G is independent of u, which means that the rate of convergence is bounded
independent of the precision used to compute the residual.
Corollary 2.3. Assume that there is a v
) = 0 and J
= J(v
)
is nonsingular and satises
u(J
)
1
8
. (2.18)
Assume also that for in (2.4),
u|J( v
i
)
1
|(F, v
i
, n, u)
1
8
for all i. (2.19)
Then, for all v
0
such that
|J
1
||v
0
v
|
1
8
, (2.20)
Newtons method in oating point arithmetic generates a sequence v
i+1
whose norm-
wise relative error decreases until the rst i for which
| v
i+1
v
|
|v
|
|J
1
|
|v
|
(F, v
, u, u) +u. (2.21)
Proof. For i = 0, the assumptions (2.10) and (2.11) hold with =
1
8
and =
1
8
and Theorem 2.2 applies to the rst step. Using the values for , , and the bound
(2.18), we nd that G < 1 so the error contracts if (2.21) does not already hold. Thus,
(2.20) is also satised with v
0
replaced by v
1
. The result follows by induction.
Example 1. To illustrate the corollary, we use Newtons method to compute a
zero of the polynomial
F(v) = (v 1)
10
10
8
.
At the solution v
= 1 10
0.8
0.8415, [J(v
)
1
[ 1.6 10
6
. To increase the
rounding errors when computing the residual, we expand (v 1)
10
as
(v 1)
10
= v
10
10v
9
+45v
8
120v
7
+210v
6
252v
5
+210v
4
120v
3
+45v
2
10v +1
and use this expression to evaluate F(v). For v 1 we have (F, v, u, u) 10
3
u
(which is roughly the sum of the absolute values of the coecients in the expansion
of (v 1)
10
). Corollary 2.3 predicts that if v
0
is not too far from v
[/[v
[ 10
9
u +u.
We carried out some numerical experiments in Matlab, for which the unit round-
o is u = 2
53
1.1 10
16
. We used the Symbolic Math Toolbox to evaluate F(v)
at precision u. We tried both u = u and u = u
3/2
3.310
24
.
1
The theory predicts
1
In the BLAST document [2], the term extended precision is used for u u
3/2
.
1044 FRANC OISE TISSEUR
limiting accuracy [ v
i+1
v
[/[v
[ 10
7
if u = u and [ v
i+1
v
[/[v
[ 10
15
if
u = u
3/2
. For both values of u, we used two dierent starting values for v
0
, one for
which [v
0
v
[/[v
[ > 10
9
u + u and the second one for which the forward error is
smaller than the expected limiting accuracy. We plot the behavior of the normwise
forward error for u = u and u = u
3/2
in Figure 2.1. The results are as predicted by
the theory. They also illustrate Wilkinsons remark [27, p. 55]:
It is perhaps worth remarking that if we start with an approxima-
tion to a zero which is appreciably more accurate than the limiting
accuracy . . . a single iteration will usually spoil this very good ap-
proximation and produce one with an error which is typical of the
limiting accuracy.
2.3. Residual. We now turn to bounding the residual for a single step of the
form (2.6). As before, we write r = F(v) and J = J(v). Note that if v
= fl(v
) =
v
+v
with |v
| u|v
) = F(v
+v
) = J(v
)v
+, where ||
2
| v
|
2
.
Thus
|F( v
)| u|J(v
)||v
| +
2
u
2
|v
|
2
is the best bound we can hope to obtain for the norm of the residual.
Theorem 2.4. Assume that there is a v
) = 0, J
= J(v
) is
nonsingular, and
|J
1
||v v
| < 1, (2.22)
u|J
1
|(F, v, n, u) < 1. (2.23)
Let
= g|J
1
|,
where g is dened in Theorem 2.2. Then
|F( v)| H|F(v)| +h,
where
H = c
0
[ + +u(J
)]
and
h = c
1
( + +u(J
)) (F, v, u, u) +c
2
( + + 1) u|J||v|,
with c
0
, c
1
, and c
2
constants of order 1.
Proof. We have
|J
1
E| u|J
1
|(F, v, n, u) < 1 (2.24)
using (2.7) and (2.23). Thus, we can apply Theorem 2.2 to deduce that v is well
dened. Let r = F( v), and dene w R
m
by w = r r J( v v). Note that from
NEWTONS METHOD IN FLOATING POINT ARITHMETIC 1045
1 2 3 4 5 6 7 8 9 10
10
9
10
8
10
7
10
6
10
5
Iteration, i
R
e
l
a
t
i
v
e
e
r
r
o
r
Predicted limiting accuracy
|v
i
v
*
|/|v
*
| with v
0
= v
*
+ 4e10u
|v
i
v
*
|/|v
*
| with v
0
= v
*
+ u
1 2 3 4 5 6 7 8 9 10
10
15
10
14
10
13
Iteration, i
R
e
l
a
t
i
v
e
e
r
r
o
r
Predicted limiting accuracy
|v
i
v
*
|/|v
*
| with v
0
= v
*
+ 4e2u
|v
i
v
*
|/|v
*
| with v
0
= v
*
+ 2u
Fig. 2.1. Behavior of the forward error for u = u (top) and u = u
3/2
(bottom).
(2.6) and (2.8) v v = d + and Jd = r +e Ed, so that r = r +J(d +) +w =
e +Ed +J +w, which yields
| r| |e| +|E||d| +|J||| +|w|
u|r| +(F, v, u, u) +u|d|((F, v, n, u) +|J|) +u|J||v| +|w|. (2.25)
From (2.12) and (2.13) it follows that
|J| (1 +)|J
|. (2.26)
1046 FRANC OISE TISSEUR
Using (2.17) and (2.24), we have
|d| |(J +E)
1
|(|r| +|e|)
1
1
|J
1
| ((1 +u)|r| +(F, v, u, u)) , (2.27)
which gives, using (2.14) and (2.26),
u|d|((F, v, n, u) +|J|)
1 +u
1
_
u|J
1
|(F, v, n, u) +
1 +
1
u(J
)
_
|r|
+
1
1
_
u|J
1
|(F, v, n, u) +
1 +
1
u(J
)
_
(F, v, u, u). (2.28)
From Lemma 2.1 we have
|w|
2
| v v|
2
. (2.29)
First, from (2.6), (2.8), (2.27), and (2.14)
| v v| (1 +u)|d| +u|v|
|J
1
|
_
(1 +u)
2
(1 )(1 )
|r| +
1 +u
(1 )(1 )
(F, v, u, u)
_
+u|v|. (2.30)
Second, from the triangle inequality and Theorem 2.2 we have
| v v| (G+ 1)|v v
| +g. (2.31)
Substituting the product of (2.30) and (2.31) into (2.29) yields
|w|
(1 +u)
2
(G+ 1)
2(1 )(1 )
|J
1
||v v
||r| +
(1 +u)
2
2(1 )(1 )
|J
1
|g|r|
+
(1 +u)(G+ 1)
2(1 )(1 )
|J
1
||v v
|(F, v, u, u)
+
(1 +u)
2(1 )(1 )
|J
1
|g(F, v, u, u)
+
(G+ 1)
2(1 )
|J
1
||v v
|u|J||v| +
1
2(1 )
g|J
1
|u|J||v|, (2.32)
where the penultimate and last terms on the right-hand side of the inequality are
obtained using |J||J
1
| 1 and (2.14). Substituting (2.28) and (2.32) into (2.25)
yields
| r| H|r| +h,
with H and h as in the statement of the theorem.
The theorem shows that if the problem is not too ill conditioned, the solver is
not too unstable, the approximation of the Jacobian is accurate enough, and v is
suciently close to the solution, then the norm of the residual reduces after one step
of Newtons method in oating point arithmetic. Note that H does not depend on u
so that, as for the forward error analysis, the use of extended precision for computing
the residual has no eect on the rate of convergence of Newtons method. With a
careful analysis of the constants in Theorem 2.4 we can derive the following corollary.
NEWTONS METHOD IN FLOATING POINT ARITHMETIC 1047
1 2 3 4 5 6 7 8 9 10
10
14
10
13
10
12
Iteration, i
R
e
s
i
d
u
a
l
n
o
r
m
Predicted limiting residual
|v
i
v
*
|/|v
*
| with v
0
= v
*
+ 4e10u
|v
i
v
*
|/|v
*
| with v
0
= v
*
+ u
1 2 3 4 5 6 7 8 9 10
10
22
10
21
10
20
Iteration, i
R
e
s
i
d
u
a
l
n
o
r
m
Predicted limiting residual
|v
i
v
*
|/|v
*
| with v
0
= v
*
+ 4e2u
|v
i
v
*
|/|v
*
| with v
0
= v
*
+ 2u
Fig. 2.2. Behavior of the norm of the residual for u = u (top) and for u = u
3/2
(bottom).
Corollary 2.5. Assume that there is a v
) = 0, J
= J(v
) is
nonsingular, and
u(J
|(F, v
, u, u) + u|v
| satises g|J
1
| <
1048 FRANC OISE TISSEUR
1/8. Then, for all v
0
such that |J
1
||v
0
v
)||v
|
10
7
u. As before, we tried both u = u and u = u
3/2
3.3 10
24
. The theory
predicts that
|F( v
i
)|
<
_
10
13
if u = u,
10
21
if u = u
3/2
.
We used the same starting values as before. We plot the behavior of [F( v
i
)[ for u = u
and u = u
3/2
in Figure 2.2. The results agree well with the predictions.
3. Applications. In this section, we consider several applications. For each of
them, we dene F and the function and apply our results. We are particularly
interested in the eect of mixed precision versus xed precision for the computation
of the residual. The proposed mixed precision BLAS routines (XBLAS) [2] make
possible the use of mixed precision in a portable manner.
3.1. Linear systems. We consider the linear system Ax = b, where A R
nn
is nonsingular and b R
n
. Iterative renement for a computed solution x is simple to
describe: compute the residual r = b A x, solve the system Ad = r for the correction
d, and form the updated solution y = x + d. If necessary, repeat the process with x
replaced by y. This process is equivalent to Newtons method with F(x) = b Ax for
which J(x) = A and thus = 0.
If the residual r = F( x) is computed with the XBLAS routine GEMV
X at precision
u, then for in (2.3) we can take
(F, x, u, u) =
n
(|A|| x| +|b|),
where
n
is dened in (2.5). Corollary 2.3 then yields the following result.
Corollary 3.1. If u(A) is suciently less than 1 and if the linear system
solver is not too unstable, then iterative renement reduces the relative forward error
until
| x
i
x|
|x|
u +(A)
n
.
If u = u
2
, then the relative error is of order u provided n(A)u 1.
A backward error of an approximate solution x is a measure of the smallest
perturbations A and b such that (A+A) x = b+b. The most popular denition
of the normwise backward error is
( x) = min : (A+A) x = b +b, |A| |A|, |b| |b| .
It can be shown [21] that
( x) =
|r|
|A|| x| +|b|
.
NEWTONS METHOD IN FLOATING POINT ARITHMETIC 1049
Corollary 2.5 thus yields the following result.
Corollary 3.2. Let iterative renement be applied to the nonsingular lin-
ear system Ax = b of order n with u(A) < 1/8 and using a solver satisfying
u|A
1
|(A, b, n, u) 1/8. Then the norm of the residual decreases until
| r
i
| max(
n
, u)(|A|| x| +|b|),
so that iterative renement yields a small normwise backward error ( x) max(
n
, u).
Corollaries 3.1 and 3.2 are standard normwise results in the literature [17], [18],
[20], [22], [27]. They show that we do not lose anything by using our general analysis.
3.2. Generalized eigenvalue problem. Newtons method and its variants
have been considered for improving the accuracy of computed eigenvalues and eigen-
vectors for the standard eigenvalue problem [10], [7], [8], [23], the singular value
problem [9], and rening estimates of invariant subspaces [4], [10]. The error analysis
in [10] applies to the standard eigenvalue problem Ax = x and requires that the
problem be scaled (|A| = 1), that the residual be computed in extended precision,
and that the linear solver be stable. A lengthy analysis leads to the conclusion that if
the problem is not too ill conditioned and the initial guess is good enough, then their
renement procedure yields a relative error of the order of the working precision.
Here, we consider the generalized eigenvalue problem (GEP)
Ax = Bx with e
T
s
x = 1 for some xed s, (3.1)
where A R
nn
, B R
nn
. Newton-based renement algorithms for this problem
have been proposed [7], [23] but no error analysis has been done.
Dene F : R
n+1
R
n+1
by
F
__
x
__
=
_
(AB)x
e
T
s
x
_
, (3.2)
where = max(|A|, |B|). Then (3.1) can be stated as nding the zeros of F(v),
where v = [x
T
, ]
T
. The function F is continuously dierentiable in R
n+1
with
Jacobian
J(v) =
_
AB Bx
e
T
s
0
_
. (3.3)
The scalar is introduced to make F and J scale linearly when A and B are multiplied
by a scalar. For all v, w R
n+1
and any absolute vector norm we have
|J(w) J(v)| 2|B||w v|
so that J is Lipschitz continuous in R
n+1
with constant = 2|B|.
The next lemma concerns the singularity of J at a zero of F. This result is more
general than the one given in [23, p. 120] as it applies to the generalized eigenvalue
problem rather than the standard eigenvalue problem and no assumption is made on
the nonsingularity of B.
Lemma 3.3. Let v
= [x
T
]
T
be a zero of F as dened by (3.2) with nite.
Then J(v
__
= det(M) v
T
M
A
u,
1050 FRANC OISE TISSEUR
where M
A
is the adjugate (or adjoint) of M, we obtain
0 = det(J(v
)) = e
T
s
(A
B)
A
Bx
. (3.4)
The adjugate has the property that
M
A
M = det(M)I.
Dene y
T
= e
T
s
(A
B)
A
. Then
y
T
(A
B) = e
T
s
det(A
B)I = 0,
because
. Using (3.4),
y
T
Bx
= e
T
s
(A
B)
A
Bx
= 0.
If
must
be an eigenvalue of multiplicity at least two.
For the converse, suppose that
that is B-orthogonal to x
. We have
[ y
T
0 ]
_
A
B Bx
e
T
s
0
_
= 0,
which means that J(v
) is singular.
In exact arithmetic, Theorem 2.2 applies with E = 0 and = u = 0 so that for
all v
0
such that |v
0
v
| 1/(4|B||J
|
1
) the Newton iteration is well dened and
converges quadratically to zero.
The residual F( v
i
) can be computed in mixed precision by the XBLAS routine
GE
SUM
be the
corresponding eigenvector normalized such that |x
= [x
s
[ = 1. Assume that J
in (3.3) is not too ill conditioned, the linear system solver is not too unstable, and
(x
0
,
0
) is a suciently good approximation to (x
i
) (x
T
)|
|(x
T
)|
<
n
|J(v
)
1
|
max(|A|
, |B|
) +u.
If u = u
2
, then
|( x
T
i
,
i
) (x
T
)|
|(x
T
)|
<
n
.
Proof. We apply Corollary 2.3 using (3.5) for (F, v, u, u). We have
|J(v
)
1
|
|v
(F, v
, u, u) =
|J(v
)
1
|
|v
n
(|A|
+[
[|B|
)|x
n
|J(v
)
1
|
max(|A|
, |B|
)
(1 +[
[)
max(1, [
[)
2
n
|J(v
)
1
|
max(|A|
, |B|
).
NEWTONS METHOD IN FLOATING POINT ARITHMETIC 1051
Since J(v
)
n+1,s
= , we have |J(v
)|
max(|A|
, |B|
). From (2.18), we
have u(J(v
)) < 1 and if
n
nu
2
, then
n
|J(v
)
1
|
<
numax(|A|
, |B|
)
1
,
which proves the last part of the corollary.
Our result is consistent with the one of Dongarra, Moler, and Wilkinson [10] con-
cerning the standard eigenvalue problem. They showed that their iterative renement
procedure, which is a recasting of Newtons method, yields a forward error of the order
of the working precision assuming that |A|
) is dened by
( x,
) = min : (A+A) x =
(B +B) x, |A| |A|, |B| |B|,
and it can be shown [14], [25] that
( x,
) =
|r|
(|A| +[
[|B|)| x|
,
where r = A x
B x.
Corollary 3.5. Under the same assumptions as in Corollary 3.4, Newtons
method for (3.2) in oating point arithmetic yields a backward error for the -norm
bounded by
( x
i
,
i
)
<
n
+u(3 +[[) max
_
|A|
|B|
,
|B|
|A|
_
.
Proof. We assume | x
i
|
1. We have (F, v
i
, u, u)
n
(|A|
+ [
i
[|B|
)
and
| v
i
|
<
1 +[
i
[, |J( v
i
)|
<
(3 +[
i
[) max(|A|
, |B|
),
and (|A|
+[
i
[|B|
)| x
i
|
>
min(|A|
, |B|
)(1 +[
i
[). Then applying Corol-
lary 2.5 yields the result.
The corollary shows that if [[ max(|A|
/|B|
, |B|
/|A|
) is large,
then we cannot guarantee a small backward error. In numerical experiments, we
have found that the backward error is small independent of the size of
[[ max(|A|
/|B|
, |B|
/|A|
[ 1 if |A|
= 1, as was as-
sumed in [10]. Then the eigenpairs rened by Newtons method have a small backward
error.
For the GEP, if the problem is scaled and replaced by
Ax =
Bx with
A and
such that |
A|
= |A|
= |B|
and
= , then, for this problem, the backward
error depends only on the size of [
[. A small [
C = C +C, |C|
2
n
2|B
1
|
2
|A|
2
,
so if B is ill conditioned, then |C|
2
/|C|
2
can be large, even if the eigenvalue
problem itself is well conditioned.
For problem (3.1), the Newton iteration (2.1) can be written as
(A
i
B)x
i+1
i+1
Bx
i
= r
i
, e
T
s
x
i+1
= e
T
s
x
i
= 1, (4.1)
where x
i+1
= x
i+1
x
i
and
i+1
=
i+1
i
. As in [10], [23] we note that
e
T
s
x
0
= 1 implies e
T
s
x
i+1
= 0 for i 0, and thus the sth column of A
i
B does
not participate in the product with x
i+1
. We can replace the sth column of A
i
B
by Bx
i
and the component s of x
i+1
by
i+1
. We dene
i
= x
i
+
i
e
s
and M
i
= (A
i
B) ((A
i
B)e
s
+Bx
i
)e
T
s
.
Then we can rewrite (4.1) as
M
i
i+1
= r
i
,
i+1
=
i
+e
T
s
i+1
, x
i+1
= x
i
+
i+1
e
T
s
i+1
e
s
. (4.2)
Algorithm 4.1 is a straightforward implementation of iteration (4.2).
Algorithm 4.1. Given A, B, and an approximate eigenpair (x, ) with |x|
=
x
s
= 1, this algorithm applies iterative renement to and x:
repeat until convergence
r = Bx Ax (possibly extended precision used)
Form M: the matrix AB with column s replaced by Bx.
Factor PM = LU (LU factorization with partial pivoting)
Solve M = r using the LU factors
= +
s
;
s
= 0
x = x +
end
This algorithm is expensive as each iteration requires O(n
3
) ops for the factorization
of M. If the eigenpairs are approximated by a Cholesky reduction of AB, then a
nonsingular matrix X such that X
T
AX = D = diag(
1
, . . . ,
n
) and X
T
BX = I is
available. Then
X
T
r
i
= X
T
M
i
i+1
=
_
(D
i
I) X
T
((A
i
B)e
s
+Bx
i
)e
T
s
X
_
X
1
i+1
. (4.3)
Dening
D
i
= D
i
I, v
i
= X
T
((A
i
B)e
s
+Bx
i
),
f = X
T
e
s
, w
i+1
= X
1
i+1
, g
i
= X
T
r
i
,
NEWTONS METHOD IN FLOATING POINT ARITHMETIC 1053
(4.3) becomes
(D
i
v
i
f
T
)w
i+1
= g
i
. (4.4)
The matrix in (4.4) is a rank-one modication of a diagonal matrix. As D
i
is nearly
singular when
i
approaches the solution
i
is upper
Hessenberg, as is the matrix
J
T
1
. . . J
T
n1
(D
i
v
i
f
T
) = H |v
i
|
2
e
1
f
T
= H
1
.
Using a QR factorization of H
1
, the solution of (4.4) can be computed in O(n
2
) ops.
Algorithm 4.2. Given A, B, X, and D such that X
T
AX = D and X
T
BX = I
and an approximate eigenpair (x, ) with |x|
= x
s
= 1, this algorithm applies
iterative renement to and x at a cost of O(n
2
) ops per iteration.
repeat until convergence
r = Bx Ax (possibly extended precision used)
D
= D I
d = Bx c
s
where c
s
is the sth column of AB
v = X
T
d; f = X
T
e
s
Compute Givens rotations J
k
in the (k, k + 1) plane, such that
Q
T
1
v := J
T
1
. . . J
T
n1
v = |v|
2
e
1
Compute orthogonal Q
2
such that
T = Q
T
2
Q
T
1
(D
+vf
T
) is upper triangular
z = Q
T
2
Q
T
1
X
T
r
Solve Tw = z for w
= Xw
= +
s
;
s
= 0
x = x +
end
When B is ill conditioned, the computed
X may be inaccurate, so that
X
T
A
X =
D+D,
X
T
B
X = I +I, with possibly large |D| and |I|. Then the procedure
used in Algorithm 4.2 to solve M = r may be unstable: is the exact solution of
(M+M) = r with a possibly large |M|. However, the theory shows that allowing
some instability in the solver and inaccurate evaluation of the Jacobian (assumptions
(2.19) and (2.23)) may aect the rate of convergence of the Newton process but not
the limiting accuracy and backward error.
We use the hat notation ( x,
) as initial guess.
We need to dene several quantities:
E
rel
( x,
) = |(x, ) ( x,
)|
/|(x, )|
+[[|B|
)|x|
2
/([[ [y
T
Bx[)
1054 FRANC OISE TISSEUR
Table 4.1
Relative errors, condition numbers, and backward error for Example 1.
i
E
rel
(x
i
,
i
) cond(
i
) (x
i
,
i
)
1 0.62 6e-5 41 4e-6
2 1.63 6e-5 120 2e-6
3 9e17 9e-5 6e18 2e-20
Table 4.2
Backward error and relative error for the two smallest eigenpairs of Example 1.
Algorithm 4.1 Algorithm 4.2
i
est
E
est
rel
it (x
i
,
i
) E
rel
(x
i
,
i
) it (x
i
,
i
) E
rel
(x
i
,
i
)
0.62 1e-16 1e-14 3 2e-17 2e-16 4 6e-17 4e-16
1.63 1e-16 1e-14 3 3e-17 4e-16 4 4e-17 7e-16
is the condition number of the eigenvalue , where y is a left eigenvector corresponding
to [14];
( x,
) = |A x
B x|
/
_
(|A|
+[
[|B|
)| x|
_
is the backward error of the approximate eigenpair ( x,
);
E
est
rel
= |J
1
|
u(|A|
+[[|B|
)|x|
/|(x
T
, )|
+u
is an approximation of the theoretical bound (2.21) for the relative forward error,
where the Jacobian matrix J is given by (3.3) and (F, v, u, u) is given by (3.5) with
n
u; and nally,
est
is the theoretical bound of the backward error for the rened
eigenpair ( x,
(B) = 7 10
18
).
We rened the two smallest eigenvalues using Algorithm 4.1 and Algorithm 4.2
with the approximate eigenpairs as initial guess and the residual computed at working
precision ( u = u 1.1 10
16
). We terminated the iteration when the norm of the
correction stopped decreasing. The results are given in Table 4.2, where it is the
number of iterations required for convergence. Algorithm 4.2 uses an unstable solver
and therefore requires one more iteration. However, the accuracy and stability are
unaected by this unstable solver. Both algorithms produce rened eigenpairs with a
small backward error and a relative error as predicted by the theory.
Example 2. We would like to test the sharpness of the residual bound in Corollary
2.5 and the backward error bound in Corollary 3.5. We consider an example with large
|J
/|B|
i
cond(
i
) (x
i
,
i
) r
est
est
r (x
i
,
i
) it
7.1e5 2.0 1e-5 3.1e-4 9.1e-5 1.2e-10 5.2e-17 5
5.6e6 9.0 2e-6 1.1e-2 7.3e-4 4.7e-10 4.3e-17 4
2.0e7 29.2 9e-7 1.5e-1 2.6e-3 1.0e-9 2.9e-17 3
3.3e7 48.7 7e-7 4.3e-1 4.3e-3 1.6e-9 2.7e-17 5
4.3e7 62.9 2e-7 7.4e-1 5.6e-3 1.7e-9 2.2e-17 3
Table 4.4
Relative error for the computed and rened eigenpairs of Example 3 using working and double
precision in the computation of the residual.
Before renement After renement
u = u u = u
2
i
cond(
i
) E
rel
(x
i
,
i
) E
est
rel
E
rel
(x
i
,
i
) E
est
rel
E
rel
(x
i
,
i
)
2.4e-7 1.8e6 1.3e-8 1.0e-11 2.0e-13 2.2e-16 1.1e-16
2.2e-5 2.0e4 2.1e-8 1.3e-11 7.3e-13 2.2e-16 2.2e-16
8.2e-4 5.3e2 1.0e-9 3.3e-13 1.8e-14 2.2e-16 1.1e-16
1.4e-2 4.0e1 6.9e-11 3.4e-14 2.0e-15 2.2e-16 1.1e-16
2.9e-2 4.6e0 4.3e-11 2.8e-14 5.6e-16 2.2e-16 1.1e-16
1.2e-1 1.5e1 2.6e-11 1.7e-14 5.6e-16 2.2e-16 1.1e-16
1.7e-1 7.4e0 3.6e-11 3.0e-14 1.3e-15 2.2e-16 1.1e-16
3.0e-1 1.1e1 3.0e-11 2.0e-13 2.2e-15 2.2e-16 5.6e-17
3.1e-1 1.2e1 3.4e-11 2.1e-13 7.8e-16 2.2e-16 5.6e-17
9.2e4 3.7e6 1.6e-16 1.5e-9 4.1e-12 2.2e-16 0.0e0
We took n = 20, A = 10
6
I, and B = 10
2
M and computed the approximate eigen-
pairs using the Cholesky reduction. Instabilities are expected as (B) = 210
13
. All
the eigenpairs have a large backward error and a small condition number except the
largest one. We rened using Algorithm 4.1. Results for some eigenpairs are given in
Table 4.3, where
|r
est
| = u(|A|
+[[|B|
)|x|
+u|J|
|(x
T
, )|
is the theoretical bound (2.35) for the norm of the residual. This example corresponds
to the bad case where [[ max
_
|A|/|B|, |B|/|A|
_
is large, which explains why
the theoretical estimates are so pessimistic. The estimates are sharp when the pair
(A, B) is scaled such that |A| = |B| and the eigenpair is rened on the reverse
problem (B, A) if [
i
[ is large. We have generated many pairs (A, B) with a large value
of max
_
|A|/|B|, |B|/|A|
_
and large eigenvalues, for which the theory predicts a
large backward error. For all of them, iterative renement yields a small backward
error as long as the initial guess is good enough for Newtons method to converge.
Example 3. We illustrate how using extended precision in computation of the
residual yields a small relative error. Let A be the Prolate matrix of size n = 10 of
the Test Matrix Toolbox [15], and let B be the Moler matrix. We used the Symbolic
Math Toolbox of Matlab to compute the exact eigenpairs of (A, B) and the Cholesky
reduction method to approximate the eigenpairs. We give the results in Table 4.4.
We rened using both working precision ( u = u) and double precision ( u = u
2
)
for the computation of the residual. For eigenpairs such that E
rel
( x
i
,
i
) > E
est
rel
,
iterative renement leads to E
rel
( x
i
,
i
) < E
est
rel
after two iterations. For the largest
eigenvalue, E
rel
( x
i
,
i
) E
est
rel
of u = u, which means that the approximate eigenpair
is appreciably more accurate than the limiting accuracy. In this case, one single step
1056 FRANC OISE TISSEUR
of iterative renement is enough to spoil the good initial approximation. If u = u
2
,
all the eigenpairs are computed to high relative accuracy as expected from the theory
(Corollary 3.4).
For further numerical examples of iterative renement for the Cholesky-QR
method, see [3].
5. Conclusions. We have analyzed Newtons method in oating point arith-
metic, allowing for extended precision in computation of the residual, inaccurate eval-
uation of the Jacobian, and a possibly unstable solver. We estimated the limiting
accuracy and the smallest residual norm. We showed that the accuracy with which
the residual is computed aects the limiting accuracy. The limiting residual norm
depends on two terms, one of them independent of the accuracy used in evaluating
the residual.
We applied our results to iterative renement for the generalized eigenvalue prob-
lem. We showed that high accuracy for the rened eigenpairs is guaranteed, under
suitable assumptions, if twice the working precision is used for the computation of
the residual. We also showed that if the pair (A, B) is well balanced (|A| |B|),
working precision in evaluating the residual is enough for iterative renement to yield
a small backward error.
Finally, we examined in detail how iterative renement can be used to improve
the forward and backward error of computed eigenpairs for the symmetric denite
GEP. We used two renement algorithms, one of them with an unstable solver.
We conrmed that the unstable solver aects the convergence but not the limiting
accuracy and backward error. In practice, the assumption that the pair (A, B) is
well balanced does not seem to be necessary. We have not been able to generate an
example for which iterative renement fails to yield a small backward error for pairs
(A, B) for which max(|A|/|B|, |B|/|A|) is large. This suggests that the bound of
Corollary 3.5 is pessimistic. Deriving a sharper bound remains an open problem.
In future work, we plan to investigate iterative renement for the quadratic eigen-
value problem, for which there are no proven backward stable algorithms [25].
Acknowledgments. I thank the referees for valuable suggestions that improved
the paper.
REFERENCES
[1] A. L. Andrew, K.-W. E. Chu, and P. Lancaster, Derivatives of eigenvalues and eigenvectors
of matrix functions, SIAM J. Matrix Anal. Appl., 14 (1993), pp. 903926.
[2] BLAS Technical Forum Standard, International Journal of High Performance Computing Ap-
plications, to appear. Available online at http://www.netlib.org/blas/blast-forum/.
[3] P. I. Davies, N. J. Higham, and F. Tisseur, Analysis of the Cholesky Method with Itera-
tive Renement for Solving the Symmetric Denite Generalized Eigenproblem, Numerical
Analysis Report No. 360, Manchester Centre for Computational Mathematics, Manchester,
UK, 2000.
[4] J. W. Demmel, Three methods for rening estimates of invariant subspaces, Computing, 38
(1987), pp. 4357.
[5] J. E. Dennis, Jr. and R. B. Schnabel, Numerical Methods for Unconstrained Optimization
and Nonlinear Equations, Prentice-Hall, Englewood Clis, NJ, 1983.
[6] J. E. Dennis, Jr. and H. F. Walker, Inaccuracy in quasi-Newton methods: Local improve-
ment theorems, Math. Programming Stud., 22 (1984), pp. 7085.
[7] J. J. Dongarra, Improving the accuracy of computed matrix eigenvalues, Preprint ANL-80-84,
Mathematics and Computer Science Division, Argonne National Laboratory, Argonne, IL,
1980.
NEWTONS METHOD IN FLOATING POINT ARITHMETIC 1057
[8] J. J. Dongarra, Algorithm 589 SICEDR: A FORTRAN subroutine for improving the accuracy
of computed matrix eigenvalues, ACM Trans. Math. Software, 8 (1982), pp. 371375.
[9] J. J. Dongarra, Improving the accuracy of computed singular values, SIAM J. Sci. Statist.
Comput., 4 (1983), pp. 712719.
[10] J. J. Dongarra, C. B. Moler, and J. H. Wilkinson, Improving the accuracy of computed
eigenvalues and eigenvectors, SIAM J. Numer. Anal., 20 (1983), pp. 2345.
[11] A. R. Ghavimi and A. J. Laub, Backward error, sensitivity, and renement of computed
solutions of algebraic Riccati equations, Numer. Linear Algebra Appl., 2 (1995), pp. 2949.
[12] G. H. Golub and C. F. Van Loan, Matrix Computations, 3rd ed., Johns Hopkins University
Press, Baltimore, MD, 1996.
[13] H. V. Henderson and S. R. Searle, On deriving the inverse of a sum of matrices, SIAM
Rev., 23 (1981), pp. 5360.
[14] D. J. Higham and N. J. Higham, Structured backward error and condition of generalized
eigenvalue problems, SIAM J. Matrix Anal. Appl., 20 (1998), pp. 493512.
[15] N. J. Higham, The Test Matrix Toolbox for Matlab (version 3.0), Numerical Analysis Report
No. 276, Manchester Centre for Computational Mathematics, Manchester, UK, 1995.
[16] N. J. Higham, Accuracy and Stability of Numerical Algorithms, SIAM, Philadelphia, 1996.
[17] N. J. Higham, Iterative renement for linear systems and LAPACK, IMA J. Numer. Anal.,
17 (1997), pp. 495509.
[18] M. Jankowski and H. Wozniakowski, Iterative renement implies numerical stability, BIT,
17 (1977), pp. 303311.
[19] P. Lancaster, Error analysis for the Newton-Raphson method, Numer. Math., 9 (1966),
pp. 5568.
[20] C. B. Moler, Iterative renement in oating point, J. Assoc. Comput. Mach., 14 (1967),
pp. 316321.
[21] J. L. Rigal and J. Gaches, On the compatibility of a given solution with the data of a linear
system, J. Assoc. Comput. Mach., 14 (1967), pp. 543548.
[22] R. D. Skeel, Iterative renement implies numerical stability for Gaussian elimination, Math.
Comp., 35 (1980), pp. 817832.
[23] H. J. Symm and J. H. Wilkinson, Realistic error bounds for a simple eigenvalue and its
associated eigenvector, Numer. Math., 35 (1980), pp. 113126.
[24] R. A. Tapia, The Kantorovich theorem for Newtons method, Amer. Math. Monthly, 78 (1971),
pp. 389392.
[25] F. Tisseur, Backward error and condition of polynomial eigenvalue problems, Linear Algebra
Appl., 309 (2000), pp. 339361.
[26] S. Wang and S. Zhao, An algorithm for Ax = Bx with symmetric and positive-denite A
and B, SIAM J. Matrix Anal. Appl., 12 (1991), pp. 654660.
[27] J. H. Wilkinson, Rounding Errors in Algebraic Processes, Notes on Applied Science No. 32,
Her Majestys Stationery Oce, London, 1963. Also published by Prentice-Hall, Englewood
Clis, NJ, 1963. Reprinted by Dover, New York, 1994.
[28] H. Wozniakowski, Numerical stability for solving nonlinear equations, Numer. Math., 27
(1977), pp. 373390.
[29] T. J. Ypma, The eect of rounding errors on Newton-like methods, IMA J. Numer. Anal., 3
(1983), pp. 109118.
[30] T. J. Ypma, Local convergence of inexact Newton methods, SIAM J. Numer. Anal., 21 (1984),
pp. 583590.