7de, Uk: X (X 1 X 2 - . - XP)

Download as pdf or txt
Download as pdf or txt
You are on page 1of 24

The Singular Value Decomposition in Multivariate Statistics

Sven Hammarling, NAG Central Office,


256 Banbury Road, Oxford OX2 7DE, UK

To Gene Golub who has done so much to encourage and advance the use of stable numerical
techniques in multivariate statistics.

I. Introduction

Many multivariate techniques in statlstics are described in terms of an appropriate


sums of squares and cross products matrix, such as a covariance matrix, or a
correlation matrix, rather than in terms of the original data matrix. While this
is frequently the best way of understanding and analysing a technique, it is not
necessarily the most satisfactory approach for implementing the technique
computationally. From a numerical point of view, it is usually better to work with
the data matrix and avoid the formation of a sums of squares and cross products
matrix.

This is a review article aimed at the statistician and mathematician who, while not
being expert numerical analysts, would like to gain some understanding of why it is
better to work with the data matrix, and of the techniques that allow us to avoid
the explicit computation of sums of squares and cross products matrices. To give a
focus and to keep the article of moderate length, we concentrate in particular on
the use of the singular value decomposition and its application to multiple
regression problems. In the final two sections we give a brief discussion of
principal components, canonical correlations and the generalized singular value
decomposition.

2. Notation

Rather than use the standard notation of numerical linear algebra, we use notation
that is more akin to that of the statistician and so we let X denote an n by p data
matrix (design matrix, matrix of observation), where p is the number of variables
and n is the number of data points (objects, individuals, observations). Let x.
2
denote the j-th column of X, so that

X = [x 1 x 2 ... Xp] (2.1)


and x. is the n element vector of sample observations for the j-th variable, let Z.
J J
and s. be respectively the sample mean and standard deviation for the j-th variable
J
and denote ~, D and e respectively as

1 D Ill°de d!l , (2.2)


= , e =

~2

~p

where

d i = Ilo/Si ,
' si#O ,
S. = 0 .
Z

Then the matrix

= X - e~ T (2.3)

is the zero means data matrix and the matrix

= XD (2.4)

is the standardized data matrix, because the mean of each column of ~ is zero and
the variance of each column is unity, unless s. = 0 in which case the j-th column
J
is zero.

The normal m a t r i c e s xTx and ~T~ are the sums o f squares and c r o s s p r o d u c t s m a t r i x
and t h e c o r r e c t e d sums of squares and c r o s s p r o d u c t s m a t r i x r e s p e c t i v e l y , the
matrix

C = ~_11 xTx (2.5)

is the sample covariance matrix and

R= ~ T i (2.6)

is the sample correlation matrix, with r.. as the sample correlation coefficient of
zJ
variables x. and x.. Of course
z 2
R = DCD

and both C and R are symmetric non-negative definite.

The notation llzll and llZll will be used to denote respectively the Euclidean
length of the n element vector z and the spectral or two norm of the n by p matrix
Z given by

n 2 ! t
[[z][ = ( X zi)= , [[Z[[ = max I lzzll = J ( z T z ) ,
i=1 I lzl I=1
where p ( z T z ) d e n o t e s the spectral radius (largest eigenvalue) of zTz. The r e a s o n
for our interest in these particular norms i s t h a t when Z i s o r t h o g o n a l then

I l z z l l = Ilmll and I l z l l = 1 ( z T z = I).

A detailed knowledge of the spectral norm of a matrix is not important here and to
give a feel for its size in relation to the elements of Z we note that

p n z2 p½
Ilzll ~ ( z z .)½~ Ilzll.
j=l i=l iJ

For much of the time we shall use X generically to represent X or X or X.

3. Instability in Forming Normal Matrices

For numerical stability it is frequently desirable to avoid forming normal


matrices~ but instead use algorithms that work directly on the data matrices (see
for example Golub, 1965). This can be especially important when small
perturbations in the data can changer or come close to chantings the rank of the
data matrix. In such cases the normal matrix will be much more sensitive to
perturbations in the data than the data matrix.

A well known example is provided by the matrix

X =

0
111 0
E
E / 0 for which xTx = I l+e21l+e121

Perturbations of order ~ are required t o c h a n g e t h e r a n k o f X, w h e r e a s


perturbations of only c 2 are required t o c h a n g e t h e r a n k o f xTx. T h i s c o u l d be
particularly disastrous if IEI i s a b o v e n o i s e level, while e2 is close to or below
noise level.
A second example is provided by the case where X is a square non-singular matrix.
The sensitivity of the solution of the equations

xb = y (3.1)

to perturbations in X and y is determined by the size of the condition number of X


with respect to inversion, c(X), given by

c(X) = l i x l l fix-ill (3.2)

(Wilkinson, 1963 and 1965; Forsythe and Moler, 1967.) Specifically, if we perturb
X by a matrix E, then the solution of the perturbed equations

(X+E) (b+e) = y (3.3)

satisfies

lleil 4 c(X)
llEll (3.4)
lib+eli llxll

For the spectral norm it can be readily be shown that

c(xTx) = c2(X), (note t h a t c(X) ~ 1) (3.5)

so that unless c(X) = I, which occurs only when X is orthogonal, xTx is more
sensitive to perturbations than X. From (3.5) we once again see that perturbations
of order e2 in xTx can have the same effect as perturbations of order e in X.

In terms of solving a system of equations (3.4) and (3.5) imply that if rounding
errors or data perturbations (noise) mean that we might lose t digits accuracy,
compared to the accuracy of the data, when solving equations with X as the matrix
of coefficients,then we should expect to lose 2t digits accuracy when solving
equations with xTx as the coefficient matrix.

Similar remarks apply to the sensitivity of the solution of linear least squares
(multiple regression) problems when X is not square and the residual (error) vector
is small relative to the solution; once again it is advisable to avoid forming the
normal equations in order to solve the least squares problem. (Detailed analyses
can be found in Golub and Wilkinson, 1966; Lawson and Hanson, 1974; Stewart, 1977.)
We are not trying to imply that normal matrices should be avoided at all costs.
When X is close to being orthogonal then the normal matrix xTx will be well-
conditioned, but the additional sensitivity of xTx is a real phenomenon, not just a
figment of the numerical analyst's imagination and since perturbations in X do not
map linearly into perturbations in xTx, perturbation and rounding error analyses
become difficult to interpret when xTx is used in place of X and decisions about
rank and linear dependence (multicollinearity) are harder to make.

Of course normal matrices, particularly correlation matrices, provide vital


statistical information, but the methods to be discussed provide ready access to
the elements of such matrices.

4. The QU Factorization and the Singular Value Decomposition

In this section we briefly introduce and discuss two tools that allow us to avoid
forming normal matrices. These tools are the well known factorizations the QU
factorization (or QR factorization, but not to be confused with the QR algorithm)
and the singular value decomposition (commonly referred to as the SVD). For
simplicity of discussion we shall assume that n ) p so that X has at least as many
rows as columns. We shall also not discuss the details of the computational
algorithms for finding the factorizations, but instead give suitable references for
such descriptions. Suffice it to say that both factorizations may be obtained by
numerically stable methods and there are a number of sources of quality software
that implement these methods (IMSL, NAG, Dongarra et al, 1979; Chan, 1982).

The Q_U factorization of a matrix X is given by

(4.1)

where Q is an n by n orthogonal matrix, so that QTQ = I and U is a p by p upper


triangular matrix. Of course the rank of U is the same as that of X and when n = p
the portion below U does not exist.

The QU factorization of X always exists and may be found, for example, by


Householder transformations, plane rotations, or Gram-Schmidt orthogonalization.
(Wilkinson, 1965; Golub, 1965; Stewart, 1974; Golub and Van Loan, 1983.)

Two f e a t u r e s o f t h e QU f a c t o r i z a t i o n are important for our purposes. Firstly we


see that
xTx = uTu (4.2)

so the elements of xTx can readily be computed from the inner products of columns
of U, which means that U gives a convenient and compact representation of xTx. In
fact, as with xTx, we need only ½p(p+l) storage locations for the non-zero elements
of U. The matrix U is the Cholesky factor of xTx. Secondly if we perturb U by a
matrix F then

°I=l '=°I:l (4.3)

and since Q is orthogonal

IIFII = IIEII (4.4)

so that a perturbation of order ~ in U corresponds to a perturbation of the same


order of magnitude in X.

Q is an n by n matrix and so it is large if there are a large number of data


points, but Q is rarely required explicitly; instead what is usually required is a
vector, or part of a vector, of the form QTy, for a given y, and this can be
computed at the same time as the QU factorization is computed.

The singular value decomposition (SVD) of a matrix X is given by

X = Q I~lPT'o (4.5)

where again Q is an n by n orthogonal matrix, P is a p by p orthogonal matrix and


is a p by p diagonal matrix

= diag(~i) =

i ~2 "'"
O ..,

with non-negative diagonal elements. The factorization can be chosen so that

~I ~ 2 ~ "'" ~ ~ ~ O (4.6)
P

and we shall assume this to be the case. As with the QU factorization the SVD

7
always exists and is usually obtained by reducing X to bidiagonal form and then
applying a variant of the QR algorithm to reduce this to ~. (Golub and Kahan,
1965; Golub and Reinsch, 1970; Wilkinson, 1977 and 1978). The ~., i = i, 2, ..., p
1
are the singular values of X, the columns of P are the right singular vectors of X
and the first p columns of Q are the left singular vectors of X. We have adopted
here the notation of Stewart (1984) in avoiding the more usual o i for the i-th
singular value. If we denote the i-th columns of P and Q by Pi and qi respectively
then equation (4.5) implies that

XPi = * i q i , i = 1, 2, ..., p. (4.7)

For this factorization we have that

xTx = PV2P T, (4.8)

which is the classical spectral factorization of xTx. Thus the columns of P are
the eigenvectors of xTx and 2 i = I, 2, " .. ' p are the eigenvalues of
the values ~i'
xTx.

and P give us an alternative representation for xTx, although not quite as


compact as U since we now need p(p+l) storage locations, but having the advantage
that the columns of P are orthonormal. Note that (4.8) implies that

Ilxll = '1" (4.9)

Analagously to equations (4.3) to (4.~) if we perturb ~ by a matrix F then

0irl x, 0i:l (4.1o)

and

IIFII = IIEII (4.11)

so that again perturbations of order e in ~ correspond to perturbations of the same


order of magnitude in X.

The SVD i s important in multivariate analysis because it provides the most reliable
method of determining the numerical rank of a matrix and can be a great aid in
analysing near multicollinearities in the data.
Of course if X is exactly of rank k < p then from (4.5) and (4.6) we must have

=0
~k+l = ~k+2 . . . . . P

and from (4.7)

XPi = 0, i = k+l, k+2, ..., p

so that these columns of P form an orthonormal basis for the null space of X. If X
is of rank p, but we choose the matrix F in equation (4.10) to be the diagonal
matrix

c-
F = diag(fi) , fi = J O, i = 1, 2, ..., k (4.12)
*i' i = k+l, k+2, ..., p

then (X+E) i s o f r a n k k and f r o m ( 4 . 1 1 )

IIEII = ~k+l (4.13)

so that regarding a small singular value of X as zero corresponds to making a


perturbation in X whose size is of the same order of magnitude as that of the
singular value.

Conversely, if X is of rank p, but E is a matrix such that the perturbed matrix


(X+E) is of rank k < p then it can readily be shown (Wilkinson, 1978) that

p n e"2 P ~2
(4.14)
j=l i=l 13 i=k+l x

so that if the elements of E are small then the singular values ~k+l' ~k+2' "'''
must also be small. Thus if X has near multicollinearities, then the
P
appropriate number of singular values of X must be small. To appreciate the
strength of this statement consider the p by p matrix

U= -I -1 ... -1 -1 -1 TM

0 1 -1 ... -1 -1 -1

0 0 1 ... -1 -1 -1

0 0 0 ... i -i -i
0 0 0 ... 0 1 -1
0 0 0 ... 0 0 1
]
U is clearly of full rank, p, but its appearance belies its closeness to a rank
deficient matrix. If we put

E = rO 0 ... 0

0 0 ... 0
: :

: :

0 0 o.. 0

_22-P0 • .. 0

then the matrix (U+E) has rank (p-l), so that when p is not small U is almost rank
deficient. On the other hand (4.14) assures us that

< 2 2-p
P

so that the near rank deficiency will be clearly exposed by the singular values.
For instance, when p = 32 so that 22-p = 2-30 = 10-9 the singular values of U are
approximately 20.05, 6.925, ..., 1.449~ 5.280 x I0 -I0 and ~32 is indeed less than
10-9.

When the SVD is computed by numerically stable methods then the above remarks also
hold in the presence of rounding errors~ except when the perturbations under
consideration are smaller than the machine accuracy~ which is not very likely in
practice. Even then we only have to allow for the fact that computationally
singular values will not usually have values less than about eps.~l, where eps is
the relative machine precision~ because now the machine error dominates the data
error. For example on a VAX 11/780 in single precision~ for which
eps = 2-24 = 6 x 10-8 , the smallest singular value of the above matrix U, as
computed by the NAG Library routine FO2WDF~ was 4.726 x 10 -8 instead of
~32 = 5.280 x I0-I0.

The singular value decomposition is of course a more complicated factorization than


the QU factorization~ it requires more storage and takes longer to computes
although this latter aspect is frequently over-emphasized.

For many applications the" QU factorization is quite sufficient and a convenient


strategy is to compute this factorization and then test U to see whether or not it
is suitable for the particular application.

10
For example, if U is required to be non-singular then, at a moderate extra expense,
we can compute or estimate its condition number c(U) in order to determine whether
or not U is sufficiently well-conditioned. If U is not suitable we can then
proceed to obtain the SVD of U as

U = Q v pT, (4.15)

where Q and pT are orthogonal a n d ~ i s diagonal. From (4.5) we get that the SVD of
X is then given by

x I:l eO ° I (4.16)

and thus the singular values and right singular vectors of U and X are identical.
We can take advantage of the upper triangular form of U in computing its SVD and
for typical statistical data where n is considerably larger than p the time taken
will be dominated by the QU factorization of X. The NAG Library routine FO2WDF is
an example that explicitly allows the user to stop at the QU factorization if U is
not too ill-conditioned.

Particularly important in some statistical and real time applications is the fact
that the QU factorization may be obtained by processing the matrix X one
observation, or a block of observations at a time, so that the complete matrix X
need not be held all at once, but can be sequentially processed to give the compact
representation U. This can be achieved by well known updating techniques using,
for example, plane rotations. (Golub, 1965; Gentleman, 1974a; Gill and Murray,
1977; Dongarra et al, 1979; Cox, 1981.)

In the next section we demonstrate that such techniques can also be used to obtain
the QU factorization of X and of X.

5. The QU Factorization of Corrected Sums of Squares and Cross Product Matrices

As mentioned in the previous section there are many applications where it is


desirable to process the data sequentially without storing the data matrix X.
Statistical packages such as BMDP (1977) allow one to form covariance and
correlation matrices by sequentially processing the data and we now show that we
can also obtain the QU factorization of such matrices by a corresponding process.

11
First we note that sample means and variances can be computed sequentially and,
indeed~ there are good numerical reasons for preferring to compute means and
variances this ways rather than by the traditional formulae. (Chan and Lewis~
1979; West~ 1979; Chant Golub and LeVeque~ 1982.) If we denote the i-th
observation of the j-th variable as x j(i) and let Yj(r) and f(r)
j denotes
respectively, the e s t i m a t e d sums of squares of d e v i a t i o n s from the mean and the
e s t i m a t e d mean of the f i r s t r observations, so t h a t

r r
~!r) = (Z x!i))/r, y!r) = Z (~!r)- x!i))2,
J i=l J J i=l J J

then it is readily verified that

~!r) = ~!r-1) + (x!r) _ ~!r-1))/r (5.1)


J J 3 J

(r) (r-l) (r-l) (x! r) - ~(r-1))2/r


Yj = Yj + j J

Of c o u r s e

~j = ~(n)j and s29 = y g n ) / ( n - 1 ) (5.2)

and so we can o b t a i n R and D o f ( 2 . 2 ) with one pass t h r o u g h the d a t a . Given the QU


factorization of X, ( 2 . 3 ) g i v e s

=°il
f = QTe (5.3)

If we find the QU factorization of the matrix in braces as, say

(5.4)

and put

= QQ1 (5.5)

then the QU factorization of X is

12
(5.6)
.o
The factorization of (5.4) is a rank one update problem and there are standard
methods by which the factorization can be obtained economically (Gill and Murray,
1977; Golub and Van Loan, 1983, section 12-6). Since we can obtain the QU
factorization of X by sequentially processing the data and noting that 0D is still
upper triangular, equations (5.1), (5.2) and (5.6) enable us to find the QU
factorizations of X and of X, and hence the Cholesky factors of the matrices C and
R of (2.5) and (2.6), by sequentially processing the data one or more observations
at a time.

As an alternative we can obtain the QU factorization of X by an updating process.


If we let X denote the zero means data matrix for the first r observations, let zT
r ~(r) r
denote the r-th row of X and define as the vector

(~(r))T [ ~ r ) _(r) ~(r)]


= x2 ... p

and take the number of elements in the vector e by context, then using (5.1)

Xr+l = Xr+l - e(x(r+l))T

~(r))/(r+l)] T
= IrE ] - i: Ix(r) + (Zr+l -
k ~r+P

so that

Xr+l = I f r - r+--~
1 e (Zr+ 1 _ ~(r))T 1 (5.8)

(Zr+ 1 _ ~(r+l))T

We can obtain the QU factorization of {Xr- 1 e(Zr+l-~ (r) )T } from that o f M r by


the method described above and we can then update this QU factorization by the
,
additional row tZr+l-X_(r+l))T. In either method i t does not seem to be possible to
avoid storing the n element vector QTe.

A method requiring storage only of additional p element vectors would be useful.


As described earlier we can readily obtain the SVD of X or X via the upper
triangular factor.

13
6. Solving Multiple Regression Problems

In this section we consider the application of the QU factorization and the SVD to
multiple regressions or linear least squares~ and take X to denote either the data
matrix~ or the standardized data matrix since the solution of a regression with one
matrix can be deduced from the solution with the other.

We wish to determine the vector b (the resression coefficients) to

T
minimize r r, where y = Xb + r, (6.1)

where y is a vector of dependent observations and r is the residual vector usually


assumed to come from a normal distribution with

E(r) = 0 and E(rr T) = o2I.

If Q is orthogonal then

rTr = rTQTQr = (Qr)T(Qr)

and (6.1) is equivalent to

minimize ~T~, where QTy = QTxb + ~, = QTr. (6.2)

If we choose Q as the orthogonal matrix of the QU factorization of X and partition


QTy as

(6.3)

then

~-= [~wtrb] (6.4)

If X has linearly independent columns then U will be non-singular and we can choose
b so that

lib = ~ . (6.5)

14
Since w is independent of b, this must be the choice of b that minimizes r~r and
T
hence r r (Golub, 1965; Gentleman, 1974b). For this choice

=
II0 T T
so that r r = w w (6.6)
w

which is information that is lost when the normal equations are formed. We need
not retain w during the factorization, but we can instead just update the sum of
T
squares so that we have the single value w w on completion of the QU factorization.

As the discussion in section 3 indicates, the sensitivity of the solution of (6.5)


is determined by the closeness of X to rank deficiency, whereas the sensitivity of
the solution of the normal equations is determined by the closeness of xTx to rank
deficiency (Wilkinson, 1974).

If X is rank deficient, so that its columns are not linearly independent then U
will be singular. Using the notation of (4.14), if we then obtain the SVD of U
(6.4) becomes

so that (6.1) is equivalent to

^T^
minimize r r, where ~ = ~T9 _ ypTb. (6.7)

If X has rank k then only the first k singular values will be non-zero, so let us
put

= ~S O~ , S - k by k and non-singular (6.8)

t0 0 3
and correspondingly partition ~T~ an pTb a s

Then

r = (6.10)

15
AT ^ ^
and r r is minimized by choosing c so that

=
(6.11)

^ ^

Since S is diagonal c i = yi/¢i, i = i, 2, ..., k. We also have

T T ^T ^ T T
rr =ww+rr=ww+vv. (6.12)

As c is not determined by (6.11) we can see that the solution is not unique. The
particular solution for which bTb is also a minimum is called the minimal length
solution and from (6.9) we see that this is given by taking

c = 0 in which case b = P (6.13)

(Golub and Reinsch, 1970; Peters and Wilkinson, 1970).

In practice X will not be exactly rank deficient and the computed singular values
will not be exactly zero and while it is not always easy to decide upon the
numerical rank of X, (Golub, Klema and Stewart, 1976; Stewart, 1979; Klema and
Laub, 1980; Stewart, 1984) equations (4.10) - (4.13) tell us about the effects of
neglecting small singular values. Furthermore (4.7) gives

[[XPi[[ = ~i (6.14)

and so the columns of P corresponding to small singular values give valuable


information on the near multicollinearities in X. We can also readily assess the
affects of different decisions on the rank of X on the solution and on the residual
sum of squares from a knowledge of the singular values and right singular vectors
(Lawson and Hanson, 1974, chapter 25, section 6).

The usual additional statistical information can efficiently be computed from


either of the factorizations. For example, when X is of full rank then the
estimated variance-covariance matrix of the sample regression, V, is defined as

V = s2(xTx) -1, (6.15)

2 .
where s zs the estimated variance of the residual and f r o m ( 4 . 2 ) this becomes

V = s 2 ( u T u ) -1 = s2U-1U -T (6.16)

16
and the element v.. is given by
zj

2 T -1 -T 2 T
vi-'J = s e . U U e. = s f.f., uTf. = e. (6.17)
I j i j j j

where e i is the i-th column of the unit matrix. In particular the diagonal
elements v.. are the estimated variances of the sample regression coefficients and
ii
these can efficiently be computed from

2 T uTfi
vii = s fifi , = ei,

this requiring just one forward substitution for each f..


i

When X is not of full rank and a minimal length solution has been obtained, then in
place of (6.15) we must take

V = s2(xTx) * (6.18)

where (xTx) t is the pseudo-inverse o f xTx, a n d f r o m ( 4 . 8 ) this becomes

V = s2(pv2pT) * = s2p(v2)fP T

s2PIS2oo°IPT
and if we partition P as

P = [P1 P2 ] ' P1 - p b y k

then corresponding to (6.17), h e r e we h a v e

2 T T
v.. = s f.f., Sf. = P1 e . (6.19)
13 z j j j

and once again the elements of V can be computed efficiently.

As a s e c o n d e x a m p l e , if x is a p element vector of values of the variables Xl, x2,


..., x , then the estimated variance of the estimate of the dependent variable xTb
P
is given by

a 2 = s2xTvx (6.20)

17
and this can be computed from

a 2 = s2fTf, uTf = x (6.21)

for the QU factorization and

a 2 = s2fTf, Sf = P l x (6.22)

for the SVD.

If we relax the assumption that E(rr T) = o2I and instead have

E ( r r T) = o214,

where W is non-negative definite, then the usual approach is to obtain the


regression coefficients as the solution of the weighted least squares problem

minimize rTw-lr, where y = Xb + r (6.23)

because E(W-|rr T) = o2I. Unless W is well-conditioned, solving (6.23) explicitly


is numerically unstable and the problem is not even defined when W is singular. If
we let F be any matrix such that

W = FF T

and let e be an error vector satisfying

Fe = r ,

then (6.23) is equivalent to the generalized linear least squares problem

T
minimize e e
(6.24)
subject t o y = Xb + Fe

and now W is not required to be non-singular. Methods for the solution of (6.24)
based on the QU factorization and on the SVD have been discussed by Paige (1978,
1979a, 1979b) and by Korouklis and Paige (1981). To briefly indicate how the SVD
may be used~ partition Q as

18
Q = [Q1 Q2 ]

Then m u l t i p l y i n g the linear constraints i n ( 6 . 2 4 ) by QT and u s i n g t h e n o t a t i o n of


(6.8) and (6.9) we find that

T
S~ = QTy - QiBe,

from which ~ is determined from e, c is arbitrary and e must satisfy

Q~y T
= Q2Be.

An SVD o f Q~B e i t h e r a l l o w s e t o be d e t e r m i n e d , o r shows t h a t the equations are


inconsistent (Hammarling, Long and M a r t i n , 1983).

When X, as w e l l as y , c o n t a i n s experimental error t h e n i n p l a c e o f ( 6 . 1 ) i t may be


more a p p r o p r i a t e to find the regression coefficients as the solution of the total
least s q u a r e s p r o b l e m (Oolub and Van Loan, 1980)

2
minimize II(E,r)IIF, where y = (X+E)b + r (6.25)

2 p n 2
= ~ Z xij. If we assume that X is of full rank and put
and IIX[[F j=l i=l

Z = (X,-y), F = (E,r)

t h e n ( 6 . 2 5 ) becomes

minimize I IFI IF, where (Z+F) b = 0 (6.26)


1

and so we require the minimum perturbation that makes Z rank deficient with [b T I] T
in the null space of (Z+F). If we let the SVD of Z be

pT and put F = Q pT
Z=Q
0 Vp 0 -4 +
0
°° I
o

then F makes (Z+F) rank deficient and is of minimum norm (Golub and Van Loan, 1983,
corollary 2.3-3). If the (p+l)th right singular vector (last column of P) is
denoted by [pT piT, and if p / 0 then it is readily verified that the regression
coefficients are given by

19
b = (1/o)p. (6.27)

For further details see Golub and Van Loan (1980, 1983) and for discussion of the
case where 0 = 0 and a comparison with standard regression see Van Huffel,
Vandewalle and Staar (1984).

7. Other Applications in Multivariate Analysis

In this section we give a very brief mention of two further applications of the SVD
in multivariate analysis.

Given a zero means data matrix X, possibly standardized, the aim of a principal
component analysis is to determine an orthogonal transformation of the columns of
to a data matrix Y whose columns have non-increasing variance, each column of
having as large a variance as possible.

It is well known, and in any case readily established from the Courant-Fischer
theorem (Wilkinson, 1965, chapter 2, section 43), that Y is given by

= XP, (7.1)

where P is the matrix of eigenvectors of ~T~. From (4.8) and (4.5) we therefore
see that P is the matrix of right singular vectors of X and that

(7.2)

so that the j-th principal component of X is given by

Yj = ~ j q j , (7.3)

where qj is the j-th left singular vector of X, and the estimated variance of y3 is
$~/(n-1). Again, we can avoid the formation of ~T~ and the SVD allows us to
properly use our judgement as to which components are significant.

Given two zero mean data matrices X and Y the canonical correlation problem is to
find a transformation, A, of the columns of

z = ~A (7.4)

20
such that the columns of Z are orthonormal and such that regression of a column of
Z on X maximizes the multiple correlation coefficient.

Let the SVD's of X and Y be

Or
X ~

o°1'; (7.5)

partition Qx' % and Py as

and let

C = ~pTA. (7.6)
Y

This gives

and h e n c e i f t h e columns o f Z a r e o r t h o n o r m a l t h e n so a r e t h e columns o f C. Noting


that

(7.8)

it is now a Straightforward matter to show that the canonical correlation problem


for the pair (X,Y) can be solved from the solution to the principal component
problem for the matrix ~T~
QxQy. The multiple correlation coefficients, or the
canonical correlation coefficients, are the singular values o f ~ Q v. The canonical
correlation example is included to indicate the potential power of the SVD as an
aid to the solution of difficult multivariate statistical problems.

A full discussion of this and related topics is given by Bj~rck and Golub (1973).
See also Golub and Van Loan (1983, section 12.4). When X and Y are both of full
rank then we can use the QU factorization of X and Y in place of their SVD's
(Golub, 1969).

A number o f o t h e r a p p l i c a t i o n s o f t h e SVD i n m u l t i v a r i a t e analysis are discussed by


Chambers (1977) and by B a n f i e l d (1978).

21
8. The Generalized Singular Value Decomposition

Here we briefly mention an important generalization of the SVD that is relevant to


a pair of data matrices (X,Y) of dimension n by p and m by p. To simplify the
discussion we assume that m $ p and that Y is of full rank. In this case the
generalized singular value decomposition (GSVD) is given by

(8,1)
x I'xo :1 ,I, o
where Qx and Qy are orthogonal and $x and y are diagonal matrices,
x = diag(ai) , ~y = diag(8 i) with

2 ~2
1• +
1 = 1
' ai > O, 8
i > O. (8.2)

The pairs (ai,Bi) are called the generalized singular values and can be chosen with
the a. in descending order and the 8. in ascending order (Van Loan, 1976) This
1 1 "
and the unrestricted case are discussed by Paige and Saunders (1981) and we
strongly recommend reference to their paper.

From (8.1) we see that

xTx = z-T Io~2X (8.3)

so that Z is the congruence matrix that simultaneously diagonalizes the normal


matrices (xTx, yTy) and hence the columns of Z are the eigenvectors of the
generalized symmetric eigenvalue problem

(xTx)z = ~(YTY)z (8.4)

2 2
and the (ai/Bi) are the corresponding eigenvalues. Thus, just as the SVD allows us
to avoid the numerically damaging step forming xTx, the GSVD allows us to avoid the
numerically damaging step of forming the pair (xTx, yTy).

Unlike the SVD there is not yet quality software available for computing the GSVD,
but numerically stable algorithms are beginning to emerge (Stewart, 1983; Paige,
1984b) and such software will surely be available in the near future. This will
mean that we can use the natural tool for tackling multivariate problems involving

22
matrix pairs (X,Y), rather than using the SVD, which is really only the natural
tool when a single data matrix is involved.

Two such examples are the generalized least squares problem and the canonical
correlation problem, discussed in the previous two sections. Paige (1984a) has
given an elegant analysis of the generalized least squares problem in terms of the
GSVD, and for the canonical correlation problem it can readily be shown that we
simply have to replace the Qx and % of (7-5) by those~Tof (8.1) and then we again
solve the principal component problem for the matrix ~-Q'Qy"

9. Acknowledgement

This is a revised version of an article that first appeared as "How to live without
covariance matrices: Numerical stability in multivariate statistical analysis", in
NAG Newsletter 1/83.

I0. References

BANFIELD, C.F. (1978). Singular value decomposition in multivariate analysis. In


"Numerical Software - Needs and Availability". Ed. Jacobs, D.A.H., Academic Press,
London.

BJ~RCK, A. and GOLUB, G.H. (1973). Numerical methods and computing angles between
linear subspaces. Maths. of Computation, 27, 579-594.

BMPD (1977). Biomedical Computer Programs. University of California Press,


London.

CHAMBERS, J.M. (1977). Computational Methods for Data Analysis. Wiley, New York.

CHAN, T.F. (1982). Algorithm 581: An improved algorithm for computing the singular
value decomposition. ACM Trans. Math. Softw. 8, 84-88.

CHAN, T.F., GOLUB, G.H. and LEVEQUE, R.J. (1982). Updating formulae and a pairwise
algorithm for computing sample variances. In "COMPSTAT '82, Part I: Proceedings in
Computational Statistics". Eds. Caussinus, H., Ettinger, P. and Tomassone, R.
Physica-Verlag, Vienna.

CHAN, T.F. and LEWIS, J.G. (1979). Computing standard deviations: Accuracy. Comm.
ACM, 22, 526-531.

COX, M.G. (1981). The least squares solution of overdetermined linear equation
shaving band or augmented band structure. IMA J. Num. Anal., I, 3-22.

DONGARRA, J.J., MOLER, C.B., BUNCH, J.R. and STEWART, G.W. (1979). Linpack Users'
Guide. SIAM, Philadelphia.

FORSYTHE, G.E. and MOLER, C.B. (1967)b Computer Solution of Linear Algebraic
Equations. Prentice-Hall, New Jersey.

23
GENTLEMAN, W.M. (1974a). Basic procedures for large, sparse or weighted linear
least squares problems. Appl. Statist., 23, 448-454.

GENTLEMAN, W.M. (1974b). Regression problems and the QU decomposition. Bulletin


IMA, 10, 195-197.

GILL, P.E. and MURRAY, W. (1977). M o d i f i c a t i o n of m a t r i x £ a c t o r i z a t i o n s a f t e r a


rank one change. I n "The S t a t e of the Art of Numerical A n a l y s i s " . Ed. J a c o b s ,
D.A.H. Academic P r e s s , London.

GOLUB, G.H. (1965). Numerical methods f o r s o l v i n g l i n e a r l e a s t s q u a r e s p r o b l e m s .


Num. Math., 7, 206-216.

GOLUB, G.H. (1969). Matrix d e c o m p o s i t i o n s and s t a t i s t i c a l calculations. In


" S t a t i s t i c a l Computation". Eds. M i l t o n , R.C. and N e l d e r , J . A . , Academic P r e s s ,
London.

GOLUB, G.H. and KAHAN, W. (1965). Calculating the singular values and pseudo-
inverse of a matrix. SIAM J. Num. Anal., 2, 202-224.

GOLUB, G.H., KLEMA, V.C. and STEWART, G.W. (1976). Rank degeneracy and least
squares problems. Technical Report STAN-CS-76-559, Stanford University, Stanford,
CA 94305, USA.

GOLUB, G.H. and REINSCH, C. (1970). Singular value decomposition and least squares
solutions. Num. Math., 14, 403-420.

GOLUB, G.H. and VAN LOAN, C.F. (1980). An analysis of the total least squares
problem. SIAM J. Num. Anal., 17, 883-893.

GOLUB, G.H. and VAN LOAN, C.F. (1983). Matrix Computations. North Oxford
Academic, Oxford.

GOLUB, G.H. and WILKINSON, J.H. (1966). Note on iterative refinement of least
squares solutions. Num. Math., 9, 139-148.

HAMMARLING, S.J., LONG, E.M.R. and MARTIN, D.W. (1983). A generalized linear least
squares algorithm for correlated observations, with special reference to degenerate
data. NPL Report DITC 33/83. National Physical Laboratory, Teddington, Middlesex,
TW110LW, UK.

IMSL, International Mathematical and S t a t i s t i c a l Libraries, 7500 Bellaire Blvd.,


Houston, TX 77036-5085, USA.

KLEMA, V.C. and LAUB, A.J. (1980). The singular value decomposition: i t s
computation and some applications. IEEE Trans. Automat. Control, AC-25, 164-176.

KOUROUKLIS, S. and PAIGE, C.C. (1981). A constrained least squares approach to the
general Gauss-Markov linear model. J. American S t a t i s t i c a l Assoc., 76, 620-625.

LAWSON, C.L. and HANSON, R.J. (1974). Solving Least Squares Problems. Prentice-
Hall, New Jersey.

NAG, Numerical Algorithms Group, NAG Central Office, 256 Banbury Road, Oxford,
OX2 7DE, UK.

PAIGE, C.C. (1978). Numericallystable computations for general univariate linear


models. Commun. Statist.-Simula. Computa., B7(5), 437-453.

24
PAIGE, C.C. (1979). Fast numerically stable computations for generalized linear
least squares problems. SIAM J. Num. Anal., 16, 165-171.

PAIGE, C.C. (1984a). The general linear model and the generalized singular value
decomposition. School of Computer Science, McOill University, Montreal, Quebec,
Canada, HJA 2K6. (Submitted to Linear Algebra Applic., Special Issue on
Statistics.)

PAIGE, C.C. (1984b). Computing the generalized singular value decomposition.


School of Computer Science, McGill University, Montreal, Quebec, Canada, H3A 2K6.
(Submitted to SIAM J. Sci. Stat. Comput.)

PAIGE, C.C. and SAUNDERS, M.°A. (1981). Towards a generalized singular value
decomposition. SIAM J. Num. Anal., 18, 398-405.

PETERS, G. and WILKINSON, J.H. (1970). The least squares problem and pseudo-
inverses. Computer J., 309-316.

STEWART, G.W. (1974). Introduction to Matrix Computations. Academic Press, New York.

STEWART, G.W. (1977). On the perturbation of pseudo-inverses, projections and


linear least squares problems. SIAM Rev., 4, 634-662.

STEWART, G.W. (1979). Assessing the effects of variable error in linear


regression. Technical Report 818, University of Maryland, College Park, Maryland
20742, USA.

STEWART, G.W. (1983). A method for computing the generalized singular value
decomposition. In "Matrix Pencils". Eds. Kagstrom, B. and Ruhe, A., Springer-
Verlag, Berlin.

STEWART, C.W. (1984). Rank degeneracy. SIAM J. Sci. Star. Comput., 5, 403-413.

VAN HUFFEL, S., VANDEWALLE, J. and STAAR, J. (1984). The total linear least
squares problem: Formulation, algorithm and applications. Katholieke Universiteit
Leuven, ESAT Laboratory, 3030 Heverlee, Belgium.

VAN LOAN, C.F. (1976). Generalizing the singular value decomposition. SIAM J.
Num. Anal., 13, 76-83.

WEST, D.H.D. (1979). Updating mean and variance estimates: An improved method.
Comm. ACM, 22, 532-535.

WILKINSON, J . H . (1963). Rounding E r r o r s i n A l g e b r a i c P r o c e s s e s . Notes on A p p l i e d


S c i e n c e No. 32, HMSO, London and P r e n t i c e - H a l l , New J e r s e y .

WILKINSON, J . H . (1965). The A l g e b r a i c E i g e n v a l u e Problem. Oxford U n i v e r s i t y


P r e s s , London.

WILKINSON, 3.H. (1974). The c l a s s i c a l e r r o r analyses for the s o l u t i o n of l i n e a r


systems. B u l l e t i n IMA, 10, 175-180.

WILKINSON, J . H . (1977). Some r e c e n t advances i n n u m e r i c a l l i n e a r a l g e b r a . I n "The


S t a t e o f t h e Art i n Numerical A n a l y s i s " . Ed. J a c o b s , D.A.H. Academic P r e s s , London.

WILKINSON, J . H . (1978). S i n g u l a r - v a l u e decomposition - Basic a s p e c t s . In


"Numerical S o f t w a r e - Needs and A v a i l a b i l i t y " . Ed. J a c o b s , D.A.H. Academic P r e s s ,
London.

25

You might also like