7de, Uk: X (X 1 X 2 - . - XP)
7de, Uk: X (X 1 X 2 - . - XP)
7de, Uk: X (X 1 X 2 - . - XP)
To Gene Golub who has done so much to encourage and advance the use of stable numerical
techniques in multivariate statistics.
I. Introduction
This is a review article aimed at the statistician and mathematician who, while not
being expert numerical analysts, would like to gain some understanding of why it is
better to work with the data matrix, and of the techniques that allow us to avoid
the explicit computation of sums of squares and cross products matrices. To give a
focus and to keep the article of moderate length, we concentrate in particular on
the use of the singular value decomposition and its application to multiple
regression problems. In the final two sections we give a brief discussion of
principal components, canonical correlations and the generalized singular value
decomposition.
2. Notation
Rather than use the standard notation of numerical linear algebra, we use notation
that is more akin to that of the statistician and so we let X denote an n by p data
matrix (design matrix, matrix of observation), where p is the number of variables
and n is the number of data points (objects, individuals, observations). Let x.
2
denote the j-th column of X, so that
~2
~p
where
d i = Ilo/Si ,
' si#O ,
S. = 0 .
Z
= X - e~ T (2.3)
= XD (2.4)
is the standardized data matrix, because the mean of each column of ~ is zero and
the variance of each column is unity, unless s. = 0 in which case the j-th column
J
is zero.
The normal m a t r i c e s xTx and ~T~ are the sums o f squares and c r o s s p r o d u c t s m a t r i x
and t h e c o r r e c t e d sums of squares and c r o s s p r o d u c t s m a t r i x r e s p e c t i v e l y , the
matrix
R= ~ T i (2.6)
is the sample correlation matrix, with r.. as the sample correlation coefficient of
zJ
variables x. and x.. Of course
z 2
R = DCD
The notation llzll and llZll will be used to denote respectively the Euclidean
length of the n element vector z and the spectral or two norm of the n by p matrix
Z given by
n 2 ! t
[[z][ = ( X zi)= , [[Z[[ = max I lzzll = J ( z T z ) ,
i=1 I lzl I=1
where p ( z T z ) d e n o t e s the spectral radius (largest eigenvalue) of zTz. The r e a s o n
for our interest in these particular norms i s t h a t when Z i s o r t h o g o n a l then
A detailed knowledge of the spectral norm of a matrix is not important here and to
give a feel for its size in relation to the elements of Z we note that
p n z2 p½
Ilzll ~ ( z z .)½~ Ilzll.
j=l i=l iJ
X =
0
111 0
E
E / 0 for which xTx = I l+e21l+e121
xb = y (3.1)
(Wilkinson, 1963 and 1965; Forsythe and Moler, 1967.) Specifically, if we perturb
X by a matrix E, then the solution of the perturbed equations
satisfies
lleil 4 c(X)
llEll (3.4)
lib+eli llxll
so that unless c(X) = I, which occurs only when X is orthogonal, xTx is more
sensitive to perturbations than X. From (3.5) we once again see that perturbations
of order e2 in xTx can have the same effect as perturbations of order e in X.
In terms of solving a system of equations (3.4) and (3.5) imply that if rounding
errors or data perturbations (noise) mean that we might lose t digits accuracy,
compared to the accuracy of the data, when solving equations with X as the matrix
of coefficients,then we should expect to lose 2t digits accuracy when solving
equations with xTx as the coefficient matrix.
Similar remarks apply to the sensitivity of the solution of linear least squares
(multiple regression) problems when X is not square and the residual (error) vector
is small relative to the solution; once again it is advisable to avoid forming the
normal equations in order to solve the least squares problem. (Detailed analyses
can be found in Golub and Wilkinson, 1966; Lawson and Hanson, 1974; Stewart, 1977.)
We are not trying to imply that normal matrices should be avoided at all costs.
When X is close to being orthogonal then the normal matrix xTx will be well-
conditioned, but the additional sensitivity of xTx is a real phenomenon, not just a
figment of the numerical analyst's imagination and since perturbations in X do not
map linearly into perturbations in xTx, perturbation and rounding error analyses
become difficult to interpret when xTx is used in place of X and decisions about
rank and linear dependence (multicollinearity) are harder to make.
In this section we briefly introduce and discuss two tools that allow us to avoid
forming normal matrices. These tools are the well known factorizations the QU
factorization (or QR factorization, but not to be confused with the QR algorithm)
and the singular value decomposition (commonly referred to as the SVD). For
simplicity of discussion we shall assume that n ) p so that X has at least as many
rows as columns. We shall also not discuss the details of the computational
algorithms for finding the factorizations, but instead give suitable references for
such descriptions. Suffice it to say that both factorizations may be obtained by
numerically stable methods and there are a number of sources of quality software
that implement these methods (IMSL, NAG, Dongarra et al, 1979; Chan, 1982).
(4.1)
so the elements of xTx can readily be computed from the inner products of columns
of U, which means that U gives a convenient and compact representation of xTx. In
fact, as with xTx, we need only ½p(p+l) storage locations for the non-zero elements
of U. The matrix U is the Cholesky factor of xTx. Secondly if we perturb U by a
matrix F then
X = Q I~lPT'o (4.5)
= diag(~i) =
i ~2 "'"
O ..,
~I ~ 2 ~ "'" ~ ~ ~ O (4.6)
P
and we shall assume this to be the case. As with the QU factorization the SVD
7
always exists and is usually obtained by reducing X to bidiagonal form and then
applying a variant of the QR algorithm to reduce this to ~. (Golub and Kahan,
1965; Golub and Reinsch, 1970; Wilkinson, 1977 and 1978). The ~., i = i, 2, ..., p
1
are the singular values of X, the columns of P are the right singular vectors of X
and the first p columns of Q are the left singular vectors of X. We have adopted
here the notation of Stewart (1984) in avoiding the more usual o i for the i-th
singular value. If we denote the i-th columns of P and Q by Pi and qi respectively
then equation (4.5) implies that
which is the classical spectral factorization of xTx. Thus the columns of P are
the eigenvectors of xTx and 2 i = I, 2, " .. ' p are the eigenvalues of
the values ~i'
xTx.
and
The SVD i s important in multivariate analysis because it provides the most reliable
method of determining the numerical rank of a matrix and can be a great aid in
analysing near multicollinearities in the data.
Of course if X is exactly of rank k < p then from (4.5) and (4.6) we must have
=0
~k+l = ~k+2 . . . . . P
so that these columns of P form an orthonormal basis for the null space of X. If X
is of rank p, but we choose the matrix F in equation (4.10) to be the diagonal
matrix
c-
F = diag(fi) , fi = J O, i = 1, 2, ..., k (4.12)
*i' i = k+l, k+2, ..., p
p n e"2 P ~2
(4.14)
j=l i=l 13 i=k+l x
so that if the elements of E are small then the singular values ~k+l' ~k+2' "'''
must also be small. Thus if X has near multicollinearities, then the
P
appropriate number of singular values of X must be small. To appreciate the
strength of this statement consider the p by p matrix
U= -I -1 ... -1 -1 -1 TM
0 1 -1 ... -1 -1 -1
0 0 1 ... -1 -1 -1
0 0 0 ... i -i -i
0 0 0 ... 0 1 -1
0 0 0 ... 0 0 1
]
U is clearly of full rank, p, but its appearance belies its closeness to a rank
deficient matrix. If we put
E = rO 0 ... 0
0 0 ... 0
: :
: :
0 0 o.. 0
_22-P0 • .. 0
then the matrix (U+E) has rank (p-l), so that when p is not small U is almost rank
deficient. On the other hand (4.14) assures us that
< 2 2-p
P
so that the near rank deficiency will be clearly exposed by the singular values.
For instance, when p = 32 so that 22-p = 2-30 = 10-9 the singular values of U are
approximately 20.05, 6.925, ..., 1.449~ 5.280 x I0 -I0 and ~32 is indeed less than
10-9.
When the SVD is computed by numerically stable methods then the above remarks also
hold in the presence of rounding errors~ except when the perturbations under
consideration are smaller than the machine accuracy~ which is not very likely in
practice. Even then we only have to allow for the fact that computationally
singular values will not usually have values less than about eps.~l, where eps is
the relative machine precision~ because now the machine error dominates the data
error. For example on a VAX 11/780 in single precision~ for which
eps = 2-24 = 6 x 10-8 , the smallest singular value of the above matrix U, as
computed by the NAG Library routine FO2WDF~ was 4.726 x 10 -8 instead of
~32 = 5.280 x I0-I0.
10
For example, if U is required to be non-singular then, at a moderate extra expense,
we can compute or estimate its condition number c(U) in order to determine whether
or not U is sufficiently well-conditioned. If U is not suitable we can then
proceed to obtain the SVD of U as
U = Q v pT, (4.15)
where Q and pT are orthogonal a n d ~ i s diagonal. From (4.5) we get that the SVD of
X is then given by
x I:l eO ° I (4.16)
and thus the singular values and right singular vectors of U and X are identical.
We can take advantage of the upper triangular form of U in computing its SVD and
for typical statistical data where n is considerably larger than p the time taken
will be dominated by the QU factorization of X. The NAG Library routine FO2WDF is
an example that explicitly allows the user to stop at the QU factorization if U is
not too ill-conditioned.
Particularly important in some statistical and real time applications is the fact
that the QU factorization may be obtained by processing the matrix X one
observation, or a block of observations at a time, so that the complete matrix X
need not be held all at once, but can be sequentially processed to give the compact
representation U. This can be achieved by well known updating techniques using,
for example, plane rotations. (Golub, 1965; Gentleman, 1974a; Gill and Murray,
1977; Dongarra et al, 1979; Cox, 1981.)
In the next section we demonstrate that such techniques can also be used to obtain
the QU factorization of X and of X.
11
First we note that sample means and variances can be computed sequentially and,
indeed~ there are good numerical reasons for preferring to compute means and
variances this ways rather than by the traditional formulae. (Chan and Lewis~
1979; West~ 1979; Chant Golub and LeVeque~ 1982.) If we denote the i-th
observation of the j-th variable as x j(i) and let Yj(r) and f(r)
j denotes
respectively, the e s t i m a t e d sums of squares of d e v i a t i o n s from the mean and the
e s t i m a t e d mean of the f i r s t r observations, so t h a t
r r
~!r) = (Z x!i))/r, y!r) = Z (~!r)- x!i))2,
J i=l J J i=l J J
Of c o u r s e
=°il
f = QTe (5.3)
(5.4)
and put
= QQ1 (5.5)
12
(5.6)
.o
The factorization of (5.4) is a rank one update problem and there are standard
methods by which the factorization can be obtained economically (Gill and Murray,
1977; Golub and Van Loan, 1983, section 12-6). Since we can obtain the QU
factorization of X by sequentially processing the data and noting that 0D is still
upper triangular, equations (5.1), (5.2) and (5.6) enable us to find the QU
factorizations of X and of X, and hence the Cholesky factors of the matrices C and
R of (2.5) and (2.6), by sequentially processing the data one or more observations
at a time.
and take the number of elements in the vector e by context, then using (5.1)
~(r))/(r+l)] T
= IrE ] - i: Ix(r) + (Zr+l -
k ~r+P
so that
Xr+l = I f r - r+--~
1 e (Zr+ 1 _ ~(r))T 1 (5.8)
(Zr+ 1 _ ~(r+l))T
13
6. Solving Multiple Regression Problems
In this section we consider the application of the QU factorization and the SVD to
multiple regressions or linear least squares~ and take X to denote either the data
matrix~ or the standardized data matrix since the solution of a regression with one
matrix can be deduced from the solution with the other.
T
minimize r r, where y = Xb + r, (6.1)
If Q is orthogonal then
(6.3)
then
If X has linearly independent columns then U will be non-singular and we can choose
b so that
lib = ~ . (6.5)
14
Since w is independent of b, this must be the choice of b that minimizes r~r and
T
hence r r (Golub, 1965; Gentleman, 1974b). For this choice
=
II0 T T
so that r r = w w (6.6)
w
which is information that is lost when the normal equations are formed. We need
not retain w during the factorization, but we can instead just update the sum of
T
squares so that we have the single value w w on completion of the QU factorization.
If X is rank deficient, so that its columns are not linearly independent then U
will be singular. Using the notation of (4.14), if we then obtain the SVD of U
(6.4) becomes
^T^
minimize r r, where ~ = ~T9 _ ypTb. (6.7)
If X has rank k then only the first k singular values will be non-zero, so let us
put
t0 0 3
and correspondingly partition ~T~ an pTb a s
Then
r = (6.10)
15
AT ^ ^
and r r is minimized by choosing c so that
=
(6.11)
^ ^
T T ^T ^ T T
rr =ww+rr=ww+vv. (6.12)
As c is not determined by (6.11) we can see that the solution is not unique. The
particular solution for which bTb is also a minimum is called the minimal length
solution and from (6.9) we see that this is given by taking
In practice X will not be exactly rank deficient and the computed singular values
will not be exactly zero and while it is not always easy to decide upon the
numerical rank of X, (Golub, Klema and Stewart, 1976; Stewart, 1979; Klema and
Laub, 1980; Stewart, 1984) equations (4.10) - (4.13) tell us about the effects of
neglecting small singular values. Furthermore (4.7) gives
[[XPi[[ = ~i (6.14)
2 .
where s zs the estimated variance of the residual and f r o m ( 4 . 2 ) this becomes
V = s 2 ( u T u ) -1 = s2U-1U -T (6.16)
16
and the element v.. is given by
zj
2 T -1 -T 2 T
vi-'J = s e . U U e. = s f.f., uTf. = e. (6.17)
I j i j j j
where e i is the i-th column of the unit matrix. In particular the diagonal
elements v.. are the estimated variances of the sample regression coefficients and
ii
these can efficiently be computed from
2 T uTfi
vii = s fifi , = ei,
When X is not of full rank and a minimal length solution has been obtained, then in
place of (6.15) we must take
V = s2(xTx) * (6.18)
V = s2(pv2pT) * = s2p(v2)fP T
s2PIS2oo°IPT
and if we partition P as
P = [P1 P2 ] ' P1 - p b y k
2 T T
v.. = s f.f., Sf. = P1 e . (6.19)
13 z j j j
a 2 = s2xTvx (6.20)
17
and this can be computed from
a 2 = s2fTf, Sf = P l x (6.22)
E ( r r T) = o214,
W = FF T
Fe = r ,
T
minimize e e
(6.24)
subject t o y = Xb + Fe
and now W is not required to be non-singular. Methods for the solution of (6.24)
based on the QU factorization and on the SVD have been discussed by Paige (1978,
1979a, 1979b) and by Korouklis and Paige (1981). To briefly indicate how the SVD
may be used~ partition Q as
18
Q = [Q1 Q2 ]
T
S~ = QTy - QiBe,
Q~y T
= Q2Be.
2
minimize II(E,r)IIF, where y = (X+E)b + r (6.25)
2 p n 2
= ~ Z xij. If we assume that X is of full rank and put
and IIX[[F j=l i=l
Z = (X,-y), F = (E,r)
t h e n ( 6 . 2 5 ) becomes
and so we require the minimum perturbation that makes Z rank deficient with [b T I] T
in the null space of (Z+F). If we let the SVD of Z be
pT and put F = Q pT
Z=Q
0 Vp 0 -4 +
0
°° I
o
then F makes (Z+F) rank deficient and is of minimum norm (Golub and Van Loan, 1983,
corollary 2.3-3). If the (p+l)th right singular vector (last column of P) is
denoted by [pT piT, and if p / 0 then it is readily verified that the regression
coefficients are given by
19
b = (1/o)p. (6.27)
For further details see Golub and Van Loan (1980, 1983) and for discussion of the
case where 0 = 0 and a comparison with standard regression see Van Huffel,
Vandewalle and Staar (1984).
In this section we give a very brief mention of two further applications of the SVD
in multivariate analysis.
Given a zero means data matrix X, possibly standardized, the aim of a principal
component analysis is to determine an orthogonal transformation of the columns of
to a data matrix Y whose columns have non-increasing variance, each column of
having as large a variance as possible.
It is well known, and in any case readily established from the Courant-Fischer
theorem (Wilkinson, 1965, chapter 2, section 43), that Y is given by
= XP, (7.1)
where P is the matrix of eigenvectors of ~T~. From (4.8) and (4.5) we therefore
see that P is the matrix of right singular vectors of X and that
(7.2)
Yj = ~ j q j , (7.3)
where qj is the j-th left singular vector of X, and the estimated variance of y3 is
$~/(n-1). Again, we can avoid the formation of ~T~ and the SVD allows us to
properly use our judgement as to which components are significant.
Given two zero mean data matrices X and Y the canonical correlation problem is to
find a transformation, A, of the columns of
z = ~A (7.4)
20
such that the columns of Z are orthonormal and such that regression of a column of
Z on X maximizes the multiple correlation coefficient.
Or
X ~
o°1'; (7.5)
and let
C = ~pTA. (7.6)
Y
This gives
(7.8)
A full discussion of this and related topics is given by Bj~rck and Golub (1973).
See also Golub and Van Loan (1983, section 12.4). When X and Y are both of full
rank then we can use the QU factorization of X and Y in place of their SVD's
(Golub, 1969).
21
8. The Generalized Singular Value Decomposition
(8,1)
x I'xo :1 ,I, o
where Qx and Qy are orthogonal and $x and y are diagonal matrices,
x = diag(ai) , ~y = diag(8 i) with
2 ~2
1• +
1 = 1
' ai > O, 8
i > O. (8.2)
The pairs (ai,Bi) are called the generalized singular values and can be chosen with
the a. in descending order and the 8. in ascending order (Van Loan, 1976) This
1 1 "
and the unrestricted case are discussed by Paige and Saunders (1981) and we
strongly recommend reference to their paper.
2 2
and the (ai/Bi) are the corresponding eigenvalues. Thus, just as the SVD allows us
to avoid the numerically damaging step forming xTx, the GSVD allows us to avoid the
numerically damaging step of forming the pair (xTx, yTy).
Unlike the SVD there is not yet quality software available for computing the GSVD,
but numerically stable algorithms are beginning to emerge (Stewart, 1983; Paige,
1984b) and such software will surely be available in the near future. This will
mean that we can use the natural tool for tackling multivariate problems involving
22
matrix pairs (X,Y), rather than using the SVD, which is really only the natural
tool when a single data matrix is involved.
Two such examples are the generalized least squares problem and the canonical
correlation problem, discussed in the previous two sections. Paige (1984a) has
given an elegant analysis of the generalized least squares problem in terms of the
GSVD, and for the canonical correlation problem it can readily be shown that we
simply have to replace the Qx and % of (7-5) by those~Tof (8.1) and then we again
solve the principal component problem for the matrix ~-Q'Qy"
9. Acknowledgement
This is a revised version of an article that first appeared as "How to live without
covariance matrices: Numerical stability in multivariate statistical analysis", in
NAG Newsletter 1/83.
I0. References
BJ~RCK, A. and GOLUB, G.H. (1973). Numerical methods and computing angles between
linear subspaces. Maths. of Computation, 27, 579-594.
CHAMBERS, J.M. (1977). Computational Methods for Data Analysis. Wiley, New York.
CHAN, T.F. (1982). Algorithm 581: An improved algorithm for computing the singular
value decomposition. ACM Trans. Math. Softw. 8, 84-88.
CHAN, T.F., GOLUB, G.H. and LEVEQUE, R.J. (1982). Updating formulae and a pairwise
algorithm for computing sample variances. In "COMPSTAT '82, Part I: Proceedings in
Computational Statistics". Eds. Caussinus, H., Ettinger, P. and Tomassone, R.
Physica-Verlag, Vienna.
CHAN, T.F. and LEWIS, J.G. (1979). Computing standard deviations: Accuracy. Comm.
ACM, 22, 526-531.
COX, M.G. (1981). The least squares solution of overdetermined linear equation
shaving band or augmented band structure. IMA J. Num. Anal., I, 3-22.
DONGARRA, J.J., MOLER, C.B., BUNCH, J.R. and STEWART, G.W. (1979). Linpack Users'
Guide. SIAM, Philadelphia.
FORSYTHE, G.E. and MOLER, C.B. (1967)b Computer Solution of Linear Algebraic
Equations. Prentice-Hall, New Jersey.
23
GENTLEMAN, W.M. (1974a). Basic procedures for large, sparse or weighted linear
least squares problems. Appl. Statist., 23, 448-454.
GOLUB, G.H. and KAHAN, W. (1965). Calculating the singular values and pseudo-
inverse of a matrix. SIAM J. Num. Anal., 2, 202-224.
GOLUB, G.H., KLEMA, V.C. and STEWART, G.W. (1976). Rank degeneracy and least
squares problems. Technical Report STAN-CS-76-559, Stanford University, Stanford,
CA 94305, USA.
GOLUB, G.H. and REINSCH, C. (1970). Singular value decomposition and least squares
solutions. Num. Math., 14, 403-420.
GOLUB, G.H. and VAN LOAN, C.F. (1980). An analysis of the total least squares
problem. SIAM J. Num. Anal., 17, 883-893.
GOLUB, G.H. and VAN LOAN, C.F. (1983). Matrix Computations. North Oxford
Academic, Oxford.
GOLUB, G.H. and WILKINSON, J.H. (1966). Note on iterative refinement of least
squares solutions. Num. Math., 9, 139-148.
HAMMARLING, S.J., LONG, E.M.R. and MARTIN, D.W. (1983). A generalized linear least
squares algorithm for correlated observations, with special reference to degenerate
data. NPL Report DITC 33/83. National Physical Laboratory, Teddington, Middlesex,
TW110LW, UK.
KLEMA, V.C. and LAUB, A.J. (1980). The singular value decomposition: i t s
computation and some applications. IEEE Trans. Automat. Control, AC-25, 164-176.
KOUROUKLIS, S. and PAIGE, C.C. (1981). A constrained least squares approach to the
general Gauss-Markov linear model. J. American S t a t i s t i c a l Assoc., 76, 620-625.
LAWSON, C.L. and HANSON, R.J. (1974). Solving Least Squares Problems. Prentice-
Hall, New Jersey.
NAG, Numerical Algorithms Group, NAG Central Office, 256 Banbury Road, Oxford,
OX2 7DE, UK.
24
PAIGE, C.C. (1979). Fast numerically stable computations for generalized linear
least squares problems. SIAM J. Num. Anal., 16, 165-171.
PAIGE, C.C. (1984a). The general linear model and the generalized singular value
decomposition. School of Computer Science, McOill University, Montreal, Quebec,
Canada, HJA 2K6. (Submitted to Linear Algebra Applic., Special Issue on
Statistics.)
PAIGE, C.C. and SAUNDERS, M.°A. (1981). Towards a generalized singular value
decomposition. SIAM J. Num. Anal., 18, 398-405.
PETERS, G. and WILKINSON, J.H. (1970). The least squares problem and pseudo-
inverses. Computer J., 309-316.
STEWART, G.W. (1974). Introduction to Matrix Computations. Academic Press, New York.
STEWART, G.W. (1983). A method for computing the generalized singular value
decomposition. In "Matrix Pencils". Eds. Kagstrom, B. and Ruhe, A., Springer-
Verlag, Berlin.
STEWART, C.W. (1984). Rank degeneracy. SIAM J. Sci. Star. Comput., 5, 403-413.
VAN HUFFEL, S., VANDEWALLE, J. and STAAR, J. (1984). The total linear least
squares problem: Formulation, algorithm and applications. Katholieke Universiteit
Leuven, ESAT Laboratory, 3030 Heverlee, Belgium.
VAN LOAN, C.F. (1976). Generalizing the singular value decomposition. SIAM J.
Num. Anal., 13, 76-83.
WEST, D.H.D. (1979). Updating mean and variance estimates: An improved method.
Comm. ACM, 22, 532-535.
25