(eBook-PDF) - Mathematics - Mathematical Methods For Robotic
(eBook-PDF) - Mathematics - Mathematical Methods For Robotic
Chapter 1
Introduction
Robotics and computer vision are interdisciplinary subjects at the intersection of engineering and computer science. By their nature, they deal with both computers and the physical world. Although the former are in the latter, the workings of computers are best described in the black-and-white vocabulary of discrete mathematics, which is foreign to most classical models of reality, quantum physics notwithstanding. This class surveys some of the key tools of applied math to be used at the interface of continuous and discrete. It is not on robotics or computer vision. These subjects evolve rapidly, but their mathematical foundations remain. Even if you will not pursue either eld, the mathematics that you learn in this class will not go wasted. To be sure, applied mathematics is a discipline in itself and, in many universities, a separate department. Consequently, this class can be a quick tour at best. It does not replace calculus or linear algebra, which are assumed as prerequisites, nor is it a comprehensive survey of applied mathematics. What is covered is a compromise between the time available and what is useful and fun to talk about. Even if in some cases you may have to wait until you take a robotics or vision class to fully appreciate the usefulness of a particular topic, I hope that you will enjoy studying these subjects in their own right.
CHAPTER 1. INTRODUCTION
1.2 Syllabus
Here is the ideal syllabus, but how much we cover depends on how fast we go. 1. Introduction 2. Unknown numbers 2.1 Algebraic linear systems 2.1.1 2.1.2 2.1.3 2.1.4 Characterization of the solutions to a linear system Gaussian elimination The Singular Value Decomposition The pseudoinverse
2.2 Function optimization 2.2.1 Newton and Gauss-Newton methods 2.2.2 Levenberg-Marquardt method 2.2.3 Constraints and Lagrange multipliers 3. Unknown functions of one real variable 3.1 Ordinary differential linear systems 3.1.1 3.1.2 3.1.3 3.1.4 3.1.5 Eigenvalues and eigenvectors The Schur decomposition Ordinary differential linear systems The matrix zoo Real, symmetric, positive-denite matrices
3.2 Statistical estimation 3.2.1 Linear estimation 3.2.2 Weighted least squares 3.2.3 The Kalman lter 4. Unknown functions of several variables 4.1 Tensor elds of several variables 4.1.1 4.1.2 4.1.3 4.1.4 4.1.5 4.2.1 4.2.2 4.2.3 4.2.4 Grad, div, curl Line, surface, and volume integrals Greens theorem and potential elds of two variables Stokes and divergence theorems and potential elds of three variables Diffusion and ow problems Finite differences Direct versus iterative solution methods Jacobi and Gauss-Seidel iterations Successive overrelaxation
CHAPTER 1. INTRODUCTION
1.4 Books
The class will be based on these lecture notes, and additional notes handed out when necessary. Other useful references include the following. R. Courant and D. Hilbert, Methods of Mathematical Physics, Volume I and II, John Wiley and Sons, 1989. D. A. Danielson, Vectors and Tensors in Engineering and Physics, Addison-Wesley, 1992. J. W. Demmel, Applied Numerical Linear Algebra, SIAM, 1997. A. Gelb et al., Applied Optimal Estimation, MIT Press, 1974. P. E. Gill, W. Murray, and M. H. Wright, Practical Optimization, Academic Press, 1993. G. H. Golub and C. F. Van Loan, Matrix Computations, 2nd Edition, Johns Hopkins University Press, 1989, or 3rd edition, 1997. W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling, Numerical Recipes in C, 2nd Edition, Cambridge University Press, 1992. G. Strang, Introduction to Applied Mathematics, Wellesley- Cambridge Press, 1986. A. E. Taylor and W. R. Mann, Advanced Calculus, 3rd Edition, John Wiley and Sons, 1983. L. N. Trefethen and D. Bau, III, Numerical Linear Algebra, SIAM, 1997.
Chapter 2
Ax = b
(2.1)
where A is an m n matrix, x is an n-dimensional vector that collects all of the unknowns, and b is a known vector of dimension m. In this chapter, we only consider the cases in which the entries of A, b, and x are real numbers. Two reasons are usually offered for the importance of linear systems. The rst is apparently deep, and refers to the principle of superposition of effects. For instance, in dynamics, superposition of forces states that if force f1 produces acceleration a1 (both possibly vectors) and force f 2 produces acceleration a2, then the combined force f1 + f2 produces acceleration a1 + a2 . This is Newtons second law of dynamics, although in a formulation less common than the equivalent f = ma. Because Newtons laws are at the basis of the entire edice of Mechanics, linearity appears to be a fundamental principle of Nature. However, like all physical laws, Newtons second law is an abstraction, and ignores viscosity, friction, turbulence, and other nonlinear effects. Linearity, then, is perhaps more in the physicists mind than in reality: if nonlinear effects can be ignored, physical phenomena are linear! A more pragmatic explanation is that linear systems are the only ones we know how to solve in general. This argument, which is apparently more shallow than the previous one, is actually rather important. Here is why. Given two algebraic equations in two variables,
f (x y ) = 0 g (x y ) = 0
we can eliminate, say, y and obtain the equivalent system
F (x) = 0 y = h(x) :
Thus, the original system is as hard to solve as it is to nd the roots of the polynomial F in a single variable. Unfortunately, if f and g have degrees df and dg , the polynomial F has generically degree df dg . Thus, the degree of a system of equations is, roughly speaking, the product of the degrees. For instance, a system of m quadratic equations corresponds to a polynomial of degree 2m . The only case in which the exponential is harmless is when its base is 1, that is, when the system is linear. In this chapter, we rst review a few basic facts about vectors in sections 2.1 through 2.4. More specically, we develop enough language to talk about linear systems and their solutions in geometric terms. In contrast with the promise made in the introduction, these sections contain quite a few proofs. This is because a large part of the course material is based on these notions, so we want to make sure that the foundations are sound. In addition, some of the proofs lead to useful algorithms, and some others prove rather surprising facts. Then, in section 2.5, we characterize the solutions of linear algebraic systems. 7
n X j =1
xj aj
(2.2)
is said to be a linear combination of a 1 : : : an with coefcients x1 : : : xn. The vectors a1 : : : an are linearly dependent if they admit the null vector as a nonzero linear combination. In other words, they are linearly dependent if there is a set of coefcients x1 : : : xn, not all of which are zero, such that
n X j =1
xj a j = 0 :
(2.3)
For later reference, it is useful to rewrite the last two equalities in a different form. Equation (2.2) is the same as
Ax = b
and equation (2.3) is the same as where
(2.4) (2.5)
Ax = 0
an
A=
a1
2x 3 1 6 7 . . x=4 . 5
xn
2b 3 1 6 7 . . b=4 . 5 :
bm
If you are not convinced of these equivalences, take the time to write out the components of each expression for a small example. This is important. Make sure that you are comfortable with this. Thus, the columns of a matrix A are dependent if there is a nonzero solution to the homogeneous system (2.5). Vectors that are not dependent are independent. Theorem 2.1.1 The vectors a1 others. Proof.
: : : an are linearly dependent iff1 at least one of them is a linear combination of the
n X j =1
In one direction, dependency means that there is a nonzero vector x such that
xj a j = 0 :
n X j =1 j 6=k
n X j =1
xj aj = xk ak +
ak
xj aj = 0
(2.6)
so that
=;
j =1 j 6=k xk n X
n x X ja j
j =1 j 6=k
xj aj
n X j =1
xj a j = 0
2
by letting x k
We can make the rst part of the proof above even more specic, and state the following Lemma 2.1.2 If n nonzero vectors a1 of the ones that precede it. Proof.
j<k xk
as desired.
2.2 Basis
A set a1 : : : an is said to be a basis for a set B of vectors if the aj are linearly independent and every vector in B can be written as a linear combination of them. B is said to be a vector space if it contains all the linear combinations of its basis vectors. In particular, this implies that every linear space contains the zero vector. The basis vectors are said to span the vector space. Theorem 2.2.1 Given a vector b in the vector space B and a basis a1 that b= are uniquely determined. Proof. Let also b= Then, 0= b;b=
n X j =1
xj aj
n X j =1
x0j aj : x0j aj =
n X j =1
n X j =1
xj aj ;
n X j =1
but because the aj are linearly independent, this is possible only when x j
The previous theorem is a very important result. An equivalent formulation is the following: If the columns a1 : : : an of A are linearly independent and the system the solution is unique.
2 This symbol marks the end of a proof.
Theorem 2.2.2 Two different bases for the same vector space B have the same number of vectors. Proof. Let a1 : : : an and a0 1 : : : a0n0 be two different bases for B . Then each a0j is in B (why?), and can therefore be written as a linear combination of a1 : : : an . Consequently, the vectors of the set
G = a01 a1 : : : an
must be linearly dependent. We call a set of vectors that contains a basis for B a generating set for B . Thus, G is a generating set for B . The rest of the proof now proceeds as follows: we keep removing a vectors from G and replacing them with a0 vectors in such a way as to keep G a generating set for B . Then we show that we cannot run out of a vectors before we run out of a 0 vectors, which proves that n n0 . We then switch the roles of a and a0 vectors to conclude that n0 n. This proves that n = n0. From lemma 2.1.2, one of the vectors in G is a linear combination of those preceding it. This vector cannot be a0 1, since it has no other vectors preceding it. So it must be one of the aj vectors. Removing the latter keeps G a generating set, since the removed vector depends on the others. Now we can add a0 2 to G, writing it right after a 01 :
G = a01 a02 : : : :
Let us continue this procedure until we run out of either a vectors to remove or a 0 vectors to add. The a vectors cannot run out rst. Suppose in fact per absurdum that G is now made only of a0 vectors, and that there are still left-over a0 vectors that have not been put into G. Since the a0 s form a basis, they are mutually linearly independent. Since B is a vector space, all the a0 s are in B . But then G cannot be a generating set, since the vectors in it cannot generate the left-over a0s, which are independent of those in G. This is absurd, because at every step we have made sure that G remains a generating set. Consequently, we must run out of a0 s rst (or simultaneously with the last a). That is, n n0 . Now we can repeat the whole procedure with the roles of a vectors and a0 vectors exchanged. This shows that 0 n n, and the two results together imply that n = n 0. A consequence of this theorem is that any basis for Rm has m vectors. In fact, the basis of elementary vectors ej
2b 3 1 6 7 . b=4 . . 5
bm
b=
can be written as
m X j =1
bj e j
and the ej are clearly independent. Since this elementary basis has m vectors, theorem 2.2.2 implies that any other basis for Rm has m vectors. Another consequence of theorem 2.2.2 is that n vectors of dimension m < n are bound to be dependent, since any basis for Rm can only have m vectors. Since all bases for a space have the same number of vectors, it makes sense to dene the dimension of a space as the number of vectors in any of its bases.
11
a2 = b2 + c2 ; 2bc cos is the angle between the sides of length b and c. A special case of this law is Pythagoras theorem, obtained = =2.
c
b
Figure 2.1: The law of cosines states that a2
= b2 + c2 ; 2bc cos
In the previous section we saw that any vector in Rm can be written as the linear combination b=
m X j =1
bj e j
(2.7)
of the elementary vectors that point along the coordinate axes. The length of these elementary vectors is clearly one, because each of them goes from the origin to the unit point of one of the axes. Also, any two of these vectors form a 90-degree angle, because the coordinate axes are orthogonal by construction. How long is b? From equation (2.7) we obtain
b = b1e1 +
m X
j =2
bj ej
bj ej k :
Pythagoras theorem can now be applied again to the last sum by singling out its rst term conclusion,
m X j =1
b 2 e2 , and so forth.
In
b2 j:
This result extends Pythagoras theorem to m dimensions. If we dene the inner product of two m-dimensional vectors as follows:
m X T b c= bj cj j =1
then
kbk2 = bT b :
(2.8)
Thus, the squared length of a vector is the inner product of the vector with itself. Here and elsewhere, vectors are column vectors by default, and the symbol T makes them into row vectors.
12 Theorem 2.3.1 bT c = kbk kck cos where Proof. is the angle between b and c.
The law of cosines applied to the triangle with sides kbk, kck, and kb ; ck yields
Corollary 2.3.2 Two nonzero vectors b and c in Rm are mutually orthogonal iff b T c = 0. Proof. When
p on the line through c that is nearest to the endpoint of b. See gure 2.2.
Given two vectors b and c applied to the origin, the projection of b onto c is the vector from the origin to the point
b p c
Figure 2.2: The vector from the origin to point p is the projection of b onto c. The line from the endpoint of b to p is orthogonal to c.
Theorem 2.3.3 The projection of b onto c is the vector p = Pc b where Pc is the following square matrix:
Pc = cc : cT c
Proof. Since by denition point p is on the line through c, the projection vector p has the form p = ac, where a is some real number. From elementary geometry, the line between p and the endpoint of b is shortest when it is orthogonal to c: cT (b ; ac) = 0
13
b a= c cT c so that ccT p = ac = c a = T b c c
as advertised.
Proof. If n = m we are done. If n < m, the given basis cannot generate all of Rm , so there must be a vector, call it an+1 , that is linearly independent of a1 : : : an . This argument can be repeated until the basis spans all of Rm , that is, until m = n.
: : : an .
Proof. We rst prove by induction on r that the vectors qr are mutually orthonormal. If r = 1, there is little to prove. The normalization in the above procedure ensures that q1 has unit norm. Let us now assume that the procedure
3 Orthonormal means orthogonal and with unit norm.
14
above has been performed a number j ; 1 of times sufcient to nd r ; 1 vectors q1 are orthonormal (the inductive assumption). Then for any i < r we have
l=1 T T T because the term qi aj cancels the i-th term (qi aj )qi qi of the sum (remember that qT and the inner products i qi = 1), 0 0 qT q are zero by the inductive assumption. Because of the explicit normalization step q = a = i l r j kaj k, the vector qr , if 0 T computed, has unit norm, and because qT i aj = 0, it follwos that q r is orthogonal to all its predecessors, qi qr = 0 for
0 qT i aj
= qT i aj ;
r;1 X
T (qT l aj )qi ql = 0
i = 1 : : : r ; 1.
Finally, we notice that the vectors qj span the same space as the aj s, because the former are linear combinations of the latter, are orthonormal (and therefore independent), and equal in number to the number of linearly independent vectors in a1 : : : an. Theorem 2.4.3 If A is a subspace of Rm and A? is the orthogonal complement of A in R m , then
dim(A) + dim(A? ) = m :
Proof. Let a1 : : : an be a basis for A. Extend this basis to a basis a1 : : : am for Rm (theorem 2.4.1). Orthonormalize this basis by the Gram-Schmidt procedure (theorem 2.4.2) to obtain q1 : : : qm . By construction, q1 : : : qn span A. Because the new basis is orthonormal, all vectors generated by qn+1 : : : qm are orthogonal to all vectors generated by q1 : : : qn , so there is a space of dimension at least m ; n that is orthogonal to A. On the other hand, the dimension of this orthogonal space cannot exceed m ; n, because otherwise we would have more than m vectors in a basis for Rm . Thus, the dimension of the orthogonal space A? is exactly m ; n, as promised. We can now start to talk about matrices in terms of the subspaces associated with them. The null space null(A) of an m n matrix A is the space of all n-dimensional vectors that are orthogonal to the rows of A. The range of A is the space of all m-dimensional vectors that are generated by the columns of A. Thus, x 2 null(A) iff Ax = 0, and b 2 range(A) iff Ax = b for some x. From theorem 2.4.3, if null(A) has dimension h, then the space generated by the rows of A has dimension r = n ; h, that is, A has n ; h linearly independent rows. It is not obvious that the space generated by the columns of A has also dimension r = n ; h. This is the point of the following theorem. Theorem 2.4.4 The number r of linearly independent columns of any m independent rows, and where h = dim(null(A)).
r = n;h
Proof. We have already proven that the number of independent rows is n ; h. Now we show that the number of independent columns is also n ; h, by constructing a basis for range(A). Let v1 : : : vh be a basis for null(A), and extend this basis (theorem 2.4.1) into a basis v1 : : : vn for Rn . Then we can show that the n ; h vectors Avh+1 : : : Avn are a basis for the range of A. First, these n ; h vectors generate the range of A. In fact, given an arbitrary vector b 2 range(A), there must be a linear combination of the columns of A that is equal to b. In symbols, there is an n-tuple x such that Ax = b. The n-tuple x itself, being an element of Rn , must be some linear combination of v1 : : : vn , our basis for Rn : x=
n X j =1
cj vj :
15
n X j =1
cj vj =
n X j =1
cj Avj =
n X j =h+1
cj Avj
since v1 : : : vh span null(A), so that Avj = 0 for j = 1 : : : h. This proves that the n ; h vectors Avh+1 : : : Avn generate range(A). Second, we prove that the n ; h vectors Avh+1 : : : Avn are linearly independent. Suppose, per absurdum, that they are not. Then there exist numbers xh+1 : : : xn, not all zero, such that
n X
j =h+1
so that
n X j =h+1
But then the vector j =h+1 xj vj is in the null space of A. Since the vectors v1 must exist coefcients x1 : : : xh such that
Pn
n X
j =h+1
in conict with the assumption that the vectors v 1
xj vj =
h X
j =1
xj v j
Thanks to this theorem, we can dene the rank of A to be equivalently the number of linearly independent columns or of linearly independent rows of A: rank(A) = dim(range(A))
= n ; dim(null(A)) :
= =
range(AT ) null(AT )
where AT is the transpose of A, dened as the matrix obtained by exchanging the rows of A with its columns. Theorem 2.5.1 The matrix A transforms a vector x in its null space into the zero vector, and an arbitrary vector x into a vector in range(A).
16
CHAPTER 2. ALGEBRAIC LINEAR SYSTEMS This allows characterizing the set of solutions to linear system as follows. Let
Ax = b n system (m can be less than, equal to, or greater than n). Also, let r = rank(A) be the number of linearly independent rows or columns of A. Then, b 62 range(A) ) no solutions b 2 range(A) ) 1n;r solutions
with the convention that 1 0 = 1. Here, 1k is the cardinality of a k-dimensional vector space. In the rst case above, there can be no linear combination of the columns (no x vector) that gives b, and the system is said to be incompatible. In the second, compatible case, three possibilities occur, depending on the relative sizes of r m n: When r = n = m, the system is invertible. This means that there is exactly one x that satises the system, since the columns of A span all of Rn. Notice that invertibility depends only on A, not on b. When r = n and m > n, the system is redundant. There are more equations than unknowns, but since b is in the range of A there is a linear combination of the columns (a vector x) that produces b. In other words, the equations are compatible, and exactly one solution exists. 4 When r < n the system is underdetermined. This means that the null space is nontrivial (i.e., it has dimension h > 0), and there is a space of dimension h = n ; r of vectors x such that Ax = 0. Since b is assumed to be in the range of A, there are solutions x to Ax = b, but then for any y 2 null(A) also x + y is a solution: be an m
Ax = b Ay = 0 ) A(x + y) = b = 1n;r solutions mentioned above. Notice that if r = n then n cannot possibly exceed m, so the rst two cases exhaust the possibilities for r = n. Also, r cannot exceed either m or n. All the cases are summarized in gure 2.3.
and this generates the 1h
Of course, listing all possibilities does not provide an operational method for determining the type of linear system for a given pair A b. Gaussian elimination, and particularly its version called reduction to echelon form is such a method, and is summarized in the next section.
17
no incompatible no
underdetermined
2.6.1
The matrix A is reduced to echelon form by a process in m ; 1 steps. The rst step is applied to U (1) = A and c(1) = b. The k-th step is applied to rows k : : : m of U (k) and c(k) and produces U (k+1) and c(k+1) . The last step produces U (m) = U and c(m) = c. Initially, the pivot column index p is set to one. Here is step k, where uij denotes entry i j of U (k):
: : : m, then increment p by 1. If p exceeds n stop.5 Row exchange Now p n and uip is nonzero for some k i m. Let l be one such value of i6 . If l 6= k, exchange rows l and k of U (k) and of c(k) . Triangularization The new entry ukp is nonzero, and is called the pivot. For i = k + 1 : : : m, subtract row k of U (k) multiplied by u ip =ukp from row i of U (k), and subtract entry k of c(k) multiplied by u ip =ukp from entry i
Skip no-pivot columns If uip is zero for every i = k of c(k). This zeros all the entries in the column below the pivot, and preserves the equality of left- and right-hand side.
When this process is nished, U is in echelon form. In particular, if the matrix is square and if all columns have a pivot, then U is upper-triangular.
5 Stop means that the entire algorithm is nished. 6 Different ways of selecting here lead to different numerical properties of the algorithm. Selecting the largest entry in the column leads to
18
2.6.2
Backsubstitution
A system
Ux = c
(2.10)
in echelon form is easily solved for x. To see this, we rst solve the system symbolically, leaving undetermined variables specied by their name, and then transform this solution procedure into one that can be more readily implemented numerically. Let r be the index of the last nonzero row of U . Since this is the number of independent rows of U , r is the rank of U . It is also the rank of A, because A and U admit exactly the same solutions and are equal in size. If r < m, the last m ; r equations yield a subsystem of the following form:
subsystem. If there is a residual system (i.e., r < m) and some of cr+1 : : : cm are nonzero, then the equations corresponding to these nonzero entries are incompatible, because they are of the form 0 = ci with ci 6= 0. Since no vector x can satisfy these equations, the linear system admits no solutions: it is incompatible. Let us now assume that either there is no residual system, or if there is one it is compatible, that is, cr+1 = : : : = cm = 0. Then, solutions exist, and they can be determined by backsubstitution, that is, by solving the equations starting from the last one and replacing the result in the equations higher up. Backsubstitutions works as follows. First, remove the residual system, if any. We are left with an r n system. In this system, call the variables corresponding to the r columns with pivots the basic variables, and call the other n ; r the free variables. Say that the pivot columns are j1 : : : jr . Then symbolic backsubstitution consists of the following sequence: for
cm Let us call this the residual subsystem. If on the other hand r = m (obviously r cannot exceed m), there is no residual
end This is called symbolic backsubstitution because no numerical values are assigned to free variables. Whenever they appear in the expressions for the basic variables, free variables are specied by name rather than by value. The nal result is a solution with as many free parameters as there are free variables. Since any value given to the free variables leaves the equality of system (2.10) satised, the presence of free variables leads to an innity of solutions. When solving a system in echelon form numerically, however, it is inconvenient to carry around nonnumeric symbol names (the free variables). Here is an equivalent solution procedure that makes this unnecessary. The solution obtained by backsubstitution is an afne function 7 of the free variables, and can therefore be written in the form where the xji are the free variables. The vector v0 is the solution when all free variables are zero, and can therefore be obtained by replacing each free variable by zero during backsubstitution. Similarly, the vector v i for i = 1 : : : n ; r can be obtained by solving the homogeneous system with xji = 1 and all other free variables equal to zero. In conclusion, the general solution can be obtained by running backsubstitution n ; r + 1 times, once for the nonhomogeneous system, and n ; r times for the homogeneous system, with suitable values of the free variables. This yields the solution in the form (2.11). Notice that the vectors v1 : : : vn;r form a basis for the null space of U , and therefore of A.
7 An afne function is a linear function plus a constant.
(2.11)
Ux = 0
19
2.6.3
An Example
An example will clarify both the reduction to echelon form and backsubstitution. Consider the system
Ax = b
where
2 1 U (1) = A = 4 2
3 3 2 6 9 5 ;1 ;3 3 0
3 5
213 c(1) = b = 4 5 5 :
5
Reduction to echelon form transforms A and b as follows. In the rst step (k = 1), there are no no-pivot columns, so the pivot column index p stays at 1. Throughout this example, we choose a trivial pivot selection rule: we pick the (1) rst nonzero entry at or below row k in the pivot column. For k = 1, this means that u 11 = a11 = 1 is the pivot. In 8 other words, no row exchange is necessary. The triangularization step subtracts row 1 multiplied by 2/1 from row 2, and subtracts row 1 multiplied by -1/1 from row 3. When applied to both U (1) and c(1) this yields
21 3 3 23 U (2) = 4 0 0 3 1 5
0 0 6 2
213 c(2) = 4 3 5 :
6
= 2) the entries u(2) ip are zero for i = 2 3, for both p = 1 and p = 2, so p is set to 3: the second (2) pivot column is column 3, and u23 is nonzero, so no row exchange is necessary. In the triangularization step, row 2 multiplied by 6/3 is subtracted from row 3 for both U (2) and c(2) to yield 213 21 3 3 23 U = U (3) = 4 0 0 3 1 5 , c = c(3) = 4 3 5 : 0 0 0 0 0 There is one zero row in the left-hand side, and the rank of U and that of A is r = 2, the number of nonzero rows. The residual system is 0 = 0 (compatible), and r < n = 4, so the system is underdetermined, with 1n;r = 12
Notice that now (k solutions. In symbolic backsubstitution, the residual subsystem is rst deleted. This yields the reduced system
x=
1 3
(2.12)
The basic variables are x1 and x3, corresponding to the columns with pivots. The other two variables, x 2 and Backsubstitution applied rst to row 2 and then to row 1 yields the following expressions for the pivot variables:
1x (3 ; x ) = 1 ; x3 = u1 (c2 ; u24x4) = 1 4 3 3 4 23 x1 = u1 (c1 ; u12x2 ; u13x3 ; u14x4) = 1 1 (1 ; 3x2 ; 3x3 ; 2x4) 11 = 1 ; 3x2 ; (3 ; x4) ; 2x4 = ;2 ; 3x2 ; x4
2 ;2 ; 3x ; x 2 4 6 x 2 6 x=4 1 ; 1x
x4
3 4
3 2 ;2 3 2 ;3 3 2 ;1 6 0 6 1 7 6 0 7 7 7 4 ;1 5 + x4 6 4 0 7 5 + x2 6 4 1 7 5=6
0 0 1
3
3 7 7 5:
k is a frequent choice, and this would have caused rows 1 and 2 to be switched.
20
This same solution can be found by the numerical backsubstitution method as follows. Solving the reduced system (2.12) with x2 = x4 = 0 by numerical backsubstitution yields
x3 = 1 3 (3 ; 1 0) = 1 x1 = 1 1 (1 ; 3 0 ; 3 1 ; 2 0) = ;2
so that
2 ;2 3 6 0 7 v0 = 6 4 1 7 5:
0
Then v1 is found by solving the nonzero part (rst two rows) of U x = 0 with x 2
= 1 and x4 = 0 to obtain
1 (;1 0) = 0 x3 = 3 x1 = 1 1 (;3 1 ; 3 0 ; 2 0) = ;3
so that
2 ;3 3 6 1 7 v1 = 6 4 0 7 5:
2 ;1 6 0 v2 = 6 4 ;1
1
3
3 7 7 5 3 7 7 5
and
2 ;2 3 2 ;3 3 2 ;1 6 0 7 6 1 7 6 0 x = v0 + x2v1 + x4v2 = 6 4 1 7 5 + x2 6 4 0 7 5 + x4 6 4 ;1
0 0 1
3
just as before.
As mentioned at the beginning of this section, Gaussian elimination is a direct method, in the sense that the answer can be found in a number of steps that depends only on the size of the matrix A. In the next chapter, we study a different
21
method, based on the so-called the Singular Value Decomposition (SVD). This is an iterative method, meaning that an exact solution usually requires an innite number of steps, and the number of steps necessary to nd an approximate solution depends on the desired number of correct digits. This state of affairs would seem to favor Gaussian elimination over the SVD. However, the latter yields a much more complete answer, since it computes bases for all the four spaces mentioned above, as well as a set of quantities, called the singular values, which provide great insight into the behavior of the linear transformation represented by the matrix A. Singular values also allow dening a notion of approximate rank which is very useful in a large number of applications. It also allows nding approximate solutions when the linear system in question is incompatible. In addition, for reasons that will become apparent in the next chapter, the computation of the SVD is numerically well behaved, much more so than Gaussian elimination. Finally, very efcient algorithms for the SVD exist. For instance, on a regular workstation, one can compute several thousand SVDs of 5 5 matrices in one second. More generally, the number of oating point operations necessary to compute the SVD of an m n matrix is amn2 + bn3 where a b are small numbers that depend on the details of the algorithm.
22
Chapter 3
2p 3 6 ..1 7 p=4 . 5
pn n = 3,
but the following
in a Cartesian reference system. For concreteness, you may want to think of the case arguments are general. Given any orthonormal basis v1 : : : vn for Rn , let
2q 3 6 ..1 7 q=4 . 5
qn
n X
be the vector of coefcients for point P in the new basis. Then for any i = 1
T vT i p = vi
n X j =1
: : : n we have
qj vj =
j =1
qj vT i vj = qi
since the vj are orthonormal. This is important, and may need emphasis: If p=
1 Vectors with unit norm.
n X j =1
qj vj
23
24
CHAPTER 3. THE SINGULAR VALUE DECOMPOSITION and the vectors of the basis v1 : : : vn are orthonormal, then the coefcients qj are the signed magnitudes of the projections of p onto the basis vectors:
qj = v T jp:
We can write all n instances of equation (3.1) by collecting the vectors vj into a matrix,
(3.1)
V=
so that Also, we can collect the n2 equations vT i vj into the following matrix equation: where I is the n
v1 q = V Tp :
vn (3.2)
1 0
if i = j otherwise (3.3)
VTV = I
(3.4)
comparison with equation (3.3) shows that the inverse of an orthogonal matrix V exists, and is equal to the transpose of V : Of course, this argument requires V to be full rank, so that the solution V ;1 to equation (3.4) is unique. However, is certainly full rank, because it is made of orthonormal columns. When V is m n with m > n and has orthonormal columns, this result is still valid, since equation (3.3) still holds. However, equation (3.4) denes what is now called the left inverse of V . In fact, V V ;1 = I cannot possibly have a solution when m > n, because the m m identity matrix has m linearly independent 2 columns, while the columns of V V ;1 are linear combinations of the n columns of V , so V V ;1 can have at most n linearly independent columns. For square, full-rank matrices (r = m = n), the distinction between left and right inverse vanishes. In fact, suppose that there exist matrices B and C such that BV = I and V C = I . Then B = B (V C ) = (BV )C = C , so the left and the right inverse are the same. We can summarize this discussion as follows:
n matrix V
with m
square
V ;1 V = V T V = I :
Sometimes, the geometric interpretation of equation (3.2) causes confusion, because two interpretations of it are possible. In the interpretation given above, the point P remains the same, and the underlying reference frame is changed from the elementary vectors ej (that is, from the columns of I ) to the vectors vj (that is, to the columns of V ). Alternatively, equation (3.2) can be seen as a transformation, in a xed reference system, of point P with coordinates p into a different point Q with coordinates q. This, however, is relativity, and should not be surprising: If you spin
2 Nay, orthonormal.
25
clockwise on your feet, or if you stand still and the whole universe spins counterclockwise around you, the result is the same.3 Consistently with either of these geometric interpretations, we have the following result: Theorem 3.1.2 The norm of a vector x is not changed by multiplication by an orthogonal matrix V :
kV xk = kxk :
Proof.
kV xk2 = xT V T V x = xT x = kxk2 :
We conclude this section with an obvious but useful consequence of orthogonality. In section 2.3 we dened the projection p of a vector b onto another vector c as the point on the line through c that is closest to b. This notion of projection can be extended from lines to vector spaces by the following denition: The projection p of a point b 2 R n onto a subspace C is the point in C that is closest to b. Also, for unit vectors c, the projection matrix is cc T (theorem 2.3.3), and the vector b ; p is orthogonal to c. An analogous result holds for subspace projection, as the following theorem shows. Theorem 3.1.3 Let U be an orthogonal matrix. Then the matrix UU T projects any vector b onto range(U ). Furthermore, the difference vector between b and its projection p onto range(U ) is orthogonal to range(U ):
U T (b ; p) = 0 :
Proof.
where x is the vector of coefcients (as many coefcients as there are columns in U ). The squared distance between b and p is kb ; pk2 = (b ; p)T (b ; p) = bT b + pT p ; 2bT p = bT b + xT U T U x ; 2bT U x :
kb ; pk2 = bT b + xT x ; 2bT U x :
2x ; 2U T b
x = UTb p = U x = UU T b
as promised. For this value of p the difference vector b ; p is orthogonal to range(U ), in the sense that
U T (b ; p) = U T (b ; UU T b) = U T b ; U T b = 0 :
3 At least geometrically. One solution may be more efcient than the other in other ways.
26
x
2
v1
u3 u2 2 u 1 1 b b
1
Figure 3.1: The matrix in equation (3.5) maps a circle on the plane into an ellipse in space. The two small boxes are corresponding points.
m n matrix A of rank r maps the r-dimensional unit hypersphere in rowspace(A) into an rdimensional hyperellipse in range(A).
This statement is stronger than saying that A maps rowspace(A) into range(A), because it also describes what happens to the magnitudes of the vectors: a hypersphere is stretched or compressed into a hyperellipse, which is a quadratic hypersurface that generalizes the two-dimensional notion of ellipse to an arbitrary number of dimensions. In three dimensions, the hyperellipse is an ellipsoid, in one dimension it is a pair of points. In all cases, the hyperellipse in question is centered at the origin. For instance, the rank-2 matrix p p
3 3 1 4 ;3 A= p 3 2 1 1
b = Ax :
3 5
(3.5)
transforms the unit circle on the plane into an ellipse embedded in three-dimensional space. Figure 3.1 shows the map
Two diametrically opposite points on the unit circle are mapped into the two endpoints of the major axis of the ellipse, and two other diametrically opposite points on the unit circle are mapped into the two endpoints of the minor axis of the ellipse. The lines through these two pairs of points on the unit circle are always orthogonal. This result can be generalized to any m n matrix. Simple and fundamental as this geometric fact may be, its proof by geometric means is cumbersome. Instead, we will prove it algebraically by rst introducing the existence of the SVD and then using the latter to prove that matrices map hyperspheres into hyperellipses. Theorem 3.2.1 If A is a real m
such that
27
n) and 1 : : :
0. Equivalently, A=U VT :
for x on the unit hypersphere kxk = 1, and consider the scalar function kAxk. Since x is dened on a compact set, this scalar function must achieve a maximum value, possibly at more than one point 4 . Let v1 be one of the vectors on the unit hypersphere in Rn where this maximum is achieved, and let 1u1 be the corresponding vector 1 u1 = Av1 with ku1k = 1, so that 1 is the length of the corresponding b = Av 1 . By theorems 2.4.1 and 2.4.2, u1 and v1 can be extended into orthonormal bases for Rm and Rn, respectively. Collect these orthonormal basis vectors into orthogonal matrices U 1 and V1. Then
Proof. This proof is adapted from Golub and Van Loan, cited in the introduction to the class notes. Consider all vectors of the form b = Ax
A1 : T AV1 is uT 1u1 = 1, and its other entries are In fact, the rst column of AV1 is Av1 = 1 u1, so the rst entry of U1 1 T uj 1 u1 = 0 because of orthonormality. The matrix S1 turns out to have even more structure than this: the row vector w T is zero. Consider in fact the
0 length of the vector
U1T AV1 = S1 =
wT
2 + wT w. However, the longest vector we can From the last term, we see that the length of this vector is at least 1 obtain by premultiplying a unit vector by matrix S 1 has length 1 . In fact, if x has unit norm so does V1 x (theorem 3.1.2). Then, the longest vector of the form AV1 x has length 1 (by denition of 1), and again by theorem 3.1.2 the T AV1 x has still length 1. Consequently, the vector in (3.6) cannot be longer than longest vector of the form S1 x = U1 1, and therefore w must be zero. Thus,
1 p 2+ S1 wT w
1
=p 21 T 1+w w
2 + wT w 1 A1 w
(3.6)
U1T AV1 = S1 =
T A1 V2 = S2 = U2
so that
1 2
The matrix A1 has one fewer row and column than A. We can repeat the same construction on A1 and write 0
A1
0T
1 0 0T 1 0T T 4 2 0T 5 : T U1 AV1 0 V2 = 0 0 U2 0 0 A2 This procedure can be repeated until Ak vanishes (zero rows or zero columns) to obtain U T AV = where U T and V are orthogonal matrices obtained by multiplying together all the orthogonal matrices used in the
0T
A2
0T
procedure, and
2 1 0 6 0 2 =6 6 . 4 ..
0
..
0 0
. . . .
3 7 7 7 5:
28
By construction, the is are arranged in nonincreasing order along the diagonal of , and are nonnegative. Since matrices U and V are orthogonal, we can premultiply the matrix product in the theorem by U and postmultiply it by V T to obtain
A=U VT :
We can now review the geometric picture in gure 3.1 in light of the singular value decomposition. In the process, we introduce some nomenclature for the three matrices in the SVD. Consider the map in gure 3.1, represented by equation (3.5), and imagine transforming point x (the small box at x on the unit circle) into its corresponding point b = Ax (the small box on the ellipse). This transformation can be achieved in three steps (see gure 3.2): 1. Write x in the frame of reference of the two vectors v1 v2 on the unit circle that map into the major axes of the ellipse. There are a few ways to do this, because axis endpoints come in pairs. Just pick one way, but order v1 v2 so they map into the major and the minor axis, in this order. Let us call v1 v2 the two right singular vectors of A. The corresponding axis unit vectors u1 u2 on the ellipse are called left singular vectors. If we dene
V=
the new coordinates of x become because V is orthogonal.
v1
v2
= V Tx
2. Transform into its image on a straight version of the nal ellipse. Straight here means that the axes of the ellipse are aligned with the y1 y2 axes. Otherwise, the straight ellipse has the same shape as the ellipse in gure 3.1. If the lengths of the half-axes of the ellipse are 1 2 (major axis rst), the transformed vector has coordinates where
2 3 1 0 =4 0 2 5
1 2 are called the singular values of A.
3. Rotate the reference frame in Rm = R3 so that the straight ellipse becomes the ellipse in gure 3.1. This rotation brings along, and maps it to b. The components of are the signed magnitudes of the projections of b along the unit vectors u1 u2 u3 that identify the axes of the ellipse and the normal to the plane of the ellipse, so b=U where the orthogonal matrix collects the left singular vectors of A.
U=
u1
u2
u3
V Tx
since this construction works for any point x on the unit circle. This is the SVD of A.
A= U VT
29
u3 u 2 2 u 1 1 y y
1
v2
u 2 2 v1 1
u 1 1
Figure 3.2: Decomposition of the mapping in gure 3.1. The singular value decomposition is almost unique. There are two sources of ambiguity. The rst is in the orientation of the singular vectors. One can ip any right singular vector, provided that the corresponding left singular vector is ipped as well, and still obtain a valid SVD. Singular vectors must be ipped in pairs (a left vector and its corresponding right vector) because the singular values are required to be nonnegative. This is a trivial ambiguity. If desired, it can be removed by imposing, for instance, that the rst nonzero entry of every left singular value be positive. The second source of ambiguity is deeper. If the matrix A maps a hypersphere into another hypersphere, the axes of the latter are not dened. For instance, the identity matrix has an innity of SVDs, all of the form
I = UIU T
where U is any orthogonal matrix of suitable size. More generally, whenever two or more singular values coincide, the subspaces identied by the corresponding left and right singular vectors are unique, but any orthonormal basis can be chosen within, say, the right subspace and yield, together with the corresponding left singular vectors, a valid SVD. Except for these ambiguities, the SVD is unique. Even in the general case, the singular values of a matrix A are the lengths of the semi-axes of the hyperellipse E dened by E = fAx : kxk = 1g : The SVD reveals a great deal about the structure of a matrix. If we dene r by
:::
> r+1 = : : : = 0
30
The sizes of the matrices in the SVD are as follows: U is m m, is m n, and V is n n. Thus, has the same shape and size as A, while U and V are square. However, if m > n, the bottom (m ; n) n block of is zero, so that the last m ; n columns of U are multiplied by zero. Similarly, if m < n, the rightmost m (n ; m) block of is zero, and this multiplies the last n ; m rows of V . This suggests a small, equivalent version of the SVD. If p = min(m n), we can dene Up = U (: 1 : p), p = (1 : p 1 : p), and Vp = V (: 1 : p), and write
A = Up p VpT
where Up is m p, p is p p, and Vp is n p. Moreover, if p ; r singular values are zero, we can let Ur then we have
A = Ur r VrT =
which is an even smaller, minimal, SVD. Finally, both the 2-norm and the Frobenius norm
v u m X n X u t kAkF = jaij j2
i=1 j =1
and
kAk2 F = kAk2 =
2 + :::+ p 2 1 1:
In the next few sections we introduce fundamental results and applications that testify to the importance of the SVD.
kAx ; bk
of the residual vector where the double bars henceforth refer to the Euclidean norm. Thus, x cannot exactly satisfy any of the m equations in the system, but it tries to satisfy all of them as closely as possible, as measured by the sum of the squares of the discrepancies between left- and right-hand sides of the equations. r = Ax ; b :
31
In other circumstances, not enough measurements are available. Then, the linear system (3.7) is underdetermined, in the sense that it has fewer independent equations than unknowns (its rank r is less than n, see again chapter 2). Incompatibility and underdeterminacy can occur together: the system admits no solution, and the least-squares solution is not unique. For instance, the system
x1 + x2 = 1 x1 + x2 = 3 x3 = 2
has three unknowns, but rank 2, and its rst two equations are incompatible: x 1 + x2 cannot be equal to both 1 and p 3. A least-squares solution turns out to be x = 1 1 2] T with residual r = Ax ; b = 1 ; 1 0], which has norm 2 (admittedly, this is a rather high residual, but this is the best we can do for this problem, in the least-squares sense). However, any other vector of the form
2 1 3 2 ;1 3 x0 = 4 1 5 + 4 1 5
2 0
is as good as x. For instance, x0 = 0 2 2], obtained for = 1, yields exactly the same residual as x (check this). In summary, an exact solution to the system (3.7) may not exist, or may not be unique, as we learned in chapter 2. An approximate solution, in the least-squares sense, always exists, but may fail to be unique. If there are several least-squares solutions, all equally good (or bad), then one of them turns out to be shorter than all the others, that is, its norm kxk is smallest. One can therefore redene what it means to solve a linear system so that there is always exactly one solution. This minimum norm solution is the subject of the following theorem, which both proves uniqueness and provides a recipe for the computation of the solution. Theorem 3.3.1 The minimum-norm least squares solution to a linear system that achieves the min kAx ; bk x is unique, and is given by where
2 1= 6 6 6 6 y=6 6 6 6 6 4
m diagonal matrix.
^ x = V yU T b
1
..
(3.8)
0
.
0
. . .
1= r
. . . .. .
is an n
0 0
3 7 7 7 7 7 7 7 7 7 5
The matrix
Ay = V y U T Ax = b
kAx ; bk
kU V T x ; bk : kU ( V T x ; U T b)k k V T x ; U T bk k y ; ck :
0 0 0
. . . .. . (3.9)
because U = I . But orthogonal matrices do not change the norm of vectors they are applied to (theorem 3.1.2), so that the last expression above equals or, with y = V T x and c = U T b,
In order to nd the solution to this minimization problem, let us spell out the last expression. We want to minimize the norm of the following vector:
2 1 0 6 6 0 ... 6 6 6 6 . . 6 . 6 6 4
0
32 y1 7 6 . 7 . 6 . 7 6 7 6 y 7 r 6 7 6 y 7 r +1 6 7 6 7 5 4 ...
yn
3 2 c 1 7 6 . . 7 6 . 7 6 7 6 c r 7 ;6 7 6 cr+1 7 6 7 5 6 4 ...
cm
3 7 7 7 7 7 : 7 7 7 5
2c 3 6 r+1 7 . 0;4 . . 5
cm
and do not depend on the unknown y. In other words, there is nothing we can do about those differences: if some or all the ci for i = r + 1 : : : m are nonzero, we will not be able to zero these differences, and each of them contributes a residual jcij to the solution. In each of the rst r differences, on the other hand, the last n ; r components of y are multiplied by zeros, so they have no effect on the solution. Thus, there is freedom in their choice. Since we look for the minimum-norm solution, that is, for the shortest vector x, we also want the shortest y, because x and y are related by an orthogonal transformation. We therefore set yr+1 = : : : = yn = 0. In summary, the desired y has the following components:
yi = ci for i = 1 : : : r i yi = 0 for i = r + 1 : : : n :
y=
+c :
Notice that there is no other choice for y, which is therefore unique: minimum residual forces the choice of y1 : : : yr , and minimum-norm solution forces the other entries of y. Thus, the minimum-norm, least-squares solution to the original system is the unique vector ^ x = V y = V +c = V +U T b
as promised. The residual, that is, the norm of kAx ; bk when x is the solution vector, is the norm of y ; c, since this vector is related to Ax ; b by an orthogonal transformation (see equation (3.9)). In conclusion, the square of the residual is
m X
i=r+1
c2 i=
m X
i=r+1
2 (uT i b)
3.4. LEAST-SQUARES SOLUTION OF A HOMOGENEOUS LINEAR SYSTEMS which is the projection of the right-hand side vector b onto the complement of the range of A.
33
Ax = 0
is x=0
(3.10)
which happens to be an exact solution. Of course it is not necessarily the only one (any vector in the null space of A is also a solution, by denition), but it is obviously the one with the smallest norm. Thus, x = 0 is the minimum-norm solution to any homogeneous linear system. Although correct, this solution is not too interesting. In many applications, what is desired is a nonzero vector x that satises the system (3.10) as well as possible. Without any constraints on x, we would fall back to x = 0 again. For homogeneous linear systems, the meaning of a least-squares solution is therefore usually modied, once more, by imposing the constraint
kxk = 1
on the solution. Unfortunately, the resulting constrained minimization problem does not necessarily admit a unique solution. The following theorem provides a recipe for nding this solution, and shows that there is in general a whole hypersphere of solutions. Theorem 3.4.1 Let
be the singular value decomposition of A. Furthermore, let vn;k+1 : : : vn be the k columns of V whose corresponding singular values are equal to the last singular value n , that is, let k be the largest integer such that
A= U VT
n;k+1 = : : : = n :
Then, all vectors of the form with x=
(3.11) (3.12)
Ax = 0
that is, they achieve the
kxk=1
min kAxk :
Note: when n is greater than zero the most common case is k = 1, since it is very unlikely that different singular values have exactly the same numerical value. When A is rank decient, on the other case, it may often have more than one singular value equal to zero. In any event, if k = 1, then the minimum-norm solution is unique, x = v n . If k > 1, the theorem above shows how to express all solutions as a linear combination of the last k columns of V .
34 Proof.
CHAPTER 3. THE SINGULAR VALUE DECOMPOSITION The reasoning is very similar to that for the previous theorem. The unit-norm Least Squares solution to
Ax = 0
is the vector x with kxk = 1 that minimizes that is,
kAxk kU V T xk :
Since orthogonal matrices do not change the norm of vectors they are applied to (theorem 3.1.2), this norm is the same as k V T xk or, with y = V T x,
1 translates to kyk = 1.
k yk :
We thus look for the unit-norm vector y that minimizes the
2 2 2 2 1 y1 + : : : + nyn :
This is obviously achieved by concentrating all the (unit) mass of y where the s are smallest, that is by letting
y1 = : : : = yn;k = 0:
(3.13)
From y = V T x we obtain x = V y = y1 v1 + : : : + yn vn, so that equation (3.13) is equivalent to equation (3.11) with 1 = yn;k+1 : : : k = yn , and the unit-norm constraint on y yields equation (3.12). Section 3.5 shows a sample use of theorem 3.4.1.
35
3.5.1
Fitting a Line to a Set of Points Let pi = (xi yi)T be a set of m 2 points on the plane, and let
ax + by ; c = 0
be the equation of a line. If the lefthand side of this equation is multiplied by a nonzero constant, the line does not change. Thus, we can assume without loss of generality that
knk = a2 + b2 = 1
(3.14)
where the unit vector n = (a b)T , orthogonal to the line, is called the line normal. The distance from the line to the origin is jcj (see gure 3.3), and the distance between the line n and a point pi is equal to di = jaxi + byi ; cj = jpT (3.15) i n ; cj :
pi
|c| b
a
Figure 3.3: The distance between point pi
(p1 : : : pm )T , the best-t line achieves the min kdk2 = kmin kP n ; c1k2 : knk=1 nk=1
In equation (3.16), 1 is a vector of m ones.
The best-t line minimizes the sum of the squared distances. Thus, if we let d
3.5.2
Since the third line parameter c does not appear in the constraint (3.14), at the minimum (3.16) we must have
@ kdk2 = 0 : @c
If we dene the centroid p of all the points p i as p=
(3.17)
1 T mP 1
1 nT P T 1 c= m c = pT n :
knk=1
that is,
where Q = P ; 1pT collects the centered coordinates of the m points. We can solve this constrained minimization problem by theorem 3.4.1. Equivalently, and in order to emphasize the geometric meaning of signular values and vectors, we can recall that if n is on a circle, the shortest vector of the form Qn is obtained when n is the right singular vector v2 corresponding to the smaller 2 of the two singular values of Q. Furthermore, since Qv2 has norm 2, the residue is min kdk = 2 knk=1 and more specically the distances di are given by d= where u2 is the left singular vector corresponding to
2 u2
Q= U VT =
yields
Qn = Qv2 =
because v1 and v2 are orthonormal vectors. To summarize, to t a line (a b c) to a set of proceed as follows:
2 X
i=1
iui vT i v2 = 2u2
1. compute the centroid of the points (1 is a vector of m ones): p= 2. form the matrix of centered coordinates:
Q = P ; 1pT Q= U VT
3.5. SVD LINE FITTING 4. the line normal is the second column of the 2
37
2 matrix V :
n = (a
b)T = v2
c = pT n
knk=1
min kdk = 2
The following matlab code implements the line tting method. function [l, residue] = linefit(P) % check input matrix sizes [m n] = size(P); if n = 2, error(matrix P must be m x 2), end if m < 2, error(Need at least two points), end one = ones(m, 1); % centroid of all the points p = (P * one) / m; % matrix of centered coordinates Q = P - one * p; [U Sigma V] = svd(Q); % the line normal is the second column of V n = V(:, 2); % assemble the three line coefficients into a column vector l = [n ; p * n]; % the smallest singular value of Q % measures the residual fitting error residue = Sigma(2, 2); A useful exercise is to think how this procedure, or something close to it, can be adapted to t a set of data points in Rm with an afne subspace of given dimension n. An afne subspace is a linear subspace plus a point, just like an arbitrary line is a line through the origin plus a point. Here plus means the following. Let L be a linear space. Then an afne space has the form A = p + L = fa j a = p + l and l 2 Lg : Hint: minimizing the distance between a point and a subspace is equivalent to maximizing the norm of the projection of the point onto the subspace. The tting problem (including tting a line to a set of points) can be cast either as a maximization or a minimization problem.
38
Chapter 4
Function Optimization
There are three main reasons why most problems in robotics, vision, and arguably every other science or endeavor take on the form of optimization problems. One is that the desired goal may not be achievable, and so we try to get as close as possible to it. The second reason is that there may be more ways to achieve the goal, and so we can choose one by assigning a quality to all the solutions and selecting the best one. The third reason is that we may not know how to solve the system of equations f(x) = 0, so instead we minimize the norm kf(x)k, which is a scalar function of the unknown vector x. We have encountered the rst two situations when talking about linear systems. The case in which a linear system admits exactly one exact solution is simple but rare. More often, the system at hand is either incompatible (some say overconstrained) or, at the opposite end, underdetermined. In fact, some problems are both, in a sense. While these problems admit no exact solution, they often admit a multitude of approximate solutions. In addition, many problems lead to nonlinear equations. Consider, for instance, the problem of Structure From Motion (SFM) in computer vision. Nonlinear equations describe how points in the world project onto the images taken by cameras at given positions in space. Structure from motion goes the other way around, and attempts to solve these equations: image points are given, and one wants to determine where the points in the world and the cameras are. Because image points come from noisy measurements, they are not exact, and the resulting system is usually incompatible. SFM is then cast as an optimization problem. On the other hand, the exact system (the one with perfect coefcients) is often close to being underdetermined. For instance, the images may be insufcient to recover a certain shape under a certain motion. Then, an additional criterion must be added to dene what a good solution is. In these cases, the noisy system admits no exact solutions, but has many approximate ones. The term optimization is meant to subsume both minimization and maximization. However, maximizing the scalar function f (x) is the same as minimizing ;f (x), so we consider optimization and minimization to be essentially synonyms. Usually, one is after global minima. However, global minima are hard to nd, since they involve a universal quantier: x is a global minimum of f if for every other x we have f (x) f (x ). Global minization techniques like simulated annealing have been proposed, but their convergence properties depend very strongly on the problem at hand. In this chapter, we consider local minimization: we pick a starting point x 0, and we descend in the landscape of f (x) until we cannot go down any further. The bottom of the valley is a local minimum. Local minimization is appropriate if we know how to pick an x0 that is close to x . This occurs frequently in feedback systems. In these systems, we start at a local (or even a global) minimum. The system then evolves and escapes from the minimum. As soon as this occurs, a control signal is generated to bring the system back to the minimum. Because of this immediate reaction, the old minimum can often be used as a starting point x0 when looking for the new minimum, that is, when computing the required control signal. More formally, we reach the correct minimum x as long as the initial point x 0 is in the basin of attraction of x , dened as the largest neighborhood of x in which f (x) is convex. Good references for the discussion in this chapter are Matrix Computations, Practical Optimization, and Numerical Recipes in C, all of which are listed with full citations in section 1.4. 39
40
k=0
while xk is not a minimum compute step direction pk with kpk k = 1 compute step size k xk+1 = xk + k pk
k = k+1
end. Different algorithms differ in how each of these instructions is performed. It is intuitively clear that the choice of the step size k is important. Too small a step leads to slow convergence, or even to lack of convergence altogether. Too large a step causes overshooting, that is, leaping past the solution. The most disastrous consequence of this is that we may leave the basin of attraction, or that we oscillate back and forth with increasing amplitudes, leading to instability. Even when oscillations decrease, they can slow down convergence considerably. What is less obvious is that the best direction of descent is not necessarily, and in fact is quite rarely, the direction of steepest descent, as we now show. Consider a simple but important case,
where Q is a symmetric, positive denite matrix. Positive denite means that for every nonzero x the quantity xT Qx is positive. In this case, the graph of f (x) ; c is a plane aT x plus a paraboloid. Of course, if f were this simple, no descent methods would be necessary. In fact the minimum of f can be found by setting its gradient to zero:
1 xT Qx f (x ) = c + a T x + 2
(4.1)
@f = a + Qx = 0 @x Qx = ;a :
so that the minimum x is the solution to the linear system (4.2) Since Q is positive denite, it is also invertible (why?), and the solution x is unique. However, understanding the behavior of minimization algorithms in this simple case is crucial in order to establish the convergence properties of these algorithms for more general functions. In fact, all smooth functions can be approximated by paraboloids in a sufciently small neighborhood of any point. Let us therefore assume that we minimize f as given in equation (4.1), and that at every step we choose the direction of steepest descent. In order to simplify the mathematics, we observe that if we let
T e ~(x) = 1 2 (x ; x ) Q(x ; x )
then we have
(4.3)
4.1. LOCAL MINIMIZATION AND STEEPEST DESCENT so that e ~ and f differ only by a constant. In fact,
41
1 T 1 T T T f (x ) = c + a T x + 1 x Qx = c ; x Qx + x Qx = c ; x Qx : 2 2 2 Since e ~ is simpler, we consider that we are minimizing e ~ rather than f . In addition, we can let y= x;x
that is, we can shift the origin of the domain to x , and study the function
instead of f or e ~, without loss of generality. We will transform everything back to course, by construction, the new minimum is at y =0 where e reaches a value of zero:
1 y T Qy e(y) = 2
e(y ) = e(0) = 0 :
y0
However, we let our steepest descent algorithm nd this minimum by starting from the initial point
= x0 ; x :
k = ; kg g k k
At every iteration k, the algorithm chooses the direction of steepest descent, which is in the direction pk opposite to the gradient of e evaluated at yk : gk
We select for the algorithm the most favorable step size, that is, the one that takes us from yk to the lowest point in the direction of pk . This can be found by differentiating the function
with respect to , and setting the derivative to zero to obtain the optimal step k . We have
@e(yk + pk ) = (y + p )T Qp k k k @
and setting this to zero yields
(Qy )T pk k = ; pT k k Qpk
pk pk gk gk k pk = ; pg T Qp = kgk k pT Qp = kgk k gT Qg : k k k
(4.4)
Thus, the basic step of our steepest descent can be written as follows: yk+1
k gk = yk + kgk k gg T Qg pk k k
42 that is,
Tg k k g : = yk ; g (4.5) k gT k Qgk How much closer did this step bring us to the solution y = 0? In other words, how much smaller is e(yk+1), relative to the value e(yk ) at the previous step? The answer is, often not much, as we shall now prove. The
yk+1 arguments and proofs below are adapted from D. G. Luenberger, Introduction to Linear and Nonlinear Programming, Addison-Wesley, 1973. From the denition of e and from equation (4.5) we obtain
T gT k gk g yT k k Qyk ; yk ; gT g Q k k yT k Qy k
T T
k gk Q yk ; gg T Qg gk k k
2 T gk Qgk
= Qyk )
yT k Qy k
yk
= Q;1gk
;1 = gT k Q gk
e(yk )
(4.6)
so if we can bound the expression in parentheses we have a bound on the rate of convergence of steepest descent. To this end, we introduce the following result. Lemma 4.1.1 (Kantorovich inequality) Let Q be a positive denite, symmetric, n holds 4 1 n (yT y)2 T y Q;1y yT Qy ( 1 + n )2 where Proof.
Q = U UT be the singular value decomposition of the symmetric (hence V = U ) matrix Q. Because Q is positive denite, all its
singular values are strictly positive, since the smallest of them satises yT Qy > 0 n = kmin yk=1
4.1. LOCAL MINIMIZATION AND STEEPEST DESCENT by the denition of positive deniteness. If we let z = UT y we have
43
Pn 1 = ) P = n i=1 =i i = ( z ( ) i=1 i i
(4.7)
(4.8)
then the numerator ( ) in (4.7) is 1= . Of course, there are many ways to choose the coefcients i to obtain a particular value of . However, each of the singular values j can be obtained by letting j = 1 and all other i to zero. Thus, the values 1= j for j = 1 : : : n are all on the curve 1= . The denominator ( ) in (4.7) is a convex combination of points on this curve. Since 1= is a convex function of , the values of the denominator ( ) of (4.7) must be in the shaded area in gure 4.1. This area is delimited from above by the straight line that connects point ( 1 1= 1) with point ( n 1= n), that is, by the line with ordinate
( ) = ( 1 + n ; )=( 1 n ) :
,,
() ()
() 1 2 n
Figure 4.1: Kantorovich inequality. For the same vector of coefcients i , the values of ( ), the value of given by (4.8). Thus an appropriate bound is
( ) ( )
min
( ) = min 1= : ( ) 1 n ( 1 + n ; )=( 1 n )
Thanks to this lemma, we can state the main result on the convergence of the method of steepest descent. Theorem 4.1.2 Let
be a quadratic function of x, with Q symmetric and positive denite. For any x 0 , the method of steepest descent xk+1 where gk converges to the unique minimum point x of f . Furthermore, at every step k there holds
1 xT Qx f (x ) = c + a T x + 2
k gk = xk ; gg T Qg gk k k T
(4.9)
f (xk+1) ; f (x )
where Proof.
(f (xk ) ; f (x ))
we immediately obtain the expression for steepest descent in terms of f and x. By equations (4.3) and (4.6) and the Kantorovich inequality we obtain
T e(y) = 1 2 y Qy
(4.10)
1; n 1+ n
1 n 1; ( 4+ e(yk ) 1 n )2
(4.11) (4.12)
(f (xk ) ; f (x )) :
Since the ratio in the last term is smaller than one, it follows immediately that f (x k ) ; f (x the minimum of f is unique, that xk ! x .
The ratio (Q) = 1= n is called the condition number of Q. The larger the condition number, the closer the fraction ( 1 ; n )=( 1 + n ) is to unity, and the slower convergence. It is easily seen why this happens in the case in which x is a two-dimensional vector, as in gure 4.2, which shows the trajectory xk superimposed on a set of isocontours of f (x). There is one good, but very precarious case, namely, when the starting point x0 is at one apex (tip of either axis) of an isocontour ellipse. In that case, one iteration will lead to the minimum x . In all other cases, the line in the direction pk of steepest descent, which is orthogonal to the isocontour at x k , will not pass through x . The minimum of f along that line is tangent to some other, lower isocontour. The next step is orthogonal to the latter isocontour (that is, parallel to the gradient). Thus, at every step the steepest descent trajectory is forced to make a ninety-degree turn. If isocontours were circles ( 1 = n ) centered at x , then the rst turn would make the new direction point to x , and
45
x*
x0 p0 p1
Figure 4.2: Trajectory of steepest descent. minimization would get there in just one more step. This case, in which because then
The more elongated the isocontours, that is, the greater the condition number (Q), the farther away a line orthogonal to an isocontour passes from x , and the more steps are required for convergence. For general (that is, non-quadratic) f , the analysis above applies once xk gets close enough to the minimum, so that f is well approximated by a paraboloid. In this case, Q is the matrix of second derivatives of f with respect to x, and is called the Hessian of f . In summary, steepest descent is good for functions that have a well conditioned Hessian near the minimum, but can become arbitrarily slow for poorly conditioned Hessians. To characterize the speed of convergence of different minimization algorithms, we introduce the notion of the order of convergence. This is dened as the largest value of q for which the
1; n =0: 1+ n
is nite. If
is this limit, then close to the solution (that is, for large values of k) we have
for a minimization method of order q. In other words, the distance of xk from x is reduced by the q-th power at every step, so the higher the order of convergence, the better. Theorem 4.1.2 implies that steepest descent has at best a linear order of convergence. In fact, the residuals jf (xk ) ; f (x )j in the values of the function being minimized converge linearly. Since the gradient of f approaches zero when xk tends to x , the arguments xk to f can converge to x even more slowly. To complete the steepest descent algorithm we need to specify how to check whether a minimum has been reached. One criterion is to check whether the value of f (xk ) has signicantly decreased from f (xk;1). Another is to check whether xk is signicantly different from x k;1. Close to the minimum, the derivatives of f are close to zero, so jf (xk ) ; f (xk;1)j may be very small but kxk ; xk;1k may still be relatively large. Thus, the check on xk is more stringent, and therefore preferable in most cases. In fact, usually one is interested in the value of x , rather than in that of f (x ). In summary, the steepest descent algorithm can be stopped when
kxk+1 ; x k
kxk ; x kq
46
be the scalar function of one variable that is obtained by restricting the function f to the line through the current point xk and in the direction of p k . Line search rst determines two points a c that bracket the desired minimum k , in the sense that a k c, and then picks a point between a and c, say, b = (a + c)=2. The only difculty here is to nd c. In fact, we can set a = 0, corresponding through equation (4.13) to the starting point x k . A point c that is on the opposite side of the minimum with respect to a can be found by increasing through values 1 = a 2 : : : until i is greater than i;1. Then, if we can assume that h is convex between 1 and i , we can set c = i. In fact, the derivative of h at a is negative, so the function is initially decreasing, but it is increasing between i;1 and i = c, so the minimum must be somewhere between a and c. Of course, if we cannot assume convexity, we may nd the wrong minimum, but there is no general-purpose x to this problem. Line search now proceeds by shrinking the bracketing triple (a b c) until c ; a is smaller than the desired accuracy in determining k . Shrinking works as follows: if b ; a > c ; b if f (u)
where the positive constant is provided by the user. In our analysis of steepest descent, we used the Hessian Q in order to compute the optimal step size (see equation (4.4)). We used Q because it was available, but its computation during steepest descent would in general be overkill. In fact, only gradient information is necessary to nd pk , and a line search in the direction of pk can be used to determine the step size k . In contrast, the Hessian of f (x) requires computing n 2 second derivatives if x is an n-dimensional vector. Using line search to nd k guarantees that a minimum in the direction pk is actually reached even when the parabolic approximation is inadequate. Here is how line search works. Let h( ) = f (xk + pk ) (4.13)
(a b c) = (b u c)
end end. It is easy to see that in each case the bracketing triple (a b c) preserves the property that f (b) f (a) and f (b) f (c), and therefore the minimum is somewhere between a and c. In addition, at every step the interval (a c) shrinks to 3=4 of its previous size, so line search will nd the minimum in a number of steps that is logarithmic in the desired accuracy.
4.2. NEWTONS METHOD can be expected to be faster. This is the idea of Newtons method, which we now summarize. Let
47
f (x k +
x)
f (xk ) + gT k
x+
be the rst terms of the Taylor series expansion of f about the current point xk , where gk and
1 xT Q x k k 2
(4.14)
= g(xk ) = @f @ x x=xk
x=x k are the gradient and Hessian of f evaluated at the current point xk . Notice that even when f is a paraboloid, the gradient gk is different from a as used in equation (4.1). In fact, a and Q are the coefcients of the Taylor expansion of f around point x = 0, while g k and Qk are the coefcients of the Taylor expansion of f around the current point xk . In other words, gradient and Hessian are constantly reevaluated in Newtons method. To the extent that approximation (4.14) is valid, we can set the derivatives of f (x k + x) with respect to x to zero, and obtain, analogously to equation (4.2), the linear system
2 6 =6 4
3 7 7 5
Qk
whose solution xk = k pk yields at the same time the step direction pk = xk =k xk k and the step size k = k xk k. The direction is of course undened once the algorithm has reached a minimum, that is, when k = 0. A minimization algorithm in which the step direction p k and size k are dened in this manner is called Newtons method. The corresponding pk is termed the Newton direction, and the step dened by equation (4.15) is the Newton step. The greater speed of Newtons method over steepest descent is borne out by analysis: while steepest descent has a linear order of convergence, Newtons method is quadratic. In fact, let y(x) = x ; Q(x);1 g(x) be the place reached by a Newton step starting at x (see equation (4.15)), and suppose that at the minimum x the Hessian Q(x ) is nonsingular. Then y(x ) = x because g(x
x = ;gk
(4.15)
) = 0, and
xk+1 ; x
where ^ x is some point on the line between x the rst term in the right-hand side above vanishes, and
@y 1 @2y ( x ; x ) + kxk ; x k2 k @ xT x=x 2 @ x@ xT x=x ^ and xk . Since y(x ) = x , the rst derivatives of y at x are zero, so that
kxk+1 ; x k c kxk ; x k2
where c depends on third-order derivatives of f near x . Thus, the convergence rate of Newtons method is of order at least two. For a quadratic function, as in equation (4.1), steepest descent takes many steps to converge, while Newtons method reaches the minimum in one step. However, this single iteration in Newtons method is more expensive,
48
because it requires both the gradient gk and the Hessian Qk to be evaluated, for a total of n + n 2 derivatives. In addition, the Hessian must be inverted, or, at least, system (4.15) must be solved. For very large problems, in which the dimension n of x is thousands or more, storing and manipulating a Hessian can be prohibitive. In contrast, steepest descent requires the gradient gk for selecting the step direction p k , and a line search in the direction pk to nd the step size. The method of conjugate gradients, discussed in the next section, is motivated by the desire to accelerate convergence with respect to the steepest descent method, but without paying the storage cost of Newtons method.
49
4.3.1
where Q is a symmetric, positive denite matrix, and x has descent, the minimum x is the solution to the linear system
1 xT Qx f (x ) = c + a T x + 2 Qx = ;a :
n components.
(4.17)
We know how to solve such a system. However, all the methods we have seen so far involve explicit manipulation of the matrix Q. We now consider an alternative solution method that does not need Q, but only the quantity gk
= Qxk + a
that is, the gradient of f (x), evaluated at n different points x 1 : : : xn. We will see that the conjugate gradients method requires n gradient evaluations and n line searches in lieu of each n n matrix inversion in Newtons method. Formal proofs can be found in Elijah Polak, Optimization Algorithms and consistent approximations, Springer, NY, 1997. The arguments offered below appeal to intuition. Consider the case n = 3, in which the variable x in f (x) is a three-dimensional vector. Then the quadratic function f (x) is constant over ellipsoids, called isosurfaces, centered at the minimum x . How can we start from a point x0 on one of these ellipsoids and reach x by a nite sequence of one-dimensional searches? In connection with steepest descent, we noticed that for poorly conditioned Hessians orthogonal directions lead to many small steps, that is, to slow convergence. When the ellipsoids are spheres, on the other hand, this works much better. The rst step takes from x0 to x1, and the line between x0 and x1 is tangent to an isosurface at x1 . The next step is in the direction of the gradient, so that the new direction p1 is orthogonal to the previous direction p 0 . This would then take us to x right away. Suppose however that we cannot afford to compute this special direction p1 orthogonal to p 0 , but that we can only compute some direction p1 orthogonal to p 0 (there is an n ; 1-dimensional space of such directions!). It is easy to see that in that case n steps will take us to x . In fact, since isosurfaces are spheres, each line minimization is independent of the others: The rst step yields the minimum in the space spanned by p0 , the second step then yields the minimum in the space spanned by p0 and p1 , and so forth. After n steps we must be done, since p0 : : : pn;1 span the whole space. In summary, any set of orthogonal directions, with a line search in each direction, will lead to the minimum for spherical isosurfaces. Given an arbitrary set of ellipsoidal isosurfaces, there is a one-to-one mapping with a spherical system: if Q = U U T is the SVD of the symmetric, positive denite matrix Q, then we can write
1 x T Qx = 1 y T y 2 2
where Consequently, there must be a condition for the original problem (in terms of Q) that is equivalent to orthogonality for the spherical problem. If two directions qi and qj are orthogonal in the spherical context, that is, if qT i qj y=
1=2U T x :
(4.18)
=0
what does this translate into in terms of the directions p i and pj for the ellipsoidal problem? We have qi j
= 1=2U T pi j
pT iU
1=2 1=2U T p = 0 j
pT i Qpj
This condition is called Q-conjugacy, or Q-orthogonality: if equation (4.19) holds, then p i and pj are said to be Q-conjugate or Q-orthogonal to each other. We will henceforth simply say conjugate for brevity. In summary, if we can nd n directions p0 : : : pn;1 that are mutually conjugate, and if we do line minimization along each direction pk , we reach the minimum in at most n steps. Of course, we cannot use the transformation (4.18) in the algorithm, because and especially U T are too large. So now we need to nd a method for generating n conjugate directions without using either Q or its SVD. We do this in two steps. First, we nd conjugate directions whose denitions do involve Q. Then, in the next subsection, we rewrite these expressions without Q. Here is the procedure, due to Hestenes and Stiefel (Methods of conjugate gradients for solving linear systems, J. Res. Bureau National Standards, section B, Vol 49, pp. 409-436, 1952), which also incorporates the steps from x0 to xn : g0 = g(x0 ) p0 = ;g0 for k = 0 : : :
=0:
(4.19)
end where
n;1 = arg min k 0 f (xk + pk ) xk+1 = xk + k pk gk+1 = g(xk+1) gT k+1 Qpk k= p T Qp k k pk+1 = ;gk+1 + k pk
gk
is the gradient of f at xk . It is simple to see that pk and pk+1 are conjugate. In fact, pT k Qpk+1
= g(xk ) = @f @ x x=xk
pT k Q(;gk+1 + k pk ) gT k+1Qpk pT Qp + ;pT Q g k k+1 k k pT k Qpk
= =
T = ;pT k Qgk+1 + gk+1Qpk = 0 : It is somewhat more cumbersome to show that pi and pk+1 for i = 0 : : : k are also conjugate. This can be done by
induction. The proof is based on the observation that the vectors p k are found by a generalization of Gram-Schmidt (theorem 2.4.2) to produce conjugate rather than orthogonal vectors. Details can be found in Polaks book mentioned earlier.
4.3.2
The algorithm shown in the previous subsection is a correct conjugate gradients algorithm. However, it is computationally inadequate because the expression for k contains the Hessian Q, which is too large. We now show that k can be rewritten in terms of the gradient values gk and gk+1 only. To this end, we notice that gk+1
= gk + k Qpk
51
k Qpk = gk+1 ; gk :
g(x) = a + Qx
and Q has disappeared. This expression for k can be further simplied by noticing that
=0
because the line along pk is tangent to an isosurface at xk+1 , while the gradient gk+1 is orthogonal to the isosurface at xk+1. Similarly, pT k ;1 g k = 0 : Then, the denominator of k becomes
T pT k (gk+1 ; gk ) = ;pk gk
= (gk ; k;1pk;1)T gk = gT k gk : :
k=
gT k+1(gk+1 ; gk ) gT k gk
4.3.3
We now know how to minimize the quadratic function (4.16) in n steps, without ever constructing the Hessian explicitly. When the function f (x) is arbitrary, the same algorithm can be used. However, n iterations will not sufce. In fact, the Hessian, which was constant for the quadratic case, now is a function of x k . Strictly speaking, we then lose conjugacy, since pk and pk+1 are associated to different Hessians. However, as the algorithm approaches the minimum x , the quadratic approximation becomes more and more valid, and a few cycles of n iterations each will achieve convergence.
52
Chapter 5
UTb = V T x
c= y
and where is diagonal. This is a fundamental transformation to use whenever the domain and the range of A are separate spaces. Often, however, domain and range are intimately related to one another even independently of the transformation A. The most important example is perhaps that of a system of linear differential equations, of the form
_ = Ax x
where A is n n. For this equation, the fact that A is square is not a coincidence. In fact, x is assumed to be a function _ is the derivative of x with respect to t: of some real scalar variable t (often time), and x
_= x
_ , and one cannot change coordinates for x In other words, there is an intimate, pre-existing relation between x and x _ accordingly. In fact, if V is an orthogonal matrix and we dene without also changing those for x
y = V Tx
dx : dt
d V T x = V T dx = V T x _: dt dt
A = S ST
53
(5.1)
54 so that if we dene
_ = y: y
This is now much easier to handle, because it is a system of n independent, scalar differential equations, which can be solved separately. The solutions can then be recombined through x = Sy : We will see all of this in greater detail soon. Unfortunately, writing A in the form (5.1) is not always possible. This stands to reason, because now we are imposing stronger constraints on the terms of the decomposition. It is like doing an SVD but with the additional constraint U = V . If we refer back to gure 3.1, now the circle and the ellipse live in the same space, and the constraint U = V implies that the vectors vi on the circle that map into the axes iui of the ellipse are parallel to the axes themselves. This will only occur for very special matrices. In order to make a decomposition like (5.1) possible, we weaken the constraints in several ways: the elements of S and are allowed to be complex, rather than real; are allowed to be negative; in fact, they can be even non-real;
S H S = SS H = I
so unitary generalizes orthogonal for complex matrices. Unitary matrices merely rotate or ip vectors, in the sense that they do not alter the vectors norms. For complex vectors, the norm squared is dened as
kxk2 = xH x
and if S is unitary we have
kS xk2 = xH S H S x = xH x = kxk2 :
xH 1 x2
=0
H xH 1 S S x2
= xH 1 x2 = 0 :
In contrast, a nonunitary transformation Q can change the norms of vectors, as well as the inner products between vectors. A matrix that is equal to its Hermitian is called a Hermitian matrix. In summary, in order to diagonalize a square matrix A from a system of linear differential equations we generally look for a decomposition of A of the form A = Q Q;1 (5.2)
55 where Q and are complex, Q is invertible, and is diagonal. For some special matrices, this may specialize to
A = S SH
with unitary S . Whenever two matrices A and B , diagonal or not, are related by
A = QBQ;1
they are said to be similar to each other, and the transformation of transformation. The equation A = Q Q;1 can be rewritten as follows:
AQ = Q
or separately for every column of Q as follows: where
Aqi = i qi
qn and
(5.3)
Q=
q1
= diag( 1 : : : n) :
are solutions of the eigenvalue/eigenvector equation (5.4) x
Ax =
which is how eigenvalues and eigenvectors are usually introduced. In contrast, we have derived this equation from the requirement of diagonalizing a matrix by a similarity transformation. The columns of Q are called eigenvectors, and the diagonal entries of are called eigenvalues.
2
1.5
0.5
0.5
1.5
Figure 5.1: Effect of the transformation (5.5) on a sample of points on the unit circle. The dashed lines are vectors that do not change direction under the transformation. That real eigenvectors and eigenvalues do not always exist can be claried by considering the eigenvalue problem from a geometrical point of view in the n = 2 case. As we know, an invertible linear transformation transforms the unit circle into an ellipse. Each point on the unit circle is transformed into some point on the ellipse. Figure 5.1 shows the effect of the transformation represented by the matrix
=3 4= 3 A = 20 2
(5.5)
56
for a sample of points on the unit circle. Notice that there are many transformations that map the unit circle into the same ellipse. In fact, the circle in gure 5.1 can be rotated, pulling the solid lines along. Each rotation yields another matrix A, but the resulting ellipse is unchanged. In other words, the curve-to-curve transformation from circle to ellipse is unique, but the point-to-point transformation is not. Matrices represent point-to-point transformations. The eigenvalue problem amounts to nding axes q1 q2 that are mapped into themselves by the original transformation A (see equation (5.3)). In gure 5.1, the two eigenvectors are shown as dashed lines. Notice that they do not correspond to the axes of the ellipse, and that they are not orthogonal. Equation (5.4) is homogeneous in x, so x can be assumed to be a unit vector without loss of generality. Given that the directions of the input vectors are generally changed by the transformation A, as evident from gure 5.1, it is not obvious whether the eigenvalue problem admits a solution at all. We will see that the answer depends on the matrix A, and that a rather diverse array of situations may arise. In some cases, the eigenvalues and their eigenvectors exist, but they are complex. The geometric intuition is hidden, and the problem is best treated as an algebraic one. In other cases, all eigenvalues exist, perhaps all real, but not enough eigenvectors can be found, and the matrix A cannot be diagonalized. In particularly good cases, there are n real, orthonormal eigenvectors. In bad cases, we have to give up the idea of diagonalizing A, and we can only triangularize it. This turns out to be good enough for solving linear differential systems, just as triangularization was sufcient for solving linear algebraic systems.
Ax =
x (5.6)
This is a homogeneous, square system of equations, which admits nontrivial solutions iff the matrix rank-decient. A square matrix B is rank-decient iff its determinant,
A; I
is
is zero. In this expression, Bij is the algebraic complement of entry bij , dened as the (n ; 1) (n ; 1) matrix obtained by removing row i and column j from B . Volumes have been written about the properties of the determinant. For our purposes, it is sufcient to recall the following properties from linear algebra:
det(B ) = det(B T ); det( b1 bn ) = 0 iff b1 : : : bn are linearly dependent; det( b1 bi bj bn ) = ; det( b1 bj det(BC ) = det(B ) det(C ).
Thus, for system (5.6) to admit nontrivial solutions, we need
bi
bn
);
det(A ; I ) = 0 :
(5.7)
From the denition of determinant, it follows, by very simple induction, that the left-hand side of equation (5.7) is a polynomial of degree n in , and that the coefcient of n is 1. Therefore, equation (5.7), which is called the characteristic equation of A, has n complex solutions, in the sense that
where some of the exactly n distinct eigenvalues is of particular interest, because of the following results.
det(A ; I ) = (;1)n ( ; 1 ) : : : ( ; n ) i may coincide. In other words, an n n matrix has at most n distinct eigenvalues.
The case of
57
: : : xk corresponding to distinct eigenvalues 1 : : : k are linearly independent. Proof. Suppose that c1 x1 + : : : + ck xk = 0 where the xi are eigenvectors of a matrix A. We need to show that then c1 = : : : = ck = 0. By multiplying by A we obtain c1 Ax1 + : : : + ck Axk = 0 and because x1 : : : xk are eigenvectors corresponding to eigenvalues 1 : : : k , we have c1 1x1 + : : : + ck k xk = 0 : (5.8)
However, from we also have
c1 x1 + : : : + ck xk = 0 c1 k x1 + : : : + ck k xk = 0
c1( 1 ; k )x1 + : : : + ck;1( k;1 ; k )xk;1 = 0 : Thus, we have reduced the summation to one containing k ; 1 terms. Since all i are distinct, the differences in parentheses are all nonzero, and we can replace each xi by x0 i = ( i ; k )xi , which is still an eigenvector of A: c1 x01 + : : : + ck;1x0k = 0 : We can repeat this procedure until only one term remains, and this forces c1 = 0, so that c2 x2 + : : : + ck xk = 0 This entire argument can be repeated for the last equation, therefore forcing c2 = 0, and so forth. In summary, the equation c1 x1 + : : : + ck xk = 0 implies that c1 = : : : = ck = 0, that is, that the vectors x1 : : : xk
are linearly independent. For Hermitian matrices (and therefore for real symmetric matrices as well), the situation is even better. Theorem 5.1.2 A Hermitian matrix has real eigenvalues. Proof. A matrix A is Hermitian iff A = AH . Let and x be an eigenvalue of A and a corresponding eigenvector:
Ax =
By taking the Hermitian we obtain Since A = AH , the last equation can be rewritten as follows: xH A = x H Ax x H Ax which implies that xH AH
x: xH xH
(5.9)
:
(5.10)
If we multiply equation (5.9) from the left by x H and equation (5.10) from the right by x, we obtain
= =
xH x
xH x
xH x =
xH x
58
=
as promised.
Corollary 5.1.3 A real and symmetric matrix has real eigenvalues. Proof. A real and symmetric matrix is Hermitian.
Theorem 5.1.4 Eigenvectors corresponding to distinct eigenvalues of a Hermitian matrix are mutually orthogonal. Proof. Let and be two distinct eigenvalues of A, and let x and y be corresponding eigenvectors:
Ax = Ay =
because A = AH and from theorem 5.1.2 the right, respectively, we obtain
x y
)
= =
y H A = yH
yH x = y H x
( ; )yH x = 0 :
Corollary 5.1.5 An n
Proof. From theorem 5.1.4, the eigenvectors of an n n Hermitian matrix with n distinct eigenvalues are all mutually orthogonal. Since the eigenvalue equation Ax = x is homogeneous in x, the vector x can be normalized without violating the equation. Consequently, the eigenvectors can be made to be orthonormal. In summary, any square matrix with n distinct eigenvalues can be diagonalized by a similarity transformation, and any square Hermitian matrix with n distinct eigenvalues can be diagonalized by a unitary similarity transformation. Notice that the converse is not true: a matrix can have coincident eigenvalues and still admit n independent, and even orthonormal, eigenvectors. For instance, the n n identity matrix has n equal eigenvalues but n orthonormal eigenvectors (which can be chosen in innitely many ways). The examples in section 5.2 show that when some eigenvalues coincide, rather diverse situations can arise concerning the eigenvectors. First, however, we point out a simple but fundamental fact about the eigenvalues of a triangular matrix. Theorem 5.1.6 The determinant of a triangular matrix is the product of the elements on its diagonal.
59
Proof. This follows immediately from the denition of determinant. Without loss of generality, we can assume a triangular matrix B to be upper-triangular, for otherwise we can repeat the argument for the transpose, which because of the properties above has the same eigenvalues. Then, the only possibly nonzero bi1 of the matrix B is b11, and the summation in the denition of determinant given above reduces to a single term:
det(B ) =
if B is 1 otherwise
1 :
By repeating the argument for B11 and so forth until we are left with a single scalar, we obtain
Corollary 5.1.7 The eigenvalues of a triangular matrix are the elements on its diagonal. Proof. The eigenvalues of a matrix A are the solutions of the equation
det(A ; I ) = 0 :
If A is triangular, so is B
= a11 : : : ann :
Note that diagonal matrices are triangular, so this result holds for diagonal matrices as well.
0 A= 2 0 1
s2
(5.11)
= 1 0
= 0 1
Matrices with n orthonormal eigenvectors are called normal. n n system of differential equations _ = Ax x has solution x(t) =
n X i=1
2 e 1t ci si e i t = S 6 4
..
e nt
3 7 5c
60 where S
= s1
sn] are the eigenvectors, i are the eigenvalues, and the vector c of constants ci is c = S H x(0) :
More compactly,
2 e 1t 6 x(t) = S 4
..
e nt
3 7 5 SH x(0) :
Fortunately these matrices occur frequently in practice. However, not all matrices are as good as these. First, there may still be a complete set of n eigenvectors, but they may not be orthonormal. An example of such a matrix is
2 ;1 0 1
q2
= 22 1 1 : This is conceptually only a slight problem, because the unitary matrix S is replaced by an invertible matrix Q, and the solution becomes 3 2 e 1t 7 6 .. x(t) = Q 4 5 Q;1x(0) : . e nt 1 q1 = 0
Computationally this is more expensive, because a computation of a Hermitian is replaced by a matrix inversion. However, things can be worse yet, and a full set of eigenvectors may fail to exist, as we now show. A necessary condition for an n n matrix to be defective, that is, to have fewer than n eigenvectors, is that it have repeated eigenvalues. In fact, we have seen (theorem 5.1.1) that a matrix with distinct eigenvalues (zero or nonzero does not matter) has a full set of eigenvectors (perhaps nonorthogonal, but independent). The simplest example of a defective matrix is which has double eigenvalue 0 and only eigenvector
0 1 0 0 1 0]T , while 3 1 0 3
has double eigenvalue 3 and only eigenvector 1 0]T , so zero eigenvalues are not the problem. However, repeated eigenvalues are not a sufcient condition for defectiveness, as the identity matrix proves. How bad can a matrix be? Here is a matrix that is singular, has fewer than n eigenvectors, and the eigenvectors it has are not orthogonal. It belongs to the scum of all matrices:
0 0 2 Its eigenvalues are 0, because the matrix is singular, and 2, repeated twice. A has to have a repeated eigenvalue if it is
to be defective. Its two eigenvectors are
2 0 2 ;1 3 A=4 0 2 1 5 :
213 q1 = 4 0 5
0
q2
= 2
p 213 2
415
0
corresponding to eigenvalues 0 and 2 in this order, and there is no q 3. Furthermore, q1 and q2 are not orthogonal to each other.
61
T = Q;1 AQ
then the eigenvalues of the triangular matrix T are equal to those of the original matrix A. In fact, if
Ax =
then that is, where so
x x
QTQ;1x = Ty =
y
is also an eigenvalue for T . The eigenvectors, however, are changed according to the last equation. The Schur decomposition does even better, since it triangularizes any square matrix A by a unitary (possibly complex) transformation: This transformation is equivalent to factoring A into the product
y = Q ;1 x
T = S H AS : A = STS H
and this product is called the Schur decomposition of A. Numerically stable and efcient algorithms exist for the Schur decomposition. In this note, we will not study these algorithms, but only show that all square matrices admit a Schur decomposition.
5.3.1
An important preliminary fact concerns vector rotations. Let e1 be the rst column of the identity matrix. It is intuitively obvious that any nonzero real vector x can be rotated into a vector parallel to e 1. Formally, take any orthogonal matrix S whose rst column is x s1 = kxk : Since sT 1 x = xT x=kxk = kxk, and since all the other sj are orthogonal to s1 , we have
which is parallel to e 1 as desired. It may be less obvious that a complex vector x can be transformed into a real vector parallel to e1 by a unitary transformation. But the trick is the same: let s1
= kx : xk
62
just about like before. We are now ready to triangularize an arbitrary square matrix A.
5.3.2
The Schur decomposition theorem is the cornerstone of eigenvalue computations. It states that any square matrix can be triangularized by unitary transformations. The diagonal elements of a triangular matrix are its eigenvalues, and unitary transformations preserve eigenvalues. Consequently, if the Schur decomposition of a matrix can be computed, its eigenvalues can be determined. Moreover, as we will see later, a system of linear differential equations can be solved regardless of the structure of the matrix of its coefcients. Lemma 5.3.1 If A is an n
n matrix and
Ax =
then there is a transformation where U is a unitary, n
(5.12)
T = U H AU
2 3 6 0 7 7 T =6 C 6 7 4 ... 5:
0
Proof. Let U be a unitary transformation that transforms the (possibly complex) eigenvector x of A into a real vector on the x1 axis:
where r is the nonzero norm of x. By substituting this into (5.12) and rearranging we have
2r3 2r3 607 6 07 7 6 7 AU 6 = U 6 7 6 . . . . 4.5 4.7 5 0 0 2r3 2r3 607 6 07 7 6 7 U H AU 6 = 6 7 6 . 4 .. 5 4 ... 7 5 0 3 2 3 20 1 1 6 607 07 7 6 7 = U H AU 6 7 6 6 . 4 ... 7 5 4 .. 5
0 0
63
S can be chosen so that the eigenvalues i of A appear in any order along the
1A1 = 1 :
Suppose it holds for all matrices of order n ; 1. Then from the lemma there exists a unitary U such that
2 3 6 0 7 7 U H AU = 6 C 6 7 . 4 .. 5
0
wH
where
(n ; 1) matrix G:
C=
: 0 V
By the inductive hypothesis, there is a unitary matrix V such that V H GV is a Schur decomposition of G. Let
2 1 6 0 S=U6 6 4 ...
0
3 7 7 7 5:
Clearly, S is a unitary matrix, and S H AS is upper-triangular. Since the elements on the diagonal of a triangular matrix are the eigenvalues, SH AS is the Schur decomposition of A. Because we can pick any eigenvalue as , the order of eigenvalues can be chosen arbitrarily.
This theorem does not say how to compute the Schur decomposition, only that it exists. Fortunately, there is a stable and efcient algorithm to compute the Schur decomposition. This is the preferred way to compute eigenvalues numerically.
64
A = AH ) A = S A real, A = AT ) A = S
In either case, is real and diagonal.
SH S T S real :
Proof. We already know that Hermitian matrices (and therefore real and symmetric ones) have real eigenvalues (theorem 5.1.2), so is real. Let now
But the only way that T can be both triangular and Hermitian is for it to be diagonal, because 0 = 0. Thus, the Schur decomposition of a Hermitian matrix is in fact a diagonalization, and this is the rst equation of the theorem (the diagonal of a Hermitian matrix must be real). Let now A be real and symmetric. All that is left to prove is that then its eigenvectors are real. But eigenvectors are the solution of the homogeneous system (5.6), which is both real and rank-decient, and therefore admits nontrivial real solutions. Thus, S is real, and SH = S T . In other words, a Hermitian matrix, real or not, with distinct eigenvalues or not, has real eigenvalues and n orthonormal eigenvectors. If in addition the matrix is real, so are its eigenvectors. We recall that a real matrix A such that for every nonzero x we have xT Ax > 0 is said to be positive denite. It is positive semidenite if for every nonzero x we have xT Ax 0. Notice that a positive denite matrix is also positive semidenite. Positive denite or semidenite matrices arise in the solution of overconstrained linear systems, because AT A is positive semidenite for every A (lemma 5.4.5). They also occur in geometry through the equation of an ellipsoid, xT Qx = 1
65
in which Q is positive denite. In physics, positive denite matrices are associated to quadratic forms xT Qx that represent energies or second-order moments of mass or force distributions. Their physical meaning makes them positive denite, or at least positive semidenite (for instance, energies cannot be negative). The following result relates eigenvalues/vectors with singular values/vectors for positive semidenite matrices. Theorem 5.4.3 The eigenvalues of a real, symmetric, positive semidenite matrix A are equal to its singular values. The eigenvectors of A are also its singular vectors, both left and right. Proof. From the previous theorem, nonnegative. In fact, from we obtain
are
sT i Asi
T 2 = sT i si = si si = ksi k = :
0.
with nonnegative diagonal entries in is the singular value decomposition A = U V T of A with = and U = V = S . Recall that the eigenvalues in the Schur decomposition can be arranged in any desired order along the diagonal.
Theorem 5.4.4 A real, symmetric matrix is positive semidenite iff all its eigenvalues are nonnegative. It is positive denite iff all its eigenvalues are positive. Proof. Theorem 5.4.3 implies one of the two directions: If A is real, symmetric, and positive semidenite, then its eigenvalues are nonnegative. If the proof of that theorem is repeated with the strict inequality, we also obtain that if A is real, symmetric, and positive denite, then its eigenvalues are positive. Conversely, we show that if all eigenvalues of a real and symmetric matrix A are positive (nonnegative) then A is positive denite (semidenite). To this end, let x be any nonzero vector. Since real and symmetric matrices have n orthonormal eigenvectors (theorem 5.4.2), we can use these eigenvectors s1 : : : sn as an orthonormal basis for Rn , and write x = c1s1 + : : : + cn sn with But then xT Ax
ci = xT si : = = =
xT A(c1s1 + : : : + cn sn) = xT (c1 As1 + : : : + cn Asn ) xT (c1
because the i are positive (nonnegative) and not all c i can be zero. Since xT Ax > 0 (or is positive denite (semidenite).
1 s1 + : : : + cn n sn ) = c1 1 xT s1 + : : : + cn n xT sn 2 1 c1 + : : : + n c2 n > 0 (or 0)
Theorem 5.4.3 establishes one connection between eigenvalues/vectors and singular values/vectors: for symmetric, positive denite matrices, the concepts coincide. This result can be used to introduce a less direct link, but for arbitrary matrices. Lemma 5.4.5
AT A is positive semidenite.
0.
Theorem 5.4.6 The eigenvalues of AT A with m n are the squares of the singular values of A; the eigenvectors of AT A are the right singular vectors of A. Similarly, for m n, the eigenvalues of AAT are the squares of the singular values of A, and the eigenvectors of AAT are the left singular vectors of A. Proof. If m
We have seen that important classes of matrices admit a full set of orthonormal eigenvectors. The theorem below characterizes the class of all matrices with this property, that is, the class of all normal matrices. To prove the theorem, we rst need a lemma. Lemma 5.4.7 If for an n n matrix B we have BB H of B equals the norm of its i-th column. Proof. From BBH
= B H B we deduce kB xk2 = xH B H B x = xH BB H x = kB H xk2 : (5.13) If x = ei , the i-th column of the n n identity matrix, B e i is the i-th column of B , and B H ei is the i-th column of B H , which is the conjugate of the i-th row of B . Since conjugation does not change the norm of a vector, the equality (5.13) implies that the i-th column of B has the same norm as the i-th row of B .
Theorem 5.4.8 An n
Proof.
AAH = ST S H ST H S H = STT H S H and AH A = ST H S H STS H = ST H TS H : Because S is invertible (even unitary), we have AAH = AH A if and only if TT H = T H T . However, a triangular matrix T for which TT H = T H T must be diagonal. In fact, from the lemma, the norm of the i-th row of T is equal to the norm of its i-th column. Let i = 1. Then, the rst column of T has norm jt11j. The rst row has rst entry t11, so the only way that its norm can be jt11j is for all other entries in the rst row to be zero. We now proceed through i = 2 : : : n, and reason similarly to conclude that T must be diagonal. The converse is also obviously true: if T is diagonal, then TT H = T H T . Thus, AAH = AH A if and only if T is diagonal, that is, if and only if A can be diagonalized by a unitary similarity transformation. This is the denition of a
normal matrix.
5.4. EIGENVALUES/VECTORS AND SINGULAR VALUES/VECTORS Corollary 5.4.9 A triangular, normal matrix must be diagonal. Proof. We proved this in the proof of theorem 5.4.8.
67
Checking that AH A = AAH is much easier than computing eigenvectors, so theorem 5.4.8 is a very useful characterization of normal matrices. Notice that Hermitian (and therefore also real symmetric) matrices commute trivially with their Hermitians, but so do, for instance, unitary (and therefore also real orthogonal) matrices:
UU H = U H U = I :
Thus, Hermitian, real symmetric, unitary, and orthogonal matrices are all normal.
68
Chapter 6
= Ax + b(t) x(0) = x0
_ x
(6.1) (6.2)
where x = x(t) is an n-dimensional vector function of time t, the dot denotes differentiation, the coefcients a ij in the n n matrix A are constant, and the vector function b(t) is a function of time. The equation (6.2), in which x0 is a known vector, denes the initial value of the solution. First, we show that scalar differential equations of order greater than one can be reduced to systems of rst-order differential equations. Then, in section 6.2, we recall a general result for the solution of rst-order differential systems from the elementary theory of differential equations. In section 6.3, we make this result more specic by showing that the solution to a homogeneous system is a linear combination of exponentials multiplied by polynomials in t. This result is based on the Schur decomposition introduced in chapter 5, which is numerically preferable to the more commonly used Jordan canonical form. Finally, in sections 6.4 and 6.5, we set up and solve a particular differential system as an illustrative example.
In fact, such an equation can be reduced to a rst-order system of the form (6.1) by introducing the n-dimensional vector
(6.3)
::: n;1
69
for i = 1 : : : n ; 1. If we write the original system (6.3) together with the obtain the rst-order system _ = Ax + b(t) x where
i xi+1 = dx dt
(6.4)
2 0 6 6 0 . . A=6 6 . 6 4 0
1 0
. . .
0 1
. . .
..
;cn;1
3 7 7 7 . . 7 . 7 1 5
0 0
2 0 3 6 0 7 6 7 6 7 . b(t) = 6 . : 7 . 6 7 4 0 5
b(t)
= Ax x(0) = x0
and xp (t) is a particular solution of x(0)
_ x
_ x
= Ax + b(t) = 0:
1 j tj X j! :
j =0
The two solution components xh and xp can be written by means of the matrix exponential, introduced in the following. For the scalar exponential e t we can write a Taylor series expansion
2 t2 e t = 1 + 1!t + 2! +
Usually1, in calculus classes, the exponential is introduced by other means, and the Taylor series expansion above is proven as a property. For matrices, the exponential eZ of a matrix Z 2 Rn n is instead dened by the innite series expansion
Z2 + eZ = I + Z + 1! 2!
1 Zj X
j =0
j! :
1 Not always. In some treatments, the exponential is dened through its Taylor series.
71
Here I is the n n identity matrix, and the general term Z j =j ! is simply the matrix Z raised to the j th power divided by the scalar j !. It turns out that this innite sum converges (to an n n matrix which we write as e Z ) for every matrix Z . Substituting Z = At gives
A2 t2 + A3 t3 + eAt = I + At + 1! 2! 3!
Differentiating both sides of (6.5) gives
1 Aj tj X
j =0
j! :
(6.5)
deAt = AeAt : dt Thus, for any vector w, the function xh (t) = eAt w satises the homogeneous differential system _ h = Axh : x By using the initial values (6.2) we obtain v = x 0, and xh (t) = eAt x(0) (6.6) is a solution to the differential system (6.1) with b(t) = 0 and initial values (6.2). It can be shown that this solution is
unique. From the elementary theory of differential equations, we also know that a particular solution to the nonhomogeneous (b(t) 6= 0) equation (6.1) is given by xp (t) =
deAt = A + A2 t + A3 t2 + dt 1! 2! 22 At = A I + 1! + A2!t +
Zt
0
eA(t;s) b(s) ds :
_p x
= AeAt
Zt
0
so xp satises equation (6.1). In summary, we have the following result. The solution to with initial value is where and
_ = Ax + b(t) x
x(0) = x0 x(t) = xh (t) + xp (t) xh (t) = eAt x(0)
xp (t) =
Zt
0
eA(t;s) b(s) ds :
(6.11)
Since we now have a formula for the general solution to a linear differential system, we seem to have all we need. However, we do not know how to compute the matrix exponential. The naive solution to use the denition (6.5)
72
requires too many terms for a good approximation. As we have done for the SVD and the Schur decomposition, we will only point out that several methods exist for computing a matrix exponential, but we will not discuss how this is done2. In a fundamental paper on the subject, Nineteen dubious ways to compute the exponential of a matrix (SIAM Review, vol. 20, no. 4, pp. 801-36), Cleve Moler and Charles Van Loan discuss a large number of different methods, pointing out that no one of them is appropriate for all situations. A full discussion of this matter is beyond the scope of these notes. When the matrix A is constant, as we currently assume, we can be much more specic about the structure of the solution (6.9) of system (6.7), and particularly so about the solution x h (t) to the homogeneous part. Specically, the matrix exponential (6.10) can be written as a linear combination, with constant vector coefcients, of scalar exponentials multiplied by polynomials. In the general theory of linear differential systems, this is shown via the Jordan canonical form. However, in the paper cited above, Moler and Van Loan point out that the Jordan form cannot be computed reliably, and small perturbations in the data can change the results dramatically. Fortunately, a similar result can be found through the Schur decomposition introduced in chapter 5. The next section shows how to do this.
= Ax x(0) = x0 :
_ x
(6.12) (6.13)
Two cases arise: either A admits n distinct eigenvalues, or is does not. In chapter 5, we have seen that if (but not only if) A has n distinct eigenvalues then it has n linearly independent eigenvectors (theorem 5.1.1), and we have shown how to nd xh (t) by solving an eigenvalue problem. In section 6.3.1, we briey review this solution. Then, in section 6.3.2, we show how to compute the homogeneous solution xh (t) in the extreme case of an n n matrix A with n coincident eigenvalues. To be sure, we have seen that matrices with coincident eigenvalues can still have a full set of linearly independent eigenvectors (see for instance the identity matrix). However, the solution procedure we introduce in section 6.3.2 for the case of n coincident eigenvalues can be applied regardless to how many linearly independent eigenvectors exist. If the matrix has a full complement of eigenvectors, the solution obtained in section 6.3.2 is the same as would be obtained with the method of section 6.3.1. Once these two extreme cases (nondefective matrix or all-coincident eigenvalues) have been handled, we show a general procedure in section 6.3.3 for solving a homogeneous or nonhomogeneous differential system for any, square, constant matrix A, defective or not. This procedure is based on backsubstitution, and produces a result analogous to that obtained via Jordan decomposition for the homogeneous part xh (t) of the solution. However, since it is based on the numerically sound Schur decomposition, the method of section 6.3.3 is superior in practice. For a nonhomogeneous system, the procedure can be carried out analytically if the functions in the right-hand side vector b(t) can be integrated.
6.3.1
In chapter 5 we saw how to nd the homogeneous part xh (t) of the solution when A has a full set of n linearly independent eigenvectors. This result is briey reviewed in this section for convenience.3 If A is not defective, then it has n linearly independent eigenvectors q1 : : : qn with corresponding eigenvalues 1 : : : n. Let Q = q1 qn : This square matrix is invertible because its columns are linearly independent. Since Aqi
A is Not Defective
= i qi , we have
AQ = Q
2 In Matlab, expm(A) is the matrix exponential of A. 3 Parts of this subsection and of the following one are based on notes written by Scott Cohen.
(6.14)
73
where = diag( 1 : : : n ) is a square diagonal matrix with the eigenvalues of A on its diagonal. Multiplying both sides of (6.14) by Q;1 on the right, we obtain A = Q Q;1: (6.15) Then, system (6.12) can be rewritten as follows:
= Ax _ = Q Q;1x x _ = Q ;1 x Q;1 x _ = y y
where y = Q;1x. The last equation (6.16) represents n uncoupled, homogeneous, differential equations y _i The solution is yh (t) = e t y(0) where
_ x
(6.16)
= i yi .
e t = diag(e 1 t : : : e nt ): Using the relation x = Qy, and the consequent relation y(0) = Q ;1 x(0), we see that the solution to the homogeneous
system (6.12) is If
A is normal, that is, if it has n orthonormal eigenvectors q1 : : :qn, then Q is replaced by the Hermitian matrix S = s1 sn , Q;1 is replaced by S H , and the solution to (6.12) becomes
xh (t) = Se t S H x(0):
xh (t) = Qe t Q;1x(0):
6.3.2
When A = Q Q;1, we derived that the solution to (6.12) is x h (t) = Qe t Q;1x(0). Comparing with (6.6), it should be the case that ;1
eQ( t)Q = Qe t Q;1: This follows easily from the denition of e Z and the fact that (Q( t)Q;1)j = Q( t)j Q;1. Similarly, if A = S S H , where S is Hermitian, then the solution to (6.12) is x h (t) = Se t S H x(0), and eS ( t)S H = Se t S H :
How can we compute the matrix exponential in the extreme case in which A has n coincident eigenvalues, regardless of the number of its linearly independent eigenvectors? In any case, A admits a Schur decomposition
A = STS H
(theorem 5.3.2). We recall that S is a unitary matrix and T is upper triangular with the eigenvalues of A on its diagonal. Thus we can write T as where is diagonal and N is strictly upper triangular. The solution (6.6) in this case becomes
H xh (t) = eS (Tt)S x(0) = SeTt S H x(0) = Se t+Nt S H x(0):
T = +N
Thus we can compute (6.6) if we can compute eTt = e t+Nt . This turns out to be almost as easy as computing e t when the diagonal matrix is a multiple of the identity matrix:
= I
74
that is, when all the eigenvalues of A coincide. In fact, in this case,
t and Nt commute:
t Nt = It Nt = t Nt = Nt t = Nt It = Nt t :
It can be shown that if two matrices Z1 and Z2 commute, that is if
Z1 Z2 = Z2 Z1
then
We already know how to compute e t, so it remains to show how to compute eNt . The fact that Nt is strictly upper triangular makes the computation of this matrix exponential much simpler than for a general matrix Z . Suppose, for example, that N is 4 4. Then N has three nonzero superdiagonals, N 2 has two nonzero superdiagonals, N 3 has one nonzero superdiagonal, and N 4 is the zero matrix:
20 3 20 60 0 7 ! N2 = 6 0 6 N = 6 40 0 0 7 5 40 0 0 0 0 3 2 20 0 0 0 0 6 7 6 0 0 0 0 0 3 4 N = 6 40 0 0 07 5!N =6 40
0 0 0 0
0 0 0 0 0 0 0 0 0
3 0 7 7 0 0 5!
0 0 0 0 0 0 0 0 0 0
3 7 7 5:
j!
j =0
j!
is simply a nite sum, and the exponential reduces to a matrix polynomial. In summary, the general solution to the homogeneous differential system (6.12) with initial value (6.13) when the n n matrix A has n coincident eigenvalues is given by xh (t) = Se t where is the Schur decomposition of A,
n ;1 N j tj X
S H x0 j ! j =0
(6.17)
A = S ( + N )S H = I
is a multiple of the identity matrix containing the coincident eigenvalues of A on its diagonal, and N is strictly upper triangular.
75
6.3.3
in the general case of a constant matrix A, defective or not, with arbitrary b(t). In fact, let A decomposition of A, and consider the transformed system
_ x
(6.18) (6.19)
= ST SH
where y(t) = S H x(t) and c(t) = S H b(t) : (6.21) The triangular matrix T can always be written in the following form:
2T 11 6 0 T22 6 T =6 4 ...
0
..
T1k T2k
. . . .
where the diagonal blocks Tii for i = 1 : : : k are of size ni ni (possibly 1 1) and contain all-coincident eigenvalues. The remaining nonzero blocks Tij with i < j can be in turn bundled into matrices
0 Tkk Ti k
3 7 7 7 5
2 c (t) 3 1 6 7 . c(t) = 4 . . 5
ck (t)
2 y (t) 3 1 6 7 . y(t) = 4 . . 5
yk (t)
2 y (0) 3 6 1. 7 y(0) = 4 . 5: .
yk (0)
The triangular system (6.20) can then be solved by backsubstitution as follows: for i = k down to 1 if i < k di(t) = Riyi+1 (t) else di(t) = 0 (an nk -dimensional vector of zeros) end Tii = i I + Ni (diagonal and strictly upper-triangular part of T ii ) yi (t) = e i It end.
76
In this procedure, the expression for yi (t) is a direct application of equations (6.9), (6.10), (6.11), and (6.17) with S = I . In the general case, the applicability of this routine depends on whether the integral in the expression for y i (t) can be computed analytically. This is certainly the case when b(t) is a constant vector b, because then the integrand is a linear combination of exponentials multiplied by polynomials in t ; s, which can be integrated by parts. The solution x(t) for the original system (6.18) is then x(t) = S y(t) : As an illustration, we consider a very small example, the 2 2 homogeneous, triangular case, y _1 = t11 t12 y1 : (6.22) y _2 0 t22 y2 When t11 = t22 = , we obtain 1 t12t y(0) : y(t) = e It 0 1 In scalar form, this becomes
y _2 (t) = t22y2
has solution We then have and
1t +
Zt
0
e 1 (t;s) d1(s) ds
Z0t
Exercise: verify that this solution satises both the differential equation (6.22) and the initial value equation y(0) = y 0 . Thus, the solutions to system (6.22) for t 11 = t22 and for t11 6= t22 have different forms. While y2 (t) is the same in both cases, we have
2; 1
if if
77
1 v1
2 v2 2
Figure 6.1: A system of masses and springs. In the absence of external forces, the two masses would assume the positions indicated by the dashed lines. This would seem to present numerical difculties when t11 t22, because the solution would suddenly switch from one form to the other as the difference between t11 and t22 changes from about zero to exactly zero or viceversa. This, however, is not a problem. In fact, e t ; e 1t
lim !
; 1 = te
f1 = c1v1 f2 = c2(v2 ; v1 )
4 Recall that the order of a differential equation is the highest degree of derivative that appears in it.
78
f3 = ;c3v2
where the ci are the positive spring constants (in newtons per meter). The accelerations of masses 1 and 2 (springs are assumed to be massless) are proportional to their accelerations, according to Newtons second law:
m1 v1 = ;f1 + f2 = ;c1 v1 + c2 (v2 ; v1 ) = ;(c1 + c2 )v1 + c2 v2 m2 v2 = ;f2 + f3 = ;c2 (v2 ; v1 ) ; c3v2 = c2 v1 ; (c2 + c3 )v2
or, in matrix form, where v= v = Bv (6.23)
v1 v2
and
c2 c1 +c2 m+ 1 1 B = ; cm c 2 ; 2m2c3 m2
_ (0) v
(6.24)
are given, which specify positions and velocities of the two masses at time t = 0. To solve the second-order system (6.23), we will rst transform it to a system of four rst-order equations. As shown in the introduction to this chapter, the trick is to introduce variables to denote the rst-order derivatives of v, so that second-order derivatives of v are rst-order derivatives of the new variables. For uniformity, we dene four new variables
2u 1 6 u 6 u=4 2 u
3 u4
3 2v 1 7 6 v 7 6 5 = 4 v_2
1 v _2
3 7 7 5
(6.25)
so that
u3 = v _1
while the original system (6.23) becomes
and
u4 = v _2 :
(6.26)
u_3 = B u1 u_4 u2 _ = Au u
We can now gather these four rst-order differential equations into a single system as follows:
where
2 0 0 6 0 0 A=6 4
B
u(0) =
1 0 0 0
v(0) _ (0) v
0 1 0 0 :
3 7 7 5:
Likewise, the initial conditions (6.24) are replaced by the (known) vector
79
Ax =
where we recall that
(6.27)
c2 c1 +c2 m+ 1 1 B = ; cm c 2 ; 2m2c3 : m2 Here, the zeros in A are 2 2 matrices of zeros, and I is the 2 2 identity matrix. If we partition the vector x into its
0 I A= B 0
and
x= we can write
y z y z z = B y
0 I Ax = B 0
z = By =
so that the eigenvalue equation (6.27) can be written as the following pair of equations: y z (6.28)
which yields
By =
1=
The eigenvalues
with
= 2:
3=
In other words, the eigenvalues of A are the square roots of the eigenvalues of B : if we denote the two eigenvalues of B as 1 and 2 , then the eigenvalues of A are
2=;
4=;
p : 2
+ c2 + det(B ; I ) = c1m 1
where
c2 + c3 + m2
and
2 2 = 2+2 + = 0 ; m cm 1 2
c1 + c2 c2 + c3 =1 2 m1 + m2
c1 c3 + c2 c3 = c1c2 +m m
1 2
are positive constants that depend on the elastic properties of the springs and on the masses. We then obtain
where
2; =
12=;
1 c1 + c2 ; c2 + c3 2 + c2 2 4 m1 m2 m1 m2 :
, so that the two solutions
The constant is real because the radicand is nonnegative. We also have that real and negative, and the four eigenvalues of A,
1 2 are
(6.29) (6.30)
1 = p; + 3 = ; ;
2 = ;p; + 4=; ; ;
80
come in nonreal, complex-conjugate pairs. This is to be expected, since our system of springs obviously exhibits an oscillatory behavior. Also the eigenvectors of A can be derived from those of B . In fact, from equation (6.28) we see that if y is an eigenvector of B corresponding to eigenvalue = 2 , then there are two corresponding eigenvectors for A of the form x= The four eigenvectors of B are the solutions of y y
(6.31)
(B ; (; + c2 ; ; c1m 1
and is obviously satised by any vector of the form y=k
)I )y = 0 : c2 y = 0 y1 + m 2
1
(6.32)
Since (; ) are eigenvalues of B , the determinant of this equation is zero, and the two scalar equations in (6.32) must be linearly dependent. The rst equation reads
where k is an arbitrary constant. For k 6= 0, y denotes the two eigenvectors of B , and from equation (6.31) the four eigenvectors of A are proportional to the four columns of the following matrix:
c2 1 c1 +c2 m m1 ;
2 c2 m1 2 6 a + 6 Q=4 c2 1 1 m ;a +1 2
1 1
2 a+ 2 1
c2 m1 2 a+ 2 c ; 2 m21 2
3 a+ 3
c2 m1 2 a+ 3 c ; 3 m21 2
4 a+ 4
c2 m1 2 a+ 4 c ; 4 m21 2
3 7 7 5
(6.33)
where
+ c2 : a = c1m
u(t) = Qe t Q;1u(0) (6.34)
The general solution to the rst-order differential system (6.26) is then given by equation (6.17). Since we just found four distinct eigenvectors, however, we can write more simply
where
2 3 1 0 0 0 6 0 2 0 0 7 =6 4 0 0 3 0 7 5:
0 0 0
4
In these expressions, the values of i are given in equations (6.30), and Q is in equation (6.33). Finally, the solution to the original, second-order system (6.23) can be obtained from equation (6.25) by noticing that v is equal to the rst two components of u. This completes the solution of our system of differential equations. However, it may be useful to add some algebraic manipulation in order to show that the solution is indeed oscillatory. As we see in the following, the masses motions can be described by the superposition of two sinusoids whose frequencies depend on the physical constants involved (masses and spring constants). The amplitudes and phases of the sinusoids, on the other hand, depend on the initial conditions. To simplify our manipulation, we note that u(t) = Qe tw
81
(6.35)
We now leave the constants in w unspecied, and derive the general solution v(t) for the original, second-order problem. Numerical values for the constants can be found from the initial conditions u(0) by equation (6.35). We have v(t) = Q(1 : 2 where Q(1 : 2
:)e tw
4=; 3
q1 q2 q2
c2
and
q2
c2 1 2 c1 +cm 2 m1 + 3
Since the s are imaginary but v(t) is real, the ki must come in complex-conjugate pairs:
v(t) = q1
;k e 1t + k e; 1t + q ;k e 3t + k e; 3t : 1 2 4 2 3
k1 = k2
and
k3 = k4 :
(6.36)
_ (t) = q1 v
we obtain
; ; 1 k1 e 1 t ; k2 e; 1 t + q2 3 k3e 3 t ; k4 e; 3 t
_ (0) = q1 v
1 (k1 ; k2 ) + q2 3 (k3 ; k4) : Since the vectors qi are independent (assuming that the mass c2 is nonzero), this means that
k1 + k2 is real k3 + k4 is real
from which equations (6.36) follow. Finally, by using the relation
v s u u 2 p 1 2 + c2 !1 = ; = t 2 (a + b) ; 1 ( a ; b ) 4 m1 m2 v s u u p 1 (a ; b)2 + c2 2 !2 = + = t1 ( a + b ) + 2 4 m1 m2
82 and
+ c2 a = c1m
1
+ c3 : b = c2m
2
Notice that these two frequencies depend only on the conguration of the system, and not on the initial conditions. The amplitudes Ai and phases i , on the other hand, depend on the constants ki as follows:
A2 = jk3j 2 = arctan2(Im(k3)
Re(k3))
where Re, Im denote the real and imaginary part and where the two-argument function arctan2 is dened as follows for (x y) 6= (0 0) arctan( y x ) y if x > 0 + arctan( x ) if x < 0 arctan2(y x) = if x = 0 and y > 0 2 ;2 if x = 0 and y < 0
and is undened for (x y) = (0 0). This function returns the arctangent of y=x (notice the order of the arguments) in the proper quadrant, and extends the function by continuity along the y axis. _ (0) from equations (6.35) The two constants k1 and k3 can be found from the given initial conditions v(0) and v and (6.25).
Chapter 7
u input
S system
y output
84
In all these examples, what is input and what is output is a choice that depends on the application. Also, all the quantities in the examples vary continuously with time. In other cases, as for instance for switching networks and computers, it is more natural to consider time as a discrete variable. If time varies continuously, the system is said to be continuous; if time varies discretely, the system is said to be discrete.
7.1.1
State
Given a dynamic system, continuous or discrete, the modeling problem is to somehow correlate inputs (causes) with outputs (effects). The examples above suggest that the output at time t cannot be determined in general by the value assumed by the input quantity at the same point in time. Rather, the output is the result of the entire history of the system. An effort of abstraction is therefore required, which leads to postulating a new quantity, called the state, which summarizes information about the past and the present of the system. Specically, the value x(t) taken by the state at time t must be sufcient to determine the output at the same point in time. Also, knowledge of both x(t 1 ) and u t1 t2 ) , that is, of the state at time t1 and the input over the interval t 1 t < t2 , must allow computing the state (and hence the output) at time t 2 . For the mass attached to a spring, for instance, the state could be the position and velocity of the mass. In fact, the laws of classical mechanics allow computing the new position and velocity of the mass at time t2 given its position and velocity at time t 1 and the forces applied over the interval t1 t2). Furthermore, in this example, the output y of the system happens to coincide with one of the two state variables, and is therefore always deducible from the latter. Thus, in a dynamic system the input affects the state, and the output is a function of the state. For a discrete system, the way that the input changes the state at time instant number k into the new state at time instant k + 1 can be represented by a simple equation: xk+1 = f (xk uk k) where f is some function that represents the change, and uk is the input at time k. Similarly, the relation between state and output can be expressed by another function: yk
= h(xk k) :
A discrete dynamic system is completely described by these two equations and an initial state x0 . In general, all quantities are vectors. For continuous systems, time does not come in quanta, so one cannot compute xk+1 as a function of xk , uk , and k, but rather compute x(t2) as a functional of x(t1 ) and the entire input u over the interval t 1 t2): x(t2 ) =
(x(t1 ) u( ) t1 t2)
where u( ) represents the entire function u, not just one of its values. A description of the system in terms of functions, rather than functionals, can be given in the case of a regular system, for which the functional is continuous, differentiable, and with continuous rst derivative. In that case, one can show that there exists a function f such that the state x(t) of the system satises the differential equation
t)
where the dot denotes differentiation with respect to time. The relation from state to output, on the other hand, is essentially the same as for the discrete case: y(t) = h(x(t) t) : Specifying the initial state x 0 completes the denition of a continuous dynamic system.
7.1.2
Uncertainty
The systems dened in the previous section are called deterministic, since the evolution is exactly determined once the initial state x at time 0 is known. Determinism implies that both the evolution function f and the output function h are known exactly. This is, however, an unrealistic state of affairs. In practice, the laws that govern a given physical
85
system are known up to some uncertainty. In fact, the equations themselves are simple abstractions of a complex reality. The coefcients that appear in the equations are known only approximately, and can change over time as a result of temperature changes, component wear, and so forth. A more realistic model then allows for some inherent, unresolvable uncertainty in both f and h. This uncertainty can be represented as noise that perturbs the equations we have presented so far. A discrete system then takes on the following form: xk+1 yk and for a continuous system
= f (xk uk k) + k = h(xk k) + k
_ (t) x
y(t)
Without loss of generality, the noise distributions can be assumed to have zero mean, for otherwise the mean can be incorporated into the deterministic part, that is, in either f or h. The mean may not be known, but this is a different story: in general the parameters that enter into the denitions of f and h must be estimated by some method, and the mean perturbations are no different. A common assumption, which is sometimes valid and always simplies the mathematics, is that and are zero-mean Gaussian random variables with known covariance matrices Q and R, respectively.
7.1.3
Linearity
The mathematics becomes particularly simple when both the evolution function f and the output function h are linear. Then, the system equations become xk+1 yk for the discrete case, and
= Fk xk + Gk uk + k = Hk xk + k
_ (t) x
86
accurately. You do not know the initial velocity of the projectiles, so you just guess some values: 0.6 kilometers/second for the horizontal component, 0.1 kilometers/second for the vertical component. Thus, your estimate of the initial state of the projectile is
0:5 where d is the horizontal coordinate, z is the vertical, you are at (0 0), and dots denote derivatives with respect to time.
From your high-school physics, you remember that the laws of motion for a ballistic trajectory are the following:
where g is the gravitational acceleration, equal to 9:8 10;3 kilometers per second squared. Since you do not trust your physics much, and you have little time to get ready, you decide to ignore air drag. Because of this, you introduce a state update covariance matrix Q = 0:1I4, where I4 is the 4 4 identity matrix. All you have to track the shells is a camera pointed at the mortar that will rotate so as to keep the projectile at the center of the image, where you see a blob that increases in size as the projectile gets closer. Thus, the aiming angle of the camera gives you elevation information about the projectiles position, and the size of the blob tells you something about the distance, given that you know the actual size of the projectiles used and all the camera parameters. The projectiles elevation is e = 1000 z (7.3) when the projectile is at (d
(7.1) (7.2)
z ). Similarly, the size of the blob in pixels is : s = p 1000 (7.4) d2 + z 2 You do not have very precise estimates of the noise that corrupts e and s, so you guess measurement covariances Re = Rs = 1000, which you put along the diagonal of a 2 2 diagonal measurement covariance matrix R.
7.2.1 The Dynamic System Equation
Equations (7.1) and (7.2) are continuous. Since you are taking measurements every discretize these equations. For the z component, equation (7.2) yields
since z _ (0) ; gt = z _ (t). Consequently, if t + dt is time instant k + 1 and t is time instant k, you have
1 gt2 2 z (t + dt) ; z (t) = z (0) + z_ (0)(t + dt) ; 1 _ (0)t ; 2 2 g(t + dt) ; z (0) + z 1 g(dt)2 = (z_ (0) ; gt)dt ; 2 1 g(dt)2 = z_ (t)dt ; 2
2 zk+1 = zk + z _k dt ; 1 2 g(dt) :
(7.5)
The reasoning for the horizontal component d is the same, except that there is no acceleration:
dk+1 = dk + d_k dt :
(7.6)
7.2. AN EXAMPLE: THE MORTAR SHELL Equations (7.5) and (7.6) can be rewritten as a single system update equation xk+1 where
87
= F xk + Gu
2 d_ k 6 d xk = 6 4 z_k
zk
k
3 7 7 5
u is equal to ;g, and the 4 1 control matrix G
is the state, the 4 4 matrix F depends on dt, the control scalar depends on dt. The two matrices F and G are as follows:
2 1 0 0 03 6 dt 1 0 0 7 F =6 4 0 0 1 07 5
0 0 dt 1
2 0 3 6 0 7 G=6 4 dt2 7 5:
;dt 2
7.2.2
The two nonlinear equations (7.3) and (7.4) express the available measurements as a function of the true values of the projectile coordinates d and z . We want to replace these equations with linear approximations. To this end, we develop both equations as Taylor series around the current estimate and truncate them after the linear term. From the elevation equation (7.3), we have
"
so that after simplifying we can redene the measurement to be the discrepancy from the estimated value: (7.7)
sk = p 1000 d2 + z 2
and after simplifying:
^k ^k d s0k = sk ; p 2000 ; 1000 d + ^2 z 2 2 2 )3=2 z : 3 = 2 ^ 2 2 ^ ( d + z ^ ) ( d + ^k d +z ^ k k k z The two measurements e0 and s0 just dened can be collected into a single measurement vector
k k
yk
"
(7.8)
0 k = e s0k
and the two approximate measurement equations (7.7) and (7.8) can be written in the matrix form where the measurement matrix Hk depends on the current state estimate ^ xk : yk
= Hk xk
(7.9)
; d^k
88
As the shell approaches us, we frantically start studying state estimation, and in particular Kalman ltering, in the hope to build a system that lets us shoot down the shell before it hits us. The next few sections will be read under this impending threat. Knowing the model for the mortar shell amounts to knowing the laws by which the object moves and those that relate the position of the projectile to our observations. So what else is there left to do? From the observations, we would like to know where the mortar shell is right now, and perhaps predict where it will be in a few seconds, so we can direct an antiaircraft gun to shoot down the target. In other words, we want to know xk , the state of the dynamic system. Clearly, knowing x0 instead is equivalent, at least when the dynamics of the system are known exactly (the system noise k is zero). In fact, from x0 we can simulate the system up until time t, thereby determining xk as well. Most importantly, we do not want to have all the observations before we shoot: we would be dead by then. A scheme that renes an initial estimation of the state as new observations are acquired is called a recursive1 state estimation system. The Kalman lter is one of the most versatile schemes for recursive state estimations. The original paper by Kalman (R. E. Kalman, A new approach to linear ltering and prediction problems, Transactions of the ASME Journal Basic Engineering, 82:3445, 1960) is still one of the most readable treatments of this subject from the point of view of stochastic estimation. Even without noise, a single observation y k may not be sufcient to determine the state xk (in the example, one observation happens to be sufcient). This is a very interesting aspect of state estimation. It is really the ensemble of all observations that let one estimate the state, and yet observations are processed one at a time, as they become available. A classical example of this situation in computer vision is the reconstruction of three-dimensional shape from a sequence of images. A single image is two-dimensional, so by itself it conveys no three-dimensional information. Kalman lters exist that recover shape information from a sequence of images. See for instance L. Matthies, T. Kanade, and R. Szeliski, Kalman lter-based algorithms for estimating depth from image sequences, International Journal of Computer Vision, 3(3):209-236, September 1989; and T.J. Broida, S. Chandrashekhar, and R. Chellappa, Recursive 3-D motion estimation from a monocular image sequence, IEEE Transactions on Aerospace and Electronic Systems, 26(4):639656, July 1990. Here, we introduce the Kalman lter from the simpler point of view of least squares estimation, since we have developed all the necessary tools in the rst part of this course. The next section denes the state estimation problem for a discrete dynamic system in more detail. Then, section 7.4 denes the essential notions of estimation theory that are necessary to understand the quantitative aspects of Kalman ltering. Section 7.5 develops the equation of the Kalman lter, and section 7.6 reconsiders the example of the mortar shell. Finally, section 7.7 establishes a connection between the Kalman lter and the solution of a linear system.
= Fk xk + Gk uk + k = Hk xk + k
k k
(7.10) (7.11)
where the system noise k and the measurement noise k are Gaussian variables,
N (0 Qk ) N (0 Rk)
as well as a (possibly completely wrong) estimate ^ x0 of the initial state and an initial covariance matrix P 0 of the estimate ^ x0, the Kalman lter computes the optimal estimate ^ xkjk at time k given the measurements y0 : : : yk . The lter also computes an estimate Pkjk of the covariance of ^ xkjk given those measurements. In these expressions, the hat means that the quantity is an estimate. Also, the rst k in the subscript refers to which variable is being estimated, the second to which measurements are being used for the estimate. Thus, in general, ^ xijj is the estimate of the value that x assumes at time i given the rst j + 1 measurements y0 : : : yj .
1 The term recursive in the systems theory literature corresponds loosely to incremental or iterative in computer science.
89
update
propagate
Figure 7.2: The update stage of the Kalman lter changes the estimate of the current system state xk to make the prediction of the measurement closer to the actual measurement yk . Propagation then accounts for the evolution of the system state, as well as the consequent growing uncertainty.
7.3.1
Update
The covariance matrix Pkjk must be computed in order to keep the Kalman lter running, in the following sense. At time k, just before the new measurement yk comes in, we have an estimate ^ xkjk;1 of the state vector xk based on the previous measurements y0 : : : yk;1. Now we face the problem of incorporating the new measurement yk into our estimate, that is, of transforming ^ xkjk;1 into ^ xkjk. If ^ xkjk;1 were exact, we could compute the new measurement yk without even looking at it, through the measurement equation (7.11). Even if ^ xkjk;1 is not exact, the estimate
^kjk;1 ^ ykjk;1 = Hk x
is still our best bet. Now y k becomes available, and we can consider the residue rk
If this residue is nonzero, we probably need to correct our estimate of the state xk , so that the new prediction
^ ^kjk;1 ykjk;1 = Hk x
we made just before the new measurement yk was available. The question however is, by how much should we correct our estimate of the state? We do not want to make ^ ykjk coincide with yk . That would mean that we trust the new measurement completely, but that we do not trust our state estimate ^ xkjk;1 at all, even if the latter was obtained through a large number of previous measurements. Thus, we need some criterion for comparing the quality of the new measurement yk with that of our old estimate ^ xkjk;1 of the state. The uncertainty about the former is Rk , the covariance of the observation error. The uncertainty about the state just before the new measurement yk becomes available is Pkjk;1. The update stage of the Kalman lter uses Rk and Pkjk;1 to weigh past evidence (^ xkjk;1) and new observations (yk ). This stage is represented graphically in the middle of gure 7.2. At the same time, also the uncertainty measure Pkjk;1 must be updated, so that it becomes available for the next step. Because a new measurement has been read, this uncertainty becomes usually smaller: Pkjk < Pkjk;1. The idea is that as time goes by the uncertainty on the state decreases, while that about the measurements may remain the same. Then, measurements count less and less as the estimate approaches its true value.
90
7.3.2
Propagation
Just after arrival of the measurement yk , both state estimate and state covariance matrix have been updated as described above. But between time k and time k + 1 both state and covariance may change. The state changes according to the system equation (7.10), so our estimate ^ xk+1jk of xk+1 given y0 : : : yk should reect this change as well. Similarly, because of the system noise k , our uncertainty about this estimate may be somewhat greater than one time epoch ago. The system equation (7.10) essentially dead reckons the new state from the old, and inaccuracies in our model of how this happens lead to greater uncertainty. This increase in uncertainty depends on the system noise covariance Qk . Thus, both state estimate and covariance must be propagated to the new time k + 1 to yield the new state estimate ^ xk+1jk and the new covariance Pk+1jk . Both these changes are shown on the right in gure 7.2. In summary, just as the state vector xk represents all the information necessary to describe the evolution of a deterministic system, the covariance matrix Pkjk contains all the necessary information about the probabilistic part of the system, that is, about how both the system noise k and the measurement noise k corrupt the quality of the state estimate ^ xkjk . Hopefully, this intuitive introduction to Kalman ltering gives you an idea of what the lter does, and what information it needs to keep working. To turn these concepts into a quantitative algorithm we need some preliminaries on optimal estimation, which are discussed in the next section. The Kalman lter itself is derived in section 7.5.
7.4.1
Linear Estimation
Given a quantity y (the observation) that is a known function of another (deterministic but unknown) quantity x (the state) plus some amount of noise, y = h(x) + n (7.12) the estimation problem amounts to nding a function
^ x = L(y)
such that ^ x is as close as possible to x. The function L is called an estimator, and its value ^ x given the observations y is called an estimate. Inverting a function is an example of estimation. If the function h is invertible and the noise term n is zero, then L is the inverse of h, no matter how the phrase as close as possible is interpreted. In fact, in that case ^ x is equal to x, and any distance between ^ x and x must be zero. In particular, solving a square, nonsingular system y = Hx (7.13)
is, in this somewhat trivial sense, a problem of estimation. The optimal estimator is then represented by the matrix
L = H ;1
and the optimal estimate is A less trivial example occurs, for a linear observation function, when the matrix H has more rows than columns, so that the system (7.13) is overconstrained. In this case, there is usually no inverse to H , and again one must say in what sense ^ x is required to be as close as possible to x. For linear systems, we have so far considered the criterion that prefers a particular ^ x if it makes the Euclidean norm of the vector y ; H x as small as possible. This is the (unweighted)
^ x = Ly :
91
least squares criterion. In section 7.4.2, we will see that in a very precise sense ordinary least squares solve a particular type of estimation problem, namely, the estimation problem for the observation equation (7.12) with h a linear function and n Gaussian zero-mean noise with the indentity matrix for covariance. An estimator is said to be linear if the function L is linear. Notice that the observation function h can still be nonlinear. If L is required to be linear but h is not, we will probably have an estimator that produces a worse estimate than a nonlinear one. However, it still makes sense to look for the best possible linear estimator. The best estimator for a linear observation function happens to be a linear estimator.
7.4.2
Best
between the left and the right-hand sides of equation (7.13), evaluated at the solution ^ x. Replacing (7.13) by a noisy equation, y = Hx + n (7.14)
In order to dene what is meant by a best estimator, one needs to dene a measure of goodness of an estimate. In the least squares approach to solving a linear system like (7.13), this distance is dened as the Euclidean norm of the residue vector ^ y ; Hx
does not change the nature of the problem. Even equation (7.13) has no exact solution when there are more independent equations than unknowns, so requiring equality is hopeless. What the least squares approach is really saying is that even at the solution ^ x there is some residue ^ (7.15) n = y ; Hx and we would like to make that residue as small as possible in the sense of the Euclidean norm. Thus, an overconstrained system of the form (7.13) and its noisy version (7.14) are really the same problem. In fact, (7.14) is the correct version, if the equality sign is to be taken literally. The noise term, however, can be used to generalize the problem. In fact, the Euclidean norm of the residue (7.15) treats all components (all equations in (7.14)) equally. In other words, each equation counts the same when computing the norm of the residue. However, different equations can have noise terms of different variance. This amounts to saying that we have reasons to prefer the quality of some equations over others or, alternatively, that we want to enforce different equations to different degrees. From the point of view of least squares, this can be enforced by some scaling of the entries of n or, even, by some linear transformation of them: n ! Wn so instead of minimizing knk2 minimize where
kW nk2 = nT R;1n
R;1 = W T W W n = W (y ; H x) = W y ; WH x
is a symmetric, nonnegative-denite matrix. This minimization problem, called weighted least squares, is only slightly different from its unweighted version. In fact, we have
W y = W Hx
in the traditional, unweighted sense. We know the solution from normal equations:
92
Interestingly, this same solution is obtained from a completely different criterion of goodness of a solution ^ x. This criterion is a probabilistic one. We consider this different approach because it will let us show that the Kalman lter is optimal in a very useful sense. The new criterion is the so-called minimum-covariance criterion. The estimate ^ x of x is some function of the measurements y, which in turn are corrupted by noise. Thus, ^ x is a function of a random vector (noise), and is therefore a random vector itself. Intuitively, if we estimate the same quantity many times, from measurements corrupted by different noise samples from the same distribution, we obtain different estimates. In this sense, the estimates are random. It makes therefore sense to measure the quality of an estimator by requiring that its variance be as small as possible: the uctuations of the estimate ^ x with respect to the true (unknown) value x from one estimation experiment to the x = Ly next should be as small as possible. Formally, we want to choose a linear estimator L such that the estimates ^ it produces minimize the following covariance matrix:
P = E (x ; ^ x)(x ; ^ x)T ] :
Minimizing a matrix, however, requires a notion of size for matrices: how large is P ? Fortunately, most interesting matrix norms are equivalent, in the sense that given two different denitions kP k 1 and kP k2 of matrix norm there exist two positive scalars such that
kP k1 < kP k2 < kP k1 :
Thus, we can pick any norm we like. In fact, in the derivations that follow, we only use properties shared by all norms, so which norm we actually use is irrelevant. Some matrix norms were mentioned in section 3.2.
7.4.3
Unbiased
In additionto requiring our estimator to be linear and with minimum covariance, we also want it to be unbiased, in the sense that if repeat the same estimation experiment many times we neither consistently overestimate nor consistently underestimate x. Mathematically, this translates into the following requirement:
E x;^ x] = 0
7.4.4 The BLUE
and
E^ x] = E x] :
We now address the problem of nding the Best Linear Unbiased Estimator (BLUE)
^ x = Ly
of x given that y depends on x according to the model (7.13), which is repeated here for convenience: y = Hx + n : First, we give a necessary and sufcient condition for L to be unbiased. Lemma 7.4.1 Let n in equation (7.16) be zero mean. Then the linear estimator L is unbiased if an only if (7.16)
LH = I
the identity matrix. Proof.
93
And now the main result. Theorem 7.4.2 The Best Linear Unbiased Estimator (BLUE)
^ x = Ly
for the measurement model where the noise vector n has zero mean and covariance R is given by and the covariance of the estimate ^ x is Proof. We can write y = Hx + n
^)(x ; ^ P = E (x ; x x)T ] = E (x ; Ly)(x ; Ly)T ] = E (x ; LH x ; Ln)(x ; LH x ; Ln)T ] = E ((I ; LH )x ; Ln)((I ; LH )x ; Ln)T ] = E LnnT LT ] = L E nnT ] LT = LRLT because L is unbiased, so that LH = I .
To show that is the best choice, let L be any (other) linear unbiased estimator. We can trivially write
(7.18)
and
;1 T ;1 ;1 T ;1 ;1 RLT 0 = RR H (H R H ) = H (H R H )
(L ; L0 )RLT 0 = (L ; L0 )H (H T R;1H );1 = (LH ; L0 H )(H T R;1 H );1 : But L and L0 are unbiased, so LH = L0 H = I , and (L ; L0 )RLT 0 =0: T The term L0 R(L ; L0 ) is the transpose of this, so it is zero as well. In conclusion, T P = L0 RLT 0 + (L ; L0 )R(L ; L0 )
the sum of two positive denite or at least semidenite matrices. For such matrices, the norm of the sum is greater or equal to either norm, so this expression is minimized when the second term vanishes, that is, when L = L0 . This proves that the estimator given by (7.18) is the best, that is, that it has minimum covariance. To prove that the covariance P of ^ x is given by equation (7.17), we simply substitute L 0 for L in P = LRLT :
as promised.
94
N (0 R)
^ x = PH T R;1y
is the covariance of the estimation error. Given a dynamic system with system and measurement equations
= Fk xk + Gk uk + k yk = Hk xk + k
k k
(7.19)
where the system noise k and the measurement noise k are Gaussian random vectors,
N (0 Qk ) N (0 Rk)
as well as the best, linear, unbiased estimate ^ x0 of the initial state with an error covariance matrix P0, the Kalman lter computes the best, linear, unbiased estimate ^ xkjk at time k given the measurements y0 : : : yk . The lter also computes the covariance Pkjk of the error ^ xkjk ; xk given those measurements. Computation occurs according to the phases of update and propagation illustrated in gure 7.2. We now apply the results from optimal estimation to the problem of updating and propagating the state estimates and their error covariances.
7.5.1
Update
At time k, two pieces of data are available. One is the estimate ^ xkjk;1 of the state xk given measurements up to but not including y k . This estimate comes with its covariance matrix Pkjk;1. Another way of saying this is that the estimate ^ xkjk;1 differs from the true state xk by an error term ek whose covariance is Pkjk;1:
^ xkjk;1 = xk + ek
with
(7.20)
E ek eT k ] = Pkjk;1 :
yk
The other piece of data is the new measurement yk itself, which is related to the state xk by the equation
= Hkxk + k
(7.21)
T ] = Rk : E kk
We can summarize this available information by grouping equations 7.20 and 7.21 into one, and packaging the error covariances into a single, block-diagonal matrix. Thus, we have y = H xk + n
7.5. THE KALMAN FILTER: DERIVATION where y= and where n has covariance
95
^ xkjk;1
yk
I H= H k
k ;1 0 R = Pkj0 Rk
n=
ek nk
1 T ;1 Pk; jk = H R H
=
and
I Hk
^ xkjk = PkjkH T R;1y h 1 i ^ xkjk;1 T R;1 H = Pkjk Pk; jk;1 k k yk T ; 1 ; 1 ^kjk;1 + Hk Rk yk ) = Pkjk(Pkjk;1x ; 1 T R;1Hk )^ T R;1 y ) = Pkjk((Pkjk ; Hk xkjk;1 + Hk k k k T ; 1 ^kjk;1) : = ^ xkjk;1 + Pkjk Hk Rk (yk ; Hk x
In the last line, the difference rk
is the residue between the actual measurement yk and its best estimate based on ^ xkjk;1, and the matrix
^kjk;1 = yk ; Hk x
T R;1 Kk = PkjkHk k
is usually referred to as the Kalman gain matrix, because it species the amount by which the residue must be multiplied (or amplied) to obtain the correction term that transforms the old estimate ^ xkjk;1 of the state xk into its new estimate ^ xkjk.
7.5.2
Propagation
Propagation is even simpler. Since the new state is related to the old through the system equation 7.19, and the noise term k is zero mean, unbiasedness requires
^ xk+1jk = Fk ^ xkjk + Gk uk
96
which is the state estimate propagation equation of the Kalman lter. The error covariance matrix is easily propagated thanks to the linearity of the expectation operator:
Pk+1jk = = = =
E (^ xk+1jk ; xk+1)(^ xk+1jk ; xk+1)T ] E (Fk (^ xkjk ; xk ) ; k )(Fk (^ x k jk ; x k ) ; T T +E k Fk E (^ xkjk ; xk )(^ xkjk ; xk ) ]Fk Fk PkjkFkT + Qk
k )T ] T k]
where the system noise k and the previous estimation error ^ xkjk ; xk were assumed to be uncorrelated.
7.5.3
In summary, the Kalman lter evolves an initial estimate and an initial error covariance matrix,
^ ^0 x0j;1 = x
both assumed to be given, by the update equations
and P0j;1 = P0
T R;1 Kk = PkjkHk k
^k+1jk = Fk ^ x xkjk + Gk uk Pk+1jk = Fk Pkjk FkT + Qk : 7.6 Results of the Mortar Shell Experiment
In section 7.2, the dynamic system equations for a mortar shell were set up. Matlab routines available through the class Web page implement a Kalman lter (with naive numerics) to estimate the state of that system from simulated observations. Figure 7.3 shows the true and estimated trajectories. Notice that coincidence of the trajectories does not imply that the state estimate is up-to-date. For this it is also necessary that any given point of the trajectory is reached by the estimate at the same time instant. Figure 7.4 shows that the distance between estimated and true target position does indeed converge to zero, and this occurs in time for the shell to be shot down. Figure 7.5 shows the 2-norm of the covariance matrix over time. Notice that the covariance goes to zero only asymptotically.
= Hk xk + k = Hk (Fk;1xk;1 + Gk;1uk;1 + k;1) + k = Hk Fk;1xk;1 + Hk Gk;1uk;1 + Hk k;1 + k = Hk Fk;1(Fk;2xk;2 + Gk;2uk;2 + k;2) + Hk Gk;1uk;1 + Hk k;1 + k
97
1.2
0.8
0.6
0.4
0.2
0.2 5
10
15
20
25
30
35
Figure 7.3: The true and estimated trajectories get closer to one another. Trajectories start on the right.
distance between true and estimated missile position vs. time 2 1.8 1.6 1.4 1.2 1 0.8 0.6 0.4 0.2 0 0
10
15
20
25
30
98
40
35
30
25
20
15
10
0 0
10
15
20
25
30
Figure 7.5: After an initial increase in uncertainty, the norm of the state covariance matrix converges to zero. Upwards segments correspond to state propagation, downwards ones to state update.
= Hk Fk;1Fk;2xk;2 + Hk (Fk;1Gk;2uk;2 + Gk;1uk;1) + Hk (Fk;1 k;2 + k;1) + k = Hk Fk;1 : : :F0x0 + Hk (Fk;1 : : :F1 G0u0 + : : : + Gk;1uk;1) + Hk (Fk;1 : : :F1 0 + : : : + k;1) + k
or in a more compact form, yk where . . .
= Hk (k ; 1 0)x0 + Hk (l j ) =
k X j =1
(k ; 1 j )Gj ;1uj ;1 + k
for l j for l < j
(7.22)
Fl : : :Fj 1
k = Hk
k X j =1
(k ; 1 j ) j ;1 + k
is noise. The key thing to notice about this somewhat intimidating expression is that for any k it is a linear system in x 0 , the initial state of the system. We can write one system like the one in equation (7.22) for every value of k = 0 : : : K , where K is the last time instant considered, and we obtain a large system of the form zK where zK
= K x0 + gK + nK
(7.23)
2 = 6 4
y0 . . . yK
3 7 5
99
gK
nK
Without knowing anything about the statistics of the noise vector n K in equation (7.23), the best we can do is to solve the system zK = K x0 + gK in the sense of least squares, to obtain an estimate of x0 from the measurements y0
^ x0jK = y K (zK ; gK )
: : : yK :
where y K is the pseudoinverse of K . We know that if K has full rank, the result with the pseudoinverse is the same as we would obtain by solving the normal equations, so that
y = ( T );1 T : K K K K
The least square solution to system (7.23) minimizes the residue between the left and the right-hand side under the assumption that all equations are to be treated the same way. This is equivalent to assuming that all the noise terms in nK are equally important. However, we know the covariance matrices of all these noise terms, so we ought to be able to do better, and weigh each equation to keep these covariances into account. Intuitively, a small covariance means that we believe in that measurement, and therefore in that equation, which should consequently be weighed more heavily than others. The quantitative embodiment of this intuitive idea is at the core of the Kalman lter. In summary, the Kalman lter for a linear system has been shown to be equivalent to a linear equation solver, under the assumption that the noise that affects each of the equations has the same probability distribution, that is, that all the noise terms in nK in equation 7.23 are equally important. However, the Kalman lter differs from a linear solver in the following important respects: 1. The noise terms in nK in equation 7.23 are not equally important. Measurements come with covariance matrices, and the Kalman lter makes optimal use of this information for a proper weighting of each of the scalar equations in (7.23). Better information ought to yield more accurate results, and this is in fact the case. 2. The system (7.23) is not solved all at once. Rather, an initial solution is rened over time as new measurements become available. The nal solution can be proven to be exactly equal to solving system (7.23) all at once. However, having better and better approximations to the solution as new data come in is much preferable in a dynamic setting, where one cannot in general wait for all the data to be collected. In some applications, data my never stop arriving. 3. A solution for the estimate ^ xkjk of the current state is given, and not only for the estimate ^ x0jk of the initial state. As time goes by, knowledge of the initial state may obsolesce and become less and less useful. The Kalman lter computes up-to-date information about the current state.