Optimization Techniques 1. Least Squares

Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Advanced Computational Techniques Lecture 8: Least Squares

Optimization
Techniques
1. Least Squares

Abasyn Univ 1

Advanced Computational Techniques Lecture 8: Least Squares

Least Squares (LS) Problem

In LS, we are concerned with solving

min kAx − bk22


x∈Cn

for x, given A ∈ Cm×n , m > n, and b ∈ Cm .


In essence, Ax − b represents an error vector and we seek
to minimize the sum square of the error vector.
There are so many applications for LS.

Abasyn University 2
Least Squares Estimation

Least Squares Estimation


Introduction

Least squares is a time-honored estimation procedure, that was developed independently by Gauss
(1795), Legendre (1805) and Adrain (1808) and published in the first decade of the nineteenth
century. It is perhaps the most widely used technique in geophysical data analysis. Unlike
maximum likelihood, which can be applied to any problem for which we know the general form
of the joint pdf, in least squares the parameters to be estimated must arise in expressions for the
means of the observations. When the parameters appear linearly in these expressions then the
least squares estimation problem can be solved in closed form, and it is relatively straightforward
to derive the statistical properties for the resulting parameter estimates.

One very simple example which we will treat in some detail in order to illustrate the more general
problem is that of fitting a straight line to a collection of pairs of observations (xi , yi ) where
i = 1, 2, . . . , n. We suppose that a reasonable model is of the form

y = β0 + β1 x, (1)

and we need a mechanism for determining β0 and β1 . This is of course just a special case of many
more general problems including fitting a polynomial of order p, for which one would need to find
p + 1 coefficients. The most commonly used method for finding a model is that of least squares
estimation. It is supposed that x is an independent (or predictor) variable which is known exactly,
while y is a dependent (or response) variable. The least squares (LS) estimates for β0 and β1 are
those for which the predicted values of the curve minimize the sum of the squared deviations from
the observations. That is the problem is to find the values of β0 , β1 that minimize the residual sum
of squares
n
X
S(β0 , β1 ) = (yi − β0 − β1 xi )2 (2)
i=1

Note that this involves the minimization of vertical deviations from the line (not the perpendicular
distance) and is thus not symmetric in y and x. In other words if x is treated as the dependent
variable instead of y one might well expect a different result.

To find the minimizing values of βi in (2) we just solve the equations resulting from setting

∂S ∂S
= 0, = 0, (3)
∂β0 ∂β1

namely
X X
yi = nβ̂0 + β̂1 xi
i i
X X X (4)
xi yi = β̂0 xi + β̂1 x2i
i i i
Least Squares Estimation Version 1.3

Solving for the β̂i yields the least squares parameter estimates:
P 2P P P
xi i yi − xi xi yi
β̂0 =
n x2i − ( xi )2
P P
P P P (5)
n xi yi − xi yi
β̂1 =
n x2i − ( xi )2
P P
P
where the ’s are implicitly taken to be from i = 1 to n in each case. Having generated these
estimates, it is natural to wonder how much faith we should have in β̂0 and β̂1 , and whether the fit
to the data is reasonable. Perhaps a different functional form would provide a more appropriate fit
to the observations, for example, involving a series of independent variables, so that
y ≈ β0 + β1 x1 + β2 x2 + β3 x3 (6)
or decay curves
f (t) = Ae−αt + Be−βt , (7)
or periodic functions
f (t) = Acosω1 t + Bsinω1 t + Ccosω2 t + Dsinω2 t. (8)
In equations (7) and (8) the functions f (t) are linear in A, B, C and D, but nonlinear in the other
parameters α, β, ω1 , and ω2 . When the function to be fit is linear in the parameters, then the partial
derivatives of S with respect to them yield equations that can be solved in closed form. Typically
non-linear least squares problems do not provide a solution in closed form and one must resort to
an iterative procedure. However, it is sometimes possible to transform the nonlinear function to
be fitted into a linear form. For example, the Arrhenius equation models the rate of a chemical
reaction as a function of temperature via a 2-parameter model with an unknown constant frequency
factor C and activation energy EA , so that

α(T ) = Ce−EA /kT (9)


Boltzmann’s constant, k is known a priori. If one measures α at various values of T , then C and
EA can be found by a linear least squares fit to the transformed variables, log α and T1 :

EA
log α(T ) = log C − (10)
kT

Fitting a Straight Line

We return to the simplest of LS fitting problems, namely fitting a straight line to paired observations
(xi , yi ), so that we can consider the statistical properties of LS estimates, assess the goodness of
fit in the resulting model, and understand how regression is related to correlation.

To make progress on these fronts we need to adopt some kind of statistical model for the noise
associated with the measurements. In the standard statistical model (SSM) we suppose that y is
a linear function of x plus some random noise,
yi = β0 + β1 xi + ei i = 1, . . . , n. (11)
Advanced Computational Techniques Lecture 8: Least Squares

Application I: System Identification

Let u[n] be an input signal that passes through a linear


time-invariant system. The output is given by
L−1
X
x[n] = h[`]u[n − `] + ν[n]
`=0

where h[n] is the impulse response of the system.


Our aim is to estimate h[n] from x[n], given that u[n] is
known.

Abasyn University 3

Advanced Computational Techniques Lecture 8: Least Squares

Let u[n] = [ u[n], u[n − 1], . . . , u[n − L + 1] ]T , and


h = [ h[0], h[1], . . . , h[L − 1] ]T . The output signal can be
re-expressed as

x[n] = uT [n]h + ν[n]

System identification can be done by minimizing the sum


squared error:
N
X ¯ T ¯
min ¯u [n]h − x[n]¯2
h∈CL
n=1

where N is the data length.

Abasyn University 4
Advanced Computational Techniques Lecture 8: Least Squares

Let x = [ x[1], . . . , x[N ] ]T . We have

x = Uh + ν

where U = [ u[1], . . . , u[N ] ]T .


The system identification problem can be rewritten as

min kUh − xk22


h∈CL

which is an LS.

Abasyn University 5

Advanced Computational Techniques Lecture 8: Least Squares

Application II: Channel Equalization

In digital communication over a linear time-dispersive


channel, the discrete signal model is generally formulated as:
L−1
X
x[n] = h[`]u[n − `] + ν[n]
`=0

where

u[n] transmitted symbol sequence

h[n] channel impulse response

x[n] received signal.

Abasyn University 6
Advanced Computational Techniques Lecture 8: Least Squares

At the receiver, we apply a filtering process, called


equalization
m−1
X
y[n] = w[`]x[n − `]
`=0

so that y[n] ≈ u[n].


Let x[n] = [ x[n], x[n − 1], . . . , x[n − m + 1] ]T , and
w = [ w[0], . . . , w[m − 1] ]T . The equalizer output equation
can be rewritten as

y[n] = xT [n]w

Abasyn University 7

Advanced Computational Techniques Lecture 8: Least Squares

Suppose that u[n] is known for n = 0, 1, . . . , N − 1. In


practice, this is made possible by having the transmitter
sending signals known to the receiver, a.k.a. pilot signals.
The equalizer coefficients w[n] are determined by
N
X ¯ T ¯
minm ¯x [n]w − u[n]¯2
w∈C
n=1

= minm kXw − uk22


w∈C

where X = [ x[1], . . . , x[N ] ]T , and


u = [ u[0], . . . , u[N − 1] ]T .
The problem is again an LS.

Abasyn University 8
Advanced Computational Techniques Lecture 8: Least Squares

Application III: Curve Fitting

Consider that there is a collection of experimental


measurements, denoted by x(t1 ), x(t2 ), . . . , x(tN ).
We seek to find a continuous curve that ’fits’ those data.
Suppose that the curve can be parameterized as

y(t) = θ1 + θ2 t + θ3 t2

and assume that x(ti ) are perturbed versions of y(t)

x(ti ) = y(ti ) + ν(ti )

where ν(ti ) is noise.

Abasyn University 9

Advanced Computational Techniques Lecture 8: Least Squares

Let x = [ x(t1 ), . . . , x(tN ) ]T . We have

x = Hθ + ν

where θ = [ θ1 , θ2 , θ3 ]T , and
 
2
1 t1 t2
 
1 t t2
 2 2
H= . 
 .. 
 
2
1 tN tN

Abasyn University 10
Advanced Computational Techniques Lecture 8: Least Squares

Again, we can use LS

min3 kHθ − xk22


θ∈R

to determine the curve coefficients.

Abasyn University 11

Advanced Computational Techniques Lecture 8: Least Squares

450
Data x(ti)
400 Fitted curve

350

300

250

200

150

100

50

−50
0 1 2 3 4 5 6 7 8 9 10
t

Abasyn University 12
Advanced Computational Techniques Lecture 8: Least Squares

Application IV: Linear Prediction

A colored process, denoted by y[n] can be modeled as



X
y[n] = h[`]w[n − `]
`=0

where w[n] is a zero-mean white process.

Abasyn University 13

Advanced Computational Techniques Lecture 8: Least Squares

Here we are interested in the autoregressive (AR) process.


In this process h[n] is an all-pole model; i.e., its z-transform
is given by

H(z) = 1/A(z)
m
X
A(z) = 1 − ai z −i
i=1

Abasyn University 14
Advanced Computational Techniques Lecture 8: Least Squares

Since
Y (z) = H(z)W (z)
we have that
Y (z)A(z) = W (z)
and that
m
X
y[n] − ai y[n − i] = w[n] (∗)
i=1

Eq. (∗) can be viewed as a ‘prediction’, where the previous


samples {y[n − i]}mi=1 predict the present sample y[n], up to
a (unpredictable) perturbation w[n].

Abasyn University 15

Advanced Computational Techniques Lecture 8: Least Squares

Our aim is to estimate a = [ a1 , . . . , am ]T from y[n].


Let

yp = [ y[1], . . . , y[N ] ]T
y[n] = [ y[n − 1], . . . , y[n − m] ]T
Y = [ y[1], . . . , y[N ] ]T

AR coefficient estimation may be achieved by LS linear


prediction:
min kYa − yp k22
a∈Cm

Abasyn University 16
Advanced Computational Techniques Lecture 8: Least Squares

Solving LS

First, some remarks:


• In Lecture 4, we have learnt that for m > n,

Ax − b 6= 0

in general, unless b ∈ R(A).


• If rank(A) < n, then the solution set

{ xLS ∈ Cn | kAxLS − bk22 = min kAx − bk22 }


x

does not simply contain one element— if xLS is a


solution, then xLS + z, z ∈ N (A) is also a solution.

Abasyn University 17

Advanced Computational Techniques Lecture 8: Least Squares

Alternative I for solving LS: use Gradient


The gradient of a function f : Rn → R is defined to be
 
∂f
∂x1
 . 
∇f =  . 
 . 
∂f
∂xn

Some useful properties for gradients:

1. The gradient of f (x) = xT b is ∇f = b.

2. The gradient of f (x) = xT Rx where R is symmetric, is


∇f = 2Rx.

Abasyn University 18
Advanced Computational Techniques Lecture 8: Least Squares

For ease of exposition of ideas, assume that A, b, & x are


real-valued.
Let

f (x) = kAx − bk22

The LS problem
min f (x)
x∈Rn

is an unconstrained optimization problem. Since f is


convex, the sufficient & necessary condition for xLS to be a
solution is that
∇f |x=xLS = 0

Abasyn University 19

Advanced Computational Techniques Lecture 8: Least Squares

We can decompose

f (x) = xT AT Ax − 2xT AT b + bT b

The gradient of f is

∇f = 2AT Ax − 2AT b

Hence, an optimal solution xLS can be found by solving

AT AxLS = AT b

For the complex case, it can be shown (in a similar way but
with more hassles) that

AH AxLS = AH b

Abasyn University 20
Advanced Computational Techniques Lecture 8: Least Squares

Alternative II for solving LS: use the Orthogonal


Principle

Theorem 8.1 (Orthogonal Principle) A vector xLS is an


LS solution if and only if

AH (AxLS − b) = 0

Abasyn University 21

Advanced Computational Techniques Lecture 8: Least Squares

The equations
AH AxLS = AH b
are referred as to the normal equations.
If A is of full column rank so that AH A is PD, then xLS is
uniquely determined by

xLS = (AH A)−1 AH b

Abasyn University 22
Advanced Computational Techniques Lecture 8: Least Squares

Interpretations of the Normal Equations

Let rLS = b − AxLS be the LS error vector.


For full rank A,

rLS = b − A(AH A)−1 AH b


= b − Pb = P⊥ b

where P is the orthogonal projection matrix of A, and P⊥


is the orthogonal complement.
This means that the LS error is orthogonal to any vector in
R(A).

Abasyn University 23

Advanced Computational Techniques Lecture 8: Least Squares

LS for Rank Deficient A

As we mentioned, for rank deficient A there are more than


one LS solutions.
But we can find a unique xLS that has its 2-norm being the
smallest among all LS solutions.

Abasyn University 24
Advanced Computational Techniques Lecture 8: Least Squares

Let r = rank(A), and denote the SVD of A by

A = UΣVH
  
H
Σ̃ 0 V
= [ U1 U2 ]   1 
0 0 V2H

where Σ̃ = Diag(σ1 , . . . , σr ) contains the nonzero singular


values of A.
Define
−1
A† = V1 Σ̃ UH
1

to be the pseudo-inverse of A.

Abasyn University 25

Advanced Computational Techniques Lecture 8: Least Squares

Theorem 8.2 The following minimum 2-norm problem

min kxk22
s.t. x minimizes kAx − bk22

is uniquely given by
xLS = A† b

Abasyn University 26
Advanced Computational Techniques Lecture 8: Least Squares

Note that

rLS = b − AxLS
−1
= b − (U1 Σ̃V1H )(V1 Σ̃ UH
1 )b

= b − U1 UH
1 b

= b − Pb = P⊥ b

where P = U1 UH
1 is the orthogonal projection matrix of A.

This orthogonal property is the same as that in the case of


full column rank A.

Abasyn University 27

Advanced Computational Techniques Lecture 8: Least Squares

Some Relationships of the pseudo-inverse

1. For the case of full column rank A (i.e., m ≥ n,


rank(A) = n),

A† = (AH A)−1 AH b

which means that the pseudo-inverse leads to the LS in


the full column rank case.

2. For the case of full row rank A (i.e., m ≤ n,


rank(A) = m),

A† = AH (AAH )−1

Abasyn University 28
Advanced Computational Techniques Lecture 8: Least Squares

Relationship to generalized inverse


A matrix C ∈ Cn×m is said to be the Moore-Penrose
generalized inverse of A if the following 4 conditions hold:

1. ACA = A

2. CAC = C

3. (AC)H = AC

4. (CA)H = CA

It can be verified that A† is the Moore-Penrose generalized


inverse.

Abasyn University 29

You might also like