Optimization Techniques 1. Least Squares
Optimization Techniques 1. Least Squares
Optimization Techniques 1. Least Squares
Optimization
Techniques
1. Least Squares
Abasyn Univ 1
Abasyn University 2
Least Squares Estimation
Least squares is a time-honored estimation procedure, that was developed independently by Gauss
(1795), Legendre (1805) and Adrain (1808) and published in the first decade of the nineteenth
century. It is perhaps the most widely used technique in geophysical data analysis. Unlike
maximum likelihood, which can be applied to any problem for which we know the general form
of the joint pdf, in least squares the parameters to be estimated must arise in expressions for the
means of the observations. When the parameters appear linearly in these expressions then the
least squares estimation problem can be solved in closed form, and it is relatively straightforward
to derive the statistical properties for the resulting parameter estimates.
One very simple example which we will treat in some detail in order to illustrate the more general
problem is that of fitting a straight line to a collection of pairs of observations (xi , yi ) where
i = 1, 2, . . . , n. We suppose that a reasonable model is of the form
y = β0 + β1 x, (1)
and we need a mechanism for determining β0 and β1 . This is of course just a special case of many
more general problems including fitting a polynomial of order p, for which one would need to find
p + 1 coefficients. The most commonly used method for finding a model is that of least squares
estimation. It is supposed that x is an independent (or predictor) variable which is known exactly,
while y is a dependent (or response) variable. The least squares (LS) estimates for β0 and β1 are
those for which the predicted values of the curve minimize the sum of the squared deviations from
the observations. That is the problem is to find the values of β0 , β1 that minimize the residual sum
of squares
n
X
S(β0 , β1 ) = (yi − β0 − β1 xi )2 (2)
i=1
Note that this involves the minimization of vertical deviations from the line (not the perpendicular
distance) and is thus not symmetric in y and x. In other words if x is treated as the dependent
variable instead of y one might well expect a different result.
To find the minimizing values of βi in (2) we just solve the equations resulting from setting
∂S ∂S
= 0, = 0, (3)
∂β0 ∂β1
namely
X X
yi = nβ̂0 + β̂1 xi
i i
X X X (4)
xi yi = β̂0 xi + β̂1 x2i
i i i
Least Squares Estimation Version 1.3
Solving for the β̂i yields the least squares parameter estimates:
P 2P P P
xi i yi − xi xi yi
β̂0 =
n x2i − ( xi )2
P P
P P P (5)
n xi yi − xi yi
β̂1 =
n x2i − ( xi )2
P P
P
where the ’s are implicitly taken to be from i = 1 to n in each case. Having generated these
estimates, it is natural to wonder how much faith we should have in β̂0 and β̂1 , and whether the fit
to the data is reasonable. Perhaps a different functional form would provide a more appropriate fit
to the observations, for example, involving a series of independent variables, so that
y ≈ β0 + β1 x1 + β2 x2 + β3 x3 (6)
or decay curves
f (t) = Ae−αt + Be−βt , (7)
or periodic functions
f (t) = Acosω1 t + Bsinω1 t + Ccosω2 t + Dsinω2 t. (8)
In equations (7) and (8) the functions f (t) are linear in A, B, C and D, but nonlinear in the other
parameters α, β, ω1 , and ω2 . When the function to be fit is linear in the parameters, then the partial
derivatives of S with respect to them yield equations that can be solved in closed form. Typically
non-linear least squares problems do not provide a solution in closed form and one must resort to
an iterative procedure. However, it is sometimes possible to transform the nonlinear function to
be fitted into a linear form. For example, the Arrhenius equation models the rate of a chemical
reaction as a function of temperature via a 2-parameter model with an unknown constant frequency
factor C and activation energy EA , so that
EA
log α(T ) = log C − (10)
kT
We return to the simplest of LS fitting problems, namely fitting a straight line to paired observations
(xi , yi ), so that we can consider the statistical properties of LS estimates, assess the goodness of
fit in the resulting model, and understand how regression is related to correlation.
To make progress on these fronts we need to adopt some kind of statistical model for the noise
associated with the measurements. In the standard statistical model (SSM) we suppose that y is
a linear function of x plus some random noise,
yi = β0 + β1 xi + ei i = 1, . . . , n. (11)
Advanced Computational Techniques Lecture 8: Least Squares
Abasyn University 3
Abasyn University 4
Advanced Computational Techniques Lecture 8: Least Squares
x = Uh + ν
which is an LS.
Abasyn University 5
where
Abasyn University 6
Advanced Computational Techniques Lecture 8: Least Squares
y[n] = xT [n]w
Abasyn University 7
Abasyn University 8
Advanced Computational Techniques Lecture 8: Least Squares
y(t) = θ1 + θ2 t + θ3 t2
Abasyn University 9
x = Hθ + ν
where θ = [ θ1 , θ2 , θ3 ]T , and
2
1 t1 t2
1 t t2
2 2
H= .
..
2
1 tN tN
Abasyn University 10
Advanced Computational Techniques Lecture 8: Least Squares
Abasyn University 11
450
Data x(ti)
400 Fitted curve
350
300
250
200
150
100
50
−50
0 1 2 3 4 5 6 7 8 9 10
t
Abasyn University 12
Advanced Computational Techniques Lecture 8: Least Squares
Abasyn University 13
H(z) = 1/A(z)
m
X
A(z) = 1 − ai z −i
i=1
Abasyn University 14
Advanced Computational Techniques Lecture 8: Least Squares
Since
Y (z) = H(z)W (z)
we have that
Y (z)A(z) = W (z)
and that
m
X
y[n] − ai y[n − i] = w[n] (∗)
i=1
Abasyn University 15
yp = [ y[1], . . . , y[N ] ]T
y[n] = [ y[n − 1], . . . , y[n − m] ]T
Y = [ y[1], . . . , y[N ] ]T
Abasyn University 16
Advanced Computational Techniques Lecture 8: Least Squares
Solving LS
Ax − b 6= 0
Abasyn University 17
Abasyn University 18
Advanced Computational Techniques Lecture 8: Least Squares
The LS problem
min f (x)
x∈Rn
Abasyn University 19
We can decompose
f (x) = xT AT Ax − 2xT AT b + bT b
The gradient of f is
∇f = 2AT Ax − 2AT b
AT AxLS = AT b
For the complex case, it can be shown (in a similar way but
with more hassles) that
AH AxLS = AH b
Abasyn University 20
Advanced Computational Techniques Lecture 8: Least Squares
AH (AxLS − b) = 0
Abasyn University 21
The equations
AH AxLS = AH b
are referred as to the normal equations.
If A is of full column rank so that AH A is PD, then xLS is
uniquely determined by
Abasyn University 22
Advanced Computational Techniques Lecture 8: Least Squares
Abasyn University 23
Abasyn University 24
Advanced Computational Techniques Lecture 8: Least Squares
A = UΣVH
H
Σ̃ 0 V
= [ U1 U2 ] 1
0 0 V2H
to be the pseudo-inverse of A.
Abasyn University 25
min kxk22
s.t. x minimizes kAx − bk22
is uniquely given by
xLS = A† b
Abasyn University 26
Advanced Computational Techniques Lecture 8: Least Squares
Note that
rLS = b − AxLS
−1
= b − (U1 Σ̃V1H )(V1 Σ̃ UH
1 )b
= b − U1 UH
1 b
= b − Pb = P⊥ b
where P = U1 UH
1 is the orthogonal projection matrix of A.
Abasyn University 27
A† = (AH A)−1 AH b
A† = AH (AAH )−1
Abasyn University 28
Advanced Computational Techniques Lecture 8: Least Squares
1. ACA = A
2. CAC = C
3. (AC)H = AC
4. (CA)H = CA
Abasyn University 29