Optimization Techniques 1. Least Squares

Advanced Computational Techniques Lecture 8: Least Squares
Optimization
Techniques
1. Least Squares
Abasyn Univ 1
Least Squares (LS) Problem
In LS, we are concerned with solving
min kAx − bk22

x∈Cn
for x, given A ∈ Cm×n , m > n, and b ∈ Cm .

In essence, Ax − b represents an error vector and we seek
to minimize the sum square of the error vector.
There are so many applications for LS.
Abasyn University 2
Least Squares Estimation
Least Squares Estimation

Introduction
Least squares is a time-honored estimation procedure, that was developed independently by Gauss
(1795), Legendre (1805) and Adrain (1808) and published in the first decade of the nineteenth
century. It is perhaps the most widely used technique in geophysical data analysis. Unlike
maximum likelihood, which can be applied to any problem for which we know the general form
of the joint pdf, in least squares the parameters to be estimated must arise in expressions for the
means of the observations. When the parameters appear linearly in these expressions then the
least squares estimation problem can be solved in closed form, and it is relatively straightforward
to derive the statistical properties for the resulting parameter estimates.
One very simple example which we will treat in some detail in order to illustrate the more general
problem is that of fitting a straight line to a collection of pairs of observations (xi , yi ) where
i = 1, 2, . . . , n. We suppose that a reasonable model is of the form
y = β0 + β1 x, (1)
and we need a mechanism for determining β0 and β1 . This is of course just a special case of many
more general problems including fitting a polynomial of order p, for which one would need to find
p + 1 coefficients. The most commonly used method for finding a model is that of least squares
estimation. It is supposed that x is an independent (or predictor) variable which is known exactly,
while y is a dependent (or response) variable. The least squares (LS) estimates for β0 and β1 are
those for which the predicted values of the curve minimize the sum of the squared deviations from
the observations. That is the problem is to find the values of β0 , β1 that minimize the residual sum
of squares
n
X
S(β0 , β1 ) = (yi − β0 − β1 xi )2 (2)
i=1
Note that this involves the minimization of vertical deviations from the line (not the perpendicular
distance) and is thus not symmetric in y and x. In other words if x is treated as the dependent
variable instead of y one might well expect a different result.
To find the minimizing values of βi in (2) we just solve the equations resulting from setting
∂S ∂S
= 0, = 0, (3)
∂β0 ∂β1
namely
X X
yi = nβ̂0 + β̂1 xi
i i
X X X (4)
xi yi = β̂0 xi + β̂1 x2i
i i i
Least Squares Estimation Version 1.3
Solving for the β̂i yields the least squares parameter estimates:
P 2P P P
xi i yi − xi xi yi
β̂0 =
n x2i − ( xi )2
P P
P P P (5)
n xi yi − xi yi
β̂1 =
n x2i − ( xi )2
P P
P
where the ’s are implicitly taken to be from i = 1 to n in each case. Having generated these
estimates, it is natural to wonder how much faith we should have in β̂0 and β̂1 , and whether the fit
to the data is reasonable. Perhaps a different functional form would provide a more appropriate fit
to the observations, for example, involving a series of independent variables, so that
y ≈ β0 + β1 x1 + β2 x2 + β3 x3 (6)
or decay curves
f (t) = Ae−αt + Be−βt , (7)
or periodic functions
f (t) = Acosω1 t + Bsinω1 t + Ccosω2 t + Dsinω2 t. (8)
In equations (7) and (8) the functions f (t) are linear in A, B, C and D, but nonlinear in the other
parameters α, β, ω1 , and ω2 . When the function to be fit is linear in the parameters, then the partial
derivatives of S with respect to them yield equations that can be solved in closed form. Typically
non-linear least squares problems do not provide a solution in closed form and one must resort to
an iterative procedure. However, it is sometimes possible to transform the nonlinear function to
be fitted into a linear form. For example, the Arrhenius equation models the rate of a chemical
reaction as a function of temperature via a 2-parameter model with an unknown constant frequency
factor C and activation energy EA , so that
α(T ) = Ce−EA /kT (9)

Boltzmann’s constant, k is known a priori. If one measures α at various values of T , then C and
EA can be found by a linear least squares fit to the transformed variables, log α and T1 :
EA
log α(T ) = log C − (10)
kT
Fitting a Straight Line
We return to the simplest of LS fitting problems, namely fitting a straight line to paired observations
(xi , yi ), so that we can consider the statistical properties of LS estimates, assess the goodness of
fit in the resulting model, and understand how regression is related to correlation.
To make progress on these fronts we need to adopt some kind of statistical model for the noise
associated with the measurements. In the standard statistical model (SSM) we suppose that y is
a linear function of x plus some random noise,
yi = β0 + β1 xi + ei i = 1, . . . , n. (11)
Application I: System Identification
Let u[n] be an input signal that passes through a linear

time-invariant system. The output is given by
L−1
X
x[n] = h[`]u[n − `] + ν[n]
`=0
where h[n] is the impulse response of the system.

Our aim is to estimate h[n] from x[n], given that u[n] is
known.
Abasyn University 3
Let u[n] = [ u[n], u[n − 1], . . . , u[n − L + 1] ]T , and

h = [ h[0], h[1], . . . , h[L − 1] ]T . The output signal can be
re-expressed as
x[n] = uT [n]h + ν[n]
System identification can be done by minimizing the sum

squared error:
N
X ¯ T ¯
min ¯u [n]h − x[n]¯2
h∈CL
n=1
where N is the data length.
Abasyn University 4
Let x = [ x[1], . . . , x[N ] ]T . We have
x = Uh + ν
where U = [ u[1], . . . , u[N ] ]T .

The system identification problem can be rewritten as
min kUh − xk22

h∈CL
which is an LS.
Abasyn University 5
Application II: Channel Equalization
In digital communication over a linear time-dispersive

channel, the discrete signal model is generally formulated as:
L−1
X
x[n] = h[`]u[n − `] + ν[n]
`=0
where
u[n] transmitted symbol sequence
h[n] channel impulse response
x[n] received signal.
Abasyn University 6
At the receiver, we apply a filtering process, called

equalization
m−1
X
y[n] = w[`]x[n − `]
`=0
so that y[n] ≈ u[n].

Let x[n] = [ x[n], x[n − 1], . . . , x[n − m + 1] ]T , and
w = [ w[0], . . . , w[m − 1] ]T . The equalizer output equation
can be rewritten as
y[n] = xT [n]w
Abasyn University 7
Suppose that u[n] is known for n = 0, 1, . . . , N − 1. In

practice, this is made possible by having the transmitter
sending signals known to the receiver, a.k.a. pilot signals.
The equalizer coefficients w[n] are determined by
N
X ¯ T ¯
minm ¯x [n]w − u[n]¯2
w∈C
n=1
= minm kXw − uk22

w∈C
where X = [ x[1], . . . , x[N ] ]T , and

u = [ u[0], . . . , u[N − 1] ]T .
The problem is again an LS.
Abasyn University 8
Application III: Curve Fitting
Consider that there is a collection of experimental

measurements, denoted by x(t1 ), x(t2 ), . . . , x(tN ).
We seek to find a continuous curve that ’fits’ those data.
Suppose that the curve can be parameterized as
y(t) = θ1 + θ2 t + θ3 t2
and assume that x(ti ) are perturbed versions of y(t)
x(ti ) = y(ti ) + ν(ti )
where ν(ti ) is noise.
Abasyn University 9
Let x = [ x(t1 ), . . . , x(tN ) ]T . We have
x = Hθ + ν
where θ = [ θ1 , θ2 , θ3 ]T , and
 
2
1 t1 t2
 
1 t t2
 2 2
H= . 
 .. 
 
2
1 tN tN
Abasyn University 10
Again, we can use LS
min3 kHθ − xk22

θ∈R
to determine the curve coefficients.
450
Data x(ti)
400 Fitted curve
350
300
250
200
150
100
50
−50
0 1 2 3 4 5 6 7 8 9 10
t
Application IV: Linear Prediction
A colored process, denoted by y[n] can be modeled as

∞
X
y[n] = h[`]w[n − `]
`=0
where w[n] is a zero-mean white process.
Here we are interested in the autoregressive (AR) process.

In this process h[n] is an all-pole model; i.e., its z-transform
is given by
H(z) = 1/A(z)
m
X
A(z) = 1 − ai z −i
i=1
Since
Y (z) = H(z)W (z)
we have that
Y (z)A(z) = W (z)
and that
m
X
y[n] − ai y[n − i] = w[n] (∗)
i=1
Eq. (∗) can be viewed as a ‘prediction’, where the previous

samples {y[n − i]}mi=1 predict the present sample y[n], up to
a (unpredictable) perturbation w[n].
Our aim is to estimate a = [ a1 , . . . , am ]T from y[n].

Let
yp = [ y[1], . . . , y[N ] ]T
y[n] = [ y[n − 1], . . . , y[n − m] ]T
Y = [ y[1], . . . , y[N ] ]T
AR coefficient estimation may be achieved by LS linear

prediction:
min kYa − yp k22
a∈Cm
Solving LS
First, some remarks:

• In Lecture 4, we have learnt that for m > n,
Ax − b 6= 0
in general, unless b ∈ R(A).

• If rank(A) < n, then the solution set
{ xLS ∈ Cn | kAxLS − bk22 = min kAx − bk22 }

x
does not simply contain one element— if xLS is a

solution, then xLS + z, z ∈ N (A) is also a solution.
Alternative I for solving LS: use Gradient

The gradient of a function f : Rn → R is defined to be
 
∂f
∂x1
 . 
∇f =  . 
 . 
∂f
∂xn
Some useful properties for gradients:
1. The gradient of f (x) = xT b is ∇f = b.
2. The gradient of f (x) = xT Rx where R is symmetric, is

∇f = 2Rx.
For ease of exposition of ideas, assume that A, b, & x are

real-valued.
Let
f (x) = kAx − bk22
The LS problem
min f (x)
x∈Rn
is an unconstrained optimization problem. Since f is

convex, the sufficient & necessary condition for xLS to be a
solution is that
∇f |x=xLS = 0
We can decompose
f (x) = xT AT Ax − 2xT AT b + bT b
The gradient of f is
∇f = 2AT Ax − 2AT b
Hence, an optimal solution xLS can be found by solving
AT AxLS = AT b
For the complex case, it can be shown (in a similar way but
with more hassles) that
AH AxLS = AH b
Alternative II for solving LS: use the Orthogonal

Principle
Theorem 8.1 (Orthogonal Principle) A vector xLS is an

LS solution if and only if
AH (AxLS − b) = 0
The equations
AH AxLS = AH b
are referred as to the normal equations.
If A is of full column rank so that AH A is PD, then xLS is
uniquely determined by
xLS = (AH A)−1 AH b
Interpretations of the Normal Equations
Let rLS = b − AxLS be the LS error vector.

For full rank A,
rLS = b − A(AH A)−1 AH b

= b − Pb = P⊥ b
where P is the orthogonal projection matrix of A, and P⊥

is the orthogonal complement.
This means that the LS error is orthogonal to any vector in
R(A).
LS for Rank Deficient A
As we mentioned, for rank deficient A there are more than

one LS solutions.
But we can find a unique xLS that has its 2-norm being the
smallest among all LS solutions.
Let r = rank(A), and denote the SVD of A by
A = UΣVH
  
H
Σ̃ 0 V
= [ U1 U2 ]   1 
0 0 V2H
where Σ̃ = Diag(σ1 , . . . , σr ) contains the nonzero singular

values of A.
Define
−1
A† = V1 Σ̃ UH
1
to be the pseudo-inverse of A.
Theorem 8.2 The following minimum 2-norm problem
min kxk22
s.t. x minimizes kAx − bk22
is uniquely given by
xLS = A† b
Note that
rLS = b − AxLS
−1
= b − (U1 Σ̃V1H )(V1 Σ̃ UH
1 )b
= b − U1 UH
1 b
= b − Pb = P⊥ b
where P = U1 UH
1 is the orthogonal projection matrix of A.
This orthogonal property is the same as that in the case of

full column rank A.
Some Relationships of the pseudo-inverse
1. For the case of full column rank A (i.e., m ≥ n,

rank(A) = n),
A† = (AH A)−1 AH b
which means that the pseudo-inverse leads to the LS in

the full column rank case.
2. For the case of full row rank A (i.e., m ≤ n,

rank(A) = m),
A† = AH (AAH )−1
Relationship to generalized inverse

A matrix C ∈ Cn×m is said to be the Moore-Penrose
generalized inverse of A if the following 4 conditions hold:
1. ACA = A
2. CAC = C
3. (AC)H = AC
4. (CA)H = CA
It can be verified that A† is the Moore-Penrose generalized

inverse.

Optimization Techniques 1. Least Squares

Uploaded by

Copyright:

Available Formats

Optimization Techniques 1. Least Squares

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimization Techniques 1. Least Squares

Uploaded by

Copyright:

Available Formats

Advanced Computational Techniques Lecture 8: Least Squares

Advanced Computational Techniques Lecture 8: Least Squares

Least Squares (LS) Problem

In LS, we are concerned with solving

min kAx − bk22

for x, given A ∈ Cm×n , m > n, and b ∈ Cm .

Least Squares Estimation

α(T ) = Ce−EA /kT (9)

Fitting a Straight Line

Application I: System Identification

Let u[n] be an input signal that passes through a linear

where h[n] is the impulse response of the system.

Advanced Computational Techniques Lecture 8: Least Squares

Let u[n] = [ u[n], u[n − 1], . . . , u[n − L + 1] ]T , and

x[n] = uT [n]h + ν[n]

System identification can be done by minimizing the sum

where N is the data length.

Let x = [ x[1], . . . , x[N ] ]T . We have

where U = [ u[1], . . . , u[N ] ]T .

min kUh − xk22

Advanced Computational Techniques Lecture 8: Least Squares

Application II: Channel Equalization

In digital communication over a linear time-dispersive

u[n] transmitted symbol sequence

h[n] channel impulse response

x[n] received signal.

At the receiver, we apply a filtering process, called

so that y[n] ≈ u[n].

Advanced Computational Techniques Lecture 8: Least Squares

Suppose that u[n] is known for n = 0, 1, . . . , N − 1. In

= minm kXw − uk22

where X = [ x[1], . . . , x[N ] ]T , and

Application III: Curve Fitting

Consider that there is a collection of experimental

and assume that x(ti ) are perturbed versions of y(t)

x(ti ) = y(ti ) + ν(ti )

where ν(ti ) is noise.

Advanced Computational Techniques Lecture 8: Least Squares

Let x = [ x(t1 ), . . . , x(tN ) ]T . We have

Again, we can use LS

min3 kHθ − xk22

to determine the curve coefficients.

Advanced Computational Techniques Lecture 8: Least Squares

Application IV: Linear Prediction

A colored process, denoted by y[n] can be modeled as

where w[n] is a zero-mean white process.

Advanced Computational Techniques Lecture 8: Least Squares

Here we are interested in the autoregressive (AR) process.

Eq. (∗) can be viewed as a ‘prediction’, where the previous

Advanced Computational Techniques Lecture 8: Least Squares

Our aim is to estimate a = [ a1 , . . . , am ]T from y[n].

AR coefficient estimation may be achieved by LS linear

First, some remarks:

in general, unless b ∈ R(A).

{ xLS ∈ Cn | kAxLS − bk22 = min kAx − bk22 }

does not simply contain one element— if xLS is a

Advanced Computational Techniques Lecture 8: Least Squares

Alternative I for solving LS: use Gradient