Advanced Signal Processing Introduction To Estimation Theory
Advanced Signal Processing Introduction To Estimation Theory
Danilo Mandic,
room 813, ext: 46271
c D. P. Mandic Advanced Signal Processing 1
Introduction to Estimation: Aims of this lecture
◦ Notions of an Estimator, Estimate, Estimandum
◦ The bias and variance in statistical estimation theory, asymptotically
unbiased and consistent estimators
◦ Performance metrics, such as the Mean Square Error (MSE)
◦ The bias–variance dilemma and the MSE, feasible MSE estimators
◦ A class of Minimum Variance Unbiased (MVU) estimators, that is, out
of all unbiased estimators find those with the lowest possible variance
◦ Extension to the vector parameter case
◦ Statistical goodness of an estimator, the role of noise
◦ Enabling technology for many applications: radar and sonar (range
and azimuth), image analysis (motion estimation), speech (features in
recognition and identification), seismics (oil reservoirs), communications
(equalisation, symbol detection), biomedicine (ECG, EEG, respiration)
c D. P. Mandic Advanced Signal Processing 2
An example from Lecture 2: Optimality in model order
selection (under- vs. over-fitting)
Original AR(2) process x[n] = −0.2x[n − 1] − 0.9x[n − 2] + w[n],
w[n] ∼ N (0, 1), estimated using AR(1), AR(2) and AR(20) models:
−0.2
0 −0.4 Original AR(2) signal
AR( 1), Error=5.2627
−0.6
AR( 2), Error=1.0421
−0.8 AR( 20), Error=1.0621
−5
360 370 380 390 400 410 0 5 10 15 20
R
Time [sample] Coefficient index
c D. P. Mandic Advanced Signal Processing 3
Discrete–time estimation problem
(try also the function specgram in Matlab # it produces the TF diagram below)
0.3
p(x[0]; A)
0.2
0.1
0
10
5
0
0
A -10 -5
x[0]
R
1
1 2
p(x[0]; θi) = √
2
exp − 2σ2 (x[0] − θi) i = 1, 2
2πσ
Clearly, the
observed value
of x[0] critically
impacts upon the
likely value of the
parameter θ (here,
θ 2 =A 2 θ 1 =A 1 x[0] the DC level A).
c D. P. Mandic Advanced Signal Processing 8
Estimator vs. Estimate
specification of the PDF is critical to determining a good estimator
c D. P. Mandic Advanced Signal Processing 9
Example 3: Finding the parameters of a straight line
recall that we have n = 0, . . . , N − 1 observed points in the vector x
In practice, the chosen PDF should fit the problem set–up and incorporate
any “prior” information; it must also be mathematically tractable.
Example: Assume that “on the Data: Straight line embedded in
average” data values are increasing random noise w[n] ∼ N (0, σ 2)
A
Unknown parameters:
A, B ⇔ θ ≡ [A B]T
ideal noiseless line
Careful: What would be the effects
0
n of bias in A and B?
c D. P. Mandic Advanced Signal Processing 10
Bias in parameter estimation
Our goal: Estimate the value of an unknown parameter, θ, from a set of
observations of a random variable described by that parameter
θ̂ = g x[0], x[1], . . . , x[N − 1] (θ̂ is a RV too)
Example: Given a set of observations from a Gaussian distribution,
estimate the mean or variance from these observations.
◦ Recall that in linear mean square estimation, when estimating the value
of a random variable y from an observation of a related random variable
x, the coefficients A and B within the estimator y = Ax + B depend
upon the mean and variance of x and y, as well as on their correlation.
The difference between the expected value of the estimate, θ̂, and
the actual value, θ, is called the bias and will be denoted by B.
B = E{θ̂N } − θ
where θ̂N denotes estimation over N data samples, x[0], . . . , x[N − 1].
Example 4: When estimating a DC level in noise, x[n] = A + w[n], the
1
PN −1
estimator, Â = N n=0 | x[n] |, is biased for A < 0. (see Appendix)
c D. P. Mandic Advanced Signal Processing 11
Now that we have a statistical estimation set–up
R
how do we measure “goodness” of the estimate?
Noise w is usually assumed white with i.i.d. samples (independent,
identically distributed)
whiteness often does not hold in real–world scenarios
Gaussianity is more realistic, due to validity of Central Limit Theorem
zero–mean noise is a nearly universal assumption, it is realistic since
w[n] = wzm[n] + µ
non–zero–mean noise ↑ ↑ zero–mean–noise µ is the mean
Good news: We can use these assumptions to find a bound on the
performance of “optimal” estimators.
More good news: Then, the performance of any practical estimator and
for any noise statistics will be bounded by that theoretical bound!
◦ Variance of noise does not always have to be known to make an estimate
◦ But, we must have tools to assess the “goodness” of the estimate
2
◦ Usually, the goodness analysis is a function of noise variance σw ,
expressed in terms of SNR = signal to noise ratio. (noise sets SNR level)
c D. P. Mandic Advanced Signal Processing 12
An alternative assessment via the estimation error
Since θ̂ is a RV, it has a PDF of its own (more in the next lecture on CRLB)
p(^
θ) p(η )
R
0 θ ^θ 0 η
c D. P. Mandic Advanced Signal Processing 13
Asymptotic unbiasedness
If the bias is zero, then for sufficiently many observations of x[n] (N large),
the expected value of the estimate, θ̂, is equal to its true value, that is
E{θ̂N } = θ ≡ B = E{θ̂N } − θ = 0
and the estimate is said to be unbiased.
If B 6= 0 then the estimator θ̂ = g(x) is said to be biased.
Example 5: Consider the sample mean estimator of the DC level in
WGN, x[n] = A + w[n], w ∼ N (0, 1), given by
N −1
1 X
 = x̄ = x[n] that is θ = A
N + 2 n=0
Is the above sample mean estimator of the true mean A biased?
Observe: This estimator is biased but the bias B → 0 when N → ∞
lim E{θ̂N } = θ
N →∞
c D. P. Mandic Advanced Signal Processing 14
Example 6: Asymptotically unbiased estimator of DC
level in noise
Consider the measurements x[n] = A + w[n], w ∼ N (1, σ 2 = 1)
N −1
1 X
and the estimator  = x[n]
N + 2 n=0
c D. P. Mandic Advanced Signal Processing 15
How about the variance?
◦ It is desirable that an estimator be either unbiased or asymptotically
unbiased (think about the power of estimation error due to DC offset)
◦ For an estimate to be meaningful, it is necessary that we use the
available statistics effectively, that is,
var(θ̂) → 0 as N →∞
or in other words n 2 o
lim var{θ̂N } = lim θ̂N − E{θ̂N } = 0
N →∞ N →∞
var{θ̂N }
P r{|θ̂N − θ| ≥ } ≤
R
2
c D. P. Mandic Advanced Signal Processing 16
Mean square convergence
NB: Mean square error criterion is very different from the variance criterion
x[n] = A + w[n]
◦ Intuitively, the sample mean is a reasonable estimator, and has the form
N −1
1 X
 = x[n]
N n=0
c D. P. Mandic Advanced Signal Processing 18
Example 7 (contd.): Mean and variance of the Sample
Mean estimator
Estimator = f( random data ) =⇒ it is a random variable itself
c D. P. Mandic Advanced Signal Processing 19
Some intricacies which are often not fully spelled–out
In our example, each data sample has the same mean, namely A
probability theory ↑
R
and the mean, A, is exactly the quantity we are trying to estimate
1
PN −1
we are estimating A using the sample mean, Â = N n=0 x[n]
statistics ↑
c D. P. Mandic Advanced Signal Processing 20
Minimum Variance Unbiased (MVU) estimation
Aim: To establish “good” estimators of unknown deterministic parameters
Unbiased estimator # “on the average” yields the true value of the
unknown parameter, independently of its particular value, i.e.
E(θ̂) = θ a<θ<b
c D. P. Mandic Advanced Signal Processing 21
Careful: The estimator is parameter dependent!
An estimator may be unbiased for certain values of the unknown
parameter but not for all values; such an estimator is biased
Example 9: Consider another sample mean estimator of a DC level:
ˆ 1
PN −1
 = 2N n=0 x[n]
ˆ
n o
Therefore: E Â = 0 when A = 0 but
ˆ
n o
A
E Â = 2 when A 6= 0 (parameter dependent)
ˆ
Hence  is not an unbiased estimator.
◦ A biased estimator introduces a “systemic error” which should not be
present it at all possible
◦ Our goal is to avoid bias if we can, as we are interested in stochastic
signal properties and bias is largely deterministic
c D. P. Mandic Advanced Signal Processing 22
Effects of averaging for real world data
Problem 3.4 from your P/A sets: heart rate estimation
The heart rate, h, of a patient is automatically
n o by a computer every 100ms.
recorded
One second of the measurements ĥ1, ĥ2, . . . , ĥ10 are averaged to obtain ĥ. Given
n o
than E ĥi = αh for some constant α and var(ĥi) = 1 for all i, determine whether
averaging improves the estimator, for α = 1 and α =Before
1/2 .
Averaging After Averaging
α=1 α=1
10
1 X
ĥ = ĥi[n],
p(ĥi )
p(ĥi )
10 i=1
10
n o α X
E ĥ = h = αh
10 i=1 h ĥi h
p(ĥi )
10
n o 1 X n o
var ĥ = 2 var ĥi
L i=1
ĥi ĥ
h/2 h h/2 h
c D. P. Mandic Advanced Signal Processing 23
Remedy: How about averaging? Averaging data segments vs
averaging estimators? Also look in your CW Assignment dealing with PSD.
Our assumption was that the individual estimates, θ̂l = g(x), are unbiased,
with equal variances, and mutually uncorrelated.
Then (NB: averaging biased estimators will not remove the bias)
n o
E θ̂ = θ
and n o n o n o
1
PL 1
var θ̂ = L2 l=1 var θ̂ l = L var θ̂l
c D. P. Mandic Advanced Signal Processing 24
R
Mean square error criterion & bias – variance dilemma
An optimality criterion is necessary to define an optimal estimator
One such natural criterion is the Mean Square Error (MSE), given by
2
2
M SE(θ̂) = E (θ̂ − θ) E error
which measures the average mean squared deviation of the estimate, θ̂,
from the true value (error power).
2
2
M SE (θ̂) = E (θ̂ − θ) =E (θ̂ − E{θ̂}) + ( E{θ̂} − θ )
| {z }
= bias, B(θ̂)
n 2o
+ 2 B(θ̂) E θ̂ − E{θ̂} +B 2(θ̂)
= E θ̂ − E{θ̂}
| {z }
=0
= var(θ̂) + B 2(θ̂)
c D. P. Mandic Advanced Signal Processing 25
Example 10: An MSE estimator with a ’gain factor’
(motivation for unbiased estimators)
Consider the following estimator for DC level in WGN
N −1
1 X
 = a x[n]
N n=0
a2 σ 2
var(Â) = N
so that we have
a2σ 2
M SE(Â) = + (a − 1)2A2
N
Of course, the choice a = 1 removes the mean and minimises the variance
c D. P. Mandic Advanced Signal Processing 26
Example 10: (continued) An MSE estimator with a ’gain’
(is a biased estimator feasible?)
Can we find an optimum a analytically? Differentiate wrt a to yield
∂M SA 2aσ 2
(Â) = + 2(a − 1)A2
∂a N
and set the result to zero arrive at the optimal value
A2
aopt = 2
R
A2 + σN
R # those which are not solely a function of the data (see Example 6).
Practically, the minimum MSE (MMSE) estimator needs to be
abandoned, and the estimator must be constrained to be unbiased.
c D. P. Mandic Advanced Signal Processing 27
Minimum variance estimation & MSE criterion, together
Basic idea of MVU: Out of all possible unbiased estimators, find the one
with the lowest variance.
If the Mean Square Error (MSE) is used as a criterion, this means that
R
| {z }
=0 f or M V U
By constraining the bias to be zero, our task is much easier, that is, to find
an estimator that minimises the variance.
◦ In this way, the realisability problem of MSE is completely avoided.
Have you noticed:
MVU estimator = Minimum mean square error unbiased estimator
We will use the acronym MVUE for minimum variance unbiased estimator.
(see the Appendix for an alternative relation between the error function
and estimator quality)
c D. P. Mandic Advanced Signal Processing 28
Desired: minimum variance unbiased (MVU) estimator
Minimising the variance of an unbiased estimator concentrates the PDF of
the error about zero ⇒ estimation error is therefore less likely to be large
◦ Existence of the MVU estimator
var( ^θ ) var( ^θ )
^θ 1 ^θ 1
^θ
2
^θ
2
^θ
3 ^θ
3
^θ is a MVU estimator no MVU estimator
3
θ θ
c D. P. Mandic Advanced Signal Processing 29
Methods to find the MVU estimator
The MVU estimator may not always exist, for example, when:
◦ There are no unbiased estimators a search for the MVU is futile
◦ None of the unbiased estimators has uniformly minimum variance, as in
the right hand side figure on the previous slide
If the MVU estimator (MVUE) exists, we may not always be able to find
it. While there is no general “turn-the-crank” method for this purpose,
the approaches to finding the MVUE employ the following procedures:
◦ Determine the Cramer-Rao lower bound (CRLB) and find some
estimator which satisfies the so defined MVU criteria (Lecture 4)
◦ Apply the Rao-Blackwell-Lehmann-Scheffe (RBLS) theorem (rare in pract.)
◦ Restrict the class of estimators to be not only unbiased, but also linear
in the parameters, this gives MVU for linear problems (Lecture 5)
◦ Employ optimisation and prior knowledge about the model (Lecture 6)
◦ Choose a suitable real–time adaptive estimation architecture and
perform on-line estimation on streaming data (Lecture 7)
c D. P. Mandic Advanced Signal Processing 30
Extensions to the vector parameter case
h iT
◦ If θ = θ̂1, θ̂2, . . . , θ̂p ∈ Rp×1 is a vector of unknown parameters, an
estimator is said to be unbiased if
E(θ̂i) = θi where ai < θi < bi
for i = 1, 2, . . . , p
By defining
E(θ1)
E(θ2)
E(θ) =
..
E(θp)
an unbiased estimator has the property E(θ̂) = θ within the
p–dimensional space of parameters spanned by θ = [θ1, . . . , θp]T .
◦ An MVU estimator has the additional property that its var(θˆi), for
i = 1, 2, . . . , p, is the minimum among all unbiased estimators.
c D. P. Mandic Advanced Signal Processing 31
Summary
◦ We are now equipped with performance metrics for assessing the
goodnes of any estimator (bias, variance, MSE).
◦ Since MSE = var + bias2, some biased estimators may yield low MSE.
However, we prefer minimum variance unbiased (MVU) estimators.
◦ Even a simple Sample Mean estimator is an example of the power of
statistical estimators.
◦ The knowledge of the parametrised PDF p(data;parameters) is very
important for designing efficient estimators.
◦ We have introduced statistical “point estimators”, would it be useful to
also know the “confidence” we have in our point estimate?
◦ In many disciplines it is useful to design so called “set membership
estimates”, where the output of an estimator belongs to a pre-definined
bound (range) of values.
◦ In our course, we will address linear, best linear unbiased, maximum
likelihood, least squares, sequential least squares, and adaptive estimators.
c D. P. Mandic Advanced Signal Processing 32
Homework: Check another proof for the MSE expression
MSE(θ̂) = var(θ̂) + bias2(θ)
2
2
Note : var(x) = E[x ] − E[x] (∗)
Idea : Let x = θ̂ − θ →
substitute into (∗)
2
2
to give var(θ̂ − θ) = E[(θ̂ − θ) ] − E[θ̂ − θ] (∗∗)
| {z } | {z } | {z }
term (1) term (2) term (3)
c D. P. Mandic Advanced Signal Processing 33
Recap: Unbiased estimators
Due to the linearity properties of the statistical expectation operator,
E {·}, that is
 ∼ N (A, σ 2/N )
c D. P. Mandic Advanced Signal Processing 34
Appendix: Some usual assumptions in the analysis
How realistic are the assumptions on the noise?
◦ Whiteness of the noise is quite realistic to assume, unless the evidence
or physical insight suggest otherwise
◦ The independent identically distributed (i.i.d.) assumption is
straightforward to remove through e.g. the weighting matrix
W = diag(1/σ02, . . . , 1/σN
2
−1 ) (see Lectures 5 and 6)
◦ In real world scenarios, whiteness is often replaced by bandpass
correlated noise (e.g. pink or 1/f noise in physiological recordings)
◦ The assumption of Gaussianity is often realistic to keep, due to e.g. the
validity of Central Limit Theorem
Is the zero–mean assumption realistic? Yes, as even for non–zero mean
noise, w[n] = wzm[n] + µ, where wzm[n] is zero–mean noise, the mean of
the noise µ can be incorporated into the signal model.
Do we always need to know noise variance? In principle no, but when
assessing performance (goodness) variance is needed to measure the SNR.
c D. P. Mandic Advanced Signal Processing 35
Appendix. Example 11: A counter-example # a little
bias can help (but the estimator is difficult to control)
Q: Let {y[n]}, n = 1, . . . , N be iid Gaussian variables ∼ N (0, σ 2).
Consider the following estimate of σ 2
N
α X
σ̂ 2 = y 2[n] α > 0
N n=1
Find α which minimises the MSE of σ̂ 2.
2 2
A: It is straightforward to show that E{σ } = ασ and
2 2 2 2
4 4
M SE(σ̂ ) = E{ σ̂ − σ } = E{σ̂ } + σ (1 − 2α)
N N
α2 X X 2 2 4 2
= E{y [n]y [s]} + σ (1 − 2α) Hint : Σn = ΣnΣs
N 2 n=1 s=1
α2 2 4 4
4 4
h
4 2 i
= N σ + 2N σ + σ (1 − 2α) = σ α (1 + ) + (1 − 2α)
N2 N
N 2 2σ 4
The MMSE is obtained for αmin = and is M M SE(σ̂ ) =
N +2 N +2 .
Given that the corresponding σ̂ 2 of an optimal unbiased estimator
(CRLB, later) is 2σ 4/N , this is an example of a biased estimator
which obtains a lower MSE than the CRLB.
c D. P. Mandic Advanced Signal Processing 36
Appendix (full analysis of Example 4)
Biased estimator:
N
1 X
à = x[n]
N n>1
Therefore,
◦ if A ≥ 1, then x[n] = x[n], and E{Ã} = A
= 0, A ≥ 0
⇒ Bias =
6= 0, A < 0
c D. P. Mandic Advanced Signal Processing 37
Notes
c D. P. Mandic Advanced Signal Processing 38
Notes
c D. P. Mandic Advanced Signal Processing 39
Notes
c D. P. Mandic Advanced Signal Processing 40