SSPI Lecture 3 Estimation Intro 2025
SSPI Lecture 3 Estimation Intro 2025
Danilo Mandic
room 813, ext: 46271
R
historical sunspot samples in the task of Sunspot Number Prediction
This highlights the need for a unifying & rigorous framework for the
assessment of “goodness of performance” of any Data Analytics model,
from the simplest “persistent” estimate, to linear ARMA processes,
through to nonlinear Neural Network models a subject of this Lecture
We will typically consider prediction/forecasting scenarios:
Prediction: Employs an already built model (based on in-sample data,
training data) to estimate out-of-sample values (prediction, inference).
Forecasting: A type of prediction which implicitly assumes time-series,
where historical data are used to predict future data. Often involves
“confidence intervals” (there is 20% chance of rain at 14:00).
c D. P. Mandic Statistical Signal Processing & Inference 2
Example from Lecture 2: How expressive are these
inference models (e.g. under-fitting vs. over-fitting)
Original AR(2) process x[n] = −0.2x[n − 1] − 0.9x[n − 2] + w[n],
w[n] ∼ N (0, 1), is estimated using AR(1), AR(2) and AR(20) models.
−0.2
0 −0.4 Original AR(2) signal
AR( 1), Error=5.2627
−0.6
AR( 2), Error=1.0421
−0.8 AR( 20), Error=1.0621
−5
360 370 380 390 400 410 0 5 10 15 20
R
Time [sample] Coefficient index
4
single-parameter model, s[n]=A
3
0
Raw data
Order-0: error power =195.05
-1 two-parameter model, s[n]=A+Bn Order-1: error power =98.39
Order-7: error power =94.11
-2
0 10 20 30 40 50 60 70 80 90 100
Sample index, n
R
contains the desired s[n] but also
a different realisation of noise w[n]
The estimated frequency, fˆ0,
and phase, Φ̂0, are random variables
Our goal: Find an estimator which
maps the data x to the estimates
fˆ0 = g1(x) and Φ̂0 = g2(x).
↑ Frequency → Time The RVs fˆ0, Φ̂0 are best described via a
prob. model which depends on: structure
Observe also mathematical artefacts of s[n], pdf of w[n], and form of g(x).
c D. P. Mandic Statistical Signal Processing & Inference 6
Statistical estimation problem (learning from data)
The notation p(x; θ) indicates that θ are not random but unknown parameters
Problem statement: Given an N -point dataset, x[0], x[1], . . . , x[N − 1],
which depends on an unknown scalar parameter, θ, an estimator is
defined as a function, g(·), of the dataset {x}, that is
θ̂ = g x[0], x[1], . . . , x[N − 1]
which may be used to estimate θ. (single parameter or “scalar” case)
Vector case: Analogously to the scalar case, we seek to determine a set of
parameters, θ = [θ1, . . . , θp]T , from data samples x = [x[0], . . . , x[N − 1]]T
such that the values of these parameters would yield the highest
probability of obtaining the observed data. This can be formalised as
max p(x; θ) where p(x; θ) reads: “p(x) parametrised by θ 00
span θ
R
R
↑ we will use p(x ; θ) to find θ̂ = g(x)
When we know this PDF, we can design optimal estimators
In practice, this PDF is not given, and our goal is to choose a model which:
◦ Captures the essence of the signal generating physical model;
◦ Leads to a mathematically tractable form of an estimator.
c D. P. Mandic Statistical Signal Processing & Inference 8
Random variable (RV), some general observations
Random variable # quantifies the outcome of a random event.
For example, “heads” or “tails” on a coin or a blue square on Rubik’s cube
are not random variables per se, but can be made random variables
0.3
p(x[0]; A)
0.2
0.1
0
10
5
0
0
A -10 -5
x[0]
R
1
1 2
p(x[0]; θi) = √
2
exp − 2σ2 (x[0] − θi) i = 1, 2
2πσ
Clearly, the
observed value
of x[0] critically
impacts upon the
likely value of the
parameter θ (here,
θ 2 =A 2 θ 1 =A 1 x[0] the DC level A)
Daily Returns of Crude Oil vs. Energy Sector Residuals of Linear Fit, Oil vs. Energy
Data from Apr. 2024 0.010 Residuals
0.010 Regression Line
Vangard Energy ETF
0.005
0.005
Residuals
0.000 0.000
0.005 0.005
0.010 0.010
0.03 0.02 0.01 0.00 0.01 0.02 0.03 0.02 0.01 0.00 0.01
Crude Oil x
S&P 500 vs. Gold Prices in April 2024 Residuals of Both Fits, S&P500 vs. Gold
2400 75
50
2350
25
Gold Prices
2300 Residuals 0
25
2250
Data 50
2200
Linear Fit 75 Linear Fit Residuals, sum(Res2)=47375
Quadratic Fit 100 Quadratic Fit Residuals, sum(Res2)=42270
5000 5050 5100 5150 5200 5250 5000 5050 5100 5150 5200 5250
S&P 500 x
In practice, the chosen PDF should fit the problem set–up and incorporate
any “prior” information; it must also be mathematically tractable.
Example: Assume that “on the Data: Straight line embedded in
average” data values are increasing random noise w[n] ∼ N (0, σ 2)
A
Unknown parameters:
A, B ⇔ θ ≡ [A B]T
ideal noiseless line
Careful: What are the effects of bias
0
n in A and B on the previous example?
R
how do we measure “goodness” of the estimate?
Noise w is usually assumed white with i.i.d. samples (independent,
identically distributed)
whiteness often does not hold in real–world scenarios
Gaussianity is more realistic, due to validity of Central Limit Theorem
zero–mean noise is a nearly universal assumption, and it is realistic since
w[n] = wzm[n] + µ
non–zero–mean noise ↑ ↑ zero–mean–noise µ is the mean
Good news: We can use these assumptions to find a bound on the
performance of “optimal” estimators.
More good news: Then, the performance of any practical estimator and
for any noise statistics will be bounded by that theoretical bound!
◦ Variance of noise does not always have to be known to make an estimate
◦ But, we must have tools to assess the “goodness” of the estimate
2
◦ Usually, the goodness analysis is a function of noise variance, σw ,
expressed in terms of SNR = signal to noise ratio. (noise sets SNR level)
c D. P. Mandic Statistical Signal Processing & Inference 19
Assessing the performance of an estimator
Recall that the estimate θ̂ = g(x) is a random variable. As such, it has a
pdf of its own, and this pdf completely depicts the quality of the estimate.
^)
p(θ −1 2 We can only assess performance when
sqrt ( 2πσ )
the value of θ is known
68% The quality (goodness) of an estimator is
^
typically captured through the mean and
θ
variance of θ̂ = g(x).
θ−σ θ θ+σ
We desire: µθ̂ = E{θ̂} = θ
θ−2σ 95% θ+2σ
and σθ̂2 = E{(θ̂ − E{θ̂})2} # small
θ−3σ >99% θ+3σ
p(^
θ) p(η )
R
0 θ ^θ 0 η
N −1
1 X
and the estimator  = x[n]
N + 2 n=0
var{θ̂N }
P r{|θ̂N − θ| ≥ } ≤
R
2
x[n] = A + w[n]
◦ Intuitively, the sample mean is a reasonable estimator, and has the form
N −1
1 X
 = x[n]
N n=0
R
and the mean, A, is exactly the quantity we are trying to estimate
1
PN −1
We are estimating A using the sample mean, Â = N n=0 x[n]
statistics ↑
ˆ 1
PN −1
 = 2N n=0 x[n]
ˆ
n o
Therefore: E Â = 0 when A = 0 but
ˆ
n o
A
E Â = 2 when A 6= 0 (parameter dependent)
ˆ
Hence  is not an unbiased estimator.
◦ A biased estimator introduces a “systemic error” which should not be
present if at all possible
◦ Our goal is to avoid bias if we can, as we are interested in stochastic
signal properties and bias is largely deterministic
α=1 α=1
10
1 X
ĥ = ĥi[n],
p(ĥi )
p(ĥi )
10 i=1
10
n o α X
E ĥ = h = αh
10 i=1 h ĥi h
p(ĥi )
10
n o 1 X n o
var ĥ = 2 var ĥi
L i=1
ĥi ĥ
h/2 h h/2 h
Our assumption was that the individual estimates, θ̂l = g(x), are unbiased,
with equal variances, and mutually uncorrelated.
Then (NB: averaging biased estimators will not remove the bias)
n o
E θ̂ = θ
and n o n o n o
1
PL 1
var θ̂ = L2 l=1 var θ̂ l = L var θ̂l
2
2
M SE (θ̂) = E (θ̂ − θ) =E (θ̂ − E{θ̂}) + ( E{θ̂} − θ )
| {z }
= bias, B(θ̂)
n 2o
+ 2 B(θ̂) E θ̂ − E{θ̂} +B 2(θ̂)
= E θ̂ − E{θ̂}
| {z }
=0
= var(θ̂) + B 2(θ̂)
a2 σ 2
var(Â) = N
so that we have
a2σ 2
M SE(Â) = + (a − 1)2A2
N
Of course, the choice a = 1 removes the mean and minimises the variance
∂M SA 2aσ 2
(Â) = + 2(a − 1)A2
∂a N
and set the result to zero arrive at the optimal value
A2
aopt = 2
R L
A2 + σN
R # those which are not solely a function of the data (see Example 6).
Practically, the minimum MSE (MMSE) estimator needs to be
abandoned, and the estimator must be constrained to be unbiased.
B 2(θ̂)
R
M SE (θ̂) = var(θ̂) +
| {z }
=0 f or M V U
By constraining the bias to be zero, our task is much easier, that is, to find
an estimator that minimises the variance.
◦ In this way, the feasibility problem of MSE is completely avoided.
Therefore:
MVU estimator = Minimum mean square error unbiased estimator
R We will use the acronym MVUE for minimum variance unbiased estimator.
Course goal: To find optimal statistical estimators and inference
(see the Appendix for an alternative relation between the error function
and the quality (goodness) of an estimator)
c D. P. Mandic Statistical Signal Processing & Inference 38
Bias–variance illustration: ARIMA prediction of
COVID-19 death data
Consider the prediction of COVID-19 death rates in the UK.
AR(1) model, 10 days ahead ARIMA(7,1,1) 10 days ahead
◦ The AR(1) prediction exhibits bias, as the mean of the predicted data (in
red) is “off-set” from the mean of true data (in blue) for most of the plot
◦ The ARIMA(7,1,1) prediction coincides with the original data in terms of
the mean for the whole plot, but exhibits large variability (what do you prefer)
^θ
2
^θ
2
^θ
3 ^θ
3
^θ is a MVU estimator no MVU estimator
3
θ θ
◦ An MVU estimator has the additional property that its var(θˆi), for
i = 1, 2, . . . , p, is the minimum among all unbiased estimators.
c D. P. Mandic Statistical Signal Processing & Inference 41
Multivariate inference often helps (see also Lecture 2)
For a rigorous account of multivariate inference, see Lecture 4
 ∼ N (A, σ 2/N )
α2 2 4 4
4 4
h
4 2 i
= N σ + 2N σ + σ (1 − 2α) = σ α (1 + ) + (1 − 2α)
N2 N
N 2 2σ 4
The MMSE is obtained for αmin = and is M M SE(σ̂ ) =
N +2 N +2 .
Given that the corresponding σ̂ 2 of an optimal unbiased estimator
(CRLB, later) is 2σ 4/N , this is an example of a biased estimator
which obtains a lower MSE than the CRLB.
N
1 X
à = x[n]
N n>1
Therefore,
◦ if A ≥ 0, then x[n] = x[n], and E{Ã} = A
◦ if A < 0, then E{Ã} =
6 A
= 0, A ≥ 0
⇒ Bias =
6= 0, A < 0
N −1
2 1 X 2 2 2 1 2 N − 1 2
⇒ E{σ̂ } = σ − σ + σ = σ
N n=0 N N N
R
P (X ≥ 100, 000) ≤ =
100, 000 2.5
Markov’s inequality can be used to prove that mean square
convergence implies convergence in probability, and also to prove
Chebyshev’s inequality. (see Lecture 1 for more detail)
1 0.6
Power
100
0 0.4
1 0.2
2 White Noise 0.0 10 1 Pink Noise
3 Pink Noise 0.2 White Noise
0 200 400 600 800 1000 0 20 40 60 80 100 0.0 0.1 0.2 0.3 0.4 0.5
Sample Index Lag Frequency
c D. P. Mandic Statistical Signal Processing & Inference 53
Notes