Matlab Codes and Report
Matlab Codes and Report
Haobo Zhu
CID: 01493196
March, 2022
Imperial College London
Contents
1 Random Signals and Stochastic Processes 2
1.1 Stochastic Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
1.2 Stochastic Processes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
1.3 Estimation of probability distributions . . . . . . . . . . . . . . . . . . . . . . . . . . 7
1
Haobo Zhu; Full Coursework 2
For a uniformly distributed random variable X ∼ U (0, 1) , the theoretical mean is given by the
following integration:
Z ∞ Z 1
m = E{X} = xp (X = x) dx = x · dx = 0.5 (1)
−∞ 0
1
which gives a theoretical mean of 0.5.
Assignment 1
Calculating 1.1
byStatistical
the MATLAB
estimation command mean on the 1000x1 vector generated by rand gives a sample
1. For a uniformly distributed random variable 𝑋~ 𝒰 0,1 , the theoretical mean is given by the following
mean of 0.4995, which is 0.12. The theoretical standard deviation is then calculated by
integration:
𝑚 𝔼𝑋 𝑥 𝑝 𝑋 𝑥 𝑑𝑥 𝑥 ∙ 1 𝑑𝑥 Z 0.5 1
s r
1
1
q
2
p
− E{X}) 2 2 2 2
σ = E{(X which gives a theoretical}mean E{X } − E{X} =
= of 0.5. x dx − 0.5 = ≈ 0.2887 (2)
Calculating by the MATLAB command PHDQ on the 1000x1 vector generated 0 by UDQG gives a sample 12
mean of 0.4995, which is 0.1% below the theoretical value.
2. The
and the result theoretical
from MATLAB standard deviation
usingis then
thecalculated
function by std gives a sample standard deviation of 0.2877,
0.295
std Value/AU
0.29
0.285
1 2 3 4 5 6 7 8 9 10
Realisation
Figure 3-5: The pdf plotted using 5, 10 and 100 bins respectively
Figure 2: The pdf plotted using 5, 10 and 100 bins respectively
Haobo Zhu; Full Coursework 3
2
The error in the pdf of a certain realisation of the random process is more pronounced if more the
The error in the pdf of a certain realisation of the random process is more pronounced if more the number
number of bins increases,
of bins whilesmoothened
increases, while smoothened with
with small small
number number of bins.
of bins.
Figure 6-9: The pdf plotted using 100, 10000 and 100000 bins respectively, along with the theoretical
Figure 3: Estimated
pdf.
PDF with 100, 10000 and 100000 samples and nbin=10, along with the theo-
retical PDF
As the sample size grows, it can be observed from the diagrams that the error of pdf approximation
converges to 0. That is, the approximation converges to the theoretical pdf 𝑋~ 𝒰 0,1 .
5. We
As the sample thengrows,
size repeat theitprocess
can be above for a normally
observed from distributed that 𝑋~
random process
the diagrams the𝒩error
0,1 . of pdf approximation
a) The theoretical mean is 0 by the definition of standard deviation, and the calculated sample mean for
converges to 0. That is, the approximation converges to the theoretical pdf X ∼ U (0, 1) .
the 1000x1 vector generated by UDQGQ is 0.0376 which is reasonable for a zero-cantered normal
distribution.
Estimation of std
b) The theoretical standard deviation is 1 by the definition
1.04 of standard deviation, and the standard
deviation calculated by MATLAB is 0.9927, which 1.03
is 0.73% below the theoretical value.
1.02
std Value/AU
1.01
0.99
0.98
0.97
0 2 4 6 8 10
Realisation
We then repeat the process above for a normally distributed random process X ∼ N (0, 1). The
theoretical mean is 0 by the definition of standard deviation, and the calculated sample mean for
the 1000x1 vector generated by randn is 0.0376 which is reasonable for a zero-cantered normal
distribution. The theoretical standard deviation is 1 by the definition of standard deviation, and
the standard deviation calculated by MATLAB is 0.9927, which is 0.73% below the theoretical
value.
Haobo Zhu; Full Coursework 4
The sample mean and standard deviation for the 10 realisations with 1000 data samples are shown
above by the two figures. They cluster near the theoretical values 0 and 1 with points lying on
3 3
both sides of them. The estimator of mean is constrained by ±0.04 and the estimator of mean
is also constrained
c) The c) sample
The ±0.04,
bysample
mean and mean which
and standard
standard
is reasonable
deviationdeviation
for the for
10 the
for we withhave
10 realisations
realisations
got
with
1000
an
1000
data
unbiased
data
samplessamples
are shown areestimator
shown for both
of them. The pdfby
above above
of by the
the this
two two figures.
realisation
figures. They
They cluster of cluster near
thetherandom
near the theoretical
theoretical values values
process 0 and 10 with
can and
be1points
with points
approximately
lying on lying
bothon both
plotted by the
sides ofsides
them.ofThethem. The estimator
estimator of meanofismean is constrained
constrained by 0.04 by 0.04 theand the estimator of is
mean
alsois also
histogram function in MATLAB.
constrained 0.04, is
by which
We isthen
which
plot the
reasonable
pdf andof an theestimator of mean
realisation with 1000 data samples
constrained by 0.04, reasonable for we for
havewegot
have
an got
unbiased unbiased
estimatorestimator
for bothforofboth of them.
them.
as follows,d) with
The number
d) pdfThe ofofthisbins
pdf realisation
of this of of
the 5,
realisation 10random
of the
random and
process100
process
can berespectively.
can be approximately
approximately Theplotted
plotted byerror
the by is
the also
KLVWRJUDP more pronounced
KLVWRJUDP
functionfunction
in MATLAB.in MATLAB. We
We then plot then plot
the pdf the pdf of the realisation
of the realisation with 1000with 1000 data
data samples samples as
as follows, follows,
in the diagram if the pdf is plotted with a greater number of bins. As the curve of the pdf of not
with number
with number of bins of
of bins
5, 10ofand
5, 10
100and 100 respectively.
respectively. Theiserror
The error also is alsopronounced
more more pronounced in the diagram
in the diagram
uniform, we iflose more
if the
the pdf pdfinformation
is plotted
is plotted with a about
with a greater greater the
numbernumber
of bins.variability
ofAs
bins.
the As the of
curve ofthethe
curve of the
pdf estimation
of pdf
not of if lose
not uniform,
uniform, we weweplot
lose the diagram
with too fewmorebins, more information
information
especially about
about for the variability
the variability
the region of
of thewith the estimation
estimation if
lowif intensity.we plot the diagram with
we plot the diagram with too few bins, too few bins,
especially
especially for the for the with
region region with
low low intensity.
intensity.
Figure
Figure 16-18 16-18
The The
pdf of thepdf of the realisation
realisation with 1000with 1000
data data samples
samples plotted5,using
plotted using 5, 100
25 and 25 and
bins100 bins
respectively
respectively Figure 5: PDF of 1000 sample WGN with different bins
The pdfThe pdf normally
of this of this normally distributed
distributed randomrandom
variablevariable also converges
also converges to the theoretical
to the theoretical one as one as the data
the data
sample sample size increases.
size increases.
Probability Density, 25bins, number of samples:100 Probability Density, 25bins, number of samples:10000
Probability
0.6 Density, 25bins, number of samples:100 Probability
0.4 Density, 25bins, number of samples:10000
0.6 0.4
0.35
0.5 0.35
0.5
0.3
0.3
0.4 0.25
0.4 0.25
0.2
0.3 0.2
0.3
0.15
0.15
0.2
0.2 0.1
0.1
0.1 0.05
0.1 0.05
0
0 0 -4 -3 -2 -1 0 1 2 3 4
0 -3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -4 -3 -2 -1 0 1 Intensity
2 3 4
-3 -2.5 -2 -1.5 -1 -0.5 0 0.5 1
Intensity 1.5 2 Intensity
Intensity
Figure 19-22: The pdf of the realisation plotted using 100, 10000 and 100000 data samples respectively,
Figure 19-22: The pdf
all plotted Figure
withof
25the 6: with
realisation
bins, along PDF
plotted of 100,
using WGN
the theoretical 10000with
pdf. different
and 100000 lengths
data samples respectively,
all plotted with 25 bins, along with the theoretical pdf.
The pdf of this normally distributed random variable also converges to the theoretical one as the
data sample size increases.
Haobo Zhu; Full Coursework 5
The ensemble means and standard deviation of M=100 members of ensemble, each with N=100
time steps are plotted as follows using MATLAB. It may be concluded that random process 1 is
not stationary in mean or variance, since the mean grows linearly in time and variance is low at
beginning and stopping range. On the other hand, random processes 2 and 3 are both stationary in
mean and variance, as they centre about a constant value, which is presumed to be the theoretical
mean and variance.
Figure 7: The ensemble means and variances of the three random processes (M=100, N=100)
The time averages are constrained for processes 1(std=0.0205) and 3(std=0.0371), but not 2
(std=0.2939), and therefore random process 2 is not ergodic in mean, as the time average changes
over realisations we get different values for every realisation. However, for process1, the ensemble
average changes with time step, and thus it is not ergodic in mean as it is impossible to retrieve
the ensemble average in the time domain. The “std” term in braces is the standard deviation for
the time average of the 4 realisations in this paragraph.
Haobo Zhu; Full Coursework 6
Similarly, we may assess the time standard deviation to see if the processes are ergodic in variance.
The time standard deviations are constrained for processes 1(std=0.0371) and 3(std=0.0139), but
not process 2(std=0.0801), and therefore process 2 is not ergodic in variance as the values derived
from each realisation are so different as they do not suggest the ensemble standard deviation.
However, process 1 is not ergodic in variance as its variance changes with time and so does its
standard deviation, so a time standard deviation cannot provide an ensemble standard deviation
for it. The “std” term in braces suggest the standard deviation of the 4 realisations for the time
std in this paragraph.
1.
Probability Density/AU
Figure 24: The pdf of a normally distributed random process with 10000 data points plotted by
the SGI function.
Figure 8: The pdf of a normally distributed random process with 10000 data points plotted by the
pdf function.
For the three random processes tested in section 1.2, only process 3 is stationary and ergodic. From
the plot below, as the data sample size increases, the pdf converges to its theoretical value.
Haobo Zhu; Full Coursework 8
Figure 9: The probability density function for the random process rp3 in exercise 2 plotted for
the realisations with 100(left-up), 1000(right-up), and 10000(left-bottom) time steps, along with its
theoretical value.
We cannot plot the pdf of a nonstationary process using the pdf function. Because the data
distributions of different time steps differ, we cannot use a single collective method (histogram in
this case) to analyse its overall pdf using a single diagram. For the 1000-sample-long signal of
which the mean changes from 0 to 1 after N=500, we may treat it as two stationary signals, one
with n < 500 and one with n ≥ 500 and calculate their pdf respectively using a histogram.
Haobo Zhu; Full Coursework 9
Figure 1: The ACF function of a WGN plotted for delays τ ∈ [−999 : 999] (left) and |τ | < 50
(right)
By plotting the the unbiased estimate of the ACF function of a White Gaussian noise generated by
randn(1, 1000), we may see the ACF has a spike at τ = 0, and the other values are all less than
0.2 except for the extrme values of τ (|τ | > 900). This is expected as a WGN are only correlated
with the exact same realisation (|τ | = 0), and the other values should converge to 0 if the sample
size is large enough
If we zoom in the plot onto |τ | < 50, we may see that all the values of the ACF other than the spike
are consistent and well constrained below 0.2, and does not increase overall when |τ | increases.
1
It may be easily observed that if |τ | gets sufficiently large, N −|τ | increases inversely proportionally.
Moreover, less samples are calculated when |τ | approaches ±1000, so that the variance of the
estimates also increases as suggested by the central limit theorem, the variance of an expected
2
value is larger if the number of samples used is smaller (σmean2 = σN , where N is the sample size).
We would recommend the empirical bound for |τ | is 900.
We then generate another 1000-sample WGN and filters it with a moving average filter with 9 unit
coefficients, then we plot the ACF of the filtered signal. We may see that spike in the ACF of the
WGN gets widened that the values for |τ | ∈ [−9 : 9] are significantly greater than 0 (Ry > 1) and
decreasing as |τ | increases, the bound is consistent with the order of the MA filter.
The ACF of the output signal represents a convolution between the ACF of the input signal and
Haobo Zhu; Full Coursework 10
Figure 2: The ACF function of a WGN plotted for delays τ ∈ [−999 : 999]
Ry (τ ) = Rx (τ ) ∗ Rh (τ ) (2)
For a white noise, the auto correlation is a delta function (Rx = δ(τ )), by the sifting property of a
delta function,
Ry (τ ) = δ(τ ) ∗ Rh (τ ) (3)
and thus Ry represents the ACF of the impulse response of the filter.
We plot the unbiased estimate of cross correlation function for ergodic signals between the input
and filtered signals given by the equation
N −|τ |−1
1 X
Rxy (τ ) = x[n]y[n + τ ], τ = −N + 1, ..., N − 1. (4)
N − |τ |
n=0
The CCF function of the input and output signal can be expressed by a convolution between the
ACF of the input and the impulse response of the filter.
Similarly, if the input Xt is an uncorrelated stochastic process, the resulting Rxy would have the
shape of the impulse response of the filter by the sifting property of the delta function.
The calculated estimation of CCF is flipped around τ = 0 comparing to the expected output. This
is due to Matlab computes the CCF by xcorr in a different way, where the CCF function between
Haobo Zhu; Full Coursework 11
Figure 3: The CCF function of a WGN and the output of the MA filter plotted for delays τ ∈ [−20 :
20]
the input and output peaks at negative lags if the output is a delayed copy of the input. This result
can be used for system identification, as if we feed a white noise into an unknown system, we would
be able to identify its impulse response after we obtain the output.
For the 100 samples of AR(2) models with length 1000 and uniformly distributed
a1 ∈ [−2.5, 2.5] and a2 ∈ [−1.5, 1.5], the pairs of coefficients that result in a stable output (the
output at the final time step less than 1000) is plotted as red asterisks in the figure below.
Figure 4: The pairs of a1 and a2 that result in stable(red) and unstable(black) outputs, along with
the stability bounds
Haobo Zhu; Full Coursework 12
a1 + a2 < 1. (10)
1 − a2 + a1 > 0
(14)
a2 − a1 < 1
Similarly,
1 − a2
> 0, (15)
1 + a2
which leads to the final stability criterion:
We then investigate the sunspot data by plotting the ACF estimates for N = {5, 20, 250}. For
the dataset with different lengths, the shapes of different lengths are quite different. For N = 5,
we may see a decaying ACF estimate in the non-zero mean plot, while for the zero mean version,
the absolute value of the ACF at |τ | = 4 is larger than that at τ = 0, probably due to the
statistical inconsistency at edges of the plot. The shapes for N = 20 and N = 250 are more similar,
demonstrating a pseudo periodic behaviour, with the shape of N = 250 has a shorter period. For
the non-zero mean version, the ACF for N = 20 at |τ | = 13 exceeds that at τ = 0. The ACFs of
the non-zero mean version is greater than those of zero mean version at all lags, and the non-zero
mean serves as a non-constant offset in the ACF estimations.
The partial correlation functions (PCF) are calculated using Yule-Walker equation for the original
data and the standardised data with zero mean and unit variance, we obtain
For both the original and standardised data, we may conclude that the model can be best modelled
by an AR(2) process, since all PAF over order 2 are less than 0.3. However, as the statistical bound
Haobo Zhu; Full Coursework 14
a1,1 a2,2 a3,3 a4,4 a5,5 a6,6 a7,7 a8,8 a9,9 a10,10
Original 0.9295 -0.5857 0.1284 0.2532 0.1555 0.2574 0.2736 0.2384 0.1680 0.0252
Standardised 0.8212 -0.6783 -0.1223 0.0473 -0.0156 0.1623 0.1751 0.2276 0.1766 0.0038
√
for 95% confidence is ±1.96/ N = ±0.1155, we may say that the PAF up to order 9 still have
some significance. The PAF over order 2 in the standardised data are all less than those for the
original data except for a9,9 .
The appropriate model order can be further assessed using the MDL, AIC, and AICc criteria for
the standardised dataset.
MDL and AIC tests show a minimum at p = 9, while AICc clearly shows a minimum at p = 2. As
the difference of the values at p = 2 and p = 9 of MDL(< 0.1) and AIC (< 0.2) are not significant,
and considering the efficiency of the model regarding computing complexity and that the AIC and
MDL may be unreliable since we have a short segment of data from the AR(2) process, we may
conclude a model order 2 is appropriate to model this dataset.
The predictions made by model orders p = {1, 2, 10} with prediction horizons m = {1, 2, 5, 10} are
plotted as follows.
Haobo Zhu; Full Coursework 15
Table 2: Cummulative Square Error of Different model orders on different prediction horizons
The plot and the table suggest that all predictions are reasonably close to the original data for
m = 1. For a typical AR(2) process, the overmodelled prediction should have larger MSE in
interpolation, but as the only criterion that suggests p = 2 is AICc, this process is not typically
AR(2) but more close to AR(9). The variance of the 10th order model is smaller than those of the
1st and 2nd order models. As the prediction horizon increases, the AR prediction results of different
model orders start to diverge. For m = 10, The variance between the prediction results and the
original data is minimised when p = 10, while the lower order models have great differences in
the amplitude of oscillations comparing to the original data. This also suggests that overmodelled
predictions are more robust than those undermodelled ones when the prediction horizon is high.
Haobo Zhu; Full Coursework 16
The NASDAQ closing prices can be well modelled by an AR(1) process, as all MDL, AIC suggest
an optimal model order at p = 1 when modelling the NASDAQ data standardised to X ∼ N (0, 1)
as AR modelling requires zero-mean datasets. The corresponding PAF is a1,1 = 0.9977, and all
higher order PAF fall below the 95% confidence interval. The daily return then can be calculated
by subtracting the adjacent closing prices from the obtained AR model.
0.1
0.08
0.06
0.04
0.02
-0.02
-0.04
0 2 4 6 8 10 12 14 16 18 20
Model Order
Figure 8: Different criteria assessing the optimal model order for NASDAQ data
and
p p
" # " #
h i X X
ln P̂X (f ; θ) = ln[σ̂ 2 ] − ln 1 − âm e−j2πf m − ln 1 − âm ej2πf m (19)
m=1 m=1
N rxx [0]
Given [I(θ)]11 = σ2
, [I(θ)]12 = [I(θ)]21 = 0, the full Fisher information matrix is then
N rxx [0]
σ2 0
I(θ) = N (21)
0
2σ 4
Haobo Zhu; Full Coursework 17
σ2 σ2
[I(θ)]11 can be further simplified as rxx [0] = σx2 = 1−ρ1 a1 = 1−a21
for zero mean data
N σ2
1
[I(θ)]11 = 2 =N (22)
σ (1 − a1 ) 1 − a21
We then invert the Fisher information matrix to get the CRLB for â1 and σ̂ 2
1
2
(1 − a1 ) 0
I−1 (θ) = N 2σ 4 (23)
0
N
Finally, we yield the CRLB for the parameters
2σ 4
var(σˆ2 ) ≥
N (24)
1
var(â1 ) ≥ (1 − a21 )
N
The heatmaps of CRLB for var(σˆ2 ) and var(aˆ1 ) are taken their logarithm and plotted as follows,
the deeper blue represents larger variance. The colour pattern clearly follows the CRLB derived in
Eqn (24).
Heatmap of log(CRLB) for the Variance Heatmap of log(CRLB) for the a1
1 1
51 51
101 6
151 101 -7
201 151
251 4 201
301 251 -8
351 301
401 2
351
401 -9
451 451
2
501
2
501
551 0 551 -10
601 601
651 651
701 -2 701 -11
751 751
801 801
851 -4 851 -12
901 901
951 951
1001 -6 1001
1
51
1
101
151
201
251
301
351
401
451
501
551
601
651
701
751
801
851
901
951
51
1001
101
151
201
251
301
351
401
451
501
551
601
651
701
751
801
851
901
951
1001
N N
(a) log10 (CRLB) for â1 (b) log10 (CRLB) for σˆ2
Figure 9: The heatmaps of CRLB for (a) var(σˆ2 ) and (b) var(aˆ1 )
The lower bound for the variance of the estimation of the power spectrum is
T
∂ P̂X (f ; θ) −1 ∂ P̂X (f ; θ)
var P̂X (f ; θ) ≥ I (θ) (26)
∂θ ∂θ
Haobo Zhu; Full Coursework 18
h iT
∂ P̂X (f ;θ) ∂ PˆX (f ;θ)
where ∂θ = ∂a1
∂ P̂X (f ;θ)
∂σ 2
. By substituting A(f ) = 1−a1 e−j2πf , A∗ (f ) = 1−a1 ej2πf ,
∂ PˆX (f ; θ)
∂ 1
= σ2
∂a1 ∂a1 A(f )A∗ (f )
∂A∗ (f )
2 1 ∂A(f ) ∗
= −σ A (f ) + A(f )
|A(f )|4 ∂a1 ∂a1
(27)
σ 2 j2πf −j2πf ∗
= e A(f ) + e A (f )
|A(f )|4
2σ 2
Re A(f )ej2πf .
= 4
|A(f )|
Similarly,
∂ P̂X (f ; θ) 1
2
= . (28)
∂σ |A(f )|2
Finally, the lower bound for the variance of the power spectrum is then
T
∂ P̂X (f ; θ) −1 ∂ P̂X (f ; θ)
var P̂x (f ; θ) ≥ I (θ)
∂θ ∂θ
i 1 (1 − a2 ) 0 2σ2 Re A(f )ej2πf
" #
h
2σ 2 j2πf 1 1 |A(f )| 4
N
= |A(f )|4 Re A(f )e |A(f )|2 2σ 4 1
0 N |A(f )|2
(29)
1 − a21 2σ 2 2 2σ 4
j2πf
= Re A(f )e +
N |A(f )|4 N |A(f )|4
2 j2πf
2σ 4 2(1 − a1 )Re A(f )e
= + 1
N |A(f )|4 |A(f )|4
We may plot the probability density function of averaged and original RR-intervals. Fig. 10 shows
the PDF estimates using histogram function with 20 bins.
PDF estimate of RRI1 PDF estimate of RRI2 PDF estimate of RRI3
14 15 8
12 7
6
10
10
5
8
4
6
3
5
4
2
2
1
0
0 0
0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.68 0.55 0.6 0.65 0.7 0.75 0.8
values values values
Figure 10: PDF estimate using histogram function for original RRI signals
With averaged RRI signals with averaging windows of 10 time points, the distributions and the
Haobo Zhu; Full Coursework 19
corresponding values generally do not change, with minor differences arise from the decrease in
total data points
PDF estimate of averaged RRI1, =1 PDF estimate of averaged RRI2, =1 PDF estimate of averaged RRI3, =1
15
18 9
16 8
14 7
10
12 6
10 5
8 4
5 6 3
4 2
2 1
0 0 0
0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.66 0.5 0.52 0.54 0.56 0.58 0.6 0.62 0.64 0.55 0.6 0.65 0.7 0.75
values values values
Figure 11: PDF estimate using histogram function for averaged RRI signals with α = 1
When α is decreased to 0.6, we may see the general shapes of the PDF are relatively consistent,
with the values on the x-axis shifted to the left (become smaller). This is expected because the
values are
PDF estimate of averaged RRI2, =0.6 PDF estimate of averaged RRI3, =0.6
PDF estimate of averaged RRI1, =0.6 30
15
25
25
20
10
20
15
15
10
10 5
5
5
0 0 0
0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.4 0.3 0.31 0.32 0.33 0.34 0.35 0.36 0.37 0.38 0.39 0.32 0.34 0.36 0.38 0.4 0.42 0.44 0.46
values values values
Figure 12: PDF estimate using histogram function for averaged RRI signals with α = 0.6
1 1
0
0.5 0.5
0 0
-5
-0.5 -0.5
-1 -1 -10
-1000 -500 0 500 1000 -1000 -500 0 500 1000 -1000 -500 0 500 1000
The ACF estimates of the three RRIs can then be analysed. As the sequences do not shrink to 0
after a certain delay as seen in Fig.13, they are AR processes.
Haobo Zhu; Full Coursework 20
The optimal order of the AR models for modelling the three RRI sequences can then be found
using criteria such as MDL, AIC and AICc.
Different Criteria for AR order, RRI1 Different Criteria for AR order, RRI2 Different Criteria for AR order, RRI3
-10.6 -10.2 -8.4
MDL MDL MDL
-10.8 AIC AIC AIC
AICc -10.4 AICc -8.6 AICc
-11
-11.4 -10.8 -9
-11.6
-11 -9.2
-11.8
Figure 14: Different criteria assessing the optimal model order for the three RRI sequences
For RRI1 and RRI3, the optimal model order suggested by MDL and AIC is p = 5, while AICc
suggests p = 4 is optimal, while MDL and AIC show major decreases after p = 8 in RRI2, but
AICc suggests p = 3 is optimal; since MDL and AIC are also flat between order 3 to order 8, there
is no need to increase the model order to such high orders as the computational complexity would
increase significantly.
N −1 2
1 X n
P̂x (f ) = x[n]e−j2πf N . (1)
N
n=0
With the written matlab function pgm, it produces following plots for WGN realisations with lengths
N = {128, 256, 512}
Haobo Zhu; Full Coursework 21
5
4
3
4
3
2 3
2
2
1
1
1
0 0 0
0 0.5 1 0 0.5 1 0 0.5 1
frequency frequency frequency
The majority of the PSD in all three estimates sit about 1, which is expected as for WGNs, all
frequencies would have the same power distribution with an intensity of their variance by their
definition. The plots are also symmetrical about 0.5, which is π in terms of angular frequency,
which is a property of PDFs.
The PDF estimates are then filtered by a order 5 MA filter, they produce the following plots
2.5
3
1.5 2
2 1.5
1
1
1
0.5
0.5
0 0
0 0.5 1 0 0.5 1 0 0.5 1
frequency frequency frequency
The envelopes of the three PDF estimates are more pronounced and they have lower spikes com-
paring to the original copy in that such moving average filters are low pass filters. The quality of
the estimates are therefore improved.
We then generate a WGN of length 1024 and segment it into 8 segments. The MSE for PDF
estimates are
Segment# 1 2 3 4 5 6 7 8 Average
MSE 0.7248 1.1582 0.7996 0.7000 0.7936 1.2452 0.6300 1.1331 0.8985
Haobo Zhu; Full Coursework 22
By averaging the PDF estimates, the MSE of the new PDF estimate becomes 0.1193, comparing to
the average MSE of the 8 segments, it is ∼ 7.5 times smaller, which should have dropped 8 times
in theory.
4
1.5
0.5
1
0 0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Frequency Frequency
Figure 3: An PDF Estimate of a segment of WGN(left), and the averaged PDF Estimate(right)
0
0
-5
-2
-4 -10
0 500 1000 0 500 1000
Time step Time step
To assess the spectral properties of AR models, a 1064-sample WGN is filtered by an AR(2) filter
1
with a = [1 0.9], namely H(z) = 1−0.9z −1 . The following plot compares the original WGN and the
filtered signal. It is clear that the filtered signal oscillates with larger amplitudes, and the high
frequency components are more pronounced.
We may then carry out a comparison between the PSD of the filter and the filtered WGN.The
following plot shows the filter is clearly a high pass filter with a maximum PSD at 100. The
Haobo Zhu; Full Coursework 23
empirically obtained estimation of cut-off frequency at which the PSD is 50.4 is 0.487.The filtered
WGN is centered around the filter, oscillates about the PDF of the filter and shoots up high when
the frequency approaches 0.5.
PDF Estimates of the Filter PDF Estimates of the Filter and a Filtered WGN
100 300
Filter Filter
Filtered WGN
250
80
200
60
150
40
100
20
50
0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Normalised Frequency Normalised Frequency
By zooming into f ∈ [0.4, 0.5], violent oscillations are prevalent in this region and thus the variability
of the estimation of PDF done by feeding a WGN into the system is high. As the 1064-samples of
a realisation of WGN can be interpreted by a whole realisation with infinitely many sample filtered
rectangular window filter in the time domain, the resulting PDF estimate is convolved with a sinc
function in the frequency domain and therefore causing oscillations.
PDF Estimates of the Filter and a Filtered WGN Filter PSD and Model Based Estimation
300 120
Filter Filter PSD
Filtered WGN Model based PSD
250 100
200 80
150 60
100 40
50 20
0 0
0.4 0.41 0.42 0.43 0.44 0.45 0.46 0.47 0.48 0.49 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Normalised Frequency Normalised Frequency
(a) Filter and Filtered WGN (b) Filter and Model based Estimation
Figure 6: PDF of the ideal pdf of filter overlaid with different estimations
The PDF may also be estimated by using the property of PDF estimators of an AR(1) process
2
σ̂X
P̂y (f ) = 2 (2)
1 + â1 e−j2πf
Haobo Zhu; Full Coursework 24
where â1 = −R̂Y (1)/R̂Y (0), and σ̂ 2 = R̂Y (0) + â1 R̂Y (1). From the calculations by xcorr, R̂Y (0) =
5.57985 and R̂Y (1) = −5.04679. Therefore,
From Fig.6(b), it is clear that the model based PSD estimation provides close estimates comparing
to the ideal PSD, with the curve slightly sit beyond the ideal one due to larger numerator and |â1 |
is greater than |a1 |.
We can then do a similar analysis to the sunspot data by filtering the original data with AR models
with order p={1, 2, 10}.
4
PSD of10Sunspot Data (Zero Mean) and AR Models of5 Sunspot Data (Original) and AR Models
PSD10
7 7
original original
6 AR(1) 6 AR(1)
AR(2) AR(2)
AR(10) AR(10)
5 5
4 4
3 3
2 2
1 1
0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Normalised Frequency Normalised Frequency
As the sunspot data is an AR(2) process, one may conclude that the AR(1) model is under-fitting
the data and AR(10) is potentially overfitting the data. However, according to the plots, the AR(10)
has overall the best performance, as AR(2) loses the first peak while AR(1) loses the second, but as
AR(2) is able to pronounce the most important frequencies in the zero-centered dataset, increasing
the model order up to 10 is not necessary, as AR(10) is much more computationally complex. The
properties of overfitting are not significant probably due to the small size of the dataset(N = 288).
For an AR(p) process, the cumulative square error loss function for finding optimal partial corre-
lation functions is given by
M
X p
X 2
J= r̂xx [k] − ai r̂xx [k − i] , f or M ≥p (4)
k=1 i=1
Haobo Zhu; Full Coursework 25
as the loss function can be also written as J = (x − s)T (x − s). By taking the gradient of J w.r.t. a
and setting the result to 0, the optimal LSE estimator of a is â = (HT H)−1 HT x, and the minimum
cost function is Jmin = xT (x − Hâ), it is possible to determine the optimal model order by LSE.
We may also compare the estimation with the Yule-Walker results, where â = R−1 x, where the
matrix R ∈ Rp×p
r̂xx [0] r̂xx [1] . . . r̂xx [p − 1]
r̂xx [1] r̂xx [0] . . . r̂xx [p − 2]
R= (6)
.. .. .. ..
. . . .
r̂xx [p − 1] r̂xx [p − 2] . . . r̂xx [0]
We may conclude that the LSE method is more computationally complex as it requires more matrix
multiplications and the dimension of the transformation matrix is larger (H ∈ RM ×p comparing to
R ∈ Rp×p ), and both methods require the calculation of a matrix where the entries are obtained
from the ACF.Given the random process x[n] is stochastic with a noise term w ∼ N (0, σ 2 ), and
rxx [m] = E(x[n]x[n + m]), the biased estimator of ACF becomes,
p
X
r̂xx [k] = ai r̂xx [k − 1] + ϵ[k] (7)
i=1
It is clear that within the estimator, the stochastic error is still present and therefore the matrix
H is stochastic.
We then test the LSE algorithm on estimating the best AR model for the (standardised) sunspot
data. The coefficients up to order 10 are
a1 0.7905
a2 1.5205 -0.8621
a3 1.4961 -0.8212 -0.0253
a4 1.4967 -0.8435 0.0115 -0.0227
a5 1.4966 -0.8473 0.0923 -0.1533 0.0797
a6 1.4836 -0.8287 0.0766 -0.0311 -0.1181 0.1205
a7 1.4620 -0.8120 0.0807 -0.0485 0.0311 -0.1247 0.1504
a8 1.4412 -0.7972 0.0757 -0.0435 0.0171 -0.0122 -0.0358 0.1159
a9 1.4304 -0.7950 0.0766 -0.0459 0.0199 -0.0212 0.0379 -0.0069 0.0774
a10 1.4222 -0.7954 0.0723 -0.0441 0.0168 -0.0178 0.0274 0.0784 -0.0641 0.0898
1.8 4.5
1.6 4
1.4 3.5
1.2 3
MSE
MSE
1 2.5
0.8 2
0.6 1.5
0.4 1
0.2 0.5
1 2 3 4 5 6 7 8 9 10 1 2 3 4 5 6 7 8 9 10
Model Order Model Order
Fig. 8(a) shows that after order 2, the minimisation in MSE is small (comparing 0.2403 when p=10
and 0.2631 when p=2) for the standardised data, which is consistent comparing to the results from
Yule-Walker in Section 2.3, and Fig. 8(b) suggests that the optimal order for the original data is
3. We may suggest that using standardised data is a good method to reduce model complexity.
We then plot the model-based estimation of PSD for the standardised data. According to the
formula of PSD of an AR model, the nominator is the variance of the noise is equivalent to the
variance of the model residual, i.e. the interpolation MSE of the AR model with the original
dataset. By normalising the PSD using this variance, we then plot the model based PSD for orders
p = {1, 2, 10}.
20
10
0
0 0.1 0.2 0.3 0.4 0.5
The plot suggests that with AR(1), it shows information of the first peak well but no information
of the second peak. AR(2) is a suitable model order, as it well indicates the second peak, but
not as good when it fits the first peak. The AR(10) model shows more information on the first
peak than AR(2) but less information than AR(1), and shoots up high in the second peak. The
1
driving noise variance is estimated by σ 2 = var(x̂[n]) , where x̂[n] is a WGN∼ N (0, 1) filtered by the
corresponding AR model, considering the dataset is standardised. The following plot suggests that
Haobo Zhu; Full Coursework 27
MSE of ACF estimation also varies with data length. With the given AR(2) model, The MSE hits
at a minimum at N=25, where MSE=0.000621, then rises and oscillates between N ∈ [30, 150]. A
second minimum is hit when N=245, where MSE=0.000607. We would recommend a data length
of 245 is optimal, and any data lengths less than 20 or between 30 and 100 should be avoided.
0
0 50 100 150 200 250
Data Length
Figure 10: The change of MSE in ACF estimation with the data length
0 0
-2 -2
0.24 0.242 0.244 0.246 0.248 0.25 0.74 0.742 0.744 0.746 0.748 0.75
time(s) time(s)
The dial tone of digits from 0 to 9 in the UK consists of 10 pairs of frequencies sinusoidal waves
superimposed with each over, and the frequencies vary from 697 Hz to 1477 Hz. Therefore, using
a sampling frequency of 32768 Hz is appropriate as it is more than 10 times beyond the Nyquist-
Shannon rate(2954 Hz, twice the highest frequency), and therefore there would be no aliasing in
the resulting sampled signal. The spectrogram can then be obtained by calculating FFT of the 21
segments with a Hanning window. It is clear that with the dialing interval, there are two peaks in
the frequency, which correspond to the two frequencies of the superimposing sine waves. For an
idle interval, the FFT is constantly 0.
Given the generated sequence is 02063853985, we may clearly identify the ten pairs of peaks in the
frequency domain to identify the digits. For example it is clear that for digit 0, the spectrogram
peaks at 941 Hz and 1336 Hz, while for digit 9 the peaks are 852 Hz and 1447 Hz. For real life
Haobo Zhu; Full Coursework 28
-40
4
Power/frequency (dB/Hz)
-60
Time (s)
3
-80
2 -100
-120
1
-140
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4
Frequency (kHz)
Figure 12: Spectrogram of the dial sequence with segment length 0.25s
signals, the true dial tone is always superimposed with noise, and thus the Fourier transform of the
resulting signal becomes ideally
2
F ŷ[n] = F y[n] + w[n] = SY (f ) + σw . (8)
We then corrupt the signal with WGNs with standard deviations 0.5, 1 and 5. The patterns of the
corrupted copies are then
corrupted signal with noise std 0.5 corrupted signal with noise std 1 corrupted signal with noise std 5
2
10
0
0 0
-2
-10
-4 -5 -20
0.24 0.245 0.25 0.24 0.245 0.25 0.24 0.245 0.25
time(s) time(s) time(s)
Fig.14 shows the spectrogram of the corrupted signal with σw = 0.5. The ten pairs of peaks can
be well detected. The background should be ideally constantly the noise variance, but it is not
constant here due to the windowing effect.
Fig.15 shows the corrupted signal with σw = 1. The peaks are still identifiable with this noise
variance, though the background offset is greater.
However, as Fig.16 suggests, the peaks in the frequency domain of the corrupted signal with σw = 5
Haobo Zhu; Full Coursework 29
Power/frequency (dB/Hz)
-40
4
-60
Time (s)
3
-80
2 -100
-120
1
-140
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Frequency (kHz)
Figure 14: Spectrogram of the corrupted signals with noise std 0.5
Power/frequency (dB/Hz)
4 -50
Time (s)
2 -100
1
-150
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Frequency (kHz)
Power/frequency (dB/Hz)
4 -40
-60
Time (s)
3
-80
2 -100
-120
1
-140
0
0 0.2 0.4 0.6 0.8 1 1.2 1.4 1.6
Frequency (kHz)
are hardly identifiable. This also implies the importance of filtering of the signal as a prepossessing
method rather than subtracting the noise variance from the frequency domain.
It is shown in the previous sections that the variance of estimation in the periodogram can be
minimised by averaging windows of the same length in the dataset. For the three RRI trials, the
periodogram obtained directly from the original data and the averaged periodogram with window
size 200 appear difficult to determine the correct respiration frequency, with only RRI3 shows a
sensible peak.
Haobo Zhu; Full Coursework 30
PSD of original Data for RRI1 PSD of original Data for RRI2 PSD of original Data for RRI3
0.15 0.06 1.2
0.05 1
X 0.03125
0.1 0.04 0.8 Y 1.01296
0.03 0.6
0.01 0.2
0 0 0
0 0.2 0.4 0.6 0 0.2 0.4 0.6 0 0.2 0.4 0.6
Normalised frequency Normalised frequency Normalised frequency
10-3 Averaged PSD of original Data for RRI1 (200) 10-3 Averaged PSD of original Data for RRI2 (200)
9 8
Averaged PSD of original Data for RRI3 (200)
0.09
X 0.00505051
8 7 X 0.030303
Y 0.00873283 X 0.00505051
0.08 Y 0.0889468
Y 0.00730047
7 6
0.07
6
5 0.06
5
4 0.05
4
0.04
3
3
0.03
2
2 0.02
1
1 0.01
0 0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Normalised frequency Normalised frequency Normalised frequency
Figure 18: Averaged periodograms with window size 200 time points
2 0.05
4
0.04
3 1.5
0.03
2 1
0.02
1 0.5 0.01
0 0 0
0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5 0 0.1 0.2 0.3 0.4 0.5
Normalised Frequency Normalised Frequency Normalised Frequency
Figure 19: Averaged periodograms with window size 100 time points
By setting the window length to 100 time points and averaging them accordingly, it is then possible
to determine clean peaks out of the periodograms for all of the three trials as shown by Fig.19.
The PSD estimates of the three trials show different respiratory frequencies as they peak at 0.01,
0.02, and 0.03 respectively, which suggest their dominant frequencye(i.e. the respiration frequency)
is then 0.02, 0.04 and 0.06 of the sampling frequency.
Haobo Zhu; Full Coursework 31
wopt = R−1
xx pzx (1)
where
rxx (0) rxx (−1) ... rxx (−Nw ) rzx (0)
rxx (1) rxx (0) . . . rxx (−Nw + 1) rzx (−1)
Rxx = , pzx = (2)
.. .. .. .. ..
. . . . .
rxx (Nw ) rxx (Nw − 1) . . . rxx (0) rzx (−Nw )
where Nw is the order of the MA filter minus 1. . When fed by a WGN, the MA filter with
b = [1 2 3 2 1] and a = [1] will scale the variance of the input noise by 12 + 22 + 32 + 22 + 12 = 19
√
times, and thus its standard deviation by 19 ≈ 4.359 times. As due to the different time points
of a white noise are uncorrelated and a white noise is stationary, ideally
var(y[n]) = E{y[n]2 } = E{x[n]2 + (2x[n1])2 + (3x[n − 2])2 + (2x[n − 3])2 + x[n − 4]2 }
= var(x[n]) + 4 var(x[n]) + 9 var(x[n]) + 4 var(x[n]) + var(x[n]) (3)
= 19 var(x[n])
√
However, the scaling is not constantly 19 for the filtered signals, and thus the coefficients are de-
normalised by multiplying the empirical standard deviation of the filter output before normalisation.
The standardised signal is then corrupted by a WGN with σw = 0.1. Given the variance of the
σ2
original signal is then 10 log10 σ2y = 20 dB. The typical resulting filter coefficients are then
w
b = [1.0048 1.9860 3.0049 1.9881 1.0048]. The average cumulative square error (CSE) for 1000
different trials is 0.0029, which suggests a good estimation of the coefficients.
2 = {0.1, 1, 2, 5, 10}.
Then we conduct five different experiments with σw
2
σw SNR(dB) b1 b2 b3 b4 b5 CSE(avg)
0.1 10 0.9825 2.0279 2.9505 2.0173 0.9377 0.0113
1 0 1.0896 2.1033 2.9492 2.2277 1.0353 0.0982
2 -3 1.4715 1.7197 3.1560 1.7209 1.0232 0.1898
5 -7 0.7177 1.9841 2.8272 0.3609 1.4751 0.4725
10 -10 0.9241 1.7518 2.8195 1.6740 1.6867 0.9325
In the table above, the coefficients are retrieved from a single estimation and the CSE is averaged
for the 1000 different realisations. As the variance of noise increases, it is clear that the cumulative
square error increases proportionally with σw 2 and the predictions get worse. The calculated w
opt
Haobo Zhu; Full Coursework 32
is ideally invariant if we use Nw greater than 4, w5 is effectively zero as the final term of pzw is
effectively zero due to since its index exceeds the order of the filter. The results for the first five
terms of wopt are effectively invariant due to the ACF of a white noise for lags other than 0 are
effectively zero, the results may be more erroneous as more terms are considered when calculating
R−1xx . The following are the results of average wopt for 100 experiments for Nw = {4, 5} and
2
σw = 0.01.
b1 b2 b3 b4 b5 b6 CSE
Nw = 4 1.0162 1.9954 2.9479 1.9747 0.9882 — 0.0027
Nw = 5 1.0151 2.0198 2.9850 1.9722 0.9797 -0.0139 0.0030
The results in Table 2 agrees with our assumption, with the 6th term of wopt when Nw = 5 is
effectively 0. With Nw = 5, the error slightly higher than the results from Nw = 4, which agrees
with the assumption.
The estimated flops of calculating wopt can the be calculated. For calculating the ACF, we need
(N + 1)2 operations if we calculate the unbiased estimate of ACF, and the matrix multiplication
requires Nw (Nw −1) operations, along with we need O(Nw3 ) operations when calculating the inverse
of Rxx . Therefore, we may conclude that Wiener filter is computationally complex and may be slow
for certain practical usages, and thus adaptive filters are important as they increase the calculation
speed drastically with the cost of accuracy.
By calculating the coefficients adaptively using the formula suggested in Eqn(39-41) in the course-
work manual for the SNR=20dB case, a typical estimated wopt is for µ = 0.01. A typical final
estimation of wopt is [1.0098 2.0018 2.9928 1.9880 0.9920] , which is close to the ideal value with a
CSE of 0.0015, but as the weights are constantly changing, the results are not as good before the
weights converge to their ideal values. A good adaptation gain is µ = 0.01, as with it the coefficients
converge after step 200 with minor fluctuations caused by the effect of additive noise(Fig.1). If the
adaptation gain is too low, the weights would not have time to converge to their ideal values, and
with too high adaptation gains the coefficients would constantly oscillate and have no chance to
converge either. For µ = 0.02, the coefficients begin to converge to the Wiener filter results after
approximately time step 100 as shown by Fig.2(a), but the oscillations are clearly greater than that
for µ = 0.01, which suggests the adaptation gain is slightly higher than optimum.
Haobo Zhu; Full Coursework 33
Coefficients evolution with LMS, mu=0.01 Squared error with LMS, mu=0.01
5 50
b1
b2
4 b3 40
b4
b5
3 30
2 20
1 10
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
time step time step
2
1.5 0
1.5
1 -5
1
0.5 b1
-10
0.5 b1 b2
b2 b3
b3 0 b4 -15
0 b4
b5
b5
-0.5
-0.5 0 200 400 600 800 1000 -20
0 200 400 600 800 1000 0 200 400 600 800 1000
time step time step time step
The calculation of ŷ[n] requires Nw (Nw +1) operations, the calculation of error requires 1 operation,
and the update of the parameters requires 2Nw + 3 operations (Nw + 1 for the vector addition, 1
for the calculation of µe[n] and (Nw + 1 for the multiplication of a scalar with a vector). Therefore,
for the whole process with N time steps, we need N (Nw (Nw + 1) + 2Nw + 3) operations, which is
much less computationally complex than the Wiener filter.
Although with µ = 0.01, the coefficients are able to converge to their true values, there are some
minor fluctuations and the rising time is not optimal. We then added an extra subroutine to
schedule the adaption rate as follows:
e[n] − e[n − 1] α N − n β
µ[n] = µ0 , ∀n ≥ 2 (4)
e[n − 1] − e[n − 2] N
With several trials and errors, we empirically selected µ0 = 0.05, α = 0.6, and β = 3. As shown by
Fig.3, the overall squared error is mitigated comparing to 1 with the maximum error not exceeding
25. The rising time is much shorter as the coefficients are converged to their true values after step
Haobo Zhu; Full Coursework 34
100 with little overshooting. The overall learning rate is decayed when time steps increase, and
little fluctuation is evident for n > 400. Although there are still minor spikes after convergence,
these are due to the effect of the additive noise and can cause oscillations if the second decaying
factor is not set. By setting the second decaying factor, overfitting is much mitigated.
Coefficients evolution with LMS, mu0=0.05 Squared error with LMS, mu0=0.05
5 25
b1
b2
4 b3 20
b4
b5
3 15
2 10
1 5
0 0
0 200 400 600 800 1000 0 200 400 600 800 1000
time step time step
Figure 3: Evolution of the five coefficients with µ0 = 0.05 with gear shifting
X(z) 1
H(z) = = (7)
W (z) 1 − a1 z − a2 z −1
−1
the errors and coefficients are therefore determined by the LMS algorithm.
The LMS algorithm is then tested with 4 different adaptive gains(0.01, 0.002, 0.0001, 0.1). The
input signal is a WGN with 10000 time points.
Haobo Zhu; Full Coursework 35
-0.2 -0.2
-0.4
-0.4
-0.6
-0.6
-0.8
-0.8
-1
-1.2 -1
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
time step time step
-0.6
-40
-0.8 -60
-1 -80
0 2000 4000 6000 8000 10000 0 2000 4000 6000 8000 10000
time step time step
With µ = {0.01, 0.002}, the performances of the algorithm are similar with MSE w.r.t the syn-
thesised signal of 1.0283 and 1.0271. The predicted AR coefficients oscillates more with µ = 0.01.
With µ = 0.0001, the adaptive gain is too small to let the predictions converge to their true values,
and the MSE is thus much worse (1.3092). When µ = 0.1, the adaptive gain is too large and the
predictions diverge to arbitrary large values with MSE=25.2297.
The LMS algorithm can also be applied to determine the AR model corresponding to the pronun-
ciation of a letter. We investigate AR models corresponding to the pronunciation of ‘e’, ‘a’, ‘s’, ‘t’,
Haobo Zhu; Full Coursework 36
‘x’ in this section. The following table suggests sensible model orders and adaptive gains with and
without gear shifting.
We may then explain two examples in the chart in detail, ‘e’ and ‘a’. For letter e, a sensible setting
is AR(1) with µ = 10 without gear shifting and µ = 50 with gear shifting. The MSE is 5.89e-05
without gear shifting and 5.33e-05 with gear shifting.
x against xhat plotted against time coefficients evolution with time coefficients evolution with time, gear shifted
0.15 1.2 1.2
1 1
0.1
0.8 0.8
0.05
0.6 0.6
0
0.4 0.4
-0.05
0.2 0.2
-0.1 0 0
0 200 400 600 800 1000 0 200 400 600 800 1000 0 200 400 600 800 1000
time step time step time step
x against xhat plotted against time coefficients evolution with time coefficients evolution with time, gear shifted
0.15 0.6 0.4
0.1 0.3
0.4
0.05 0.2
0 0.2 0.1
-0.05 0
0
-0.1 -0.1
A typical well converging model for letter ‘a’ is AR(10) with µ = 5. The MSE of prediction is
6.95e-05. The five coefficients converge to constants after time step 300, and minor changes are
Haobo Zhu; Full Coursework 37
observable due to the non-stationary nature of the input signal. With gear shifting, µ is set to
10. The fluctuation in the predicted coefficients is much mitigated, but the MSE is increased to
1.62e-04. The extrapolation error may be reduced due to less over-fitting.
The correct filter length, i.e. model order, may be determined heuristically by assessing the vari-
ability of the coefficients in a given model order; if the coefficients vary too much, we may conclude
that the given model order is not robust and the coefficients must change with time to fit the
data. MDL, AIC and AICc criteria may also be applied to determine the optimal model order
analytically, where the MSE may be used as the error term.
We may also assess the performance of AR models of different orders with prediction gain given by
2
Rp = 10 log10 σσx2 . Here we investigate the models without gear shifting and with adaptive gains
e
suggested in Table 3.
Prediction Gain of different orders for e Prediction Gain of different orders for a Prediction Gain of different orders for s
6
50
0 5
0
-500
4
-1000
-50 3
-1500
2
-2000 -100
1
-2500
-3000 -150 0
1 2 3 4 5 6 0 5 10 15 20 25 30 0 5 10 15 20 25 30
order order order
0 16
-20 15.5
-40 15
-60 14.5
-80 14
-100 13.5
0 5 10 15 20 25 30 0 5 10 15 20 25 30
order order
The plots above suggest the sensible orders are well chosen. The prediction gain of ‘e’, ‘a’ and ‘t’
drop quickly after certain model orders, and before that Rp ’s remain constant and suggest no big
change by altering the model orders. However, such effect is not observed in the Rp ’s of ‘s’ and
‘x’, but it is shown by the plots that it is not worthy to increase the model orders for the marginal
increase in Rp .
For fs = 16000Hz, the selection of model order remain unchanged according to the plots of Rp , but
Rp ’s at these orders are generally decreased comparing to those in fs = 44100Hz. For example, the
Haobo Zhu; Full Coursework 38
prediction gain for ‘e’ is 23dB at order 1 when fs = 44100Hz, but is only 16.5 when fs = 16000Hz.
As we are sampling 1000 time points with different sampling frequencies, as the sampling frequency
decreases a longer time frame is considered. Therefore, the prediction results are more prone to the
non-stationarity caused by change in the pronunciation of the letter, for example changing from
[ k ] to [ s ] in letter ‘x’, and thus it is more possible not to converge.
Prediction Gain of different orders for e Prediction Gain of different orders for a Prediction Gain of different orders for s
19.5 21 10
19 5
20
18.5 0
19
18 -5
18
17.5 -10
17
17 -15
16.5 16 -20
16 15 -25
0 5 10 15 20 25 30 0 5 10 15 20 25 30 0 5 10 15 20 25 30
order order order
10
4
0
-10 3.5
-20
3
-30
2.5
-40
-50 2
0 5 10 15 20 25 30 0 5 10 15 20 25 30
order order
In order to reduce computational complexity, the LMS algorithm may be further simplified to the
following forms where the exact contribution of the weights update by the error term is not explicitly
calculated. We may assess the following algorithms by using the data generated in Section 4.4 and
letter ‘e’ in Section 4.5.
The vanilla LMS algorithm, sign-error, signed-regressor and sign-sign are then plotted with red,
blue, yellow and magenta respectively. For the synthesised signal in Section 4.4, all algorithms
converge to the theoretical AR coefficients with comparable performances, and the vanilla LMS
Haobo Zhu; Full Coursework 39
algorithm slightly outperforms the others (MSE = {1.05, 1.06, 1.07, 1.09}). For letter ‘e’, the
adaptive gain for the four algorithms should be set to distinct values to produce decent results,
and are set to 50, 0.5, 2, 0.02 respectively. The MSE are also in the same order (1.64e-05, 2.73e-05,
1.33e-05, 1.89e-05). Although the result from the vanilla LMS is not the best when comparing the
MSE, the evolution of the AR coefficient involves less fluctuation.
0
1
-0.2
0.8
-0.4
0.6
-0.6
0.4
-0.8
0.2
-1 0
0 2000 4000 6000 8000 10000 0 200 400 600 800 1000
time step time step
The cost function that should be minimised for a maximum-likelihood estimate (MLE) is therefore
N −1
X 2
J(θ) = x[n] − A cos(2πf0 n + ϕ) (2)
n=0
In general, a sinusoidal function can be split into a sine and a cosine component with the same
frequency. Therefore,
T
1)) , the squared part in Eqn. (2) becomes
x[0] 1 0
x[1] cos(2πf0)
sin(2πf0 )
− α1 − α2 (4)
.. .. ..
. . .
x[N − 1] cos(2πf0 (N − 1)) sin(2πf0 (N − 1))
which therefore can be expressed in a matrix form as x − Hα where H = [c, s] and α = [α1 , α2 ]T .
As the squared sum can be expressed by a dot product w.r.t. itself of a vector, it is proven that
J(θ) may be mapped to
J ′ (α1 , α2 , f0 ) = (x − α1 c − α2 s)T (x − α1 c − α2 s) = (x − Hα)T (x − Hα) = J ′ (α, f0 ) (5)
The optimum estimator of the parameters, denoted by α̂, is given by α̂ = (HT H)−1 HT x. By
plugging α̂ into Eqn.(5), the minimum loss given by α is therefore
′
Jmin,α (f0 ) = (x − Hα̂)T (x − Hα̂)
= (x − H(HT H)−1 HT x)T (x − H(HT H)−1 HT x)
T
= xT x − 2xT H(HT H)−1 HT x + H(HT H)−1 HT x H(HT H)−1 HT x
(6)
T
= xT x − 2xT H(HT H)−1 HT x + xT H (HT H)−1 HT H(HT H)−1 HT x
= const. − xT H(HT H)−1 HT x
′
Therefore, to minimise Jmin,α (f0 ) by altering the frequency, we may maximise the non-constant
term in Eqn. (6), namely x H(HT H)−1 HT x as which is subtracted from the constant. ■
T
The term xT H(HT H)−1 HT x is quadratic in x and is scaled by H(HT H)−1 HT . As H = [c, s],
cT c cT s −1 cT
T T −1 T T
x H(H H) H x = x c s x (7)
sT c sT s sT
As H(HT H)−1 HT acts as a coefficient of x, we may plug in x into the matrices as xT HHT x =
xT ccT x + xT ssT x
cT c cT s −1 cT
T T T −1 T
c c cT s
T
c x c x
x c s x= T (8)
sT c sT s sT s x sT c sT s sT x
N
T 0
By the orthogonality properties of sinusoidal functions, H H = 2 N as cosine and sine func-
0 2
tions are orthogonal to each other the sums are 0, and the in phase terms are added up N times
with an average value of 12 . Thus,
T T N −1 T T T 2
0 cT x
c x 0 c x c x
xT H(HT H)−1 HT x = T 2 = N (9)
s x 0 N2 sT x sT x 0 N2 sT x
The optimum estimate of f0 can then be found by finding f0 that maximises
N −1 −1
2 NX
!
2 T T 2 X 2
(x cc x + xT ssT x) = x[n] cos(2πf0 n) + x[n] sin(2πf0 n)
N N
n=0 n=0
2 (10)
N −1
2 X
= x[n]e−j2πf0 n
N
n=0
Haobo Zhu; Full Coursework 41
as the cosine and sine terms can be treated as the real and imaginary parts of e−j2πf0 n . Comparing
PN −1 n 2
−j2πf N
Eqn.(10) with the formula that calculates PSD estimate P̂X (f ) = N1 n=0 x[n]e , the
f0 in Eqn.(10) suggests normalised frequencies that f0 ∈ [0, 1] , while the f term in the PSD
formula suggests data points, and the corresponding normalised frequency is given by Nf . Eqn.(10)
is therefore nothing else but the PSD formula with normalised frequency and scaled by a factor of
2 if the dataset is sufficiently large and the frequency is not close to 0 or 0.5. Therefore, it is shown
that the optimum frequency estimate fˆ0 can be found by maximising the periodogram. ■
MLE estimate for f0 = 0.25 MLE estimate for f0 = 0.4 MLE estimate for f0 = 0.495
5 5 10
4.5 4.5 9
4 4 8
3.5 3.5 7
3 3 6
2.5 2.5 5
2 2 4
1.5 1.5 3
1 1 2
0.5 0.5 1
0 0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Normalised Frequency Normalised Frequency Normalised Frequency
9
2.5
2 8
7
2
1.5 6
1.5 5
1 4
1
3
0.5 2
0.5
1
0 0 0
0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Normalised Frequency Normalised Frequency Normalised Frequency
From the plots of periodograms and MLE estimates, for f0 = 0.25 and f0 = 0.4, they are consistent
with each other, while the MLE estimates lose all information as it becomes flat after f = 0.48 for
f0 = 0.495. A theoretical justification is that if f0 approaches 0, s →
− 0, and similarly c →
− 0 as
T
f0 approaches 0.5. Therefore, the matrix H H becomes singular and thus not invertable, and the
MLE estimation therefore becomes meaningless as the result does not converge.