0% found this document useful (0 votes)
18 views

Module C

The document discusses basic estimation theory including maximum likelihood estimation and conditional distributions. It provides examples of estimating an unknown parameter given observations from a distribution and defines conditional probability and density functions for discrete and continuous random variables.

Uploaded by

Ankur Mondal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views

Module C

The document discusses basic estimation theory including maximum likelihood estimation and conditional distributions. It provides examples of estimating an unknown parameter given observations from a distribution and defines conditional probability and density functions for discrete and continuous random variables.

Uploaded by

Ankur Mondal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Module C: Estimation

Outline:
ˆ Basic Estimation Theory: ML, MAP
ˆ Conditional Expectation, and Mean Square Estimation
ˆ Orthogonality Principle and LMMSE Estimator

1
Estimation Theory

ˆ Main Question: Given an observation Y of a random variable X, how to


estimate X?
ˆ In other words, what is the best function g such that X̂ = g(Y ) is the best
estimator? How to quantify “best”?
ˆ More generally: given a sequence of observation of yb1 , . . . , ybk , how to esti-
mate X?
ˆ Example: Radar detection: Suppose that X is the radial distance of an
aircraft from a radar station and Y = X + Z is the radar’s observed location
where Z is independent of X and Z ∼ N (0, σ 2 ). What is the best estimator
Xb = g(Y ) of the location X?

2
Motivating Example

ˆ Let X be a random variable which is uniformly distributed over [0, θ].

ˆ We observe m samples of X denoted x


b1 , x
b2 , . . . , x
bm .

ˆ Problem: estimate θ given our observations.

ˆ Let the samples be {1, 2, 1.5, 1.75, 2, 1.3, 0.8, 0.3, 1}.

ˆ What is a good estimate of θ?

ˆ Can we find a function g(b


x1 , x
b2 , . . . , x
bm ) which will map any set of m samples
into an estimate of θ? Such a function is termed “estimator.”

ˆ We often treat the observations as random variables that depend on the


quantities that we are trying to estimate.

ˆ Case 1: The unknown quantity θ is assumed to be an unknown parameter/-


constant with observation X ∼ distribution(θ)

ˆ Case 2: The unknown quantity θ is assumed to be a random variable.

3
Maximum Likelihood Estimation (θ is a parameter)

ˆ We observe X which is assumed to be a random variable whose distribution


depends on an unknown parameter θ.

ˆ When X is continuous, its density fX (x; θ).


ˆ When X is discrete, its pmf pX (x; θ).

ˆ When the observation is x


b, we define Likelihood function as
(
fX (b
x; θ) when X is continuous,
L(θ|X = x
b) =
pX (b
x; θ) when X is discrete.

ˆ The maximum likelihood estimate of θ when X = x


b is

θ̂M L (b
x) := argmaxθ L(θ|X = x
b).

ˆ Thus, maximum likelihood estimate is the value of θ which maximizes the


likelihood of observing x
b.

4
Log Likelihood Estimation

ˆ We rarely estimate a quantity based on a single observation.

ˆ Suppose we have N i.i.d observations, {b


x1 , x bN } each drawn from
b2 , . . . , x
the same distribution.

ˆ Likelihood function is then computed as

L(θ|X1 = xb1 , X2 = x
b1 , . . . , XN = x bn ) = fX1 ,X2 ,...,XN (b
x1 , x
b2 , . . . , x
bN ; θ)
x1 ; θ) × fX2 (b
= fX1 (b x2 ; θ) . . . × fXN (b
xN ; θ) (due to independence of observations)
x1 ; θ) × fX (b
= fX (b x2 ; θ) . . . × fX (b
xN ; θ) (each Xi has identical distribution)
N
Y N
Y
= fX (b
xi ; θ) = L(θ|Xi = x
bi ).
i=1 i=1

ˆ Product term is difficult to maximize. However, we can compute the log-


likelihood as
N
X
log(L(θ|X1 = x
b1 , X2 = x
b1 , . . . , XN = x
bn )) = log(fX (b
xi ; θ))
i=1

which is often easier to maximize with respect to θ.

5
Example

ˆ Consider a random variable X defined as


(
1 with probability θ
X= , θ ∈ [0, 1].
0 with probability 1 − θ

ˆ We observe {b
x1 , x bN } with each x
b2 , . . . , x bi ∈ {0, 1}.
ˆ Problem: find θ̂M L (b
x1 , x
b2 , . . . , x
bN )
ˆ The likelihood function L(θ|X1 = x1 , X2 = x2 ....XN = xn ) = .

ˆ The log-likelihood function log(L(θ|X1 = x1 , X2 = x2 ....XN = xn )) = .

ˆ Optimizing log-likelihood function with respect to θ yields

ˆ ML Estimator θ̂M L (X1 , X2 , ......XN ) is a r.v that is function of X1 , ....XN


given by
θ̂M L (X1 , X2 , ......XN ) = .

ˆ When X is a discrete random variable with p.m.f. [θ1 θ2 ....θN ] = θ with

P(X = 1) = θ1 , P(X = 2) = θ2 ... and so on.

Then, the likelihood function L(θ|X = i) = θi . What is the likelihood


function after N observations?

6
Conditional distribution

ˆ Recall that conditional probability of two events A and B is defined as


P(A ∩ B)
P(A|B) = .
P(B)

ˆ Example: let X1 : outcome of one coin toss with


(
1, with probability p
X1 =
0, with probability 1 − p.

ˆ Let X2 : be outcome of second coin toss, and X2 has same distribution as


X1 .



p2 when (x1 , x2 ) = (1, 1)

p(1 − p) when (x1 , x2 ) = (1, 0)
ˆ Joint pmf: pX1 X2 (x1 , x2 ) =


p(1 − p) when (x1 , x2 ) = (0, 1)
(1 − p)2

when (x1 , x2 ) = (0, 0)

ˆ Conditional pmf of X1 conditioned on X2 :

P(X1 = x1 , X2 = x2 )
pX1 |X2 (x1 |X2 = x2 ) = P(X1 = x1 |X2 = x2 ) = .
P(X2 = x2 )

ˆ Conditional pmf of X1 given X2 = 0 is given by:

pX1 |X2 (0|X2 = 0) = P(X1 = 0|X2 = 0) =


pX1 |X2 (1|X2 = 0) = P(X1 = 1|X2 = 0) =

7
Conditional Distributions

ˆ Consider two discrete random variables X and Y . Let X takes values from
the set {x1 , . . . , xn } and let Y takes values from the set {y1 , . . . , ym }.

ˆ Conditional pmf of X given Y = yj is given by:


P(X = xi , Y = yj )
pX|Y (xi |Y = yj ) = P(X = xi |Y = yj ) = ∀i ∈ {1, 2, . . . , n}.
P(Y = yj )

ˆ The numerator is obtained from the joint distribution of X and Y . The


denominator is obtained from the marginal distribution of Y .

ˆ For two continuous random variables X and Y conditional CDF is given by


FX,Y (x, y)
FX|Y (x|y) = P(X ≤ x|Y ≤ y) = .
FY (y)

ˆ In this case, the conditional density is given by


fX,Y (x, y)
fX|Y (x|y) = .
fY (y)

8
Example

Consider two continuous random variables X and Y with joint density


(
x + y if 0 ≤ x ≤ 1, 0 ≤ y ≤ 1,
fXY (x, y) =
0 otherwise.

Determine P(X < 14 |Y = 13 ) by deriving and using the conditional density of X


given Y .

9
Example

Consider a random variable X whose density is given by


(
1 , 0≤x≤1
fX (x) =
0 , otherwise

The conditional density of Y given X = x is given by


(
1
1−x , x≤y≤1
fY |X=x (y) =
0 otherwise.
Determine the marginal density of Y .

10
Maximum A-Posteriori (MAP) Estimation

ˆ ML estimators assume θ to be an unknown parameter. If instead θ is a


r.v with some distribution that is known, we use a Bayesian approach to
estimate θ.

ˆ We assume prior distribution: fθ (θ)/pθ (θ) of θ that is known to us before-


hand.

ˆ Conditional distribution: fX|θ (x|θ) is also as some to be known. The distri-


bution of the observed quantity is known if the unknown parameter is exactly
known.

ˆ Once we observe X = x
b, we find posterior distribution using Baye’s law as:
fθ,X (θ, x
b)
fθ|X (θ|X = x
b) =
fX (b
x)
x|θ)fθ (θ)
fX|θ (b
=
fX (b
x)
x|θ)fθ (θ)
fX|θ (b
=R .
f
θ X|θ (b
x |θ)f θ (θ)dθ

ˆ The MAP estimate is defined as:

θ̂MAP (b
x) = argmaxθ fθ|X (θ|X = x
b) = argmaxθ x|θ)fθ (θ),
fX|θ (b

which is the mode of the posterior distribution.

11
Example (Previous year End Semester Question)

Suppose Θ is a random parameter, and given Θ = θ, the observed quantity Y


has conditional density
θ
fY |Θ (y|θ) = e−θ|y| , y ∈ R.
2
1. Find the Maximum Likelihood (ML) estimate of Θ based on the observation
Y = −0.5.

Suppose further that Θ has prior density given by fΘ (θ) = 1θ , 1 ≤ θ ≤ e


(and fΘ (θ) = 0 for θ < 1 and θ > e.). Then,
2. find the Maximum A-Posteriori (MAP) estimate of Θ based on the observa-
tion Y = −0.5.
b M L (Y = −0.5) = 2, Θ
Answer: Θ b M AP (Y = −0.5) = 1.

12
Mean Square Estimation Theory

ˆ The best is subjective and need to set a criteria. One popular criteria is
Mean Square Error (MSE).

ˆ For measurements X1 , . . . , Xk of a random variable X, we define the MSE


of (a measurable) an estimator (function) g : Rk → R to be

E[|g(X1 , . . . , Xk ) − X|2 ].

ˆ In this setting, we view E[|U − X|2 ] as the squared distance of random


variables U and X.

ˆ Once we fix the MSE criteria for the best estimator, then the problem of find-
ing the best MSE estimator for X based on the measurements X1 , . . . , Xk
can be formulated as:

arg min E[|g(X1 , . . . , Xk ) − X|2 ].


g:Rk →R

ˆ Any g that minimizes the above criteria is called a Minimum Mean Square
Error (MMSE) estimator.

ˆ When solving for MMSE, we always assume that all the random variables
involved have finite mean and variance.

13
MMSE

ˆ In practice: finding the MMSE might be hard.


ˆ We can restrict our attention to special classes of functions g.
ˆ Let k = 0, and suppose that we want to find the best constant c that
estimates X. Note that in this case, we view c as a constant random
variable.

objective: finding c ∈ argminc E[|X − c|2 ]. (1)

ˆ Let X̄ = E[X]. Then,

E[|X − c|2 ] = E[|X − X̄ + X̄ − c|2 ]


= E[|X − X̄|2 + 2(X̄ − c)E[(X − X̄)] + (X̄ − c)2
= E[(X − X̄)2 ] + E[(X̄ − c)2 ].

ˆ Therefore, (1) is minimized when c = X̄ and MMSE value is going to be


Var(X).
ˆ Estimation theory interpretation of mean and variance: The best
constant MMSE estimator of X is E[X] and the corresponding MMSE value
is Var(X).

14
Conditional Expectation

Example: Let X, Y be discrete r.v with (X, Y ∈ {1, 2}) and joint pmf:
1 1
P[X = 1, Y = 1] = , P[X = 1, Y = 2] =
2 10
1 3
P[X = 2, Y = 1] = , P[X = 2, Y = 2] =
10 10
ˆ Determine the marginal pmf of X and Y .
ˆ Show that the conditional pmf of X given Y = 1 is
(
5
6 if X = 1
P[X|Y = 1] = 1
6 if X = 2.

ˆ We can then compute


X
E[X|Y = 1] = xP[X = x|Y = 1] = .
x∈X

ˆ Similarly, show that the conditional pmf of X given Y = 2 is


(
1
4 if X = 1
P[X|Y = 2] = 3
4 if X = 2.

ˆ Then, E[X|Y = 2] = .
ˆ We can view E[X|Y ] as a function of Y as
(
E[X|Y = 1] with probability P[Y = 1]
g(Y ) = E[X|Y ] =
E[X|Y = 2] with probability P[Y = 2]

ˆ Now, determine E[g(Y )].


ˆ Determine E[X]. What do you notice?

15
Conditional Expectation

ˆ If the value of Y is specified, then E[X|Y = y] is a scalar.


ˆ Otherwise, E[X|Y ] is a random variable which is a function of Y .

if for ω1 ̸= ω2 , Y (ω1 ) = Y (ω2 ) ⇒ E[X|Y = Y (ω1 )] = E[X|Y = Y (ω2 )].

ˆ For two continuous random variables X, Y ,


Z Z
fX,Y (x, y)
E[X|Y = y] = xfX|Y (x | Y = y)dx = x dx.
x x fY (y)

ˆ Similarly,
Z
E[h(X)|Y = y] = h(x)fX|Y (x, Y = y)dx
Zx
E[l(X, Y )|Y = y] = l(x, y)fX|Y (x, Y = y)dx
x

ˆ If the value of Y is not specified, E[l(X, Y )|Y ] is a random variable.

16
Example

Let X and Y be two random variables and independent with


(
1 with probability 21 ,
X=
0 with probability 21 .
Let Y have the same distribution as X. Let Z = X + Y .
ˆ Determine the pmf of Z.
ˆ Find conditional distribution and expectation of X when z = 1 and z = 2.
ˆ Find conditional distribution and expectation of z when X = 1.

17
Properties of Conditional Expectation

ˆ Linearity: E[aX + bY |Z] = aE[X|Z] + bE[Y |Z] a.e.

ˆ Monotonicty: X ≤ Y ⇒ E[X|Z] ≤ E[Y |Z] a.e.

ˆ Identity: E[Y |Y = y] = y. What is the conditional distribution of Y when


its value is specified? Determine E[Y |Y ] and E[g(Y )].

ˆ Independence: Suppose X and Y are indepdent. Then,


Z
E[X | Y = y] = xfx|Y =y (x | Y = y)dx
Zx Z
fxy (x, y)
= x dx = xfX (x)dx = E[X]
x fY (y) x

independent of the value of Y = y.


In other words,
Z Z
E[X | Y ] = E[X | Y = y]fY (y)dy = E[X] fY (y)dy = E[X].
y y

Similarly, E[g(X) | Y ] = E[g(X)].

ˆ E[Xg(Y )|Y ] = g(Y )E[X|Y ].

18
Tower Property and Orthogonality

Tower Property:
E[E[X|Y ]] = E[X].
Proof:
Z
EY [E[X|Y ]] = E[X|Y = y]fY (y)dy
Zy  Z 
= xfX|Y (x | Y = y)dx fY (y)dy
y x
Z Z
= x fX|Y (x | Y = y)fY (y) dydx
y x | {z }
fxy (x,y)
Z Z 
= x fXY (x, y)dy dx
x y
| {z }
=:fX (x)
Z
= xfX (x)dx = E[X]
x

Orthogonality: for any measurable function g,

E[(X − E[X|Y ])g(Y )] = 0.

That is, (X − E[X|Y ]) is orthogonal to any function g(Y ) of Y .


Proof:

19
Minimum Mean Square Estimator (MMSE)

Proposition: Let g(Y ) be an estimator of X, and the mean square estimation


error be defined as E[(X − g(Y ))2 ]. Then,

E[(X − E[X|Y ])2 ] ≤ E[(X − g(Y ))2 ], for all measurable g.

Proof:

E (X − g(Y ))2 = E (X − E[X | Y ] + E[X | Y ] − g(Y ))2


   

20
L2(Ω, F, P) Space of Random Variables

ˆ We define L2 (Ω, F, P) (or simply L2 ) to be the set of random variables with


finite second moment, i.e., L2 = {X | E[X 2 ] < ∞}.
ˆ Properties of L2 :
– L2 is a linear subspace of random variables:
(i) aX ∈ L2 for all X ∈ L2 and a ∈ R as E[(aX)2 ] = a2 E[X 2 ] < ∞,
and
(ii) X + Y ∈ L2 for all X, Y ∈ L2

– The most important property: L2 is an inner-product space. For any


two random variables X, Y ∈ L2 , let us define their inner product

X · Y := E[XY ].

– Then this operation satisfies the axioms of an inner product:


(i) X · X = E[X 2 ] ≥ 0.
(ii) X · X = 0 iff X = 0 almost surely.
(iii) linearity : (αX + Y ) · Z = X · Z + αY · Z.
ˆ Therefore, L2 is a normed vector space, with the norm ∥ · ∥ defined by
√ p
∥X∥ := X ·X = E[X 2 ].

ˆ Similarly, we have ∥X − Y ∥2 := (X − Y ) · (X − Y ) = E[(X − Y )2 ].

21
L2-norm and L2 convergence

ˆ Since L2 is a normed space, we can define a new limit of random variables:


Definition 1. We say that a sequence {Xk } converges in L2 (or in MSE
sense) to X if limk→∞ ∥X − Xk ∥ = 0.

ˆ Note that limk→∞ ∥X − Xk ∥ = 0 iff limk→∞ E[|X − Xk |2 ] = 0.

ˆ Definition: We say that H ⊆ L2 is a linear subspace if


(i) for any X, Y ∈ H, we have X + Y ∈ H, and
(ii) for any X ∈ H and a ∈ R, aX ∈ H.

ˆ Definition: We say that H ⊆ L2 is closed if for any sequence {Xk } with

lim ∥Xm − Xn ∥2 = lim E[|Xm − Xn |2 ] = 0,


m,n→∞ m,n→∞

L
we have limk→∞ Xk →2 X for some random variable X ∈ L2 .

ˆ Showing linear subspace is easy, but closedness might be hard.

ˆ Important Cases:
1. For random variables X1 , . . . , Xk ∈ L2 , the set H = {α1 X1 + . . . +
αk Xk | αi ∈ R} is a closed linear subspace.
2. For any random variables X1 , . . . , Xk ∈ L2 , the set H = {α0 + α1 X1 +
. . . + αk Xk | αi ∈ R} is a closed linear subspace.

22
Orthogonality Principle

Theorem 1. Let H be a closed linear subspace of L2 and let X ∈ L2 . Then,


a. There exists a unique (up to almost sure equivalence) random variable
Y ⋆ ∈ H such that

∥Y ⋆ − X∥2 ≤ ∥Z − X∥2 , for all Z ∈ H.

b. Let W be a random variable. W = Y ⋆ a.e. if and only if W ∈ H and

E[(X − W )Z] = 0, for all Z ∈ H.

Note:
ˆ Y ⋆ is called the projection of X on the subspace H and is denoted by
ΠH (X).
ˆ Two random variables X, Y are orthogonal, X ⊥ Y , if E[XY ] = 0.
ˆ Relate MSE estimator with the above theorem.

23
Linear Minimum Mean Square Error (LMMSE) Estimation

ˆ Let Y be a measurement of X and we want to find an estimate of X which


is a linear function of Y minimizing the mean square error. The estimator is
of the form: X bLMSE (Y ) = aY + b. The goal is to find coefficients a∗ , b∗ ∈ R
such that

∥X − (a∗ Y + b∗ )∥ ≤ ∥X − (aY + b)∥, for any a, b ∈ R.

ˆ Let L(Y ) := {Z | Z = aY + b, a, b ∈ R} be the set of random variables


that are linear functions of Y . One can show that L(Y ) is a closed linear
subspace.
ˆ Then, X
bLMSE (Y ) = ΠL(Y ) (Y ).

ˆ From orthogonality property, we know that E[(X − X


bLMSE (Y ))Z] = 0 for all
Z ∈ L(Y ).
ˆ Show that the coefficients a∗ , b∗ satisfy
Cov(X, Y )
a∗ = , b∗ = E[X] − a∗ E[Y ].
Var(Y )

ˆ Thus, the LMMSE estimate


Cov(X, Y )
X̂(Y ) := a∗ Y +b∗ = a∗ (Y −E[Y ])+E[X] = E[X]+ (Y −E[Y ]).
Var(Y )

ˆ We can verify that (X − X̂) ⊥ (αY + β) for all α, β ∈ R.


ˆ What is the mean square estimation error?

24
Derivation of LMMSE Coefficients

25
LMMSE Coefficients for Multiple Observations

ˆ Let Y = [Y1 , ...,Yk ]⊤ be measurements available to us.


ˆ We wish to determine XbLMSE (Y ) = a0 + Pk ai Yi = ΠL(Y ) .
i=1

ˆ The goal is to find coefficients that minimize the mean square error
k
X
min E[(X − (a0 + ai Yi ))2 ].
a0 ,a1 ,...,ak
i=1

ˆ Due to the orthogonality property, the LMMSE estimator satisfies


k
X
E[(X − (a∗0 + a∗i Yi ))Z] = 0 ∀Z ∈ L(Y ).
i=1

ˆ We need to cleverly choose k + 1 elements from L(Y ) to set up a system


of k + 1 linear equations and solve for the coefficients.

26
Derivation of LMMSE Coefficients

ˆ Hint: Choose 1 and Yi − E[Yi ] for all i ∈ {1, 2, . . . , k}.


ˆ If Z = 1, then orthogonality yields
k
X
E[(X − (a∗0 + a∗i Yi ))] = 0.
i=1

ˆ If Z = Yj − E[Yj ], then orthogonality yields


k
X
E[(X − (a∗0 + a∗i Yi ))(Yj − E[Yj ])] = 0.
i=1

27
Derivation of LMMSE Coefficients

ˆ Finally, from the above analysis, we obtain


 
a∗1
 a∗ 
 2
 ..  = [Cov(Y )]−1 Cov(X, Y ).
.
a∗k

ˆ The LMMSE is given by


k
X
X
bLMSE (Y ) = a∗0 + a∗i Yi
i=1
k
X
= E[X] + a∗i (Yi − E[Yi ])
i=1
∗ ⊤
= E[X] + (a ) [Y − E[Y ]]
= E[X] + Cov(X, Y )⊤ [Cov(Y )]−1 [Y − E[Y ]].

 
X1
X 
 2
ˆ When X is also a random vector  .. , the LMMSE is given by
 . 
Xn
   
X
b1,LMSE (Y ) E[X1 ] + Cov(X1 , Y )⊤ [Cov(Y )]−1 [Y − E[Y ]]
⊤ −1
 b2,LMSE (Y )   E[X2 ] + Cov(X2 , Y ) [Cov(Y )] [Y − E[Y ]] 
X   
XLMSE (Y ) = 
b .. = .. .
 .   . 
⊤ −1
Xbn,LMSE (Y ) E[Xn ] + Cov(Xn , Y ) [Cov(Y )] [Y − E[Y ]]

28
Example (Previous year End-Sem Question)

X is a three-dimensional random vector with E[X] = 0 and autocorrelation


matrix RX with elements rij = (−0.80)|i−j| . Use X1 and X2 to form a linear
estimate of X3 : X̂3 = a1 X1 + a2 X2 , i.e., determine a1 and a2 that minimizes
mean-square error.

29
MMSE and LMMSE Estimator Comparison

ˆ An estimator X(Y
b ) is unbiased if E[X(Y
b )] = E[X].

– Is MMSE estimator unbiased?


– Is LMMSE estimator unbiased?

ˆ Among MMSE and LMMSE estimators, which one has smaller estimation
error?

ˆ If X and Y are uncorrelated, what does the LMMSE estimator give us?
What about MMSE estimator?

ˆ What do you need to know to determine MMSE and LMMSE estimators?

ˆ What if Cov(Y ) is not invertible?

ˆ When X and Y are jointly Gaussian,

X
bLMMSE (Y ) = X
bMMSE (Y )
⇐⇒ E[X|Y ] = E[X] + Cov(X, Y )⊤ [Cov(Y )]−1 [Y − E[Y ]].

Conditional expectation of X given Y is a linear function of Y .

30

You might also like