0% found this document useful (0 votes)

18 views

Module C

The document discusses basic estimation theory including maximum likelihood estimation and conditional distributions. It provides examples of estimating an unknown parameter given observations from a distribution and defines conditional probability and density functions for discrete and continuous random variables.

Uploaded by

Ankur Mondal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

18 views

Module C

Uploaded by

Ankur Mondal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 30

Module C: Estimation

Outline:
Basic Estimation Theory: ML, MAP
Conditional Expectation, and Mean Square Estimation
Orthogonality Principle and LMMSE Estimator

1
Estimation Theory

Main Question: Given an observation Y of a random variable X, how to

estimate X?
In other words, what is the best function g such that X̂ = g(Y ) is the best
estimator? How to quantify “best”?
More generally: given a sequence of observation of yb1 , . . . , ybk , how to esti-
mate X?
Example: Radar detection: Suppose that X is the radial distance of an
aircraft from a radar station and Y = X + Z is the radar’s observed location
where Z is independent of X and Z ∼ N (0, σ 2 ). What is the best estimator
Xb = g(Y ) of the location X?

2
Motivating Example

Let X be a random variable which is uniformly distributed over [0, θ].

We observe m samples of X denoted x

b1 , x
b2 , . . . , x
bm .

Problem: estimate θ given our observations.

Let the samples be {1, 2, 1.5, 1.75, 2, 1.3, 0.8, 0.3, 1}.

What is a good estimate of θ?

Can we find a function g(b

x1 , x
b2 , . . . , x
bm ) which will map any set of m samples
into an estimate of θ? Such a function is termed “estimator.”

We often treat the observations as random variables that depend on the

quantities that we are trying to estimate.

Case 1: The unknown quantity θ is assumed to be an unknown parameter/-

constant with observation X ∼ distribution(θ)

Case 2: The unknown quantity θ is assumed to be a random variable.

3
Maximum Likelihood Estimation (θ is a parameter)

We observe X which is assumed to be a random variable whose distribution

depends on an unknown parameter θ.

When X is continuous, its density fX (x; θ).

When X is discrete, its pmf pX (x; θ).

When the observation is x

b, we define Likelihood function as
(
fX (b
x; θ) when X is continuous,
L(θ|X = x
b) =
pX (b
x; θ) when X is discrete.

The maximum likelihood estimate of θ when X = x

b is

θ̂M L (b
x) := argmaxθ L(θ|X = x
b).

Thus, maximum likelihood estimate is the value of θ which maximizes the

likelihood of observing x
b.

4
Log Likelihood Estimation

We rarely estimate a quantity based on a single observation.

Suppose we have N i.i.d observations, {b

x1 , x bN } each drawn from
b2 , . . . , x
the same distribution.

Likelihood function is then computed as

L(θ|X1 = xb1 , X2 = x
b1 , . . . , XN = x bn ) = fX1 ,X2 ,...,XN (b
x1 , x
b2 , . . . , x
bN ; θ)
x1 ; θ) × fX2 (b
= fX1 (b x2 ; θ) . . . × fXN (b
xN ; θ) (due to independence of observations)
x1 ; θ) × fX (b
= fX (b x2 ; θ) . . . × fX (b
xN ; θ) (each Xi has identical distribution)
N
Y N
Y
= fX (b
xi ; θ) = L(θ|Xi = x
bi ).
i=1 i=1

Product term is difficult to maximize. However, we can compute the log-

likelihood as
N
X
log(L(θ|X1 = x
b1 , X2 = x
b1 , . . . , XN = x
bn )) = log(fX (b
xi ; θ))
i=1

which is often easier to maximize with respect to θ.

5
Example

Consider a random variable X defined as

(
1 with probability θ
X= , θ ∈ [0, 1].
0 with probability 1 − θ

We observe {b
x1 , x bN } with each x
b2 , . . . , x bi ∈ {0, 1}.
Problem: find θ̂M L (b
x1 , x
b2 , . . . , x
bN )
The likelihood function L(θ|X1 = x1 , X2 = x2 ....XN = xn ) = .

The log-likelihood function log(L(θ|X1 = x1 , X2 = x2 ....XN = xn )) = .

Optimizing log-likelihood function with respect to θ yields

ML Estimator θ̂M L (X1 , X2 , ......XN ) is a r.v that is function of X1 , ....XN

given by
θ̂M L (X1 , X2 , ......XN ) = .

When X is a discrete random variable with p.m.f. [θ1 θ2 ....θN ] = θ with

P(X = 1) = θ1 , P(X = 2) = θ2 ... and so on.

Then, the likelihood function L(θ|X = i) = θi . What is the likelihood

function after N observations?

6
Conditional distribution

Recall that conditional probability of two events A and B is defined as

P(A ∩ B)
P(A|B) = .
P(B)

Example: let X1 : outcome of one coin toss with

(
1, with probability p
X1 =
0, with probability 1 − p.

Let X2 : be outcome of second coin toss, and X2 has same distribution as

X1 .



p2 when (x1 , x2 ) = (1, 1)

p(1 − p) when (x1 , x2 ) = (1, 0)
Joint pmf: pX1 X2 (x1 , x2 ) =


p(1 − p) when (x1 , x2 ) = (0, 1)
(1 − p)2

when (x1 , x2 ) = (0, 0)

Conditional pmf of X1 conditioned on X2 :

P(X1 = x1 , X2 = x2 )
pX1 |X2 (x1 |X2 = x2 ) = P(X1 = x1 |X2 = x2 ) = .
P(X2 = x2 )

Conditional pmf of X1 given X2 = 0 is given by:

pX1 |X2 (0|X2 = 0) = P(X1 = 0|X2 = 0) =

pX1 |X2 (1|X2 = 0) = P(X1 = 1|X2 = 0) =

7
Conditional Distributions

Consider two discrete random variables X and Y . Let X takes values from
the set {x1 , . . . , xn } and let Y takes values from the set {y1 , . . . , ym }.

Conditional pmf of X given Y = yj is given by:

P(X = xi , Y = yj )
pX|Y (xi |Y = yj ) = P(X = xi |Y = yj ) = ∀i ∈ {1, 2, . . . , n}.
P(Y = yj )

The numerator is obtained from the joint distribution of X and Y . The

denominator is obtained from the marginal distribution of Y .

For two continuous random variables X and Y conditional CDF is given by

FX,Y (x, y)
FX|Y (x|y) = P(X ≤ x|Y ≤ y) = .
FY (y)

In this case, the conditional density is given by

fX,Y (x, y)
fX|Y (x|y) = .
fY (y)

8
Example

Consider two continuous random variables X and Y with joint density

(
x + y if 0 ≤ x ≤ 1, 0 ≤ y ≤ 1,
fXY (x, y) =
0 otherwise.

Determine P(X < 14 |Y = 13 ) by deriving and using the conditional density of X

given Y .

9
Example

Consider a random variable X whose density is given by

(
1 , 0≤x≤1
fX (x) =
0 , otherwise

The conditional density of Y given X = x is given by

(
1
1−x , x≤y≤1
fY |X=x (y) =
0 otherwise.
Determine the marginal density of Y .

10
Maximum A-Posteriori (MAP) Estimation

ML estimators assume θ to be an unknown parameter. If instead θ is a

r.v with some distribution that is known, we use a Bayesian approach to
estimate θ.

We assume prior distribution: fθ (θ)/pθ (θ) of θ that is known to us before-

hand.

Conditional distribution: fX|θ (x|θ) is also as some to be known. The distri-

bution of the observed quantity is known if the unknown parameter is exactly
known.

The MAP estimate is defined as:

θ̂MAP (b
x) = argmaxθ fθ|X (θ|X = x
b) = argmaxθ x|θ)fθ (θ),
fX|θ (b

which is the mode of the posterior distribution.

11
Example (Previous year End Semester Question)

Suppose Θ is a random parameter, and given Θ = θ, the observed quantity Y

has conditional density
θ
fY |Θ (y|θ) = e−θ|y| , y ∈ R.
2
1. Find the Maximum Likelihood (ML) estimate of Θ based on the observation
Y = −0.5.

Suppose further that Θ has prior density given by fΘ (θ) = 1θ , 1 ≤ θ ≤ e

(and fΘ (θ) = 0 for θ < 1 and θ > e.). Then,
2. find the Maximum A-Posteriori (MAP) estimate of Θ based on the observa-
tion Y = −0.5.
b M L (Y = −0.5) = 2, Θ
Answer: Θ b M AP (Y = −0.5) = 1.

12
Mean Square Estimation Theory

The best is subjective and need to set a criteria. One popular criteria is
Mean Square Error (MSE).

For measurements X1 , . . . , Xk of a random variable X, we define the MSE

of (a measurable) an estimator (function) g : Rk → R to be

E[|g(X1 , . . . , Xk ) − X|2 ].

In this setting, we view E[|U − X|2 ] as the squared distance of random

variables U and X.

Once we fix the MSE criteria for the best estimator, then the problem of find-
ing the best MSE estimator for X based on the measurements X1 , . . . , Xk
can be formulated as:

arg min E[|g(X1 , . . . , Xk ) − X|2 ].

g:Rk →R

Any g that minimizes the above criteria is called a Minimum Mean Square
Error (MMSE) estimator.

When solving for MMSE, we always assume that all the random variables
involved have finite mean and variance.

13
MMSE

In practice: finding the MMSE might be hard.

We can restrict our attention to special classes of functions g.
Let k = 0, and suppose that we want to find the best constant c that
estimates X. Note that in this case, we view c as a constant random
variable.

objective: finding c ∈ argminc E[|X − c|2 ]. (1)

Let X̄ = E[X]. Then,

E[|X − c|2 ] = E[|X − X̄ + X̄ − c|2 ]

= E[|X − X̄|2 + 2(X̄ − c)E[(X − X̄)] + (X̄ − c)2
= E[(X − X̄)2 ] + E[(X̄ − c)2 ].

Therefore, (1) is minimized when c = X̄ and MMSE value is going to be

Var(X).
Estimation theory interpretation of mean and variance: The best
constant MMSE estimator of X is E[X] and the corresponding MMSE value
is Var(X).

14
Conditional Expectation

Example: Let X, Y be discrete r.v with (X, Y ∈ {1, 2}) and joint pmf:
1 1
P[X = 1, Y = 1] = , P[X = 1, Y = 2] =
2 10
1 3
P[X = 2, Y = 1] = , P[X = 2, Y = 2] =
10 10
Determine the marginal pmf of X and Y .
Show that the conditional pmf of X given Y = 1 is
(
5
6 if X = 1
P[X|Y = 1] = 1
6 if X = 2.

We can then compute

X
E[X|Y = 1] = xP[X = x|Y = 1] = .
x∈X

Similarly, show that the conditional pmf of X given Y = 2 is

(
1
4 if X = 1
P[X|Y = 2] = 3
4 if X = 2.

Now, determine E[g(Y )].

Determine E[X]. What do you notice?

15
Conditional Expectation

If the value of Y is specified, then E[X|Y = y] is a scalar.

Otherwise, E[X|Y ] is a random variable which is a function of Y .

if for ω1 ̸= ω2 , Y (ω1 ) = Y (ω2 ) ⇒ E[X|Y = Y (ω1 )] = E[X|Y = Y (ω2 )].

For two continuous random variables X, Y ,

Z Z
fX,Y (x, y)
E[X|Y = y] = xfX|Y (x | Y = y)dx = x dx.
x x fY (y)

Similarly,
Z
E[h(X)|Y = y] = h(x)fX|Y (x, Y = y)dx
Zx
E[l(X, Y )|Y = y] = l(x, y)fX|Y (x, Y = y)dx
x

If the value of Y is not specified, E[l(X, Y )|Y ] is a random variable.

16
Example

Let X and Y be two random variables and independent with

(
1 with probability 21 ,
X=
0 with probability 21 .
Let Y have the same distribution as X. Let Z = X + Y .
Determine the pmf of Z.
Find conditional distribution and expectation of X when z = 1 and z = 2.
Find conditional distribution and expectation of z when X = 1.

17
Properties of Conditional Expectation

Linearity: E[aX + bY |Z] = aE[X|Z] + bE[Y |Z] a.e.

Monotonicty: X ≤ Y ⇒ E[X|Z] ≤ E[Y |Z] a.e.

Identity: E[Y |Y = y] = y. What is the conditional distribution of Y when

its value is specified? Determine E[Y |Y ] and E[g(Y )].

Independence: Suppose X and Y are indepdent. Then,

Z
E[X | Y = y] = xfx|Y =y (x | Y = y)dx
Zx Z
fxy (x, y)
= x dx = xfX (x)dx = E[X]
x fY (y) x

independent of the value of Y = y.

In other words,
Z Z
E[X | Y ] = E[X | Y = y]fY (y)dy = E[X] fY (y)dy = E[X].
y y

Similarly, E[g(X) | Y ] = E[g(X)].

E[Xg(Y )|Y ] = g(Y )E[X|Y ].

18
Tower Property and Orthogonality

Tower Property:
E[E[X|Y ]] = E[X].
Proof:
Z
EY [E[X|Y ]] = E[X|Y = y]fY (y)dy
Zy Z
= xfX|Y (x | Y = y)dx fY (y)dy
y x
Z Z
= x fX|Y (x | Y = y)fY (y) dydx
y x | {z }
fxy (x,y)
Z Z
= x fXY (x, y)dy dx
x y
| {z }
=:fX (x)
Z
= xfX (x)dx = E[X]
x

Orthogonality: for any measurable function g,

E[(X − E[X|Y ])g(Y )] = 0.

That is, (X − E[X|Y ]) is orthogonal to any function g(Y ) of Y .

Proof:

19
Minimum Mean Square Estimator (MMSE)

Proposition: Let g(Y ) be an estimator of X, and the mean square estimation

error be defined as E[(X − g(Y ))2 ]. Then,

E[(X − E[X|Y ])2 ] ≤ E[(X − g(Y ))2 ], for all measurable g.

Proof:

E (X − g(Y ))2 = E (X − E[X | Y ] + E[X | Y ] − g(Y ))2

20
L2(Ω, F, P) Space of Random Variables

We define L2 (Ω, F, P) (or simply L2 ) to be the set of random variables with

finite second moment, i.e., L2 = {X | E[X 2 ] < ∞}.
Properties of L2 :
– L2 is a linear subspace of random variables:
(i) aX ∈ L2 for all X ∈ L2 and a ∈ R as E[(aX)2 ] = a2 E[X 2 ] < ∞,
and
(ii) X + Y ∈ L2 for all X, Y ∈ L2

– The most important property: L2 is an inner-product space. For any

two random variables X, Y ∈ L2 , let us define their inner product

X · Y := E[XY ].

– Then this operation satisfies the axioms of an inner product:

(i) X · X = E[X 2 ] ≥ 0.
(ii) X · X = 0 iff X = 0 almost surely.
(iii) linearity : (αX + Y ) · Z = X · Z + αY · Z.
Therefore, L2 is a normed vector space, with the norm ∥ · ∥ defined by
√ p
∥X∥ := X ·X = E[X 2 ].

Similarly, we have ∥X − Y ∥2 := (X − Y ) · (X − Y ) = E[(X − Y )2 ].

21
L2-norm and L2 convergence

Since L2 is a normed space, we can define a new limit of random variables:

Definition 1. We say that a sequence {Xk } converges in L2 (or in MSE
sense) to X if limk→∞ ∥X − Xk ∥ = 0.

Note that limk→∞ ∥X − Xk ∥ = 0 iff limk→∞ E[|X − Xk |2 ] = 0.

Definition: We say that H ⊆ L2 is a linear subspace if

(i) for any X, Y ∈ H, we have X + Y ∈ H, and
(ii) for any X ∈ H and a ∈ R, aX ∈ H.

Definition: We say that H ⊆ L2 is closed if for any sequence {Xk } with

lim ∥Xm − Xn ∥2 = lim E[|Xm − Xn |2 ] = 0,

m,n→∞ m,n→∞

L
we have limk→∞ Xk →2 X for some random variable X ∈ L2 .

Showing linear subspace is easy, but closedness might be hard.

Important Cases:
1. For random variables X1 , . . . , Xk ∈ L2 , the set H = {α1 X1 + . . . +
αk Xk | αi ∈ R} is a closed linear subspace.
2. For any random variables X1 , . . . , Xk ∈ L2 , the set H = {α0 + α1 X1 +
. . . + αk Xk | αi ∈ R} is a closed linear subspace.

22
Orthogonality Principle

Theorem 1. Let H be a closed linear subspace of L2 and let X ∈ L2 . Then,

a. There exists a unique (up to almost sure equivalence) random variable
Y ⋆ ∈ H such that

∥Y ⋆ − X∥2 ≤ ∥Z − X∥2 , for all Z ∈ H.

b. Let W be a random variable. W = Y ⋆ a.e. if and only if W ∈ H and

E[(X − W )Z] = 0, for all Z ∈ H.

Note:
Y ⋆ is called the projection of X on the subspace H and is denoted by
ΠH (X).
Two random variables X, Y are orthogonal, X ⊥ Y , if E[XY ] = 0.
Relate MSE estimator with the above theorem.

23
Linear Minimum Mean Square Error (LMMSE) Estimation

Let Y be a measurement of X and we want to find an estimate of X which

is a linear function of Y minimizing the mean square error. The estimator is
of the form: X bLMSE (Y ) = aY + b. The goal is to find coefficients a∗ , b∗ ∈ R
such that

∥X − (a∗ Y + b∗ )∥ ≤ ∥X − (aY + b)∥, for any a, b ∈ R.

Let L(Y ) := {Z | Z = aY + b, a, b ∈ R} be the set of random variables

that are linear functions of Y . One can show that L(Y ) is a closed linear
subspace.
Then, X
bLMSE (Y ) = ΠL(Y ) (Y ).

From orthogonality property, we know that E[(X − X

bLMSE (Y ))Z] = 0 for all
Z ∈ L(Y ).
Show that the coefficients a∗ , b∗ satisfy
Cov(X, Y )
a∗ = , b∗ = E[X] − a∗ E[Y ].
Var(Y )

Thus, the LMMSE estimate

Cov(X, Y )
X̂(Y ) := a∗ Y +b∗ = a∗ (Y −E[Y ])+E[X] = E[X]+ (Y −E[Y ]).
Var(Y )

We can verify that (X − X̂) ⊥ (αY + β) for all α, β ∈ R.

What is the mean square estimation error?

24
Derivation of LMMSE Coefficients

25
LMMSE Coefficients for Multiple Observations

Let Y = [Y1 , ...,Yk ]⊤ be measurements available to us.

We wish to determine XbLMSE (Y ) = a0 + Pk ai Yi = ΠL(Y ) .
i=1

The goal is to find coefficients that minimize the mean square error
k
X
min E[(X − (a0 + ai Yi ))2 ].
a0 ,a1 ,...,ak
i=1

Due to the orthogonality property, the LMMSE estimator satisfies

k
X
E[(X − (a∗0 + a∗i Yi ))Z] = 0 ∀Z ∈ L(Y ).
i=1

We need to cleverly choose k + 1 elements from L(Y ) to set up a system

of k + 1 linear equations and solve for the coefficients.

26
Derivation of LMMSE Coefficients

Hint: Choose 1 and Yi − E[Yi ] for all i ∈ {1, 2, . . . , k}.

If Z = 1, then orthogonality yields
k
X
E[(X − (a∗0 + a∗i Yi ))] = 0.
i=1

If Z = Yj − E[Yj ], then orthogonality yields

k
X
E[(X − (a∗0 + a∗i Yi ))(Yj − E[Yj ])] = 0.
i=1

27
Derivation of LMMSE Coefficients

Finally, from the above analysis, we obtain

 
a∗1
 a∗ 
 2
 ..  = [Cov(Y )]−1 Cov(X, Y ).
.
a∗k

The LMMSE is given by

k
X
X
bLMSE (Y ) = a∗0 + a∗i Yi
i=1
k
X
= E[X] + a∗i (Yi − E[Yi ])
i=1
∗ ⊤
= E[X] + (a ) [Y − E[Y ]]
= E[X] + Cov(X, Y )⊤ [Cov(Y )]−1 [Y − E[Y ]].

 
X1
X 
 2
When X is also a random vector  .. , the LMMSE is given by
 . 
Xn
   
X
b1,LMSE (Y ) E[X1 ] + Cov(X1 , Y )⊤ [Cov(Y )]−1 [Y − E[Y ]]
⊤ −1
 b2,LMSE (Y )   E[X2 ] + Cov(X2 , Y ) [Cov(Y )] [Y − E[Y ]] 
X   
XLMSE (Y ) = 
b .. = .. .
 .   . 
⊤ −1
Xbn,LMSE (Y ) E[Xn ] + Cov(Xn , Y ) [Cov(Y )] [Y − E[Y ]]

28
Example (Previous year End-Sem Question)

X is a three-dimensional random vector with E[X] = 0 and autocorrelation

matrix RX with elements rij = (−0.80)|i−j| . Use X1 and X2 to form a linear
estimate of X3 : X̂3 = a1 X1 + a2 X2 , i.e., determine a1 and a2 that minimizes
mean-square error.

29
MMSE and LMMSE Estimator Comparison

An estimator X(Y
b ) is unbiased if E[X(Y
b )] = E[X].

– Is MMSE estimator unbiased?

– Is LMMSE estimator unbiased?

Among MMSE and LMMSE estimators, which one has smaller estimation
error?

If X and Y are uncorrelated, what does the LMMSE estimator give us?
What about MMSE estimator?

What do you need to know to determine MMSE and LMMSE estimators?

What if Cov(Y ) is not invertible?

When X and Y are jointly Gaussian,

X
bLMMSE (Y ) = X
bMMSE (Y )
⇐⇒ E[X|Y ] = E[X] + Cov(X, Y )⊤ [Cov(Y )]−1 [Y − E[Y ]].

Conditional expectation of X given Y is a linear function of Y .

6 437-Pset1
No ratings yet
6 437-Pset1
8 pages
Actsc 432 Review Part 1
No ratings yet
Actsc 432 Review Part 1
7 pages
STAT2102_Chapter6
No ratings yet
STAT2102_Chapter6
5 pages
Ch-5
No ratings yet
Ch-5
24 pages
Lecture 1.3
No ratings yet
Lecture 1.3
7 pages
For Section 1.7
No ratings yet
For Section 1.7
11 pages
Estimation 4
No ratings yet
Estimation 4
16 pages
Slide MI2036 Chap5
No ratings yet
Slide MI2036 Chap5
52 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
23 Estimate Rand Var 2
No ratings yet
23 Estimate Rand Var 2
19 pages
Solution 3 Problem 1: Let X
No ratings yet
Solution 3 Problem 1: Let X
12 pages
Statistical Inference
No ratings yet
Statistical Inference
55 pages
Untitled
No ratings yet
Untitled
5 pages
7 Mle
No ratings yet
7 Mle
31 pages
3.exponential Family & Point Estimation - 552
0% (1)
3.exponential Family & Point Estimation - 552
33 pages
7.Estimation Clustering
No ratings yet
7.Estimation Clustering
56 pages
2 Mle
No ratings yet
2 Mle
28 pages
Maximum Likelihood Notes1
No ratings yet
Maximum Likelihood Notes1
10 pages
Lecture 03 Maximum Likelihood Estimation
No ratings yet
Lecture 03 Maximum Likelihood Estimation
22 pages
Problem Set 10 Solutions
No ratings yet
Problem Set 10 Solutions
6 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
ML Notes
No ratings yet
ML Notes
4 pages
Notes For Lectures 1 To 10 - 2024
No ratings yet
Notes For Lectures 1 To 10 - 2024
39 pages
Chapter 2: Maximum Likelihood Estimation: Advanced Econometrics - HEC Lausanne
No ratings yet
Chapter 2: Maximum Likelihood Estimation: Advanced Econometrics - HEC Lausanne
207 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
Ch5_2945310
No ratings yet
Ch5_2945310
3 pages
A Guide To Modern Econometrics by Verbeek 181 190
No ratings yet
A Guide To Modern Econometrics by Verbeek 181 190
10 pages
Slide - 8 - 04 - Minimum Mean Square Estimation
No ratings yet
Slide - 8 - 04 - Minimum Mean Square Estimation
33 pages
MLE Lecture Note For Econometrician
No ratings yet
MLE Lecture Note For Econometrician
13 pages
NOTES
No ratings yet
NOTES
14 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
11 pages
Estimation Theory
100% (1)
Estimation Theory
8 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
STAT 2-2 Test of Hypothesis
No ratings yet
STAT 2-2 Test of Hypothesis
14 pages
Ps 1
No ratings yet
Ps 1
6 pages
Chapte 2 - Maximum Likelihood - HEC_Lausanne
No ratings yet
Chapte 2 - Maximum Likelihood - HEC_Lausanne
276 pages
CS145: Probability & Computing: Lecture 6: Multiple Discrete Variables, Joint & Conditional Distributions, Independence
No ratings yet
CS145: Probability & Computing: Lecture 6: Multiple Discrete Variables, Joint & Conditional Distributions, Independence
23 pages
Practice Problem Set 3
No ratings yet
Practice Problem Set 3
2 pages
Estimation 2
No ratings yet
Estimation 2
20 pages
Point Estimation: Definition of Estimators
No ratings yet
Point Estimation: Definition of Estimators
8 pages
lecture1_ml_MLE
No ratings yet
lecture1_ml_MLE
103 pages
Module4
No ratings yet
Module4
3 pages
Chap - 2point - Estimation
No ratings yet
Chap - 2point - Estimation
11 pages
Lecture5 Module2 Anova 1
No ratings yet
Lecture5 Module2 Anova 1
9 pages
Parameter and Spectrum Estimation
No ratings yet
Parameter and Spectrum Estimation
10 pages
Chap 5
No ratings yet
Chap 5
32 pages
Chapter 2 - Maximum Likelihood - HEC_Lausanne
No ratings yet
Chapter 2 - Maximum Likelihood - HEC_Lausanne
277 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
Assignment 10 solution
No ratings yet
Assignment 10 solution
8 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
7 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
7 pages
ln13
No ratings yet
ln13
5 pages
Untitled
No ratings yet
Untitled
5 pages
8 Conditional Expectation
No ratings yet
8 Conditional Expectation
27 pages
MLE_Assingnment (1)
No ratings yet
MLE_Assingnment (1)
7 pages
Lecture Notes 1: Brief Review of Basic Probability (Casella and Berger Chapters 1-4)
100% (1)
Lecture Notes 1: Brief Review of Basic Probability (Casella and Berger Chapters 1-4)
14 pages
Tutorial6 Delay Aware Controller Design Sol
No ratings yet
Tutorial6 Delay Aware Controller Design Sol
29 pages
Tutorial Statespace Modeling
No ratings yet
Tutorial Statespace Modeling
3 pages
ModuleB 1
No ratings yet
ModuleB 1
38 pages
Module A
No ratings yet
Module A
43 pages

Module C

Uploaded by

Module C

Uploaded by

Module C: Estimation

 Main Question: Given an observation Y of a random variable X, how to

 Let X be a random variable which is uniformly distributed over [0, θ].

 We observe m samples of X denoted x

 Problem: estimate θ given our observations.

 What is a good estimate of θ?

 Can we find a function g(b

 We often treat the observations as random variables that depend on the

 Case 1: The unknown quantity θ is assumed to be an unknown parameter/-

 Case 2: The unknown quantity θ is assumed to be a random variable.

 We observe X which is assumed to be a random variable whose distribution

 When X is continuous, its density fX (x; θ).

 When the observation is x

 The maximum likelihood estimate of θ when X = x

 Thus, maximum likelihood estimate is the value of θ which maximizes the

 We rarely estimate a quantity based on a single observation.

 Suppose we have N i.i.d observations, {b

 Likelihood function is then computed as

 Product term is difficult to maximize. However, we can compute the log-

which is often easier to maximize with respect to θ.

 Consider a random variable X defined as

 The log-likelihood function log(L(θ|X1 = x1 , X2 = x2 ....XN = xn )) = .

 Optimizing log-likelihood function with respect to θ yields

 ML Estimator θ̂M L (X1 , X2 , ......XN ) is a r.v that is function of X1 , ....XN

 When X is a discrete random variable with p.m.f. [θ1 θ2 ....θN ] = θ with

P(X = 1) = θ1 , P(X = 2) = θ2 ... and so on.

Then, the likelihood function L(θ|X = i) = θi . What is the likelihood

 Recall that conditional probability of two events A and B is defined as

 Example: let X1 : outcome of one coin toss with

 Let X2 : be outcome of second coin toss, and X2 has same distribution as

 Conditional pmf of X1 conditioned on X2 :

 Conditional pmf of X1 given X2 = 0 is given by:

pX1 |X2 (0|X2 = 0) = P(X1 = 0|X2 = 0) =

 Conditional pmf of X given Y = yj is given by:

 The numerator is obtained from the joint distribution of X and Y . The

 For two continuous random variables X and Y conditional CDF is given by

 In this case, the conditional density is given by

Consider two continuous random variables X and Y with joint density

Determine P(X < 14 |Y = 13 ) by deriving and using the conditional density of X

Consider a random variable X whose density is given by

The conditional density of Y given X = x is given by

 ML estimators assume θ to be an unknown parameter. If instead θ is a

 We assume prior distribution: fθ (θ)/pθ (θ) of θ that is known to us before-

 Conditional distribution: fX|θ (x|θ) is also as some to be known. The distri-

 The MAP estimate is defined as:

which is the mode of the posterior distribution.

Suppose Θ is a random parameter, and given Θ = θ, the observed quantity Y

Suppose further that Θ has prior density given by fΘ (θ) = 1θ , 1 ≤ θ ≤ e

 For measurements X1 , . . . , Xk of a random variable X, we define the MSE

 In this setting, we view E[|U − X|2 ] as the squared distance of random

arg min E[|g(X1 , . . . , Xk ) − X|2 ].

 In practice: finding the MMSE might be hard.

objective: finding c ∈ argminc E[|X − c|2 ]. (1)

 Let X̄ = E[X]. Then,

E[|X − c|2 ] = E[|X − X̄ + X̄ − c|2 ]

 Therefore, (1) is minimized when c = X̄ and MMSE value is going to be

 We can then compute

 Similarly, show that the conditional pmf of X given Y = 2 is

 Now, determine E[g(Y )].

 If the value of Y is specified, then E[X|Y = y] is a scalar.

if for ω1 ̸= ω2 , Y (ω1 ) = Y (ω2 ) ⇒ E[X|Y = Y (ω1 )] = E[X|Y = Y (ω2 )].

 For two continuous random variables X, Y ,

 If the value of Y is not specified, E[l(X, Y )|Y ] is a random variable.

Let X and Y be two random variables and independent with

 Linearity: E[aX + bY |Z] = aE[X|Z] + bE[Y |Z] a.e.

 Monotonicty: X ≤ Y ⇒ E[X|Z] ≤ E[Y |Z] a.e.

 Identity: E[Y |Y = y] = y. What is the conditional distribution of Y when

 Independence: Suppose X and Y are indepdent. Then,

independent of the value of Y = y.

Similarly, E[g(X) | Y ] = E[g(X)].

 E[Xg(Y )|Y ] = g(Y )E[X|Y ].

Orthogonality: for any measurable function g,

E[(X − E[X|Y ])g(Y )] = 0.

That is, (X − E[X|Y ]) is orthogonal to any function g(Y ) of Y .

Proposition: Let g(Y ) be an estimator of X, and the mean square estimation

E[(X − E[X|Y ])2 ] ≤ E[(X − g(Y ))2 ], for all measurable g.

Main Question: Given an observation Y of a random variable X, how to

Let X be a random variable which is uniformly distributed over [0, θ].

We observe m samples of X denoted x

Problem: estimate θ given our observations.

What is a good estimate of θ?

Can we find a function g(b

We often treat the observations as random variables that depend on the

Case 1: The unknown quantity θ is assumed to be an unknown parameter/-

Case 2: The unknown quantity θ is assumed to be a random variable.

We observe X which is assumed to be a random variable whose distribution

When X is continuous, its density fX (x; θ).

When the observation is x

The maximum likelihood estimate of θ when X = x

Thus, maximum likelihood estimate is the value of θ which maximizes the

We rarely estimate a quantity based on a single observation.

Suppose we have N i.i.d observations, {b

Likelihood function is then computed as

Product term is difficult to maximize. However, we can compute the log-

Consider a random variable X defined as

The log-likelihood function log(L(θ|X1 = x1 , X2 = x2 ....XN = xn )) = .

Optimizing log-likelihood function with respect to θ yields

ML Estimator θ̂M L (X1 , X2 , ......XN ) is a r.v that is function of X1 , ....XN

When X is a discrete random variable with p.m.f. [θ1 θ2 ....θN ] = θ with

Recall that conditional probability of two events A and B is defined as

Example: let X1 : outcome of one coin toss with

Let X2 : be outcome of second coin toss, and X2 has same distribution as

Conditional pmf of X1 conditioned on X2 :

Conditional pmf of X1 given X2 = 0 is given by:

Conditional pmf of X given Y = yj is given by:

The numerator is obtained from the joint distribution of X and Y . The

For two continuous random variables X and Y conditional CDF is given by

In this case, the conditional density is given by

ML estimators assume θ to be an unknown parameter. If instead θ is a

We assume prior distribution: fθ (θ)/pθ (θ) of θ that is known to us before-

Conditional distribution: fX|θ (x|θ) is also as some to be known. The distri-

The MAP estimate is defined as:

For measurements X1 , . . . , Xk of a random variable X, we define the MSE

In this setting, we view E[|U − X|2 ] as the squared distance of random

In practice: finding the MMSE might be hard.

Let X̄ = E[X]. Then,

Therefore, (1) is minimized when c = X̄ and MMSE value is going to be

We can then compute

Similarly, show that the conditional pmf of X given Y = 2 is

Now, determine E[g(Y )].

If the value of Y is specified, then E[X|Y = y] is a scalar.

For two continuous random variables X, Y ,

If the value of Y is not specified, E[l(X, Y )|Y ] is a random variable.

Linearity: E[aX + bY |Z] = aE[X|Z] + bE[Y |Z] a.e.

Monotonicty: X ≤ Y ⇒ E[X|Z] ≤ E[Y |Z] a.e.

Identity: E[Y |Y = y] = y. What is the conditional distribution of Y when

Independence: Suppose X and Y are indepdent. Then,

E[Xg(Y )|Y ] = g(Y )E[X|Y ].

We define L2 (Ω, F, P) (or simply L2 ) to be the set of random variables with

Similarly, we have ∥X − Y ∥2 := (X − Y ) · (X − Y ) = E[(X − Y )2 ].

Since L2 is a normed space, we can define a new limit of random variables:

Note that limk→∞ ∥X − Xk ∥ = 0 iff limk→∞ E[|X − Xk |2 ] = 0.

Definition: We say that H ⊆ L2 is a linear subspace if

Definition: We say that H ⊆ L2 is closed if for any sequence {Xk } with

Showing linear subspace is easy, but closedness might be hard.

Let Y be a measurement of X and we want to find an estimate of X which

Let L(Y ) := {Z | Z = aY + b, a, b ∈ R} be the set of random variables

From orthogonality property, we know that E[(X − X

Thus, the LMMSE estimate

We can verify that (X − X̂) ⊥ (αY + β) for all α, β ∈ R.

Let Y = [Y1 , ...,Yk ]⊤ be measurements available to us.

Due to the orthogonality property, the LMMSE estimator satisfies

We need to cleverly choose k + 1 elements from L(Y ) to set up a system

Hint: Choose 1 and Yi − E[Yi ] for all i ∈ {1, 2, . . . , k}.

If Z = Yj − E[Yj ], then orthogonality yields

Finally, from the above analysis, we obtain

The LMMSE is given by

What do you need to know to determine MMSE and LMMSE estimators?

What if Cov(Y ) is not invertible?

When X and Y are jointly Gaussian,