0% found this document useful (0 votes)
49 views19 pages

17 Notes MFML Probreview

The document discusses statistical estimation and classification. It provides an overview of probability concepts like distribution functions, probability density functions, expectations, moments, and independence. It also covers joint distributions of multiple random variables and how to factorize their densities. Key concepts are explained for both scalar and vector-valued random variables.

Uploaded by

Abdoulie Njie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views19 pages

17 Notes MFML Probreview

The document discusses statistical estimation and classification. It provides an overview of probability concepts like distribution functions, probability density functions, expectations, moments, and independence. It also covers joint distributions of multiple random variables and how to factorize their densities. Key concepts are explained for both scalar and vector-valued random variables.

Uploaded by

Abdoulie Njie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 19

IV.

Statistical Estimation
and Classification

Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
Probability: An Extremely Concise Review
1. A scalar-valued random variable X is completely characterized
by its distribution function

FX (u) = P (X ≤ u) .

This is also called the cumulative distribution function


(cdf). FX (u) is monotonically increasing in u; it goes to one as
u → ∞ and goes to zero as u → −∞.

2. If FX is differentiable, then we can also characterize X using


its probability density function (pdf)

dFX (u)
fX (x) =
du u=x
The density has the properties fX (x) ≥ 0 and
Z ∞
fX (x) dx = 1.
−∞

Events of interest are subsets1 of the real line — given such an


event/subset E, we can compute the probability of E occurring
as Z
P (E) = fX (x) dx.
x∈E

1
Technically, it must be a subset of the real line that can be written as
some combination of countable unions, countable intersections, and com-
plements of intervals. You really have to know something about real
analysis to construct a set that does not meet this criteria.

1
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
It is possible that a pdf exists even if FX is not differentiable
everywhere, for example:
 
0,
u < 0,
 0, x < 0,

FX (u) = u, 0 ≤ u ≤ 1, has pdf fX (x) = 1, 0 ≤ x ≤ 1,
 

1, u ≥ 1 
0, x > 1.

3. The expectation of a function g(X) of a random variable is


Z ∞
E[g(X)] = g(x)fX (x) dx.
−∞

This is the “average value” of g(X) in that given a series of


realizations X = x1, X = x2, . . . , of X,
M
1 X
g(xm) → E[g(X)], as M → ∞.
M m=1
This fact is known as the (weak) law of large numbers.

4. The moment of X of degree p is the expectation of the mono-


mial g(x) = xp. The zeroth moment is always 1:
Z ∞
0
E[X ] = E[1] = fX (x) dx = 1,
−∞

and the first moment is the mean:


Z ∞
E[X] = x fX (x) dx.
−∞

The variance is the second moment minus the mean squared:


var(X) = E[X 2] − (E[X])2 = E[(X − E[X])2].

2
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
This is sometime referred to as the “variation around the mean”.
Aside from the zeroth moment, there is nothing that says that
the integrals above must converge; it is easy to construct ex-
amples of well-defined random variables where E[X] = ∞.

5. A pair of random variables (X, Y ) are completely described by


their joint distribution function (joint cdf)2
FX,Y (u, v) = P (X ≤ u, Y ≤ v) .
Again, if FX,Y is continuously differentiable, (X, Y ) is also char-
acterize by the density

∂FX,Y (u, v)
fX,Y (x, y) = .
∂u ∂v (u,v)=(x,y)
In this case, events of interest correspond to regions in the plane
R2 , and the probability of an event occurring is the integral of
the density over this region.

6. From the joint pdf fX,Y (x, y), we can recover the individual
marginal pdfs for X and Y using
Z ∞
fX (x) = fX,Y (x, y) dy,
Z−∞

fY (y) = fX,Y (x, y) dx.
−∞

The pair of densities fX (x), fY (y) tell us how X and Y behave


individually, but not how they interact.

2
For fixed u, v ∈ R, the notation P (X ≤ u, Y ≤ v) should be read as “the
probability that X is ≤ u and Y is ≤ v.

3
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
7. If X and Y do interact in a meaningful way, then observing
one of them affects the distribution of the other. If we observe
X = x, then with this knowledge, the density for Y becomes
fX,Y (x, y)
fY (y|X = x) = .
fX (x)
This is a density over y; it is easy to check that it is positive
everywhere and that it integrates to one. fX (y|X = x) is called
the conditional density for Y given X = x.

8. We call X and Y independent if observing X tells us nothing


about Y (and vice versa). This means
fY (y|X = x) = fY (y), for all x ∈ R,
and
fX (x|Y = y) = fX (x), for all y ∈ R.
(If one of the statements above is true, then the other fol-
lows automatically.) Equivalently, independence means that
the joint pdf is separable:
fX,Y (x, y) = fX (x) fY (y).

9. We can always factor the joint pdf in two different ways:


fX (x)fY (y|X = x) = fX,Y (x, y) = fY (y)fX (x|Y = y).
At this point, we should be comfortable enough with what is
going on that we can use fY (y|x) as short-hand notation for
fY (y|X = x). Then we can rewrite the above in its more
common form as
fX (x)fY (y|x) = fX,Y (x, y) = fY (y)fX (x|y).

4
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
This factorization also gives us a handy way to compute the
marginals: Z ∞
fX (x) = fY (y)fX (x|y) dy.
−∞

It also yields Bayes’ equation


fY (y|x)fX (x)
fX (x|y) = ,
fY (y)
which is a fundamental relation for statistical inference.

10. All of the above extends in the obvious way to more than two
random variables. A random vector
 
X1
 X2 
X=
 ... 

XD

is completely characterized by the density fX (x) = fX (x1, . . . , xD )


on RD . In general, we can factor the joint pdf as

fX (x) = fX1 (x1) fX2 (x2|x1) fX3 (x3|x2, x1) · · · fXD (xD |x1, . . . , xD−1).

11. The pth moment of a random vector X that maps into RD is


the collection of expectations of all monomials of order p. The
mean of a random vector is a vector of length D:
 
E[X1]
E[X] =  ...  ,
E[XD ]

5
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
the second moment is the D × D matrix of all correlations
between entries:
E[X12] E[X1X2] · · · E[X1XD ]
 

E[XX T] =  ... ... ... ,


E[XD X1] ··· E[XD2 ]
the third moment is the D × D × D tensor E[X ⊗ X ⊗ X],
where
(E[X ⊗ X ⊗ X]) (i, j, k) = E[XiXj Xk ],
and so on. The covariance matrix contains all the pairs of
second moments:
Ri,j = E[(Xi − E[Xi])(Xj − E[Xj ])].
If µX = E[X] is the mean vector, we can write the covariance
matrix succinctly in terms of the second moment as
R = E[XX T] − µX µTX

12. Given independent observations X = x1, X = x2, . . . , X =


xM of a random vector X with unknown (or partially known)
distribution, a completely reasonable way to estimate the mean
vector is using
M
1 X
µ̂ = xm .
M m=1
If the mean µX = E[X] is known but the covariance is not, we
can estimate the covariance using
M
!
1 X
R̂ = xmxTm − µX µTX .
M m=1

6
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
If both the mean and covariance are unknown, we first estimate
the mean vector as above, then take
M
!
1 X
R̂ = xmxTm − µ̂µ̂T.
M − 1 m=1

The difference in the scaling is to ensure that E[R̂] = R in


both cases.

7
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
The Weak Law of Large Numbers
The WLLN is absolutely fundamental to machine learning (and really
to all of probability and statistics). It basically formalizes the notion
that given a series of independent samples of a random variable X, we
can approximate E[X] by averaging the samples. The WLLN states
that if X1, X2, . . . are independent copies of a random variable X,
N
1 X
Xn → E[X] as N → ∞.
N n=1

The only condition for this convergence is that X has finite variance.
We start by stating the main result precisely. Let X be a random
variable with pdf fX (x), mean E[X] = µ, and variance var(X) =
σ 2 < ∞. We observe samples of X labeled X1, X2, . . . , XN . The
Xi are independent of one another, and they all have the same dis-
tribution as X. We will show that the sample mean formed from a
sample of size N :
1
MN = (X1 + X2 + · · · + XN ),
N
obeys3
σ2
P (|MN − µ| > ) ≤ ,
N 2
where  > 0 is an arbitrarily small number. In the expression above,
MN is the only thing which is random; µ and σ 2 are fixed underlying
properties of the distribution, N is the amount of data we see, and
 is something we can choose arbitrarily.
3
This is a simple example of a concentration bound. It is not that tight;
we will later counter inequalities of this type that are much more precise.
But it is relatively simple and will serve our purpose here.

8
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
Notice that no matter how small  is, the probability on the right
hand side above goes to zero as N → ∞. That is, for any fixed
 > 0,
lim P (|MN − µ| > ) = 0.
N →∞

This result is follows from two simple but important tools known as
the Markov and Chebyshev inequalities.

Markov inequality

Let X be a random variable that only takes positive values:

fX (x) = 0, for x < 0, or FX (0) = 0.

Then
E[X]
P (X ≥ a) ≤ for all a > 0.
a
For example, the probability that X is more than 5 times its mean
is 1/5, 10 times the mean is 1/10, etc. And this holds for any
distribution.

The Markov inequality is easy to prove:


Z ∞
E[X] = xfX (x) dx
Z0 ∞
≥ xfX (x) dx
Za ∞
≥ afX (x) dx
a
= a · P (X ≥ a)

9
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
E[X]
and so P (X ≥ a) ≤ a
.

Again, this is a very general statement in that we have assumed


nothing about X other than it is positive. The price for the generality
is that the bound is typically very loose, and does not usually capture
the behavior of P (X ≥ a). We can, however, cleverly apply the
Markov inequality to get something slightly more useful.

Chebyshev inequality

The main use of the Markov inequality turns out to be its use in
deriving other, more accurate deviation inequalities. Here we will
use it to derive the Chebyshev inequality, from which the weak
law of large numbers will follow immediately.

Chebyshev inequality: If X is a random variable with mean µ


and variance σ 2, then
σ2
P (|X − µ| > c) ≤ 2 for all c > 0.
c

The Chebyshev inequality follows immediately from the Markov in-


equality in the following way. No matter what range of values X
takes, the quantity |X − µ|2 is always positive. Thus
2 E[|X − µ|2] σ 2
2

P |X − µ| > c ≤ = 2.
c2 c
Since squaring (·)2 is monotonic (invertible) over positive numbers,
2 2
 σ2
P |X − µ| > c = P (|X − µ| > c) ≤ 2 .
c

10
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
We now have a bound which depends on the mean and the variance
of X; this leads to a more accurate approximation of the probability.

Simple proof of the weak law of large numbers

We now turn to the behavior of the the sample mean


X1 + X2 + · · · + XN
MN = ,
N
where again the Xi are iid random variables with E[Xi] = µ and
var Xi = σ 2. We know that
E[X1] + E[X2] + · · · + E[XN ] N µ
E[MN ] = = = µ,
N N
and since the Xi are independent,
var(X1) + var(X2) + · · · + var(XN ) N σ 2 σ 2
var(MN ) = = = .
N2 N2 N

For any  > 0, a direct application of the Chebyshev inequality tells


us that
σ2
P (|MN − µ| > ) ≤ .
N 2
The point is that this gets arbitrarily small as N → ∞ no matter
what  was chosen to be. We have established, in some sense, that
even though {MN } is a sequence of random numbers, it converges
to something deterministic, namely µ.

11
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
WLLN: Let X1, X2, . . . be iid random variables as above. For ev-
ery  > 0, we have
X1 + · · · + XN
 

P (|MN − µ| > ) = P − µ >  −→ 0,
N
as N → ∞.

One of the philosophical consequences of the WLLN is that it tells us


that probabilities can be estimated through empirical frequen-
cies. Suppose I want to estimate the probability of and event A
occurring related to some probabilistic experiment. We run a series
of (independent) experiments, and set Xi = 1 if A occurred in exper-
iment i, and Xi = 0 otherwise. Then given X1, . . . , XN , we estimate
the probability of A in a completely reasonable way, by computing
the percentage of times it occurred:
X1 + · · · + XN
pempirical = .
N
The WLLN tells us that

pempirical → P (A) , as N → ∞.

This lends some mathematical weight to our interpretation of prob-


abilities as relative frequencies.

All of the above of course applies to functions of random variables.


That is, if X is a random variable, and g(X) is a function of that
random variable with

var(g(X)) = E[(g(X) − E[g(X)])2] < ∞,

12
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
then given independent realizations X1, . . . , XN , we have
N
1 X
g(Xn) → E[g(X)]
N n=1

as N → ∞.

13
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
Minimum Mean-Square Error Estimation
Now we will take our first look at estimating variables that are them-
selves random, subject to a known probability law.
We start our discussion with a very basic problem. Suppose Y is a
scalar random variable with a known pdf fY (y). Here is a fun game:
you guess what Y is going to be, then I draw a realization of Y
corresponding to its probability law, then we see how close you were
with your guess.
What is your best guess?
Well, that of course depends on what exactly we mean by “best”, i.e.
what price I pay for being a certain amount off. But if we penalize
the mean-squared error, we know exactly how to minimize it.
Let g be your guess. The error in your guess is of course random
(since the realization of Y is random), and so is the squared-error
(Y − g)2. We want to choose g so that the mean of the squared error
is as small as possible:
minimize E[(Y − g)2].
g

Expanding the squared error makes it clear how to do this:


E[(Y − g)2] = E[Y 2] − 2g E[Y ] + g 2.
No matter what the first moment E[Y ] and second moment E[Y 2] are
(as long as they are finite), the expression above is a convex quadratic
function in g, and hence is minimized when its first derivative (w.r.t.
g) is zero, i.e. when
−2 E[Y ] + 2g = 0 ⇒ ĝ = E[Y ].

14
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
The mean squared error for this choice ĝ is of course exactly the
variance of Y ,
E[(Y − ĝ)2] = E[(Y − E[Y ])2] = var(Y ).

The story gets more interesting (and relevant) when we have multiple
random variables, some of which we observe, some of which we do not.
Suppose that two random variables (Y, Z) have joint pdf fY,Z (y, z).
Suppose that a realization of (Y, Z) is drawn, and I get to observe
Z. What have I learned about Y ?
If Y and Z are independent, then the answer is of course nothing.
But if they are not independent, then the marginal distribution of Y
changes. In particular, before the random variables were drawn, the
(marginal) pdf for Y was
Z
fY (y) = fY,Z (y, z) dz.

After we observe Z = z, we have


fY,Z (y, z) fY,Z (y, z)
fY (y|Z = z) = =R .
fZ (z) fY,Z (y, z) dy
Y is still a random variable, but its distribution depends on the value
z that was observed for Z.
Now, given that I have observed Z = z, what is the best guess for
Y ? If by “best” we mean that which minimizes the mean squared
error, it is the conditional mean. That is, the minimizer of
minimize E[(Y − g)2|Z = z]
g

15
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
is
ĝ = E[Y |Z = z].

Notice that unlike before, ĝ is not pre-determined, it depends on the


outcome Z = z. We might denote

ĝ(z) = E[Y |Z = z].

For a particular choice of z, the mean-squared error is the conditional


variance

E[(Y − ĝ(z))2|Z = z] = E (Y − E[Y |Z = z])2|Z = z


 

= E[Y 2|Z = z] − (E[Y |Z = z])2


= var(Y |Z = z).

So in general, not only does ĝ depend on z, but its performance (its


mean-square error) also depends on z.

We can also average over the draw of Z. First, note that since Z is
a random variable, ĝ is a priori also random, we might say

ĝ(Z) = E[Y |Z].

Let me pause here because this is the point where many people start
to get confused. The quantities
Z ∞ Z ∞
E[Y ] = yfY (y) dy, and E[Y |Z = z] = yfY (y|Z = z) dy
−∞ −∞

are deterministic, but we re-emphasize that


Z ∞
E[Y |Z] = yfY (y|Z) dy

16
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
is random. The above integrates out the randomness of Y , but not
that of Z. (Note that in E[Y |Z = z] the randomness in Z is removed
through direct observation.)

The “average estimate” is now


E[g(Z)] = E[E[Y |Z]]
Z
= E[Y |Z = z]fZ (z) dz
= E[Y ].
The identity E[E[Y |Z]] = E[Y ] is known as the law of iterated ex-
pectation or total expetation. The inside E above integrates out the
randomness in Y while the outside one integrates over Z, the result
is a deterministic quantity.
Since E[g(Z)] = E[Y ], on average we are doing the same thing as if
we didn’t observe Z at all. But since we are adapting g to the draw
of Z, we get better average performance. The mean square error
(which is random with Z) is
E[(Y − ĝ(Z))2|Z] = E[Y 2|Z] − (E[Y |Z])2 = var(Y |Z).
The average performance is then
E[E[(Y − ĝ(Z))2|Z]] = E[var(Y |Z)] ≤ var(Y ).
The last inequality, which follows from the law of total variance dis-
cussed below, means that on average, an (optimal) estimator us-
ing knowledge of Z will outperform an (optimal) estimator without
knowledge of Z. Of course, when Y and Z are uncorrelated (mean-
ing E[Y Z] = E[Y ] E[Z]) we will have E[var(Y |Z)] = var(Y ) and
knowing Z makes no difference.

17
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
The law of total variance
Recall that for any random variable Y ,
var(Y ) = E[Y 2] − (E[Y ])2. (1)
As we have seen above, E[Y |Z] is a random variable (where now the
randomness is being caused by Z). Hence it also has a mean
E[E[Y |Z]] = E[Y ],
and a variance
 2
 2
var(E[Y |Z]) = E (E[Y |Z]) − (E[E[Y |Z]])
2
= E (E[Y |Z])2 − (E[Y ]) .
 
(2)
The quantity (again as we have seen above) var(Y |Z) is also a ran-
dom variable; we can write its mean as
E [var(Y |Z)] = E E[(Y − E[Y |Z])2 | Z]
 

= E E[Y 2|Z]] − E (E[Y |Z])2


  

= E[Y 2] − E (E[Y |Z])2 .


 
(3)
Adding together (2) and (3) and applying (1) gives us the cute ex-
pression
var(Y ) = E [var(Y |Z)] + var(E[Y |Z]).
This is known as the law of total variance. It basically says that
you can decompose the variance in a random variable by the expected
variance in “using Z to predict Y ” and the variance of the prediction
(condition expectation) itself. Note that all of the quantities above
are non-negative, so we have the inequalities
E[var(Y |Z)] ≤ var(Y ) and var(E[Y |Z]) ≤ var(Y ).

18
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020

You might also like