17 Notes MFML Probreview
17 Notes MFML Probreview
Statistical Estimation
and Classification
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
Probability: An Extremely Concise Review
1. A scalar-valued random variable X is completely characterized
by its distribution function
FX (u) = P (X ≤ u) .
1
Technically, it must be a subset of the real line that can be written as
some combination of countable unions, countable intersections, and com-
plements of intervals. You really have to know something about real
analysis to construct a set that does not meet this criteria.
1
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
It is possible that a pdf exists even if FX is not differentiable
everywhere, for example:
0,
u < 0,
0, x < 0,
FX (u) = u, 0 ≤ u ≤ 1, has pdf fX (x) = 1, 0 ≤ x ≤ 1,
1, u ≥ 1
0, x > 1.
2
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
This is sometime referred to as the “variation around the mean”.
Aside from the zeroth moment, there is nothing that says that
the integrals above must converge; it is easy to construct ex-
amples of well-defined random variables where E[X] = ∞.
6. From the joint pdf fX,Y (x, y), we can recover the individual
marginal pdfs for X and Y using
Z ∞
fX (x) = fX,Y (x, y) dy,
Z−∞
∞
fY (y) = fX,Y (x, y) dx.
−∞
2
For fixed u, v ∈ R, the notation P (X ≤ u, Y ≤ v) should be read as “the
probability that X is ≤ u and Y is ≤ v.
3
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
7. If X and Y do interact in a meaningful way, then observing
one of them affects the distribution of the other. If we observe
X = x, then with this knowledge, the density for Y becomes
fX,Y (x, y)
fY (y|X = x) = .
fX (x)
This is a density over y; it is easy to check that it is positive
everywhere and that it integrates to one. fX (y|X = x) is called
the conditional density for Y given X = x.
4
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
This factorization also gives us a handy way to compute the
marginals: Z ∞
fX (x) = fY (y)fX (x|y) dy.
−∞
10. All of the above extends in the obvious way to more than two
random variables. A random vector
X1
X2
X=
...
XD
fX (x) = fX1 (x1) fX2 (x2|x1) fX3 (x3|x2, x1) · · · fXD (xD |x1, . . . , xD−1).
5
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
the second moment is the D × D matrix of all correlations
between entries:
E[X12] E[X1X2] · · · E[X1XD ]
6
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
If both the mean and covariance are unknown, we first estimate
the mean vector as above, then take
M
!
1 X
R̂ = xmxTm − µ̂µ̂T.
M − 1 m=1
7
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
The Weak Law of Large Numbers
The WLLN is absolutely fundamental to machine learning (and really
to all of probability and statistics). It basically formalizes the notion
that given a series of independent samples of a random variable X, we
can approximate E[X] by averaging the samples. The WLLN states
that if X1, X2, . . . are independent copies of a random variable X,
N
1 X
Xn → E[X] as N → ∞.
N n=1
The only condition for this convergence is that X has finite variance.
We start by stating the main result precisely. Let X be a random
variable with pdf fX (x), mean E[X] = µ, and variance var(X) =
σ 2 < ∞. We observe samples of X labeled X1, X2, . . . , XN . The
Xi are independent of one another, and they all have the same dis-
tribution as X. We will show that the sample mean formed from a
sample of size N :
1
MN = (X1 + X2 + · · · + XN ),
N
obeys3
σ2
P (|MN − µ| > ) ≤ ,
N 2
where > 0 is an arbitrarily small number. In the expression above,
MN is the only thing which is random; µ and σ 2 are fixed underlying
properties of the distribution, N is the amount of data we see, and
is something we can choose arbitrarily.
3
This is a simple example of a concentration bound. It is not that tight;
we will later counter inequalities of this type that are much more precise.
But it is relatively simple and will serve our purpose here.
8
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
Notice that no matter how small is, the probability on the right
hand side above goes to zero as N → ∞. That is, for any fixed
> 0,
lim P (|MN − µ| > ) = 0.
N →∞
This result is follows from two simple but important tools known as
the Markov and Chebyshev inequalities.
Markov inequality
Then
E[X]
P (X ≥ a) ≤ for all a > 0.
a
For example, the probability that X is more than 5 times its mean
is 1/5, 10 times the mean is 1/10, etc. And this holds for any
distribution.
9
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
E[X]
and so P (X ≥ a) ≤ a
.
Chebyshev inequality
The main use of the Markov inequality turns out to be its use in
deriving other, more accurate deviation inequalities. Here we will
use it to derive the Chebyshev inequality, from which the weak
law of large numbers will follow immediately.
10
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
We now have a bound which depends on the mean and the variance
of X; this leads to a more accurate approximation of the probability.
11
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
WLLN: Let X1, X2, . . . be iid random variables as above. For ev-
ery > 0, we have
X1 + · · · + XN
P (|MN − µ| > ) = P − µ > −→ 0,
N
as N → ∞.
pempirical → P (A) , as N → ∞.
12
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
then given independent realizations X1, . . . , XN , we have
N
1 X
g(Xn) → E[g(X)]
N n=1
as N → ∞.
13
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
Minimum Mean-Square Error Estimation
Now we will take our first look at estimating variables that are them-
selves random, subject to a known probability law.
We start our discussion with a very basic problem. Suppose Y is a
scalar random variable with a known pdf fY (y). Here is a fun game:
you guess what Y is going to be, then I draw a realization of Y
corresponding to its probability law, then we see how close you were
with your guess.
What is your best guess?
Well, that of course depends on what exactly we mean by “best”, i.e.
what price I pay for being a certain amount off. But if we penalize
the mean-squared error, we know exactly how to minimize it.
Let g be your guess. The error in your guess is of course random
(since the realization of Y is random), and so is the squared-error
(Y − g)2. We want to choose g so that the mean of the squared error
is as small as possible:
minimize E[(Y − g)2].
g
14
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
The mean squared error for this choice ĝ is of course exactly the
variance of Y ,
E[(Y − ĝ)2] = E[(Y − E[Y ])2] = var(Y ).
The story gets more interesting (and relevant) when we have multiple
random variables, some of which we observe, some of which we do not.
Suppose that two random variables (Y, Z) have joint pdf fY,Z (y, z).
Suppose that a realization of (Y, Z) is drawn, and I get to observe
Z. What have I learned about Y ?
If Y and Z are independent, then the answer is of course nothing.
But if they are not independent, then the marginal distribution of Y
changes. In particular, before the random variables were drawn, the
(marginal) pdf for Y was
Z
fY (y) = fY,Z (y, z) dz.
15
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
is
ĝ = E[Y |Z = z].
We can also average over the draw of Z. First, note that since Z is
a random variable, ĝ is a priori also random, we might say
Let me pause here because this is the point where many people start
to get confused. The quantities
Z ∞ Z ∞
E[Y ] = yfY (y) dy, and E[Y |Z = z] = yfY (y|Z = z) dy
−∞ −∞
16
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
is random. The above integrates out the randomness of Y , but not
that of Z. (Note that in E[Y |Z = z] the randomness in Z is removed
through direct observation.)
17
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
The law of total variance
Recall that for any random variable Y ,
var(Y ) = E[Y 2] − (E[Y ])2. (1)
As we have seen above, E[Y |Z] is a random variable (where now the
randomness is being caused by Z). Hence it also has a mean
E[E[Y |Z]] = E[Y ],
and a variance
2
2
var(E[Y |Z]) = E (E[Y |Z]) − (E[E[Y |Z]])
2
= E (E[Y |Z])2 − (E[Y ]) .
(2)
The quantity (again as we have seen above) var(Y |Z) is also a ran-
dom variable; we can write its mean as
E [var(Y |Z)] = E E[(Y − E[Y |Z])2 | Z]
18
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020