0% found this document useful (0 votes)

49 views19 pages

17 Notes MFML Probreview

The document discusses statistical estimation and classification. It provides an overview of probability concepts like distribution functions, probability density functions, expectations, moments, and independence. It also covers joint distributions of multiple random variables and how to factorize their densities. Key concepts are explained for both scalar and vector-valued random variables.

Uploaded by

Abdoulie Njie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views19 pages

17 Notes MFML Probreview

Uploaded by

Abdoulie Njie

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 19

IV.

Statistical Estimation
and Classification

Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
Probability: An Extremely Concise Review
1. A scalar-valued random variable X is completely characterized
by its distribution function

FX (u) = P (X ≤ u) .

This is also called the cumulative distribution function

(cdf). FX (u) is monotonically increasing in u; it goes to one as
u → ∞ and goes to zero as u → −∞.

2. If FX is differentiable, then we can also characterize X using

its probability density function (pdf)

dFX (u)
fX (x) =
du u=x
The density has the properties fX (x) ≥ 0 and
Z ∞
fX (x) dx = 1.
−∞

Events of interest are subsets1 of the real line — given such an

event/subset E, we can compute the probability of E occurring
as Z
P (E) = fX (x) dx.
x∈E

1
Technically, it must be a subset of the real line that can be written as
some combination of countable unions, countable intersections, and com-
plements of intervals. You really have to know something about real
analysis to construct a set that does not meet this criteria.

1
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
It is possible that a pdf exists even if FX is not differentiable
everywhere, for example:
 
0,
u < 0,
 0, x < 0,

FX (u) = u, 0 ≤ u ≤ 1, has pdf fX (x) = 1, 0 ≤ x ≤ 1,
 

1, u ≥ 1 
0, x > 1.

3. The expectation of a function g(X) of a random variable is

Z ∞
E[g(X)] = g(x)fX (x) dx.
−∞

This is the “average value” of g(X) in that given a series of

realizations X = x1, X = x2, . . . , of X,
M
1 X
g(xm) → E[g(X)], as M → ∞.
M m=1
This fact is known as the (weak) law of large numbers.

4. The moment of X of degree p is the expectation of the mono-

mial g(x) = xp. The zeroth moment is always 1:
Z ∞
0
E[X ] = E[1] = fX (x) dx = 1,
−∞

and the first moment is the mean:

Z ∞
E[X] = x fX (x) dx.
−∞

The variance is the second moment minus the mean squared:

var(X) = E[X 2] − (E[X])2 = E[(X − E[X])2].

2
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
This is sometime referred to as the “variation around the mean”.
Aside from the zeroth moment, there is nothing that says that
the integrals above must converge; it is easy to construct ex-
amples of well-defined random variables where E[X] = ∞.

5. A pair of random variables (X, Y ) are completely described by

their joint distribution function (joint cdf)2
FX,Y (u, v) = P (X ≤ u, Y ≤ v) .
Again, if FX,Y is continuously differentiable, (X, Y ) is also char-
acterize by the density

∂FX,Y (u, v)
fX,Y (x, y) = .
∂u ∂v (u,v)=(x,y)
In this case, events of interest correspond to regions in the plane
R2 , and the probability of an event occurring is the integral of
the density over this region.

6. From the joint pdf fX,Y (x, y), we can recover the individual
marginal pdfs for X and Y using
Z ∞
fX (x) = fX,Y (x, y) dy,
Z−∞
∞
fY (y) = fX,Y (x, y) dx.
−∞

The pair of densities fX (x), fY (y) tell us how X and Y behave

individually, but not how they interact.

2
For fixed u, v ∈ R, the notation P (X ≤ u, Y ≤ v) should be read as “the
probability that X is ≤ u and Y is ≤ v.

3
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
7. If X and Y do interact in a meaningful way, then observing
one of them affects the distribution of the other. If we observe
X = x, then with this knowledge, the density for Y becomes
fX,Y (x, y)
fY (y|X = x) = .
fX (x)
This is a density over y; it is easy to check that it is positive
everywhere and that it integrates to one. fX (y|X = x) is called
the conditional density for Y given X = x.

8. We call X and Y independent if observing X tells us nothing

about Y (and vice versa). This means
fY (y|X = x) = fY (y), for all x ∈ R,
and
fX (x|Y = y) = fX (x), for all y ∈ R.
(If one of the statements above is true, then the other fol-
lows automatically.) Equivalently, independence means that
the joint pdf is separable:
fX,Y (x, y) = fX (x) fY (y).

9. We can always factor the joint pdf in two different ways:

fX (x)fY (y|X = x) = fX,Y (x, y) = fY (y)fX (x|Y = y).
At this point, we should be comfortable enough with what is
going on that we can use fY (y|x) as short-hand notation for
fY (y|X = x). Then we can rewrite the above in its more
common form as
fX (x)fY (y|x) = fX,Y (x, y) = fY (y)fX (x|y).

4
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
This factorization also gives us a handy way to compute the
marginals: Z ∞
fX (x) = fY (y)fX (x|y) dy.
−∞

It also yields Bayes’ equation

fY (y|x)fX (x)
fX (x|y) = ,
fY (y)
which is a fundamental relation for statistical inference.

10. All of the above extends in the obvious way to more than two
random variables. A random vector
 
X1
 X2 
X=
 ... 


is completely characterized by the density fX (x) = fX (x1, . . . , xD )

on RD . In general, we can factor the joint pdf as

fX (x) = fX1 (x1) fX2 (x2|x1) fX3 (x3|x2, x1) · · · fXD (xD |x1, . . . , xD−1).

11. The pth moment of a random vector X that maps into RD is

the collection of expectations of all monomials of order p. The
mean of a random vector is a vector of length D:
 
E[X1]
E[X] =  ...  ,
E[XD ]

5
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
the second moment is the D × D matrix of all correlations
between entries:
E[X12] E[X1X2] · · · E[X1XD ]
 

E[XX T] =  ... ... ... ,

E[XD X1] ··· E[XD2 ]
the third moment is the D × D × D tensor E[X ⊗ X ⊗ X],
where
(E[X ⊗ X ⊗ X]) (i, j, k) = E[XiXj Xk ],
and so on. The covariance matrix contains all the pairs of
second moments:
Ri,j = E[(Xi − E[Xi])(Xj − E[Xj ])].
If µX = E[X] is the mean vector, we can write the covariance
matrix succinctly in terms of the second moment as
R = E[XX T] − µX µTX

12. Given independent observations X = x1, X = x2, . . . , X =

xM of a random vector X with unknown (or partially known)
distribution, a completely reasonable way to estimate the mean
vector is using
M
1 X
µ̂ = xm .
M m=1
If the mean µX = E[X] is known but the covariance is not, we
can estimate the covariance using
M
!
1 X
R̂ = xmxTm − µX µTX .
M m=1

6
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
If both the mean and covariance are unknown, we first estimate
the mean vector as above, then take
M
!
1 X
R̂ = xmxTm − µ̂µ̂T.
M − 1 m=1

The difference in the scaling is to ensure that E[R̂] = R in

both cases.

7
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
The Weak Law of Large Numbers
The WLLN is absolutely fundamental to machine learning (and really
to all of probability and statistics). It basically formalizes the notion
that given a series of independent samples of a random variable X, we
can approximate E[X] by averaging the samples. The WLLN states
that if X1, X2, . . . are independent copies of a random variable X,
N
1 X
Xn → E[X] as N → ∞.
N n=1

The only condition for this convergence is that X has finite variance.
We start by stating the main result precisely. Let X be a random
variable with pdf fX (x), mean E[X] = µ, and variance var(X) =
σ 2 < ∞. We observe samples of X labeled X1, X2, . . . , XN . The
Xi are independent of one another, and they all have the same dis-
tribution as X. We will show that the sample mean formed from a
sample of size N :
1
MN = (X1 + X2 + · · · + XN ),
N
obeys3
σ2
P (|MN − µ| > ) ≤ ,
N 2
where > 0 is an arbitrarily small number. In the expression above,
MN is the only thing which is random; µ and σ 2 are fixed underlying
properties of the distribution, N is the amount of data we see, and
is something we can choose arbitrarily.
3
This is a simple example of a concentration bound. It is not that tight;
we will later counter inequalities of this type that are much more precise.
But it is relatively simple and will serve our purpose here.

8
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
Notice that no matter how small is, the probability on the right
hand side above goes to zero as N → ∞. That is, for any fixed
> 0,
lim P (|MN − µ| > ) = 0.
N →∞

This result is follows from two simple but important tools known as
the Markov and Chebyshev inequalities.

Markov inequality

Let X be a random variable that only takes positive values:

fX (x) = 0, for x < 0, or FX (0) = 0.

Then
E[X]
P (X ≥ a) ≤ for all a > 0.
a
For example, the probability that X is more than 5 times its mean
is 1/5, 10 times the mean is 1/10, etc. And this holds for any
distribution.

The Markov inequality is easy to prove:

Z ∞
E[X] = xfX (x) dx
Z0 ∞
≥ xfX (x) dx
Za ∞
≥ afX (x) dx
a
= a · P (X ≥ a)

9
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
E[X]
and so P (X ≥ a) ≤ a
.

Again, this is a very general statement in that we have assumed

nothing about X other than it is positive. The price for the generality
is that the bound is typically very loose, and does not usually capture
the behavior of P (X ≥ a). We can, however, cleverly apply the
Markov inequality to get something slightly more useful.

Chebyshev inequality

The main use of the Markov inequality turns out to be its use in
deriving other, more accurate deviation inequalities. Here we will
use it to derive the Chebyshev inequality, from which the weak
law of large numbers will follow immediately.

Chebyshev inequality: If X is a random variable with mean µ

and variance σ 2, then
σ2
P (|X − µ| > c) ≤ 2 for all c > 0.
c

The Chebyshev inequality follows immediately from the Markov in-

equality in the following way. No matter what range of values X
takes, the quantity |X − µ|2 is always positive. Thus
2 E[|X − µ|2] σ 2
2

P |X − µ| > c ≤ = 2.
c2 c
Since squaring (·)2 is monotonic (invertible) over positive numbers,
2 2
σ2
P |X − µ| > c = P (|X − µ| > c) ≤ 2 .
c

10
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
We now have a bound which depends on the mean and the variance
of X; this leads to a more accurate approximation of the probability.

Simple proof of the weak law of large numbers

We now turn to the behavior of the the sample mean

X1 + X2 + · · · + XN
MN = ,
N
where again the Xi are iid random variables with E[Xi] = µ and
var Xi = σ 2. We know that
E[X1] + E[X2] + · · · + E[XN ] N µ
E[MN ] = = = µ,
N N
and since the Xi are independent,
var(X1) + var(X2) + · · · + var(XN ) N σ 2 σ 2
var(MN ) = = = .
N2 N2 N

For any > 0, a direct application of the Chebyshev inequality tells

us that
σ2
P (|MN − µ| > ) ≤ .
N 2
The point is that this gets arbitrarily small as N → ∞ no matter
what was chosen to be. We have established, in some sense, that
even though {MN } is a sequence of random numbers, it converges
to something deterministic, namely µ.

11
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
WLLN: Let X1, X2, . . . be iid random variables as above. For ev-
ery > 0, we have
X1 + · · · + XN

P (|MN − µ| > ) = P − µ > −→ 0,
N
as N → ∞.

One of the philosophical consequences of the WLLN is that it tells us

that probabilities can be estimated through empirical frequen-
cies. Suppose I want to estimate the probability of and event A
occurring related to some probabilistic experiment. We run a series
of (independent) experiments, and set Xi = 1 if A occurred in exper-
iment i, and Xi = 0 otherwise. Then given X1, . . . , XN , we estimate
the probability of A in a completely reasonable way, by computing
the percentage of times it occurred:
X1 + · · · + XN
pempirical = .
N
The WLLN tells us that

pempirical → P (A) , as N → ∞.

This lends some mathematical weight to our interpretation of prob-

abilities as relative frequencies.

All of the above of course applies to functions of random variables.

That is, if X is a random variable, and g(X) is a function of that
random variable with

var(g(X)) = E[(g(X) − E[g(X)])2] < ∞,

12
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
then given independent realizations X1, . . . , XN , we have
N
1 X
g(Xn) → E[g(X)]
N n=1

as N → ∞.

13
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
Minimum Mean-Square Error Estimation
Now we will take our first look at estimating variables that are them-
selves random, subject to a known probability law.
We start our discussion with a very basic problem. Suppose Y is a
scalar random variable with a known pdf fY (y). Here is a fun game:
you guess what Y is going to be, then I draw a realization of Y
corresponding to its probability law, then we see how close you were
with your guess.
What is your best guess?
Well, that of course depends on what exactly we mean by “best”, i.e.
what price I pay for being a certain amount off. But if we penalize
the mean-squared error, we know exactly how to minimize it.
Let g be your guess. The error in your guess is of course random
(since the realization of Y is random), and so is the squared-error
(Y − g)2. We want to choose g so that the mean of the squared error
is as small as possible:
minimize E[(Y − g)2].
g

Expanding the squared error makes it clear how to do this:

E[(Y − g)2] = E[Y 2] − 2g E[Y ] + g 2.
No matter what the first moment E[Y ] and second moment E[Y 2] are
(as long as they are finite), the expression above is a convex quadratic
function in g, and hence is minimized when its first derivative (w.r.t.
g) is zero, i.e. when
−2 E[Y ] + 2g = 0 ⇒ ĝ = E[Y ].

14
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
The mean squared error for this choice ĝ is of course exactly the
variance of Y ,
E[(Y − ĝ)2] = E[(Y − E[Y ])2] = var(Y ).

The story gets more interesting (and relevant) when we have multiple
random variables, some of which we observe, some of which we do not.
Suppose that two random variables (Y, Z) have joint pdf fY,Z (y, z).
Suppose that a realization of (Y, Z) is drawn, and I get to observe
Z. What have I learned about Y ?
If Y and Z are independent, then the answer is of course nothing.
But if they are not independent, then the marginal distribution of Y
changes. In particular, before the random variables were drawn, the
(marginal) pdf for Y was
Z
fY (y) = fY,Z (y, z) dz.

After we observe Z = z, we have

fY,Z (y, z) fY,Z (y, z)
fY (y|Z = z) = =R .
fZ (z) fY,Z (y, z) dy
Y is still a random variable, but its distribution depends on the value
z that was observed for Z.
Now, given that I have observed Z = z, what is the best guess for
Y ? If by “best” we mean that which minimizes the mean squared
error, it is the conditional mean. That is, the minimizer of
minimize E[(Y − g)2|Z = z]
g

15
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
is
ĝ = E[Y |Z = z].

Notice that unlike before, ĝ is not pre-determined, it depends on the

outcome Z = z. We might denote

ĝ(z) = E[Y |Z = z].

For a particular choice of z, the mean-squared error is the conditional

variance

E[(Y − ĝ(z))2|Z = z] = E (Y − E[Y |Z = z])2|Z = z

= E[Y 2|Z = z] − (E[Y |Z = z])2

= var(Y |Z = z).

So in general, not only does ĝ depend on z, but its performance (its

mean-square error) also depends on z.

We can also average over the draw of Z. First, note that since Z is
a random variable, ĝ is a priori also random, we might say

ĝ(Z) = E[Y |Z].

Let me pause here because this is the point where many people start
to get confused. The quantities
Z ∞ Z ∞
E[Y ] = yfY (y) dy, and E[Y |Z = z] = yfY (y|Z = z) dy
−∞ −∞

are deterministic, but we re-emphasize that

Z ∞
E[Y |Z] = yfY (y|Z) dy
∞

16
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
is random. The above integrates out the randomness of Y , but not
that of Z. (Note that in E[Y |Z = z] the randomness in Z is removed
through direct observation.)

The “average estimate” is now

E[g(Z)] = E[E[Y |Z]]
Z
= E[Y |Z = z]fZ (z) dz
= E[Y ].
The identity E[E[Y |Z]] = E[Y ] is known as the law of iterated ex-
pectation or total expetation. The inside E above integrates out the
randomness in Y while the outside one integrates over Z, the result
is a deterministic quantity.
Since E[g(Z)] = E[Y ], on average we are doing the same thing as if
we didn’t observe Z at all. But since we are adapting g to the draw
of Z, we get better average performance. The mean square error
(which is random with Z) is
E[(Y − ĝ(Z))2|Z] = E[Y 2|Z] − (E[Y |Z])2 = var(Y |Z).
The average performance is then
E[E[(Y − ĝ(Z))2|Z]] = E[var(Y |Z)] ≤ var(Y ).
The last inequality, which follows from the law of total variance dis-
cussed below, means that on average, an (optimal) estimator us-
ing knowledge of Z will outperform an (optimal) estimator without
knowledge of Z. Of course, when Y and Z are uncorrelated (mean-
ing E[Y Z] = E[Y ] E[Z]) we will have E[var(Y |Z)] = var(Y ) and
knowing Z makes no difference.

17
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020
The law of total variance
Recall that for any random variable Y ,
var(Y ) = E[Y 2] − (E[Y ])2. (1)
As we have seen above, E[Y |Z] is a random variable (where now the
randomness is being caused by Z). Hence it also has a mean
E[E[Y |Z]] = E[Y ],
and a variance
2
2
var(E[Y |Z]) = E (E[Y |Z]) − (E[E[Y |Z]])
2
= E (E[Y |Z])2 − (E[Y ]) .

(2)
The quantity (again as we have seen above) var(Y |Z) is also a ran-
dom variable; we can write its mean as
E [var(Y |Z)] = E E[(Y − E[Y |Z])2 | Z]

= E E[Y 2|Z]] − E (E[Y |Z])2

= E[Y 2] − E (E[Y |Z])2 .

(3)
Adding together (2) and (3) and applying (1) gives us the cute ex-
pression
var(Y ) = E [var(Y |Z)] + var(E[Y |Z]).
This is known as the law of total variance. It basically says that
you can decompose the variance in a random variable by the expected
variance in “using Z to predict Y ” and the variance of the prediction
(condition expectation) itself. Note that all of the quantities above
are non-negative, so we have the inequalities
E[var(Y |Z)] ≤ var(Y ) and var(E[Y |Z]) ≤ var(Y ).

18
Georgia Tech ECE 7750 Notes by J. Romberg. Last updated 12:17, October 24, 2020

ALL ST218 Lecture Notes
No ratings yet
ALL ST218 Lecture Notes
87 pages
Mathematical Expectation or Expected Value
No ratings yet
Mathematical Expectation or Expected Value
52 pages
CH 4
No ratings yet
CH 4
71 pages
Lect 3
No ratings yet
Lect 3
32 pages
APMA1655
No ratings yet
APMA1655
56 pages
Notes
No ratings yet
Notes
56 pages
Chapter 4 Slides
No ratings yet
Chapter 4 Slides
27 pages
Week 6
No ratings yet
Week 6
6 pages
Applied Maths
No ratings yet
Applied Maths
34 pages
Basic Probability and Statistics: Random Variables Distribution Functions Various Probability Distributions
No ratings yet
Basic Probability and Statistics: Random Variables Distribution Functions Various Probability Distributions
39 pages
CS229 - Probability Theory Review: Taide Ding, Fereshte Khani
No ratings yet
CS229 - Probability Theory Review: Taide Ding, Fereshte Khani
37 pages
Probability Review Stochastic
No ratings yet
Probability Review Stochastic
23 pages
MIT6 436JF18 Lec06
No ratings yet
MIT6 436JF18 Lec06
18 pages
Stat6201 ch1-5
No ratings yet
Stat6201 ch1-5
4 pages
Unit02 Slide
No ratings yet
Unit02 Slide
53 pages
AE 248: AI and Data Science: Prabhu Ramachandran 2024-01-01
No ratings yet
AE 248: AI and Data Science: Prabhu Ramachandran 2024-01-01
12 pages
Intro To Data Science Lecture 2
No ratings yet
Intro To Data Science Lecture 2
12 pages
323 Egec
No ratings yet
323 Egec
18 pages
Quiz02 Review
No ratings yet
Quiz02 Review
7 pages
Probability
No ratings yet
Probability
12 pages
LECT3 Probability Theory
No ratings yet
LECT3 Probability Theory
42 pages
Joint Distribution: Eral Rvs
No ratings yet
Joint Distribution: Eral Rvs
12 pages
ProbabilityStatistics Probability2
No ratings yet
ProbabilityStatistics Probability2
11 pages
cs109 Final Cheat 3 PDF
No ratings yet
cs109 Final Cheat 3 PDF
13 pages
Cheat Sheet
No ratings yet
Cheat Sheet
5 pages
Lect 2
No ratings yet
Lect 2
7 pages
Week 5-8 Short Notes
No ratings yet
Week 5-8 Short Notes
10 pages
Revision Concepts
No ratings yet
Revision Concepts
5 pages
Expectation: Definition Expected Value of A Random Variable X Is Defined
No ratings yet
Expectation: Definition Expected Value of A Random Variable X Is Defined
15 pages
Section06 Solutions
No ratings yet
Section06 Solutions
11 pages
Stochastic
No ratings yet
Stochastic
63 pages
Probability - Statistics - Class Notes
No ratings yet
Probability - Statistics - Class Notes
15 pages
Review
No ratings yet
Review
6 pages
Vectors of Random Variables: Guy Lebanon January 6, 2006
No ratings yet
Vectors of Random Variables: Guy Lebanon January 6, 2006
2 pages
MA2216/ST2131 Probability Notes 5 Distribution of A Function of A Random Variable and Miscellaneous Remarks
No ratings yet
MA2216/ST2131 Probability Notes 5 Distribution of A Function of A Random Variable and Miscellaneous Remarks
13 pages
Probability Formula Sheet
No ratings yet
Probability Formula Sheet
11 pages
PT2425 Cheatsheet Updatedv2
No ratings yet
PT2425 Cheatsheet Updatedv2
5 pages
Lecture Notes 1: Brief Review of Basic Probability (Casella and Berger Chapters 1-4)
100% (1)
Lecture Notes 1: Brief Review of Basic Probability (Casella and Berger Chapters 1-4)
14 pages
Problems and Solutions 4 PDF
No ratings yet
Problems and Solutions 4 PDF
67 pages
Multivariate Statistical Distributions
No ratings yet
Multivariate Statistical Distributions
12 pages
Ta 2
No ratings yet
Ta 2
7 pages
Notes Probability
No ratings yet
Notes Probability
15 pages
Memo Proba
No ratings yet
Memo Proba
2 pages
Draw PDF
No ratings yet
Draw PDF
21 pages
Distributions and Normal Random Variables
No ratings yet
Distributions and Normal Random Variables
8 pages
Probability Basics
No ratings yet
Probability Basics
19 pages
LN 1
No ratings yet
LN 1
11 pages
Manual For Instructors: TO Linear Algebra Fifth Edition
No ratings yet
Manual For Instructors: TO Linear Algebra Fifth Edition
12 pages
Chap2 PDF
No ratings yet
Chap2 PDF
20 pages
Introductory Probability and The Central Limit Theorem
No ratings yet
Introductory Probability and The Central Limit Theorem
11 pages
Probability Review
No ratings yet
Probability Review
12 pages
Lecture Notes 1 36-705 Brief Review of Basic Probability
No ratings yet
Lecture Notes 1 36-705 Brief Review of Basic Probability
7 pages
Exam P Review Sheet
No ratings yet
Exam P Review Sheet
12 pages
College Statistics
No ratings yet
College Statistics
244 pages
MIT14 381F13 Lec1 PDF
No ratings yet
MIT14 381F13 Lec1 PDF
8 pages
Probab Refresh
No ratings yet
Probab Refresh
7 pages
1 Math Fundamentals: 1.1 Integrals, Factors and Techniques
No ratings yet
1 Math Fundamentals: 1.1 Integrals, Factors and Techniques
11 pages

17 Notes MFML Probreview

Uploaded by

17 Notes MFML Probreview

Uploaded by

IV.

This is also called the cumulative distribution function

2. If FX is differentiable, then we can also characterize X using

Events of interest are subsets1 of the real line — given such an

3. The expectation of a function g(X) of a random variable is

This is the “average value” of g(X) in that given a series of

4. The moment of X of degree p is the expectation of the mono-

and the first moment is the mean:

The variance is the second moment minus the mean squared:

5. A pair of random variables (X, Y ) are completely described by

The pair of densities fX (x), fY (y) tell us how X and Y behave

8. We call X and Y independent if observing X tells us nothing

9. We can always factor the joint pdf in two different ways:

It also yields Bayes’ equation

is completely characterized by the density fX (x) = fX (x1, . . . , xD )

11. The pth moment of a random vector X that maps into RD is

E[XX T] =  ... ... ... ,

12. Given independent observations X = x1, X = x2, . . . , X =

The difference in the scaling is to ensure that E[R̂] = R in

Let X be a random variable that only takes positive values:

fX (x) = 0, for x < 0, or FX (0) = 0.

The Markov inequality is easy to prove:

Again, this is a very general statement in that we have assumed

Chebyshev inequality: If X is a random variable with mean µ

The Chebyshev inequality follows immediately from the Markov in-

Simple proof of the weak law of large numbers

We now turn to the behavior of the the sample mean

For any  > 0, a direct application of the Chebyshev inequality tells

One of the philosophical consequences of the WLLN is that it tells us

This lends some mathematical weight to our interpretation of prob-

All of the above of course applies to functions of random variables.

var(g(X)) = E[(g(X) − E[g(X)])2] < ∞,

Expanding the squared error makes it clear how to do this:

After we observe Z = z, we have

Notice that unlike before, ĝ is not pre-determined, it depends on the

ĝ(z) = E[Y |Z = z].

For a particular choice of z, the mean-squared error is the conditional

E[(Y − ĝ(z))2|Z = z] = E (Y − E[Y |Z = z])2|Z = z

= E[Y 2|Z = z] − (E[Y |Z = z])2

So in general, not only does ĝ depend on z, but its performance (its

ĝ(Z) = E[Y |Z].

are deterministic, but we re-emphasize that

The “average estimate” is now

= E E[Y 2|Z]] − E (E[Y |Z])2

= E[Y 2] − E (E[Y |Z])2 .

You might also like

For any > 0, a direct application of the Chebyshev inequality tells