MA1610 F2014 Class Notes
MA1610 F2014 Class Notes
1. SEPTEMBER 4
The first class of our course is an informal introduction to the concept of random, or
chance experiments, and to the notion of probability. We will also learn how to simulate
random experiments on the computer and look at some of the examples presented in the
book.
1.1. Chance experiments. We call random, or chance experiment, any experiment having
more than one possible outcome, which we cannot predict a priori. The typical example is
certainly the following experiment
Flip a coin and record whether the face landing upwards is a heads or a tails.
whose two possible outcomes are, apparently, equally likely to occur and we, in everyday
life, cannot predict. Thinking some more, one would object that knowing the exact initial
position, initial velocity, initial momentum of the coin, and performing this experiment in
vacuum, so that no friction effects have to be considered, the laws of Newtonian mechanics
allow to precisely calculate which face will land upwards. However, such a task is not
attainable, firstly because the physical quantities involved are extremely hard to measure
precisely, and secondly because the system under consideration is very sensitive to initial
conditions: namely, a small (infinitesimal) change in the initial velocity or momentum of
the coin can result in a macroscopic effect (heads instead of tails).
More interesting physical phenomena with the same sensitivity are long-term atmospher-
ical dynamics, genetic mutations, asset pricing in organized markets, and many more. Just
like we are able to guess that heads will come up about half the times we flip a coin, we
can assign a measure of likelihood to each outcome of these (any) chance experiments. A
mathematical theory which allows to do so is the theory of probability.
1.2. Discrete sample spaces and probability. Today we will mostly describe random ex-
periments with a finite, or countably infinite (as in N) number of outcomes.
Definition 1.1. The set of all possible outcomes of a random experiment is called sample
space and is often denoted by Ω.
Example 1.2. The experiment consisting of rolling a six-sided die and recording the outcome
has sample space Ω = {1, . . . , 6}. The experiment consisting of rolling two six-sided dice and
recording the outcome of each has sample space consisting of all ordered pairs of elements
in {1, . . . , 6}. The experiment consisting of rolling two six-sided dice and recording the sum
of the rolls has sample space {2, . . . , 12}. These experiments all have a finite sample space.
Example 1.3. Suppose we toss a coin N times (N being a positive integer) and record the
outcome of each toss being either heads (H) or tails (T). A possible sample space for this
experiment is the set of ordered N -tuples of elements in {H, T }. An example of element of
1
2 PROBABILITY FALL 2014 - CLASS NOTES
We will come back to formal definitions next class. For today, let’s try to understand how a
probability distribution is assigned.
Frequentistic method. To assign P(ω) to each ω we run the experiment a large number, of
times N , and simply decide that the probability of ω is the relative frequency of its occur-
rences, that is
number of times the outcome is ω
P(ω) =
N
Example: at Brown a student can receive the grades of {A, B, C, N C}. This will be our sample
space. If we look at the grade a 1610 student will receive at the end of the class as a chance
experiment, a sensible way to assign a probability to each grade is to look at the relative
frequency of that grade in past classes.
Bayesian, or a priori method. We start from an example. An urn contains 2 red balls, 3
green balls and 1 white ball. We extract one ball from the urn and note its color (what is the
sample space?). It seems likely to assume that each ball has equal likelihood to be picked.
Any assignment of probabilities respecting this assumption must then satisfy
P(Red) = 2P(White), P(Green) = 3P(White)
in addition to P(ω) ≥ 0 for each color ω and P(Red) + P(White) + P(Green) = 1. This
determines the P(ω) uniquely.
1.3. Random numbers.
Example 1.5. Suppose we have at our disposal a 8-faced die (or N -faced, for what matters,
where N is any integer greater or equal to 2). Defining a probability distribution for the
random experiment of rolling the die and recording the roll by the Bayesian method clearly
leads us to having equal likelyhood for each score from 1 to 8 (or from 1 to N ), i.e. P(1) =
· · · = P(8) = 81 . As expected, rolling the die a large number of times yields approximately
equal relative frequencies for each score.
PROBABILITY FALL 2014 - CLASS NOTES 3
The MatLab function randi simulates the experiment of rolling a 8-sided dice by produc-
ing, each time it is called with the command randi([1,8],1) a (pseudo)random integer
in {1, . . . , 8}, each with equal likelihood. To check this, and ideally repeat the experiment
say 98 times, we can generate a 98x1 vector of (pseudo)random integers in {1, . . . , 8} with
the command randi([1,8],98,1). For better visualization, we can use the countint
function (this is not built in inside Matlab, I wrote it out and you will have it downloadable
for use on your copy)
FREQ=countint(randi([1,8],98,1);
so that the matrix FREQ is a relative frequency matrix associated to the 98 random integers
between 1 and 8. We can visualize the relative frequency with the tool spikegraph by call-
ing spikegraph(FREQ). See how things change when 98 is replaced by bigger numbers,
as in Figure 1.5
Example 1.6. If instead of integers, we want to simulate uniformly distributed random REAL
numbers between 0 and 1 (see the Spinner example later on in the course), one should use
rand. Calling rand(m,n) where m,n are integers returns a m by n matrix of random num-
bers between 0 and 1. We can use the utility distvisual to construct a relative frequency
histogram, with 14 cells say, with the call
distvisual(rand(98,14),0,1,14);
the 0,1 after the vector rand is used to specify the range of the histogram.
1.4. Simulations of chance experiments, examples.
Example 1.7 (Example 1.4, Section 1.1). A coin is termed/assumed to be fair if, when tossed,
the outcomes H =landing heads upwards and T =landing tails upwards have equal proba-
bility, so that P(H) = P(T ) = 21 .
In this example, Peter and Quentin agree to play the following simple game. A fair coin
is tossed, in sequence, N times. (Notice that we know what the sample space Ω for this
experiment is, see Example 1.3). Each time H occurs, Peter wins a penny from Quentin.
Each time T occurs Peter loses a penny to Quentin. At the end of the game, we record two
quantities:
W = Peter’s winnings at the end of the game L = # of times Peter is leading
To be more precise, we say that Peter is leading after the j-th toss if either his winning are
positive after the j-th toss or his winnings are zero and he lost the j-th toss.
Notice that W will take integer values between −N and N ; more precisely, W will take
only those values in {−N , . . . , N } with the same parity of N . Notice that L will take values
in {0, . . . , N }. Since values of W, L depend on the random sequences in Ω = {H, T }N which
is our sample space, W, L are random quantities as well. A function X : Ω → D (where
D can be R, Rd , Z...) is termed a random variable. We will talk more about discrete and
continuous random variables in the upcoming classes. We write P(W = k) in short to denote
the probability that W is equal to k. If k is not an admissible value, this will automatically
be zero.
Question: what are the most likely values for W, and for L? How do they depend on the
number of turns N? Try giving yourself an answer before the upcoming simulation.1
1
I will post a version of the notes after class containing the simulation plots and the correct answers.
4 PROBABILITY FALL 2014 - CLASS NOTES
We give a heuristic answer to this question by simulating 10000 instances of the Peter-
Quentin game, first with N = 5, then with N = 20, finally with N = 40. We will use the Mat-
lab function headtail. Here is a summary of what it does and how to call it from the Mat-
Lab command line. The same text will appear if you write the command help headtail.
Same can be done with any other function I will provide you.
PROBABILITY FALL 2014 - CLASS NOTES 5
Example 1.8 (Exercise 16, Section 1.1). Suppose that the probabilities of giving birth to a
boy (B) or to a girl (G) are equal (both are 0.5).2
A) Suppose that each family in our study gives birth to children till at least one boy is born
or N children have been born. Consider the random variable C =number of children born to
each family. How do you expect C to be distributed, in other words what are the probabilities
that each family has respectively 1, . . . , N children? Also, what is the probability that at
least one boy is born? This of course depends on N . We use the MatLab routine tillboy
to simulate the experiment with large number of families. This is what the help file says: in
the sample call N = 4, but you can play and insert different values.
This routine solves problem 16 from Sectiom 1.1 of the book.
That is, we simulate the following situation: each family gives
birth to children till at least one boy is born. They are allowed to
have a certain number of maximum attempts
Inputs: - p: probability to give birth to a boy
- maxchildren: maximum number of children
a family is allowed to give birth to
- families: how many families are simulated
outputs: - children: families x 1 vector reporting
how many children each family gives birth to
- boyslYN: families x 1 vector reporting whether
each family managed to give birth to at least one boy
2
A Bayesian approach based on elementary biology notions would lead to this assumption. A frequentist
approach would not. In fact, the worldwide relative frequency of live male births is actually 0.513. Try running
the simulation with different values of p and see what effect this has.
6 PROBABILITY FALL 2014 - CLASS NOTES
Example call:
[children,boysYN,Children,BoysYN]=tillboy(0.5,4,100)
After we have memorized the simulation output, one can compute the average value of our
frequency distribution by, for instance, invoking the matlab function mean, with argument
the data vector children: mean(children). Try increasing the number of families and
repeating the average calculation to see what “real mean value” we are approaching.
Assume now that each family will give birth to children till at least one boy AND one girl are
born. What do you expect the “real mean value” to be this time? A very similar function,
the matlab function boygirl, can be used for simulation. Try finding out the details and
running the simulation yourself!
PROBABILITY FALL 2014 - CLASS NOTES 7
2. SEPTEMBER 9
2.1. Discrete probability distribution functions. Let us consider a random experiment X
with discrete (finite or countably infinite) sample space Ω. Recall that the elements ω of Ω
are the possible outcomes or values for X 3.
Definition 2.1. A probability distribution function for X is a function m : Ω → R satisfying
(2.1) m(ω) ≥ 0 ∀ω ∈ ω,
X
(2.2) m(ω) = 1.
ω∈Ω
In the case that Ω is countably infinite, (2.2) means that the series converges (absolutely
and unconditionally, being of positive terms) to 1.
Example 2.2. If X is the roll of a fair (all outcomes equally likely) N-sided die, and Ω =
{1, . . . , N }, a probability distribution function for X is given by m(ω) = 1/N for all N . Such
a distribution is called uniform distribution on {1, . . . , N }.
Example 2.3. Let us consider the random experiment(s) of Example 1.4, i.e. tossing a fair
coin till one head is obtained, with X being the number of tosses. Let us first assume that
we will stop after N tosses. Then Ω = {1, . . . , N } are the possible values for X . One discrete
probability distribution function for X is given by
m(1) = 21 , m(2) = 14 , . . . , m(n) = 2−n , . . . m(N − 1) = 2−(N −1) , m(N ) = 2−(N −1) .
We can verify that
X n N −1
X
m(n) = 2−n + 2−(N −1) = (1 − 2−(N −1) ) + 2−(N −1) = 1.
n=1 n=1
Example 2.5. Throughout, let Ω be a discrete sample space. Recall that E = {} (the empty
set) , E = Ω (the full set) are subset of Ω, belong to P (Ω), and as such, they are events.
Example 2.6. Referring to the situation with finitely many tosses N of Example 1.4, A =
{ω odd}, B = {ω is a power of 2}, C = {ω ≤ N − 1} are examples of events.
Remark 2.7. We can look at the probability P as a real valued function defined on subsets
of Ω. To be more precise P has domain P (Ω) and range [0, 1]. We will see how this point
of view can be generalized in the complements.
3
The book calls X a random variable, since we can think of X as a function with domain Ω and range Ω,
defined by X (ω) = ω. As such, this is a particular case of the more general definition of random variable that
we will see later on.
8 PROBABILITY FALL 2014 - CLASS NOTES
and the second equality is true because each ω ∈ A1 ∪ A2 belongs to exactly one of A1 and
A2 . The general case n ≥ 2 follows by induction.
We have two corollaries of Theorem 2.1.
Corollary 2.1.1. Let A, B ⊂ Ω. Then P(A) = P(A ∩ B) + P(A ∩ B̃).
Proof. This is an application of (4) of the Theorem to the pairwise disjoint events (A ∩ B)
and (A ∩ B̃) whose union is A.
Corollary 2.1.2. Let A, B ⊂ Ω. Then
P(A ∪ B) = P(A) + P(B) − P(A ∩ B)
4
Proof. Observe that A ∩ B̃, A ∩ B, Ã ∩ B are pairwise disjoint events. Thus
A = (A ∩ B̃) ∪ (A ∩ B) =⇒ P(A) = P(A ∩ B̃) + P(A ∩ B)
B = (Ã ∩ B) ∪ (A ∩ B) =⇒ P(B) = P(Ã ∩ B) + P(A ∩ B)
A ∪ B = (A ∩ B̃) ∪ (Ã ∩ B) ∪ (A ∩ B) =⇒ P(A ∪ B) = P(A ∩ B̃) + P(Ã ∩ B) + P(A ∩ B)
by using (4) of Theorem 2.1 for each line. The corollary then follows by summing up the
first two lines and then comparing with the third.
Example 2.8. This random experiment concerns eye color fenotypes of the offspring of a
father coming from the sample population F and a mother coming from the sample popu-
lation M. There are three possible colors: dark (d), green (g) and blue (b). As in real life, if
at least one of the parents has dark eyes, the offspring will have dark eyes.
The populations F and M have different fenotypical distribution mF and mM . We know
that mF (d) = 0.5 (that is, 50% of the fathers’ population has black eyes) We also know that
the probability of an offspring having dark eyes when neither parent has blue eyes is 0.4
and the probability of having dark eyes when neither parent has green eyes is 0.7. Can we
calculate mM (d)?
4
The proof of this corollary given in the book, while formally correct, is not the simplest/most general one.
PROBABILITY FALL 2014 - CLASS NOTES 9
Solution. The sample space for the experiment of selecting an offspring at random is Ω =
{d, g, b}2 . Let us consider the following events:
· A = {offspring has dark eyes and neither parent has blue eyes};
· B = {offspring has dark eyes and neither parent has green eyes};
· C = {offspring’s father has dark eyes};
· D = {offspring’s mother has dark eyes}.
We are given that
P(A) = 0.4, P(B) = 0.7, P(C) = mF (d) = 0.5
and we have to find P(D). It is easy to see that A ∪ B = C ∪ D = {at least one parent has
dark eyes} and that A ∩ B = C ∩ D = {both parents have dark eyes}. Using Corollary 2.1.2,
these equalities and Corollary 2.1.2 again
P(D) = P(C ∪ D) + P(C ∩ D) − P(C) = P(A ∪ B) + P(A ∩ B) − P(C)
= P(A) + P(B) − P(C) = 0.4 + 0.7 − 0.5 = 0.6.
Example 2.9 (Tree diagrams, 1). An urn contains 5 green balls and 7 red balls. Two of the
balls are removed without looking at their color. A third ball is then removed. What is the
probability that the third ball is red?
Example 2.10 (Tree diagrams, 2). A basketball magazine performed a survey among its read-
ers asking who is the best basketball center of all time. Options given were Wilt Chamberlain
(W), Bill Russell (B), Kareem Abdul-Jabbar (K) and Shaq (S). Suppose that the distribution
of votes was the following: (W) 27%, (B) 30%, (K) 22%, (S) 21%. What is the probability
that two of the survey participants chosen at random agree on who is the best center?
Example 2.11 (Infinite sample spaces). The following game is played. A fair coin is tossed,
and if heads comes up, the game is over. If tails comes up, a six-sided die is rolled. The game
is over if a six comes up, otherwise a coin is tossed in the next turn. We keep alternating
between coin on odd tosses and die on even tosses till a success comes up. If the game ends
after j-th tosses and j is odd, the player wins 2 j dollars, while if j is even, the player wins
nothing.
· calculate the probability that the game ends at the j-th toss;
· calculate the probability that the game ends at an even toss;
· what is a fair entering price for this game?
2.2. Complements. A more general definition of probability as a function on subsets of a
sample space Ω can be given without a priori specifying a probability distribution function.
For this, we need two additional concepts.
Definition 2.12. Let Ω be a set and with P (Ω) denote its power set. An algebra F is a
subcollection of subsets of Ω satisfying
(1) ; ∈ F ;
(2) if A ∈ F then à = Ω\A ∈ F (closure under complement); SN
(3) for any finite collection of sets A1 , . . . , AN ∈ F , their union A = n=1 An ∈ F (closure
under finite unions)
If the stronger form of (3)
10 PROBABILITY FALL 2014 - CLASS NOTES
The following theorem establishes the equivalence of the definition above with Definition
2.4. A proof of the theorem can be given by using the homework Problem 1.10.
Theorem 2.2. Let Ω be a discrete sample space. If m is a probability distribution function on
Ω, then the function X
P (Ω) ∈ A 7→ P(A) = m(ω)
ω∈A
is a probability measure on (Ω, P (Ω)). Conversely, given a probability measure on (Ω, P (Ω)),
the function
m : Ω → [0, 1], m(ω) := P({ω})
is a discrete probability distribution function in the sense of Definition 2.4.
PROBABILITY FALL 2014 - CLASS NOTES 11
3. SEPTEMBER 16
We begin with an example.
Example 3.1. As an example of random experiment with sample space the interval [0, 2π),
we have described a spinner on September 11. It is reasonable to assume that the probability
of the needle ending up between the angles a and b is proportional to the normalized (by
2π) length of the interval, i.e. (b − a)/2π. We have verified this experimentally using the
random generator in MatLab.
We now want to look at the distribution of the sum of two such uniformly distributed
random numbers in [0, 1) (we renormalize for convenience), call it X .
3.1. Probability distribution functions on R, Rn . We want to be able to describe random
experiments whose natural sample space Ω is a subset of the real line R or of the Euclidean
space Rn . We get a simplified theory if we restrict Ω to be of the following type:
· Ω ⊂ R is a finite or countable union of (closed, half-open, open) possibly unbounded
intervals {I n };
· Ω ⊂ Rn is a countable union of products of intervals Ras above;
· Ω ⊂ Rn is a domain for which the Riemann integral Ω ·dx 1 . . . dx n makes sense;
we will refer to Ω as such by the phrase sample space (where we mean in fact admissible
sample space).
Definition 3.2. Let Ω ⊂ Rn be a sample space. A function f : Ω → R is a probability
distribution function on Ω if
· f (x) ≥ 0 for all x ∈ Ω; Z
· f is Riemann integrable on Ω and f (x) dx = 15
Ω
Note thatR we can always think of f being defined on Rn by setting f = 0 on Rn \Ω. If we do
so, then Rn f = 1.
The function Z x
The function Z x1 Z xn
FX (x 1 , . . . , x n ) = ··· f (t 1 , . . . , t n ) dt 1 · · · dt n
−∞ −∞
is called cumulative distribution function of X .
Example 3.7. Let X = (X 1 , X 2 ) be a random variable describing the landing position of a dart
thrown at a target Ω which is a disc of radius R > 0, in cartesian coordinates centered at
the center of the target. Assuming that the landing position is uniformly distributed on the
circle, a probability distribution function is given by
1
x 12 + x 22 < R2 ,
f (x 1 , x 2 ) = πR
2
0 otherwise.
So in particular, for instance
Z0 Z 0 Z
1 1
P(X 1 ≤ 0, X 2 ≤ 0) = f (x 1 , x 2 ) dx 1 = dx 2 = dx 1 dx 2 = .
−∞ −∞ x 12 +x 22 <R2 ,x 1 ≤0,x 2 ≤0
πR 2 4
PROBABILITY FALL 2014 - CLASS NOTES 13
1, r > R.
and then extend this to any B ∈ B(R) by postulating that the countable additivity (4) in
the Theorem holds.
3.3. Further examples.
Example 3.13 (exponential distribution). Let T be the random variable describing the (ran-
dom) time between two consecutive breakdowns of a certain machine which is assumed
to be wear-free, in the following sense: if we set the origin of time at the last breakdown,
the probability of having a breakdown in the time interval [t, t + s) with the machine is
still working at time t > 0 is the same as the probability of having one between [0, s). By
this assumption, we can determine the cumulative distribution function of T up to some
parameter λ > 0. Indeed, we only consider positive times, so P(T ≤ 0) = 0. If t > 0 and
s > 0 a moment’s thought leads to P(T > t + s) = P(T > t)P(T > s) and thus setting
G(t) = P(T > t) = 1 − P(T ≤ t) = 1 − F T (t), we have that
G(t + s) = G(t)G(s), ∀t, s > 0.
It is clear that G(t) = e−λt satisfies the above equation for all s, t. Some work shows that
these are the only continuous solutions to the above equation (the proof is that H = ln G
satisfies H(t +s) = H(t)+H(s) and such a function, if continuous, must be linear.) Moreover
λ > 0 for G to go to zero at infinity. Therefore, we have found
0 t ≤0
F T (t) =
1−e −λt
t > 0.
Note that λ > 0 is the reciprocal of the expected time between occurrences.
PROBABILITY FALL 2014 - CLASS NOTES 15
Note that
E(X) = b, Cov (X) = C := AAT ;
this follows from part 1 of the previous lemma. Notice that C is symmetric and positive
definite. We thus say that X ∼ N (b, C), that is, X is a Gaussian vector in Rn with mean b
and covariance matrix C.
16 PROBABILITY FALL 2014 - CLASS NOTES
Lemma 7.5. Let X ∼ N (b, C) be a Gaussian vector on Rn . Then its (joint) probability distri-
bution function is given by
1
exp − 21 (x − b) T C −1 (x − b) .
fX (x) = p
(2π)n detC
where C −1 denotes the inverse matrix of C.
Proof. We have that X = AZ+b where A is a nonsingularpmatrix such that C = AAT , and Z is a
standard Gaussian vector in Rn . In particular |detA| = det C. Furthermore, Z = A−1 (x − b)
since A is nonsingular. In our usual notation, Z = g −1 (X). The Jacobian matrix of g −1 is of
course A−1 and its determinant is (detA)−1 . By the change of variable theorem, we have
1
exp − 21 (A−1 (x − b)) T A−1 (x − b)
fX (x) = | det J g −1 (x)| f Z (g −1 (x)) = p
(2π)n |detA
1
exp − 12 (x − b) T (A−1 ) T A−1 (x − b)
=p
(2π)n detC
1
exp − 12 (x − b) T C −1 (x − b)
=p
(2π)n detC
since (A−1 ) T A−1 = (AAT )−1 = C −1 .
The components of a Gaussian vectors are gaussian random variables whose variance is
the j-th diagonal element of the covariance matrix and whose covariance structure com-
pletely determines their independence. This is formalized in the following lemma.
Lemma 7.6. Let X = (X 1 , . . . X n ) T ∼ N (b, C) be a Gaussian vector on Rn . Then
(1) X j ∼ N (b j , C j j);
(2) X j and X k are independent if and only if C jk = 0.
The lemma has been proved in class. For proof details and more examples, please see the
class notes.
Example 7.7. Let Z = (Z1 , Z2 , Z3 ) T be a Gaussian standard vector. Define
1
X1 = p (Z
3 1 + Z2 + Z3 )
1
X2 = p (Z
6 1 + Z2 − 2Z3 )
1
X3 = p (Z
2 1 − Z2 )
Find the joint distribution of X = (X 1 , X 2 , X 3 ) T . Are X 1 , X 2 , X 3 independent?
Solution. We have X = AZ with
p1 p1 p1
3 3 3
A= p1 p1 − p2 .
6 6 6
p1 − p12 0
2
One has
1 0 0
C = AAT = 0 1 0
0 0 1
so X is a standard gaussian vector and its components are independent.
PROBABILITY FALL 2014 - CLASS NOTES 17
The upcoming section summarizes the derivation of the first batch of our convergence
results, that is, convergence in probability (weak law) or almost surely (strong law) of the
sample mean of a sequence of independent copies of X to EX whenever X has finite first
moment.
which rearranging gives the conclusion. We have used that |x| > " to get the last inequality.
The Chebychev inequality is proved in the same exact way: writing µ = EX
Z Z
Var (X ) = E(|X − µ|2 ) = |x − µ|2 f X (x) dx ≥ |x − µ|2 f X (x) dx
R |x−µ|>"
Z
≥ "2 f X (x) dx = " 2 P(|X − µ| > ").
|x−µ|>"
8.2. Weak laws. We first state and prove the weak law of large numbers in its standard
form.
Theorem 8.1. Let X be a random variable admitting mean, that is E|X | < ∞. Let µ = EX .
Let X n , n = 1, 2, . . ., be a sequence of independent copies of X . Define the sample mean
n
1X
Yn = X j.
n j=1
The meaning of the conclusion of the weak law of large numbers is that the sample mean
can approximate (or estimate) the theoretical mean µ with arbitrary precision " and with
arbitrary large probability 1 − δ. Indeed the statement can be rewritten as follows: for all
", δ > 0 there exists N = N (", δ) large enough such that
n ≥ N =⇒ P(|Yn − µ| ≤ ") > 1 − δ.
The proof is easier if we assume that E(X 2 ) < ∞, that is X admits variance. Try writing
this case out by yourself: it is a subcase of what we prove in the general case.
Proof of the weak law. First of all, by replacing X with X −µ it suffices to assume µ = 0. This
is only used to simplify notation. We thus have to prove that for any given ", δ > 0 we can
find N large enough so that
(8.1) n ≥ N =⇒ P(|Yn | > ") < δ.
From the fact that A = E|X | < ∞, the tails of the corresponding integral
Z
A = E|X | = |x| f X (x) dx
R
and denote by Ȳn , Ỹn respectively the sample mean of the X̄ n , X̃ n , we learn the following
properties
KA
(a) Var X̄ ≤ KA, thus Var Ȳn ≤ 2 ;
η
n
(b) E|Ỹn | = E|X̃ | < 3 ;
η
(c) |EỸn | = |EX̃ | < 3 ;
(d) EỸn = −E Ȳn .
To prove (a), notice that
Z Z
Var X̄ = E|X̄ | =
2
x f X (x) dx ≤ K
2
|x| f X (x) dx ≤ KA
|x|≤K |x|≤K
To prove (b) notice that the integral in (8.3) is exactly E|X̃ |. Then
n
1X
E|Ỹn | ≤ E|X̃ n | ≤ E|X̃ |.
n j=1
Then (c) simply follows by Jensen’s inequality or simply from the fact |EZ| ≤ E|Z|. Convince
yourself of this fact. Finally (d) follows because Yn = Ỹn + Ȳn and EYn = 0.
We now conclude the proof. We have
"
|Yn | = |Ȳn − EȲn + Ỹn − EỸn | ≤ |Ȳn − EȲn | + |Ỹn | + |EỸn | ≤ |Ȳn − EȲn | + |Ỹn | + ,
3
where we have used (d) in the first step and then (c) and η ≤ " see (8.2). So the set
{|Yn | > "} is contained in the union of {|Ȳn − EȲn | > 3" } and {|Ỹn | > 3" } (convince yourself of
this fact). Hence
(8.4) P(|Yn | > ") ≤ P(|Ȳn − EȲn | > 3" ) + P(|Ỹn | > 3" ).
3 η δ
(8.5) P(|Ỹn | > 3" ) ≤ E|Ỹn | ≤ ≤ ,
" " 2
in view of how we chose η in (8.2). Furthermore using Chebychev inequality and (a)
9 9 KA δ
(8.6) P(|Ȳn − EȲn | > 3" ) ≤ Var (Yn ) ≤ 2 < ,
" 2 " n 2
if we choose n > N = d18KA/(δ" 2 )e. Putting (8.5) and (8.6) inside (8.4) completes the
proof.
Limit laws similar to the weak law of large numbers can be proved under different as-
sumptions. Here is an example.
Example 8.3. Let X n be identical copies of a random variable X having EX = µ and finite
variance Var X = σ2 . We do not assume the X n to be independent but instead that
σ2
|Cov (X j , X k )| ≤ p , j 6= k.
| j − k|
changing variable to get to the second line and using the estimate
A ZA
X 1 1 p
p ≤1+ p dx ≤ A
m=1
2 m 1 2 x
We conclude that
σ2 4σ2 5σ2
Var Yn ≤ + p ≤ p
n n n
and from Chebychev’s inequality
Var Yn 5σ2
P(|Yn − µ| > ") ≤ ≤ p
"2 "2 n
goes to zero as n → ∞.
8.3. Strong law. We have seen in class that a stronger result can be proved under morally6
the same assumptions as the weak law.
Theorem 8.2. Let X be a random variable admitting mean, that is E|X | < ∞. Let µ = EX .
Let X n , n = 1, 2, . . ., be a sequence of independent copies of X defined on the same probability
space. Define the sample mean
n
1X
Yn = X j.
n j=1
Then
P lim Yn = µ = 1.
n→∞
A convergence of the above type is termed almost sure convergence and is stronger than
the (??) that we have proved in the weak law. This has been proved in class, but it can be
skipped at least for now.
6
Indeed, the weak law of large numbers hold in a slightly greater mathematical generality. The reason is
that we need the event
lim Yn = µ
n→∞
to be a measurable event and for this the variables X n have to be defined on the same probability space. This
is usually not a problem, but for instance it tells us we can’t work on a discrete probability space (this was one
of the problems earlier in the class).