0% found this document useful (0 votes)
11 views11 pages

STAT2011 Week3 2024

Uploaded by

zixuanmay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views11 pages

STAT2011 Week3 2024

Uploaded by

zixuanmay
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

(Begin of Lecture 3.

1 )

2.3 Combinatorial probability


Couple enumeration results with the notion of probability.

Recall classical definition of probability. If there are n equally likely ways to perform
a certain operation and a total of m of those satisfy some stated condition A, then
Def m
P (A) = .
n
Example 2.3.1 (Urn model). N = 8 chips, numbered 1, . . . , 8, sample k = 3. Let

A = largest chip in sample is 5.

Determine P (A).
We begin visualising the situation with Figure 2.7.1 in Larsen and Marx (2012):

Then,

• n= N 8
 
k
= 3
= 56 (In R: choose(8,3) or use factorial(n) repeatedly)

• m = 1 2 = 6 (chip 5 must be selected and two other chips from the subpopulation
1 4
 

1 – 4 must be selected)

Thus,
1
 4
6
P (A) = 1
8
2 = = 0.11 2
3
56

25
Example 2.3.2 (Urn model, continued). N = 3n chips

• n red numbered r1 , . . . , rn

• n white numbered w1 , . . . , wn

• n blue numbered b1 , . . . , bn

Draw k = 2 (without replacement). Consider A to be the union of two events A1 and A2

A = {A1 = same colour OR A2 = two drawn have same number}

Note, A1 and A2 are mutually exclusive (recall, mutually exclusive events are not inde-
pendent). Therefore from Axiom 3,

P (A) = P (A1 ∪ A2 ) = P (A1 ) + P (A2 )


N 3n
 
There are a total of k
= 2
possibilities.

P (A1 ) = P (2 reds ∪ 2 whites ∪ 2 blues); these are mutually exclusive (why?)


= P (2 reds) + P (2 whites) + P (2 blues); are equally likely events (why?)
n

2
= 3P (2 reds) = 3 3n

2

Similarly,

P (A2 ) = P (two 1’s ∪ . . . ∪ two n’s); these are mutually exclusive (why?)
3

2
= n 3n

2

Thus,
n 3
 
3 +n n+1
P (A1 ∪ A2 ) = 2
3n
 2
= 2
2
3n − 1

Example 2.3.3 (Birthday problem). Select k people (at random) from the general pop-
ulation.
What are the chances that at least two of those k were born on the same day?
Assumptions: no birthday on 29 February, all birthdays are equally likely (plausible?)
Picture k individuals as an ordered sequence

Person 1 2 3 ... 365


Possible bdays 365 365 365 . . . 365

26
Using the multiplication rule, there are 365k equally likely sequences. Then consider
A = ‘at least two people have the same bday’
AC = ‘no two people have the same bday’
A is difficult to enumerate but AC is much easier:
365k − 365 Pk
P (A) = 1 − P (AC ) = ; 365 Pk = 365(364) . . . (365 − k + 1)
365k
Thus,
k P (A)
15 0.253
22 0.476
40 0.891
50 0.970
70 0.999
Relaxed assumptions: all birthdays are equally likely, except 29 February, which is four
times less likely
We could proceed thinking about
L = # of people in sample of size k with bday on 29/2
Then,
k
X
P (A) = P (A|L = l)P (L = l)
l=0
Of course if L ≥ 2 then at least two people have their birthday on the same day, so the
calculation of P (AC ) will simplify considerably and is left as an exercise. 2

2.4 Monte Carlo Simulation


Recall von Mises definition of probability. If an experiment is repeated n times
under identical conditions, and if the event E occurs on m of those repetitions, then
m
P (E) = lim
n→∞ n
m
If n is finite then P (E) ≈ n .
Monte-Carlo studies: repeat through pseudo-random simulations.
We consider again the birthday problem when k = 40:
n m P (E)
10 10 1
100 90 0.9
1000 887 0.887
10000 8913 0.8913
∞ – 0.891
(Check R code for the Birthday problem.)

27
(Begin of Lecture 3.2 )

3 Random variables
3.1 Introduction
So far probabilities were assigned to events – that is to sets of sample outcomes.
Events were restricted to either a finite or a countably infinite number of sample outcomes.
One particular probability function encountered was the assignment of 1/n as the prob-
ability associated with each of the n points in a finite sample space.
This will now be generalised using the concept of random variables.

Example 3.1.1. Medical researcher tests 8 patients for their allergic reaction (yes/no)
to a new drug.

• Let S denote the sample space

• There are 28 possible outcomes, each being an ordered sequence of length 8

• Result: (y, n, n, y, n, n, y, n) ⇒ 3 yeses


8!
Note, there are 3!5! = 56 possible sequences of length n = 8 with n1 = 3 y’s and
n2 = 5 n’s (recall Theorem 2.2.1 from Lecture 2.3).
(Typically, in studies of this sort, the particular subjects experiencing reactions is
of little interest: What does matter is the number who show a reaction.)

• Let X denote the number of allergic reactions among a set of 8 adults.

• Then X is said to be a random variable and the number x = 3 is the value of the
random variable for the outcome (y, n, n, y, n, n, y, n). 2

In general, random variables (RVs) are functions that associate numbers in R with some
attribute of a sample outcome that is deemed to be especially important.

Definition 3.1.1 (Random variable). A random variable X is a function, mapping each


sample outcome s in some probability space S to a real number, i.e.

for s ∈ S : X(s) = ts = t ∈ R or X : S 7→ R 2

Example (continued). The observed sample outcome is s = (y, n, n, y, n, n, y, n) ̸∈ R


and ts = t = 3; thus X(s) = t = 3 ∈ R. 2

RVs often create a dramatically simpler sample space.

Example (continued). The RV X has nine possible values (reduced from 256), the
integers 0, 1, . . . , 8. 2

28
All RVs fall into one of two broad categories:

• discrete RVs have a finite or countably infinite number of possible values;

• continuous RVs have a uncountably infinite number of possible values.

3.2 Discrete random variables


A key feature of a RV X is to change the sample space S to X(S) ⊂ R, the image of X.

Definition 3.2.1. A function whose domain is a sample space S and whose values form
a finite or countably infinite set of real numbers is called a discrete random variable. 2

We denote RVs by uppercase letters, often X or Y .

Example 3.2.1 (Two fair dice; investigate sum).

• S = {(i, j) | i, j = 1, . . . 6} consists of 36 ordered pairs

• X((i, j)) = i + j has eleven possible values: 2-12

• P (X = k) = P ({s ∈ S : X(s) = k})

Thus, values of X are best visualised on the following grid:

1 2 3 4 5 6
1 2 3 4 5 6 7
2 3 4 5 6 7 8
3 4 5 6 7 8 9
4 5 6 7 8 9 10
5 6 7 8 9 10 11
6 7 8 9 10 11 12
Thus
k 2 3 4 5 6 7 8 9 10 11 12
1 2 3 4 5 6 5 4 3 2 1
P (X = k) 36 36 36 36 36 36 36 36 36 36 36

Definition 3.2.2 (Probability mass function). Let S be a finite or countably infinite


sample space, X be a discrete random variable, and p be a real-valued function defined
for each element of S satisfying

a. 0 ≤ p(s) for each s ∈ S


P
b. s∈S p(s) = 1.

The function pX (k) = P ({s ∈ S : X(s) = k}) = P (X = k) is said to be a probability


mass function and respects the axioms of the probability. 2

29
Let A ⊂ S be any event. Then,
X
P (A) = p(s)
s∈A

induces a probability function that satisfies the axioms of probability (proof left as an
exercise).

Example 3.2.2 (Geometric distribution). Toss a coin until first head is observed. What
is the probability that this happens on an odd-numbered toss?

S = {s1 , s2 , s3 , s4 , . . .} = {H, TH, TTH, TTTH, . . .}

Consider RV X : S 7→ N by sk 7→ k. Then P (X = k) = p(sk ) = pX (k). For ease of


notation, write pX (k) = p(k).
Clearly,
 k−1  
1 1
p(k) = P (TT . . . T} H) = = 2−k
| {z 2 2
(k−1) T’s

The discrete probability function p(k) satisfies a. and b. in Definition 3.2.2:

a. 0 ≤ p(k) because 2−k ≥ 0 for all k


P P∞ P∞ 1  k
b. s∈S p(s) = k=1 p(k) = k=1 2 =1

Thus p(1) + p(3) + p(5) + . . . = ∞


P∞ 1 2i+1
= ( 21 ) ∞ 1 i
P P
i=0 p(2i + 1) = i=0 ( 2 ) i=0 ( 4 ) .

Recall geometric series: ∞ i 1


P
i=0 r = 1−r for 0 < r < 1.
1
Thus p(1) + p(3) + p(5) + . . . = ( 21 )( 1−1/4 ) = 23 . 2

Definition 3.2.3 (Cumulative distribution function). The cumulative distribution func-


tion (cdf ) of X is defined as
FX (t) = P (X ≤ t) 2

The cdf has many properties (proofs left as an exercise for now) such as

• FX (t) is monotone increasing with values between 0 and 1;

• FX (t) is continuous from right.

30
(Begin of Lecture 3.3 )

3.2.1 Expected value


The probability mass function provides a global overview of a discrete random variable’s
behaviour.
If X is discrete: pX (k) = P (X = k) for all k defines probabilities.
The expected value is a single number summary of a RV and measures its “average” value
it can attain, therefore describing the central tendency (point of balance) of the pmf.

Example 3.2.3 (Roulette). The following picture is from Wikimedia (retrieved on 1/2/2018)
and shows the American roulette board

There are n = 38 possible numbers, 18 of which are odd, 18 are even (zero does not count
as even) and two are different (neither odd nor even).

Place $1 on Odd. Let RV X denote winnings. Then,


(
1 with pX (1) = P (X = 1) = 18
38
= 199
X=
−1 with pX (−1) = P (X = −1) = 1019

Thus,
9 10
“Expected” winnings = E(X) = ($1) + (−$1) = −5 cents
19 19
The following is Figure 3.5.2 in Larsen and Marx (2012):

31
Definition 3.2.4 (Expected value of a discrete RV X). Let discrete RV X have probability
mass function (pmf ) pX (k). The expected value of X, denoted by E(X) (or sometimes µ
or µX ), is given by X
E(X) = k · pX (k). 2
all k

(The above equation shows that the pX (k)’s are weights when taking a weighted average
of the k’s.)
Comment. We assume that the sum in the definition above converges absolutely, that
is X
|k|pX (k) < ∞,
all k

If not, we say the RV has no finite expectation.

Example 3.2.4 (Equally likely outcomes). Here pX (k) = 1/n for all k ∈ X with #X = n.
Then,
Def
X X 1 1X
E(X) = k · pX (k) = k = k
all k all k
n n all k

Thus E(X) is the arithmetic mean of all values in S. 2

3.2.2 Other measures of location


There are other measures of central tendency than the expected value (mean).

Definition 3.2.5 (Median of a discrete RV). If X is a discrete RV, the median m is that
point for which
P (X < m) = P (X > m)
(or any point m that minimizes |P (X < m) − P (X > m)|.) 2

(For symmetric pmf ’s E(X) and median coincide, otherwise E(X) is drawn towards the
longer tail, whereas the median ‘stays’. )

32
3.2.3 Functions of RVs and their expected values
Let X, X1 , X2 , X3 , . . . denote RVs. We often want to learn something about their func-
tions.

Example 3.2.5.

a. Y = g(X) for any function g, e.g. g(x) = aX + b

b. Tn = X1 + X2 + . . . + Xn

c. X n = n1 ni=1 Xi = Tn /n
P

d. Y2 = X 2

e. Sxx = ni=1 (Xi − X n )2 2


P

Theorem 3.2.1. Suppose X is a discrete RV with pmf pX . Let g(X) be a function of


X. Then, X
E[g(X)] = g(k)pX (k)
all k

2
P
provided that all k |g(k)|pX (k) < ∞.

Proof of Theorem 3.2.1. Let W = g(X). The set of k-values X = {k1 , k2 , . . .} will
give rise to a set of w-values T = {w1 , w2 , . . .}, where in general, more than one of the
k’s may be associated with a given w.
Let Xj be the set of k’s for which g(k) = wj .
Then ∪all j Xj = X .
Clearly P (W = wj ) = P (X ∈ Sj ).

Def
X X
E(W ) = wj P (W = wj ) = wj P (X ∈ Xj )
all j all j
X X
= wj pX (k)
all j k∈Xj
XX
= wj pX (k)
all j k∈Xj
XX
= g(k)pX (k)
all j k∈Xj
X
= g(k)pX (k)
all k

33
Corollary 3.2.2. For any RV X, E(aX + b) = a E(X) + b. 2

Proof of Corollary 3.3.5


Thm
X
E(aX + b) = (ak + b)pX (k)
all k
X X
= a kpX (k) +b pX (k)
|all k {z } |all k {z }
=E(X) =1

Example 3.2.6. Consider W = g(X) = X 2 , where the pmf of X and therefore of W is


given by

k pX (k) w pW (w)
−2 5/8 1 1/8
1 1/8 4 7/8
2 2/8
Thm P 5 1 2 29
Note, E(W ) = all k g(k)pX (k) = 4( 8 ) + 8 + 4( 8 ) = 8
Def
Also, E(W ) = all w g(w)pW (w) = 1( 18 ) + 4( 87 ) = 29 2
P
8

3.2.4 Variance
Location (e.g. expected value) is uninformative about the dispersion (spread) of RVs.
There are several ways to measure dispersion, for example through

a. E(|X − µ|) = ‘expected absolute deviation from µ’;

b. E[(X − µ)2 ] = ‘expected squared deviation from µ’;

c. E[ρ(X − µ)], ρ any non-negative function.

Definition 3.2.6. The variance of a RV X is the expected value of its squared deviation
from E(X) = µ and is only defined when E(X 2 ) is finite,

Var(X) = σ 2 = E[(X − µ)2 ] 2

Theorem 3.2.3. Let X be any RV having mean µ and for which E(X 2 ) is finite. Then,

Var(X) = σ 2 = E[(X − µ)2 ] = E(X 2 ) − µ2 2

34
Proof of Theorem 3.2.3 Let g(X) = (X − µ)2 . Then
X X
Var(X) = E[(X − µ)2 ] = g(k)pX (k) = (k − µ)2 pX (k)
all k all k
X X X
2
= k pX (k) − 2kµpX (k) + µ2 pX (k)
all k all k all k
X X
2 2
= E(X ) − 2µ kpX (k) + µ pX (k)
all k all k
= E(X 2 ) − 2µ + µ = E(X 2 ) − µ2
2 2

Theorem 3.2.4. Let X be any RV with finite µ = E(X) and finite E(X 2 ). Consider
Z = aX + b. Then
Var(Z) = a2 Var(X) 2

Proof of Theorem 3.2.4 Let Z = aX + b. Thus


Def Thm
Var(Z) = E{[(Z − E(Z)]2 } = E(Z 2 ) −[E(Z)]2
| {z } | {z }
(⋆) (⋆⋆)

We’ve already seen that E(aX + b) = a E(X) + b = aµ + b.


Thus, (⋆⋆)2 = [aµ + b]2 = a2 µ2 + (2aµb + b2 )
Further, (⋆) = E[a2 X 2 + 2abX + b2 ] = a2 E(X 2 ) + (2abµ + b2 )
Finally, (⋆) − (⋆⋆)2 = a2 E(X 2 ) − a2 µ2 = a2 [E(X 2 ) − µ2 ] = a2 Var(X) 2

One unfortunate consequence of the definition of the variance is that the units for the
variance are the square of the units of the RV. Therefore, taking the square-root is often
done in statistics.

Definition 3.2.7. The standard deviation of a RV X, provided E(X 2 ) is finite, is

2
p
SD(X) = σX = Var(X)

References
Larsen RL, Marx ML (2012). Introduction to Mathematical Statistics and Its Applica-
tions, 5th Edition, Boston: Pearson, Section 2.7, 3.3, 3.5, 3.6

35

You might also like