Elements of Probability and Statistical Theory: STAT 160A
Elements of Probability and Statistical Theory: STAT 160A
1
1 Probability and Distribution
2
Introduction
6
• Manipulating sets
Commutative law: A ∪ B = B ∪ A, A ∩ B = B ∩ A
Associative law:
(A ∪ B) ∪ C = (A ∪ C) ∪ (B ∪ C), (A ∩ B) ∩ C = (A ∩ C) ∩ (B ∩ C)
• Some examples of set function.
• Integral and Sum
integral over a one(two)-dimensional set C :
R RR
C
f (x)dx ( C g(x, y)dxdy)
• Exercise
1.2.1,1.2.2,1.2.4,1.2.5,1.2.8,1.2.13,1.2.14
7
The Probability Set Function
9
Year Number of Births Proportion of Boys
1990 4,158,212 0.5121179
1991 4,110,907 0.5112054
1992 4,065,014 0.5121992
1993 4,000,240 0.5121845
1994 3,952,767 0.5116894
1995 3,926,589 0.5084196
1996 3,891,494 0.5114951
1997 3,880,894 0.5116337
1998 3,941,553 0.5115255
1999 3,959,417 0.5119072
2000 4,058,814 0.5117182
2001 4,025,933 0.5111665
2002 4,021,726 0.5117154
10
• σ-Field:
Let B be a collection of subsets of sample space C. B is a σ-Field
if
(1) φ ∈ B
(2) If C ∈ B then Cc ∈ B
S∞
(3) If the sequence of sets {C1 ,C2 ,...} is in B then i=1 Ci ∈ B.
Note:
(1)σ-Field always contains φ and sample space C
(2)σ-Field is also closed under countable intersection
14
– A permutation of n distinct object is an arrangement of these
objects on a line and the number of permutations of n distinct
objects is equal to n! (=n(n − 1) · · · (3)(2)(1))
– An k permutation of n distinct object (k ≤ n) is an
arrangement of k objects chosen from n distinct objects and
the number of k permutations from n distinct object, denoted
Pkn and equal to n(n-1)(n-(k-1))=n!/(n-k)!
– An k combination from n distinct object (k ≤ n) is a subset
containing k objects taken from the set containing these n
distinct objects. Note that the order to choose the objects out
from the given set is not on account for a combination. The
number
of k combinations from n distinct object, denoted
n
or Cnk (also refered to as a binomial coefficient)
k
and equal to n!/k!(n-k)!
15
– Example 1.3.4 Let a card be drawn at random from an
ordinary deck of 52 playing cards which has been well shuffled.
(1) The probability of drawing a card that is a spade is 0.25
(2) The probability of drawing a card that is a king is 1/13.
(3) Suppose 5 cards are taken at random without replacement
and order is not important. Then the probability of getting a
flush, all 5 cards of the same suit, is 0.00198.
(4) The probability of getting exactly 3 of a kind and the
other two cards are distinct and are of different kinds is 0.0211
(5) The probability of getting exactly three cards that are
kings and exactly two cards that are queens is 0.0000093.
Note: The case discussed above is assuming that all the
outcomes in the sample space are equally likely.
– A loaded dice example A die is loaded in such a way that
the probability of any particular face’s showing is directly
proportional to the number on that face. What is the
probability of observing 1,2 or 3?
16
Solution: The experiment generates a sample space containing
six outcomes that are not equally likely. By assumption,
P(“i“face appears)=P(i)=ki, i=1,· · · ,6, where k is a constant.
P6
Since i=1 P (i)=1, we have k=1/21.
Therefore,P(1)+P(2)+P(3)=2/7.
• Theorem 1.3.6 (Continuity theorem of probability)
Let Cn be an increasing (decreasing) sequence of events. Then
Ã∞ !à à ∞ !!
[ \
lim P (Cn ) = P ( lim Cn ) = P Cn P Cn (1)
n→∞ n→∞
n=1 n=1
18
• Definition of conditional probability
The conditional probability of an event C2 , given an event C1 ,
denoted by P (C2 |C1 ), is defined as P (C2 |C1 ) = P (C1 ∩C2 )
P (C1 )
provided P (C1 ) > 0, C1 , C2 ⊂ C.
Q: Is the conditional probability function a probability set
function?
A: It is a probability set function defined on σ-Field on sample
space C1 .
(i)P (C2 |C1 ) = P (C 1 ∩C2 )
P (C1 ) ≥0
(ii)P (C1 |C1 ) = 1
(iii)Let {Ci }, i=2,· · · , ∞ be a pairwise mutually exclusive
19
sequence of events, Then
[∞ S∞ S∞
P (( i=2 Ci ) ∩ C1 ) P ( i=2 (Ci ∩ C1 ))
P ( Ci |C1 ) = =
i=2
P (C 1 ) P (C1 )
P∞ X∞
P (C i ∩ C 1 ) P (Ci ∩ C1 )
= i=2 =
P (C1 ) i=2
P (C1 )
∞
X
= P (Ci |C1 )
i=2
20
• Law of Total probability
If C1 , C2 , ...Ck is a collection of pairwise mutually exclusive and
Sk
exhaustive events, that is Ci ∩ Cj = φ for i 6= j and C = i=1 Ci ,
and P (Ci ) > 0, i=1,...,k. Then for any event C,
k
X
P (C) = P (C|Ci )P (Ci )
i=1
22
Random Variables
23
• Probability model of X
If B ⊂ D and C = {c : c ∈ C and X(c) ∈ B}, then the probability
of event B, denoted by PX (B), is equal to P(C).
PX (B) is also a probability set function
(1) PX (B) = P (C) ≥ 0
(2) PX (D) = P (C) = 1
(3) For a sequence of mutually exclusive events {Bn }, let
Cn = {c : c ∈ C and X(c) ∈ Bn }. {Cn } are mutually exclusive.
∞
[ ∞
[ ∞
X ∞
X
PX ( Bn ) = P ( Cn ) = P (Cn ) = P (Bn )
n=1 n=1 n=1 n=1
24
• Theorem 1.5.1 Let X be a r.v. with cdf F(x). Then
(a) For all a and b, if a < b then F (a) ≤ F (b). (F is a
nondecreasing function).
hint: {X ≤ a} ⊂ {X ≤ b}
(b) limx→−∞ F (x) = 0
(c) limx→∞ F (x) = 1
hint:{X ≤ −∞} = φ, {X ≤ ∞} = samplespace
(d) limx↓x0 F (x) = F (x0 ), (F is right continuous).
hint: let {xn } be any sequence of real numbers such that xn ↓ x0 .
T∞
Let Cn = {X ≤ xn }. Then n=1 Cn = {X ≤ x0 }.
• Theorem 1.5.2 Let X be a r.v. with cdf F(x). Then for a < b,
P [a < X ≤ b] = FX (b) − FX (a).
• Theorem 1.5.3 For any random variable,
P [X = x] = FX (x) − FX (x−), for all x ∈ R, where
FX (x−) = limz↑x FX (z).
25
Discrete and Continuous Random Variables
27
Expectation of a Random Variable
29
Some Special Expectation
31
• If the mgf of a r.v. X exists, then E(X r ) = M (r) (0) and
∞
X
r tr
M (t) = 1 + M (0)
r=1
r!
32
Important Inequalities
34
finite expectation, then,
φ[E(X)] ≤ E[φ(X)]
35
2 Multivariate Distribution
36
Distribution of Two Random Variables
37
• Definition 2.1.1 (Bivariate r.v.)A bivariate random variable
X = (X1 , X2 ) is a real-valued function which assigns to each
element c of sample space C one and only one ordered pair of
numbers X1 (c) = x1 , X2 (c) = x2 . The space of X = (X1 , X2 ) is
A = {(x1 , x2 ) : X1 (c) = x1 , X2 (c) = x2 , c ∈ C}
• Definition If event A ⊂ A, C = {c : c ∈ C and
(X1 (c), X2 (c)) ∈ A}, then P ((X1 , X2 ) ∈ A) = P (C).
• A bivariate random variable is of the discrete type or of the
continuous type
38
N ote :(i)0 ≤ pX1 ,X2 (x1 , x2 ) ≤ 1
XX
(ii) pX1 ,X2 (x1 , x2 ) = 1
A
XX
(iii)for an eventA ∈ A, P [(X1 , X2 ) ∈ A] = pX1 ,X2 (x1 , x2 )
A
39
∂ 2 FX1 ,X2 (x1 , x2 )
fX1 ,X2 (x1 , x2 ) = joint pdf for continuous case
∂x ∂x
Z x1 Z x12 2
FX1 ,X2 (x1 , x2 ) = fX1 ,X2 (w1 , w2 )dw1 dw2
−∞ −∞
joint cdf for continuous case
Example 2.1.2
• Theorem
P (a < X1 ≤ b, c < X2 ≤ d) = F (b, d) − F (b, c) − F (a, d) + F (a, c)
40
• Marginal Distribution
The marginal pmf for a single discrete r.v. can be obtained from
the joint discrete pmf by summing
X X
fX1 (x) = pX1 ,X2 (x1 , x2 ), fX2 (x) = pX1 ,X2 (x1 , x2 ).
allx2 allx1
41
Suppose (X1 , X2 ) is discrete, then E(Y) exists if
XX
|g(x1 , x2 )|pX1 ,X2 (x1 , x2 ) < ∞
x1 x2
Then
XX
E(Y ) = g(x1 , x2 )pX1 ,X2 (x1 , x2 )
x1 x2
Then
Z ∞ Z ∞
E(Y ) = g(x1 , x2 )fX1 ,X2 (x1 , x2 )dx1 dx2
−∞ −∞
• Discrete Case
Let pX1 ,X2 (x1 , x2 ) be the joint pmf of two discrete-type r.v. X1
and X2 . Let y1 = µ1 (x1 , x2 ) and y2 = µ2 (x1 , x2 ) define a
one-to-one transformation. What are the joint pmf of the two
new random variables Y1 = µ1 (X1 , X2 ) and Y2 = µ2 (X1 , X2 )?
y1 = µ1 (x1 , x2 ) x1 = ω1 (y1 , y2 )
(1) ⇒
y2 = µ2 (x1 , x2 ) x2 = ω2 (y1 , y2 )
(2) pY1 ,Y2 (y1 , y2 ) = pX1 ,X2 [ω1 (y1 , y2 ), ω2 (y1 , y2 )]
example 2.2.1
• Continuous Case
Let fX1 ,X2 (x1 , x2 ) be the joint pdf of two continuous-type r.v.
X1 and X2 . Let y1 = µ1 (x1 , x2 ) and y2 = µ2 (x1 , x2 ) define a
one-to-one transformation. What are the joint pdf of the two
new random variables Y1 = µ1 (X1 , X2 ) and Y2 = µ2 (X1 , X2 )?
44
y1 = µ1 (x1 , x2 ) x1 = ω1 (y1 , y2 )
(1) ⇒
y2 = µ2 (x1 , x2 ) x2 = ω2 (y1 , y2 )
¯ ¯
¯ ∂x1 ∂x1 ¯
¯ ∂y ¯
(2) J = ¯¯ 1 ∂y2 ¯
¯ Jacobian of the transformation
¯ ∂x
∂y1
2 ∂x2
∂y2 ¯
(3)fY1 ,Y2 (y1 , y2 ) = fX1 ,X2 [ω1 (y1 , y2 ), ω2 (y1 , y2 )]|J|
Example 2.2.3, 2.2.4, 2.2.5
• In addition to the change-of-variable techniques for finding
distributions of functions of random variables, there are two
other techniques: cdf techniques and mgf techniques.
Example 2.2.2, 2.2.6,2.2.7
45
Conditional Distributions and Expectations
46
• Conditional pmf of the discrete r.v.
X1 , X2 : discrete r.v.
pX1 ,X2 (x1 , x2 ): joint pmf
pX1 , pX2 : marginal pmf
Then for any pX1 (x1 ) > 0,
P (X1 = x1 , X2 = x2 )
P (X2 = x2 |X1 = x1 ) =
P (X1 = x1 )
pX1 ,X2 (x1 , x2 )
=
pX1 (x1 )
= pX2 |X1 (x2 |x1 )
pX2 |X1 (x2 |x1 ) is called the conditional pmf of X2 given X1 = x1 .
Similarly, pX1 |X2 (x1 |x2 ) is called the conditional pmf of X1 given
X2 = x2 .
47
(ii)
X X pX (x1 , x2 )
1 ,X2
pX2 |X1 (x2 |x1 ) =
x2 x2
pX1 (x1 )
1 X
= pX1 ,X2 (x1 , x2 )
pX1 (x1 ) x
2
pX1 (x1 )
= =1
pX1 (x1 )
• Conditional pdf of the continuous r.v.
X1 , X2 : continuous r.v.
fX1 ,X2 (x1 , x2 ): joint pdf
fX1 , fX2 : marginal pdf
f (x1 ,x2 )
Then for any fX1 (x1 ) > 0, fX2 |X1 (x2 |x1 ) = X1f,X 2
X1 (x1 )
is called
the conditional pdf of X2 given X1 = x1 . Similarly,
fX1 |X2 (x1 |x2 ) is called the conditional pdf of X1 given X2 = x2 .
48
Question: Is the conditional pdf a probability density function?
f 2 (x1 ,x2 )
(i) fX2 |X1 (x2 |x1 ) = X1f,X
X (x1 )
>0
1
(ii)
Z ∞ Z ∞
fX1 ,X2 (x1 , x2 )
fX2 |X1 (x2 |x1 )dx2 = dx2
−∞ −∞ f X1 (x 1 )
Z ∞
1
= fX1 ,X2 (x1 , x2 )dx2
fX1 (x1 ) −∞
fX1 (x1 )
= =1
fX1 (x1 )
• Conditional probability
discrete case:
X
P (a < X2 < b|X1 = x1 ) = pX2|X1 (x2 |x1 )
a<x2 <b
X
P (c < X1 < d|X2 = x2 ) = pX1|X2 (x1 |x2 )
c<x1 <d
49
continuous case:
Z b
P (a < X2 < b|X1 = x1 ) = fX2|X1 (x2 |x1 )dx2
a
Z d
P (c < X1 < d|X2 = x2 ) = fX1|X2 (x1 |x2 )dx1
c
• Conditional Expectation
If u(X2 ) is a function of X2 , then
P
discrete case: E(u(X2 )|X1 ) = x2 u(x2 )pX2 |X1 (x2 |x1 )
R∞
continuous case: E(u(X2 )|X1 ) = −∞ u(x2 )fX2 |X1 (x2 |x1 )dx2
• Conditional Variance
If u(X2 ) is a function of X2 , then
50
• Theorem 2.3.1 Let (X1 , X2 ) be a random vector such that the
variance of X2 is finite. Then,
(a) E[E(X2 |X1 )] = E(X2 )
(b) var[E(X2 |X1 )] ≤ V ar(X2 )
51
The Correlation Coefficient
54
Independent Random Variables
Motivated example: f (x1 , x2 ) = f2|1 (x2 |x1 )f1 (x1 ), what happens if
f2|1 (x2 |x1 ) does not depend upon x1 ?
Answer: f2|1 (x2 |x1 ) = f2 (x2 ) and f (x1 , x2 ) = f2 (x2 )f1 (x1 )
• Definition 2.5.1 (Independence) Let X1 and X2 have the joint
pdf f (x1 , x2 )(joint pmf p(x1 , x2 )) and the marginal pdfs (pmfs)
f1 (x1 )(p1 (x1 )) and f2 (x2 )(p2 (x2 )), respectively.
X1 and X2 are independent ⇔ f (x1 , x2 ) = f1 (x1 )f2 (x2 ) for
continuous case
X1 and X2 are independent ⇔ p(x1 , x2 ) = p1 (x1 )p2 (x2 ) for
discrete case
Remark: if f1 (x1 ) and f2 (x2 ) are positive on and only on, the
respective spaces A1 and A2 , then f1 (x1 )f2 (x2 ) is positive on,
and only on, the product space
A = {(x1 , x2 ) : x1 ∈ A1 , x2 ∈ A2 }. To check whether two r.v. X1
55
and X2 are independent, check the joint range first. If
A = A1 × A2 , then go to check if f (x1 , x2 ) = f1 (x1 )f2 (x2 ). If
not, we stop and conclude that they are not independent.
Example: Check whether the two r.v. X1 and X2 are
independent, where the joint p.d.f of X1 and X2 is given by
f (x1 , x2 ) = 2 if 0 < x1 < x2 < 1.
Solution: we can prove that f1 (x1 ) = 2(1 − x1 ) if 0 < x1 < 1
and f2 (x2 ) = 2x2 if 0 < x2 < 1. So,
the joint range is A = {(x1 , x2 ) : 0 < x1 < x2 < 1},
the range of X1 is A1 = {x1 : 0 < x1 < 1},
the range of X2 is A2 = {x2 : 0 < x2 < 1}.
Obviously A 6= A1 × A2 . X1 and X2 are dependent.
Example 2.5.1, Exercise 2.5.2,2.5.3
• Theorem 2.5.1 Let the random variables X1 and X2 have support
S1 and S2 , respectively, and have the joint pdf (joint pmf)
f (x1 , x2 ) (p(x1 , x2 )).
X1 ,X2 are independent ⇔ f (x1 , x2 ) = g(x1 )h(x2 ) for continuous
56
case
X1 ,X2 are independent ⇔ p(x1 , x2 ) = g(x1 )h(x2 ) for discrete case
where g(x1 ) > 0, x1 ∈ S1 , and h(x2 ) > 0, x2 ∈ S2 .
Example 2.5.1, Example 2.5.2, Exercise 2.5.1
• Theorem 2.5.2 Let the r.v X1 and X2 have the joint cdf F (x1 , x2 )
and the marginal cdfs F1 (x1 ) and F2 (x2 ), respectively.
X1 ,X2 are independent ⇔ F (x1 , x2 ) = F1 (x1 )F2 (x2 ) for all
(x1 , x2 ) ∈ R2
• Theorem 2.5.3 X1 ,X2 are independent
⇔ P (a < X1 ≤ b, c < X2 ≤ d) = P (a < X1 ≤ b)P (c < X2 ≤ d)
for every a < b and c < d, where a,b,c,d are constant.
Example 2.5.3, Exercise 2.5.5
• Theorem 2.5.4 If X1 and X2 are independent r.v. and that
E[u(X1 )] and E[v(X2 )] exist. Then
E[u(X1 )v(X2 ))] = E[u(X1 )]E[v(X2 )]
Example 2.5.4 Note that the converse is not true. That is if
57
cov(X1 , X2 ) = 0, then X1 and X2 could be dependent.
• Theorem 2.5.5 Suppose the joint mgf, M (t1 , t2 ) exists for the
random variables X1 and X2 , then
X1 ,X2 are independent ⇔ M (t1 , t2 ) = M (t1 , 0)M (0, t2 )
Example 2.5.5, 2.5.6, Exercise 2.5.6
More Examples: Exercise 2.5.9, 2.5.12
58
Extension to Several Random Variables
59
p(x1 , ..., xn ) = P (X1 = x1 , ..., Xn = xn )
X X
F (x1 , ..., xn ) = P (X1 ≤ x1 , ..., Xn ≤ xn ) = ··· p(x1 , ..., xn )
u1 ≤x1 un ≤xn
Continuous case:
∂ n F (x1 , ..., xn )
f (x1 , ..., xn ) =
∂x1 ...∂xn
Z xn Z x1
F (x1 , ..., xn ) = P (X1 ≤ x1 , ..., Xn ≤ xn ) = ··· f (x1 , ..., xn )dx1 ...dxn
−∞ −∞
Example 2.6.1
60
• Expectation
X X
E[u(X1 , ...Xn )] = ··· u(x1 , ...xn )p(x1 , ..., xn )
xn x1
discrete case
Z ∞ Z ∞
= ··· u(x1 , ...xn )f (x1 , ..., xn )dx1 · · · dxn
−∞ −∞
continuous case
Xm m
X
E[ kj uj (X1 , ..., Xn )] = kj E[uj (X1 , ..., Xn )]
j=1 j=1
61
• Marginal pdf(pmf) of k (k < n) random variable
X X
p(x1 , ..., xk ) = ··· p(x1 , ..., xn ) discrete case
xn xk+1
Z ∞ Z ∞
f (x1 , ...xk ) = ··· f (x1 , ..., xn )dxk+1 · · · dxn continuous case
−∞ −∞
discrete case
Z ∞ Z ∞
= ··· u(x2 , · · · , xn )f (x2 , · · · , xn |x1 )dx2 · · · dxn
−∞ −∞
continuous case
• Independence
(1) The r.v.’s X1 , ..., Xn are mutually independent if and only if
f (x1 , ..., xn ) ≡ f (x1 ) · · · f (xn ) or p(x1 , ..., xn ) ≡ p(x1 ) · · · p(xn )
(2)if X1 , ..., Xn are mutually independent then
P (a1 < X1 < b1 , a2 < X2 < b2 , ..., an < Xn < bn )
= P (a1 < X1 < b1 )P (a2 < X2 < b2 ) · · · P (an < Xn < bn )
63
(3)if X1 , ..., Xn are mutually independent then
n
Y n
Y
E[ ui (Xi )] = E[ui (Xi )]
i=1 i=1
64
Transformations: Random Vectors
• Discrete Case
Let pX1 ,··· ,Xn (x1 , ..., xn ) be the joint pmf of n discrete-type r.v.
X1 ,...,Xn . Let y1 = µ1 (x1 , ..., xn ), · · · , yn = µn (x1 , ..., xn ) define
a one-to-one transformation. What are the joint pmf of the n new
random variables Y1 = µ1 (X1 , ..., Xn ), · · · ,Yn = µn (X1 , ..., Xn )?
(2)
pY1 ,··· ,Yn (y1 , · · · , y2 ) = pX1 ,··· ,Xn (ω1 (y1 , · · · , yn ), · · · , ωn (y1 , · · · , yn ))
65
• Continuous Case
Let fX1 ,··· ,Xn (x1 , · · · , xn ) be the joint pdf of n continuous-type
r.v. X1 , · · · , Xn . y1 = µ1 (x1 , ..., xn ), · · · , yn = µn (x1 , ..., xn )
define a one-to-one transformation. What are the joint pdf of the
n new random variables Y1 = µ1 (X1 , ..., Xn ), · · · ,
Yn = µn (X1 , ..., Xn )?
67
3 Some Special Distributions
68
The Binomial and Related Distribution
• Bernoulli Distribution
A Bernoulli experiment is a random experiment, the outcome
of which can be classified in but one of two mutually exclusive
and exhaustive ways. For example,
rain or not rain tomorrow? (X = 0 → no rain, X = 1 → rain)
Head turning up or tail turning up after flipping a coin once?
(X = 0 → tail, X = 1 → head)
Bernoulli Distribution: The r.v. X has a Bernoulli
distribution with parameter p, 0 ≤ p ≤ 1, if its pmf is given by
P(X=1)=p, P(X=0)=1-p. This pmf can be written more
succinctly as pX (x) = px (1 − p)1−x , x = 0, 1
Mean: E(X)=p
Variance: Var(X)=p(1-p)
mgf : M (t) = pet + q, ∀t
69
• Binomial Distribution
Repeat the Bernoulli experiments in previous Example many
times. Say, n times. Each time there is probability=p to observe
1 (rain or head turning up). If X is the number of 1 observed,
then
n
p(x) = P (X = x) = px (1 − p)n−x , x = 0, 1, 2, · · · , n
x
Binomial Distribution: The r.v. X has a Binomial
distribution b(n,p) with parameters n,p, where n is the number
of trials, p is the probability of observing 1 in each independent
trial, 0 ≤ p ≤ 1,if the pmf of X is given by
n
pX (x) = px (1 − p)n−x , x = 0, 1, 2, · · · , n.
x
Mean: E(X)=np
Variance: Var(X)=np(1-p)
mgf : M (t) = (pet + q)n , ∀t
70
Q1: Is b(n,p) a pmf?
Q2: How to use mgf to compute E(X) and Var(X)?
Q3: If n=1, Binomial distribution is also another special
distribution. What is this distribution?
Example 3.1.1-3.1.5
• Theorem 3.1.1 Let X1 , X2 , ..., Xm be independent random
variables such that Xi has binomial b(ni , p) distribution, for
Pm
i = 1, 2, · · · , m. Let Y = i=1 Xi . Then Y has a binomial
Pm
b( i=1 ni , p) distribution.
• Multinomial Distribution
The binomial distribution can be generalized to the multinomila
distribution. Let a random experiment be repeated n
independent times. On each repetition, the experiment results in
but one of k mutually exclusive and exhaustive ways, say
C1 , C2 , · · · , Ck . Let pi be the probability that the outcome is an
element of Ci and let pi remain constant throughout the n
71
independent repetitions, i = 1, 2, · · · , k. If Xi are the number of
outcomes that are elements of Ci , i = 1, 2, · · · , k − 1, then
p(x1 , x2 , · · · , xk−1 ) = P (X1 = x1 , X2 = x2 , · · · , Xk−1 = xk−1 )
n! x1 xk−1 n−(x1 +···+xk−1 )
= x1 !···xk−1 !(n−(x p
1 +···+xk−1 ))! 1
· · · pk−1 pk
Example: trinomial distribution page 138-139
• Negative Binomial Distribution
Repeat the Bernoulli experiments in the first example until
observing 1 (rain or head turning up) for r times. Each time
there is probability=p to observe 1. If Y is the number of 0
observed (no rain or tailturning up),then
y+r−1
pY (y) = P (Y = y) == pr (1 − p)y , y = 0, 1, 2, · · · .
r−1
Negative Binomial Distribution: The r.v. Y has a Negative
Binomial distribution with parameters r,p, where r is the number
of trials observing 1, p is the probability of observing 1 in each
independent trial, 0 ≤ p ≤ 1, if the pmf of Y is given by
72
y+r−1
pY (y) = P (Y = y) = pr (1 − p)y , y = 0, 1, 2, · · · .
r−1
r(1−p)
Mean: E(X) = p
Variance: r(1−p)
p2
mgf: M (t) = pr [1 − (1 − p)et ]−r for t < −ln(1 − p)
• Geometric Distribution
Repeat the Bernoulli experiments in the first example until
observing 1 (rain or head turning up) for the first times. Each
time there is probability=p to observe 1. If Y is the number of 0
observed (no rain or tail turning up), then
pY (y) = P (Y = y) = p(1 − p)y , y = 0, 1, 2, · · · .
Geometric Distribution: The r.v. Y has a Geometric
distribution with p, where p is the probability of observing 1 in
each independent trial, 0 ≤ p ≤ 1, if the pmf of Y is given by
pY (y) = P (Y = y) = p(1 − p)y , y = 0, 1, 2, · · · .
73
Mean: E(X) = 1−p p
Variance: 1−p
p2
mgf: M (t) = p[1 − (1 − p)et ]−1 for t < −ln(1 − p)
Geometric distribution is a special case of Negative Binomial
Distribution when r=1.
• Hypergeometric Distribution
An urn containing N objects in which M objects are defective
(M ≤ N ). A sample of n objects are chosen at random without
replacement, and let X be the number of defective objects in the
n objects chosen
out.
Then
M N −M
x n−x
P (X = x) = , x = L, L + 1, · · · , U
N
n
L = max{0, n − N + M }, U = min{M, n}
Hypergeometric Distribution: The r.v. X has a
74
Hypergeometric distribution with parameters M,n,N, where
M, n ≤ N, if the
pmf of X isgiven by
M N −M
x n−x
pX (x) = , x = L, L + 1, · · · , U
N
n
L=max{0,n-N+M}, U=min{M,n}
Mean:E(X) = MNn
nM M
N (1− N )(N −n)
Variance:V ar(X) = N −1
75
M N −M
x n−x n
lim = px (1−p)n−x
M, N → ∞ N x
limM,N →∞ M
=p n
N
• Poisson Distribution
Some events are rather rare, they don’t happen that often. For
instance, car accidents are the exception rather than the rule.
Still, over a period of time, we can say something about the
nature of rare events.
eg1. If wearing seat belts reduce the number of death in car
accidents. Here, the Poisson distribution can be a useful tool to
answer question about benefits of seat belt use.
eg2. Death of infants
eg3. The number of misprints in a book
The Poisson distribution is a mathematical rule that assigns
76
probabilities to the number occurrences in a fixed interval(X).
The only thing we have to know to specify the Poisson
distribution is the mean number of occurrences for which the
symbol λ is often used.
e−λ λx
P (X = x) = x!
Poisson Distribution The r.v. X has a Possion distribution
with parameters λ, if the pmf of X is given by
−λ x
p(x) = e x!λ , x = 0, 1, 2, · · · , ∞
Mean: E(X)=λ
Variance: Var(X)=λ
λ(et −1)
mgf : M (t) = e ∀t
The Poisson distribution resembles the binomial distribution in
that it models counts of events. For example, a Poisson
distribution could be used to model the number of accidents at
an intersection in a week. However, if we want to use the
binomial distribution we have to know both the number of people
who make enter the intersection, and the number of people who
77
have an accident at the intersection, whereas the number of
accidents is sufficient for applying the Poisson distribution. Thus,
the Poisson distribution is cheaper to use because the number of
accidents is usually recorded by the police department, whereas
the total number of drivers is not. This is supported by the
following theorem.
• Asymptotic distribution of Binomial distribution is Poisson
distribution
If r.v. X has a binomial distribution with parameter n and p,
n → ∞, p → 0, limn→∞ np = λ, then
e−λ λx
lim p(x) =
n→∞ x!
• Conditions under which a Poisson distribution holds
– counts of rare events
– all events are independent
– average rate does not change over the period of interest
78
• Theorem 3.2.1 Let X1 , X2 , ..., Xn be independent random
variables such that Xi has Poisson distribution with parameter
Pn
mi , for i = 1, 2, · · · , n. Then Y = i=1 Xi has a Poisson
Pn
distribution with parameter i=1 mi .
Example 3.2.1-3.2.4
79
Special Continuous Distribution
81
Let
w- time interval
W - a r.v. , time needed to obtain exactly k deaths (e.g. k=1)
k - a fixed positive integer
X - a r.v., the count of deaths within the time interval w,
following poisson distribution with average count λw at time
interval w,
The cdf for W is P (W ≤ w) = 1 − P (W > w).
Since the event {W > w} means obtaining at most k-1 deaths
within time interval w, we have
k−1
X k−1
X x −λw Z ∞
(λw) e z k−1 e−z
P (W > w) = P (X = x) = = dz
x=0 x=0
x! λw (k − 1)!
82
times. Therefore,
R∞z k−1 e−z
R λw z k−1 e−z
1− λw (k−1)!
dz = 0 Γ(k) dz if w > 0
G(w) = P (W ≤ w) = {
0 otherwise
Then, pdf is
λk wk−1 e−λw
Γ(k) if w > 0
g(w) = {
0 otherwise
λe−λw if w > 0
g(w) = {
0 otherwise
83
• Exponential distribution: exp(λ)
A continuous r.v. X has a exponential distribution with
parameter λ > 0, if and only if
λe−λw if w > 0
f (x) = {
0 otherwise
Mean: E(X) = λ1
Variance: V ar(X) = ( λ1 )2
mgf : M (t) = 1−1 t for t < λ
λ
Remark:
1. exponential distribution is a special case of gamma
distribution for α = 1, β = λ1
2. exponential distribution is often used in survival analysis.
Denote survival function as S(x) and X as the survival time,
84
e−λx if x ≥ 0
S(x) = P (X > x) = 1 − P (X ≤ x) = {
1 otherwise
3. Memoryless properties of exponential distribution
X has a exp(λ) distribution ⇔ P (X > a + t|X > a) = P (X > t)
for any a > 0, t > 0
From this property, we can say that the life length of a cancer
patient doesn’t depend on how long he/she has survived.
• Chi-square distribution: χ2 (r)
A continuous r.v. X has a χ2 distribution with parameter r > 0,
where r is a positive integer, if and only if
1
2r/2 Γ(r/2)
xr/2−1 e−x/2 if x > 0
f (x) = {
0 otherwise
Mean: E(X)=r
85
Variance: Var(X)=2r
mgf : M (t) = (1 − 2t)−r/2 for t < 0.5
Remark:
χ2 (r) distribution is a special case of gamma distribution
Γ(r/2, 2)
Example 3.3.3, 3.3.4
– Theorem 3.3.1 Let X have a χ2 (r) distribution. If k > −r/2
then E(X k ) exists and it is given by
k 2k Γ( 2r + k)
E(X ) = r if k > −r/2
Γ( 2 )
Example 3.3.5,3.3.6
• Theorem 3.3.2. Let X1 , ..., Xn be independent random variables.
Suppose, for i=1,...,n, that Xi has a Γ(αi , β) distribution. Let
Pn Pn
Y = i=1 Xi . Then Y has Γ( i=1 αi , β) distribution.
– Corollary 3.3.1 Let X1 , ..., Xn be independent random
variables. Suppose, for i=1,...,n, that Xi has a χ2 (ri )
86
Pn 2
Pn
distribution. Let Y = i=1 Xi . Then Y has χ ( i=1 ri )
distribution.
• Beta distribution :β(α, β)
Motivation: Let X1 and X2 be two independent random variables
that have Γ distribution (Γ(α, 1), Γ(β, 1)) and the joint pdf
1
h(x1 , x2 ) = xα−1
1 xβ−1
2 e−x1 −x2
Γ(α)Γ(β)
for 0 < x1 < ∞, 0 < x2 < ∞, and zero elsewhere, where
α > 0, β > 0.
Let Y1 = X1 + X2 , Y2 = X1 /(X1 + X2 ), the joint pdf of Y1 , Y2 is
then
y2α−1 (1−y2 )β−1 α+β−1 −y1
Γ(α)Γ(β) y1 e for 0 < y1 < ∞, 0 < y2 < 1
g(y1 , y2 ) = {
0 otherwise
where, α > 0, β > 0.
Obviously, Y1 and Y2 are independent. We can also prove that
87
The marginal pdf of Y2 is
Γ(α+β) α−1
Γ(α)Γ(β) y2 (1 − y2 )β−1 if 0 < y2 < 1
g2 (y2 ) = {
0 otherwise
r1 −2 r2 −2
r r
1
r1 +r2 ( r1rzw
2
) 2 z 2 exp[− z2 ( rr12w + 1)] rr12z
Γ( 21 )Γ( 22 )2 2
g(w, z) = { for 0 < w < ∞, 0 < z < ∞
0 otherwise
90
The marginal pdf of W is then
r1 +r2 r
Γ( 2 )( r1 )r1 /2 wr1 /2−1
r
2
r
Γ( 21 )Γ( 22 ) (1+r1 w/r2 )(r1 +r2 )/2
for 0 < w < ∞
g1 (w) = {
0 otherwise
95
• Mixture Distribution
Example 3.7.1-3.7.4
96