100% found this document useful (1 vote)
177 views14 pages

Lecture Notes 1: Brief Review of Basic Probability (Casella and Berger Chapters 1-4)

1. This document provides a brief review of basic probability concepts including random variables, expected values, transformations, independence, and important distributions such as the normal, chi-squared, Bernoulli, and binomial distributions. 2. Key concepts covered include defining random variables and their probability mass/density functions, expected values and their properties, transformations of random variables, characterizations of independence, and definitions of important distributions. 3. Examples are provided to illustrate concepts like finding the density of transformed random variables and determining independence between random variables.

Uploaded by

ravikumar rayala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
177 views14 pages

Lecture Notes 1: Brief Review of Basic Probability (Casella and Berger Chapters 1-4)

1. This document provides a brief review of basic probability concepts including random variables, expected values, transformations, independence, and important distributions such as the normal, chi-squared, Bernoulli, and binomial distributions. 2. Key concepts covered include defining random variables and their probability mass/density functions, expected values and their properties, transformations of random variables, characterizations of independence, and definitions of important distributions. 3. Examples are provided to illustrate concepts like finding the density of transformed random variables and determining independence between random variables.

Uploaded by

ravikumar rayala
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Lecture Notes 1

Brief Review of Basic Probability


(Casella and Berger Chapters 1-4)

1 Probability Review
Chapters 1-4 are a review. I will assume you have read and understood Chapters
1-4. Let us recall some of the key ideas.

1.1 Random Variables

A random variable is a map X from a set Ω (equipped with a probability P ) to R. We write

P (X ∈ A) = P ({ω ∈ Ω : X(ω) ∈ A})

and we write X ∼ P to mean that X has distribution P . The cumulative distribution


function (cdf ) of X is
FX (x) = F (x) = P (X ≤ x).

If X is discrete, its probability mass function (pmf ) is

pX (x) = p(x) = P (X = x).

If X is continuous, then its probability density function function (pdf ) satisfies


Z Z
P (X ∈ A) = pX (x)dx = p(x)dx
A A

and pX (x) = p(x) = F 0 (x). The following are all equivalent:

X ∼ P, X ∼ F, X ∼ p.

Suppose that X ∼ P and Y ∼ Q. We say that X and Y have the same distribution
if P (X ∈ A) = Q(Y ∈ A) for all A. In that case we say that X and Y are equal in
d
distribution and we write X = Y .

1
d
It can be shown that X = Y if and only if FX (t) = FY (t) for all t.

1.2 Expected Values

The mean or expected value of g(X) is


 R

Z Z 
 −∞ g(x)p(x)dx if X is continuous
E (g(X)) = g(x)dF (x) = g(x)dP (x) = P
 j g(xj )p(xj )
 if X is discrete.

Recall that:

1. E( kj=1 cj gj (X)) = kj=1 cj E(gj (X)).


P P

2. If X1 , . . . , Xn are independent then


n
!
Y Y
E Xi = E (Xi ) .
i=1 i

3. We often write µ = E(X).

4. σ 2 = Var (X) = E ((X − µ)2 ) is the Variance.

5. Var (X) = E (X 2 ) − µ2 .

6. If X1 , . . . , Xn are independent then


n
!
X X
Var ai X i = a2i Var (Xi ) .
i=1 i

7. The covariance is

Cov(X, Y ) = E((X − µx )(Y − µy )) = E(XY ) − µX µY

and the correlation is ρ(X, Y ) = Cov(X, Y )/σx σy . Recall that −1 ≤ ρ(X, Y ) ≤ 1.

2
The conditional expectation of Y given X is the random variable E(Y |X) whose
value, when X = x is Z
E(Y |X = x) = y p(y|x)dy

where p(y|x) = p(x, y)/p(x).

The Law of Total Expectation or Law of Iterated Expectation:


Z
 
E(Y ) = E E(Y |X) = E(Y |X = x)pX (x)dx.

The Law of Total Variance is

   
Var(Y ) = Var E(Y |X) + E Var(Y |X) .

The moment generating function (mgf ) is

MX (t) = E etX .


d
If MX (t) = MY (t) for all t in an interval around 0 then X = Y .

(n)
Check that MX (t)|t=0 = E (X n ) .

1.3 Exponential Families

A family of distributions {p(x; θ) : θ ∈ Θ} is called an exponential family if


( k )
X
p(x; θ) = h(x)c(θ) exp wi (θ)ti (x) .
i=1

Example 1 X ∼ Poisson(λ) is exponential family since


e−λ λx 1
p(x) = P (X = x) = = e−λ exp{log λ · x}.
x! x!
Example 2 X ∼ U (0, θ) is not an exponential family. The density is
1
pX (x) = I(0,θ) (x)
θ
3
where IA (x) = 1 if x ∈ A and 0 otherwise.

We can rewrite an exponential family in terms of a natural parameterization. For k = 1 we


have
p(x; η) = h(x) exp{ηt(x) − A(η)}

where Z
A(η) = log h(x) exp{ηt(x)}dx.

For example a Poisson can be written as

p(x; η) = exp{ηx − eη }/x!

where the natural parameter is η = log λ.


Let X have an exponential family distribution. Then

E (t(X)) = A0 (η), Var (t(X)) = A00 (η).

Practice Problem: Prove the above result.

1.4 Transformations

Let Y = g(X). Then


Z
FY (y) = P(Y ≤ y) = P(g(X) ≤ y) = pX (x)dx
A(y)

where
Ay = {x : g(x) ≤ y}.

Then pY (y) = FY0 (y).


If g is monotonic, then
dh(y)
pY (y) = pX (h(y))

dy
where h = g −1 .

4
Example 3 Let pX (x) = e−x for x > 0. Hence FX (x) = 1 − e−x . Let Y = g(X) = log X.
Then

FY (y) = P (Y ≤ y) = P (log(X) ≤ y)
y
= P (X ≤ ey ) = FX (ey ) = 1 − e−e

y
and pY (y) = ey e−e for y ∈ R.

Example 4 Practice problem. Let X be uniform on (−1, 2) and let Y = X 2 . Find the
density of Y .

Let Z = g(X, Y ). For exampe, Z = X + Y or Z = X/Y . Then we find the pdf of Z as


follows:

1. For each z, find the set Az = {(x, y) : g(x, y) ≤ z}.

2. Find the CDF


Z Z
FZ (z) = P (Z ≤ z) = P (g(X, Y ) ≤ z) = P ({(x, y) : g(x, y) ≤ z}) = pX,Y (x, y)dxdy.
Az

3. The pdf is pZ (z) = FZ0 (z).

Example 5 Practice problem. Let (X, Y ) be uniform on the unit square. Let Z = X/Y .
Find the density of Z.

1.5 Independence
X and Y are independent if and only if

P(X ∈ A, Y ∈ B) = P(X ∈ A)P(Y ∈ B)

for all A and B.

Theorem 6 Let (X, Y ) be a bivariate random vector with pX,Y (x, y). X and Y are inde-
pendent iff pX,Y (x, y) = pX (x)pY (y).

5
X1 , . . . , Xn are independent if and only if
n
Y
P(X1 ∈ A1 , . . . , Xn ∈ An ) = P(Xi ∈ Ai ).
i=1
Qn
Thus, pX1 ,...,Xn (x1 , . . . , xn ) = i=1 pXi (xi ).
If X1 , . . . , Xn are independent and identically distributed we say they are iid (or that they
are a random sample) and we write

X1 , . . . , X n ∼ P or X1 , . . . , X n ∼ F or X1 , . . . , Xn ∼ p.

1.6 Important Distributions

X ∼ N (µ, σ 2 ) if
1 2 2
p(x) = √ e−(x−µ) /(2σ ) .
σ 2π
If X ∈ Rd then X ∼ N (µ, Σ) if
 
1 1 T −1
p(x) = exp − (x − µ) Σ (x − µ) .
(2π)d/2 |Σ| 2

Pp
X ∼ χ2p if X = j=1 Zj2 where Z1 , . . . , Zp ∼ N (0, 1).

X ∼ Bernoulli(θ) if P(X = 1) = θ and P(X = 0) = 1 − θ and hence

p(x) = θx (1 − θ)1−x x = 0, 1.

X ∼ Binomial(θ) if
 
n x
p(x) = P(X = x) = θ (1 − θ)n−x x ∈ {0, . . . , n}.
x

X ∼ Uniform(0, θ) if p(x) = I(0 ≤ x ≤ θ)/θ.

6
1.7 Sample Mean and Variance

The sample mean is


1X
X= Xi ,
n i
and the sample variance is
1 X
S2 = (Xi − X)2 .
n−1 i
Let X1 , . . . , Xn be iid with µ = E(Xi ) = µ and σ 2 = Var(Xi ) = σ 2 . Then
σ2
E(X) = µ, Var(X) = , E(S 2 ) = σ 2 .
n
Theorem 7 If X1 , . . . , Xn ∼ N (µ, σ 2 ) then
2
(a) X ∼ N (µ, σn )

(n−1)S 2
(b) σ2
∼ χ2n−1

(c) X and S 2 are independent

1.8 Delta Method

If X ∼ N (µ, σ 2 ), Y = g(X) and σ 2 is small then

Y ≈ N (g(µ), σ 2 (g 0 (µ))2 ).

To see this, note that


(X − µ)2 00
Y = g(X) = g(µ) + (X − µ)g 0 (µ) + g (ξ)
2
for some ξ. Now E((X − µ)2 ) = σ 2 which we are assuming is small and so

Y = g(X) ≈ g(µ) + (X − µ)g 0 (µ).

Thus
E(Y ) ≈ g(µ), Var(Y ) ≈ (g 0 (µ))2 σ 2 .

Hence,
g(X) ≈ N g(µ), (g 0 (µ))2 σ 2 .


7
Appendix: Useful Facts

Facts about sums


Pn n(n+1)
• i=1 i= 2
.
Pn n(n+1)(2n+1)
• i=1 i2 = 6
.

a
• Geometric series: a + ar + ar2 + . . . = 1−r
, for 0 < r < 1.

a(1−rn )
• Partial Geometric series a + ar + ar2 + . . . + arn−1 = 1−r
.

• Binomial Theorem
n   n  
X n x X n
a = (1 + a)n , ax bn−x = (a + b)n .
x=0
x x=0
x

• Hypergeometric identity
∞     
X a b a+b
= .
x=0
x n−x n

Common Distributions

Discrete

Uniform

• X ∼ U (1, . . . , N )

• X takes values x = 1, 2, . . . , N

• P (X = x) = 1/N

1 N (N +1) (N +1)
x N1 =
P P
• E (X) = x xP (X = x) = x N 2
= 2

1 N (N +1)(2N +1)
• E (X 2 ) = x2 P (X = x) = x2 N1 =
P P
x x N 6

8
Binomial

• X ∼ Bin(n, p)

• X takes values x = 0, 1, . . . , n

n

• P (X = x) = x
px (1 − p)n−x

Hypergeometric

• X ∼ Hypergeometric(N, M, K)
−M
(Mx )(NK−x )
• P (X = x) = N
(K )

Geometric

• X ∼ Geom(p)

• P (X = x) = (1 − p)x−1 p, x = 1, 2, . . .

x(1 − p)x−1 = p d
(−(1 − p)x ) = p pp2 = p1 .
P P
• E (X) = x x dp

Poisson

• X ∼ Poisson(λ)

e−λ λx
• P (X = x) = x!
x = 0, 1, 2, . . .

• E (X) = Var (X) = λ


P∞ −λ λx P∞ (λet )x t t
• MX (t) = x=0 e
tx e
x!
= e−λ x=0 x!
= e−λ eλe = eλ(e −1) .

t
• E (X) = λet eλ(e −1) |t=0 = λ.

• Use mgf to show: if X1 ∼ Poisson(λ1 ), X2 ∼ Poisson(λ2 ), independent then Y =


X1 + X2 ∼ Poisson(λ1 + λ2 ).

9
Continuous Distributions

Normal

• X ∼ N (µ, σ 2 )

√1 −1 2
• p(x) = 2πσ
exp{ 2σ 2 (x − µ) }, x ∈ R

• mgf MX (t) = exp{µt + σ 2 t2 /2}.

• E (X) = µ

• Var (X) = σ 2 .

• e.g., If Z ∼ N (0, 1) and X = µ + σZ, then X ∼ N (µ, σ 2 ). Show this...

Proof.

MX (t) = E etX = E et(µ+σZ) = etµ E etσZ


  

2 /2 2 σ 2 /2
= etµ MZ (tσ) = etµ e(tσ) = etµ+t

which is the mgf of a N (µ, σ 2 ).

Alternative proof:
 
x−µ
FX (x) = P (X ≤ x) = P (µ + σZ ≤ x) = P Z ≤
σ
 
x−µ
= FZ
σ
 
0 x−µ 1
pX (x) = FX (x) = pZ
σ σ
(  2 )
1 1 x−µ 1
= √ exp −
2π 2 σ σ
(  2 )
1 1 x−µ
= √ exp − ,
2πσ 2 σ

which is the pdf of a N (µ, σ 2 ). 

10
Gamma

• X ∼ Γ(α, β).

• pX (x) = 1
Γ(α)β α
xα−1 e−x/β , x a positive real.
R∞ 1 α−1 −x/β
• Γ(α) = 0 βα
x e dx.

• Important statistical distribution: χ2p = Γ( p2 , 2).


Pp
• χ2p = i=1 Xi2 , where Xi ∼ N (0, 1), iid.

Exponential

• X ∼ exp(β)

• pX (x) = β1 e−x/β , x a positive real.

• exp(β) = Γ(1, β).

• e.g., Used to model waiting time of a Poisson Process. Suppose N is the number of
phone calls in 1 hour and N ∼ P oisson(λ). Let T be the time between consecutive
phone calls, then T ∼ exp(1/λ) and E (T ) = (1/λ).
P
• If X1 , . . . , Xn are iid exp(β), then i Xi ∼ Γ(n, β).

• Memoryless Property: If X ∼ exp(β), then

P (X > t + s|X > t) = P (X > s).

Linear Regression

Model the response (Y ) as a linear function of the parameters and covariates (x) plus random
error ().
Yi = θ(x, β) + i

11
where
θ(x, β) = Xβ = β0 + β1 x1 + β2 x2 + . . . + βk xk .

Generalized Linear Model

Model the natural parameters as linear functions of the the covariates.


Example: Logistic Regression.
T
eβ x
P (Y = 1|X = x) = .
1 + eβ T x

In other words, Y |X = x ∼ Bin(n, p(x)) and

η(x) = β T x

where  
p(x)
η(x) = log .
1 − p(x)
Logistic Regression consists of modelling the natural parameter, which is called the log odds
ratio, as a linear function of covariates.

Location and Scale Families, CB 3.5


Let p(x) be a pdf.

Location family : {p(x|µ) = p(x − µ) : µ ∈ R}


 
1 x
Scale family : p(x|σ) = f : σ>0
σ σ
   
1 x−µ
Location − Scale family : p(x|µ, σ) = f : µ ∈ R, σ > 0
σ σ

(1) Location family. Shifts the pdf.

e.g., Uniform with p(x) = 1 on (0, 1) and p(x − θ) = 1 on (θ, θ + 1).

12
e.g., Normal with standard pdf the density of a N (0, 1) and location family pdf N (θ, 1).
(2) Scale family. Stretches the pdf.

e.g., Normal with standard pdf the density of a N (0, 1) and scale family pdf N (0, σ 2 ).
(3) Location-Scale family. Stretches and shifts the pdf.

e.g., Normal with standard pdf the density of a N (0, 1) and location-scale family pdf N (θ, σ 2 ),
i.e., σ1 p( x−µ
σ
).

Multinomial Distribution
The multivariate version of a Binomial is called a Multinomial. Consider drawing a ball
from an urn with has balls with k different colors labeled “color 1, color 2, . . . , color k.”
P
Let p = (p1 , p2 , . . . , pk ) where j pj = 1 and pj is the probability of drawing color j. Draw
n balls from the urn (independently and with replacement) and let X = (X1 , X2 , . . . , Xk )
be the count of the number of balls of each color drawn. We say that X has a Multinomial
(n, p) distribution. The pdf is
 
n
p(x) = px1 . . . pxkk .
x1 , . . . , x k 1

Multivariate Normal Distribution


Let Y ∈ Rd . Then Y ∼ N (µ, Σ) if
 
1 1 T −1
p(y) = exp − (y − µ) Σ (y − µ) .
(2π)d/2 |Σ|1/2 2

Then E(Y ) = µ and cov(Y ) = Σ. The moment generating function is

tT Σt
 
T
M (t) = exp µ t + .
2

Theorem 8 (a). If Y ∼ N (µ, Σ), then E(Y ) = µ, cov(Y ) = Σ.

13
(b). If Y ∼ N (µ, Σ) and c is a scalar, then cY ∼ N (cµ, c2 Σ).
(c). Let Y ∼ N (µ, Σ). If A is p × n and b is p × 1, then AY + b ∼ N (Aµ + b, AΣAT ).

Theorem 9 Suppose that Y ∼ N (µ, Σ). Let


     
Y1 µ1 Σ11 Σ12
Y = , µ= , Σ= .
Y2 µ2 Σ21 Σ22

where Y1 and µ1 are p × 1, and Σ11 is p × p.


(a). Y1 ∼ Np (µ1 , Σ11 ), Y2 ∼ Nn−p (µ2 , Σ22 ).
(b). Y1 and Y2 are independent if and only if Σ12 = 0.
(c). If Σ22 > 0, then the condition distribution of Y1 given Y2 is

Y1 |Y2 ∼ Np (µ1 + Σ12 Σ−1 −1


22 (Y2 − µ2 ), Σ11 − Σ12 Σ22 Σ21 ).

Lemma 10 Let Y ∼ N (µ, σ 2 I), where Y T = (Y1 , . . . , Yn ), µT = (µ1 , . . . , µn ) and σ 2 > 0 is


a scalar. Then the Yi are independent, Yi ∼ N1 (µ, σ 2 ) and

||Y ||2 Y TY
 T 
2 µ µ
= ∼ χ n .
σ2 σ2 σ2

Theorem 11 Let Y ∼ N (µ, Σ). Then:


(a). Y T Σ−1 Y ∼ χ2n (µT Σ−1 µ).
(b). (Y − µ)T Σ−1 (Y − µ) ∼ χ2n (0).

14

You might also like