0% found this document useful (0 votes)
46 views74 pages

Probability 3.2 EdX

Uploaded by

Archana Keshavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
46 views74 pages

Probability 3.2 EdX

Uploaded by

Archana Keshavan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 74

2.

Getting Started with Statistics

Dave Goldsman
H. Milton Stewart School of Industrial and Systems Engineering
Georgia Institute of Technology

3/2/20

ISYE 6739
Outline

1 Introduction to Descriptive Statistics


2 Summarizing Data
3 Candidate Distributions
4 Introduction to Estimation
5 Unbiased Estimation
6 Mean Squared Error
7 Maximum Likelihood Estimation
8 Trickier MLE Examples
9 Invariance Property of MLEs
10 Method of Moments Estimation
11 Sampling Distributions

ISYE 6739
Introduction to Descriptive Statistics

Lesson 2.1 — Introduction to Descriptive Statistics

What’s Coming Up:


Three high-level lessons on what Statistics is (not involving much math).
Several lessons on estimating parameters of probability distributions.
One lesson on certain distributions that will come up in subsequent
Statistics modules — normal, time for t, χ2 , and F .

Statistics forms a rational basis for decision-making using observed or


experimental data. We make these decisions in the face of uncertainty.

Statistics helps us answer questions concerning:


The analysis of one population (or system).
The comparison of many populations.

ISYE 6739
Introduction to Descriptive Statistics

Examples:
Election polling.
Coke vs. Pepsi.
The effect of cigarette smoking on the probability of getting cancer.
The effect of a new drug on the probability of contracting hepatitis.
What’s the most popular TV show during a certain time period?
The effect of various heat-treating methods on steel tensile strength.
Which fertilizers improve crop yield?
King of Siam — etc., etc., etc.

Idea (Election polling example): We can’t poll every single voter. Thus, we
take a sample of data from the population of voters, and try to make a
reasonable conclusion based on that sample.

ISYE 6739
Introduction to Descriptive Statistics

Statistics tells us how to conduct the sampling (i.e., how many observations to
take, how to take them, etc.), and then how to draw conclusions from the
sampled data.

Types of Data
Continuous variables: Can take on any real value in a certain
interval. For example, the lifetime of a lightbulb or the weight of a
newborn child.
Discrete variables: Can only take on specific values. E.g., the number
of accidents this week at a factory or the possible rolls of a pair of dice.
Categorical variables: These data are not typically numerical.
What’s your favorite TV show during a certain time slot?

ISYE 6739
Introduction to Descriptive Statistics

Plotting Data

A picture is worth 1000 words. Always plot data before doing anything else,
if only to identify any obvious issues such as nonstandard distributions,
missing data points, outliers, etc.

Histograms provide a quick, succinct look at what you are dealing with. If
you take enough observations, the histogram will eventually converge to the
true distribution. But sometimes choosing the optimal number of cells is a
little tricky — like Goldilocks!

ISYE 6739
Summarizing Data

Outline

1 Introduction to Descriptive Statistics


2 Summarizing Data
3 Candidate Distributions
4 Introduction to Estimation
5 Unbiased Estimation
6 Mean Squared Error
7 Maximum Likelihood Estimation
8 Trickier MLE Examples
9 Invariance Property of MLEs
10 Method of Moments Estimation
11 Sampling Distributions

ISYE 6739
Summarizing Data

Lesson 2.2 — Summarizing Data

In addition to plotting data, how do we summarize data?

It’s nice to have lots of data. But sometimes it’s too much of a good thing!
Need to summarize.

Example: Grades on a test (i.e., raw data):

23 62 91 83 82 64 73 94 94 52
67 11 87 99 37 62 40 33 80 83
99 90 18 73 68 75 75 90 36 55

ISYE 6739
Summarizing Data

Stem-and-Leaf Diagram of grades. Easy way to write down all of the data.
Saves some space, and looks like a sideways histogram.

9 9944100
8 73320
7 5533
6 87422
5 52
4 0
3 763
2 3
1 81

ISYE 6739
Summarizing Data

Grouped Data

Cumul. Proportion of
Range Freq. Freq. observations so far
0–20 2 2 2/30
21–40 5 7 7/30
41–60 2 9 9/30
61–80 10 19 19/30
81–100 11 30 1

ISYE 6739
Summarizing Data

Summary Statistics:

n = 30 observations.

If Xi is the ith score, then the sample mean is


n
X
X̄ ≡ Xi /n = 66.5.
i=1

The sample variance is


n
1 X
S2 ≡ (Xi − X̄)2 = 630.6.
n−1
i=1

Remark: Before you take any observations, X̄ and S 2 must be regarded as


random variables.

ISYE 6739
Summarizing Data

In general, suppose that we sample iid data X1 , . . . , Xn from the population


of interest.

Example: Xi is the lifespan of the ith lightbulb we observe.

We’re most interested in measuring the “center” and “spread” of the


underlying distribution of the data.

Measures of Central Tendency:


Pn
Sample Mean: X̄ = i=1 Xi /n.

Sample Median: The “middle” observation when the Xi ’s are arranged


numerically.

ISYE 6739
Summarizing Data

Example: 16, 7, 83 gives a median of 16.


16+20
Example: 16, 7, 83, 20 gives a “reasonable” median of 2 = 18.

Remark: The sample median is less susceptible to “outlier” data than the
sample mean. One bad number can spoil the sample mean’s entire day.

Example: 7, 7, 7, 672, 7 results in a sample mean of 140 and a sample


median of 7.

Sample Mode: “Most common” value. Not the most useful measure
sometimes.

Example: 16, 7, 20, 83, 7 gives a mode of 7.

ISYE 6739
Summarizing Data

Measures of Variation (dispersion, spread)

Sample Variance:
n n
X 
2 1 X 2 1 2 2
S ≡ (Xi − X̄) = Xi − nX̄ ,
n−1 n−1
i=1 i=1

the latter expression being easier to compute.



Sample Standard Deviation: S = + S 2 .

Sample Range: maxi Xi − mini Xi .

ISYE 6739
Summarizing Data

Remark: Suppose the data takes p different values X1 , . . . , Xp , with


frequencies f1 , . . . , fp , respectively.

How to calculate X̄ and S 2 quickly?


p Pp 2
X
2 j=1 fj Xj − nX̄ 2
X̄ = fj Xj /n and S = .
n−1
j=1

Example: Suppose we roll a die 10 times.

Xj 1 2 3 4 5 6
fj 2 1 1 3 0 3

Then X̄ = (2 · 1 + 1 · 2 + · · · + 3 · 6)/10 = 3.7, and S 2 = 3.789. 2

ISYE 6739
Summarizing Data

Remark: If the individual observations can’t be determined in frequency


distributions, you might just break the observations up into c intervals.

Example: Suppose c = 3, where we denote the midpoint of the jth interval


by mj , j = 1, . . . , c, and the total sample size n = cj=1 fj = 30.
P

Xj interval mj fj
100–150 125 10
150–200 175 15
200–300 250 5
Pc
j=1 fj mj
X̄ ≈ = 170.833 and
n
Pc 2
j=1 fj mj − nX̄ 2
S2 ≈ = 1814. 2
n−1

ISYE 6739
Candidate Distributions

Outline

1 Introduction to Descriptive Statistics


2 Summarizing Data
3 Candidate Distributions
4 Introduction to Estimation
5 Unbiased Estimation
6 Mean Squared Error
7 Maximum Likelihood Estimation
8 Trickier MLE Examples
9 Invariance Property of MLEs
10 Method of Moments Estimation
11 Sampling Distributions

ISYE 6739
Candidate Distributions

Lesson 2.3 — Candidate Distributions

Time to make an informed guess about the type of probability distribution


we’re dealing with. We’ll look at more-formal methodology for fitting
distributions later in the course when we do goodness-of-fit tests. But for now,
some preliminary things we should think about:

Is the data from a discrete, continuous, or mixed distribution?


Univariate/multivariate?
How much data is available?
Are experts around to ask about nature of the data?
What if we do not have much/any data — can we at least guess at a good
distribution?

ISYE 6739
Candidate Distributions

If the distribution is a discrete random variable, then we have a number of


familiar choices to select from.
Bernoulli(p) (success with probability p)
Binomial(n, p) (number of successes in n Bern(p) trials)
Geometric(p) (number of Bern(p) trials until first success)
Negative Binomial (number of Bern(p) trials until multiple successes)
Poisson(λ) (counts the number of arrivals over time)
Empirical (the all-purpose “sample” distribution based on the histogram)

ISYE 6739
Candidate Distributions

If the data suggest a continuous distribution. . . .


Uniform (not much is known from the data, except perhaps the minimum
and maximum possible values)
Triangular (at least we have an idea regarding the minimum, maximum,
and “most likely” values)
Exponential(λ) (e.g., interarrival times from a Poisson process)
Normal (a good model for heights, weights, IQs, sample means, etc.)
Beta (good for specifying bounded data)
Gamma, Weibull, Gumbel, lognormal (reliability data)
Empirical (our all-purpose friend)

ISYE 6739
Introduction to Estimation

Outline

1 Introduction to Descriptive Statistics


2 Summarizing Data
3 Candidate Distributions
4 Introduction to Estimation
5 Unbiased Estimation
6 Mean Squared Error
7 Maximum Likelihood Estimation
8 Trickier MLE Examples
9 Invariance Property of MLEs
10 Method of Moments Estimation
11 Sampling Distributions

ISYE 6739
Introduction to Estimation

Lesson 2.4 — Introduction to Estimation

Definition: A statistic is a function of the observations X1 , . . . , Xn , and


not explicitly dependent on any unknown parameters.

Examples of statistics: X̄ and S 2 , but not (X̄ − µ)/σ.

Statistics are random variables. If we take two different samples, we’d expect
to get two different values of a statistic.

A statistic is usually used to estimate some unknown parameter from the


underlying probability distribution of the Xi ’s.

Examples of parameters: µ, σ 2 .

ISYE 6739
Introduction to Estimation

Let X1 , . . . , Xn be iid RV’s and let T (X) ≡ T (X1 , . . . , Xn ) be a statistic


based on the Xi ’s. Suppose we use T (X) to estimate some unknown
parameter θ. Then T (X) is called a point estimator for θ.

Examples: X̄ is usually a point estimator for the mean µ = E[Xi ], and S 2 is


often a point estimator for the variance σ 2 = Var(Xi ).

It would be nice if T (X) had certain properties:

Its expected value should equal the parameter it’s trying to estimate.

It should have low variance.

ISYE 6739
Unbiased Estimation

Outline

1 Introduction to Descriptive Statistics


2 Summarizing Data
3 Candidate Distributions
4 Introduction to Estimation
5 Unbiased Estimation
6 Mean Squared Error
7 Maximum Likelihood Estimation
8 Trickier MLE Examples
9 Invariance Property of MLEs
10 Method of Moments Estimation
11 Sampling Distributions

ISYE 6739
Unbiased Estimation

Lesson 2.5 — Unbiased Estimation

Definition: T (X) is unbiased for θ if E[T (X)] = θ.

Example/Theorem: Suppose X1 , . . . , Xn are iid anything with mean µ.


Then X̄ is always unbiased for µ.
 Xn 
1
E[X̄] = E Xi = E[Xi ] = µ.
n
i=1

That’s why X̄ is called the sample mean. 2

Baby Example: In particular, suppose X1 , . . . , Xn are iid Exp(λ). Then X̄


is unbiased for µ = E[Xi ] = 1/λ.

But be careful. . . . 1/X̄ is biased for λ in this exponential case, i.e.,


E[1/X̄] 6= 1/E[X̄] = λ. 2

ISYE 6739
Unbiased Estimation

Example/Theorem: Suppose X1 , . . . , Xn are iid anything with mean µ and


variance σ 2 . Then S 2 is always unbiased for σ 2 .
 n 
2 1 X
E[S ] = E (Xi − X̄)2 = Var(Xi ) = σ 2 .
n−1
i=1

This is why S 2 is called the sample variance. 2

Baby Example: Suppose X1 , . . . , Xn are iid Exp(λ). Then S 2 is unbiased


for Var(Xi ) = 1/λ2 . 2

ISYE 6739
Unbiased Estimation

Proof (of general result): First, some algebra gives


n
X n
X
(Xi − X̄)2 = (Xi2 − 2X̄Xi + X̄ 2 )
i=1 i=1
n
X n
X
= Xi2 − 2X̄ Xi + nX̄ 2
i=1 i=1
n
X
= Xi2 − 2nX̄ 2 + nX̄ 2
i=1
n
X
= Xi2 − nX̄ 2 .
i=1

So. . .

ISYE 6739
Unbiased Estimation

n n
1 hX i 1 hX i
E[S 2 ] = E (Xi − X̄)2 = E Xi2 − nX̄ 2
n−1 n−1
i=1 i=1
n
1 X 
= E[Xi2 ] − nE[X̄ 2 ]
n−1
i=1
n  
= E[X12 ] − E[X̄ 2 ] (since the Xi ’s are iid)
n − 1
n 
= Var(X1 ) + (E[X1 ])2 − Var(X̄) − (E[X̄])2
n−1
n
= (σ 2 − σ 2 /n) (since E[X1 ] = E[X̄] and Var(X̄) = σ 2 /n)
n−1
= σ 2 . Done. 2

Remark: S is not unbiased for the standard deviation σ.

ISYE 6739
Unbiased Estimation

iid
Big Example: Suppose that X1 , . . . , Xn ∼ Unif(0, θ), i.e., the pdf is
f (x) = 1/θ, for 0 < x < θ. Think of it this way: I give you a bunch of
random numbers between 0 and θ, and you have to guess what θ is.

We’ll look at three unbiased estimators for θ:

Y1 = 2X̄.
n+1
Y2 = n max1≤i≤n Xi .
(
12X̄ w.p. 1/2
Y3 =
−8X̄ w.p. 1/2.

If they’re all unbiased, which one’s the best?

ISYE 6739
Unbiased Estimation

“Good” Estimator: Y1 = 2X̄.

Proof (that it’s unbiased): E[Y1 ] = 2E[X̄] = 2E[Xi ] = θ. 2


n+1
“Better” Estimator: Y2 = n max1≤i≤n Xi .

Why might this estimator for θ make sense? (We’ll say why it’s “better” in a
little while.)
n+1
Proof (that it’s unbiased): E[Y2 ] = n E[maxi Xi ] = θ iff


E[max Xi ] = (which is what we’ll show below).
n+1

ISYE 6739
Unbiased Estimation

First, let’s get the cdf of M ≡ maxi Xi :

P (M ≤ y) = P (X1 ≤ y and X2 ≤ y and · · · and Xn ≤ y)

= P (X1 ≤ y)P (X2 ≤ y) · · · P (Xn ≤ y) (Xi ’s indep)

= [P (X1 ≤ y)]n (Xi ’s indentically distributed)


Z y n
= fX1 (x) dx
0
Z y n
= (1/θ) dx
0

= (y/θ)n .

ISYE 6739
Unbiased Estimation

This implies that the pdf of M is

d ny n−1
fM (y) ≡ (y/θ)n = , 0 < y < θ,
dy θn
and this implies that
θ θ
ny n
Z Z

E[M ] = yfM (y) dy = dy = .
0 0 θn n+1

n+1
Whew! This finally shows that Y2 = n max1≤i≤n Xi is an unbiased
estimator for θ! 2

Lastly, let’s look at. . .

ISYE 6739
Unbiased Estimation

“Ugly” Estimator:
(
12X̄ w.p. 1/2
Y3 =
−8X̄ w.p. 1/2.
Ha! It’s possible to get a negative estimate for θ, which is strange since θ > 0!

Proof (that it’s unbiased):


1 1
E[Y3 ] = 12E[X̄] · − 8E[X̄] · = 2E[X̄] = θ. 2
2 2

Usually, it’s good for an estimator to be unbiased, but the “ugly” estimator Y3
shows that unbiased estimators can sometimes be goofy.

Therefore, let’s look at some other properties an estimator can have.

ISYE 6739
Unbiased Estimation

For instance, consider the variance of an estimator.

Big Example (cont’d): Again suppose that


iid
X1 , . . . , Xn ∼ Unif(0, θ).
n+1
Recall that both Y1 = 2X̄ and Y2 = n M are unbiased for θ.

Let’s find Var(Y1 ) and Var(Y2 ). First,

4 4 θ2 θ2
Var(Y1 ) = 4Var(X̄) = · Var(Xi ) = · = .
n n 12 3n

ISYE 6739
Unbiased Estimation

Meanwhile,
 2
n+1
Var(Y2 ) = Var(M )
n
 2  2
n+1 n+1
= E[M 2 ] − · E[M ]
n n
2 Z θ
ny n+1

n+1
= dy − θ2
n 0 θn

(n + 1)2 θ2 θ2
= θ2 · − θ2 = < .
n(n + 2) n(n + 2) 3n

Thus, both Y1 and Y2 are unbiased, but Y2 has much lower variance than Y1 .
We can break the “unbiasedness tie” by choosing Y2 . 2

ISYE 6739
Mean Squared Error

Outline

1 Introduction to Descriptive Statistics


2 Summarizing Data
3 Candidate Distributions
4 Introduction to Estimation
5 Unbiased Estimation
6 Mean Squared Error
7 Maximum Likelihood Estimation
8 Trickier MLE Examples
9 Invariance Property of MLEs
10 Method of Moments Estimation
11 Sampling Distributions

ISYE 6739
Mean Squared Error

Lesson 2.6 — Mean Squared Error

We’ll now talk about a statistical performance measure that combines


information about the bias and the variance of an estimator.

Definition: The Mean Squared Error (MSE) of an estimator T (X) of θ is

MSE(T (X)) ≡ E[(T (X) − θ)2 ].

Before giving an easier interpretation of MSE, define the bias of an estimator


for the parameter θ,

Bias(T (X)) ≡ E[T (X)] − θ.

ISYE 6739
Mean Squared Error

Theorem/Proof: Easier interpretation of MSE.

MSE(T (X)) = E[(T (X) − θ)2 ]

= E[T 2 ] − 2θE[T ] + θ2

= E[T 2 ] − (E[T ])2 + (E[T ])2 − 2θE[T ] + θ2

= Var(T ) + (E[T ] − θ)2 .


| {z }
Bias

So MSE = Bias2 + Var, and thus combines the bias and variance of an
estimator. 2

ISYE 6739
Mean Squared Error

The lower the MSE the better. If T1 (X) and T2 (X) are two estimators of θ,
we’d usually prefer the one with the lower MSE — even if it happens to have
higher bias.

Definition: The relative efficiency of T2 (X) to T1 (X) is


MSE(T1 (X))/MSE(T2 (X)). If this quantity is < 1, then we’d want T1 (X).

Example: Suppose that estimator A has bias = 3 and variance = 10, while
estimator B has bias = −2 and variance = 14. Which estimator (A or B) has
the lower mean squared error?

Solution: MSE = Bias2 + Var, so

MSE(A) = 9 + 10 = 19 and MSE(B) = 4 + 14 = 18.

Thus, B has lower MSE. 2

ISYE 6739
Mean Squared Error

iid
Example: X1 , . . . , Xn ∼ Unif(0, θ).
n+1
Two estimators: Y1 = 2X̄, and Y2 = n maxi Xi .

Showed before E[Y1 ] = E[Y2 ] = θ (so both estimators are unbiased).

θ2 θ2
Also, Var(Y1 ) = 3n , and Var(Y2 ) = n(n+2) .

Thus,
θ2 θ2
MSE(Y1 ) = and MSE(Y2 ) = ,
3n n(n + 2)
so Y2 is better (by an order of magnitude, actually). 2

ISYE 6739
Maximum Likelihood Estimation

Outline

1 Introduction to Descriptive Statistics


2 Summarizing Data
3 Candidate Distributions
4 Introduction to Estimation
5 Unbiased Estimation
6 Mean Squared Error
7 Maximum Likelihood Estimation
8 Trickier MLE Examples
9 Invariance Property of MLEs
10 Method of Moments Estimation
11 Sampling Distributions

ISYE 6739
Maximum Likelihood Estimation

Lesson 2.7 — Maximum Likelihood Estimation

Definition: Consider an iid random sample X1 , . . . , Xn , where each Xi has


pmf/pdf f (x). Further, suppose that θ is some unknown parameter from Xi .
Qn
The likelihood function is L(θ) ≡ i=1 f (xi ).

The maximum likelihood estimator (MLE) of θ is the value of θ that


maximizes L(θ). The MLE is a function of the Xi ’s and is a RV.

Remark: We can very informally regard the MLE as the “most likely”
estimate of θ.

ISYE 6739
Maximum Likelihood Estimation

iid
Example: Suppose X1 , . . . , Xn ∼ Exp(λ). Find the MLE for λ.

First of all, the likelihood function is


n
Y n
Y  n
X 
L(λ) = f (xi ) = λe−λxi = λn exp − λ xi .
i=1 i=1 i=1

Now maximize L(λ) with respect to λ. Could take the derivative and plow
through all of the horrible algebra. Too tedious. Need a trick. . . .

Useful Trick: Since the natural log function is one-to-one, it’s easy to see
that the λ that maximizes L(λ) also maximizes `n(L(λ))!
  n
X  n
X
n
`n(L(λ)) = `n λ exp − λ xi = n`n(λ) − λ xi .
i=1 i=1

ISYE 6739
Maximum Likelihood Estimation

The trick makes our job less horrible.


n n
d d  X  n X
`n(L(λ)) = n`n(λ) − λ xi = − xi ≡ 0.
dλ dλ λ
i=1 i=1

This implies that the MLE is λ̂ = 1/X̄. 2

Remarks:
λ̂ = 1/X̄ makes sense, since E[X] = 1/λ.
hat over λ to indicate that this is the MLE. It’s
At the end, we put a little d
like a party hat!
At the end, we make all of the little xi ’s into big Xi ’s to indicate that this
is a random variable.
Just to be careful, you “probably” ought to do a second-derivative test.

ISYE 6739
Maximum Likelihood Estimation

iid
Example: Suppose X1 , . . . , Xn ∼ Bern(p). Find the MLE for p.
Useful trick for this problem: Since
(
1 w.p. p
Xi =
0 w.p. 1 − p,

we can write the pmf as

f (x) = px (1 − p)1−x , x = 0, 1.

Thus, the likelihood function is


n
Y n
Y Pn Pn
L(p) = f (xi ) = pxi (1 − p)1−xi = p i=1 xi
(1 − p)n− i=1 xi
.
i=1 i=1

ISYE 6739
Maximum Likelihood Estimation

This implies that


n
X  n
X 
`n(L(p)) = xi `n(p) + n − xi `n(1 − p)
i=1 i=1

⇒ P P
d i xi n − i xi
`n(L(p)) = − ≡ 0
dp p 1−p

n
X   n
X 
(1 − p) xi = p n− xi
i=1 i=1

p̂ = X̄.
This makes sense since E[X] = p. 2

ISYE 6739
Trickier MLE Examples

Outline

1 Introduction to Descriptive Statistics


2 Summarizing Data
3 Candidate Distributions
4 Introduction to Estimation
5 Unbiased Estimation
6 Mean Squared Error
7 Maximum Likelihood Estimation
8 Trickier MLE Examples
9 Invariance Property of MLEs
10 Method of Moments Estimation
11 Sampling Distributions

ISYE 6739
Trickier MLE Examples

Lesson 2.8 — Trickier MLE Examples

iid
Example: X1 , . . . , Xn ∼ Nor(µ, σ 2 ). Get simultaneous MLEs for µ and σ 2 .
n n
Y Y 1 n 1 (xi − µ)2 o
L(µ, σ 2 ) = f (xi ) = √ exp −
2πσ 2 2 σ2
i=1 i=1
n 1 X (x − µ)2 on
1 i
= exp − .
(2πσ 2 )n/2 2 σ2
i=1

n
n n 1 X
⇒ `n(L(µ, σ 2 )) = − `n(2π) − `n(σ 2 ) − 2 (xi − µ)2
2 2 2σ
i=1
n
∂ 1 X
⇒ `n(L(µ, σ 2 )) = 2 (xi − µ) ≡ 0,
∂µ σ
i=1

and so µ̂ = X̄.

ISYE 6739
Trickier MLE Examples

Similarly, take the partial with respect to σ 2 (not σ),


n
∂ 2 n 1 X
`n(L(µ, σ )) = − + (xi − µ̂)2 ≡ 0,
∂σ 2 2σ 2 2σ 4
i=1

and eventually get


n
c2 = 1
X
σ (Xi − X̄)2 . 2
n
i=1

c2 is to the (unbiased) sample variance,


Remark: Notice how close σ
n
1 X n c2
S2 = (Xi − X̄)2 = σ .
n−1 n−1
i=1

c2 is a little bit biased, but it has slightly less variance than S 2 . Anyway, as n
σ
gets big, S 2 and σ c2 become the same.

ISYE 6739
Trickier MLE Examples

Example: The pdf of the Gamma distribution w/parameters r and λ is

λr r−1 −λx
f (x) = x e , x > 0.
Γ(r)
iid
Suppose X1 , . . . , Xn ∼ Gam(r, λ). Find the MLEs for r and λ.
n n
Y λnr  Y r−1 −λ Pi xi
L(r, λ) = f (xi ) = xi e
[Γ(r)]n
i=1 i=1

n
Y  n
X
⇒ `n(L) = rn `n(λ) − n `n(Γ(r)) + (r − 1)`n xi − λ xi
i=1 i=1
n
∂ rn X
⇒ `n(L) = − xi ≡ 0,
∂λ λ
i=1

so that λ̂ = r̂/X̄.

ISYE 6739
Trickier MLE Examples

The Trouble in River City is, we need to find r̂. To do so, we have
n n
∂ ∂ h Y  X i
`n(L) = rn `n(λ) − n `n(Γ(r)) + (r − 1)`n xi − λ xi
∂r ∂r
i=1 i=1
n
n d Y 
= n `n(λ) − Γ(r) + `n xi
Γ(r) dr
i=1
n
Y 
= n `n(λ) − nΨ(r) + `n xi ≡ 0,
i=1

where Ψ(r) ≡ Γ0 (r)/Γ(r) is the digamma function.

ISYE 6739
Trickier MLE Examples

At this point, substitute in λ̂ = r̂/X̄, and use a computer method (bisection,


Newton’s method, etc.) to search for the value of r that solves
n
Y 
n `n(r/X̄) − nΨ(r) + `n xi ≡ 0.
i=1

The gamma function is readily available in any reasonable software package;


but if the digamma function happens to be unavailable in your town, you can
take advantage of the approximation

. Γ(r + h) − Γ(r)
Γ0 (r) = (for any small h of your choosing). 2
h

ISYE 6739
Trickier MLE Examples

iid
Example: Suppose X1 , . . . , Xn ∼ Unif(0, θ). Find the MLE for θ.

The pdf is f (x) = 1/θ, 0 < x < θ, (beware of the funny limits). Then
n
Y
L(θ) = f (xi ) = 1/θn if 0 ≤ xi ≤ θ, ∀i
i=1

In order to have L(θ) > 0, we must have 0 ≤ xi ≤ θ, ∀i. In other words, we


must have θ ≥ maxi xi .

Subject to this constraint, L(θ) = 1/θn is maximized at the smallest possible


θ value, namely, θ̂ = maxi Xi .

This makes sense in light of the similar (unbiased) estimator,


Y2 = n+1
n maxi Xi , from a previous lesson. 2

Remark: We used very little calculus in this example!

ISYE 6739
Invariance Property of MLEs

Outline

1 Introduction to Descriptive Statistics


2 Summarizing Data
3 Candidate Distributions
4 Introduction to Estimation
5 Unbiased Estimation
6 Mean Squared Error
7 Maximum Likelihood Estimation
8 Trickier MLE Examples
9 Invariance Property of MLEs
10 Method of Moments Estimation
11 Sampling Distributions

ISYE 6739
Invariance Property of MLEs

Lesson 2.9 — Invariance Property of MLEs

We can get MLEs of functions of parameters almost for free!

Theorem (Invariance Property): If θ̂ is the MLE of some parameter θ and


h(·) is any reasonable function, then h(θ̂) is the MLE of h(θ).

Remark: We noted before that such a property does not hold for
unbiasedness.
√ For instance, although E[S 2 ] = σ 2 , it is usually the case that
2
E[ S ] 6= σ.

Remark: The proof of the Invariance Property is “easy” when h(·) is a


one-to-one function. It’s not so easy — but still generally true — when h(·) is
nastier.

ISYE 6739
Invariance Property of MLEs

iid
Example: Suppose X1 , . . . , Xn ∼ Nor(µ, σ 2 ).

1 Pn
We saw that the MLE for σ 2 is σ
c2 =
n i=1 (Xi − X̄)2 .

If we consider the function h(y) = + y, then the Invariance Property says
that the MLE of σ is
v
q u n
u1 X
σ̂ = 2
σ =t
c (Xi − X̄)2 . 2
n
i=1

iid
Example: Suppose X1 , . . . , Xn ∼ Bern(p).

We saw that the MLE for p is p̂ = X̄. Then Invariance says that the MLE for
Var(Xi ) = p(1 − p) is p̂(1 − p̂) = X̄(1 − X̄). 2

ISYE 6739
Invariance Property of MLEs

iid
Example: Suppose X1 , . . . , Xn ∼ Exp(λ).

We define the survival function as

F̄ (x) = P (X > x) = 1 − F (x) = e−λx .

In addition, we saw that the MLE for λ is λ̂ = 1/X̄.

Then Invariance says that the MLE of F̄ (x) is

[
F̄ (x) = e−λ̂x = e−x/X̄ .

This kind of thing is used all of the time in the actuarial sciences. 2

ISYE 6739
Method of Moments Estimation

Outline

1 Introduction to Descriptive Statistics


2 Summarizing Data
3 Candidate Distributions
4 Introduction to Estimation
5 Unbiased Estimation
6 Mean Squared Error
7 Maximum Likelihood Estimation
8 Trickier MLE Examples
9 Invariance Property of MLEs
10 Method of Moments Estimation
11 Sampling Distributions

ISYE 6739
Method of Moments Estimation

Lesson 2.10 — Method of Moments Estimation

Recall that the kth moment of a random variable X is


( P
xk f (x) if X is discrete
µk ≡ E[X ] =k
R xk
R x f (x) dx if X is continuous.

Definition: Suppose X1 , . . . , Xn are iid random variables. P


Then the
method of moments (MoM) estimator for µk is mk ≡ ni=1 Xik /n.

Remark:
Pn As n → ∞, the Law of Large Numbers implies that
k k
i=1 Xi /n → E[X ], i.e., mk → µk (so this is a good estimator).

Remark: You should always love your MoM!

ISYE 6739
Method of Moments Estimation

Examples:

The MoM estimator


Pn for the true mean µ1 = µ = E[Xi ] is the sample mean
m1 = X̄ = i=1 Xi /n.
Pn
The MoM estimator for µ2 = E[Xi2 ] is m2 = 2
i=1 Xi /n.

The MoM estimator for Var(Xi ) = E[Xi2 ] − (E[Xi ])2 = µ2 − µ21 is


n
1X 2 n−1 2
m2 − m21 = Xi − X̄ 2 = S .
n n
i=1

(For large n, it’s also OK to use S 2 .)

General Game Plan: Express the parameter of interest in terms of the true
moments µk = E[X k ]. Then substitute in the sample moments mk .

ISYE 6739
Method of Moments Estimation

iid
Example: Suppose X1 , . . . , Xn ∼ Pois(λ).

Since λ = E[Xi ], a MoM estimator for λ is X̄.


n−1 2
But also note that λ = Var(Xi ), so another MoM estimator for λ is n S
(or plain old S 2 ). 2

Usually use the easier-looking estimator if you have a choice.

iid
Example: Suppose X1 , . . . , Xn ∼ Nor(µ, σ 2 ).
n−1 2
MoM estimators for µ and σ 2 are X̄ and n S (or S 2 ), respectively.

For this example, these estimators are the same as the MLEs. 2

Let’s finish up with a less-trivial example. . . .

ISYE 6739
Method of Moments Estimation

iid
Example: Suppose X1 , . . . , Xn ∼ Beta(a, b). The pdf is

Γ(a + b) a−1
f (x) = x (1 − x)b−1 , 0 < x < 1.
Γ(a)Γ(b)

It turns out (after lots of algebra) that

a ab
E[X] = and Var(X) = .
a+b (a + b)2 (a + b + 1)

Let’s estimate a and b via MoM.

ISYE 6739
Method of Moments Estimation

We have
a b E[X] . bX̄
E[X] = ⇒ a = = , (1)
a+b 1 − E[X] 1 − X̄
so
ab E[X]b
Var(X) = = .
(a + b)2 (a + b + 1) (a + b)(a + b + 1)
bX̄
Plug into the above X̄ for E[X], S 2 for Var(X), and 1−X̄
for a. Then after
lots of algebra, we can solve for b:

. (1 − X̄)2 X̄
b = − 1 + X̄.
S2
To finish up, you can plug back into Equation (1) to get the MoM estimator
for a.

ISYE 6739
Method of Moments Estimation

Example: Consider the following data set consisting of n = 10 observations


that we have obtained from a Beta distribution.

0.86 0.77 0.84 0.38 0.83 0.54 0.77 0.94 0.37 0.40

We immediately have X̄ = 0.67, and S 2 = 0.04971. Then the MoM


estimators are
. (1 − X̄)2 X̄
b = − 1 + X̄ = 1.1377,
S2
and then
. bX̄
a = = 2.310. 2
1 − X̄

ISYE 6739
Sampling Distributions

Outline

1 Introduction to Descriptive Statistics


2 Summarizing Data
3 Candidate Distributions
4 Introduction to Estimation
5 Unbiased Estimation
6 Mean Squared Error
7 Maximum Likelihood Estimation
8 Trickier MLE Examples
9 Invariance Property of MLEs
10 Method of Moments Estimation
11 Sampling Distributions

ISYE 6739
Sampling Distributions

Introduction and Normal Distribution

Goal: Talk about some distributions we’ll need later to do “confidence


intervals” (CIs) and “hypothesis tests”: Normal, χ2 , t, and F .

Definition: Recall that a statistic is just a function of the observations


X1 , . . . , Xn from a random sample. The function does not depend explicitly
on any unknown parameters.

Example: X̄ and S 2 are statistics, but (X̄ − µ)/σ is not.

Since statistics are RV’s, it’s useful to figure out their distributions.
The distribution of a statistic is called a sampling distribution.

iid
Example: X1 , . . . , Xn ∼ Nor(µ, σ 2 ) ⇒ X̄ ∼ Nor(µ, σ 2 /n).

The normal is used to get CIs and do hypothesis tests for µ.

ISYE 6739
Sampling Distributions

χ2 Distribution
iid
Definition/Theorem: If Z1 , . . . , Zk ∼ Nor(0, 1), then Y ≡ ki=1 Zi2 has
P
the chi-squared distribution with k degrees of freedom (df), and we
write Y ∼ χ2 (k).

The term “df” informally corresponds to the number of “independent pieces


of information” you have. For example, if you have RV’s X1 , . . . , Xn such
that ni=1 Xi = c, a known constant, then you might have n − 1 df, since
P
knowledge of any n − 1 of the Xi ’s gives you the remaining Xi .

We also informally “lose” a degree of freedom every time we have to estimate


a parameter. For instance, if we have access to n observations, but have to
estimate two parameters µ and σ 2 , then we might only end up with n − 2 df.

In reality, df corresponds to the number of dimensions of a certain space (not


covered in this course)!

ISYE 6739
Sampling Distributions

The pdf of the chi-squared distribution is


1 k
fY (y) =  y 2 −1 e−y/2 , y > 0.
2k/2 Γ k2

Fun Facts: Can show that E[Y ] = k, and Var(Y ) = 2k.

The exponential distribution is a special case of the chi-squared distribution.


In fact, χ2 (2) ∼ Exp(1/2).

Proof: Just plug k = 2 into the pdf. 2

For k > 2, the χ2 (k) pdf is skewed to the right. (You get an occasional
“large” observation.)

For large k, the χ2 (k) is approximately normal (by the CLT).

ISYE 6739
Sampling Distributions

Definition: The (1 − α) quantile of a RV X is that value xα such that


P (X > xα ) = 1 − F (xα ) = α. Note that xα = F −1 (1 − α), where F −1 (·)
is the inverse cdf of X.

Notation: If Y ∼ χ2 (k), then we denote the (1 − α) quantile with the


special symbol χ2α,k (instead of xα ). In other words, P (Y > χ2α,k ) = α.
You can look up χ2α,k , e.g., in a table at the back of the book or via the Excel
function CHISQ.INV(1 − α, k).

Example: If Y ∼ χ2 (10), then

P (Y > χ20.05,10 ) = 0.05,

where we can look up χ20.05,10 = 18.31. 2

ISYE 6739
Sampling Distributions

Theorem:Pχ2 ’s add up. IfPY1 , . . . , Yn are independent with Yi ∼ χ2 (di ), for


all i, then ni=1 Yi ∼ χ2 ( ni=1 di ).

Proof: Just use mgf’s. Won’t go thru it here. 2

So where does the χ2 distribution come up in statistics?

It usually arises when we try to estimate σ 2 .

iid
Example: If X1 , . . . , Xn ∼ Nor(µ, σ 2 ), then, as we’ll show in the next
module,
n
1 X σ 2 χ2 (n − 1)
S 2
= (Xi − X̄)2 ∼ . 2
n−1 n−1
i=1

ISYE 6739
Sampling Distributions

t Distribution
2
Definition/Theorem: Suppose thatp Z ∼ Nor(0, 1), Y ∼ χ (k), and Z and
Y are independent. Then T ≡ Z/ Y /k has the Student t distribution
with k degrees of freedom, and we write T ∼ t(k).

The pdf is

Γ k+1
  − k+1
2 x2 2
fT (x) = √ k
 + 1 , x ∈ R.
πk Γ 2 k

Fun Facts: The t(k) looks like the Nor(0,1), except the t has fatter tails.

The k = 1 case gives the Cauchy distribution, which has really fat tails.

As the degrees of freedom k becomes large, t(k) → Nor(0, 1).


k
Can show that E[T ] = 0 for k > 1, and Var(T ) = k−2 for k > 2.
ISYE 6739
Sampling Distributions

Notation: If T ∼ t(k), then we denote the (1 − α) quantile by tα,k .


In other words, P (T > tα,k ) = α.

Example: If T ∼ t(10), then P (T > t0.05,10 ) = 0.05, where we find


t0.05,10 = 1.812 in the back of the book or via the Excel function
T.INV(1 − α, k). 2

Remarks: So what do we use the t distribution for in statistics?

It’s used when we find confidence intervals and conduct hypothesis tests for
the mean µ. Stay tuned.

By the way, why did I originally call it the Student t distribution?

“Student” is the pseudonym of the guy (William Gossett) who first derived it.
Gossett was a statistician at the Guinness Brewery.

ISYE 6739
Sampling Distributions

F Distribution

Definition/Theorem: Suppose that X ∼ χ2 (n), Y ∼ χ2 (m), and X and Y


are independent. Then F ≡ YX/n
/m = mX/(nY ) has the F distribution with
n and m df, denoted F ∼ F (n, m).

The pdf is
n n n
Γ n+m ) 2 x 2 −1

2 (m
fF (x) = n+m , x > 0.
Γ n2 Γ m
 
n
2 (m x + 1) 2

Fun Facts: The F (n, m) is usually a bit skewed to the right.

Note that you have to specify two df’s.


m
Can show that E[F ] = m−2 (m > 2), and Var(F ) = blech.

t distribution is a special case — can you figure out which?


ISYE 6739
Sampling Distributions

Notation: If F ∼ F (n, m), then we denote the (1 − α) quantile by Fα,n,m .


That is, P (F > Fα,n,m ) = α.

Tables came be found in back of the book for various α, n, m or you can use
the Excel function F.INV(1 − α, n, m)

Example: If F ∼ F (5, 10), then P (F > F0.05,5,10 ) = 0.05, where we find


F0.05,5,10 = 3.326. 2

Remarks: It can be shown that F1−α,m,n = 1/Fα,n,m . Use this fact if you
have to find something like F0.95,10,5 = 1/F0.05,5,10 = 1/3.326.

So what do we use the F distribution for in statistics?

It’s used when we find confidence intervals and conduct hypothesis tests for
the ratio of variances from two different processes. Details later.

ISYE 6739

You might also like