0% found this document useful (0 votes)
10 views37 pages

Stat2602 Chapter3

The document discusses statistical models, defining them as simplifications of complex systems that allow for explanation and prediction. It introduces concepts such as point estimators, statistics, and bias of estimators, providing examples to illustrate these ideas. The document emphasizes that while statistical models may not perfectly describe reality, they can still be useful for inference.

Uploaded by

jeffsiu456
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views37 pages

Stat2602 Chapter3

The document discusses statistical models, defining them as simplifications of complex systems that allow for explanation and prediction. It introduces concepts such as point estimators, statistics, and bias of estimators, providing examples to illustrate these ideas. The document emphasizes that while statistical models may not perfectly describe reality, they can still be useful for inference.

Uploaded by

jeffsiu456
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Stat2602 Probability and Statistics II Fall 2014-2015

Chapter III Point Estimation


§ 3.1 Statistical Model

The reality, or the experienced world, is a complex of infinite events, forever


interacting, changing and developing toward novel qualities and outcomes.
Statistical analysis often relies on a simplification of a complex system from which
the data arises. A statistical model is a simplification and description of the
complex system based on mathematical formulation that permits explanation and
prediction.

Definition

A statistical model consists of

1. A random vector X   X 1 , X 2 ,..., X n   . The space  of X is called the


sample space.

2. An unknown constant parameter vector θ  1 ,..., p   . The space  of θ is


called the parameter space.

3. A function f x; θ  which represents the probability density (mass) function of


X for each θ .

Example 3.1

Suppose that we draw a sample of size n  20 from a population with unknown


mean and variance. As a model for investigation, we may assume that it is a
random sample with X 1 , X 2 ,..., X 20 ~ N  ,  2  where  and  2 are unknown
iid

constants. Then X   X 1 , X 2 ,..., X 20  , θ   ,   , p  2 , and

1   xi   2 
f x; θ   
20
exp  
i 1 2 2  2  2

 1 20 2
 2   20 exp  2   xi    
10

 2 i 1 

The sample space for this model is   n ; and the parameter space is
    0,   .

P.41
Stat2602 Probability and Statistics II Fall 2014-2015

Example 3.2

Suppose that in an election of governor for a large city, there are 3 candidates: Ada,
Bob, and Carter. Denote 1 ,  2 , and  3 respectively as the proportion of the votes
they will receive, where 1 ,  2 ,  3 are unknown constants. An opinion poll asks a
sample of 100 voters who they would vote, and observes the counts as X 1 , X 2 ,
X 3 . Then X   X 1 , X 2 , X 3  , θ  1 , 2 , 3  , and p  3 . Since the voting population
is assumed to be very large, it is reasonable and permissible to think of the
probabilities as unchanging once a voter is selected for the sample. Hence we may
assume the following trinomial pmf of X :

100 !
f x; θ   1x  2x  3x
1 2 3

x1! x2! x3!


The sample space for this model is    x1 , x2 , x3   0,...,100 | x1  x2  x3  100 ;
3


and the parameter space is   1 , 2 , 3   0,1 | 1   2   3  1 . Note that
3

X 1 , X 2 , X 3 are not independent in this model.

Remark

“Essentially, all models are wrong, but some are useful.” – George E.P. Box

Always keep in mind that a statistical model is just serving as a simplification of a


complex system. It may be a well approximation, but never exact. It is almost
impossible for a single model to describe perfectly a real world phenomenon.
Nevertheless, successful statistical inference does happen as our world does have a
certain degree of consistency we exploit. So our almost always wrong models do
prove useful in most situations.

Definition

A function    θ  of the unknown parameter vector θ is a parameter.

Definition

A function T  T X  of the observed random vector X is a statistic.

Note that T may be vector valued, i.e. it may contain one, or more than one
components.
P.42
Stat2602 Probability and Statistics II Fall 2014-2015

Example 3.3

Consider the statistical model in Example 3.1: X 1 , X 2 ,..., X 20 ~ N  ,  2  .


iid

Obviously, both the population mean  and population variance  2 are


parameters as they are functions of θ   ,   . Besides, since the coefficient of
variation   is a function of θ   ,   , it is also a parameter.

On the other hand, the sample mean X and sample variance S 2 are statistics as
they are functions of X   X 1 , X 2 ,..., X 20  . Moreover, the sample minimum X 1 ,
sample maximum X 20  , sample median M  X 10   X 11  2 , etc, are also statistics.

Note that the sample itself is just an identity function of the sample: X  T X  ,
therefore the original sample  X 1 , X 2 ,..., X 20  is also a statistic.

§ 3.2 Point Estimator

One primary goal of statistical analysis to draw inferences about the unknown
parameters, using observed statistics. Point estimation is the process of using a
statistic to calculate a single value which is to serve as a “guess” of the unknown
parameter value.

Definition

A statistic used for estimating the value of an unknown parameter  is called a


point estimator of  . It is often denoted as ˆ . A point estimate is a number
calculated by applying the estimator ˆ on the sample data.

Note that the notations ˆ and  refer to different quantities, with ˆ  T X  as the
sample statistic that can be observed from the sample while  is the population
parameter that is unobservable.

Roughly speaking, parameter  is the target; estimator ˆ is the tool; and estimate
is the result.

P.43
Stat2602 Probability and Statistics II Fall 2014-2015

Example 3.4

Consider the statistical model in Example 3.2: X 1 , X 2 ,..., X 20 ~ N  ,  2  .


iid

A commonly used point estimator of  is the sample mean: ̂  X .

Suppose that the follow sample was obtained:

80 108 98 97 102 105 92 101 87 89


109 93 74 80 107 126 104 123 102 107

From the data, X  99.2 . Hence a point estimate of the population mean  is
ˆ  99.2 . Similarly, using the sample standard deviation ˆ  S as the point
estimator of  , a point estimate is obtained as ˆ  13.33 .

Note that  is also the population median for this normal model. We may also use
the sample median to estimate the value of  . In general for any parameter, there
may be many different methods to do the estimation.

Example 3.5

Suppose that we can observe some random points which are known to be
uniformly distributed from 0 to an unknown constant  . Then we have the
following statistical model: X 1 , X 2 ,..., X n ~ U 0,   , with parameter space
iid

  0,   .

A reasonable estimator of  is the maximum of the sample, i.e. ˆ  X n  .


However, since the population mean is   , it is also reasonable to estimate 
2
by using the sample mean, i.e. ˆ  2 X .

In order to compare the performance of different estimators, some desirable


properties of estimation are discussed below.

P.44
Stat2602 Probability and Statistics II Fall 2014-2015

§ 3.2.1 Bias of Estimator

Definition

The bias of an estimator ˆ  T X  of the parameter    θ  is defined as

 
bias ˆ  E ˆ   .

Note that the expectation is taken with respect to the sampling distribution of

ˆ  T X  . We say that ˆ overestimates  if bias ˆ  0 ; underestimates  if

bias ˆ  0 .

Unbiased Estimator

The estimator ˆ is called an unbiased estimator of the parameter  if


E ˆ   for all the values of θ   ,


i.e. the estimator is unbiased if bias ˆ  0 for any θ   .

Example 3.6

Suppose we have a random sample X 1 , X 2 ,..., X n  from a distribution with mean


 and variance  2 .

1. Sample mean is an unbiased estimator of the population mean  as

E X    for all  .

2. Sample variance S 2 
1 n
  X i  X 2 is an unbiased estimator of the
n  1 i 1
population variance  as
2

E S 2    2 for all  2 .

Note that usually, E S    and hence S is a biased estimator of  . In fact, S


often underestimates  .
P.45
Stat2602 Probability and Statistics II Fall 2014-2015

Example 3.7

Consider the statistical model in Example 3.5: X 1 , X 2 ,..., X n ~ U 0,   .


iid

Consider the two estimators

ˆ1  2 X and ˆ2  X  n  .

Obviously ˆ1 is unbiased as   


E ˆ1  2 E  X   2   2   
2
for all  .

Using the formula in Section 2.1.4 for order statistics, the pdf of ˆ2 is obtained as

n 1
 y 1 ny n 1
fˆ  y   n    , 0  y  .
2
   n

Hence the expected value of ˆ2 is

   ny n 1  n
n
 ny
E ˆ2  0 n dy   n 
 .
  n  1 0 n  1

Therefore ˆ2  X  n  is biased and the bias is given by

   
bias ˆ2  E ˆ2   
n
n 1
  

n 1
0,

i.e. ˆ2 underestimates  . However, if we use

n 1
ˆ3  X n  ,
n

then ˆ3 is an unbiased estimator of  .

P.46
Stat2602 Probability and Statistics II Fall 2014-2015

Example 3.8

Consider the binomial model: X ~ bn, p  , p    0,1

X
An unbiased estimator of p is the sample proportion: pˆ  as
n

E  X  np
E  pˆ    p for all p  0,1 .
n n

X2
However, pˆ  2 is not unbiased for p 2 as
2

p 1  p 
E  pˆ 2   Var  pˆ   E  pˆ    p2  p2 .
2

To find an unbiased estimator of p 2 , we can modify p̂ 2 by considering

p 1  p  p n  1 p 2
E  pˆ 2    p2  
n n n
n  E  pˆ  
  E  pˆ    p
2 2

n 1 n 
  n  2 pˆ   
  E   pˆ      p
2

  n 1 n 

Hence an unbiased estimator of   p 2 can be obtained as

n  2 pˆ  n  X 2 X  X  X  1
ˆ   ˆ
p       .
n 1 n  n  1  n 2 n 2  n n  1

Example 3.9

Consider the exponential model: X 1 , X 2 ,, X n ~ Exponentia l   ,     0,  


iid

Suppose that we are interested in the parameter   e   .

To find an unbiased estimator of  , consider the fact that for any i  1,2,..., n , we
have   e    P  X i  1 .

P.47
Stat2602 Probability and Statistics II Fall 2014-2015

Hence we may define the following indicator variable:

1 if X i  1,
I X 1  
0 otherwise.
i

Then
E I X 1   1  P  X i  1  0  P  X i  1
i

 P  X i  1  

Therefore an unbiased estimator can be obtained as

1 n number of X i s such that X i  1


   I X 1 
ˆ
n i 1 i
n

which is the sample proportion of the sample data with values greater than one.

Definition

A sequence of estimators ˆn of  is said to be asymptotically unbiased if

lim
n 

bias ˆ  0 .

Example 3.10

In Example 3.7, X 1 , X 2 ,..., X n ~ U 0,   and ˆ2  X  n  is asymptotically unbiased for


iid

 as
 
lim bias ˆ2  lim 
  
  0.
n n
 n  1

In Example 3.8, X ~ bn, p  and pˆ 2  X 2 n 2 is asymptotically unbiased for p 2 as

 p 1  p  
lim bias  pˆ 2   lim   0.
n n
 n 

P.48
Stat2602 Probability and Statistics II Fall 2014-2015

§ 3.2.2 Precision, Mean Square Error, and Efficiency of Estimator

The property of being unbiased is a desirable one to seek for an estimator.


However, if an estimator ˆ1 is consistently closer to  than another estimator ˆ2 in
repeated samples of the same size, ˆ would be preferred to ˆ even when ˆ is
1 2 1

biased.

Definition

The Mean Square Error (MSE) of an estimator ˆ of the parameter  is defined as

 
MSE ˆ  E ˆ    .
2

In words, the MSE of an estimator is its mean square estimation error.

Remark
The MSE of an estimator can be expressed in terms of its bias and variance.

  
MSE ˆ  E ˆ  
2

     
 E ˆ  E ˆ  E ˆ  
2

 2
      
 E ˆ  E ˆ  E ˆ  
2
 
 2 E ˆ   E ˆ  E ˆ  
 Var ˆ   bias ˆ 
2

Therefore the MSE can be regarded as a measure combining the precision and bias
of an estimator. In particular, if ˆ is unbiased, then


MSE ˆ  Var ˆ . 

Efficiency

Suppose that there are two estimators ˆ1 ,ˆ2 of a parameter  . The ratio


eff ˆ1 ,ˆ2 MSE ˆ2  
MSE ˆ1  
is called the efficiency of ˆ1 compared to ˆ2 .

P.49
Stat2602 Probability and Statistics II Fall 2014-2015

The estimator ˆ1 is said to be better (or formally, more efficient) than the estimator
ˆ for estimating  if eff  ,   1 for all possible values of  and provided that
2 1 2

strict inequality holds for at least one value of  .

Example 3.11

Consider Example 3.7 again. Based on a random sample X 1 , X 2 ,..., X n ~ U 0,  ,


iid

we have the following three estimators of  :

ˆ1  2 X unbiased;
ˆ2  X  n   
biased with bias ˆ2  

n 1
;
n 1
ˆ3  X n  unbiased.
n

Now consider their MSEs. For ˆ1 ,

ˆ  ˆ  
MSE 1  Var 1  4Var  X  
4 2 4  2  2
n
   .
n 12 3n

For ˆ2 ,

   ny n  2  n 2
n 1
 ny
E  2  0 n dy  
ˆ 2
n 
 ,
  n  2  0 n  2

  n  n  n 2
2 2

Var  2 
ˆ   
n  2  n  1  n  1 n  2 
2

and hence

     
   n 2 2 2
2
2
MSE ˆ2  bias ˆ2  Var ˆ2       .
 n  1  n  12
n  2  n  1n  2 

Comparing ˆ2 to ˆ1 ,


eff ˆ2 ,ˆ1  
MSE 

 n  1n  2
1 
MSE  2  6n

which is greater than 1 for any values of  when n  2 . Therefore based on a


sample with size larger than two, the estimator ˆ2 is more efficient than ˆ1 even
though ˆ is biased.
2

P.50
Stat2602 Probability and Statistics II Fall 2014-2015

Now for ˆ3 ,

     n  1
MSE  3  Var  3  
2
n  1
 
2
n 2 2
ˆ ˆ  Var  2 
ˆ   .
 n  n2 n  12 n  2 nn  2

Comparing ˆ3 to ˆ2 ,

 
eff ˆ3 ,ˆ2 
MSE 

 2n
2 
MSE  3  n  1

which is greater than 1 for any values of  when n  1 . Therefore based on a


sample with size larger than 1, the estimator ˆ3 is more efficient than ˆ2 and is
therefore the best among these three estimators.

Example 3.12

Suppose that for a particular statistical model, we use ˆ1  2 as an estimator of the
unknown parameter  , i.e. we estimate the value of  by 2 no matter what we
observe from the sample. The mean square error of such an estimator is   2 
2

and will be equal to 0 if   2 . Although this estimator is rather silly, it would be


the “best” if   2 .

On the other hand, we may use ˆ2  5 as the estimator. Then it will be the best if
  5 . Therefore comparing ˆ and ˆ , none of them can outperform the other for
1 2

all values of  .

As can be seen from this simple example, it is impossible to find a “best” estimator
with minimum MSE for all  . In searching for a “best” estimator, we often impose
some desirable constraints, such as unbiasedness.

Definition

An unbiased estimator ˆ  T X  is said to be the best unbiased estimator of


   θ  if it is better than any other unbiased estimator of  .

Methods for finding best unbiased estimator will be discussed in later sections.

P.51
Stat2602 Probability and Statistics II Fall 2014-2015

§ 3.2.3 Consistency of Estimator

Since an estimator ˆ  T X  is a function of the random sample, it depends on the


sample size n. Sometimes we may denote it as ˆ to emphasize that it depends on n.
n

A good estimator should have the property that when we observe a very large
sample, the observed value of ˆn would be arbitrarily close to the true value  .

Definition

The estimator ˆn is said to be a consistent estimator of  if for any given   0 ,


we have

lim P ˆn      1 .
n

In other word, ˆn is consistent for  if ˆn 

P
.

Remarks

1. If ˆn is an estimator of  with lim


n 
 
MSE ˆn  0 , then ˆn is a consistent
estimator of  . (The converse is not necessarily true.)

Proof

By Markov inequality, for any   0 ,

 
P ˆn      P ˆn     2   1 
2 
E ˆn    1
2

as n  .
  2

 
2. If ˆn is an unbiased estimator of  with lim Var ˆn  0 , then ˆn is a consistent
n 

estimator of  .

   
3. If ˆn is an estimator of  with lim bias ˆn  0 and lim Var ˆn  0 , then ˆn is a
n  n 

consistent estimator of  .

P.52
Stat2602 Probability and Statistics II Fall 2014-2015

Example 3.13
Consider the three estimators of  in the statistical model X 1 , X 2 ,..., X n ~ U 0, 
iid

described in Example 3.7: ˆ1  2 X , ˆ2  X  n  , ˆ3  n  1 X n  n . From Example


3.11, there MSEs were derived as follows:

 

  2 2
  2
2

MSE ˆ1  , MSE ˆ2  , MSE ˆ3  .


3n n  1n  2 nn  2

Since these MSEs all converge to zero as n tends to infinity, all the three estimators
are consistent.

Example 3.14
Suppose that X 1 , X 2 ,..., X n  is a random sample drawn from a population with
finite mean  and finite variance  2 . Then by the law of large numbers, we have

X

P
 and S

P
,

and hence ̂  X and ̂  S are consistent estimators of  and  respectively.

Now suppose that we randomly draw a point uniformly from 0 to 1 and denote it as
U, i.e. U ~ U 0,1 . Then consider the following estimator of  :

n if U  1 n ,
ˆ '  
X otherwise.

This estimator is not asymptotically unbiased as

 1  1
 n  1   E  X   1  1   
1
E ˆ '  
n  n  n
 
 lim biasˆ '   lim1    1  0 ,
n  n 
 n

and hence lim MSE ˆ '   0 . However, for any   0 ,


n 

P  n       1   P  X     
 1
P  ˆ '     
1
n  n

Since lim P  X       0 (law of large number), we have lim P  ˆ '      0


n  n

and hence ̂ ' is also a consistent estimator of  .

P.53
Stat2602 Probability and Statistics II Fall 2014-2015

§ 3.3 Method of Moment Estimator (MME)

Method of Moment

Equate the moments of the random variables to the sample moments and then solve
for the equations in terms of the parameters.

r th moment of the distribution : E  X r   function of 

1 n r
r th sample moment : mr   Xi
n i 1

If we have one parameter only, equate E  X   g   to X and then solve for  .


The solution gives the method of moment estimator of  .

Example 3.15


In example 3.7, X 1 , X 2 ,..., X n ~ U 0,  , we have E  X  
iid
. The MME of  can be
2
obtained by solving the equation:
ˆ
X ,
2

resulting ˆ  2 X , which is the ˆ1 in example 3.7.

Example 3.16

In Program Evaluation and Review Technique (PERT) analysis, time requirements


for component tasks in large projects are usually modelled as

T  a  bX

where a is the best-case requirement, a  b is the worst-case requirement and X


is a continuous random variable between 0 and 1, usually modelled by the beta
distribution Beta ,   where  and  are the unknown parameters.

Suppose we observed T1 , T2 ,...,Tn from past experience. Then we can compute


1
X 1 , X 2 ,..., X n by X i  Ti  a  .
b

P.54
Stat2602 Probability and Statistics II Fall 2014-2015

Since there are two unknown parameters, we will need two equations based on the
first two moments:

    1
EX   , E X 2  
        1
The MME of  and  can be obtained by solving

ˆ
m1  X  ,
ˆ  ˆ

1 n 2 ˆ ˆ  1
m2   Xi 
n i 1 
ˆ    ˆ  ˆ  1 
which gives

 m1 m1  m2   1  m m  m2 
 ,  1 1
.
m2  m12 m2  m12

Example 3.17

Consider the exponential model: X 1 , X 2 ,, X n ~ Exponentia l   ,     0,  


iid

1
Using the first moment E  X   , the MME of  can be easily obtained as

1
ˆ  . Similarly, the MME of   e   is given by ˆ  e 1 X .
X

Remarks

1. Method of moment estimators may not be unbiased even when sample size is
large.

2. Method of moment estimator may not have the least variance even when
sample size is large.

3. Although method of moment estimator may not be the best, it is easy to find
and therefore it can be used as a first guess on the value of the parameters.

P.55
Stat2602 Probability and Statistics II Fall 2014-2015

§ 3.4 Maximum Likelihood Estimator (MLE)

Definition

For a statistical model with random vector X   X 1 , X 2 ,..., X n   , parameter


vector θ   , and probability density (mass) function f x; θ  , the likelihood
function is defined as
Lθ ; x   f x; θ 

which is a function of the parameter vector θ , based on the given values of X .

The logarithm of L is called the log-likelihood function:

l θ ; x   log Lθ ; x  .

In particular, if X 1
, X 2 ,..., X n  is a random sample (iid) with common pdf f  x; θ 
then
Lθ ; x    f  xi ; θ  .
n

i 1

and
l θ ; x    log f  xi ; θ  .
n

i 1

Maximum Likelihood

A p-dimensional statistic θˆ  T X  is a maximum likelihood estimator (MLE) of θ


 
if θ̂   and L θˆ ; x  Lθ ; x  for all θ   , i.e. it maximizes the likelihood
function.

For any parameter    θ  , the maximum likelihood estimator is ˆ   θ̂ . It is 


called the invariance principle for MLEs.

Since the logarithm function is strictly increasing, we can also obtain the MLE by
maximizing the log-likelihood function.

P.56
Stat2602 Probability and Statistics II Fall 2014-2015

Example 3.18

Consider the statistical model: X 1 , X 2 ,..., X n ~ Poisson   ,     0,   . From the


iid

likelihood function
n

e   x  n
1
x
L ; x    p xi ;       xi ! e  n  ,
n n i i
i 1

i 1 i 1 xi !  i 1 

we obtain the log-likelihood function:

l  ; x   log L ; x    log xi !  n    xi  log .


n n

i 1  i 1 

To maximize l  ; x  , we use simple calculus:

d 1 1 n
l  ; x   0
n

d
 n

 xi  0
i 1
   xi  x
n i 1

It can be easily verified that l  ; x  achieves its maximum at   x by the second


derivative test:
d2
l  ; x   
1 n
 xi  0 .
d 2  2 i 1

Therefore the MLE of  is simply the sample mean: ˆML  X .

Example 3.19

X 1 , X 2 ,..., X n ~ U 0,  ,
iid
Consider the statistical model in Example 3.7:
    0,   . Since the pdf of U 0,  is

1
 for 0  x   ,
f  x;    
 0 otherwies,

the likelihood function is given by

 1
 for 0  x1   , 0  x2   , ..., 0  xn   ,
L ; x    f  xi ;     n
n

i 1
 0 otherwise,

P.57
Stat2602 Probability and Statistics II Fall 2014-2015

 1
 for   max  x1 , x2 ,..., xn ,
or simply, L ; x     n
 0 otherwise.

Therefore L ; x  is maximized at   max x1 , x2 ,..., xn  and the MLE of  is


obtained as
ˆML  max X 1 , X 2 ,..., X n   X n 

which is the estimator ˆ2 in Example 3.7.

Example 3.20

Consider the normal model: X 1 , X 2 ,..., X n ~ N  ,  2  , θ   ,        0,   ,


iid

p  2 . The likelihood function is given by

 1 
L ,  ; x    f  xi ;  ,    2  2  n exp  2  x     ,
n n n
 2

 2 
i
i 1 i 1

from which we obtain the log-likelihood function:

n 1
l  ,  ; x   log L ,  ; x    log2   n log   2  x   
n
2
.
2
i
2 i 1

Base on the decomposition of variations, we have the following identity:

 x       xi  x   n  x   
n n
2 2 2
i
i 1 i 1

which achieves its minimum at   x . Therefore for any value of  , l  ,  ; x  is


maximized at   x . For such value of  , the log-likelihood function becomes

n 1
l  x ,  ; x    log2   n log   2  x  x .
n
2

2
i
2 i 1

To maximize l  x ,  ; x  with respect to  , we use simple calculus:

d n 1 n 1 n
l  x ,  ; x   0    3   xi  x   0     xi  x 2
2

d   i 1 n i 1

P.58
Stat2602 Probability and Statistics II Fall 2014-2015

1 n
It can be easily verified that l  x ,  ; x  achieves its maximum at    xi  x 2
n i 1
by the second derivative test. Therefore the MLE of θ   ,   is given by


θˆ ML   X ,
1 n
 X i  X 2    X , n  1S  .
 n i 1   n 

n 1
In particular, ̂ML  X , ˆ ML  S which are quite close to the usual estimators.
n

According to the invariance principle, the MLE of the coefficient of variation


 ˆ n 1 S
  can be obtained as ˆML  ML  .
 ˆ ML n X

Example 3.21

Suppose that we draw a random sample from the Gamma distribution, i.e. we have
the statistical model: X 1 , X 2 ,..., X n ~ Gamma  ,   , θ   ,      0,    0,   ,
iid

p  2 . The likelihood function is given by


n

  
n
  xi
L ,  ; x    f  xi ;  ,    
n n

  
  x i
1
e i 1
,
i 1   i 1

from which we obtain the log-likelihood function:

l  ,  ; x   log L ,  ; x   n log   n log      1 log xi    xi .


n n

i 1 i 1

To maximize the log-likelihood function we may need to use multivariable


calculus, by solving the equations:

l n '   n
 0  n log     log xi  0 ,
   i 1

l n n

0   xi  0    .
  i 1 x

There is, however, no close form solution to these equations. Numerical methods
such as Newton’s method will be needed to find the MLEs of  and  .

P.59
Stat2602 Probability and Statistics II Fall 2014-2015

Nice Properties of MLE

Under some regularity conditions, the MLE has the following nice properties that
justify its popularity to be applied for a wide range of statistical models.

1. Asymptotically Unbiased

In small sample, the MLE ˆML may be biased. However, the bias will converge
to zero as n   , i.e. ˆ is approximately unbiased in large sample.
ML

2. Asymptotically Efficient

The MLE ˆML is the best unbiased estimator in large sample, i.e. it has the
smallest variance among all unbiased estimators as n   .

3. Consistency

The MLE ˆML is a consistent estimator of  .

4. Asymptotic Normality

The MLE ˆML is approximately distributed as normal when the sample size is
large.

Example 3.22

Consider the statistical model: X 1 , X 2 ,..., X n ~ Exponentia l   ,     0,   .


iid

 xi
L ; x    f  xi ;     e   e
n n 
xi n
Likelihood function: i 1

i 1 i 1

l  ; x   log L ; x   n log      xi 


n
Log-likelihood function:
 i 1 

By maximizing l  ; x  , the MLE of  can be easily obtained as ˆML  X 1


, which
is the same as the MME. Based on the properties of MLE, ̂ML is asymptotically
unbiased and consistent for  , and is approximated distributed as normal with
 
mean  . The large sample approximation of Var ̂ML can be obtained by using the
Fisher Information.

P.60
Stat2602 Probability and Statistics II Fall 2014-2015

§ 3.5 Fisher Information

In this section, we first consider only the statistical model with one-dimensional
parameter space. Generalization to p-dimensional parameter vector θ requires the
use of matrix algebra and multivariable calculus and will be briefly described later.
Details will be taught in advanced statistical inference course.

Definition

Let X be a random vector with likelihood function L , X  . The score function is
defined as the derivative of the log-likelihood function:

d
S X;   log L ; X  ,
d

and the variance of the score function is called the Fisher Information :

I    Var S X;  .

Remarks

1. The score function measures how sensitively the likelihood function L , X 
depends on the parameter  , while the Fisher information measures the amount
of information that the observed X carries about the unknown parameter  .

2. Under certain regularity conditions (e.g. range of X not depends on  ,


differentiable log-likelihood function, …, etc), we have E S X;   0 and


I    E S X; 
2

 d2 
 E   2 log L ; X 
 d 

The proof is beyond the scope of this course and a mathematical justification is
given in the supplementary notes.

3. If X   X 1 , X 2 ,..., X n  is a random sample, then I    nI1   where I 1   is the


 d2 
information in a single X : I1    E   2 log f  X ;  .
 d 

P.61
Stat2602 Probability and Statistics II Fall 2014-2015

Example 3.23

Consider the Poisson model: X 1 , X 2 ,..., X n ~ Poisson  ,     0,  


iid

e   x  n
1
 x
L ; x    p  xi ;   
n n i

   xi !  e 
i
 n
Likelihood function: i 1

i 1 i 1 xi !  i 1 

The Fisher Information is obtained as follows:

log L ; X   n    X i  log   log X !


n n

 i 1  i 1

d d2 1
log L ; X    2
1 n
log L ; X   n   X i ,
n

d  i 1 d 2

X
i 1
i

 d2   1  1 n
I    E   2 log L ; X   E  2
n

X   2  n 
 d    
i
 i 1

Example 3.24

Consider the exponential model: X 1 , X 2 ,..., X n ~ Exponential   ,     0,  


iid

 xi
L ; x    f  xi ;     e x   n e
n n 
Likelihood function: i i 1

i 1 i 1

The Fisher Information is obtained as:

log L ; X   n log     X i


n

i 1

d n n d2 n
log L ; X     X i , log L  ; X   
d  i 1 d2 2

 d2  n n
I    E   2 log L ; X   E  2   2
 d    

P.62
Stat2602 Probability and Statistics II Fall 2014-2015

Two important theorems in mathematical statistics based on the Fisher Information


are stated below. The proofs are beyond the scope of this course and are omitted
here.

Cramér-Rao Inequality

Let ˆ  T X  be an unbiased estimator of    . Under certain regularity


conditions, the following inequality always holds.

Var ˆ  
 '  2

.
I  

The quantity  '   I   gives a lower bound to the variance of all unbiased
estimator and is called the Cramér-Rao Lower Bound.

In particular, the Cramér-Rao Lower Bound for all unbiased estimator ˆ of  is


given by
Var ˆ 
1
I  

.

Asymptotic normality of MLE

Suppose that ˆML is a MLE of    . Under certain regularity conditions,

I  ˆML     
L

N 0, '  
2
 as n   ,

  '   
2

i.e., ˆML is approximately distributed as N   ,  for large n.


 I   

In particular, the asymptotic distribution of the MLE ˆ of  is

 
I   ˆ   
L
N 0,1 .

P.63
Stat2602 Probability and Statistics II Fall 2014-2015

Example 3.25

From Example 3.23, the Fisher information for the Poisson model was determined
n
as I    .

The Cramér-Rao Lower Bound for any unbiased estimator ˆ of    is given by

Var ˆ  
 '  
2



 '   .
2

I   n


Therefore for any unbiased estimator ˆ of  , we have Var ˆ  .
n

Var  X  
Since X is unbased for  and Var  X    , the variance of X attains
n n
the Cramér-Rao Lower Bound and hence


Var ˆ  Var  X 

for any unbiased estimator ˆ , i.e. X has the minimum variance among all
unbiased estimators. The sample mean is therefore the best unbiased estimator, or
the Uniformly Minimum-Variance Unbiased Estimator (UMVUE).

Moreover, from Example 3.18 we have derived that X is a MLE of  . Therefore


asymptotically it is distributed as normal, with

n X   
I   X     

L
N 0,1 as n   ,

 
i.e. X is approximately distributed as N  ,  when the sample size n is large.
 n
This is exactly the same result obtained from the central limit theorem.

If we want to estimate      2 , the Cramér-Rao Lower Bound for any unbiased


estimator is
 4 3
Var ˆ    '   
2

n n

 2 4 3 
and approximately,  ML  X ~ N   ,
ˆ 2
.
 n 

P.64
Stat2602 Probability and Statistics II Fall 2014-2015

Example 3.26

From Example 3.24, the Fisher information for the exponential model was
n
determined as I    2 .

The Cramér-Rao Lower Bound for any unbiased estimator ˆ of    is given by

Var ˆ  
 '  
2


2
 '   .
2

I   n



2

Therefore for any unbiased estimator ̂ of  , we have Var ˆ  .


n

It can be shown that the MLE of  is ˆML  X 1 . Its asymptotic distribution is


given by

 
I   ˆML   

n ˆML  


L
N 0,1
 as n   ,

 2 
i.e. ML  X is approximately distributed as N   ,  when the sample size n is
ˆ 1

 n
large.

If we want to estimate     e   , the Cramér-Rao Lower Bound for any unbiased


estimator is
2 2 e 2 
Var ˆ    '   
2

n n

  2 e 2  
and approximately, ˆML  e 1 X
~ Ne , .
 n 

P.65
Stat2602 Probability and Statistics II Fall 2014-2015

Remarks

1. If the variance of an unbiased estimator ˆ is equal to the Cramér-Rao Lower


Bound, then ˆ is said to be an Efficient Estimator. Therefore an efficient
estimator must be an UMVUE. However, an UMVUE may not be an efficient
estimator as the Cramér-Rao Lower Bound may not be attainable.

2. The asymptotic distribution of MLE has mean equal to the parameter and
variance equal to the Cramér-Rao Lower Bound. Therefore the MLE is
asymptotically unbiased and efficient.

3. The “regularity conditions” for the asymptotic normality of the MLE includes:

 The support of the data distribution cannot depend on the parameter. For
example, U 0,  violates this condition.

 The MLE cannot lie on the boundary of the set of possible parameters. For
example, this condition is violated for the normal model N  , 2  with
  0 , as the MLE of  would be obtained as ̂  X if X  0 and ˆ  0 if
X  0.

 The number of nuisance parameters cannot increase with the sample size.
Nuisance parameters are parameters other than the one being estimated. For
example, this condition is violated if we are estimating  based on the
model X i ~ N  , i2  , i  1,2,..., n as the number of variances  i2 increases
as the sample size.

 The amount of information should be increase indefinitely with the sample


size. For example, if there is too much dependence among the observations,
then this condition will be violated.

4. Sometimes it is hard to check whether the above regularity conditions are


satisfied. Some sufficient conditions that guarantee the asymptotic normality of
the MLE are:

d d2
 Both log L ; X  and log L ; X  exists.
d d 2

 I   is continuous.

 The MLE ˆML is a consistent estimator for  .

P.66
Stat2602 Probability and Statistics II Fall 2014-2015

5. For statistical model with p-dimensional parameter space, the Fisher


information is a p  p matrix

 2 
I θ   E   2 log Lθ; X 
 θ 

 2 

where the (i,j)th element is E   log Lθ; X  .
  i  j 

The Cramér-Rao Lower Bound of any unbiased estimator ˆ  T X  of the


univariate parameter  θ  is given by

   1   
Var ˆ     θ ' I θ    θ 
 θ   θ 

     
where  θ     θ ,  θ ,,  θ ' .
θ  1  2  n 

Example 3.27
Consider the normal model: X 1 , X 2 ,..., X n ~ N  ,  2  , θ   ,        0,   ,
iid

p  2 . From Example 3.20, the log-likelihood function is determined as

n 1
log Lθ; X    log2   n log   2  X   .
n
2

2
i
2 i 1

 1  n 1 n
log Lθ; X   2  X   , log Lθ; X     3   X i   
n
2

     i 1
i
i 1

2 n  2  n
log L θ; X     E   2 log Lθ; X   2
 2 2    

2 2 n  2 
log Lθ; X    3   X i     E  log Lθ; X   0
  i 1   

2 n 3  2  2n
log Lθ; X   2  4  X    log Lθ; X   2
n
E 
2

      
2 i 2
i 1

P.67
Stat2602 Probability and Statistics II Fall 2014-2015

Therefore the Fisher information matrix is

 n 
 2 0  n 1 0
I θ       
 0 2n   2  0 2 

 2 

 2  2 0
I θ  
1
with inverse given by  .
2n  0 1 

For the parameter  , the Cramér-Rao Lower Bound for any unbiased estimator ̂
is
 2  2 0  1   2
Var ˆ   1 , 0'     .
2n  0 1  0  n

2
Since the sample mean X is unbiased for  and Var  X   , it is an efficient
n
estimator of  , and hence the UMVUE of  .

On the other hand, for the parameter  2 , the Cramér-Rao Lower Bound for any
unbiased estimator ˆ 2 is

 2  2 0  0  2 4
Var ˆ 2   0 , 2 '     .
2n  0 1  2  n

Therefore the sample variance S 2 is not an efficient estimator of  2 as


2 4 2 4
Var S  
2
 . (However, using the Lehmann-Scheffé theorem, which will
n  1 2n
be taught in advanced statistical inference course, it can be proved that S 2 is an
UMVUE of  2 .)

Finally, for the coefficient of variation     , the Cramér-Rao Lower Bound for
any unbiased estimator ˆ is given by

  1   2  2 0       2  2   2 
2

Var ˆ     2 , '     .


    2 n  0 1  1   2 n 4

P.68
Stat2602 Probability and Statistics II Fall 2014-2015

§ 3.6 Sufficient Statistics

Recall that a statistic T X  is a function of the observed random vector X . A


useful T X  should have much lower dimension than the original data X and
contain as much information as possible, so that we can reduce the data with no
loss. For example, the commonly used statistics T X    X , S 2  can reduce the data
from n numbers to 2 numbers. So the question is, how can we determine whether
T X  contains “all the information” in the data for estimating the unknown
parameter θ ? In other words, is T X  “sufficient” for making inference about θ ?

Definition

For a statistical model with random vector X   X 1 , X 2 ,..., X n   and parameter


vector θ   , a statistic T  T X  (possibly vector-valued) is a sufficient statistic
for θ if the conditional distribution of X given T does not depend on the
unknown parameter θ .

Suppose we know only the value of a sufficient statistic T but not the complete
data X . According to the definition, the conditional distribution of X given T
does not depend on θ . Therefore we can use this conditional distribution to
generate another set of data X * , without knowing the value of θ . This generated
data X * should have the same distribution as the original data X . If we apply the
same statistical method to make inference about θ , the performance would be the
same no matter we apply it on X or X * . Therefore knowing the value of T is as
good as having the complete data X , i.e. T contains all the information about θ
and is sufficient for θ .

Example 3.28

Suppose we independently perform n Bernoulli trials, each with success


probability p, i.e. we have the Bernoulli model: X 1 , X 2 ,..., X n ~ b1, p  ,
iid

p    0,1 .

n
Let T   X i . Then T is simply the count of number of successes out of the n
i 1

Bernoulli trials and hence T ~ bn, p  .

P.69
Stat2602 Probability and Statistics II Fall 2014-2015

The conditional pmf of X   X 1 , X 2 ,, X n  is given by

p x | t   P  X 1  x1 , X 2  x2 ,, X n  xn | T  t 
P X 1  x1 , X 2  x2 ,, X n  xn , T  t 

PT  t 

 t , then the probability in the numerator is equal to zero, and so as p x | t  .


n
If x
i 1
i

n
For xi 1
i t,
P X 1  x1 , X 2  x2 ,, X n  xn 
p x | t  
PT  t 
P X 1  x1 P X 2  x2  P X n  xn 

PT  t 
p x 1  p  p x 1  p   p x 1  p 
1
1 x 1 2
1 x 1 x
2 n n


n t
  p 1  p 
n t

t
p x  x x 1  p 
1 2 n
n  x1  x2  xn 


n t
  p 1  p 
n t

t
p t 1  p 
n t 1
n
   
n t t
  p 1  p 
n t

t

In summary
 n  1
  if x1  x2    xn  t ,
px | t    t 

0 otherwise.

n
which does not depend on the parameter p. Therefore T   X i is a sufficient
i 1

statistic for p. Knowing the number of successes is already enough for us to make
inference about the success probability p. Once we have the number of successes,
we don’t need to know which of the Bernoulli trials are successes.

P.70
Stat2602 Probability and Statistics II Fall 2014-2015

Example 3.29

Consider the Bernoulli model with n  3 : X 1 , X 2 , X 3 ~ b1, p  , p    0,1 .


iid

X 1  2 X 2  3X 3
Let T  be a weighted mean of the sample, i.e. the third trial will
6
be counted more than the first two trials. Consider the conditional probability

 1  P X 1  1, X 2  1, X 3  0
P X 1  1, X 2  1, X 3  0 | T   
 2 PT  1 2
P X 1  1, X 2  1, X 3  0

P X 1  2 X 2  3 X 3  3
P X 1  1, X 2  1, X 3  0

P X 1  1, X 2  1, X 3  0  P X 1  0, X 2  0, X 3  1
p 2 1  p 
 2
p 1  p   1  p  p
2

p

Since this conditional probability depends on the value of p, the conditional


distribution of X   X 1 , X 2 , X 3  given T also depends on the parameter p.
Therefore the weighted mean T is not a sufficient statistic for p.

Remarks

1. Obviously, the complete data X  T X  must be a sufficient statistic.

2. Sufficient statistics are not unique. If T X  is sufficient and T * X  is any other


statistic such that T X   g T * X  , then T * X  is also sufficient. For
example, for the Bernoulli model in Example 3.28, the sample mean X is also
sufficient as T  nX .

3. If T X  is a sufficient statistic and is function of any other sufficient statistic


T * X  , then T X  is said to be a minimal sufficient statistic. It provides the
greatest possible data reduction.

4. For general statistical models, it is often difficult to find the conditional


distribution of X given T X  . To check sufficiency, we can use the following
two criteria which are much easier to check.

P.71
Stat2602 Probability and Statistics II Fall 2014-2015

§ 3.6.1 The Fisher-Neyman and Factorization Criteria

Fisher-Neyman Criterion

For a statistical model with random vector X and parameter vector θ , the statistic
T  T X  is sufficient for θ if and only if the ratio

p x; θ 
qT x ; θ 

does not depend on θ , where p x; θ  and qt; θ  are the pmfs (or pdfs) of X and
T  T X  , respectively.

Factorization Criterion

For a statistical model with random vector X and parameter vector θ , the statistic
T  T X  is sufficient for θ if and only if there exist functions g t; θ  and h x 
such that the pmf (or pdf) p x; θ  of X can be written as

p x; θ   h x g T x ; θ  ,

i.e. the pmf (or pdf) of X can be “factorized”, with the function involving θ
depends only on T X  .

These criteria are true in both discrete and continuous cases. However, the proofs
for continuous case need measure theory and are omitted. The proofs for discrete
case can be found in the supplementary notes.

Example 3.30

For the Bernoulli model in Example 3.28, the pmf of X is


n

 xi n

px; p    p 1  p  1  p 
n
xi 1 xi
p i 1
n  xi
i 1

i 1

n
Therefore base on the factorization criterion, T   X i is sufficient for p.
i 1

P.72
Stat2602 Probability and Statistics II Fall 2014-2015

Example 3.31

Consider a gamma model:

X 1 , X 2 ,..., X n ~ Gamma  ,   , θ   ,      0,    0,  


iid

The joint pdf is


n

   1 x  n  n    x  1

f x; θ    
n

n   xi 
i

xi e   i
e i 1

i 1        i 1 

Therefore T X     X ,  X  is a sufficient statistic for θ   ,   .


n n


 
i i
i 1 i 1

Example 3.32

Consider a normal model: X 1 , X 2 ,..., X n ~ N  ,  2 , θ   ,        0,  


iid

The joint pdf is


 1   xi   2  
f x; θ    
n
exp   
i 1
 2
2
 2  2

 1 n 2
 2 2  2 exp  2   xi    
n

 2 i 1 
 1 
 2 2  2 exp  2   xi2  2  xi  n 2  
n n n

 2  u1 i 1 

Therefore T X     X ,  X i2  is a sufficient statistic for θ   ,   . Since we can


 n n

 i 1 i 1 
also write the pdf as

 1 2 
f x; θ   2 2  2 exp  2    xi  x   n x     
n n
 2

 2  i 1 
 1
 2 
 2 2  2 exp  2 n  1s 2  n x     , 
n

 2 

The statistic  X , S 2  is also sufficient for θ   ,   . Note that X and S 2 are


jointly sufficient for  and  . It is not a correct way to think that X is sufficient
for  and S 2 is sufficient for  .

P.73
Stat2602 Probability and Statistics II Fall 2014-2015

Example 3.33

Consider a normal model with known variance  2  1 :

X 1 , X 2 ,..., X n ~ N  ,1 ,     
iid

Putting  2  1 into the pdf, we have

 1 n 
f x;    2  2 exp    xi2  2   xi  n 2  
n n

 2  i 1 i 1 
 1 2  n  2

 2  2 exp   xi   exp n x 
n n


 2 i 1   2 

Therefore the sample mean X is sufficient for  .

Example 3.34

Consider a normal model with known mean   1 :

X 1 , X 2 ,..., X n ~ N 1,  2  ,     0,  


iid

Putting   1 into the expressions in Example 3.32, the pdf can be expressed as

 1 
f x;   2 2  2 exp  2   xi2  2 xi  n  
n n n

 2  i 1 i 1 
 1
 2 
 2 2  2 exp  2 n  1s 2  n x  1  
n

 2 

Both the statistics   X ,  X i2  and  X , S 2  are sufficient for  . However, S 2 is


n n

 i 1 i 1 
not a sufficient statistic for  .

P.74
Stat2602 Probability and Statistics II Fall 2014-2015

§ 3.6.2 The Rao-Blackwell Theorem

The following theorem shows how we can possibly “improve” an estimator using
the sufficient statistics.

Rao-Blackwell Theorem

Let T  T X  be a sufficient statistic for θ and ˆ be an estimator of the one-


dimensional parameter  θ . Then the conditional expectation

ˆ*  E ˆ | T 

is an estimator of  θ which has mean square error no greater than the mean
square error of ˆ .

Proof

Since T  T X  is sufficient for θ , the conditional distribution of X given T does


not depend on θ . Hence the conditional expectation ˆ * is a function of T only
and is therefore an estimator.

Using expectation by conditioning,

E  *  E E ˆ | T X   E ˆ   bias *  biasˆ 

Hence ˆ * and ˆ have same expectation and same bias. Consider the variance:

Var ˆ   Var E ˆ | T X   E Var ˆ | T X 


 Var E ˆ | T X   Var  *

As a result,

MSEˆ *  Varˆ *  biasˆ *


2

 Varˆ   biasˆ   MSEˆ 


2

P.75
Stat2602 Probability and Statistics II Fall 2014-2015

Remarks

1. The theorem suggest that we can always try to improve an estimator ˆ using a
sufficient statistic T , so that ˆ*  E ˆ | T  can perform as good as, and may be
better than, the original ˆ . Such process is often called Rao-Blackwellization.

2. As can be seen from the proof of the theorem, if ˆ is unbiased, then


ˆ*  E ˆ | T  is also unbiased, and therefore Var ˆ   Var  * if and only if


E Varˆ | T X   E ˆ  ˆ *  0 ,
2

i.e. if ˆ*  E ˆ | T  has the same variance as the unbiased estimator ˆ , then
P  *     1. This implies that a best unbiased estimator (if it exists) must be a
function of the sufficient statistic.

Example 3.35
Consider the Poisson model: X 1 , X 2 ,..., X n ~ Poisson  ,     0,  
iid

The pmf is given by


n

e   x  n
1
x
p x;       xi !  e n 
n i i
i 1

i 1 xi !  i 1 

n
Therefore T   X i is a sufficient statistic for  . Now suppose we want to find an
i 1

estimator for   e  . Note that T ~ Poissonn  and

E e  X   E e T n   M T     exp e 1 n  1  e 
 1
 n

Hence e  X is not an unbiased estimator of e  . To find an unbiased estimator, we


can base on the observation that e   P  X 1  0  and use the indicator variable

1 if X 1  0,
ˆ  
0 otherwise.

It is an unbiased estimator of e  as E ˆ   P ˆ  1  P  X 1  0   e  .

Of course it is rather silly to do the estimation using only one out of n observations.
However, it is the starting point from which we can do the Rao-Blackwellization.

P.76
Stat2602 Probability and Statistics II Fall 2014-2015

Consider the conditional expectation

E ˆ | T  t   P ˆ  1 | T  t 
 P X 1  0 | T  t 
P X 1  0, T  t 

PT  t 
P X 1  0,  X i  t 
 n

  i 2 
PT  t 
e  n 1 n  1  e n n 
t t

e 


t! t!
t
 1
 1  
 n

Therefore an unbiased estimator of   e  in terms of the sufficient statistic


n
T   X i is given by
i 1
T

ˆ*  E ˆ | T   1   .


1
 n

P.77

You might also like