0% found this document useful (0 votes)
71 views51 pages

Model Inference and Averaging: Dept. Computer Science & Engineering, Shanghai Jiao Tong University

This document discusses various statistical modeling and inference techniques, including the bootstrap method, Bayesian inference, the EM algorithm, MCMC sampling, and model averaging. It provides technical details on maximum likelihood estimation, describing how it works by choosing parameters to maximize the likelihood of obtaining the observed data. Examples are given for how MLE can be applied to estimate the mean parameter of Gaussian and Poisson distributions. The key steps of MLE are to define the likelihood function and take its derivative to find the parameter values that maximize likelihood.

Uploaded by

Peter Parker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views51 pages

Model Inference and Averaging: Dept. Computer Science & Engineering, Shanghai Jiao Tong University

This document discusses various statistical modeling and inference techniques, including the bootstrap method, Bayesian inference, the EM algorithm, MCMC sampling, and model averaging. It provides technical details on maximum likelihood estimation, describing how it works by choosing parameters to maximize the likelihood of obtaining the observed data. Examples are given for how MLE can be applied to estimate the mean parameter of Gaussian and Poisson distributions. The key steps of MLE are to define the likelihood function and take its derivative to find the parameter values that maximize likelihood.

Uploaded by

Peter Parker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 51

Model Inference and

Averaging
Dept. Computer Science & Engineering,
Shanghai Jiao Tong University
Contents
• The Bootstrap and Maximum Likelihood Methods
• Bayesian Methods
• Relationship Between the Bootstrap and Bayesian
Inference
• The EM Algorithm
• MCMC for Sampling from the Posterior
• Bagging
• Model Averaging and Stacking

2018/10/25 Model Inference and Averaging 2


Bootstrap by Basis Expansions
• Consider a linearN expansion
 ( x)    j h j ( x)
j 1

• The least square error solution


ˆ
  (H H ) H y
T 1 T

• The Covariance of \beta


ˆ
cor (  )  ( H H ) ˆ ;
T 1 2

ˆ 2   ( yi  ˆ ( xi )) 2 / N
2018/10/25 Model Inference and Averaging 3
2018/10/25 Model Inference and Averaging 4
Parametric Model
• Assume a parameterized probability density
(parametric model) for observations
zi  gθ ( z )
E.g. normal distribution
θ = ( μ, σ 2 )
1
1 - ( z - μ )2 σ 2
2
gθ ( z ) = e
2πσ 2
2018/10/25 Model Inference and Averaging 5
Maximum Likelihood Inference
• Suppose we are trying to measure the true
value of some quantity (xT).
– We make repeated measurements of this
quantity {x1, x2, … xn}.
– The standard way to estimate xT from our
measurements is to calculate the mean value:
1 N
 x   xi
N i1
and set xT = μx.

2018/10/25 Model Inference and Averaging 6


Maximum Likelihood Inference
• Suppose we are trying to measure the true
value
DOESof some quantity (xT).MAKE SENSE?
THIS PROCEDURE
– We make repeated measurements of this
quantity {x1, x2, … xn}.
– The
The maximum
standard waylikelihood method
to estimate (MLM)
xT from our
answers thisisquestion
measurements and the
to calculate provides
meanavalue:
1 N
general method for estimating parameters
 x  from
of interest  xidata.
N i1
and set xT = μx.

2018/10/25 Model Inference and Averaging 7


The Maximum Likelihood Method
• Statement of the Maximum Likelihood
Method
– Assume we have made N measurements of x
{x1, x2, …, xn}.
– Assume we know the probability distribution
function that describes x: f(x, a).
– Assume we want to determine the parameter a.
• MLM: pick a to maximize the probability of
getting the measurements (the xi's) we did!
2018/10/25 Model Inference and Averaging 8
The MLM Implementation
• The probability of measuring x1 is f ( x1 ,  )dx
• The probability of measuring x2 is f ( x2 ,  )dx
• The probability of measuring xn is f ( xn ,  )dx
• If the measurements are independent, the
probability of getting the measurements we did is:
L  f ( x1 ,  )dx  f ( x2 ,  )dx    f ( xn ,  )dx
 f ( x1 ,  )  f ( x2 ,  )    f ( xn ,  )[dx n ]
• We can drop the dxn term as it is only a
proportionality constant. N
L is called the Likelihood Function : L   f ( xi , )
i 1
2018/10/25 Model Inference and Averaging 9
Log Maximum Likelihood Method
• We want to pick the a that maximizes L:
L
0
  *
– Often easier to maximize lnL.
– L and lnL are both maximum at the same
location.

• we maximize lnL rather than L itself because
lnL converts the product into a summation.
N
ln L   ln f (xi ,  )
i1
2018/10/25 Model Inference and Averaging 10
Log Maximum Likelihood Method
• The new maximization condition is:
 ln L N

 ln f ( xi ,  ) 0
   * i 1    *

•  could be an array of parameters (e.g.


slope and intercept) or just a single variable.
• equations to determine a range from simple
linear equations to coupled non-linear
equations.
2018/10/25 Model Inference and Averaging 11
An Example: Gaussian
• Let f(x, ) be given by a Gaussian distribution
function.
• Let  =μbe the mean of the Gaussian. We
want to use our data+MLM to find the mean.
• We want the best estimate of a from our set
of n measurements {x1, x2, …, xn}.
• Let’s assume that s is the same for each
measurement.

2018/10/25 Model Inference and Averaging 12


An Example: Gaussian
• Gaussian PDF ( x i  ) 2
1 
f (xi , )  e 2  2

 2
• The likelihood function for this problem is:
n n ( xi  ) 2
1 
L   f ( xi ,
)   e 2 2

i 1 i 1  2
n
( xi  ) 2
 1  
n ( x1  ) 2
( x2  ) 2
( xn  ) 2 n 
 1    
2 2
 2 2 2

2 2 2

 e e e  e i 1

 2   2 
2018/10/25 Model Inference and Averaging 13
An Example: Gaussian
n
( xi  ) 2
n 
1
ln L  ln  f ( xi ,  )  ln([ ]n e i1 2 2
)
i 1  2
1 n ( x   )2
 n ln( ) i 2
 2 i 1 2
• We want to find the a that maximizes the log
likelihood function:
 ln L    1  n
( x   ) 2

 n ln   i
0
     2  i 1 2 2

 n n
1 n


 i 1
( xi   )  0;
2

i 1
2( xi   )( 1)  0    xi
n i 1
2018/10/25 Model Inference and Averaging 14
An Example: Gaussian
• If  are different for each data point then  is
just the weighted average:
n
xi
  2
  i n1 i Weighted Average
1
i 1  i
2

2018/10/25 Model Inference and Averaging 15


An Example: Poisson
• Let f(x,) be given by a Poisson distribution.
• Let  = μ be the mean of the Poisson.
• We want the best estimate of a from our set
of n measurements {x1, x2, … xn}.
• Poisson PDF:

e  x
f ( x, ) 
x!
2018/10/25 Model Inference and Averaging 16
An Example: Poisson
• The likelihood function for this problem is:

n n
e  xi
L f ( xi ,  )  
i 1 i 1 xi !
n
 xi
e   e  x1
e  
x2 xn
e  n  i1
 ... 
x1! x2 ! xn ! x1! x2 !..xn !

2018/10/25 Model Inference and Averaging 17


An Example: Poisson
• Find a that maximizes the log likelihood
function:
d ln L d  n

   n  ln    xi  ln( x1! x2 !..xn !) 
d d  i 1 
1 n
 n   xi  0
 i 1
1 n
   xi Average
n i 1

2018/10/25 Model Inference and Averaging 18


General properties of MLM
• For large data samples (large n) the
likelihood function, L, approaches a
Gaussian distribution.
• Maximum likelihood estimates are usually
consistent.
– For large n the estimates converge to the true
value of the parameters we wish to determine.
• Maximum likelihood estimates are usually
unbiased.
– For all sample sizes the parameter of interest is
calculated correctly.
2018/10/25 Model Inference and Averaging 19
General properties of MLM
• Maximum likelihood estimate is efficient: the
estimate has the smallest variance.
• Maximum likelihood estimate is sufficient: it
uses all the information in the observations
(the xi’s).
• The solution from MLM is unique.
• Bad news: we must know the correct
probability distribution for the problem at
hand!
2018/10/25 Model Inference and Averaging 20
Maximum Likelihood
• We maximize the likelihood function
N
L(θ; Z ) =  g (z )
i =1
θ i

• Log-likelihood function
(θ ; Z ) = log L(θ ; Z )
N N
=  log g
i =1
θ
( zi ) = i =1
(θ ; zi )

2018/10/25 Model Inference and Averaging 21


Score Function
• Assess the precision of  using the likelihood
function N
(θ ; Z ) =  (θ; zi )
i =1

 (θ ; zi )
wher e (θ; zi ) =

• Assume that L takes its maximum in the
interior parameter space. Then
(θˆ; Z ) = 0
2018/10/25 Model Inference and Averaging 22
Likelihood Function
• We maximize the likelihood
N
function
L(θ; Z ) =  g (z )
i =1
θ i

• We omit normalization since only adds a


constant factor
• Think of L as a function of ˆ with Z fixed
• Log-likelihood function
(θ ; Z ) = log L(θ ; Z )
N N
=  log g
i =1
θ
( zi ) = 
i =1
(θ ; zi )
2018/10/25 Model Inference and Averaging 23
Fisher Information
• Negative sum of second derivatives is the
information matrix N 2
 (θ ; zi )
I(θ ) = - 
i =1 θ θ T
• is called the observed information, should be
greater 0.
• Fisher information ( expected information ) is
i(θ ) = E I(θ ) 
Assume that 0 is the true value of 
2018/10/25 Model Inference and Averaging 24
Sampling Theory
• Basic result of sampling theory
• The sampling distribution of the max-
likelihood estimator approaches the following
normal distribution, as N  
ˆ  N( 0, i( 0 )1)
• When we sample independently from g ( z ) 0

• This suggests to approximate the distribution


with
N( ˆ, i( ˆ) 1 )
2018/10/25 Model Inference and Averaging 25
Error Bound
• The corresponding error estimates are
obtained from
i( ˆ)jj1 and I( ˆ)jj1
• The confidence points have the form
ˆj  z( 1 )  i( ˆ)jj1

and ˆj  z( 1 )  I( ˆ)jj1

z(1-α) is the 1-α percentile of


the normal distribution
2018/10/25 Model Inference and Averaging 26
Simplified form of the Fisher information

Suppose, in addition, that the operations of integration and


differentiation can be swapped for the second derivative of f(x;θ) as
well, i.e.,
2   2 
2 
T ( x) f ( x; θ )dx  =  T ( x)  2 f ( x; θ )  dx
   
In this case, it can be shown that the Fisher information equals
 2 
I ( ) = - E  2 log f ( X ;  ) 
  
The Cramér–Rao bound can then be written as
ˆ 1 1
var (  )  
I(  )  2 
- E  2 log f ( X ;  ) 
  
2018/10/25 Model Inference and Averaging 27
Single-parameter proof

If the expectation of T is denoted by ψ(θ), then, for all θ,


2
 (  ) 
var ( t( X ) ) 
I(  )
Let X be a random variable with probability density function f(x;θ).
Here T = t(X) is a statistic, which is used as an estimator for ψ(θ). If
V is the score, i.e.

V  l n f ( X;  )

then the expectation of V, written E(V), is zero. If we consider the
covariance cov(V,T) of V and T, we have cov(V,T) = E(VT), because
E(V) = 0. Expanding this expression we have
  
cov( V , T )  E  T  l n f ( X;  ) 
  
2018/10/25 Model Inference and Averaging 28
Using the chain rule
 1 Q
lnQ 
 Q 
and the definition of expectation gives, after cancelling f(x;θ),
because the integration and differentiation operations
commute (second condition).
       
E T  l n f ( X;  )     
t( x ) f ( x;  )  dx    t( x) f ( x;  ) dx    (  )
    

The Cauchy-Schwarz inequality shows that


var ( T ) var ( V )  cov( V , T )   (  )

Therefore 2 2 2
 (  )   (  )     1
var T     E( T ) 
var ( V ) I(  )    I(  )
2018/10/25 Model Inference and Averaging 29
An Example
• Consider a linearN expansion
 ( x)    j h j ( x)
j 1

• The least square error solution


ˆ
  (H H ) H y
T 1 T

• The Covariance of \beta


ˆ
cor (  )  ( H H ) ˆ ;
T 1 2

ˆ 2   ( yi  ˆ ( xi )) 2 / N
2018/10/25 Model Inference and Averaging 30
An Example
N
Consider prediction model ˆ ( x)   ˆ j h j ( x),
j 1

The standard deviation


se[ ˆ ( x)]  [ h( x)T ( H T H ) 1 h( x)]1/2 ˆ

• The confidence region

ˆ ( x)  1.96se[ ˆ ( x)]

2018/10/25 Model Inference and Averaging 31


Bayesian Methods
• Given a sampling model Pr(Z |  ) and a prior Pr( )
for the parameters, estimate the posterior
probability Pr ( Z  )  Pr (  )
Pr (  Z) 
 Pr ( Z  )  Pr (  )d
• By drawing samples or estimating its mean or
mode
• Differences to mere counting ( frequentist
approach )
– Prior: allow for uncertainties present before seeing the
data
– Posterior: allow for uncertainties present after seeing the
data
2018/10/25 Model Inference and Averaging 32
Bayesian Methods
• The posterior distribution affords also a
predictive distribution of seeing future values
new
Z
Pr ( z Z)   Pr ( z  )  Pr (  Z) d
new new

• In contrast, the max-likelihood approach


would predict future data on the basis of
Pr ( z new ˆ) not accounting for the uncertainty
in the parameters
2018/10/25 Model Inference and Averaging 33
An Example
• Consider a linearN expansion
 ( x)    j h j ( x)
j 1

• The least square error solution

 N( 0,  )

 T 1 x   
px  
1  12 x  
e
2  p/2

p/2

2018/10/25 Model Inference and Averaging 34


• The posterior distribution for \beta is also Gaussian, with mean and
covariance
1
 T  1  2
E(  Z)   H H    HT y,
  
1
 T  1  2
cov(  Z) =  H H     2.
  

• The corresponding posterior values for \mu(x),


1
  1 
2
E( ( x) Z)  h( x)T  HT H    HT y,
  
1
  1 
2
cov  ( x) , ( x) Z   h( x)T  HT H    h( x) 2 .
  
2018/10/25 Model Inference and Averaging 35
2018/10/25 Model Inference and Averaging 36
Bootstrap vs Bayesian
• The bootstrap mean is an approximate
posterior average
• Simple example:
– Single observation z drawn from a normal
distribution z N(  , 1)
– Assume a normal prior for  :  N( 0,  )
– Resulting posterior distribution
 z 1 

 z N , 
 1  1  1  1  
2018/10/25 Model Inference and Averaging 37
Bootstrap vs Bayesian
• Three ingredients make this work
– The choice of a noninformative prior for 
– The dependence of ( ; Z) on Z only through
the max-likelihood estimate ˆ Thus
( ; Z)  ( ; ˆ)
– The symmetry of
( ˆ;  )  ( ˆ;  )  const ant .

2018/10/25 Model Inference and Averaging 38


Bootstrap vs Bayesian
• The bootstrap distribution represents an (approximate)
nonparametric, noninformative posterior distribution for our
parameter.
• But this bootstrap distribution is obtained painlessly without
having to formally specify a prior and without having to
sample from the posterior distribution.
• Hence we might think of the bootstrap distribution as a
\poor man's" Bayes posterior. By perturbing the data, the
bootstrap approximates the Bayesian effect of perturbing
the parameters, and is typically much simpler to carry out.

2018/10/25 Model Inference and Averaging 39


Contents
• The Bootstrap and Maximum Likelihood Methods
• Bayesian Methods
• Relationship Between the Bootstrap and Bayesian
Inference
• The EM Algorithm
• MCMC for Sampling from the Posterior
• Bagging
• Model Averaging and Stacking

2018/10/25 Model Inference and Averaging 40


The EM Algorithm
• The EM algorithm for two-component
Gaussian mixtures
– Take initial guesses ˆ , ˆ1 ,ˆ1 , ˆ 2 ,ˆ2 for the
parameters

– Expectation Step: Compute the responsibilities


ˆ ˆ ( yi )

ˆi  2
, i = 1, ,N
( 1  ˆ )ˆ ( yi )  
ˆ ˆ ( yi )
1 2

2018/10/25 Model Inference and Averaging 41


The EM Algorithm
– Maximization Step: Compute the weighted
means and variances
 i 1( 1  ˆi ) yi  i 1 ( 1 ˆ ) ( ˆ
 ) 2
N N
 y  1
ˆ1  , ˆ12  i i
,
 ( 1  ˆi )  ( 1  ˆi )
N N
i 1 i 1

 i 1 ˆi yi  i 1 i i 1
ˆ( ˆ
 ) 2
N N
y 
ˆ2  , ˆ22  ,
 ˆi  ˆi
N N
i 1 i 1

ˆ   ˆi  N
N
i 1

– Iterate 2 and 3 until convergence


2018/10/25 Model Inference and Averaging 42
The EM Algorithm in General
• Baum-Welch algorithm
• Applicable to problems for which maximizing
the log-likelihood is difficult but simplified by
enlarging the sample set with unobserved
( latent ) data ( data augmentation ).

2018/10/25 Model Inference and Averaging 43


The EM Algorithm in General

Z input data, with log-likelihood (  , Z )


Z m latent data (in our example ∆i)
T  ( Z, Zm ) complete data with
log-likelihood 0(  , T)

2018/10/25 Model Inference and Averaging 44


Pr ( Z , Z  )m

Pr ( Z m
Z,  ) 
Pr ( Z  )
Pr ( T  )
Pr ( Z  ) 
Pr ( Z m
Z,  )

we have ( ; Z) = 0
(  ; T)  1
(  ; Z Z)
 m

2018/10/25 Model Inference and Averaging 45


The EM Algorithm in General
• Start with initial params ˆ
• Expectation Step: at the j-th step compute
 ˆ( j)
 ˆ
Q(  ,  )  E( 0(  , T)  Z,  )
( j)

as a function of the dummy argument


• Maximization Step: Determine the new
params ˆ( j1) by maximizing
 ˆ
Q(  ,  )( j)

• Iterate 2 and 3 until convergence


2018/10/25 Model Inference and Averaging 46
Contents
• The Bootstrap and Maximum Likelihood Methods
• Bayesian Methods
• Relationship Between the Bootstrap and Bayesian
Inference
• The EM Algorithm
• MCMC for Sampling from the Posterior
• Bagging
• Model Averaging and Stacking

2018/10/25 Model Inference and Averaging 47


Model Averaging and Stacking
• Given predictions fˆ1( x) , fˆ2( x) , , fˆM( x)
• Under square-error loss, seek weights
wˆ  ( wˆ1, wˆ2 , , wˆ M )
• Such that 2
 ˆ  M
wˆ  arg min EP Y   wm f m( x) 
w  m 1 
• Here the input x is fixed and the N
observations in Z are distributed according to
P
2018/10/25 Model Inference and Averaging 48
Model Averaging and Stacking
• The solution is the population linear
regression of Y on Fˆ( x ) T
  fˆ1( x) , fˆ2( x) , , fˆM( x) 
1
namely wˆ  EP  Fˆ( x) Fˆ( x)  EP  Fˆ( x)Y 
T

• Now the full regression has smaller error,


namely
2
 M
ˆ  ˆ
2
EP Y   wˆ m f m( x)   EP Y  f m( x)  m
 m 1 
• Population linear regression is not available,
thus replace it by the linear regression over
the training set
2018/10/25 Model Inference and Averaging 49
Contents
• The Bootstrap and Maximum Likelihood Methods
• Bayesian Methods
• Relationship Between the Bootstrap and Bayesian
Inference
• The EM Algorithm
• MCMC for Sampling from the Posterior
• Bagging
• Model Averaging and Stacking

2018/10/25 Model Inference and Averaging 50


2018/10/25 Model Inference and Averaging 51

You might also like