08 SS039
08 SS039
Contents
1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Definition of a density . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 The Kullback-Leibler risk . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Statistical models and families . . . . . . . . . . . . . . . . . . . . . . 5
4.1 Statistical families . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.3 Statistical models and true probability . . . . . . . . . . . . . . . 6
5 The likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.1 Definition of the likelihood . . . . . . . . . . . . . . . . . . . . . . 7
5.2 Computation of the likelihood . . . . . . . . . . . . . . . . . . . . 8
5.3 Performance of the MLE in terms of Kullback-Leibler risk . . . . 9
6 The penalized likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 9
7 The hierarchical likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 11
7.1 Random effects models . . . . . . . . . . . . . . . . . . . . . . . . 11
7.2 Hierarchical likelihood . . . . . . . . . . . . . . . . . . . . . . . . 11
8 Akaike and likelihood cross-validation criteria . . . . . . . . . . . . . . 12
∗ This paper was accepted by Elja Arjas, Executive Editor for the Bernoulli.
1
D. Commenges/Statistical models and likelihoods 2
1. Introduction
Since its proposal by Fisher [16], likelihood inference has occupied a central
position in statistical inference. In some situations, modified versions of the
likelihood have been proposed. Marginal, conditional, profile and partial likeli-
hoods have been proposed to get rid of nuisance parameters. Pseudo-likelihood
and hierarchical likelihood may be used to circumvent numerical problems in the
computation of the likelihood, that are mainly due to multiple integrals. Penal-
ized likelihood has been proposed to introduce a smoothness a priori knowledge
on functions, thus leading to smooth estimators. Several reviews have already
been published, for instance [31], but it is nearly impossible in a single paper
to describe with some details all the types of likelihoods that have been pro-
posed. This paper aims at describing the conventional likelihood and two of its
variants: penalized and hierarchical likelihoods. The aim of this paper is not to
give the properties of the estimators obtained by maximizing these likelihoods,
but rather to describe these three likelihoods together with their link to the
Kullback-Leibler divergence. This interest more in the foundations rather than
the properties, leads us to first develop some reflexions and definitions about
statistical models and to give a slightly extended version of the Kullback-Leibler
divergence.
In section 2, we recall the definition of a density and the relationship between
a density in the sample space and for a random variable. In section 3, we give a
slightly extended version of the Kullback-Leibler divergence (making it explicit
that it also depends on a sigma-field). Section 4 gives an account of statisti-
cal models, distinguishing mere statistical families from statistical models and
defining the misspecification risk. Section 5 presents the likelihood and discusses
issues about its computation and the performance of the estimator of the max-
imum likelihood in terms of Kullback-Leibler risk. In section 6, we define the
penalized likelihood and show that for a family of penalized likelihood estima-
tors there is an identical family of sieves estimators. In section 7, we describe
the hierarchical likelihood. In section 8, we briefly sketch the possible unifica-
tion of these likelihoods through a Bayesian representation that allows us to
consider the maximum (possibly penalized) likelihood estimators as maximum
a posteriori (MAP) estimators; this question however cannot be easily settled
due to the non-invariance of the MAP for reparameterization. Finally, there is
a short conclusion.
2. Definition of a density
Consider a measurable space (S, A) and two measures µ and ν with µ abso-
lutely continuous relatively to ν. For G a sub-σ-field of A the Radon-Nikodym
D. Commenges/Statistical models and likelihoods 3
dµ
derivative of µ with respect to ν on X , denoted by: dν |G is the G-measurable
random variable such that
dµ
Z
µ(G) = dν, G ∈ G.
G dν |G
dP 1
1
dP
= EP 0 |H .
dP 0 |H dP 0 |G
Consider now the case where the measurable space (Ω, F ) is the sample space
of an experiment. For the statistician (Ω, F ) is not any measurable space: it is a
space which enables us to represent real events. We shall write in bold character
a probability on (Ω, F ), for instance, P 1 . Let us define a random variable X,
that is, a measurable function from (Ω, F ) to (ℜ, B). The couple (P 1 , X) induces
a probability measure on (ℜ, B) defined by: PX 1
(B) = P 1 {X −1 (B)}, B ∈ B. This
probability measure is called the distribution of X. If this probability measure
is absolutely continuous with respect to Lebesgue (resp. counting) measure, one
speaks of continuous (resp. discrete) variable. For instance, for a continuous
1 dP 1
variable we define the density fX = dλX , where λ is Lebesgue measure on
ℜ, which is the usual probability density function (p.d.f.). Note that the p.d.f.
1
depends on both P 1 and X, while dP 0 depends on X but not on a specific
dP |X
random variable X. Often in applied statistics one works only with distributions,
but this may leave some problems unsolved.
Example 1. Consider the case where concentrations of CD4 lymphocytes are
measured. Ω represents the set of physical concentrations that may happen. Let
the random variables X and Y express the concentration in number of CD4
by mm3 and by ml respectively. Thus we have Y = 103 X. So X and Y are
different, although they are informationally equivalent. For instance the events
{ω : X(ω) = 400} and {ω : Y (ω) = 400000} are the same. The densities of X
and Y , for the same P 1 on (Ω, F ), are obviously different. So, if we look only
at distributions, we shall have difficulties to rigorously define what a model is.
D. Commenges/Statistical models and likelihoods 4
Many problems in statistical inference can be treated from the point of view of
decision theory. That is, estimators for instance are chosen as minimizing some
risk function. The most important risk function is based on the Kullback-Leibler
divergence. Maximum likelihood estimators, use of Akaike criterion or likelihood
cross-validation can be grounded on the Kullback-Leibler divergence. Given a
probability P 2 absolutely continuous with respect to a probability P 1 and X
a sub-σ-field of F , the loss using P 2 in place of P 1 is the log-likelihood ratio
P 1 /P 2 1 P 1 /P 2
LX = log dP 2 . Its expectation is EP 1 [LX ]. This is the Kullback-
dP |X
Leibler risk, also called divergence [28, 29], information deviation [4] or entropy
[1]. The different names of this quantity reflects its central position in statistical
theory, being connected to several fields of the theory. Several notations have
been used by different authors. Here we choose the Cencov [4] notation:
P 1 /P 2
I(P 2 |P 1 ; X ) = EP 1 [LX ].
C 1
fX (x) 1 S 1 (C) 1
Z
2 1
I(P |P ; O) = log 2 fX (x)dx + log X2 (C) S (C),
0 fX (x) SX
1 2
where SX (.) and SX (.) are the survival functions of X under P 1 and P 2 respec-
tively.
5. The likelihood
P θ /P 0
I(P θ |P ∗ ; Oi ) = I(P 0 |P ∗ ; Oi ) − EP ∗ LOi .
P θ /P 0
Minimizing I(P θ |P ∗ ; Oi ) is equivalent to maximizing EP ∗ (LOi ). We cannot
P θ /P 0
compute EP ∗ (LOi ), but we can estimate it. The law of large numbers tells
D. Commenges/Statistical models and likelihoods 8
us that, when n → ∞:
n
X P θ /P 0 P θ /P 0
n−1 LOi → EP ∗ LOi .
i=1
Thus, we may maximize the estimator on the left hand, which is the loglikeli-
P θ /P 0
hood LŌ divided by n. Maximizing the loglikelihood is equivalent to max-
n
P θ /P 0
imizing the likelihood function LŌ . The likelihood function is the function
n
P θ /P 0
θ → LŌ . In conclusion, the maximum likelihood estimator (MLE) can be
n
considered as an estimator that minimizes a natural estimator of the Kullback-
Leibler risk.
We expect good behavior of the MLE θ̂ when the law of large numbers can
be applied and when the number of parameters is not too large. Some cases of
unsatisfactory behavior of the MLE are reported for instance in [30]. The prop-
erties of the MLE may not be satisfactory when the number of parameters is too
large, and especially when it increases with n as in an example given by Ney-
mann and Scott [36]. In this example (Xi , Yi ), i = 1, . . . , n are all independent
random variables with Xi and Yi both normal N (ξi , σ 2 ). It is readily seen that
not only the MLE of ξi , i = 1 . . . , n, but also the MLE of σ 2 are inconsistent.
This example is typical of problems where there are individual parameters (a ξi
for each i), so that in fact the statistical model changes with n. Such situations
are better approached by random effects models.
To assess the performance of the MLE we can use a risk which is an extended
version of the Kullback-Leibler risk with respect to P ∗ :
P ∗ /P θ̂
EKL(P θ̂ , Oi ) = EP ∗ LOi .
The difference with the classical Kullback-Leibler risk is that here P θ̂ is random:
so EKL(P θ̂ , Oi ) is the expectation of the Kullkack-Leibler divergence between
P θ̂ and P ∗ . In parametric models (that is, Θ is a subset of ℜp ), it can be shown
[9, 35] that
P ∗ /P θopt 1
EKL(P θ̂ , Oi ) = EP ∗ [LOi ] + n−1 Tr(I −1 J) + o(n−1 ), (5.2)
2
where I is the information matrix and J is the variance of the score, both
computed in θopt ; here the symbol Tr means the trace. This can be nicely inter-
preted by saying that the risk EKL(P θ̂ , Oi ) is the sum of the misspecification
P ∗ /P θopt 1 −1
risk EP ∗ [LX ] and the statistical risk 2n Tr(I −1 J). Note in passing
∗ θopt
P /P
that if Π is well specified we have EP ∗ [LOi ] = 0 and I = J, and thus
θ̂ p −1
EKL(P , Oi ) = 2n + o(n ).
There is a large literature on the topic: see [13, 14, 17, 18, 21, 25, 37, 44] among
others. Penalized likelihood is useful when the statistical model is too large to
obtain good estimators, while conventional parametric models appear too rigid.
A simple form of the penalized log-likelihood is
P θ /P 0
plκ (θ) = LŌ − κJ(θ),
n
where J(θ) is a measure of dislike of θ and κ weights the influence of this measure
on the objective function. A classical example is when θ = (α(.), β), where α(.)
D. Commenges/Statistical models and likelihoods 10
In this case J(θ) measures the irregularity of the function α(.). The maximum
penalized likelihood estimator (MpLE) θκpl is the value of θ which maximizes
plκ (θ). κ is often called a smoothing coefficient in the cases where J(θ) is a
measure of the irregularity of a function. More generally, we will call it a meta-
parameter. We may generalize the penalized log-likelihood by replacing κJ(θ) by
J(θ, κ), where κ could be multidimensional. When κ varies, this defines a family
of estimators,(θκpl ; κ ≥ 0). κ may be chosen by cross-validation (see section 8).
There is another way of dealing with the problem of statistical models that
might be too large. This is by using the so-called sieve estimators [40]. Sieves are
based on a sequence of approximating spaces. For instance rather than working
with a functional parameter we may restrict to spaces where the function is
represented on a basis (e.g. a splines basis). Here we consider a special sieves
approach where the approximating spaces may be functional spaces. Consider
a family of models (Pν )ν≥0 where:
Pν = (P θ ; θ ∈ Θ : J(θ) ≤ ν).
For fixed ν, the MLE θ̂ν solves the constrained maximization problem:
P θ /P 0
max LŌ ; subject to J(θ) ≤ ν. (6.1)
n
When ν varies this defines a family of sieve estimators: (θ̂ν ; ν ≥ 0). θ̂ν maximizes
P θ /P 0
the Lagrangian LŌn − λ[J(θ) − ν] for some value of λ. The Lagrangian
superficially looks like the penalized log-likelihood function, but an important
difference is that here the Lagrange multiplier λ is not fixed and is a part of
the solution. If the problem is convex the Karush-Kuhn-Tucker conditions are
necessary and sufficient. Here these conditions are
P θ /P 0
∂LŌ ∂J(θ)
J(θ) ≤ ν; λ ≥ 0; n
−λ = 0. (6.2)
∂θ ∂θ
It is clear that when the observation Ōn is fixed, the function κ → J(θκpl ) is a
monotone decreasing function. Consider the case where this function is continu-
ous and unbounded (when κ → 0). Then for each fixed ν there exists a value, say
pl
κν , such that J(θκν ) = ν. Note that this value depends on Ōn . Now, it is easy
pl
to see that θκν satisfies the Karush-Kuhn-Tucker conditions (6.2), with λ = κν .
Thus, if we can find the correct κν we can solve the constrained maximization
problem by maximizing the corresponding penalized likelihood. However, the
search for κν is not simple, and we must remember that the relationship be-
tween ν and κν depends on Ōn . A simpler result, deriving from the previous
considerations, is:
D. Commenges/Statistical models and likelihoods 11
pl
Lemma 6.1 (Penalized and sieves estimators). The families (P θκ ; κ ≥ 0)
and (P θ̂ν ; ν ≥ 0) are identical families of estimators.
The consequence is that since it is easier to solve the unconstrained maximiza-
tion problem involved in the penalized likelihood approach, one should apply
this approach in applications. On the other hand, it may be easier to develop
asymptotic results for sieve estimators (because θ̂ν is a MLE) than for penal-
ized likelihood estimators. One should be able to derive properties of penalized
likelihood estimators from those of sieve estimators.
ied in both linear [43] and non-linear [10] cases. While in the linear case compu-
tation of the above integrals is analytical, in the non-linear case it is not. The
numerical computation of these multiple integrals of dimension K is a daunting
task if K is larger than 2 or 3, especially if the likelihood given the random
effects is not itself very easy to compute; this is the curse of dimensionality.
P θ /P 0 P θ /P 0 P θ /P 0 Pn
LḠn = LḠn |b + Lb ; the last term can be written i=1 log fb (bi ; τ ).
None of these likelihoods can be computed (is measurable for) Ōn . The h-
P θ /P 0
loglikelihood function is the function γ → LḠ where γ = (θ, b) is the
n
set of all the “parameters”. Thus, estimators (here denoted MHLE) of both θ
and b can be obtained by maximizing the h-loglikelihood:
n
P γ /P 0 X
hlτ (γ) = LŌ − log fb (bi ; τ ).
n
i=1
P γ /P 0
= ni log f(Yi ; θ, bi ). How-
P
Often the loglikelihood can be written LŌ
n
ever, this formulation is not completely general, because there are interesting
cases where observations of the Yi are censored. So, we prefer writing the loglike-
P γ /P 0
lihood as LŌn . We note γ̂τ = (θ̂τ , b̂τ ) the maximum h-likelihood estimators
of the parameters for given τ ; the latter (meta) parameter can be estimated by
profile likelihood. The main interest of this approach is that there is no need
to compute multiple integrals. This problem is replaced by that of maximizing
hlτ (γ) over γ. That is, the problem is now a large number of parameters that
must be estimated, which this is equal to m+nK. This may be large, but special
algorithms can be used for generalized linear models.
Therneau and Grambsch [41] used the same approach for fitting frailty mod-
els, calling it a penalized likelihood. It may superficially look like the penalized
quasi likelihood of Breslow and Clayton [2], but this is not the same thing. There
is a link with the more conventional penalized likelihood for estimating smooth
functions discussed in section 6. The h-likelihood can be considered as a penal-
ized likelihood but with two important differences relative to the conventional
one: (i) the problem is parametric; (ii) the number of parameters grows with n.
Commenges et al. [9] have proved that the maximum h-likelihood estimators for
the fixed parameters are M-estimators [42]. Thus, under some regularity con-
ditions they have an asymptotic normal distribution. However, this asymptotic
distribution is not in general centered on the true parameter values, so that the
estimators are biased. In practice the bias can be negligible so that this approach
can be interesting in some situations due to its relative numerical simplicity.
An important issue is the choice between different estimators. Two typical sit-
uations are: (i) choice of MLE’s in different models; (ii) choice of MpLE’s with
different penalties. If we consider two models Π and Π′ we get two estimators
P θ̂ and P γ̂ of the probability P ∗ , and we may wish to assess which is better.
This is the “model choice” issue. A penalized likelihood function produces a
pl
family of estimators (P θκ ; κ ≥ 0), and we may wish to choose the best. Here,
what we call “the best” estimator is the estimator that minimizes some risk
function; in both cases we can use the extended version of the Kullback-Leibler
D. Commenges/Statistical models and likelihoods 13
P ∗ /P θ̂
EKL(P θ̂ ; Oi ) = EP ∗ (LOi ).
P 0 /P θ̂
Since P ∗ is unknown we can first work with EP ∗ (LOi ), which is equal,
θ̂
up to a constant, to EKL(P ; Oi ). Second we can, as usual, replace the ex-
pectation under P ∗ by expectation under the empirical distribution. For para-
P 0 /P θ̂
metric models, Akaike [1] has shown that an estimator of EP ∗ (LOi ) was
P θ̂ /P 0
−n−1 (LŌ − p), and Akaike criterion (AIC) can be deduced by multiplying
n
P θ̂ /P 0
this quantity by 2n: AIC = −2LŌn + 2p. Other criteria have been proposed
for model choice, and for more detail about Akaike and other criteria we refer
to [3, 27, 35]. Here, we pursue Akaike’s idea of estimating the Kullback-Leibler
risk. It is clear that the absolute risk itself can not in general be estimated.
However, the difference of risks between two estimators in parametric models
∆(P θ̂ , P γ̂ ) = EKL(P θ̂ ; Oi ) − EKL(P γ̂ ; Oi ) can be estimated by the statistic
D(P θ̂ , P γ̂ ) = (1/2n)(AIC(P θ̂ ) − AIC(P γ̂ )) and a more refined analysis of the
difference of risks can be developed, as in [9].
The leave-one-out likelihood cross-validation criterion can also be considered
as a possible “estimator” up to a constant of EKL [7]. It is defined as:
n
1 X P θ̂(Ōn|i ) /P 0
LCV (P θ̂n ; On+1 ) = − L ,
n i=1 Oi
where Ōn|i = ∨j6=i Oj and On+1 is another i.i.d. replicate of Oi . Then, we define
an estimator of the difference of risks between two estimators:
The advantage of LCV is that it can be used for comparing smooth estimators in
nonparametric models, and in particular it can be used for choosing the penalty
weight in penalized likelihood. A disadvantage is the computational burden, but
a general approximation formula has been given ([7, 37]):
P θ̂ /P 0
LCV ≈ −n−1 [LŌ −1
− Tr(Hplκ
HLŌn )],
n
where HLŌn and Hplκ are the Hessian of the loglikelihood and penalized log-
likelihood respectively. This expression looks like an AIC criterion and there are
−1
arguments to interpret Tr[Hpl κ
HLŌn ] as the model degree of freedom.
One important issue is the relationship between the three likelihoods considered
here and the Bayesian approach. The question arises because it seems that these
D. Commenges/Statistical models and likelihoods 14
Conclusion
Acknowledgements
I would like to thank Anne Gégout-Petit for helpful comments on the manuscript.
D. Commenges/Statistical models and likelihoods 15
References
[38] Rubin, D.B. (1976). Inference and missing data. Biometrika 63, 581–592.
MR0455196
[39] Rue, H., Martino, S. and Chopin, N. (2009). Approximate Bayesian
inference for latent Gaussian models using integrated nested Laplace ap-
proximations. J. Roy. Statist. Soc. B 71, 1–35.
[40] Shen, X. (1997). On methods of sieves and penalization. Ann. Statist. 25,
2555–2591. MR1604416
[41] Therneau, T.M. and Grambsch, P.M. (2000). Modeling survival data:
extending the Cox model. Springer. MR1774977
[42] van der Vaart, A. (1998), Asymptotic Statistics, Cambridge.
[43] Verbeke, G. and Molenberghs, G. (2000). Linear Mixed Models for
Longitudinal Data. New-York: Springer. MR1880596
[44] Wahba, G. (1983). Bayesian “Confidence Intervals” for the Cross-
Validated Smoothing Spline J. Roy. Statist. Soc. B 45, 133–150.
MR0701084
[45] Williams, D. (1991). Probability with Martingales. Cambridge University
Press. MR1155402