0% found this document useful (0 votes)

14 views17 pages

08 SS039

Uploaded by

Herick Lyncon Antunes Rodrigues

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

14 views17 pages

08 SS039

Uploaded by

Herick Lyncon Antunes Rodrigues

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

Statistics Surveys

Vol. 3 (2009) 1–17

ISSN: 1935-7516
DOI: 10.1214/08-SS039

Statistical models: Conventional,

penalized and hierarchical likelihood∗
Daniel Commenges
Epidemiology and Biostatistics Research Center, INSERM
Université Victor Segalen Bordeaux 2
146 rue Léo Saignat, Bordeaux, 33076, France
e-mail: [email protected]

Abstract: We give an overview of statistical models and likelihood, to-

gether with two of its variants: penalized and hierarchical likelihood. The
Kullback-Leibler divergence is referred to repeatedly in the literature, for
defining the misspecification risk of a model and for grounding the likeli-
hood and the likelihood cross-validation, which can be used for choosing
weights in penalized likelihood. Families of penalized likelihood and partic-
ular sieves estimators are shown to be equivalent. The similarity of these
likelihoods with a posteriori distributions in a Bayesian approach is consid-
ered.
AMS 2000 subject classifications: Primary 62-02, 62C99; secondary
62A01.
Keywords and phrases: Bayes estimators, cross-validation, h-likelihood,
incomplete data, Kullback-Leibler risk, likelihood, penalized likelihood, sieves,
statistical models.

Received August 2008.

Contents

1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2 Definition of a density . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
3 The Kullback-Leibler risk . . . . . . . . . . . . . . . . . . . . . . . . . 4
4 Statistical models and families . . . . . . . . . . . . . . . . . . . . . . 5
4.1 Statistical families . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.2 Statistical models . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
4.3 Statistical models and true probability . . . . . . . . . . . . . . . 6
5 The likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
5.1 Definition of the likelihood . . . . . . . . . . . . . . . . . . . . . . 7
5.2 Computation of the likelihood . . . . . . . . . . . . . . . . . . . . 8
5.3 Performance of the MLE in terms of Kullback-Leibler risk . . . . 9
6 The penalized likelihood . . . . . . . . . . . . . . . . . . . . . . . . . . 9
7 The hierarchical likelihood . . . . . . . . . . . . . . . . . . . . . . . . . 11
7.1 Random effects models . . . . . . . . . . . . . . . . . . . . . . . . 11
7.2 Hierarchical likelihood . . . . . . . . . . . . . . . . . . . . . . . . 11
8 Akaike and likelihood cross-validation criteria . . . . . . . . . . . . . . 12
∗ This paper was accepted by Elja Arjas, Executive Editor for the Bernoulli.

1
D. Commenges/Statistical models and likelihoods 2

9 Link with the MAP estimator . . . . . . . . . . . . . . . . . . . . . . . 13

Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
Acknowledgements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
References . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

1. Introduction

Since its proposal by Fisher [16], likelihood inference has occupied a central
position in statistical inference. In some situations, modified versions of the
likelihood have been proposed. Marginal, conditional, profile and partial likeli-
hoods have been proposed to get rid of nuisance parameters. Pseudo-likelihood
and hierarchical likelihood may be used to circumvent numerical problems in the
computation of the likelihood, that are mainly due to multiple integrals. Penal-
ized likelihood has been proposed to introduce a smoothness a priori knowledge
on functions, thus leading to smooth estimators. Several reviews have already
been published, for instance [31], but it is nearly impossible in a single paper
to describe with some details all the types of likelihoods that have been pro-
posed. This paper aims at describing the conventional likelihood and two of its
variants: penalized and hierarchical likelihoods. The aim of this paper is not to
give the properties of the estimators obtained by maximizing these likelihoods,
but rather to describe these three likelihoods together with their link to the
Kullback-Leibler divergence. This interest more in the foundations rather than
the properties, leads us to first develop some reflexions and definitions about
statistical models and to give a slightly extended version of the Kullback-Leibler
divergence.
In section 2, we recall the definition of a density and the relationship between
a density in the sample space and for a random variable. In section 3, we give a
slightly extended version of the Kullback-Leibler divergence (making it explicit
that it also depends on a sigma-field). Section 4 gives an account of statisti-
cal models, distinguishing mere statistical families from statistical models and
defining the misspecification risk. Section 5 presents the likelihood and discusses
issues about its computation and the performance of the estimator of the max-
imum likelihood in terms of Kullback-Leibler risk. In section 6, we define the
penalized likelihood and show that for a family of penalized likelihood estima-
tors there is an identical family of sieves estimators. In section 7, we describe
the hierarchical likelihood. In section 8, we briefly sketch the possible unifica-
tion of these likelihoods through a Bayesian representation that allows us to
consider the maximum (possibly penalized) likelihood estimators as maximum
a posteriori (MAP) estimators; this question however cannot be easily settled
due to the non-invariance of the MAP for reparameterization. Finally, there is
a short conclusion.

2. Definition of a density

Consider a measurable space (S, A) and two measures µ and ν with µ abso-
lutely continuous relatively to ν. For G a sub-σ-field of A the Radon-Nikodym
D. Commenges/Statistical models and likelihoods 3

dµ
derivative of µ with respect to ν on X , denoted by: dν |G is the G-measurable
random variable such that
dµ
Z
µ(G) = dν, G ∈ G.
G dν |G

The Radon-Nikodym derivative is also called the density. We are interested in

the case where µ is a probability measure, which we will call P 1 ; ν may also be
a probability measure, P 0 . In that case we can speak of the likelihood ratio and
P 1 /P 0
denote it LG . In order to speak of a likelihood function, we have to define
a model (see section 4). Note that likelihood ratios (as Radon-Nikodym deriva-
tives) are defined with respect to a sigma-field. The definitions and properties
of these probabilistic concepts are very clearly presented in [45]. For the statis-
tician, sigma-fields represent sets of events which may be (but are not always)
dP 1 dP 1
observed. If H and G are different sigma-fields, dP 0 |H and dP 0 |G are different,

but if H ⊂ G the former can be expressed as a conditional expectation (given

H) of the latter and we have the fundamental formula:

dP 1
1
dP
= EP 0 |H .
dP 0 |H dP 0 |G

Consider now the case where the measurable space (Ω, F ) is the sample space
of an experiment. For the statistician (Ω, F ) is not any measurable space: it is a
space which enables us to represent real events. We shall write in bold character
a probability on (Ω, F ), for instance, P 1 . Let us define a random variable X,
that is, a measurable function from (Ω, F ) to (ℜ, B). The couple (P 1 , X) induces
a probability measure on (ℜ, B) defined by: PX 1
(B) = P 1 {X −1 (B)}, B ∈ B. This
probability measure is called the distribution of X. If this probability measure
is absolutely continuous with respect to Lebesgue (resp. counting) measure, one
speaks of continuous (resp. discrete) variable. For instance, for a continuous
1 dP 1
variable we define the density fX = dλX , where λ is Lebesgue measure on
ℜ, which is the usual probability density function (p.d.f.). Note that the p.d.f.
1
depends on both P 1 and X, while dP 0 depends on X but not on a specific
dP |X
random variable X. Often in applied statistics one works only with distributions,
but this may leave some problems unsolved.
Example 1. Consider the case where concentrations of CD4 lymphocytes are
measured. Ω represents the set of physical concentrations that may happen. Let
the random variables X and Y express the concentration in number of CD4
by mm3 and by ml respectively. Thus we have Y = 103 X. So X and Y are
different, although they are informationally equivalent. For instance the events
{ω : X(ω) = 400} and {ω : Y (ω) = 400000} are the same. The densities of X
and Y , for the same P 1 on (Ω, F ), are obviously different. So, if we look only
at distributions, we shall have difficulties to rigorously define what a model is.
D. Commenges/Statistical models and likelihoods 4

3. The Kullback-Leibler risk

Many problems in statistical inference can be treated from the point of view of
decision theory. That is, estimators for instance are chosen as minimizing some
risk function. The most important risk function is based on the Kullback-Leibler
divergence. Maximum likelihood estimators, use of Akaike criterion or likelihood
cross-validation can be grounded on the Kullback-Leibler divergence. Given a
probability P 2 absolutely continuous with respect to a probability P 1 and X
a sub-σ-field of F , the loss using P 2 in place of P 1 is the log-likelihood ratio
P 1 /P 2 1 P 1 /P 2
LX = log dP 2 . Its expectation is EP 1 [LX ]. This is the Kullback-
dP |X
Leibler risk, also called divergence [28, 29], information deviation [4] or entropy
[1]. The different names of this quantity reflects its central position in statistical
theory, being connected to several fields of the theory. Several notations have
been used by different authors. Here we choose the Cencov [4] notation:
P 1 /P 2
I(P 2 |P 1 ; X ) = EP 1 [LX ].

If X is the largest sigma-field defined on the space, then we omit it in the

notation. Note that the Kullback-Leibler risk is asymmetric and hence does not
define a distance between probabilities; we have to take on this fact. If X is a
1 2
random variable with p.d.f. fX and fX under P 1 and P 2 respectively we have
1 1
dP fX (X) 2 1
2 = 2 (X)
fX
and the divergence of the distribution PX relative to PX can
dP |X
be written:
1
fX (x) 1
Z
2 1
I(PX |PX ) = log f (x)dx. (3.1)
2
fX (x) X
We have that I(P 2 |P 1 ; X ) = I(PX
2
|PX1
), if X is the σ-field generated by X on
(Ω, F ). Note that on (Ω, F ) we have to specify that we assess the divergence on
X ; we might assess it on a different sigma-field and would of course obtain a
different result. This provides more flexibility. In particular, we shall use this in
the case of incomplete data. The observation is represented by a sigma-field O.
Suppose we are interested in making inference about the true probability on X .
We have complete data if our observation is O = X . With incomplete data, in
the case where the mechanism leading to incomplete data is deterministic, we
have O ⊂ X . In that case it will be very difficult to estimate I(P 2 |P 1 ; X ) and
P 1 /P 2
it will be more realistic to use I(P 2 |P 1 ; O) = EP 1 [LO ]. We need this flex-
ibility to extend Akaike’s argument for the likelihood and for developing model
choice criteria to situations with incomplete data. This will become important
in section 5, where P 1 will be the true unknown probability (denoted P ∗ ) and
the problem will be to estimate this divergence rather than to compute it.
Example 2. Suppose we are interested in modeling the time to an event, X,
and we wish to evaluate the divergence of P 2 with respect to P 1 . It is natural to
compute the divergence on the sigma-field X generated by X, I(P 2 |P 1 ; X ) =
2 1
I(PX |PX ) given by formula (3.1). Suppose that we have an observation of X
under P 1 which is right-censored at a fixed time C. We observe (X̃, δ) where
D. Commenges/Statistical models and likelihoods 5

X̃ = min(X, C) and δ = 1{X≤C} . Thus on {X ≤ C} we observe all the events of

X but on {X > C} we observe no more events. If we represent the observation
by the sigma-field O, we can say that O is generated by (X̃, δ). It is clear that we
have O ⊂ X . Although in theory it is still interesting to compute the divergence
of P 2 with respect to P 1 on the sigma-field X it is also interesting to compute
it on the observed sigma-field, which is I(P 2 |P 1 ; O). It can be proved by simple
1 f 1 (X)
probabilistic arguments that on {X ≤ C} we have ddP P2
= fX2 (X) and on
|O X
1 1
dP SX (C)
{X > C} we have 2 = 2 (C)
SX
and thus
dP |O

C 1
fX (x) 1 S 1 (C) 1
Z
2 1
I(P |P ; O) = log 2 fX (x)dx + log X2 (C) S (C),
0 fX (x) SX
1 2
where SX (.) and SX (.) are the survival functions of X under P 1 and P 2 respec-
tively.

4. Statistical models and families

4.1. Statistical families

We consider a subset P of the probabilities on a measurable space (S, A). We

shall call such a subset a family of probabilities, and we may parameterize this
family. Following [22], a parameterization can be represented by a function from
a set Θ with values in P: θ → P θ . It is desirable that this function be one-to-one,
a property linked to the identifiability issue which will be discussed later in this
section. The parameterization associated with the family of probabilities P can
be denoted Π = (P θ ; θ ∈ Θ) and we have P = {P θ ; θ ∈ Θ}. We may denote
Π ∼ P. If Π1 ∼ P and Π2 ∼ P, Π1 and Π2 are two parameterizations of the
same family of probabilities and we may note Π1 ∼ Π2 .
P is really a family of probabilities and Π a parametrized family of probabili-
ties. We may call them statistical families if the aim of considering such families
is to make statistical inference. However, a family of probability on (ℜ, B) is not
sufficient to specify a statistical model (here, we do not follow [22]). A statistical
model depends on the random variables chosen, as exemplified in section 2.

4.2. Statistical models

A family of probabilities on the sample space of an experiment (Ω, F ) will be

called a statistical model and a parameterization of this family will be called a
parameterized statistical model.
Definition 1. Two parameterized statistical models Π = (P θ , θ ∈ Θ) on X
and Π′ = (P γ , γ ∈ Γ) on Y are equivalent (in the sense that they specify the
same statistical model) if X = Y and they specify the same family of probability
on (Ω, X ).
D. Commenges/Statistical models and likelihoods 6

The pair (X, Π) of a random variable and a parameterized statistical model

θ
induces the parameterized family (of distributions) on (ℜ, B): ΠX = (PX ;θ ∈
Θ). Conversely, the pair (X, ΠX ) induces Π if X = F . In that case, we may
describe the statistical model by (X, ΠX ). Two different random variables X
and Y induce two (generally different) parameterized families on (ℜ, B), ΠX
and ΠY . Conversely, one may ask whether the pairs (X, ΠX ) and (Y, ΠY ) define
equal or equivalent parameterized statistical models. We need the definition
of “informationally equivalent” random variables (or more generally random
elements).
Definition 2. X and Y are informationally equivalent if the sigma-fields X
and Y generated by X and Y are equal.
Each pair (X, PX θ
) induces a probability on (Ω, X ) P X,θ = PXθ
oX and thus
the pair (X, ΠX ) induces the parameterized statistical model (P X,θ , θ ∈ Θ).
Similarly, each pair (Y, PYγ ) induces a probability on (Ω, Y) P Y,γ = PYγ oY and
the pair (Y, ΠY ) induces the parameterized statistical model (P Y,γ , γ ∈ Γ).
Tautologically, we will say that (X, ΠX ) and (Y, ΠY ) define the same statistical
models if (P X,θ , θ ∈ Θ) and (P Y,γ , γ ∈ Γ) are equivalent.
Example 1 (continued). (i) ΠX = (N (103 ; σ 2 ), σ 2 > 0) and ΠY = (N (103 ; σ 2 ),
σ 2 > 0) are the same parameterized families on (ℜ, B). However, since X and Y
are measurements of the same quantity in different units, these parameterized
families correspond to different statistical models.
(ii) ΠX = (N (µ, σ 2 ); µ ∈ ℜ, σ 2 > 0) and ΠY = (N (µ, σ 2 ); µ ∈ ℜ, σ 2 > 0)
are the same parameterized family on (ℜ, B). (X, ΠX ) and (Y, ΠY ) specify the
same statistical model but not the same parameterized statistical model.
(iii) ΠX = (N (103 ; σ 2 ), σ 2 > 0) and ΠY = (N (106 ; 106σ 2 ), σ 2 > 0) are
different families on (ℜ, B). However (X, ΠX ) and (Y, ΠY ) specify the same
statistical model (with the same parameterization).
For sake of simplicity we have considered distributions of real random vari-
ables. The same can be said about random variables with values in ℜd or stochas-
tic processes that are random elements with values in a Skorohod space. Com-
menges and Gégout-Petit [6] gave an instance of two informationally equivalent
processes. The events described by an irreversible three-state process X = (Xt ),
where Xt takes values 0, 1, 2, can be described by a bivariate counting process
N = (N1 , N2 ). The law of the three-state process is specified by the transition
intensities α01 , α02, α12. There is a way of expressing the intensities λ1 and λ2 of
N1 and N2 such that the laws of X and N correspond to the same probability
on (Ω, F ). Thus the same statistical model can be described with X or with N .

4.3. Statistical models and true probability

So-called objectivist approaches to statistical inference assume that there is

a true, generally unknown, probability P ∗ . Frequentists as well as objectivist
Bayesians adopt this paradigm while subjectivist Bayesians, following De Finetti
[11], reject it. We adopt the objectivist paradigm, which in our view is more
D. Commenges/Statistical models and likelihoods 7

suited to answer scientific issues. Statistical inference aims to approach P ∗ or

functionals of P ∗ . Model Π is well specified if P ∗ ∈ Π and is mis-specified
∗
otherwise. If it is well specified, then there is a θ∗ ∈ Θ such that P θ = P ∗ . If
we consider a probability P θ , we may measure its divergence with respect to
P ∗ on a given sigma-field O by I(P θ |P ∗ ; O), and we may choose θ that mini-
mizes this divergence. We assume that there exists a value θopt that minimizes
I(P θ |P ∗ ; O). We call I(P θopt |P ∗ ; O) the misspecification risk of model Π. Of
course, if the model is well specified, then I(P θ |P ∗ ; O) is minimized at θ∗ , and
the misspecification risk is null.

5. The likelihood

5.1. Definition of the likelihood

Conventionally, most statistical models assume that independently identically

distributed (i.i.d.) random variables, say Xi , i = 1, . . . , n, are observed. How-
ever, in case of complex observation schemes, the observed random variables
become complicated. Moreover the same statistical model can be described by
different random variables. For instance, in Example 2 the observed random
variables are the pairs (X̃i , δi ). However, we may also describe the observation
by (δi Xi , δi ), or in terms of counting processes by (Nui , 0 ≤ u ≤ C), where
(Nui = 1{Xi ≤u}). These three descriptions are observationally equivalent, in
the sense that they correspond to the same sigma-field, say Oi = σ(X̃i , δi ) =
σ(δi Xi , δi ) = σ(Nui , 0 ≤ u ≤ C).
We shall adopt the description of observations in terms of sigma-fields because
it is more intrinsic. We shall work with a measure space (Ω, F ) containing all
events of interest. For instance the observation of subject i, Oi , belongs to F .
Saying that observations are i.i.d. means that the Oi are independent, that there
is a one-to-one correspondence between Oi and Oi′ and that the restrictions
of P ∗ to Oi (denoted P ∗Oi ) are the same. We call Ōn the global observation:
Ōn = ∨ni=1 Oi . Since we do not know P ∗ , we may in the first place reduce the
search by restricting our attention to a statistical model Π and find a P θ ∈ Π
close to P ∗ , that is, one which minimizes I(P θ |P ∗ ; Oi ). We have already given
a name to it, P θopt , but we cannot compute it directly because we do not know
P ∗ . The problem is that I(P θ |P ∗ ; Oi ) doubly depends on the unknown P ∗ :
(i) through the Radon-Nikodym derivative and (ii) through the expectation.
P ∗ /P θ P ∗ /P 0 P 0 /P θ
Problem (i) can be eliminated by noting that LOi = LOi + LOi .
Thus, by taking expectation under P ∗ :

P θ /P 0
I(P θ |P ∗ ; Oi ) = I(P 0 |P ∗ ; Oi ) − EP ∗ LOi .

P θ /P 0
Minimizing I(P θ |P ∗ ; Oi ) is equivalent to maximizing EP ∗ (LOi ). We cannot
P θ /P 0
compute EP ∗ (LOi ), but we can estimate it. The law of large numbers tells
D. Commenges/Statistical models and likelihoods 8

us that, when n → ∞:
n
X P θ /P 0 P θ /P 0
n−1 LOi → EP ∗ LOi .
i=1

Thus, we may maximize the estimator on the left hand, which is the loglikeli-
P θ /P 0
hood LŌ divided by n. Maximizing the loglikelihood is equivalent to max-
n
P θ /P 0
imizing the likelihood function LŌ . The likelihood function is the function
n
P θ /P 0
θ → LŌ . In conclusion, the maximum likelihood estimator (MLE) can be
n
considered as an estimator that minimizes a natural estimator of the Kullback-
Leibler risk.

5.2. Computation of the likelihood

Computation of the likelihood is simple in terms of the probability on the ob-

served σ-field. The conventional way of specifying a model is in terms of a ran-
θ
dom variable and a family of distributions (X, (fX (.))θ∈Θ ). Then the likelihood
θ
for observation X is simply fX (X). When the events of interest are represented
by stochastic processes in continuous time, it is also possible to define a den-
sity and hence a likelihood function. See [15] for diffusion processes and [23] for
counting processes.
Two situations make the computation of the likelihood more complex. The
first is when there is incomplete observation of the events of interest. If the
mechanism leading to incomplete data is random we should in principle model
it. The theory of ignorable missing observation of Rubin [38] has been extended
to more general mechanisms leading to incomplete data in [20]. This has been
developed in the stochastic process framework by Commenges and Gégout-Petit
[5] (who also give some general formulas for likelihood calculus). The second
situation occurs when the law is described through a conditional probability and
the conditioning events are not observed. This is the framework of random effects
models (see section 7.1). Although conceptually different these two situations
lead to the same problem: the likelihood for subject i can be relatively easily
computed for a “complete” observation Gi and the likelihood for the observation
Oi ⊂ Gi is the conditional expectation (which derives from the fundamental
formula):
P θ /P 0 P θ /P 0
LOi = EP 0 LGi |Oi . (5.1)
The conditional expectation is expressed as an integral which must be computed
numerically in most cases. The only notable exception is the linear mixed ef-
fects model where the integral can be analytically computed. For examples of
algorithms for non-linear mixed effects see [12] and [19]. For general formulas
for the likelihood of interval-censored observations of counting processes see [6].
D. Commenges/Statistical models and likelihoods 9

5.3. Performance of the MLE in terms of Kullback-Leibler risk

We expect good behavior of the MLE θ̂ when the law of large numbers can
be applied and when the number of parameters is not too large. Some cases of
unsatisfactory behavior of the MLE are reported for instance in [30]. The prop-
erties of the MLE may not be satisfactory when the number of parameters is too
large, and especially when it increases with n as in an example given by Ney-
mann and Scott [36]. In this example (Xi , Yi ), i = 1, . . . , n are all independent
random variables with Xi and Yi both normal N (ξi , σ 2 ). It is readily seen that
not only the MLE of ξi , i = 1 . . . , n, but also the MLE of σ 2 are inconsistent.
This example is typical of problems where there are individual parameters (a ξi
for each i), so that in fact the statistical model changes with n. Such situations
are better approached by random effects models.
To assess the performance of the MLE we can use a risk which is an extended
version of the Kullback-Leibler risk with respect to P ∗ :

P ∗ /P θ̂
EKL(P θ̂ , Oi ) = EP ∗ LOi .

The difference with the classical Kullback-Leibler risk is that here P θ̂ is random:
so EKL(P θ̂ , Oi ) is the expectation of the Kullkack-Leibler divergence between
P θ̂ and P ∗ . In parametric models (that is, Θ is a subset of ℜp ), it can be shown
[9, 35] that

P ∗ /P θopt 1
EKL(P θ̂ , Oi ) = EP ∗ [LOi ] + n−1 Tr(I −1 J) + o(n−1 ), (5.2)
2
where I is the information matrix and J is the variance of the score, both
computed in θopt ; here the symbol Tr means the trace. This can be nicely inter-
preted by saying that the risk EKL(P θ̂ , Oi ) is the sum of the misspecification
P ∗ /P θopt 1 −1
risk EP ∗ [LX ] and the statistical risk 2n Tr(I −1 J). Note in passing
∗ θopt
P /P
that if Π is well specified we have EP ∗ [LOi ] = 0 and I = J, and thus
θ̂ p −1
EKL(P , Oi ) = 2n + o(n ).

6. The penalized likelihood

There is a large literature on the topic: see [13, 14, 17, 18, 21, 25, 37, 44] among
others. Penalized likelihood is useful when the statistical model is too large to
obtain good estimators, while conventional parametric models appear too rigid.
A simple form of the penalized log-likelihood is

P θ /P 0
plκ (θ) = LŌ − κJ(θ),
n

where J(θ) is a measure of dislike of θ and κ weights the influence of this measure
on the objective function. A classical example is when θ = (α(.), β), where α(.)
D. Commenges/Statistical models and likelihoods 10

is a function and β is a real parameter. J(θ) can be chosen as

Z ∞
J(θ) = α′′ (u)2 du.
0

In this case J(θ) measures the irregularity of the function α(.). The maximum
penalized likelihood estimator (MpLE) θκpl is the value of θ which maximizes
plκ (θ). κ is often called a smoothing coefficient in the cases where J(θ) is a
measure of the irregularity of a function. More generally, we will call it a meta-
parameter. We may generalize the penalized log-likelihood by replacing κJ(θ) by
J(θ, κ), where κ could be multidimensional. When κ varies, this defines a family
of estimators,(θκpl ; κ ≥ 0). κ may be chosen by cross-validation (see section 8).
There is another way of dealing with the problem of statistical models that
might be too large. This is by using the so-called sieve estimators [40]. Sieves are
based on a sequence of approximating spaces. For instance rather than working
with a functional parameter we may restrict to spaces where the function is
represented on a basis (e.g. a splines basis). Here we consider a special sieves
approach where the approximating spaces may be functional spaces. Consider
a family of models (Pν )ν≥0 where:

Pν = (P θ ; θ ∈ Θ : J(θ) ≤ ν).

For fixed ν, the MLE θ̂ν solves the constrained maximization problem:
P θ /P 0
max LŌ ; subject to J(θ) ≤ ν. (6.1)
n

When ν varies this defines a family of sieve estimators: (θ̂ν ; ν ≥ 0). θ̂ν maximizes
P θ /P 0
the Lagrangian LŌn − λ[J(θ) − ν] for some value of λ. The Lagrangian
superficially looks like the penalized log-likelihood function, but an important
difference is that here the Lagrange multiplier λ is not fixed and is a part of
the solution. If the problem is convex the Karush-Kuhn-Tucker conditions are
necessary and sufficient. Here these conditions are
P θ /P 0
∂LŌ ∂J(θ)
J(θ) ≤ ν; λ ≥ 0; n
−λ = 0. (6.2)
∂θ ∂θ
It is clear that when the observation Ōn is fixed, the function κ → J(θκpl ) is a
monotone decreasing function. Consider the case where this function is continu-
ous and unbounded (when κ → 0). Then for each fixed ν there exists a value, say
pl
κν , such that J(θκν ) = ν. Note that this value depends on Ōn . Now, it is easy
pl
to see that θκν satisfies the Karush-Kuhn-Tucker conditions (6.2), with λ = κν .
Thus, if we can find the correct κν we can solve the constrained maximization
problem by maximizing the corresponding penalized likelihood. However, the
search for κν is not simple, and we must remember that the relationship be-
tween ν and κν depends on Ōn . A simpler result, deriving from the previous
considerations, is:
D. Commenges/Statistical models and likelihoods 11

pl
Lemma 6.1 (Penalized and sieves estimators). The families (P θκ ; κ ≥ 0)
and (P θ̂ν ; ν ≥ 0) are identical families of estimators.
The consequence is that since it is easier to solve the unconstrained maximiza-
tion problem involved in the penalized likelihood approach, one should apply
this approach in applications. On the other hand, it may be easier to develop
asymptotic results for sieve estimators (because θ̂ν is a MLE) than for penal-
ized likelihood estimators. One should be able to derive properties of penalized
likelihood estimators from those of sieve estimators.

7. The hierarchical likelihood

7.1. Random effects models

An important class of models arises when we define a potentially observable

variable Yi for each subject, and its distribution is given conditionally on un-
observed quantities. This is the classical framework of random effects models,
which we have already mentioned in subsections 5.2 and 5.3. Specifically, let us
consider the following model: conditionally on bi , Yi has a density fY |b (.; θ, bi ),
where θ is a vector of parameters of dimension m and bi are random effects (or
parameters) of dimension K. The (Yi , bi ) are i.i.d. Typically Yi is multivariate of
dimension ni . We assume that the bi have density fb (.; τ ), where τ is a param-
eter. Typically Yi is observed, while bi is not. This can be made more general
for including the case of censored observation of Yi .
The conventional approach for estimating θ is to compute the maximum like-
lihood estimators. Empirical Bayes estimators of the bi can be computed in
a second stage. The likelihood (for observation i) is computed by taking the
conditional expectation given Oi of the complete likelihood on the sigma-field
including the random effect Gi = Oi ∨ σ(bi ). This is an application of formula
(5.1). Practically the computation of this conditional expectation involves the
integrals fYθ |b (Yi |b)fb (b)db. Random effects models have been thoroughly stud-
R

ied in both linear [43] and non-linear [10] cases. While in the linear case compu-
tation of the above integrals is analytical, in the non-linear case it is not. The
numerical computation of these multiple integrals of dimension K is a daunting
task if K is larger than 2 or 3, especially if the likelihood given the random
effects is not itself very easy to compute; this is the curse of dimensionality.

7.2. Hierarchical likelihood

For hierarchical generalized linear models, the hierarchical likelihood (or h-

likelihood), was proposed by Lee and Nelder [32]; see also [33, 34]. The h-
likelihood is the joint (or complete) likelihood of the observations and the
(unobserved) random effects, but where the random effects are treated as pa-
P θ /P 0
rameters. The complete loglikelihood is LḠn . It can be decomposed into
D. Commenges/Statistical models and likelihoods 12

P θ /P 0 P θ /P 0 P θ /P 0 Pn
LḠn = LḠn |b + Lb ; the last term can be written i=1 log fb (bi ; τ ).
None of these likelihoods can be computed (is measurable for) Ōn . The h-
P θ /P 0
loglikelihood function is the function γ → LḠ where γ = (θ, b) is the
n
set of all the “parameters”. Thus, estimators (here denoted MHLE) of both θ
and b can be obtained by maximizing the h-loglikelihood:
n
P γ /P 0 X
hlτ (γ) = LŌ − log fb (bi ; τ ).
n
i=1

P γ /P 0
= ni log f(Yi ; θ, bi ). How-
P
Often the loglikelihood can be written LŌ
n
ever, this formulation is not completely general, because there are interesting
cases where observations of the Yi are censored. So, we prefer writing the loglike-
P γ /P 0
lihood as LŌn . We note γ̂τ = (θ̂τ , b̂τ ) the maximum h-likelihood estimators
of the parameters for given τ ; the latter (meta) parameter can be estimated by
profile likelihood. The main interest of this approach is that there is no need
to compute multiple integrals. This problem is replaced by that of maximizing
hlτ (γ) over γ. That is, the problem is now a large number of parameters that
must be estimated, which this is equal to m+nK. This may be large, but special
algorithms can be used for generalized linear models.
Therneau and Grambsch [41] used the same approach for fitting frailty mod-
els, calling it a penalized likelihood. It may superficially look like the penalized
quasi likelihood of Breslow and Clayton [2], but this is not the same thing. There
is a link with the more conventional penalized likelihood for estimating smooth
functions discussed in section 6. The h-likelihood can be considered as a penal-
ized likelihood but with two important differences relative to the conventional
one: (i) the problem is parametric; (ii) the number of parameters grows with n.
Commenges et al. [9] have proved that the maximum h-likelihood estimators for
the fixed parameters are M-estimators [42]. Thus, under some regularity con-
ditions they have an asymptotic normal distribution. However, this asymptotic
distribution is not in general centered on the true parameter values, so that the
estimators are biased. In practice the bias can be negligible so that this approach
can be interesting in some situations due to its relative numerical simplicity.

8. Akaike and likelihood cross-validation criteria

An important issue is the choice between different estimators. Two typical sit-
uations are: (i) choice of MLE’s in different models; (ii) choice of MpLE’s with
different penalties. If we consider two models Π and Π′ we get two estimators
P θ̂ and P γ̂ of the probability P ∗ , and we may wish to assess which is better.
This is the “model choice” issue. A penalized likelihood function produces a
pl
family of estimators (P θκ ; κ ≥ 0), and we may wish to choose the best. Here,
what we call “the best” estimator is the estimator that minimizes some risk
function; in both cases we can use the extended version of the Kullback-Leibler
D. Commenges/Statistical models and likelihoods 13

risk already used in section 5:

P ∗ /P θ̂
EKL(P θ̂ ; Oi ) = EP ∗ (LOi ).

P 0 /P θ̂
Since P ∗ is unknown we can first work with EP ∗ (LOi ), which is equal,
θ̂
up to a constant, to EKL(P ; Oi ). Second we can, as usual, replace the ex-
pectation under P ∗ by expectation under the empirical distribution. For para-
P 0 /P θ̂
metric models, Akaike [1] has shown that an estimator of EP ∗ (LOi ) was
P θ̂ /P 0
−n−1 (LŌ − p), and Akaike criterion (AIC) can be deduced by multiplying
n
P θ̂ /P 0
this quantity by 2n: AIC = −2LŌn + 2p. Other criteria have been proposed
for model choice, and for more detail about Akaike and other criteria we refer
to [3, 27, 35]. Here, we pursue Akaike’s idea of estimating the Kullback-Leibler
risk. It is clear that the absolute risk itself can not in general be estimated.
However, the difference of risks between two estimators in parametric models
∆(P θ̂ , P γ̂ ) = EKL(P θ̂ ; Oi ) − EKL(P γ̂ ; Oi ) can be estimated by the statistic
D(P θ̂ , P γ̂ ) = (1/2n)(AIC(P θ̂ ) − AIC(P γ̂ )) and a more refined analysis of the
difference of risks can be developed, as in [9].
The leave-one-out likelihood cross-validation criterion can also be considered
as a possible “estimator” up to a constant of EKL [7]. It is defined as:
n
1 X P θ̂(Ōn|i ) /P 0
LCV (P θ̂n ; On+1 ) = − L ,
n i=1 Oi

where Ōn|i = ∨j6=i Oj and On+1 is another i.i.d. replicate of Oi . Then, we define
an estimator of the difference of risks between two estimators:

∆(P θ̂ , P γ̂ ) = LCV (P θ̂n ; On+1 ) − LCV (P γ̂n ; On+1 ) (8.1)

The advantage of LCV is that it can be used for comparing smooth estimators in
nonparametric models, and in particular it can be used for choosing the penalty
weight in penalized likelihood. A disadvantage is the computational burden, but
a general approximation formula has been given ([7, 37]):

P θ̂ /P 0
LCV ≈ −n−1 [LŌ −1
− Tr(Hplκ
HLŌn )],
n

where HLŌn and Hplκ are the Hessian of the loglikelihood and penalized log-
likelihood respectively. This expression looks like an AIC criterion and there are
−1
arguments to interpret Tr[Hpl κ
HLŌn ] as the model degree of freedom.

9. Link with the MAP estimator

One important issue is the relationship between the three likelihoods considered
here and the Bayesian approach. The question arises because it seems that these
D. Commenges/Statistical models and likelihoods 14

three likelihoods can be identified with the numerator of a posteriori distribu-

tions with particular priors. Thus MLE, MpLE and MHLE could be identified
with the maximum a posteriori (MAP) estimators with the corresponding pri-
ors. However, this relationship depends on the parameterization. Thus the MLE
is identical to the MAP using a uniform prior for the parameters. If we change
the parameterization, the uniform prior on the new parameters does not corre-
spond in general to the uniform prior on the original parameters, as was already
noticed by Fisher [16]. This apparent paradox led Jeffreys to propose a prior
invariant for parameterization [24], known as Jeffrey’s prior. However the MAP
with Jeffreys’s prior is no longer identical to the MLE when Jeffreys’s prior is
not uniform.
p For instance, for the parameter of a binomial trial, Jeffreys’s prior
is 1/ p(1 − p). Adding the logarithm of this term to the loglikelihood shifts the
maximum away from 0.5. Moreover it is questionable whether this invariance
property can be identified with a non-informativeness character of this prior (for
a review on the choice of priors, see [26]).
In the Bayesian paradigm, rather than considering estimators based on max-
imization of some expression such as the likelihood or posterior density, it is
common to attempt to summarize the statistical inferences by using quantiles
of the posterior distribution, such as the median, or expectations with respect
to the posterior. While such estimators may be more satisfactory, they typically
involve multiple integrals that are hard to compute: computations are mostly
being done with the MCMC algorithm. Maximization methods have the ad-
vantage of being potentially easier in the case where multiple integrals can be
avoided. There are also approximate Bayesian methods, which yield the a pos-
teriori marginal distribution by approximating some of the multiple integrals by
Laplace approximation, which in turn involves a maximization problem. Rue et
al. [39] claim that this approach is much faster than the MCMC algorithm.

Conclusion

The Kolmogorov representation of a statistical experiment has to be taken se-

riously if we want to have a deep understanding of what a statistical model is.
The Kullback-Leibler risk is underlying most of the reflexions about likelihood,
as was clearly seen by Akaike [1]. Finally, the link with the Bayesian approach
should be explored more thoroughly than could done in this paper. The MLE
and MAP estimators are the same if, in a given paramterization, the prior used
for the MAP is uniform. However, this identity is not stable with respect to repa-
rameterizations. Similar remarks hold for the link between penalized likelihood
and MAP.

Acknowledgements

I would like to thank Anne Gégout-Petit for helpful comments on the manuscript.
D. Commenges/Statistical models and likelihoods 15

References

[1] Akaike, H. (1973). Information theory and an extension of maximum like-

lihood principle. Second International Symposium on Information Theory,
Akademia Kiado. 267–281. MR0483125
[2] Breslow, N.E. and Clayton, D.G. (1993). Approximate Inference in
Generalized Linear Mixed Models. J. Amer. Statist. Assoc. 88, 9–25.
[3] Burnham, K.P. and Anderson, D.R. (2004). Multimodel inference: un-
derstanding AIC and BIC in model selection. Sociol. Methods Res. 33,
261–304. MR2086350
[4] Cencov, N.N. (1982). Statistical decisions rules and optimal inference.
American Mathematical Society. MR0645898
[5] Commenges, D. and Gégout-Petit, A. (2005). Likelihood inference for
incompletely observed stochastic processes: general ignorability conditions.
arXiv:math.ST/0507151.
[6] Commenges, D. and Gégout-Petit, A. (2007). Likelihood for generally
coarsened observations from multi-state or counting process models. Scand.
J. Statist. 34, 432–450. MR2346649
[7] Commenges, D., Joly, P., Gégout-Petit, A. and Liquet, B. (2007).
Choice between semi-parametric estimators of Markov and non-Markov
multi-state models from generally coarsened observations. Scand. J. Statist.
34, 33–52. MR2325241
[8] Commenges, D., Jolly, D., Putter, H. and Thiébaut, R. (2009).
Inference in HIV dynamics models via hierarchical likelihood. Submitted.
[9] Commenges, D., Sayyareh, A., Letenneur, L., Guedj, J. and Bar-
Hen, A. (2008). Estimating a difference of Kullback-Leibler risks using a
normalized difference of AIC. Ann. Appl. Statist. 2, 1123–1142.
[10] Davidian, M. and Giltinan, D.M. (2003). Nonlinear models for re-
peated measurement data: an overview and update, J. Agric. Biol. Environ.
Statist. 8, 387–419.
[11] De Finetti, B. (1974). Theory of Probability. Chichester: Wyley.
[12] Delyon B., Lavielle, M. and Moulines, E. (1999). Convergence of a
Stochastic Approximation Version of the EM Algorithm. Ann. Statist. 27,
94–128. MR1701103
[13] Eggermont, P. and Lariccia, V. (1999). Optimal convergence rates
for Good’s nonparametric likelihood density estimator. Ann. Statist. 27,
1600–1615. MR1742501
[14] Eggermont, P. and Lariccia, V. (2001). Maximum penalized likelihood
estimation. New-York: Springer-Verlag. MR1837879
[15] Feigin, P.D. (1976). Maximum likelihood estimation for continuous-time
stochastic processes. Adv. Appl. Prob. 8, 712–736. MR0426342
[16] Fisher, R.A. (1922). On the Mathematical Foundations of Theoretical
Statistics. Phil. Trans. Roy. Soc. A 222, 309–368.
[17] Good, I.J. and Gaskin, R.A. (1971). Nonparametric roughness penalty
for probability densities. Biometrika 58, 255–277. MR0319314
D. Commenges/Statistical models and likelihoods 16

[18] Gu, C. and Kim, Y. J. (2002). Penalized likelihood regression.: gen-

eral formulation and efficient approximation. Can. J. Stat. 30, 619–628.
MR1964431
[19] Guedj, J., Thiébaut, R. and Commenges, D. (2007). Maximum likeli-
hood estimation in dynamical models of HIV. Biometrics 63, 1198–1206.
MR2414598
[20] Heitjan, D.F. and Rubin, D.B. (1991). Ignorability and coarse data.
Ann. Statist. 19, 2244–2253. MR1135174
[21] Hastie, T. and Tibshirani, R. (1990). Generalized additive models. Lon-
don: Chapman and Hall. MR1082147
[22] Hoffmann-Jorgensen, J. (1994). Probability with a view toward statis-
tics. London: Chapman and Hall.
[23] Jacod, J. (1975). Multivariate point processes: predictable projection;
Radon-Nikodym derivative, representation of martingales. Z. Wahrsch.
Verw. Geb. 31, 235–253. MR0380978
[24] Jeffreys, H. (1961). Theory of probability. Oxford University Press.
MR0187257
[25] Joly, P. and Commenges, D. (1999). A penalized likelihood approach
for a progressive three-state model with censored and truncated data: Ap-
plication to AIDS. Biometrics 55, 887–890.
[26] Kass, R.E. and Wasserman, L. (1996). The selection of prior distribu-
tions by formal rules J. Amer. Statist. Assoc. 91, 1343–1370.
[27] Konishi, S. and Kitagawa, G. (2008). Information Criteria and Statis-
tical Modeling. New-York: Springer Series in Statistics. MR2367855
[28] Kullback, S. and Leibler, R.A. (1951). On information and sufficiency.
Ann. Math. Statist. 22, 79–86. MR0039968
[29] Kullback, S. (1959). Information Theory. New-York: Wiley. MR0103557
[30] Le Cam, L. (1990). Maximum Likelihood: An Introduction. Int. Statist.
Rev. 58, 153–171.
[31] Lee, Y. and Nelder, J.A. (1992) Likelihood, Quasi-Likelihood and Pseu-
dolikelihood: Some Comparisons. J. Roy. Statist. Soc. B 54, 273–284.
MR1157725
[32] Lee, Y. and Nelder, J.A. (1996). Hierarchical Generalized Linear Mod-
els. J. Roy. Statist. Soc. B 58, 619–678. MR1410182
[33] Lee, Y. and Nelder, J.A. (2001). Hierarchical generalised linear models:
A synthesis of generalised linear models, random-effect models and struc-
tured dispersions. Biometrika 88, 987–1006. MR1872215
[34] Lee, Y., Nelder, J.A. and Pawitan, Y. (2006). Generalized linear mod-
els with random effects. Chapman and Hall. MR2259540
[35] Linhart, H. and Zucchini, W. (1986). Model Selection, New York: Wi-
ley. MR0866144
[36] Neymann, J. and Scott, E.L. (1988).Consistent estimates based on
partially consistent observations. Econometrika 16, 1–32. MR0025113
[37] O’Sullivan, F. (1988). Fast computation of fully automated log-density
and log-hazard estimators. SIAM J. Scient. Statist. Comput. 9, 363–379.
MR0930052
D. Commenges/Statistical models and likelihoods 17

[38] Rubin, D.B. (1976). Inference and missing data. Biometrika 63, 581–592.
MR0455196
[39] Rue, H., Martino, S. and Chopin, N. (2009). Approximate Bayesian
inference for latent Gaussian models using integrated nested Laplace ap-
proximations. J. Roy. Statist. Soc. B 71, 1–35.
[40] Shen, X. (1997). On methods of sieves and penalization. Ann. Statist. 25,
2555–2591. MR1604416
[41] Therneau, T.M. and Grambsch, P.M. (2000). Modeling survival data:
extending the Cox model. Springer. MR1774977
[42] van der Vaart, A. (1998), Asymptotic Statistics, Cambridge.
[43] Verbeke, G. and Molenberghs, G. (2000). Linear Mixed Models for
Longitudinal Data. New-York: Springer. MR1880596
[44] Wahba, G. (1983). Bayesian “Confidence Intervals” for the Cross-
Validated Smoothing Spline J. Roy. Statist. Soc. B 45, 133–150.
MR0701084
[45] Williams, D. (1991). Probability with Martingales. Cambridge University
Press. MR1155402

(J. G. Kalbfleisch) Probability and Statistical I PDF
No ratings yet
(J. G. Kalbfleisch) Probability and Statistical I PDF
188 pages
Notes3 Likelihood
No ratings yet
Notes3 Likelihood
13 pages
Principles of Statistics
No ratings yet
Principles of Statistics
113 pages
Statistical Inference in Science
No ratings yet
Statistical Inference in Science
262 pages
Notests PDF
No ratings yet
Notests PDF
153 pages
Theoretical Statistics and Asymptotics: Nancy Reid
No ratings yet
Theoretical Statistics and Asymptotics: Nancy Reid
13 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
114 pages
Bayesian Uncertainty Quantification
No ratings yet
Bayesian Uncertainty Quantification
23 pages
Prints PDF
No ratings yet
Prints PDF
106 pages
MA40189 Notes
No ratings yet
MA40189 Notes
70 pages
Chap1 Introduction 2may24
No ratings yet
Chap1 Introduction 2may24
21 pages
Maximum Likelihood Estimation by K.Kashin
No ratings yet
Maximum Likelihood Estimation by K.Kashin
34 pages
Empirical Process (Sara Van de Geer)
No ratings yet
Empirical Process (Sara Van de Geer)
91 pages
1.probability Random Variables and Stochastic Processes Athanasios Papoulis S. Unnikrishna Pillai 1 300 1 30
No ratings yet
1.probability Random Variables and Stochastic Processes Athanasios Papoulis S. Unnikrishna Pillai 1 300 1 30
30 pages
Oi PDF
No ratings yet
Oi PDF
107 pages
Lecture Notes For STAT2602
No ratings yet
Lecture Notes For STAT2602
104 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
Probs Stats
No ratings yet
Probs Stats
26 pages
Lecture Notes MAI
No ratings yet
Lecture Notes MAI
111 pages
Statistical+Inference+1 Shaw2007
No ratings yet
Statistical+Inference+1 Shaw2007
66 pages
Chapter Iii. Statistical Models: I I N I I N I I N I I N
No ratings yet
Chapter Iii. Statistical Models: I I N I I N I I N I I N
5 pages
(FreeCourseWeb - Com) 1493997599
100% (1)
(FreeCourseWeb - Com) 1493997599
386 pages
Lecture Notes Fall Term 2013
No ratings yet
Lecture Notes Fall Term 2013
40 pages
Foundations of Statistical Inference
No ratings yet
Foundations of Statistical Inference
89 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
Discrete Probability and Likelihood: Readings: Agresti (2002), Section 1.2
No ratings yet
Discrete Probability and Likelihood: Readings: Agresti (2002), Section 1.2
17 pages
15.097: Probabilistic Modeling and Bayesian Analysis
No ratings yet
15.097: Probabilistic Modeling and Bayesian Analysis
42 pages
CompleteLectureNotes STAT 261
No ratings yet
CompleteLectureNotes STAT 261
158 pages
All in Likelihood
No ratings yet
All in Likelihood
546 pages
Towards A Unified Definition of Maximum Likelihood: Key Andphrases
No ratings yet
Towards A Unified Definition of Maximum Likelihood: Key Andphrases
11 pages
Chap 3
No ratings yet
Chap 3
74 pages
Exponential Family
No ratings yet
Exponential Family
45 pages
Chap3 01
No ratings yet
Chap3 01
35 pages
Ch3 PDF
No ratings yet
Ch3 PDF
55 pages
Actsc 432 Review Part 1
No ratings yet
Actsc 432 Review Part 1
7 pages
Model-Free Objetive Bayesian Prediction
No ratings yet
Model-Free Objetive Bayesian Prediction
8 pages
LectureNotes22 WI4455
No ratings yet
LectureNotes22 WI4455
154 pages
Log-Linear Models and Conditional Random Fieldsels
No ratings yet
Log-Linear Models and Conditional Random Fieldsels
27 pages
L5 6 7 ML
No ratings yet
L5 6 7 ML
28 pages
STATS 225: Bayesian Analysis Lecture 1: Introduction: Babak Shahbaba
No ratings yet
STATS 225: Bayesian Analysis Lecture 1: Introduction: Babak Shahbaba
49 pages
ML Unit 2
No ratings yet
ML Unit 2
8 pages
Categorical Notes Ch1
No ratings yet
Categorical Notes Ch1
18 pages
Zzzz-Essential Bayes
No ratings yet
Zzzz-Essential Bayes
158 pages
Chap2 PDF
No ratings yet
Chap2 PDF
20 pages
Bayes Manuscripts
No ratings yet
Bayes Manuscripts
180 pages
Statistics
No ratings yet
Statistics
53 pages
Book
No ratings yet
Book
113 pages
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
No ratings yet
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
53 pages
An Introduction To Objective Bayesian Statistics PDF
No ratings yet
An Introduction To Objective Bayesian Statistics PDF
69 pages
MLE Lecture Note For Econometrician
No ratings yet
MLE Lecture Note For Econometrician
13 pages
Book
No ratings yet
Book
106 pages
CSKHKPV Palampurujhjhj
No ratings yet
CSKHKPV Palampurujhjhj
18 pages
Unit 2 (2) - 1
No ratings yet
Unit 2 (2) - 1
37 pages
Andrew O. Finley, Edwin J. Green, and William E. Strawderman - Introduction To Bayesian Methods in Ecology and Natural Resources Book-Springer (2020)
No ratings yet
Andrew O. Finley, Edwin J. Green, and William E. Strawderman - Introduction To Bayesian Methods in Ecology and Natural Resources Book-Springer (2020)
188 pages
Bayesian and MLE
No ratings yet
Bayesian and MLE
30 pages
Introduction
No ratings yet
Introduction
11 pages
$$$MGB3rdSearchable PDF
100% (1)
$$$MGB3rdSearchable PDF
577 pages
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
No ratings yet
Stat 535 C - Statistical Computing & Monte Carlo Methods: Arnaud Doucet
23 pages
Math Stats Text
100% (1)
Math Stats Text
577 pages
The Annotated Transformer
No ratings yet
The Annotated Transformer
59 pages
Oct 2020 Decision 1 Ial Maths Edexcel QP
No ratings yet
Oct 2020 Decision 1 Ial Maths Edexcel QP
32 pages
Practice Questions Year 10
No ratings yet
Practice Questions Year 10
3 pages
Image Compression Models: Fig: Functional Block Diagram of A General Image Compression System
No ratings yet
Image Compression Models: Fig: Functional Block Diagram of A General Image Compression System
2 pages
Applied Maths Class 12 Board Paper
No ratings yet
Applied Maths Class 12 Board Paper
13 pages
Unit 1 Analysis of Algorithms: Structure Page Nos
No ratings yet
Unit 1 Analysis of Algorithms: Structure Page Nos
16 pages
COMP9517 Lab3 - Theory
No ratings yet
COMP9517 Lab3 - Theory
16 pages
Block Cipher Design Principles
No ratings yet
Block Cipher Design Principles
13 pages
Numerical Solution of Stochastic Differential Equations in Finance
No ratings yet
Numerical Solution of Stochastic Differential Equations in Finance
22 pages
LectureNote CM2
No ratings yet
LectureNote CM2
51 pages
Simple Equations Questions
No ratings yet
Simple Equations Questions
4 pages
Class 12 Maths
No ratings yet
Class 12 Maths
7 pages
In-Class Problems: Introduction To Scientific Computation
No ratings yet
In-Class Problems: Introduction To Scientific Computation
3 pages
Digital Certificates and Digital Signature
No ratings yet
Digital Certificates and Digital Signature
5 pages
Image Fusion Using MatlabIMAGE FUSION USING MATLAB
No ratings yet
Image Fusion Using MatlabIMAGE FUSION USING MATLAB
8 pages
01 Overview of Data Structure PDF
No ratings yet
01 Overview of Data Structure PDF
20 pages
Prashanth AI-Driven Traffic Lights
No ratings yet
Prashanth AI-Driven Traffic Lights
7 pages
Comparative Evaluation of Credit Card Fraud Detection
No ratings yet
Comparative Evaluation of Credit Card Fraud Detection
7 pages
Linear Regression Analysis
No ratings yet
Linear Regression Analysis
7 pages
Design and Analysis of Algorithms 1
No ratings yet
Design and Analysis of Algorithms 1
29 pages
Dco
No ratings yet
Dco
13 pages
Heap Sort 001
No ratings yet
Heap Sort 001
4 pages
Unit IV Da Online - PPTX 2 82
No ratings yet
Unit IV Da Online - PPTX 2 82
81 pages
Module4 Practice Problems
No ratings yet
Module4 Practice Problems
2 pages
MST 2
No ratings yet
MST 2
4 pages
Advanced Encryption Standard (AES) (CS-452)
100% (1)
Advanced Encryption Standard (AES) (CS-452)
59 pages
Cryptography and Network Security
No ratings yet
Cryptography and Network Security
14 pages
Image Compression Fundamentals
85% (13)
Image Compression Fundamentals
84 pages
T05 Achyut
No ratings yet
T05 Achyut
4 pages
Sciencedirect: Survey On Anomaly Detection Using Data Mining Techniques
No ratings yet
Sciencedirect: Survey On Anomaly Detection Using Data Mining Techniques
6 pages

08 SS039

Uploaded by

08 SS039

Uploaded by

Statistics Surveys

Vol. 3 (2009) 1–17

Statistical models: Conventional,

Abstract: We give an overview of statistical models and likelihood, to-

Received August 2008.

9 Link with the MAP estimator . . . . . . . . . . . . . . . . . . . . . . . 13

The Radon-Nikodym derivative is also called the density. We are interested in

but if H ⊂ G the former can be expressed as a conditional expectation (given

3. The Kullback-Leibler risk

If X is the largest sigma-field defined on the space, then we omit it in the

X̃ = min(X, C) and δ = 1{X≤C} . Thus on {X ≤ C} we observe all the events of

4. Statistical models and families

4.1. Statistical families

We consider a subset P of the probabilities on a measurable space (S, A). We

4.2. Statistical models

A family of probabilities on the sample space of an experiment (Ω, F ) will be

The pair (X, Π) of a random variable and a parameterized statistical model

4.3. Statistical models and true probability

So-called objectivist approaches to statistical inference assume that there is

suited to answer scientific issues. Statistical inference aims to approach P ∗ or

5.1. Definition of the likelihood

Conventionally, most statistical models assume that independently identically

5.2. Computation of the likelihood

Computation of the likelihood is simple in terms of the probability on the ob-

5.3. Performance of the MLE in terms of Kullback-Leibler risk

6. The penalized likelihood

is a function and β is a real parameter. J(θ) can be chosen as

7. The hierarchical likelihood

7.1. Random effects models

An important class of models arises when we define a potentially observable

7.2. Hierarchical likelihood

For hierarchical generalized linear models, the hierarchical likelihood (or h-

8. Akaike and likelihood cross-validation criteria

risk already used in section 5:

∆(P θ̂ , P γ̂ ) = LCV (P θ̂n ; On+1 ) − LCV (P γ̂n ; On+1 ) (8.1)

9. Link with the MAP estimator

three likelihoods can be identified with the numerator of a posteriori distribu-

The Kolmogorov representation of a statistical experiment has to be taken se-

[1] Akaike, H. (1973). Information theory and an extension of maximum like-

[18] Gu, C. and Kim, Y. J. (2002). Penalized likelihood regression.: gen-

You might also like