0% found this document useful (0 votes)
67 views46 pages

Numerical Optimization of Likelihoods: Additional Literature For STK2120

This document discusses numerical optimization methods for maximum likelihood estimation when analytical solutions are not available. It provides three examples: [1] Muon decay data fitted to a distribution, requiring numerical solution to maximize the log-likelihood. [2] Rainfall data fitted to a gamma distribution, with log-likelihood derivatives set to zero to estimate parameters numerically. [3] Survival times fitted to a Weibull distribution, again using numerical optimization of the log-likelihood. R code is provided to illustrate the numerical methods.

Uploaded by

Putri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
67 views46 pages

Numerical Optimization of Likelihoods: Additional Literature For STK2120

This document discusses numerical optimization methods for maximum likelihood estimation when analytical solutions are not available. It provides three examples: [1] Muon decay data fitted to a distribution, requiring numerical solution to maximize the log-likelihood. [2] Rainfall data fitted to a gamma distribution, with log-likelihood derivatives set to zero to estimate parameters numerically. [3] Survival times fitted to a Weibull distribution, again using numerical optimization of the log-likelihood. R code is provided to illustrate the numerical methods.

Uploaded by

Putri
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 46

Numerical optimization of likelihoods:

Additional literature for STK2120

Geir Storvik, University of Oslo

February 21, 2011


1 Introduction

As described in section 7.2 in the textbook by Devore and Berk (2007), maximum likeli-
hood is an important estimation method for statistical analysis. The textbook also gives
several examples for which analytical expressions of the maximum likelihood estimators
are available. In this note we will be concerned with examples of models where numerical
methods are needed in order to obtain the estimates. Some of these models are “standard”
in the sense that statistical software is available for direct computation. In other cases,
user-made programs must be applied. In both cases the use of a computer is essential, and
knowledge on using and/or developing software becomes vital. Throughout the note we
will illustrate the different optimization methods through three examples. R code used in
connection with these examples are given either directly in the text or in the appendix.
Example 1 (Muon decay)
The angle θ at which electrons are emitted in muon decay has a distribution with the
density
1 + αx
f (x|α) = , −1 ≤ x ≤ 1 and −1≤α≤1
2
where x = cos θ (Rice 1995, Example D, section 8.4). The following data are simulated
from the given distribution:

0.41040018 0.91061564 -0.61106896 0.39736684 0.37997637 0.34565436


0.01906680 -0.28765977 -0.33169289 0.99989810 -0.35203164 0.10360470
0.30573300 0.75283842 -0.33736278 -0.91455101 -0.76222116 0.27150040
-0.01257456 0.68492778 -0.72343908 0.45530570 0.86249107 0.52578673
0.14145264 0.76645754 -0.65536275 0.12497668 0.74971197 0.53839119

We want to estimate α based on these data. The likelihood function is given by


n
Y 1 + αxi
L(α) =
i=1
2

while the log-likelihood is given by


n
X
l(α) = log(1 + αxi ) − n log(2)
i=1

l is a smooth function in α, so at the max-point, the derivative should be equal to zero (if
not, the max-point is at one of the end-points). Now
n
X
′ xi
l (α) = .
i=1
1 + αxi

1
−15
8e−07

−20
log−likelihood
Likelihood

−25
4e−07

−30
0e+00

−35
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0

alpha alpha

Figure 1.1: Likelihood (left) and log-likelihood (right) for muon decay example.

Unfortunately, no closed expression is available for the solution of the equation l′ (α) = 0.
Figure 1.1 shows lik(α) and l(α) as functions of α for the actual value of x. A good guess on
the optimal value of α would be possible to obtain by a visual check. An automatic method
would however be preferable, both in order to save work and for avoiding subjectivity. 

Example 2 (Rainfall of summer storms in Illinois)


Figure 1.2 shows a histogram of rainfall measurements of summer storms, measured by a
network of rain gauges in southern Illinois for the years 1960-1964. A possible model for
these data is a gamma distribution

1 α α−1 −λx
f (x|α, λ) = λ x e
Γ(α)

where Γ(α) is the gamma function as defined in Devore and Berk (2007, page 190). As-
suming that all observations are independent, the likelihood is in this case

n
Y 1 α α−1 −λxi
lik(α, λ) = λ xi e .
i=1
Γ(α)

2
150

100

50

0
0 0.5 1 1.5 2 2.5

Figure 1.2: Histogram of rainfall measurements of summer storms in Illinois.

The log-likelihood now becomes


n
X
l(α, λ) = [α log(λ) + (α − 1) log(xi ) − λxi − log(Γ(α))]
i=1
n
X n
X
=nα log(λ) + (α − 1) log(xi ) − λ xi − n log(Γ(α)). (1.1)
i=1 i=1

Also in this case the log-likelihood is a smooth function of the parameters involved, and
the derivatives are
X n
∂ Γ′ (α)
l(α, λ) =n log(λ) + log(xi ) − n ; (1.2)
∂α i=1
Γ(α)
X n

l(α, λ) =nαλ−1 − xi . (1.3)
∂λ i=1

Putting the derivatives to zero and solving the equations is again difficult, making direct
analytical solutions intractable.

In Figure 1.3, lik(α, β) and l(α, β) are plotted for different values of α and λ. These
plots are produced by the commands given in Programwindow A.1.

Also in this case a good guess of the mode is possible, but a numerical procedure would
be preferable. 
Example 3 (Survival times for leukemia)
In Cox and Oakes (1984) a dataset concerning the survival times for leukemia is discussed.
We will consider a subset of these data, given in Table 1.1.

A distribution commonly applied to survival data is the Weibull distribution which has
probability density
 x β−1 β
f (x; α, β) = βα−1 e−(x/α) .
α
3
3.0

3.0
lambda

lambda
2.0

2.0
1.0

1.0
0.35 0.40 0.45 0.50 0.55 0.35 0.40 0.45 0.50 0.55

alpha alpha
lik

l
lam

lam
bda

bda

ha ha
alp alp

Figure 1.3: Perspective and contour plot of likelihood function (left) and log-likelihood
function (right) for rainfall measurements.

Here α is a scale parameter while β describes the shape of the distribution. Assuming all
observations x1 , ..., xn are independent, the likelihood function is
n
Y  x β−1
i β
lik(α, β) = βα−1 e−(xi /α)
i=1
α

while the log-likelihood is


n
X
l(α, β) = [log(β) − log(α) + (β − 1)(log(xi ) − log(α)) − (xi /α)β ]
i=1
n
X n
X
=n log(β) − nβ log(α) + (β − 1) log(xi ) − (xi /α)β . (1.4)
i=1 i=1

In Figure 1.4 lik(α, β) and l(α, β) are plotted for different values of α and β. Differentiating

4
56 65 17 7 16 22 3 4 2 3 8 4 3 30 4 43

Table 1.1: Survival times for leukemia. Subset of dataset from Cox and Oakes (1984)
corresponding to those individuals which responded “absent” to a given test.

with respect to α and β, we get


n
∂ nβ β X
l(θ) = − + (xi /α)β , (1.5)
∂α α α i=1
X n X n
∂ n
l(θ) = − n log(α) + log(xi ) − (xi /α)β log(xi /α). (1.6)
∂β β i=1 i=1

As for the previous examples, maximizing the likelihood analytically is difficult. 

In this note we will discuss how numerical methods can be applied for solving statistical
optimization problems. We will not go deeply into the general theory behind numerical
methods. For this, we refer to other courses (i.e., MAT-INF 1100, INF-MAT 3370). Our
intention is to demonstrate how such numerical methods can be applied to statistical prob-
lems, in addition to present some optimization methods which are specialized for use in
statistics. We will start with some general remarks concerning optimization (section 2).
Our primary application will be maximum likelihood estimation. Therefore, some prop-
erties of the log-likelihood function and the maximum likelihood estimators (section 3)
will also be discussed. The remaining sections consider different optimization methods.
All methods are iterative. They start with some initial guess on the parameters. Next,
a new candidate to the optimum, presumable better than the initial guess is found. This
new guess is used to find an even better one, and so one. Section 4 consider the Newton-
Raphson method, which probably is the most used optimization method in statistics. A
slight modification of this method, the scoring algorithm, is discussed in section 5. Some
refinements of these algorithms are discussed in Section 6. All procedures discussed are
illustrated by examples. The datasets and R code can be found on the course homepage.
R routines are also given in the text.

2 General comments on optimization

As mentioned in the introduction, optimization problems often occur in statistics. Analytic


expressions for maximum likelihood estimators in complex models are usually not easily
available, and numerical methods are needed. Other types of optimization problems may
also be of interest. For illustrative purpose, we will however concentrate on likelihood
maximization.

5
1.4

1.4
beta

beta
0.8

0.8
0.2

0.2
10 15 20 25 30 10 15 20 25 30

alpha alpha

l
lik
bet

bet
a

ha ha
alp alp

Figure 1.4: Perspective and contour plot of likelihood function (left) and log-likelihood
function (right) for leukemia data.

Optimization is a huge field in numerical analysis and we will only focus on a few meth-
ods that are widely used in statistics. Although many of the general purpose methods are
useful also for statistical problems, methods specifically designed for likelihood maximiza-
tion may be better to use in certain situations. In any application, we may choose if we
want to maximize or minimize, since maximization of a function l is equivalent to mini-
mization of −l. Since our primary concern will be likelihood maximization, we will assume
that maximization is our aim.

For a general discussion of optimization methods, we refer to Van Loan (2000). Lange
(1999) is a very good reference for numerical analysis in general applied to statistics. We
will only give some general comments here:

• In some applications, the function we want to maximize is well-behaved, in the sense


that the function is concave and only one maximum point exists. In such cases
most methods will work (although with different efficiency). For more complicated

6
problems, the concavity property is generally lost, and local maximum points may
occur. In such cases, no guarantee is given that the numerical method will return
the global maximum point. Two standard methods are then widely used:

(a) Find local maxima starting from widely varying starting values and then pick
the maximum of these;
(b) Perturb a local optimum by a “small” amount, and use this as an starting point
for a new run of your routine. Then see if your routine converges to a better
point, or “always” to the same one.

• When searching for an appropriate method to use, you must choose between meth-
ods that need only evaluations of the function itself and methods that also require
evaluations of the derivative(s) of the function. Algorithms using the derivative are
somewhat more powerful, but the extra cost of calculating these may result in less
efficiency overall. Note however, that in order to obtain information values and stan-
dard errors for the estimators, derivatives may be needed anyway (see next section).

• In many situations, application-specific algorithms can be constructed. Such algo-


rithms can be far more efficient than the more general ones.

As already mentioned, our main concern will be on optimizing likelihood functions.


The likelihood function is a function both on the random variable X = (X1 , ..., Xn ) and
the parameter vector θ. In general we should therefore write lik(θ; X) = f (X; θ) where
f (X; θ) is the probability density for X given θ. When estimation of θ is to be performed,
the observed value X is given, and can be considered as fixed. In those cases, we will simply
write lik(θ).

The optimum of the likelihood function lik(θ) coincides with the optimum of the log-
likelihood function l(θ). We will consider maximization of the log-likelihood function. In
numerical literature, the function to maximize is usually denoted the objective function.
We will follow this convention when discussing maximization algorithms in general.

3 Some properties of the likelihood function and max-


imum likelihood estimators

Devore and Berk (2007, sec. 7.4) discusses large sample properties of maximum likelihood
estimators. Some of these results will be repeated in addition to that some extra properties
important for numerical optimization of likelihood functions will be introduced. We will
start by assuming only one unknown parameter θ. Afterwards we will generalize to several
parameters.

7
3.1 The one-parameter situation

The derivative of the log-likelihood,


s(θ; X) = l′ (θ; X) (3.1)
is usually named the score function. Note that the score function is a random variable
since it depends on the random observations X. It can be expressed by the likelihood
function itself through
1
s(θ; X) = lik′ (θ; X)
lik(θ; X)
which is obtained from (3.1) by ordinary rules on differentiation. If l(θ; X) is a smooth
function in θ, the likelihood should have derivative equal to zero at the max-point. A
common approach in order to find maximum points is therefore to solve the scoring equation
b X) = 0.
s(θ;
Note however that this criterion is only a necessary one, also minimum points and saddle
points can be solutions to this equation. In order to evaluate if the solution actually is a
maximum point, the second derivative must be inspected.

As is apparent from (3.1), s(θ; X) is a stochastic quantity. An important property of


the scoring function is that if X has probability density f (X; θ), then E[s(θ; X)] = 0. A
solution of the scoring equation can therefore be seen as a value of θb such that the scoring
function is equal to its expectation.

The variability of s(θ; X) reflects the uncertainty involved in estimating θ. The variance
of s is called the Fisher information (sometimes it is also called the expected information).

The Fisher information I(θ) of an experiment can be written as (we will in this note
use I both for the information based on one observation and for the information from a
random sample, the textbook distinuish between these situations by using In for the latter)
I(θ) = Var[s(θ; X)] = −E [l′′ (θ)] . (3.2)

For proof of the properties given above for the discrete case with n = 1, see Devore and
Berk (2007, page 365). The general case is a trivial extension. A consequence of the first
equality in (3.2) is that I(θ) is always non-negative.

Given the maximum likelihood estimator θb of θ, an important task is to derive un-


certainty properties about this estimator. The theorem at page 369 in Devore and Berk
(2007) state that for a large number of observations, θb is approximately normal distributed
with expectation θ and variance equal to 1/I(θ). In the textbook a proof is sketched for
the i.i.d case, but this result is also valid in more general situations.

8
3.2 The multi-parameter situation

Assume θ = (θ1 , ..., θp )T is a vector of p, say, unknown parameters. The derivate of the
log-likelihood, still named the score function, now is a vector:
1
s(θ; X) = l′ (θ; X) = lik′ (θ; X). (3.3)
lik(θ; X)
The ith element of the vector s(θ; X), si (θ; X), is the partial derivative of l(θ; X) with re-

spect to θi . A more mathematically correct way of expressing s(θ; X) would be ∂θ l(θ; X),
but we will use the simpler form l (θ; X).

As for the one-parameter case, each si (θ; X) has expectation zero. Finding the MLE
by solving the scoring equations
b X) = 0
s(θ; (3.4)
now result in a set of p equations with p unknowns.
The expected information is now generalized to be a matrix I(θ) with the (i, j)th entry
given by
 
∂2
Iij (θ) = Cov[si (θ; X), sj (θ; X)] = −E l(θ) . (3.5)
∂θi ∂θj
Here the second equality can be proven by a simple generalization of the argument in the
one-parameter case. In the multi-parameter situation we usually name I(θ) by the expected
information matrix or the Fisher information matrix. An important property of I(θ) is
that it is always positive (semi-)definite1 . This will be of importance in the construction
of the scoring algorithm in section 5.
The large sample theory described in 7.4 of Devore and Berk (2007) also generalizes
to the multi-parameter case, so that the MLE is, under appropriate smoothness condi-
tions on f , consistent. Further, each element θbi of θ
b will, for a sufficiently large number
of observations, have a sampling distribution approximately normal with expectation θi
and variance equal to {I −1 (θ)}ii , the ith entry of the diagonal of I −1 (θ). Further, the
covariance between θbi and θbj is approximately given by the (i, j)th entry of I −1 (θ).
Note that I(θ) depends on the unknown quantity θ. Common practice is to insert the
b for θ giving an estimate I(θ)
estimated value θ b of I(θ).

A further complication is that the expectations in (3.5) are not always possible to
compute. Then an alternative is to use the observed information matrix J (θ) with (i, j)th
entry given by
∂2
Jij (θ) = − l(θ). (3.6)
∂θi ∂θj
1
A matrix I is positive semi-definite if a′ Ia ≥ 0 for all vectors a.

9
As for the expected information matrix, an estimate θ b needs to be inserted for θ in order to
−1
evaluate J(θ). The ith diagonal element of J (θ) can then be used as an approximation
for the variance of θbi instead of the ith diagonal element of I −1 (θ). Both these approxi-
mations will be equally valid in the sense that as the number of observations increases, the
approximation error will decrease to zero. If possible to calculate, the expected information
is preferable, since the observed information in some cases can be unstable. Note that in
many standard models used in statistics, I(θ) = J (θ).

4 The Newton-Raphson method

One of the most used methods for optimization in statistics is the Newton-Raphson method
(or Newton’s rule). It is based on approximating the function which we want to optimize
by a quadratic one. The optimum of the approximation (which is easy to calculate) gives
a guess of the optimum for the actual function. If this guess is not adequately close to the
optimum, a new approximation is computed and the process repeated.

Assume for simplicity that l only involves a one-dimensional parameter and that θ̄ is
our current best guess on the maximum of l(θ). By using a Taylor series expansion around
θ̄, l(θ) can be approximated by

˜lθ̄ (θ) = l(θ̄) + l′ (θ̄)(θ − θ̄) + 1 l′′ (θ̄)(θ − θ̄)2 . (4.1)


2
When θ is close to θ̄, the difference l(θ) − ˜lθ̄ (θ) is small. Figure 4.1 shows l(θ) and ˜lθ̄ (θ)
for a specific example. We see that the maximum value of ˜lθ̄ (θ) is closer to the maximum
value of l(θ) than θ̄.

The gradient of ˜lθ̄ (θ) at θ is


˜l′ (θ) = l′ (θ̄) + l′′ (θ̄)(θ − θ̄)
θ̄

and the Hessian or second derivative is


˜l′′ (θ) = l′′ (θ̄).
θ̄

At the point θ̄, l(θ) and ˜lθ̄ (θ) have equal first and second derivatives. Note that for
the particular case when l is a log-likelihood function, the Hessian is equal to minus the
observed information evaluated at θ = θ̄, l′′ (θ̄) = −J(θ̄).

In the optimum point of the approximation, ˜lθ̄ (θ) has a gradient equal to zero, giving
the following equation:

l′′ (θ̄)(θ − θ̄) = −l′ (θ̄). (4.2)

10
70

60

50

40

30

20
0.2 0.5 0.8

Figure 4.1: A function l(θ) (solid line) and its quadratic approximation ˜lθ̄ (θ) (dashed line).
The point (θ̄, l(θ̄)) is given by a ∗.

Solving with respect to θ, we get


l′ (θ̄)
θ = θ̄ − . (4.3)
l′′ (θ̄)

This gives a procedure for optimizing ˜lθ̄ (θ). Our aim is however to optimize l(θ). Since
˜lθ̄ (θ) is an approximation of l(θ), our hope is then that (4.3) will give us a new value closer
to the optimum of l(θ). This suggests an iterative procedure for optimizing l(θ):

l′ (θ(s) )
θ(s+1) = θ(s) − ; (4.4)
l′′ (θ(s) )
which is the Newton-Raphson method. The procedure is run until there is no significant
difference between θ(s) and θ(s+1) .2

θ(s+1) = θ(s) is equivalent to l′ (θ(s) ) = 0. This demonstrates that when the algorithm
has converged, we have reached a stationary point of l(θ). This point could be a maximum
point, a minimum point or even a saddle point. However, if l′′ (θ(s) ) < 0, the point is a
maximum point. This should be checked in each case. Figure 4.2 shows the first four
iterations of a Newton-Raphson algorithm. The results from the last two iterations are
almost indistinguishable, indicating that convergence is reached.

When l(θ) is a log-likelihood function, this algorithm can be written as

s(θ(s) )
θ(s+1) = θ(s) + .
J(θ(s) )

In such cases a convergence point is a maximum point if J(θ(s) ) > 0.


2
Other convergence criteria could also be used but will not be considered here

11
70
4
32
65
1
60

55

50

45
0
40

35
0.2 0.5 0.8

Figure 4.2: A function l(θ) (solid line) and its quadratic approximations ˜lθs (θ) for s = 0
(solid plus star), s = 1 (dashed), s = 2 (dashed-dotted) and s = 3 (dotted). The points
(θs , l(θs )) for s = 0, 1, 2, 3, 4 are given by their numbers.

The Newton-Raphson algorithm will only converge to the closest stationary point. If
several such points are present, the choice of starting value becomes critical. In many
cases of statistics, reasonable starting values can be found by other means, i.e. moment
estimators.

Example 1 (Muon decay, cont.)


We want in this case to maximize the log-likelihood function
n
X
l(α) = log(1 + αxi ) − n log(2)
i=1

In section 1 we found that


n
X
′ xi
s(α) = l (α) = .
i=1
1 + αxi

Differentiating once more with respect to α gives


n
X
′′ x2i
l (α) = −J(α) = − .
i=1
(1 + αxi )2

Note l′′ (α) is always negative (and therefore J(α) is always positive) for all possible α
values, showing that the log-likelihood function is concave and unimodal.

The iterative steps in the Newton-Raphson algorithm are given by


Pn xi
i=1 1+α(s) xi
α(s+1) = α(s) + Pn xi .
i=1 (1+α(s) xi )2

12
nr.muon <- function(x,alpha0=0.6,eps=0.000001)
{
n = length(x)
diff = 1;
alpha = alpha0;
l = sum(log(1+alpha*x))-n*log(2)
while(diff>eps)
{
alpha.old = alpha
s = sum(x/(1+alpha*x))
Jbar = sum((x/(1+alpha*x))^2)
alpha = alpha+s/Jbar
l = sum(log(1+alpha*x))-n*log(2)
diff = abs(alpha-alpha.old)
}
list(alpha,Jbar)
}

Programwindow 4.1: R code for running Newton-Raphson on muon decay example.

Programwindow 4.1 describes a R function for running the Newton-Raphson algorithm


using a stopping criteria

|α(s+1) − α(s) | < 0.000000000001.

The small value on the right hand side is much smaller than necessary, but is used in order
to demonstrate properties of the Newton-Raphson algorithm. A run of this function gave
the following results:

Iteration s α(s) l(α(s) )


0 0.6000000 -19.65135
1 0.5040191 -19.58507
2 0.4944591 -19.58454
3 0.4943927 -19.58454
4 0.4943927 -19.58454

We see that after one iteration, one decimal is correct. This is increased to three decimals
after 2 iterations and to at least 7 decimals after 3 iterations, illustrating the rapid increase
in accuracy for the Newton-Raphson algorithm. Further, since the log-likelihood funcion
is concave and unimodal, the resulting point is a maximum point.

13
The Fisher information is in this case
I(θ) =E[J(θ)]
X2
=nE[ ]
(1 + αX)2
Z
n 1 x2
= dx
2 −1 1 + αx
Z 1 Z 1
n 1
= 2[ (αx − 1)dx + dx]
2α −1 −1 1 + αx
 
−n n 1+α
= 2 + 3 log (4.5)
α 2α 1−α
For αb = 0.4943927, Ib = 11.78355 giving an approximative standard error equal to 0.291
for α
b. Using the observed information, the standard error would be estimated to 0.297. 

The argument for deriving the Newton-Raphson algorithm for optimization in one
dimension can be directly extended to multi-dimensional problems giving the multi-para-
meter Newton-Raphson method:
h i−1
(s+1) (s) (s)
θ ′′
= θ − l (θ ) l′ (θ (s) ) (4.6)

where l′ (θ) now is a vector consisting of the partial derivatives while l′′ (θ) is a matrix
with (i, j) entry equal to the second derivative with respect to θi and θj . l′′ (θ) is usually
denoted as the Hessian matrix. The algorithm can be written as
θ (s+1) = θ (s) + J −1 (θ (s) )s(θ (s) ). (4.7)
for likelihood optimization, where s(θ) is the score function while J (θ) is the observed
information matrix. In order to check if the resulting point is a maximum point, we
need to see if the observed information matrix is positive definite, which is the case if all
b are positive.
eigenvalues of J (θ)

Example 2 (Rainfall data, cont.)


In this case we want to maximize
n
X n
X
l(α, λ) =nα log(λ) + (α − 1) log(xi ) − λ xi − n log(Γ(α)).
i=1 i=1

The first derivatives (or score functions) are given by (1.2)-(1.3) while the second derivatives
are
∂2 Γ′′ (α)Γ(α) − Γ′ (α)2 ∂2
l(α, λ) = − n l(α, λ) =nλ−1 (4.8)
∂α2 Γ(α)2 ∂λ∂α
∂2
l(α, λ) = − nαλ−2 (4.9)
∂λ2

14
A problem with implementing the Newton-Raphson algorithm in this case is that the first
and second derivatives of the gamma function is not directly available. Numerically these
can however be approximated by

Γ(α + h) − Γ(α)
Γ′ (α) ≈
h
Γ ′
(α + h) − Γ′ (α) Γ(α + 2h) − 2Γ(α + h) + Γ(α)
Γ′′ (α) ≈ ≈
h h2
for h small3 .

Programwindow 4.2 describes a R function for running the Newton-Raphson algorithm


using a stopping criteria
X (s+1) (s) (s+1) (s)
[|αi − αi | + |λi − λi |] < 0.0001.
i

A run of this function, using the moment estimates for α and λ as starting values, gave
the following results:

Iteration s α(s) λ(s) l(θ (s) )


0 0.3762506 1.676755 -656.1040
1 0.4306097 1.919006 -580.9499
2 0.4405084 1.963119 -567.5825
3 0.4407890 1.964370 -567.2049
4 0.4407914 1.964380 -567.2017
5 0.4407914 1.964380 -567.2017
6 0.4407914 1.964380 -567.2017

Also for this example, convergence is fast. The algorithm is however sensitive to starting
values. Using α(0) = λ(0) = 1, both α(1) and λ(1) becomes negative, resulting in that the
algorithm crashes. We will in Section 6 see how this can be avoided.

The observed information matrix J (θ) = −l′′ (θ) is directly available from the Newton-
Raphson algorithm. Note from (4.8)-(4.9) that J (θ) do not depend on the observations,
resulting in that the Fisher information matrix in this case is equal to the observed one.
Inserting the estimate of θ we get
   
b 1394.1 −115.6 −1 b 0.00114 0.0051
I(θ) = , I (θ) =
−115.6 25.9 0.0051 0.0613

giving approximative standard errors 0.0337 and 0.248 for α b respectively.


b and λ,
3
Better approximations are possible, but the simple choice will be sufficient here. Note that h shouldn’t
be chosen too small, because round-off errors in the calculation of Γ(α) on a computer then can make the
approximation bad, see Van Loan (2000, sec. 1.5.2)

15
nr.gamma <- function(x,eps=0.000001)
{
n = length(x);sumx = sum(x);sumlogx = sum(log(x))
diff = 1;h = 0.0000001;
alpha = mean(x)^2/var(x);lambda=mean(x)/var(x)
theta = c(alpha,lambda)
while(diff>eps)
{
theta.old = theta
g = gamma(alpha)
dg = (gamma(alpha+h)-gamma(alpha))/h
d2g = (gamma(alpha+2*h)-2*gamma(alpha+h)+
gamma(alpha))/h^2
s = c(n*log(lambda)+sumlogx-n*dg/gamma(alpha),
n*alpha/lambda-sumx)
Jbar = matrix(c(n*(d2g*g-dg^2)/g^2,-n/lambda,
-n/lambda,n*alpha/lambda^2),ncol=2)
theta = theta + solve(Jbar,s)
alpha = theta[1];lambda = theta[2]
diff = sum(abs(theta-theta.old))
}
list(theta=theta,Jbar=Jbar)
}

Programwindow 4.2: R code for running Newton-Raphson on gamma distributed data.

Note that since J (θ) = I(θ), J (θ) will be positive definite for all values of θ. The
α, b
log-likelihood function is then concave, resulting in that the values of (b λ) obtained by
the Newton-Raphson algorithm is equal to the global maximum. 

Example 3 (Leukemia data, cont.)


The log-likelihood is given in (1.4) while the score functions are given in equations (1.5)

16
and (1.6). Further
n
∂2 nβ β(β + 1) X
J1,1 =− l(θ) = − + (xi /α)β ,
∂α2 α2 α2 i=1
n n
∂2 n 1X βX
J1,2 =− l(θ) = − (xi /α)β − (xi /α)β log(xi /α),
∂α∂β α α i=1 α i=1
X n
∂2 n
J2,2 = − 2
l(θ) = 2
+ (xi /α)β [log(xi /α)]2.
∂β β i=1

Programwindow 4.3 shows R code for a Newton-Raphson algorithm in this case. With
starting values α(0) = 20 and β (0) = 1 and a convergence criterion
|α(s+1) − α(s) | + |β (s+1) − β (s) | < 0.00001,
we get the following results:
Iteration s α(s) β (s) l(θ (s) )
0 10.00000 1.0000000 -138.00580
1 11.88883 0.8904244 -62.98770
2 15.09949 0.9287394 -62.22634
3 16.74320 0.9244928 -62.10186
4 17.17639 0.9220478 -62.09619
5 17.20186 0.9218854 -62.09617
6 17.20194 0.9218849 -62.09617
7 17.20194 0.9218849 -62.09617
b for θ into J(θ) gives
Inserting θ
   
b = 0.0460 −0.4324 −1 b = 24.5071 0.2919
J (θ) , J (θ)
−0.4324 36.3039 0.2919 0.0310
giving approximate standard error 4.9505 for α
b and 0.1761.
Also this Newton-Raphson algorithm is sensitive to the starting values chosen. With
α = 20 and β (0) = 2, both α(1) and β (1) become negative, which are illegal values for
(0)

these parameters. 

An advantage in using the Newton-Raphson algorithm for statistical problems, is that


log-likelihoods in many cases are close to quadratic functions around their maximum points.
This is connected to that maximum likelihood estimators are approximately normally dis-
tributed (the logarithm of a Gaussian density is a quadratic function). In such cases, the
approximation ˜lθ̄ (θ) becomes a very good approximation near the maximum point.
In more complex situations, the use of a Newton-Raphson algorithm may be more
problematic. In the next sections we will discuss modifications of the Newton-Raphson
algorithm, making it more robust.

17
nr.weibull = function(x,theta0,eps=0.000001)
{
n = length(x);
sumlogx = sum(log(x));
diff = 1;theta = theta0;alpha = theta[1];beta = theta[2]
while(diff>eps)
{
theta.old = theta
w1 = sum((x/alpha)^beta)
w2 = sum((x/alpha)^beta*log(x/alpha))
w3 = sum((x/alpha)^beta*log(x/alpha)^2)
s = c(-n*beta/alpha+beta*w1/alpha,
n/beta-n*log(alpha)+sumlogx-w2)
Jbar = matrix(c(-n*beta/alpha^2+beta*(beta+1)*w1/alpha^2,
n/alpha-w1/alpha-beta*w2/alpha,
n/alpha-w1/alpha-beta*w2/alpha,n/beta^2+w3),
ncol=2)
theta = theta + solve(Jbar,s)
alpha = theta[1];beta = theta[2]
diff = sum(abs(theta-theta.old))
}
list(alpha=alpha,beta=beta,Jbar=Jbar)
}

Programwindow 4.3: R code for running Newton-Raphson on Weibull distributed data.

5 Fisher’s scoring algorithm

Figure 5.1 illustrates one type of problem that can occur when using the Newton-Raphson
algorithm. θ̄ is given in a point where l(θ) is convex (that is the second derivative is
positive). Using (4.3) will give a reduction in l, since (4.3) in this case finds a minimum
point of ˜lθ̄ (θ).

This problem with the Newton-Raphson method is directly translated to the multi-
parameter case. In the general case, if at least one of the eigenvalues of the Hessian matrix
l′′ (θ) is positive, a smaller value of l(θ) can be obtained from one iteration to the other.
A standard trick in numerical literature for such cases is to replace the Hessian matrix
with another matrix which is negative definite. For likelihood optimization, this means
replacing J (θ) with a positive definite matrix. A possible candidate could be the identity
matrix, but a more efficient choice is available. The Fisher information I(θ) is equal to

18
0.6

0.4

0.2

0
−4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5

Figure 5.1: Example of a one-dimensional function l(θ) which is to be maximized (solid


line). The quadratic approximation ˜lθ̄ (θ) defined in (4.1) is shown as a dashed line with
the value of θ̄ marked by the vertical line.

the expectation of J (θ), indicating that these two matrices should be similar. But I(θ)
will always be positive definite, making this matrix a possibility for replacing J (θ). This
is the Fisher’s method of scoring (or the scoring algorithm):
θ (s+1) = θ (s) + I −1 (θ (s) )s(θ (s) ). (5.1)
Note that for this algorithm, the Fisher information is directly available.

Example 1 (Muon decay, cont.)


The Fisher information is in this case given by (4.5). Based on this, a scoring algorithm
can be constructed as given in Programwindow 5.1. Note that the only change from the
Newton-Raphson algorithm given in Programwindow 4.1 is the use of I(θ) instead of J(θ).
Running this algorithm with convergence criterion
|α(s+1) − α(s) | < 0.000001,
convergence was obtained after five iterations. The α
b value was exactly the same as that
obtained by the Newton-Raphson algorithm, and also the number of iterations needed for
convergence is comparable for the two algorithms. 

In general it is not possible to say which algorithm that would be preferable if both
converges. The next example illustrates a situation where the Newton-Raphson algorithm
have problems in converging while the scoring algorithm is more robust.
Example 4 (Truncated Poisson)
We will consider an example which involves the estimation of the parameter of a truncated
Poisson distribution given by
θx e−θ
f (x; θ) = , x = 1, 2, . . .
x!(1 − e−θ )

19
scoring.muon <- function(x,alpha0=0.6,eps=0.000001)
{
n = length(x)
diff = 1;
alpha = alpha0;
l = sum(log(1+alpha*x))-n/2
while(diff>eps)
{
alpha.old = alpha
s = sum(x/(1+alpha*x))
Ibar = n*(log((1+alpha)/(1-alpha))/(2*alpha^3)-1/(alpha^2))
alpha = alpha+s/Ibar
l = sum(log(1+alpha*x))-n/2
diff = abs(alpha-alpha.old)
}
list(alpha,Ibar)
}

Programwindow 5.1: R routine for estimation in the muon decay example using the scoring
algorithm.

x 1 2 3 4 5 6
fx 1486 694 195 37 10 1

Table 5.1: Grouped data from a truncated Poisson distribution, where fx represents the
frequency of x. Table 4.1 from Everitt (1987).

Such a density might arise, for example, in studying the size of groups at parties. The data
in Table 5.1 is taken from Everitt (1987) and represents samples from this distribution.

Assume n is the number of observations and xi is the value of observation i. The


likelihood function is given by

n
Y θxi e−θ
lik(θ) =
i=1
xi !(1 − e−θ )

20
nr.trpo = function(x,theta0,eps=0.000001)
{
n = sum(x);i = 1:6;sumx = sum(x*i)
theta = theta0;diff = 1
while(diff> eps)
{
theta.old = theta
s = sumx/theta-n/(1-exp(-theta))
Jbar = sumx/theta^2-n*exp(-theta)/(1-exp(-theta))^2
theta = theta+s/Jbar
diff = abs(theta-theta.old)
}
list(theta,Jbar)
}

Programwindow 5.2: R routine for estimation in the truncated Poisson distribution using
the Newton Raphson algorithm.

and the log-likelihood


n
X
l(θ) = [xi log(θ) − θ − log(xi !) − log(1 − e−θ )]
i=1
n
X
=Const. + log(θ) xi − nθ − n log(1 − e−θ ).
i=1

Differentiating with respect to θ, we get


Pn
′ i=1 xi e−θ
l (θ) = −n−n .
θ 1 − e−θ
If we put l′ (θ) to zero we find that the resulting equation has no explicit solution for θ. We
will apply both the Newton-Raphson and the scoring algorithms for obtaining numerical
solutions. The second derivative is given by
Pn
′′ i=1 xi e−θ
l (θ) = − +n .
θ2 (1 − e−θ )2
In Programwindow 5.2 R code for performing the Newton Raphson algorithm is given.
Using this code with starting value θ(0) = 1.5 gave θb = 0.8925 after 6 iterations. On the
other hand, starting at θ(0) = 2.0 or larger, the procedure diverged, demonstrating the
non-robustness of the Newton-Raphson algorithm.

21
Turn now to the scoring algorithm. The expectation in the truncated Poisson distribu-
tion is θ/(1 − e−θ ), which gives the Fisher information
 
n 1 e−θ
I(θ) = − .
(1 − e−θ ) θ 1 − e−θ
In Programwindow 5.3, R code for performing the scoring algorithm is given. Note that
the only change from the Newton-Raphson algorithm is the replacement of J(θ) with I(θ).
Starting at θ(0) = 1.5, only 4 iterations were needed for convergence to θb = 0.8925. For
θ(0) = 2.0, convergence was also obtain, now after 6 iterations. Convergence was also
obtained for all other values tried out.
b we obtain I(θ)
Inserting θ, b = 1750.8 giving an approximate standard error for θb equal
to 0.0239. 

scoring.trpo = function(x,theta0,eps=0.000001)
{
n = sum(x);i = 1:6;sumx = sum(x*i)
theta = theta0;diff = 1
while(diff> eps)
{
theta.old = theta
s = sumx/theta-n/(1-exp(-theta))
Ibar = n*(1/theta-exp(-theta)/(1-exp(-theta)))/(1-exp(-theta))
theta = theta+s/Ibar
diff = abs(theta-theta.old)
}
list(theta,Ibar)
}

Programwindow 5.3: R routine for estimation in the truncated Poisson distribution using
the Fisher scoring algorithm.

6 Modifications of the Newton-Raphson and the scor-


ing algorithms

Direct use of the Newton-Raphson or the Fisher scoring algorithm can in many cases be
problematic. In this section we will discuss three possibilities for improving these algo-
rithms, reduction of the optimization problem to a smaller dimension, reparametrization
and smaller jumps.

22
6.1 Dimension reduction

For some maximum likelihood problems where full analytical solutions are impossible, some
of the parameters can be partly found as functions of others. We will illustrate this through
the leukemia data example.

Example 3 (Leukemia data, cont.)


As discussed before, the Newton-Raphson algorithm is very sensitive to starting values for
this problem. Further, using the scoring algorithm is problematic because of difficulties in
calculation of the expected information matrix. Inspecting the scoring function (1.5), we
see however that solving the equation s1 (θ) = 0 with respect to α (keeping β fixed), we
obtain
n
1 X β 1/β
α
b(β) = [ x ] .
n i=1 i

Here we have used the notation α


b(β) to make it explicit that the optimal value of α depends
on the valu of β. Now insert this solution for α into the log-likelihood function (1.4) to
obtain:

lβ (β) =l(b
α(β), β)
n n Pn β
1 X β 1/β X
i=1 xi
=n log(β) − nβ log([ xi ] ) + (β − 1) log(xi ) − 1
Pn β
n i=1 i=1 n i=1 xi
n n
1X β X
=n log(β) − n log([ xi ]) + (β − 1) log(xi ) − n. (6.2)
n i=1 i=1

Note that the likelihood function has been reduced to a function in one variable, which is
much easier to maximize. The partially maximized log-likelihood function lβ (β) is called
the profile log-likelihood function and is plotted in Figure 6.2.

Maximization of lβ (β) can now be performed by a Newton-Raphson algorithm. We


have
Pn β n
n xi log(xi ) X
lβ (β) = − n i=1

Pn β + log(xi )
β i=1 ix i=1
Pn β 2
Pn β Pn β
n i=1 xi log(xi ) i=1 xi − ( xi log(xi ))2
′′
lβ (β) = − 2 − n Pn β 2 i=1
β ( i=1 xi )
Programwindow 6.1 shows R code for the Newton-Raphson. A run of this algorithm with
β (0) = 1.0 and convergence criterion

|β (s) − β (s−1) | < 0.000001,

23
−92

−94

−96

−98

−100

−102

−104
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5

Figure 6.2: Profile log-likelihood lβ (β) given in (6.2) for the leukemia data.

gave the following results:


Iteration s α(s) β (s) lβ (β (s) )
0 17.93750 1.0000000 -62.19030
1 23.32840 0.9165698 -66.60703
2 16.87444 0.9218605 -61.81284
3 17.20042 0.9218849 -62.09486
4 17.20194 0.9218849 -62.09617
5 17.20194 0.9218849 -62.09617
which is the same results obtained in Section 4, but convergence was obtained with fewer
iterations.
Although a more efficient algorithm can be constructed in this way, the algorithm is
still sensitive to starting values. Starting with β (0) = 2.0, the following happened:
Iteration s α(s) β (s) lβ (β (s) )
0 26.6118 2.0000000 -73.6808
1 2.370984e-10 -0.2961171 NaN
It is possible to show that lβ′′ (β) ≤ 0 for all values of β. Nevertheless, the modified
Newton-Raphson algorithm (6.5) fail for some starting values. The reason in this case is
the constraint on the parameter β (it needs to be positive). Running the Newton-Raphson,
at the first iteration β becomes negative, and the algorithm just gives meaningless results.
How to handle constraints will be discussed in the next subsection. 

6.2 Reparametrization

Many of the parameters involved have restrictions on their values. For the gamma dis-
tributed data discussed in example 2, both parameters needs to be positive. The same is

24
nr.profile.weibull = function(x,beta0,eps=0.00001)
{
n = length(x);sumlogx = sum(log(x));
diff = 1;beta = beta0
while(diff>eps)
{
beta.old = beta
w1 = sum(x^beta)
w2 = sum((x^beta)*log(x))
w3 = sum((x^beta)*log(x)*log(x))
w4 = sum(log(x))
l1 = n/beta-n*w2/w1+w4
l2 = -n/(beta^2)-n*(w3*w1-w2^2)/(w1^2)
beta = beta.old - l1/l2;
diff = abs(beta-beta.old)
}
alpha = (w1/n)^(1/beta)
list(alpha=alpha,beta=beta)
}

Programwindow 6.1: R routine for optimization of the profile likelihood function lβ (β)
given in (6.2) with respect to β using the Newton-Raphson algorithm.

true for the Weibull data considered in example 3. For both the Newton-Raphson algo-
rithm and the Fisher scoring algorithm, negative values can occur at some iterations, and
the procedures can break down. The basic idea in such cases is to use reparametrization.

Example 3 (Leukemia data, cont.)


Continuing on the profile likelihood discussed in the previous subsection, we will now see
how the constraint β > 0 can be taken into consideration.

Define b = log(β). Since β > 0, b can take values on the whole real line. Rewriting
lβ (β) as a function of b, we get
n n
1 X eb X
lb (b) = nb − n log([ xi ]) + (eb − 1) log(xi ) − n. (6.3)
n i=1 i=1

Such a transformation of β to b we call a reparametrization of β. We aim at maximizing


lb (b) with respect to b. Note that since log(β) is a monotone and invertible transformation,
a maximum point for b directly gives a maximum value for β. Since β = exp(b), we
automatically obtain β > 0.

25
To use Newton-Raphson, we need the derivatives:
" Pn eb b n
#
1
xi e log(x i ) 1 X
lb′ (b) =n 1 − n i=1 1
Pn eb + eb log(xi )
n i=1 ix n i=1
" Pn Pn Pn eb #
1 eb 1 1
x log(x i ) − log(x i ) i=1 xi
lb′′ (b) = − neb n i=1 i 1
Pnn i=1 eb
n

n i=1 xi
" Pn Pn eb Pn eb # n
1 eb 21 1 2 X
2b n
x
i=1 i [log(x i )] n i=1 ix − [ n
x
i=1 i log(xi )] b
− ne P +e log(xi )
[ n1 ni=1 xei ]2
b
i=1

Programwindow 6.2 shows R code for a Newton-Raphson algorithm for maximizing lb (b)
with respect to b.

Starting now with β (0) = 2.0, the following happened:

Iteration s α(s) β (s) lβ (β (s) )


0 26.61179 2.000000 -73.68079
1 23.92023 1.662872 -68.31253
2 21.33190 1.366104 -64.63275
3 19.15131 1.129239 -62.71561
4 17.71581 0.976462 -62.14272
5 17.20004 0.921683 -62.09617
6 17.20225 0.921918 -62.09617
7 17.20189 0.921879 -62.09617
8 17.20195 0.921886 -62.09617
9 17.20194 0.921885 -62.09617
10 17.20194 0.921885 -62.09617

Also for more extreme starting values, convergence was obtained in this case. 

6.3 Smaller jumps

Formally, the Newton-Raphson algorithm only finds solutions to the equations

l′ (θ) = 0 (6.4)

which is equal to the scoring equations (3.4). From a pure numerical point of view, opti-
mization through searching for solutions of (6.4) is not recommended. By only using the
derivatives of the function, and not the function itself, the algorithm may converge to any
stationary point, not distinguishing between maxima and minima. However, if the function

26
l is concave and unimodal, the Newton-Raphson algorithm can be slightly modified such
that convergence can be guaranteed. Consider the iterations
h i−1
θ (s+1) = θ (s) − δ (s) l′′ (θ (s) ) l′ (θ (s) ) (6.5)

where δ (s) ≤ 1. At each iterations, start with δ (s) = 1. If l(θ(s+1) ) < l(θ(s) ), divide δ (s)
by two, and use (6.5) again. Repeat this until l(θ(s+1) ) ≥ l(θ(s) ). Because l is concave, it
is always possible to find an δ (s) small enough such that this will be satisfied. In many
situations of statistics, the log-likelihood function satisfy this property, making such a
procedure operational.

A similar modification can be made for the scoring algorithm:

θ (s+1) = θ (s) + δ (s) I −1 (θ (s) )s(θ (s) ). (6.6)

Since I(θ (s) ) is positive definite, such an algorithm is guaranteed to converge to a (local)
maxima, even if l(θ) is not concave.

Example 2 (Rainfall data, cont.)


For this example, J (θ) = I(θ), implying that J (θ) is positive (semi-) definite while l′′ (θ)
is negative (semi-) definite. An implementation of a Newton-Raphson algorithm (6.5)
allowing for smaller jumps is given in Programwindow 6.3. In order to make negative
values of the parameters illegal, we put the likelihood value in such cases very small (a
better implementation would be to reparametrize, as described in section 6.2).

Running this algorithm with α(0) = λ(0) = 1, convergence now was reached after seven
iterations. Note that a smaller δ was only necessary in the first iteration. When (α(s) , λ(s) )
gets closer to the optimal value, no modification of the ordinary Newton-Raphson algorithm
is necessary.
Iteration s α(s) β (s) l(θ (s) ) δ (s)
0 1.0000000 1.000000 -50.9370
1 0.3841504 0.578052 146.8740 0.25
2 0.3518249 0.912280 171.4139 1.00
3 0.4019210 1.423652 182.4078 1.00
4 0.4319168 1.822000 185.1676 1.00
5 0.4402049 1.954300 185.3468 1.00
6 0.4407877 1.964326 185.3477 1.00
7 0.4407914 1.964380 185.3477 1.00

27
nr.profile.weibull2 = function(x,beta0,eps=0.000001)
{
n = length(x);sumlogx = sum(log(x))
diff = 1;b = log(beta0)
beta = exp(b)
w1 = sum(x^beta)
l = n*log(beta)-n*log(w1/n)+(beta-1)*sumlogx-n
alpha = (w1/n)^(1/beta)
it = 0
while(diff>eps)
{
it = it+1
b.old = b
w1 = sum(x^(exp(b)))
w2 = sum((x^(exp(b))*log(x)))
w3 = sum((x^(exp(b))*(log(x))^2))
l1 = n-n*exp(b)*w2/w1+exp(b)*sumlogx
l2 = -n*exp(2*b)*(w2+(w3*w1-w2*w2)/w1)/w1 +
exp(b)*sumlogx
b = b - l1/l2
diff = sum(abs(b-b.old))
beta = exp(b)
diff = abs(beta-exp(b.old))
w1 = sum(x^beta)
alpha = (w1/n)^(1/beta)
l = n*log(beta)-n*log(w1/n)+(beta-1)*sumlogx-n
}
alpha = (w1/n)^(1/beta)
list(alpha=alpha,beta=beta)
}

Programwindow 6.2: R routine for optimization of the profile likelihood function lb (b)
given in (6.3) with respect to b using the Newton-Raphson algorithm.

28
nr.gamma.mod = function(x,theta0=NULL,eps=0.000001)
{
n = length(x);sumx = sum(x);sumlogx = sum(log(x))
h = 0.0000001;diff = 1
if(is.null(theta0))
{alpha = mean(x)^2/var(x);lambda=mean(x)/var(x)}
else
{alpha = theta0[1];lambda=theta0[2]}
theta = c(alpha,lambda)
l = n*alpha*log(lambda)+(alpha-1)*sumlogx-
lambda*sumx-n*log(gamma(alpha))
while(diff>eps)
{
theta.old = theta;l.old = l
g = gamma(alpha)
dg = (gamma(alpha+h)-gamma(alpha))/h
d2g = (gamma(alpha+2*h)-2*gamma(alpha+h)+
gamma(alpha))/h^2
s = c(n*log(lambda)+sumlogx-n*dg/gamma(alpha),
n*alpha/lambda-sumx)
Jbar = -matrix(c(-n*(d2g*g-dg^2)/g^2,n/lambda,
n/lambda,-n*alpha/lambda^2),ncol=2)
l = l.old-1;delta = 1
while(l < l.old)
{
theta = theta.old + delta*solve(Jbar,s)
alpha = theta[1];lambda = theta[2]
if ((alpha < 0) || (lambda < 0))
l = -9999999
else
{
l = n*alpha*log(lambda)+(alpha-1)*sumlogx-
lambda*sumx-n*log(gamma(alpha))
}
delta = delta/2
}
diff = sum(abs(theta-theta.old))
}
list(alpha,lambda,Jbar)
}

Programwindow 6.3: R code for running Newton-Raphson on rainfall data.


29
7 Non-linear regression

In this section we will consider maximum likelihood estimation in non-linear regression


models. The general formulation of the model will be
Yi = g(xi , β) + ei , i = 1, ..., n
where xi is a vector of explanatory variables, β is a p-dimensional vector of unknown
regression parameters and ei is a noise term. We will make the standard assumptions
about these noise terms:

(a) E[ei ] = 0.
(b) Var[ei ] = σ 2 .
(c) e1 , ..., en are uncorrelated.
(d) The ei ’s are normally distributed.

Multiple linear regression is the special case where


g(xi , β) = β0 + β1 xi,1 + · · · + βp−1 xi,p−1 .
We will however in this section allow for nonlinear g functions.

Assume that {(yi, xi ), i = 1, ..., n} are observed (yi is the observed value of Yi ). Under
the assumptions above, the likelihood function is given by
n
Y
2
L(β, σ ) = f (yi ; xi , β, σ 2 )
i=1
n
Y 1 1 2
= √ e− 2σ2 (yi −g(xi ,β))
i=1
2πσ
while the log-likelihood is
n
X
2 1 1 1
l(β, σ ) = [− log(2π) − log(σ 2 ) − 2 (yi − g(xi , β))2 ]
i=1
2 2 2σ
n
n n 1 X
=− log(2π) − log(σ 2 ) − 2 (yi − g(xi , β))2 . (7.1)
2 2 2σ i=1

not possible to obtain, and numerical methods have to be applied. For notational simplicity,
define

gk′ (xi , β) = g(xi , β) (7.2)
∂βk

30
and

′′ ∂2
gk,l (xi , β) = g(xi , β). (7.3)
∂βk ∂βl
The partial derivatives of l(β, σ 2 ) with respect to β and σ 2 are then given by the score
function s(β, σ 2 ) with elements

sk (β, σ 2 ) = l(β, σ 2 )
∂βk
n
1 X
= 2 (yi − g(xi , β))gk′ (xi , β), (7.4)
σ i=1

sp+1 (β, σ 2 ) = l(β, σ 2 )
∂σ 2
n
n 1 X
=− 2 + 4 (yi − g(xi , β))2 , (7.5)
2σ 2σ i=1
(7.6)

and the observed information matrix J (β, σ 2 ) with elements

2 ∂2
Jk,l (β, σ ) = − l(β, σ 2 )
∂βk ∂βl
n
1 X ′
= 2 [g (xi , β)gl′(xi , β) − (yi − g(xi , β))gk,l
′′
(xi , β)], (7.7)
σ i=1 k
∂2
Jk,p+1(β, σ 2 ) = − l(β, σ 2 )
∂βk ∂σ 2
n
1 X
= 4 (yi − g(xi , β))gk′ (xi , β), (7.8)
σ i=1
∂2
Jp+1,p+1(β, σ 2 ) = − l(β, σ 2 ) (7.9)
∂σ 2 ∂σ 2
n
n 1 X
=− 4 + 6 (yi − g(xi , β))2 , (7.10)
2σ σ i=1

where k, l = 1, ..., p. These quantities can be directly imputed into the general Newton-
Raphson algorithm (4.7).

A more efficient algorithm can be obtained by utilizing that for given β, an analytical
b2 for σ 2 can be obtained. From (7.1),
expression for the maximum likelihood estimate σ
n
∂ 2 n 1 X
l(β, σ ) = − + (yi − g(xi , β))2
∂σ 2 2σ 2 2σ 4 i=1

31

b2 to the equation
and the solution σ ∂σ2
l(β, σ 2 ) = 0 is given by
n
2 1X
σ
b (β) = (yi − g(xi , β))2 . (7.11)
n i=1

Inserting this into (7.1), we obtain the profile log-likelihood


b2 (β))
lβ (β) =l(β, σ
n
n n 1X n
=− log(2π) − log( (yi − g(xi , β))2 ) − .
2 2 n i=1 2

Maximizing lβ (β) with respect to β is equivalent to minimizing


n
X
S(β) = (yi − g(xi , β))2 ,
i=1

showing that similarly to ordinary linear regression, maximizing the likelihood is equivalent
to least squares estimation when we assume normal error terms. A Newton-Raphson
algorithm for minimizing S(β) can be directly constructed.
An alternative modification is to replace the observed information matrix J (β, σ) by
its expectation, that is the Fisher information matrix I(β, σ), similar to the approach in
section 5. Replacing yi by Yi in (7.7)-(7.10) and using that E[Yi ] = g(xi , β), we get
n
1 X ′
Ik,l (β, σ) = 2 gk (xi , β)gl′(xi , β), (7.12)
σ i=1
Ik,p+1(β, σ) =0, (7.13)
n
Ip+1,p+1(β, σ) = 4 (7.14)

for k, l = 1, ..., p. It can be shown that the I(β, σ) is negative definite. By using this matrix
instead of J (β, σ) in the Newton-Raphson algorithm we obtain the scoring algorithm. Both
the Newton-Raphson algorithm and the scoring algorithm are easy to implement. Note
however that the scoring version is somewhat simpler since it only involves first derivatives.
Further, defining θ = (β, σ 2 ), sβ (β, σ 2 ) to be the first p elements of s(β, σ), sσ2 (β, σ 2 ) to
be the last element ofs(β, σ), I β (β, σ 2 ) to be the upper left p × p submatrix of I(β, σ) and
Iσ2 (β, σ 2 ) to be the (p + 1, p + 1) element of I(β, σ), the scoring algorithm update can be
written as
 s+1 
s+1 β
θ =
(σ 2 )s+1
 −1  
s I β(β, σ 2 ) 0 sβ (β, σ 2 )
=θ +
0 Iσ2 (β, σ 2 ) sσ2 (β, σ 2 )
 s 
β + I β(β, σ 2 )−1 sβ(β, σ 2 )
=
(σ 2 )s + Iσ2 (β, σ 2 )−1 sσ2 (β, σ 2 )

32
Noticing now that sβ(β, σ 2 ) and I β (β, σ 2 ) only depend on σ 2 through the common factor
1
2σ2
, we see that the updating of β is independent of σ 2 , a reasonable property since as
noted above the optimal value of β is unaffected of σ 2 .

It can further be shown that at any iteration,


n
1X
(σ 2 )s+1 = (yi − g(xi , βs ))2 ,
n i=1

corresponding to the optimal solution given in (7.11). In practice this means hat we only
need to update β through the scoring algorithm, and after convergence, the estimate of σ 2
can be obtaind directly through (7.11).

As usual, uncertainty estimates of our estimates are of interest, and can be obtained
through the information matrices. Either the observed or the expected (Fisher) information
matrices can be used.

Using the Fisher information matrix, for large sample sizes, the covariance matrix for
b b2 ) is given by
(β, σ
 2 −1 
−1 2 σ I β (β) 0
I (β, σ ) = 2σ4 (7.15)
0 n

b and σ
The 0 part on the off-diagonal of I −1 (β, σ 2 ) imply that β b2 are independent for large
sample sizes.
Example 5 (Weight loss programme)
Venables and Ripley (1999) contain a dataset (originally from Dr. T Davies) describing
weights (yi) of obese patients after different number of days (xi ) since start of a weight
reduction programme. The data, also available from the course home page, is plotted in
Figure 7.1. Venables and Ripley (1999) suggests the following model for this dataset:

yi = β0 + β1 e−β2 xi + ei .

So β = (β0 , β1 , β2 ) contains 3 unknown regression parameters in this case. In order to


implement the Newton-Raphson or scoring algorithm, we need the derivatives of the g-
function. We have

g ′ (xi , β) = 1 e−β2 xi −β1 e−β2 xi xi
 
0 0 0
g ′′ (xi , β) = 0 0 −eβ2 xi xi 
−β2 xi
0 −e xi β1 e−β2 xi x2i
In Programwindow 7.1 R code for maximum likelihood estimation based on Newton-
Raphson is given. Running this algorithm with start values given in the first row of
the table below, convergence was reached after 7 iterations.

33
200

180

160

140

120

100
0 50 100 150 200 250

Figure 7.1: Weight loss from an obese patient

(s) (s) (s)


Iteration s β0 β1 β2 (σ 2 )(s) l(β (s) , (σ 2 )(s) )
0 90.00000 95.0000 0.005000000 209.38680 -143.25920
1 84.33882 100.5511 0.005199136 0.7286630 -69.66973
2 76.34951 107.1314 0.004415786 1.4737980 -78.82676
3 76.80056 106.8413 0.004541713 0.6565196 -68.31437
4 81.66417 102.3930 0.004880670 0.6063372 -67.28066
5 81.39949 102.6619 0.004886570 0.5695842 -66.46777
6 81.37380 102.6841 0.004884399 0.5695808 -66.46769
7 81.37382 102.6841 0.004884401 0.5695808 -66.46769

In Programwindow 7.2 R code for maximum likelihood estimation based on scoring


algorithm is given. Running this algorithm with start values given in the first row of the
table below, convergence was reached after 4 iterations.
(s) (s) (s)
Iteration s β0 β1 β2 (σ 2 )(s) l(β (s) , (σ 2 )(s) )
0 90.00000 95.0000 0.005000000 209.3868 -143.25920
1 81.40014 102.6569 0.004875966 0.5758058 -66.60900
2 81.37434 102.6836 0.004884439 0.5695808 -66.46769
3 81.37381 102.6841 0.004884401 0.5695808 -66.46769
4 81.37382 102.6841 0.004884401 0.5695808 -66.46769

In Table 7.1 standard errors based on large sample approximations (first row) is given.
An alternative method for estimating the variability of the parameter estimates is boot-
strapping. We will consider bootstrapping in the conditional inference setting. This means
that we consider the explanatory variables to be fixed while the randomness appears from
the noise terms. Bootstrap samples of Y1 , ..., Yn can be obtained by

b + e∗ ,
Yi∗ = g(xi , β) i = 1, ..., n
i

34
100 80

60

50 40

20

0 0
70 75 80 85 90 95 100 105 110

100 100

50 50

0 0
4 5 6 0 0.5 1 1.5
−3
x 10

Figure 7.2: Histogram of bootstrap simulations of βb0 (upper left), βb1 (upper right), βb2
(lower left) and σb2 (lower right) for weight loss data.

Parameter Std[βb0 ] Std[βb1 ] Std[βb2 ] Std[b


σ2 ]
Large sample approx 2.5354 2.3273 0.00018 0.1480
Fixed x, Parametric bootstrapping 2.1880 2.0118 0.00017 0.1473
Random x, Nonparametric bootstrapping 2.1800 1.9927 0.00018 0.1502
Fixed x, Nonparametric bootstrapping 2.2760 2.0776 0.00018 0.1533

Table 7.1: Standard errors based on large sample approximation, parametric and non-
parametric bootstrapping for weight loss data.

where e∗ = (e∗1 , ..., e∗n ) are bootstrap samples of the noise terms. Two main alternatives for
sampling e∗ is possible. In parametric bootstrapping, the model assumption ei ∼ N(0, σ 2 )
is used and we simulate e∗i through e∗i ∼ N(0, s2 ). For nonparametric bootstrapping, the
normal assumption is relaxed, and the e∗i ’s are samples from an estimate of the distribution
for ei . This can be peformed by sampling e∗1 , ..., e∗n from (e1 , ..., en ) with replacement.

The second and third rows in Table 7.1 shows standard errors estimated by parametric
and non-parametric bootstrapping. Figure 7.2 shows histograms of the 1000 nonparametric
bootstrap simulations. They all are close to normal distributions, confirming the large sam-
ple theory and also explaining the similarities of the standard errors obtained from the two
methods. The parametric simulations (not shown) looks very similar. Programwindow 7.3
shows R code for performing parametric and nonparametric bootstrap simulations.

35


36
nr.weight = function(x,y,beta.start,eps=0.000001)
{
#Note: beta[i+1] correspond to beta_i
n = length(x)
diff = 1;beta = beta.start
y.pred = beta[1]+beta[2]*exp(-beta[3]*x)
res = y-y.pred
while(diff>eps)
{
beta.old = beta
s = c(sum(y-y.pred),
sum((res)*exp(-beta[3]*x)),
-beta[2]*sum((res)*exp(-beta[3]*x)*x))
Jbar =
matrix(c(n,sum(exp(-beta[3]*x)),-beta[2]*sum(exp(-beta[3]*x)*x),
sum(exp(-beta[3]*x)),sum(exp(-2*beta[3]*x)),
sum((res-beta[2]*exp(-beta[3]*x))*exp(-beta[3]*x)*x),
-beta[2]*sum(exp(-beta[3]*x)*x),
sum((res-beta[2]*exp(-beta[3]*x))*exp(-beta[3]*x)*x),
-beta[2]*sum((res-beta[2]*exp(-beta[3]*x))*
exp(-beta[3]*x)*x*x)),ncol=3)
beta = beta.old + solve(Jbar,s)
y.pred = beta[1]+beta[2]*exp(-beta[3]*x)
res = y-y.pred
diff = sum(abs(beta-beta.old))
}
sigma = mean((res)^2)
list(beta=beta,sigma=sigma,Jbar=Jbar)
}

Programwindow 7.1: R code for running Newton-Raphson on weight loss data.

37
scoring.weight = function(x,y,beta.start,eps=0.000001)
{
#Note: beta[i+1] correspond to beta_i
n = length(x)
diff = 1;beta = beta.start
y.pred = beta[1]+beta[2]*exp(-beta[3]*x)
while(diff>eps)
{
beta.old = beta
s = c(sum(y-y.pred),
sum((y-y.pred)*exp(-beta[3]*x)),
-beta[2]*sum((y-y.pred)*exp(-beta[3]*x)*x))
Ibar = matrix(c(n,sum(exp(-beta[3]*x)),
-beta[2]*sum(exp(-beta[3]*x)*x),
sum(exp(-beta[3]*x)),sum(exp(-2*beta[3]*x)),
-beta[2]*sum(exp(-2*beta[3]*x)*x),
-beta[2]*sum(exp(-beta[3]*x)*x),
-beta[2]*sum(exp(-2*beta[3]*x)*x),
beta[2]*beta[2]*sum(exp(-2*beta[3]*x)*x*x)),ncol=3)
beta = beta.old + solve(Ibar,s)
y.pred = beta[1]+beta[2]*exp(-beta[3]*x)
diff = sum(abs(beta-beta.old))
}
sigma = mean((y-y.pred)^2)
list(beta=beta,sigma=sigma,Ibar=Ibar)
}

Programwindow 7.2: R code for running the scoring algorithm on weight loss data.

38
n = dim(wtloss)[1]
fit = scoring.weight(wtloss$days,wtloss$weight,c(90,95,0.005))
beta = fit$beta
sigma = fit$sigma
ypred=beta[1]+beta[2]*exp(-beta[3]*wtloss$days)
res=wtloss$weight-ypred
B = 1000
beta.star = matrix(NA,nrow=B,ncol=3)
sigma.star = rep(NA,B)
for(b in 1:B)
{
res.star = sqrt(sigma)*rnorm(n)
weight.star = ypred+res.star;
fit = scoring.weight(wtloss$days,weight.star,beta);
beta.star[b,] = fit$beta
sigma.star[b] = fit$sigma
}

n = dim(wtloss)[1]
fit = scoring.weight(wtloss$days,wtloss$weight,c(90,95,0.005))
beta = fit$beta
sigma = fit$sigma
ypred=beta[1]+beta[2]*exp(-beta[3]*wtloss$days)
res=wtloss$weight-ypred
B = 1000
beta.star = matrix(NA,nrow=B,ncol=3)
sigma.star = rep(NA,B)
for(b in 1:B)
{
res.star <- sample(res,n,replace=T)
weight.star = ypred+res.star
fit = scoring.weight(wtloss$days,weight.star,beta);
beta.star[b,] = fit$beta
sigma.star[b] = fit$sigma
}

Programwindow 7.3: R code for parametric (upper) and nonparametric (lower) bootstrap-
ping on weight loss data.

39
8 Logistic regression

In linear regression the response is usually assumed to be on a continuous scale. Many


other types of responses do however exist, making the need for different regression methods.
In Table 8.2, a data set is given where the response is whether a beetle given a dose of
poison has died or not, i.e., a binary response. The explanatory variable is the amount of
poison. The data are grouped since many beetles are given the same dose.

Dose Number of Numbers


insects killed
1.6907 59 6
1.7242 60 13
1.7552 62 18
1.7842 56 28
1.8113 63 52
1.8369 59 53
1.8610 62 61
1.8839 60 60

Table 8.2: The mortality of beetles against dose of poison.

Assume Yi is a binary response while xi is a explanatory variable. In linear regression,


the expected response is modeled as a linear function of the exploratory variable:
E[Yi ] = β0 + β1 xi .
Note however that in the case of binary response, the expectation is equal to the probability
for a beetle to die, a number between zero and one. The linear regression model is surely
inappropriate since the expected value may vary from −∞ to ∞.

In logistic regression, the response is modeled by


Yi ∼ binom(1, p(xi , β)),
exp{β0 + β1 xi } (8.16)
p(xi , β) = ,
1 + exp{β0 + β1 xi }
By making the usual assumption that all observations are independent, the likelihood
function becomes
Yn
L(β) = p(xi , β)yi (1 − p(xi , β))1−yi .
i=1

As usual, we consider the log-likelihood:


n
X
l(β) = [yi log(p(xi , β)) + (1 − yi ) log(1 − p(xi , β))].
i=1

40
Elementary calculations show that the scoring function is equal to
" #  Pn 
∂l(β)
∂β0 [y i − p(xi , β)]
s(β) = ∂l(β) = Pn i=1

∂β1 i=1 [yi − p(xi , β)]xi

while the observed information matrix is given by


 Pn Pn 
p(x i , β)(1 − p(xi , β)) p(xi , β)(1 − p(xi , β))xi
J (β) = Pn i=1 Pn
i=1
2 (8.17)
i=1 p(xi , β)(1 − p(xi , β))xi i=1 p(xi , β)(1 − p(xi , β))xi

Note that J (β) do not depend on the random observations yi , showing that the expected
information matrix I(β) is equal to J (β). This implies that J (β) is always positive
definite, making the log-likelihood function concave with only one (global) maxima.

In Programwindow 8.3, an R routine for optimizing the log-likelihood for the logistic
model is given. Although the log-likelihood is unimodal, a modification of the ordinary
Newton-Raphson algorithm allowing for smaller jumps (as described in section 6.3) is
needed in order to make the algorithm robust towards starting values.

Example 6 (Beetle data)


To illustrate logistic regression, we will analyze the data given in Table 8.2. Note that
these data are grouped. In order to use the expressions derived above, these data needs to
be converted to individual data. This can be performed by the following R commands:

beetle = read.table("beetle.dat",col.names=c("dose","n","y"))
m = dim(beetle)[1]
x = NULL
y = NULL
for(j in 1:m)
{
x = c(x,rep(beetle$dose[j],beetle$n[j]))
y = c(y,rep(1,beetle$y[j]),rep(0,beetle$n[j]-beetle$y[j]))
}
beetle2 = data.frame(dose=x,resp=y)

Running the routine described in Programwindow 8.3, convergence was reached after 6
iterations to βb0 = −60.72 and βb1 = 34.27. The values at different iterations are shown
below.

41
(s) (s)
Iteration s β0 β1 l(β (s) ) δ (s)
0 2.00000 1.00000 -721.3648
1 -104.29550 57.96621 -248.0056 0.25
2 -45.92656 25.95912 -191.0286 0.50
3 -57.76158 32.60580 -186.4047 1.00
4 -60.58140 34.19359 -186.2358 1.00
5 -60.71715 34.27016 -186.2354 1.00
6 -60.71745 34.27033 -186.2354 1.00
7 -60.71745 34.27033 -186.2354 1.00

Concerning the uncertainty involved in these estimates, the large sample approximation
to the covariance matrix is given directly from (8.17) with β = β b inserted. This gave
estimated covariance matrix
 
b 26.8398 −15.0822
Var[β] ≈ .
−15.0822 8.4806

and standard errors 5.1807 and 2.9121 for βb0 and βb1 , respectively. 

9 Discussion

In this note we have discussed several different numerical procedures for optimization.
Although our primary concern has been on maximum likelihood problems, the procedures
could just as well have been applied to other optimization problems.

In general it is difficult to give recommendations on which procedure to use. Some


comparative remarks (Titterington et al. 1985) can be made though:

(a) Direct procedures will in many cases work, but convergence could be extremely slow.

(b) The Newton-Raphson method and the Method of Scoring are usually more compli-
cated, and there is no guarantee of monotonicity.

(c) If the Newton-Raphson method converges, it converges fast (second order).

(d) For the Newton-Raphson method, the observed information matrix is directly given
as part of the algorithm, while for direct maximization some further calculations are
needed in order to obtain this.

(e) In general, no method is guaranteed to give the global optimum. The algorithms
should therefore be run with different starting values.

42
nr.logist = function(data,beta.start,eps=0.000001)
{
x=data[,1];y=data[,2];n=length(x)
diff=1;beta=beta.start
p = exp(beta[1]+beta[2]*x)/(1+exp(beta[1]+beta[2]*x))
l = sum(y*log(p)+(1-y)*log(1-p))
while(diff>eps)
{
beta.old = beta
l.old = l
p = exp(beta[1]+beta[2]*x)/(1+exp(beta[1]+beta[2]*x))
s = c(sum(y-p),sum((y-p)*x))
Jbar = matrix(c(sum(p*(1-p)),sum(p*(1-p)*x),
sum(p*(1-p)*x),sum(p*(1-p)*x*x)),ncol=2)
l=l.old-1;delta=1
while(l<l.old)
{
beta = beta.old + delta*solve(Jbar,s)
p = exp(beta[1]+beta[2]*x)/(1+exp(beta[1]+beta[2]*x))
l = sum(y*log(p)+(1-y)*log(1-p))
delta=delta/2
}
diff = sum(abs(beta-beta.old))
}
list(beta=beta)
}

Figure 8.3: R code for running Newton-Raphson algorithm for logistic regression.

43
A R code

x = scan("ILLRAIN.DAT",na.strings="*")
x = x[!is.na(x)]
alpha = seq(0.35,0.55,0.005);lambda = seq(1,3,0.05)
loglik = matrix(nrow=length(alpha),ncol=length(lambda))
n = length(x);sumx=sum(x);sumlogx = sum(log(x));
for(i in 1:length(alpha))
for(j in 1:length(lambda))
loglik[i,j] = n*alpha[i]*log(lambda[j])+(alpha[i]-1)*sumlogx-
lambda[j]*sumx-n*log(gamma(alpha[i]))
par(mfrow=c(1,2))
#image(alpha,lambda,exp(loglik),col=gray((0:32)/32))
#image(alpha,lambda,loglik,col=gray((0:32)/32))
persp(alpha,lambda,exp(loglik),theta=330,phi=45,shade=1,zlab="lik")
persp(alpha,lambda,loglik,theta=330,phi=45,shade=1,zlab="l")

Programwindow A.1: R code for plotting likelihood function (figure 1.3) for rainfall data.

References
D. R. Cox and D. Oakes. Analysis of Survival Data. Chapman & Hall, London, 1984.

J.L. Devore and K.N. Berk. Modern mathematical statistics with applications. Duxbury
Pr, 2007. ISBN 0534404731.

B. S. Everitt. Introduction to optimization methods and their applications in statistics.


Chapman and Hall, London, 1987.

K. Lange. Numerical Analysis for Statisticians. Statistics and Computing. Springer Verlag,
1999.

J. A. Rice. Mathematical statistics and data analysis. Duxbury Press, Belmont, California,
second edition, 1995.

D. M. Titterington, A. F. M. Smith, and U. E. Makov. Statistical Analysis of Finite


Mixture Distributions. New York: John Wiley & Sons, 1985.

C. F. Van Loan. Introduction to scientific computing. Prentice Hall, Upper Saddle River,
NJ 07458, second edition, 2000.

44
W. N. Venables and B. D Ripley. Modern Applied Statistics with S-plus. Statistics and
Computing. Springer Verlag, third edition, 1999.

45

You might also like