Numerical Optimization of Likelihoods: Additional Literature For STK2120
Numerical Optimization of Likelihoods: Additional Literature For STK2120
As described in section 7.2 in the textbook by Devore and Berk (2007), maximum likeli-
hood is an important estimation method for statistical analysis. The textbook also gives
several examples for which analytical expressions of the maximum likelihood estimators
are available. In this note we will be concerned with examples of models where numerical
methods are needed in order to obtain the estimates. Some of these models are “standard”
in the sense that statistical software is available for direct computation. In other cases,
user-made programs must be applied. In both cases the use of a computer is essential, and
knowledge on using and/or developing software becomes vital. Throughout the note we
will illustrate the different optimization methods through three examples. R code used in
connection with these examples are given either directly in the text or in the appendix.
Example 1 (Muon decay)
The angle θ at which electrons are emitted in muon decay has a distribution with the
density
1 + αx
f (x|α) = , −1 ≤ x ≤ 1 and −1≤α≤1
2
where x = cos θ (Rice 1995, Example D, section 8.4). The following data are simulated
from the given distribution:
l is a smooth function in α, so at the max-point, the derivative should be equal to zero (if
not, the max-point is at one of the end-points). Now
n
X
′ xi
l (α) = .
i=1
1 + αxi
1
−15
8e−07
−20
log−likelihood
Likelihood
−25
4e−07
−30
0e+00
−35
−1.0 −0.5 0.0 0.5 1.0 −1.0 −0.5 0.0 0.5 1.0
alpha alpha
Figure 1.1: Likelihood (left) and log-likelihood (right) for muon decay example.
Unfortunately, no closed expression is available for the solution of the equation l′ (α) = 0.
Figure 1.1 shows lik(α) and l(α) as functions of α for the actual value of x. A good guess on
the optimal value of α would be possible to obtain by a visual check. An automatic method
would however be preferable, both in order to save work and for avoiding subjectivity.
1 α α−1 −λx
f (x|α, λ) = λ x e
Γ(α)
where Γ(α) is the gamma function as defined in Devore and Berk (2007, page 190). As-
suming that all observations are independent, the likelihood is in this case
n
Y 1 α α−1 −λxi
lik(α, λ) = λ xi e .
i=1
Γ(α)
2
150
100
50
0
0 0.5 1 1.5 2 2.5
Also in this case the log-likelihood is a smooth function of the parameters involved, and
the derivatives are
X n
∂ Γ′ (α)
l(α, λ) =n log(λ) + log(xi ) − n ; (1.2)
∂α i=1
Γ(α)
X n
∂
l(α, λ) =nαλ−1 − xi . (1.3)
∂λ i=1
Putting the derivatives to zero and solving the equations is again difficult, making direct
analytical solutions intractable.
In Figure 1.3, lik(α, β) and l(α, β) are plotted for different values of α and λ. These
plots are produced by the commands given in Programwindow A.1.
Also in this case a good guess of the mode is possible, but a numerical procedure would
be preferable.
Example 3 (Survival times for leukemia)
In Cox and Oakes (1984) a dataset concerning the survival times for leukemia is discussed.
We will consider a subset of these data, given in Table 1.1.
A distribution commonly applied to survival data is the Weibull distribution which has
probability density
x β−1 β
f (x; α, β) = βα−1 e−(x/α) .
α
3
3.0
3.0
lambda
lambda
2.0
2.0
1.0
1.0
0.35 0.40 0.45 0.50 0.55 0.35 0.40 0.45 0.50 0.55
alpha alpha
lik
l
lam
lam
bda
bda
ha ha
alp alp
Figure 1.3: Perspective and contour plot of likelihood function (left) and log-likelihood
function (right) for rainfall measurements.
Here α is a scale parameter while β describes the shape of the distribution. Assuming all
observations x1 , ..., xn are independent, the likelihood function is
n
Y x β−1
i β
lik(α, β) = βα−1 e−(xi /α)
i=1
α
In Figure 1.4 lik(α, β) and l(α, β) are plotted for different values of α and β. Differentiating
4
56 65 17 7 16 22 3 4 2 3 8 4 3 30 4 43
Table 1.1: Survival times for leukemia. Subset of dataset from Cox and Oakes (1984)
corresponding to those individuals which responded “absent” to a given test.
In this note we will discuss how numerical methods can be applied for solving statistical
optimization problems. We will not go deeply into the general theory behind numerical
methods. For this, we refer to other courses (i.e., MAT-INF 1100, INF-MAT 3370). Our
intention is to demonstrate how such numerical methods can be applied to statistical prob-
lems, in addition to present some optimization methods which are specialized for use in
statistics. We will start with some general remarks concerning optimization (section 2).
Our primary application will be maximum likelihood estimation. Therefore, some prop-
erties of the log-likelihood function and the maximum likelihood estimators (section 3)
will also be discussed. The remaining sections consider different optimization methods.
All methods are iterative. They start with some initial guess on the parameters. Next,
a new candidate to the optimum, presumable better than the initial guess is found. This
new guess is used to find an even better one, and so one. Section 4 consider the Newton-
Raphson method, which probably is the most used optimization method in statistics. A
slight modification of this method, the scoring algorithm, is discussed in section 5. Some
refinements of these algorithms are discussed in Section 6. All procedures discussed are
illustrated by examples. The datasets and R code can be found on the course homepage.
R routines are also given in the text.
5
1.4
1.4
beta
beta
0.8
0.8
0.2
0.2
10 15 20 25 30 10 15 20 25 30
alpha alpha
l
lik
bet
bet
a
ha ha
alp alp
Figure 1.4: Perspective and contour plot of likelihood function (left) and log-likelihood
function (right) for leukemia data.
Optimization is a huge field in numerical analysis and we will only focus on a few meth-
ods that are widely used in statistics. Although many of the general purpose methods are
useful also for statistical problems, methods specifically designed for likelihood maximiza-
tion may be better to use in certain situations. In any application, we may choose if we
want to maximize or minimize, since maximization of a function l is equivalent to mini-
mization of −l. Since our primary concern will be likelihood maximization, we will assume
that maximization is our aim.
For a general discussion of optimization methods, we refer to Van Loan (2000). Lange
(1999) is a very good reference for numerical analysis in general applied to statistics. We
will only give some general comments here:
6
problems, the concavity property is generally lost, and local maximum points may
occur. In such cases, no guarantee is given that the numerical method will return
the global maximum point. Two standard methods are then widely used:
(a) Find local maxima starting from widely varying starting values and then pick
the maximum of these;
(b) Perturb a local optimum by a “small” amount, and use this as an starting point
for a new run of your routine. Then see if your routine converges to a better
point, or “always” to the same one.
• When searching for an appropriate method to use, you must choose between meth-
ods that need only evaluations of the function itself and methods that also require
evaluations of the derivative(s) of the function. Algorithms using the derivative are
somewhat more powerful, but the extra cost of calculating these may result in less
efficiency overall. Note however, that in order to obtain information values and stan-
dard errors for the estimators, derivatives may be needed anyway (see next section).
The optimum of the likelihood function lik(θ) coincides with the optimum of the log-
likelihood function l(θ). We will consider maximization of the log-likelihood function. In
numerical literature, the function to maximize is usually denoted the objective function.
We will follow this convention when discussing maximization algorithms in general.
Devore and Berk (2007, sec. 7.4) discusses large sample properties of maximum likelihood
estimators. Some of these results will be repeated in addition to that some extra properties
important for numerical optimization of likelihood functions will be introduced. We will
start by assuming only one unknown parameter θ. Afterwards we will generalize to several
parameters.
7
3.1 The one-parameter situation
The variability of s(θ; X) reflects the uncertainty involved in estimating θ. The variance
of s is called the Fisher information (sometimes it is also called the expected information).
The Fisher information I(θ) of an experiment can be written as (we will in this note
use I both for the information based on one observation and for the information from a
random sample, the textbook distinuish between these situations by using In for the latter)
I(θ) = Var[s(θ; X)] = −E [l′′ (θ)] . (3.2)
For proof of the properties given above for the discrete case with n = 1, see Devore and
Berk (2007, page 365). The general case is a trivial extension. A consequence of the first
equality in (3.2) is that I(θ) is always non-negative.
8
3.2 The multi-parameter situation
Assume θ = (θ1 , ..., θp )T is a vector of p, say, unknown parameters. The derivate of the
log-likelihood, still named the score function, now is a vector:
1
s(θ; X) = l′ (θ; X) = lik′ (θ; X). (3.3)
lik(θ; X)
The ith element of the vector s(θ; X), si (θ; X), is the partial derivative of l(θ; X) with re-
∂
spect to θi . A more mathematically correct way of expressing s(θ; X) would be ∂θ l(θ; X),
but we will use the simpler form l (θ; X).
′
As for the one-parameter case, each si (θ; X) has expectation zero. Finding the MLE
by solving the scoring equations
b X) = 0
s(θ; (3.4)
now result in a set of p equations with p unknowns.
The expected information is now generalized to be a matrix I(θ) with the (i, j)th entry
given by
∂2
Iij (θ) = Cov[si (θ; X), sj (θ; X)] = −E l(θ) . (3.5)
∂θi ∂θj
Here the second equality can be proven by a simple generalization of the argument in the
one-parameter case. In the multi-parameter situation we usually name I(θ) by the expected
information matrix or the Fisher information matrix. An important property of I(θ) is
that it is always positive (semi-)definite1 . This will be of importance in the construction
of the scoring algorithm in section 5.
The large sample theory described in 7.4 of Devore and Berk (2007) also generalizes
to the multi-parameter case, so that the MLE is, under appropriate smoothness condi-
tions on f , consistent. Further, each element θbi of θ
b will, for a sufficiently large number
of observations, have a sampling distribution approximately normal with expectation θi
and variance equal to {I −1 (θ)}ii , the ith entry of the diagonal of I −1 (θ). Further, the
covariance between θbi and θbj is approximately given by the (i, j)th entry of I −1 (θ).
Note that I(θ) depends on the unknown quantity θ. Common practice is to insert the
b for θ giving an estimate I(θ)
estimated value θ b of I(θ).
A further complication is that the expectations in (3.5) are not always possible to
compute. Then an alternative is to use the observed information matrix J (θ) with (i, j)th
entry given by
∂2
Jij (θ) = − l(θ). (3.6)
∂θi ∂θj
1
A matrix I is positive semi-definite if a′ Ia ≥ 0 for all vectors a.
9
As for the expected information matrix, an estimate θ b needs to be inserted for θ in order to
−1
evaluate J(θ). The ith diagonal element of J (θ) can then be used as an approximation
for the variance of θbi instead of the ith diagonal element of I −1 (θ). Both these approxi-
mations will be equally valid in the sense that as the number of observations increases, the
approximation error will decrease to zero. If possible to calculate, the expected information
is preferable, since the observed information in some cases can be unstable. Note that in
many standard models used in statistics, I(θ) = J (θ).
One of the most used methods for optimization in statistics is the Newton-Raphson method
(or Newton’s rule). It is based on approximating the function which we want to optimize
by a quadratic one. The optimum of the approximation (which is easy to calculate) gives
a guess of the optimum for the actual function. If this guess is not adequately close to the
optimum, a new approximation is computed and the process repeated.
Assume for simplicity that l only involves a one-dimensional parameter and that θ̄ is
our current best guess on the maximum of l(θ). By using a Taylor series expansion around
θ̄, l(θ) can be approximated by
At the point θ̄, l(θ) and ˜lθ̄ (θ) have equal first and second derivatives. Note that for
the particular case when l is a log-likelihood function, the Hessian is equal to minus the
observed information evaluated at θ = θ̄, l′′ (θ̄) = −J(θ̄).
In the optimum point of the approximation, ˜lθ̄ (θ) has a gradient equal to zero, giving
the following equation:
10
70
60
50
40
30
20
0.2 0.5 0.8
Figure 4.1: A function l(θ) (solid line) and its quadratic approximation ˜lθ̄ (θ) (dashed line).
The point (θ̄, l(θ̄)) is given by a ∗.
This gives a procedure for optimizing ˜lθ̄ (θ). Our aim is however to optimize l(θ). Since
˜lθ̄ (θ) is an approximation of l(θ), our hope is then that (4.3) will give us a new value closer
to the optimum of l(θ). This suggests an iterative procedure for optimizing l(θ):
l′ (θ(s) )
θ(s+1) = θ(s) − ; (4.4)
l′′ (θ(s) )
which is the Newton-Raphson method. The procedure is run until there is no significant
difference between θ(s) and θ(s+1) .2
θ(s+1) = θ(s) is equivalent to l′ (θ(s) ) = 0. This demonstrates that when the algorithm
has converged, we have reached a stationary point of l(θ). This point could be a maximum
point, a minimum point or even a saddle point. However, if l′′ (θ(s) ) < 0, the point is a
maximum point. This should be checked in each case. Figure 4.2 shows the first four
iterations of a Newton-Raphson algorithm. The results from the last two iterations are
almost indistinguishable, indicating that convergence is reached.
s(θ(s) )
θ(s+1) = θ(s) + .
J(θ(s) )
11
70
4
32
65
1
60
55
50
45
0
40
35
0.2 0.5 0.8
Figure 4.2: A function l(θ) (solid line) and its quadratic approximations ˜lθs (θ) for s = 0
(solid plus star), s = 1 (dashed), s = 2 (dashed-dotted) and s = 3 (dotted). The points
(θs , l(θs )) for s = 0, 1, 2, 3, 4 are given by their numbers.
The Newton-Raphson algorithm will only converge to the closest stationary point. If
several such points are present, the choice of starting value becomes critical. In many
cases of statistics, reasonable starting values can be found by other means, i.e. moment
estimators.
Note l′′ (α) is always negative (and therefore J(α) is always positive) for all possible α
values, showing that the log-likelihood function is concave and unimodal.
12
nr.muon <- function(x,alpha0=0.6,eps=0.000001)
{
n = length(x)
diff = 1;
alpha = alpha0;
l = sum(log(1+alpha*x))-n*log(2)
while(diff>eps)
{
alpha.old = alpha
s = sum(x/(1+alpha*x))
Jbar = sum((x/(1+alpha*x))^2)
alpha = alpha+s/Jbar
l = sum(log(1+alpha*x))-n*log(2)
diff = abs(alpha-alpha.old)
}
list(alpha,Jbar)
}
The small value on the right hand side is much smaller than necessary, but is used in order
to demonstrate properties of the Newton-Raphson algorithm. A run of this function gave
the following results:
We see that after one iteration, one decimal is correct. This is increased to three decimals
after 2 iterations and to at least 7 decimals after 3 iterations, illustrating the rapid increase
in accuracy for the Newton-Raphson algorithm. Further, since the log-likelihood funcion
is concave and unimodal, the resulting point is a maximum point.
13
The Fisher information is in this case
I(θ) =E[J(θ)]
X2
=nE[ ]
(1 + αX)2
Z
n 1 x2
= dx
2 −1 1 + αx
Z 1 Z 1
n 1
= 2[ (αx − 1)dx + dx]
2α −1 −1 1 + αx
−n n 1+α
= 2 + 3 log (4.5)
α 2α 1−α
For αb = 0.4943927, Ib = 11.78355 giving an approximative standard error equal to 0.291
for α
b. Using the observed information, the standard error would be estimated to 0.297.
The argument for deriving the Newton-Raphson algorithm for optimization in one
dimension can be directly extended to multi-dimensional problems giving the multi-para-
meter Newton-Raphson method:
h i−1
(s+1) (s) (s)
θ ′′
= θ − l (θ ) l′ (θ (s) ) (4.6)
where l′ (θ) now is a vector consisting of the partial derivatives while l′′ (θ) is a matrix
with (i, j) entry equal to the second derivative with respect to θi and θj . l′′ (θ) is usually
denoted as the Hessian matrix. The algorithm can be written as
θ (s+1) = θ (s) + J −1 (θ (s) )s(θ (s) ). (4.7)
for likelihood optimization, where s(θ) is the score function while J (θ) is the observed
information matrix. In order to check if the resulting point is a maximum point, we
need to see if the observed information matrix is positive definite, which is the case if all
b are positive.
eigenvalues of J (θ)
The first derivatives (or score functions) are given by (1.2)-(1.3) while the second derivatives
are
∂2 Γ′′ (α)Γ(α) − Γ′ (α)2 ∂2
l(α, λ) = − n l(α, λ) =nλ−1 (4.8)
∂α2 Γ(α)2 ∂λ∂α
∂2
l(α, λ) = − nαλ−2 (4.9)
∂λ2
14
A problem with implementing the Newton-Raphson algorithm in this case is that the first
and second derivatives of the gamma function is not directly available. Numerically these
can however be approximated by
Γ(α + h) − Γ(α)
Γ′ (α) ≈
h
Γ ′
(α + h) − Γ′ (α) Γ(α + 2h) − 2Γ(α + h) + Γ(α)
Γ′′ (α) ≈ ≈
h h2
for h small3 .
A run of this function, using the moment estimates for α and λ as starting values, gave
the following results:
Also for this example, convergence is fast. The algorithm is however sensitive to starting
values. Using α(0) = λ(0) = 1, both α(1) and λ(1) becomes negative, resulting in that the
algorithm crashes. We will in Section 6 see how this can be avoided.
The observed information matrix J (θ) = −l′′ (θ) is directly available from the Newton-
Raphson algorithm. Note from (4.8)-(4.9) that J (θ) do not depend on the observations,
resulting in that the Fisher information matrix in this case is equal to the observed one.
Inserting the estimate of θ we get
b 1394.1 −115.6 −1 b 0.00114 0.0051
I(θ) = , I (θ) =
−115.6 25.9 0.0051 0.0613
15
nr.gamma <- function(x,eps=0.000001)
{
n = length(x);sumx = sum(x);sumlogx = sum(log(x))
diff = 1;h = 0.0000001;
alpha = mean(x)^2/var(x);lambda=mean(x)/var(x)
theta = c(alpha,lambda)
while(diff>eps)
{
theta.old = theta
g = gamma(alpha)
dg = (gamma(alpha+h)-gamma(alpha))/h
d2g = (gamma(alpha+2*h)-2*gamma(alpha+h)+
gamma(alpha))/h^2
s = c(n*log(lambda)+sumlogx-n*dg/gamma(alpha),
n*alpha/lambda-sumx)
Jbar = matrix(c(n*(d2g*g-dg^2)/g^2,-n/lambda,
-n/lambda,n*alpha/lambda^2),ncol=2)
theta = theta + solve(Jbar,s)
alpha = theta[1];lambda = theta[2]
diff = sum(abs(theta-theta.old))
}
list(theta=theta,Jbar=Jbar)
}
Note that since J (θ) = I(θ), J (θ) will be positive definite for all values of θ. The
α, b
log-likelihood function is then concave, resulting in that the values of (b λ) obtained by
the Newton-Raphson algorithm is equal to the global maximum.
16
and (1.6). Further
n
∂2 nβ β(β + 1) X
J1,1 =− l(θ) = − + (xi /α)β ,
∂α2 α2 α2 i=1
n n
∂2 n 1X βX
J1,2 =− l(θ) = − (xi /α)β − (xi /α)β log(xi /α),
∂α∂β α α i=1 α i=1
X n
∂2 n
J2,2 = − 2
l(θ) = 2
+ (xi /α)β [log(xi /α)]2.
∂β β i=1
Programwindow 4.3 shows R code for a Newton-Raphson algorithm in this case. With
starting values α(0) = 20 and β (0) = 1 and a convergence criterion
|α(s+1) − α(s) | + |β (s+1) − β (s) | < 0.00001,
we get the following results:
Iteration s α(s) β (s) l(θ (s) )
0 10.00000 1.0000000 -138.00580
1 11.88883 0.8904244 -62.98770
2 15.09949 0.9287394 -62.22634
3 16.74320 0.9244928 -62.10186
4 17.17639 0.9220478 -62.09619
5 17.20186 0.9218854 -62.09617
6 17.20194 0.9218849 -62.09617
7 17.20194 0.9218849 -62.09617
b for θ into J(θ) gives
Inserting θ
b = 0.0460 −0.4324 −1 b = 24.5071 0.2919
J (θ) , J (θ)
−0.4324 36.3039 0.2919 0.0310
giving approximate standard error 4.9505 for α
b and 0.1761.
Also this Newton-Raphson algorithm is sensitive to the starting values chosen. With
α = 20 and β (0) = 2, both α(1) and β (1) become negative, which are illegal values for
(0)
these parameters.
17
nr.weibull = function(x,theta0,eps=0.000001)
{
n = length(x);
sumlogx = sum(log(x));
diff = 1;theta = theta0;alpha = theta[1];beta = theta[2]
while(diff>eps)
{
theta.old = theta
w1 = sum((x/alpha)^beta)
w2 = sum((x/alpha)^beta*log(x/alpha))
w3 = sum((x/alpha)^beta*log(x/alpha)^2)
s = c(-n*beta/alpha+beta*w1/alpha,
n/beta-n*log(alpha)+sumlogx-w2)
Jbar = matrix(c(-n*beta/alpha^2+beta*(beta+1)*w1/alpha^2,
n/alpha-w1/alpha-beta*w2/alpha,
n/alpha-w1/alpha-beta*w2/alpha,n/beta^2+w3),
ncol=2)
theta = theta + solve(Jbar,s)
alpha = theta[1];beta = theta[2]
diff = sum(abs(theta-theta.old))
}
list(alpha=alpha,beta=beta,Jbar=Jbar)
}
Figure 5.1 illustrates one type of problem that can occur when using the Newton-Raphson
algorithm. θ̄ is given in a point where l(θ) is convex (that is the second derivative is
positive). Using (4.3) will give a reduction in l, since (4.3) in this case finds a minimum
point of ˜lθ̄ (θ).
This problem with the Newton-Raphson method is directly translated to the multi-
parameter case. In the general case, if at least one of the eigenvalues of the Hessian matrix
l′′ (θ) is positive, a smaller value of l(θ) can be obtained from one iteration to the other.
A standard trick in numerical literature for such cases is to replace the Hessian matrix
with another matrix which is negative definite. For likelihood optimization, this means
replacing J (θ) with a positive definite matrix. A possible candidate could be the identity
matrix, but a more efficient choice is available. The Fisher information I(θ) is equal to
18
0.6
0.4
0.2
0
−4 −3.5 −3 −2.5 −2 −1.5 −1 −0.5 0 0.5
the expectation of J (θ), indicating that these two matrices should be similar. But I(θ)
will always be positive definite, making this matrix a possibility for replacing J (θ). This
is the Fisher’s method of scoring (or the scoring algorithm):
θ (s+1) = θ (s) + I −1 (θ (s) )s(θ (s) ). (5.1)
Note that for this algorithm, the Fisher information is directly available.
In general it is not possible to say which algorithm that would be preferable if both
converges. The next example illustrates a situation where the Newton-Raphson algorithm
have problems in converging while the scoring algorithm is more robust.
Example 4 (Truncated Poisson)
We will consider an example which involves the estimation of the parameter of a truncated
Poisson distribution given by
θx e−θ
f (x; θ) = , x = 1, 2, . . .
x!(1 − e−θ )
19
scoring.muon <- function(x,alpha0=0.6,eps=0.000001)
{
n = length(x)
diff = 1;
alpha = alpha0;
l = sum(log(1+alpha*x))-n/2
while(diff>eps)
{
alpha.old = alpha
s = sum(x/(1+alpha*x))
Ibar = n*(log((1+alpha)/(1-alpha))/(2*alpha^3)-1/(alpha^2))
alpha = alpha+s/Ibar
l = sum(log(1+alpha*x))-n/2
diff = abs(alpha-alpha.old)
}
list(alpha,Ibar)
}
Programwindow 5.1: R routine for estimation in the muon decay example using the scoring
algorithm.
x 1 2 3 4 5 6
fx 1486 694 195 37 10 1
Table 5.1: Grouped data from a truncated Poisson distribution, where fx represents the
frequency of x. Table 4.1 from Everitt (1987).
Such a density might arise, for example, in studying the size of groups at parties. The data
in Table 5.1 is taken from Everitt (1987) and represents samples from this distribution.
n
Y θxi e−θ
lik(θ) =
i=1
xi !(1 − e−θ )
20
nr.trpo = function(x,theta0,eps=0.000001)
{
n = sum(x);i = 1:6;sumx = sum(x*i)
theta = theta0;diff = 1
while(diff> eps)
{
theta.old = theta
s = sumx/theta-n/(1-exp(-theta))
Jbar = sumx/theta^2-n*exp(-theta)/(1-exp(-theta))^2
theta = theta+s/Jbar
diff = abs(theta-theta.old)
}
list(theta,Jbar)
}
Programwindow 5.2: R routine for estimation in the truncated Poisson distribution using
the Newton Raphson algorithm.
21
Turn now to the scoring algorithm. The expectation in the truncated Poisson distribu-
tion is θ/(1 − e−θ ), which gives the Fisher information
n 1 e−θ
I(θ) = − .
(1 − e−θ ) θ 1 − e−θ
In Programwindow 5.3, R code for performing the scoring algorithm is given. Note that
the only change from the Newton-Raphson algorithm is the replacement of J(θ) with I(θ).
Starting at θ(0) = 1.5, only 4 iterations were needed for convergence to θb = 0.8925. For
θ(0) = 2.0, convergence was also obtain, now after 6 iterations. Convergence was also
obtained for all other values tried out.
b we obtain I(θ)
Inserting θ, b = 1750.8 giving an approximate standard error for θb equal
to 0.0239.
scoring.trpo = function(x,theta0,eps=0.000001)
{
n = sum(x);i = 1:6;sumx = sum(x*i)
theta = theta0;diff = 1
while(diff> eps)
{
theta.old = theta
s = sumx/theta-n/(1-exp(-theta))
Ibar = n*(1/theta-exp(-theta)/(1-exp(-theta)))/(1-exp(-theta))
theta = theta+s/Ibar
diff = abs(theta-theta.old)
}
list(theta,Ibar)
}
Programwindow 5.3: R routine for estimation in the truncated Poisson distribution using
the Fisher scoring algorithm.
Direct use of the Newton-Raphson or the Fisher scoring algorithm can in many cases be
problematic. In this section we will discuss three possibilities for improving these algo-
rithms, reduction of the optimization problem to a smaller dimension, reparametrization
and smaller jumps.
22
6.1 Dimension reduction
For some maximum likelihood problems where full analytical solutions are impossible, some
of the parameters can be partly found as functions of others. We will illustrate this through
the leukemia data example.
lβ (β) =l(b
α(β), β)
n n Pn β
1 X β 1/β X
i=1 xi
=n log(β) − nβ log([ xi ] ) + (β − 1) log(xi ) − 1
Pn β
n i=1 i=1 n i=1 xi
n n
1X β X
=n log(β) − n log([ xi ]) + (β − 1) log(xi ) − n. (6.2)
n i=1 i=1
Note that the likelihood function has been reduced to a function in one variable, which is
much easier to maximize. The partially maximized log-likelihood function lβ (β) is called
the profile log-likelihood function and is plotted in Figure 6.2.
23
−92
−94
−96
−98
−100
−102
−104
0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5
Figure 6.2: Profile log-likelihood lβ (β) given in (6.2) for the leukemia data.
6.2 Reparametrization
Many of the parameters involved have restrictions on their values. For the gamma dis-
tributed data discussed in example 2, both parameters needs to be positive. The same is
24
nr.profile.weibull = function(x,beta0,eps=0.00001)
{
n = length(x);sumlogx = sum(log(x));
diff = 1;beta = beta0
while(diff>eps)
{
beta.old = beta
w1 = sum(x^beta)
w2 = sum((x^beta)*log(x))
w3 = sum((x^beta)*log(x)*log(x))
w4 = sum(log(x))
l1 = n/beta-n*w2/w1+w4
l2 = -n/(beta^2)-n*(w3*w1-w2^2)/(w1^2)
beta = beta.old - l1/l2;
diff = abs(beta-beta.old)
}
alpha = (w1/n)^(1/beta)
list(alpha=alpha,beta=beta)
}
Programwindow 6.1: R routine for optimization of the profile likelihood function lβ (β)
given in (6.2) with respect to β using the Newton-Raphson algorithm.
true for the Weibull data considered in example 3. For both the Newton-Raphson algo-
rithm and the Fisher scoring algorithm, negative values can occur at some iterations, and
the procedures can break down. The basic idea in such cases is to use reparametrization.
Define b = log(β). Since β > 0, b can take values on the whole real line. Rewriting
lβ (β) as a function of b, we get
n n
1 X eb X
lb (b) = nb − n log([ xi ]) + (eb − 1) log(xi ) − n. (6.3)
n i=1 i=1
25
To use Newton-Raphson, we need the derivatives:
" Pn eb b n
#
1
xi e log(x i ) 1 X
lb′ (b) =n 1 − n i=1 1
Pn eb + eb log(xi )
n i=1 ix n i=1
" Pn Pn Pn eb #
1 eb 1 1
x log(x i ) − log(x i ) i=1 xi
lb′′ (b) = − neb n i=1 i 1
Pnn i=1 eb
n
n i=1 xi
" Pn Pn eb Pn eb # n
1 eb 21 1 2 X
2b n
x
i=1 i [log(x i )] n i=1 ix − [ n
x
i=1 i log(xi )] b
− ne P +e log(xi )
[ n1 ni=1 xei ]2
b
i=1
Programwindow 6.2 shows R code for a Newton-Raphson algorithm for maximizing lb (b)
with respect to b.
Also for more extreme starting values, convergence was obtained in this case.
l′ (θ) = 0 (6.4)
which is equal to the scoring equations (3.4). From a pure numerical point of view, opti-
mization through searching for solutions of (6.4) is not recommended. By only using the
derivatives of the function, and not the function itself, the algorithm may converge to any
stationary point, not distinguishing between maxima and minima. However, if the function
26
l is concave and unimodal, the Newton-Raphson algorithm can be slightly modified such
that convergence can be guaranteed. Consider the iterations
h i−1
θ (s+1) = θ (s) − δ (s) l′′ (θ (s) ) l′ (θ (s) ) (6.5)
where δ (s) ≤ 1. At each iterations, start with δ (s) = 1. If l(θ(s+1) ) < l(θ(s) ), divide δ (s)
by two, and use (6.5) again. Repeat this until l(θ(s+1) ) ≥ l(θ(s) ). Because l is concave, it
is always possible to find an δ (s) small enough such that this will be satisfied. In many
situations of statistics, the log-likelihood function satisfy this property, making such a
procedure operational.
Since I(θ (s) ) is positive definite, such an algorithm is guaranteed to converge to a (local)
maxima, even if l(θ) is not concave.
Running this algorithm with α(0) = λ(0) = 1, convergence now was reached after seven
iterations. Note that a smaller δ was only necessary in the first iteration. When (α(s) , λ(s) )
gets closer to the optimal value, no modification of the ordinary Newton-Raphson algorithm
is necessary.
Iteration s α(s) β (s) l(θ (s) ) δ (s)
0 1.0000000 1.000000 -50.9370
1 0.3841504 0.578052 146.8740 0.25
2 0.3518249 0.912280 171.4139 1.00
3 0.4019210 1.423652 182.4078 1.00
4 0.4319168 1.822000 185.1676 1.00
5 0.4402049 1.954300 185.3468 1.00
6 0.4407877 1.964326 185.3477 1.00
7 0.4407914 1.964380 185.3477 1.00
27
nr.profile.weibull2 = function(x,beta0,eps=0.000001)
{
n = length(x);sumlogx = sum(log(x))
diff = 1;b = log(beta0)
beta = exp(b)
w1 = sum(x^beta)
l = n*log(beta)-n*log(w1/n)+(beta-1)*sumlogx-n
alpha = (w1/n)^(1/beta)
it = 0
while(diff>eps)
{
it = it+1
b.old = b
w1 = sum(x^(exp(b)))
w2 = sum((x^(exp(b))*log(x)))
w3 = sum((x^(exp(b))*(log(x))^2))
l1 = n-n*exp(b)*w2/w1+exp(b)*sumlogx
l2 = -n*exp(2*b)*(w2+(w3*w1-w2*w2)/w1)/w1 +
exp(b)*sumlogx
b = b - l1/l2
diff = sum(abs(b-b.old))
beta = exp(b)
diff = abs(beta-exp(b.old))
w1 = sum(x^beta)
alpha = (w1/n)^(1/beta)
l = n*log(beta)-n*log(w1/n)+(beta-1)*sumlogx-n
}
alpha = (w1/n)^(1/beta)
list(alpha=alpha,beta=beta)
}
Programwindow 6.2: R routine for optimization of the profile likelihood function lb (b)
given in (6.3) with respect to b using the Newton-Raphson algorithm.
28
nr.gamma.mod = function(x,theta0=NULL,eps=0.000001)
{
n = length(x);sumx = sum(x);sumlogx = sum(log(x))
h = 0.0000001;diff = 1
if(is.null(theta0))
{alpha = mean(x)^2/var(x);lambda=mean(x)/var(x)}
else
{alpha = theta0[1];lambda=theta0[2]}
theta = c(alpha,lambda)
l = n*alpha*log(lambda)+(alpha-1)*sumlogx-
lambda*sumx-n*log(gamma(alpha))
while(diff>eps)
{
theta.old = theta;l.old = l
g = gamma(alpha)
dg = (gamma(alpha+h)-gamma(alpha))/h
d2g = (gamma(alpha+2*h)-2*gamma(alpha+h)+
gamma(alpha))/h^2
s = c(n*log(lambda)+sumlogx-n*dg/gamma(alpha),
n*alpha/lambda-sumx)
Jbar = -matrix(c(-n*(d2g*g-dg^2)/g^2,n/lambda,
n/lambda,-n*alpha/lambda^2),ncol=2)
l = l.old-1;delta = 1
while(l < l.old)
{
theta = theta.old + delta*solve(Jbar,s)
alpha = theta[1];lambda = theta[2]
if ((alpha < 0) || (lambda < 0))
l = -9999999
else
{
l = n*alpha*log(lambda)+(alpha-1)*sumlogx-
lambda*sumx-n*log(gamma(alpha))
}
delta = delta/2
}
diff = sum(abs(theta-theta.old))
}
list(alpha,lambda,Jbar)
}
(a) E[ei ] = 0.
(b) Var[ei ] = σ 2 .
(c) e1 , ..., en are uncorrelated.
(d) The ei ’s are normally distributed.
Assume that {(yi, xi ), i = 1, ..., n} are observed (yi is the observed value of Yi ). Under
the assumptions above, the likelihood function is given by
n
Y
2
L(β, σ ) = f (yi ; xi , β, σ 2 )
i=1
n
Y 1 1 2
= √ e− 2σ2 (yi −g(xi ,β))
i=1
2πσ
while the log-likelihood is
n
X
2 1 1 1
l(β, σ ) = [− log(2π) − log(σ 2 ) − 2 (yi − g(xi , β))2 ]
i=1
2 2 2σ
n
n n 1 X
=− log(2π) − log(σ 2 ) − 2 (yi − g(xi , β))2 . (7.1)
2 2 2σ i=1
not possible to obtain, and numerical methods have to be applied. For notational simplicity,
define
∂
gk′ (xi , β) = g(xi , β) (7.2)
∂βk
30
and
′′ ∂2
gk,l (xi , β) = g(xi , β). (7.3)
∂βk ∂βl
The partial derivatives of l(β, σ 2 ) with respect to β and σ 2 are then given by the score
function s(β, σ 2 ) with elements
∂
sk (β, σ 2 ) = l(β, σ 2 )
∂βk
n
1 X
= 2 (yi − g(xi , β))gk′ (xi , β), (7.4)
σ i=1
∂
sp+1 (β, σ 2 ) = l(β, σ 2 )
∂σ 2
n
n 1 X
=− 2 + 4 (yi − g(xi , β))2 , (7.5)
2σ 2σ i=1
(7.6)
2 ∂2
Jk,l (β, σ ) = − l(β, σ 2 )
∂βk ∂βl
n
1 X ′
= 2 [g (xi , β)gl′(xi , β) − (yi − g(xi , β))gk,l
′′
(xi , β)], (7.7)
σ i=1 k
∂2
Jk,p+1(β, σ 2 ) = − l(β, σ 2 )
∂βk ∂σ 2
n
1 X
= 4 (yi − g(xi , β))gk′ (xi , β), (7.8)
σ i=1
∂2
Jp+1,p+1(β, σ 2 ) = − l(β, σ 2 ) (7.9)
∂σ 2 ∂σ 2
n
n 1 X
=− 4 + 6 (yi − g(xi , β))2 , (7.10)
2σ σ i=1
where k, l = 1, ..., p. These quantities can be directly imputed into the general Newton-
Raphson algorithm (4.7).
A more efficient algorithm can be obtained by utilizing that for given β, an analytical
b2 for σ 2 can be obtained. From (7.1),
expression for the maximum likelihood estimate σ
n
∂ 2 n 1 X
l(β, σ ) = − + (yi − g(xi , β))2
∂σ 2 2σ 2 2σ 4 i=1
31
∂
b2 to the equation
and the solution σ ∂σ2
l(β, σ 2 ) = 0 is given by
n
2 1X
σ
b (β) = (yi − g(xi , β))2 . (7.11)
n i=1
showing that similarly to ordinary linear regression, maximizing the likelihood is equivalent
to least squares estimation when we assume normal error terms. A Newton-Raphson
algorithm for minimizing S(β) can be directly constructed.
An alternative modification is to replace the observed information matrix J (β, σ) by
its expectation, that is the Fisher information matrix I(β, σ), similar to the approach in
section 5. Replacing yi by Yi in (7.7)-(7.10) and using that E[Yi ] = g(xi , β), we get
n
1 X ′
Ik,l (β, σ) = 2 gk (xi , β)gl′(xi , β), (7.12)
σ i=1
Ik,p+1(β, σ) =0, (7.13)
n
Ip+1,p+1(β, σ) = 4 (7.14)
2σ
for k, l = 1, ..., p. It can be shown that the I(β, σ) is negative definite. By using this matrix
instead of J (β, σ) in the Newton-Raphson algorithm we obtain the scoring algorithm. Both
the Newton-Raphson algorithm and the scoring algorithm are easy to implement. Note
however that the scoring version is somewhat simpler since it only involves first derivatives.
Further, defining θ = (β, σ 2 ), sβ (β, σ 2 ) to be the first p elements of s(β, σ), sσ2 (β, σ 2 ) to
be the last element ofs(β, σ), I β (β, σ 2 ) to be the upper left p × p submatrix of I(β, σ) and
Iσ2 (β, σ 2 ) to be the (p + 1, p + 1) element of I(β, σ), the scoring algorithm update can be
written as
s+1
s+1 β
θ =
(σ 2 )s+1
−1
s I β(β, σ 2 ) 0 sβ (β, σ 2 )
=θ +
0 Iσ2 (β, σ 2 ) sσ2 (β, σ 2 )
s
β + I β(β, σ 2 )−1 sβ(β, σ 2 )
=
(σ 2 )s + Iσ2 (β, σ 2 )−1 sσ2 (β, σ 2 )
32
Noticing now that sβ(β, σ 2 ) and I β (β, σ 2 ) only depend on σ 2 through the common factor
1
2σ2
, we see that the updating of β is independent of σ 2 , a reasonable property since as
noted above the optimal value of β is unaffected of σ 2 .
corresponding to the optimal solution given in (7.11). In practice this means hat we only
need to update β through the scoring algorithm, and after convergence, the estimate of σ 2
can be obtaind directly through (7.11).
As usual, uncertainty estimates of our estimates are of interest, and can be obtained
through the information matrices. Either the observed or the expected (Fisher) information
matrices can be used.
Using the Fisher information matrix, for large sample sizes, the covariance matrix for
b b2 ) is given by
(β, σ
2 −1
−1 2 σ I β (β) 0
I (β, σ ) = 2σ4 (7.15)
0 n
b and σ
The 0 part on the off-diagonal of I −1 (β, σ 2 ) imply that β b2 are independent for large
sample sizes.
Example 5 (Weight loss programme)
Venables and Ripley (1999) contain a dataset (originally from Dr. T Davies) describing
weights (yi) of obese patients after different number of days (xi ) since start of a weight
reduction programme. The data, also available from the course home page, is plotted in
Figure 7.1. Venables and Ripley (1999) suggests the following model for this dataset:
yi = β0 + β1 e−β2 xi + ei .
33
200
180
160
140
120
100
0 50 100 150 200 250
In Table 7.1 standard errors based on large sample approximations (first row) is given.
An alternative method for estimating the variability of the parameter estimates is boot-
strapping. We will consider bootstrapping in the conditional inference setting. This means
that we consider the explanatory variables to be fixed while the randomness appears from
the noise terms. Bootstrap samples of Y1 , ..., Yn can be obtained by
b + e∗ ,
Yi∗ = g(xi , β) i = 1, ..., n
i
34
100 80
60
50 40
20
0 0
70 75 80 85 90 95 100 105 110
100 100
50 50
0 0
4 5 6 0 0.5 1 1.5
−3
x 10
Figure 7.2: Histogram of bootstrap simulations of βb0 (upper left), βb1 (upper right), βb2
(lower left) and σb2 (lower right) for weight loss data.
Table 7.1: Standard errors based on large sample approximation, parametric and non-
parametric bootstrapping for weight loss data.
where e∗ = (e∗1 , ..., e∗n ) are bootstrap samples of the noise terms. Two main alternatives for
sampling e∗ is possible. In parametric bootstrapping, the model assumption ei ∼ N(0, σ 2 )
is used and we simulate e∗i through e∗i ∼ N(0, s2 ). For nonparametric bootstrapping, the
normal assumption is relaxed, and the e∗i ’s are samples from an estimate of the distribution
for ei . This can be peformed by sampling e∗1 , ..., e∗n from (e1 , ..., en ) with replacement.
The second and third rows in Table 7.1 shows standard errors estimated by parametric
and non-parametric bootstrapping. Figure 7.2 shows histograms of the 1000 nonparametric
bootstrap simulations. They all are close to normal distributions, confirming the large sam-
ple theory and also explaining the similarities of the standard errors obtained from the two
methods. The parametric simulations (not shown) looks very similar. Programwindow 7.3
shows R code for performing parametric and nonparametric bootstrap simulations.
35
36
nr.weight = function(x,y,beta.start,eps=0.000001)
{
#Note: beta[i+1] correspond to beta_i
n = length(x)
diff = 1;beta = beta.start
y.pred = beta[1]+beta[2]*exp(-beta[3]*x)
res = y-y.pred
while(diff>eps)
{
beta.old = beta
s = c(sum(y-y.pred),
sum((res)*exp(-beta[3]*x)),
-beta[2]*sum((res)*exp(-beta[3]*x)*x))
Jbar =
matrix(c(n,sum(exp(-beta[3]*x)),-beta[2]*sum(exp(-beta[3]*x)*x),
sum(exp(-beta[3]*x)),sum(exp(-2*beta[3]*x)),
sum((res-beta[2]*exp(-beta[3]*x))*exp(-beta[3]*x)*x),
-beta[2]*sum(exp(-beta[3]*x)*x),
sum((res-beta[2]*exp(-beta[3]*x))*exp(-beta[3]*x)*x),
-beta[2]*sum((res-beta[2]*exp(-beta[3]*x))*
exp(-beta[3]*x)*x*x)),ncol=3)
beta = beta.old + solve(Jbar,s)
y.pred = beta[1]+beta[2]*exp(-beta[3]*x)
res = y-y.pred
diff = sum(abs(beta-beta.old))
}
sigma = mean((res)^2)
list(beta=beta,sigma=sigma,Jbar=Jbar)
}
37
scoring.weight = function(x,y,beta.start,eps=0.000001)
{
#Note: beta[i+1] correspond to beta_i
n = length(x)
diff = 1;beta = beta.start
y.pred = beta[1]+beta[2]*exp(-beta[3]*x)
while(diff>eps)
{
beta.old = beta
s = c(sum(y-y.pred),
sum((y-y.pred)*exp(-beta[3]*x)),
-beta[2]*sum((y-y.pred)*exp(-beta[3]*x)*x))
Ibar = matrix(c(n,sum(exp(-beta[3]*x)),
-beta[2]*sum(exp(-beta[3]*x)*x),
sum(exp(-beta[3]*x)),sum(exp(-2*beta[3]*x)),
-beta[2]*sum(exp(-2*beta[3]*x)*x),
-beta[2]*sum(exp(-beta[3]*x)*x),
-beta[2]*sum(exp(-2*beta[3]*x)*x),
beta[2]*beta[2]*sum(exp(-2*beta[3]*x)*x*x)),ncol=3)
beta = beta.old + solve(Ibar,s)
y.pred = beta[1]+beta[2]*exp(-beta[3]*x)
diff = sum(abs(beta-beta.old))
}
sigma = mean((y-y.pred)^2)
list(beta=beta,sigma=sigma,Ibar=Ibar)
}
Programwindow 7.2: R code for running the scoring algorithm on weight loss data.
38
n = dim(wtloss)[1]
fit = scoring.weight(wtloss$days,wtloss$weight,c(90,95,0.005))
beta = fit$beta
sigma = fit$sigma
ypred=beta[1]+beta[2]*exp(-beta[3]*wtloss$days)
res=wtloss$weight-ypred
B = 1000
beta.star = matrix(NA,nrow=B,ncol=3)
sigma.star = rep(NA,B)
for(b in 1:B)
{
res.star = sqrt(sigma)*rnorm(n)
weight.star = ypred+res.star;
fit = scoring.weight(wtloss$days,weight.star,beta);
beta.star[b,] = fit$beta
sigma.star[b] = fit$sigma
}
n = dim(wtloss)[1]
fit = scoring.weight(wtloss$days,wtloss$weight,c(90,95,0.005))
beta = fit$beta
sigma = fit$sigma
ypred=beta[1]+beta[2]*exp(-beta[3]*wtloss$days)
res=wtloss$weight-ypred
B = 1000
beta.star = matrix(NA,nrow=B,ncol=3)
sigma.star = rep(NA,B)
for(b in 1:B)
{
res.star <- sample(res,n,replace=T)
weight.star = ypred+res.star
fit = scoring.weight(wtloss$days,weight.star,beta);
beta.star[b,] = fit$beta
sigma.star[b] = fit$sigma
}
Programwindow 7.3: R code for parametric (upper) and nonparametric (lower) bootstrap-
ping on weight loss data.
39
8 Logistic regression
40
Elementary calculations show that the scoring function is equal to
" # Pn
∂l(β)
∂β0 [y i − p(xi , β)]
s(β) = ∂l(β) = Pn i=1
Note that J (β) do not depend on the random observations yi , showing that the expected
information matrix I(β) is equal to J (β). This implies that J (β) is always positive
definite, making the log-likelihood function concave with only one (global) maxima.
In Programwindow 8.3, an R routine for optimizing the log-likelihood for the logistic
model is given. Although the log-likelihood is unimodal, a modification of the ordinary
Newton-Raphson algorithm allowing for smaller jumps (as described in section 6.3) is
needed in order to make the algorithm robust towards starting values.
beetle = read.table("beetle.dat",col.names=c("dose","n","y"))
m = dim(beetle)[1]
x = NULL
y = NULL
for(j in 1:m)
{
x = c(x,rep(beetle$dose[j],beetle$n[j]))
y = c(y,rep(1,beetle$y[j]),rep(0,beetle$n[j]-beetle$y[j]))
}
beetle2 = data.frame(dose=x,resp=y)
Running the routine described in Programwindow 8.3, convergence was reached after 6
iterations to βb0 = −60.72 and βb1 = 34.27. The values at different iterations are shown
below.
41
(s) (s)
Iteration s β0 β1 l(β (s) ) δ (s)
0 2.00000 1.00000 -721.3648
1 -104.29550 57.96621 -248.0056 0.25
2 -45.92656 25.95912 -191.0286 0.50
3 -57.76158 32.60580 -186.4047 1.00
4 -60.58140 34.19359 -186.2358 1.00
5 -60.71715 34.27016 -186.2354 1.00
6 -60.71745 34.27033 -186.2354 1.00
7 -60.71745 34.27033 -186.2354 1.00
Concerning the uncertainty involved in these estimates, the large sample approximation
to the covariance matrix is given directly from (8.17) with β = β b inserted. This gave
estimated covariance matrix
b 26.8398 −15.0822
Var[β] ≈ .
−15.0822 8.4806
and standard errors 5.1807 and 2.9121 for βb0 and βb1 , respectively.
9 Discussion
In this note we have discussed several different numerical procedures for optimization.
Although our primary concern has been on maximum likelihood problems, the procedures
could just as well have been applied to other optimization problems.
(a) Direct procedures will in many cases work, but convergence could be extremely slow.
(b) The Newton-Raphson method and the Method of Scoring are usually more compli-
cated, and there is no guarantee of monotonicity.
(d) For the Newton-Raphson method, the observed information matrix is directly given
as part of the algorithm, while for direct maximization some further calculations are
needed in order to obtain this.
(e) In general, no method is guaranteed to give the global optimum. The algorithms
should therefore be run with different starting values.
42
nr.logist = function(data,beta.start,eps=0.000001)
{
x=data[,1];y=data[,2];n=length(x)
diff=1;beta=beta.start
p = exp(beta[1]+beta[2]*x)/(1+exp(beta[1]+beta[2]*x))
l = sum(y*log(p)+(1-y)*log(1-p))
while(diff>eps)
{
beta.old = beta
l.old = l
p = exp(beta[1]+beta[2]*x)/(1+exp(beta[1]+beta[2]*x))
s = c(sum(y-p),sum((y-p)*x))
Jbar = matrix(c(sum(p*(1-p)),sum(p*(1-p)*x),
sum(p*(1-p)*x),sum(p*(1-p)*x*x)),ncol=2)
l=l.old-1;delta=1
while(l<l.old)
{
beta = beta.old + delta*solve(Jbar,s)
p = exp(beta[1]+beta[2]*x)/(1+exp(beta[1]+beta[2]*x))
l = sum(y*log(p)+(1-y)*log(1-p))
delta=delta/2
}
diff = sum(abs(beta-beta.old))
}
list(beta=beta)
}
Figure 8.3: R code for running Newton-Raphson algorithm for logistic regression.
43
A R code
x = scan("ILLRAIN.DAT",na.strings="*")
x = x[!is.na(x)]
alpha = seq(0.35,0.55,0.005);lambda = seq(1,3,0.05)
loglik = matrix(nrow=length(alpha),ncol=length(lambda))
n = length(x);sumx=sum(x);sumlogx = sum(log(x));
for(i in 1:length(alpha))
for(j in 1:length(lambda))
loglik[i,j] = n*alpha[i]*log(lambda[j])+(alpha[i]-1)*sumlogx-
lambda[j]*sumx-n*log(gamma(alpha[i]))
par(mfrow=c(1,2))
#image(alpha,lambda,exp(loglik),col=gray((0:32)/32))
#image(alpha,lambda,loglik,col=gray((0:32)/32))
persp(alpha,lambda,exp(loglik),theta=330,phi=45,shade=1,zlab="lik")
persp(alpha,lambda,loglik,theta=330,phi=45,shade=1,zlab="l")
Programwindow A.1: R code for plotting likelihood function (figure 1.3) for rainfall data.
References
D. R. Cox and D. Oakes. Analysis of Survival Data. Chapman & Hall, London, 1984.
J.L. Devore and K.N. Berk. Modern mathematical statistics with applications. Duxbury
Pr, 2007. ISBN 0534404731.
K. Lange. Numerical Analysis for Statisticians. Statistics and Computing. Springer Verlag,
1999.
J. A. Rice. Mathematical statistics and data analysis. Duxbury Press, Belmont, California,
second edition, 1995.
C. F. Van Loan. Introduction to scientific computing. Prentice Hall, Upper Saddle River,
NJ 07458, second edition, 2000.
44
W. N. Venables and B. D Ripley. Modern Applied Statistics with S-plus. Statistics and
Computing. Springer Verlag, third edition, 1999.
45