Tutorial On Maximum Likelihood Estimation
Tutorial On Maximum Likelihood Estimation
Tutorial On Maximum Likelihood Estimation
Tutorial
Tutorial on maximum likelihood estimation
In Jae Myung*
Department of Psychology, Ohio State University, 1885 Neil Avenue Mall, Columbus, OH 43210-1222, USA
Received 30 November 2001; revised 16 October 2002
Abstract
In this paper, I provide a tutorial exposition on maximum likelihood estimation (MLE). The intended audience of this tutorial are
researchers who practice mathematical modeling of cognition but are unfamiliar with the estimation method. Unlike least-squares
estimation which is primarily a descriptive tool, MLE is a preferred method of parameter estimation in statistics and is an
indispensable tool for many statistical modeling techniques, in particular in non-linear modeling with non-normal data. The purpose
of this paper is to provide a good conceptual explanation of the method with illustrative examples so the reader can have a grasp of
some of the basic principles.
r 2003 Elsevier Science (USA). All rights reserved.
0022-2496/03/$ - see front matter r 2003 Elsevier Science (USA). All rights reserved.
doi:10.1016/S0022-2496(02)00028-7
I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100 91
In this tutorial paper, I introduce the maximum model’s parameter. As the parameter changes in value,
likelihood estimation method for mathematical model- different probability distributions are generated. For-
ing. The paper is written for researchers who are mally, a model is defined as the family of probability
primarily involved in empirical work and publish in distributions indexed by the model’s parameters.
experimental journals (e.g. Journal of Experimental Let f ðyjwÞ denote the probability density function
Psychology) but do modeling. The paper is intended to (PDF) that specifies the probability of observing data
serve as a stepping stone for the modeler to move vector y given the parameter w: Throughout this paper
beyond the current practice of using LSE to more we will use a plain letter for a vector (e.g. y) and a letter
informed modeling analyses, thereby expanding his or with a subscript for a vector element (e.g. yi ). The
her repertoire of statistical instruments, especially in parameter w ¼ ðw1 ; y; wk Þ is a vector defined on a
non-linear modeling. The purpose of the paper is to multi-dimensional parameter space. If individual ob-
provide a good conceptual understanding of the method servations, yi ’s, are statistically independent of one
with concrete examples. For in-depth, technically more another, then according to the theory of probability, the
rigorous treatment of the topic, the reader is directed to PDF for the data y ¼ ðy1 ; y; ym Þ given the parameter
other sources (e.g., Bickel & Doksum, 1977, Chap. 3; vector w can be expressed as a multiplication of PDFs
Casella & Berger, 2002, Chap. 7; DeGroot & Schervish, for individual observations,
2002, Chap. 6; Spanos, 1999, Chap. 13).
f ðy ¼ ðy1 ; y2 ; y; yn Þ j wÞ ¼ f1 ðy1 j wÞ f2 ðy2 j wÞ
?fn ðym j wÞ: ð1Þ
Fig. 1. Binomial probability distributions of sample size n ¼ 10 and probability parameter w ¼ 0:2 (top) and w ¼ 0:7 (bottom).
92 I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100
which is known as the binomial distribution with interest, find the one PDF, among all the probability
parameters n ¼ 10; w ¼ 0:2: Note that the number of densities that the model prescribes, that is most likely to
trials ðnÞ is considered as a parameter. The shape of this have produced the data. To solve this inverse problem,
PDF is shown in the top panel of Fig. 1. If the we define the likelihood function by reversing the roles of
parameter value is changed to say w ¼ 0:7; a new PDF the data vector y and the parameter vector w in f ðyjwÞ;
is obtained as i.e.
10! LðwjyÞ ¼ f ðyjwÞ: ð5Þ
f ðy j n ¼ 10; w ¼ 0:7Þ ¼ ð0:7Þy ð0:3Þ10y
y!ð10 yÞ!
Thus LðwjyÞ represents the likelihood of the parameter
ðy ¼ 0; 1; y; 10Þ ð3Þ
w given the observed data y; and as such is a function of
whose shape is shown in the bottom panel of Fig. 1. The w: For the one-parameter binomial example in Eq. (4),
following is the general expression of the PDF of the the likelihood function for y ¼ 7 and n ¼ 10 is given by
binomial distribution for arbitrary values of w and n:
Lðw j n ¼ 10; y ¼ 7Þ ¼ f ðy ¼ 7 j n ¼ 10; wÞ
n!
f ðyjn; wÞ ¼ wy ð1 wÞny 10! 7
y!ðn yÞ! ¼ w ð1 wÞ3 ð0pwp1Þ: ð6Þ
7!3!
ð0pwp1; y ¼ 0; 1; y; nÞ ð4Þ
The shape of this likelihood function is shown in Fig. 2.
which as a function of y specifies the probability of data There exist an important difference between the PDF
y for a given value of n and w: The collection of all such f ðyjwÞ and the likelihood function LðwjyÞ: As illustrated
PDFs generated by varying the parameter across its in Figs. 1 and 2, the two functions are defined on
range (0–1 in this case for w; nX1) defines a model. different axes, and therefore are not directly comparable
to each other. Specifically, the PDF in Fig. 1 is a
2.2. Likelihood function function of the data given a particular set of parameter
values, defined on the data scale. On the other hand, the
Given a set of parameter values, the corresponding likelihood function is a function of the parameter given
PDF will show that some data are more probable than a particular set of observed data, defined on the
other data. In the previous example, the PDF with w ¼ parameter scale. In short, Fig. 1 tells us the probability
0:2; y ¼ 2 is more likely to occur than y ¼ 5 (0.302 vs. of a particular data value for a fixed parameter, whereas
0.026). In reality, however, we have already observed the Fig. 2 tells us the likelihood (‘‘unnormalized probabil-
data. Accordingly, we are faced with an inverse ity’’) of a particular parameter value for a fixed data set.
problem: Given the observed data and a model of Note that the likelihood function in this figure is a curve
Fig. 2. The likelihood function given observed data y ¼ 7 and sample size n ¼ 10 for the one-parameter model described in the text.
I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100 93
because there is only one parameter beside n; which is at wi ¼ wi;MLE for all i ¼ 1; y; k: This is because the
assumed to be known. If the model has two parameters, definition of maximum or minimum of a continuous
the likelihood function will be a surface sitting above the differentiable function implies that its first derivatives
parameter space. In general, for a model with k vanish at such points.
parameters, the likelihood function LðwjyÞ takes the The likelihood equation represents a necessary con-
shape of a k-dim geometrical ‘‘surface’’ sitting above a dition for the existence of an MLE estimate. An
k-dim hyperplane spanned by the parameter vector w ¼ additional condition must also be satisfied to ensure
ðw1 ; y; wk Þ: that ln LðwjyÞ is a maximum and not a minimum, since
the first derivative cannot reveal this. To be a maximum,
the shape of the log-likelihood function should be
3. Maximum likelihood estimation convex (it must represent a peak, not a valley) in the
neighborhood of wMLE : This can be checked by
Once data have been collected and the likelihood calculating the second derivatives of the log-likelihoods
function of a model given the data is determined, one is and showing whether they are all negative at wi ¼ wi;MLE
in a position to make statistical inferences about the for i ¼ 1; y; k;1
population, that is, the probability distribution that @ 2 ln LðwjyÞ
underlies the data. Given that different parameter values o0: ð8Þ
@w2i
index different probability distributions (Fig. 1), we are
interested in finding the parameter value that corre- To illustrate the MLE procedure, let us again consider
sponds to the desired probability distribution. the previous one-parameter binomial example given a
The principle of maximum likelihood estimation fixed value of n: First, by taking the logarithm of the
(MLE), originally developed by R.A. Fisher in the likelihood function Lðwjn ¼ 10; y ¼ 7Þ in Eq. (6), we
1920s, states that the desired probability distribution is obtain the log-likelihood as
the one that makes the observed data ‘‘most likely,’’ 10!
which means that one must seek the value of the ln Lðw j n ¼ 10; y ¼ 7Þ ¼ ln þ 7 ln w þ 3 lnð1 wÞ:ð9Þ
7!3!
parameter vector that maximizes the likelihood function Next, the first derivative of the log-likelihood is
LðwjyÞ: The resulting parameter vector, which is sought calculated as
by searching the multi-dimensional parameter space, is
called the MLE estimate, and is denoted by wMLE ¼ d ln Lðw j n ¼ 10; y ¼ 7Þ 7 3 7 10w
¼ ¼ : ð10Þ
ðw1;MLE ; y; wk;MLE Þ: For example, in Fig. 2, the MLE dw w 1 w wð1 wÞ
estimate is wMLE ¼ 0:7 for which the maximized like- By requiring this equation to be zero, the desired MLE
lihood value is LðwMLE ¼ 0:7jn ¼ 10; y ¼ 7Þ ¼ 0:267: estimate is obtained as wMLE ¼ 0:7: To make sure that
The probability distribution corresponding to this the solution represents a maximum, not a minimum, the
MLE estimate is shown in the bottom panel of Fig. 1. second derivative of the log-likelihood is calculated and
According to the MLE principle, this is the population evaluated at w ¼ wMLE ;
that is most likely to have generated the observed data
of y ¼ 7: To summarize, maximum likelihood estima- d 2 ln Lðw j n ¼ 10; y ¼ 7Þ 7 3
¼ 2
tion is a method to seek the probability distribution that dw2 w ð1 wÞ2
makes the observed data most likely. ¼ 47:62o0 ð11Þ
which is negative, as desired.
3.1. Likelihood equation
In practice, however, it is usually not possible to
obtain an analytic form solution for the MLE estimate,
MLE estimates need not exist nor be unique. In this
especially when the model involves many parameters
section, we show how to compute MLE estimates when
and its PDF is highly non-linear. In such situations, the
they exist and are unique. For computational conve-
MLE estimate must be sought numerically using non-
nience, the MLE estimate is obtained by maximizing the
linear optimization algorithms. The basic idea of non-
log-likelihood function, ln LðwjyÞ: This is because the
linear optimization is to quickly find optimal parameters
two functions, ln LðwjyÞ and LðwjyÞ; are monotonically
that maximize the log-likelihood. This is done by
related to each other so the same MLE estimate is
obtained by maximizing either one. Assuming that the
1
log-likelihood function, ln LðwjyÞ; is differentiable, if Consider the Hessian matrix HðwÞ defined as Hij ðwÞ ¼
wMLE exists, it must satisfy the following partial 2
@ ln LðwÞ
ði; j ¼ 1; y; kÞ: Then a more accurate test of the convexity
differential equation known as the likelihood equation: @wi @wj
condition requires that the determinant of HðwÞ be negative definite,
@ln LðwjyÞ
¼0 ð7Þ that is, z0 Hðw ¼ wMLE Þzo0 for any kx1 real-numbered vector z; where
@wi z0 denotes the transpose of z:
94 I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100
Fig. 3. A schematic plot of the log-likelihood function for a fictitious one-parameter model. Point B is the global maximum whereas points A and C
are two local maxima. The series of arrows depicts an iterative optimization process.
searching much smaller sub-sets of the multi-dimen- tries to improve upon an initial set of parameters that is
sional parameter space rather than exhaustively search- supplied by the user. Initial parameter values are chosen
ing the whole parameter space, which becomes either at random or by guessing. Depending upon the
intractable as the number of parameters increases. The choice of the initial parameter values, the algorithm
‘‘intelligent’’ search proceeds by trial and error over the could prematurely stop and return a sub-optimal set of
course of a series of iterative steps. Specifically, on each parameter values. This is called the local maxima
iteration, by taking into account the results from the problem. As an example, in Fig. 3 note that although
previous iteration, a new set of parameter values is the starting parameter value at point a2 will lead to the
obtained by adding small changes to the previous optimal point B called the global maximum, the starting
parameters in such a way that the new parameters are parameter value at point a1 will lead to point A, which is
likely to lead to improved performance. Different a sub-optimal solution. Similarly, the starting parameter
optimization algorithms differ in how this updating value at a3 will lead to another sub-optimal solution at
routine is conducted. The iterative process, as shown by point C.
a series of arrows in Fig. 3, continues until the Unfortunately, there exists no general solution to the
parameters are judged to have converged (i.e., point B local maximum problem. Instead, a variety of techni-
in Fig. 3) on the optimal set of parameters on an ques have been developed in an attempt to avoid the
appropriately predefined criterion. Examples of the problem, though there is no guarantee of their
stopping criterion include the maximum number of effectiveness. For example, one may choose different
iterations allowed or the minimum amount of change in starting values over multiple runs of the iteration
parameter values between two successive iterations. procedure and then examine the results to see whether
the same solution is obtained repeatedly. When that
happens, one can conclude with some confidence that a
3.2. Local maxima global maximum has been found.2
3.3. Relation to least-squares estimation time, and pðw; tÞ the model’s prediction of the prob-
ability of correct recall at time t: The two models are
Recall that in MLE we seek the parameter values that defined as
are most likely to have produced the data. In LSE, on power model : pðw; tÞ ¼ w1 tw2 ðw1 ; w2 40Þ;
the other hand, we seek the parameter values that
provide the most accurate description of the data, exponential model : pðw; tÞ ¼ w1 expðw2 tÞ ð13Þ
measured in terms of how closely the model fits the ðw1 ; w2 40Þ:
data under the square-loss function. Formally, in LSE,
Suppose that data y ¼ ðy1 ; y; ym Þ consists of m
the sum of squares error (SSE) between observations and
observations in which yi ð0pyi p1Þ represents an ob-
predictions is minimized:
served proportion of correct recall at time ti ði ¼
Xm
1; y; mÞ: We are interested in testing the viability of
SSEðwÞ ¼ ðyi prdi ðwÞÞ2 ; ð12Þ
these models. We do this by fitting each to observed data
i¼1
and examining its goodness of fit.
where prdi ðwÞ denotes the model’s prediction for the ith Application of MLE requires specification of the PDF
observation. Note that SSEðwÞ is a function of the f ðyjwÞ of the data under each model. To do this, first we
parameter vector w ¼ ðw1 ; y; wk Þ: note that each observed proportion yi is obtained by
As in MLE, finding the parameter values that dividing the number of correct responses ðxi Þ by the
minimize SSE generally requires use of a non-linear total number of independent trials ðnÞ; yi ¼
optimization algorithm. Minimization of LSE is also xi =n ð0pyi p1Þ We then note that each xi is binomially
subject to the local minima problem, especially when the distributed with probability pðw; tÞ so that the PDFs for
model is non-linear with respect to its parameters. The the power model and the exponential model are
choice between the two methods of estimation can have obtained as
non-trivial consequences. In general, LSE estimates tend
n!
to differ from MLE estimates, especially for data that power : f ðxi j n; wÞ ¼
ðn xi Þ!xi !
are not normally distributed such as proportion correct
2 xi 2 nxi
and response time. An implication is that one might ðw1 tw
i Þ ð1 w1 twi Þ ;
possibly arrive at different conclusions about the same n! ð14Þ
data set depending upon which method of estimation is exponential : f ðxi j n; wÞ ¼
ðn xi Þ!xi !
employed in analyzing the data. When this occurs, MLE
ðw1 expðw2 ti ÞÞxi
should be preferred to LSE, unless the probability
density function is unknown or difficult to obtain in an ð1 w1 expðw2 ti ÞÞnxi ;
easily computable form, for instance, for the diffusion where xi ¼ 0; 1; y; n; i ¼ 1; y; m:
model of recognition memory (Ratcliff, 1978).3 There is There are two points to be made regarding the PDFs
a situation, however, in which the two methods in the above equation. First, the probability parameter
intersect. This is when observations are independent of of a binomial probability distribution (i.e. w in Eq. (4))
one another and are normally distributed with a is being modeled. Therefore, the PDF for each model in
constant variance. In this case, maximization of the Eq. (14) is obtained by simply replacing the probability
log-likelihood is equivalent to minimization of SSE, and parameter w in Eq. (4) with the model equation, pðw; tÞ; in
therefore, the same parameter values are obtained under Eq. (13). Second, note that yi is related to xi by a fixed
either MLE or LSE. scaling constant, 1=n: As such, any statistical conclusion
regarding xi is applicable directly to yi ; except for the scale
transformation. In particular, the PDF for yi ; f ðyi jn; wÞ;
4. Illustrative example is obtained by simply replacing xi in f ðxi jn; wÞ with nyi :
Now, assuming that xi ’s are statistically independent
In this section, I present an application example of of one another, the desired log-likelihood function for
maximum likelihood estimation. To illustrate the the power model is given by
method, I chose forgetting data given the recent surge
ln Lðw ¼ ðw1 ; w2 Þjn; xÞ
of interest in this topic (e.g. Rubin & Wenzel, 1996;
Wickens, 1998; Wixted & Ebbesen, 1991). ¼ lnðf ðx1 jn; wÞ f ðx2 j n; wÞ?f ðxm j n; wÞÞ
Among a half-dozen retention functions that have Xm
Fig. 4. Modeling forgetting data. Squares represent the data in Murdock (1961). The thick (respectively, thin) curves are best fits by the power
(respectively, exponential) models.
Table 1
Summary fits of Murdock (1961) data for the power and exponential models under the maximum likelihood estimation (MLE) method and the least-
squares estimation (LSE) method.
MLE LSE
Loglik/SSE ðr2 Þ 313:37 ð0:886Þ 305:31 ð0:963Þ 0.0540 (0.894) 0.0169 (0.967)
Parameter w1 0.953 1.070 1.003 1.092
Parameter w2 0.498 0.131 0.511 0.141
Note: For each model fitted, the first row shows the maximized log-likelihood value for MLE and the minimized sum of squares error value for LSE.
Each number in the parenthesis is the proportion of variance accounted for (i.e. r2 ) in that case. The second and third rows show MLE and LSE
parameter estimates for each of w1 and w2 : The above results were obtained using Matlab code described in the appendix.
This quantity is to be maximized with respect to the yield the observed data ðy1 ; y; y6 Þ ¼ ð0:94; 0:77; 0:40;
two parameters, w1 and w2 : It is worth noting that the 0:26; 0:24; 0:16Þ; from which the number of correct
last three terms of the final expression in the above responses, xi ; is obtained as 100yi ; i ¼ 1; y; 6: In
equation (i.e., ln n! lnðn xi Þ! ln xi !) do not depend Fig. 4, the proportion recall data are shown as squares.
upon the parameter vector, thereby do not affecting the The curves in Fig. 4 are best fits obtained under MLE.
MLE results. Accordingly, these terms can be ignored, Table 1 summarizes the MLE results, including fit
and their values are often omitted in the calculation of measures and parameter estimates, and also include the
the log-likelihood. Similarly, for the exponential model, LSE results, for comparison. Matlab code used for the
its log-likelihood function can be obtained from Eq. (15) calculations is included in the appendix.
by substituting w1 expðw2 ti Þ for w1 twi
2
: The results in Table 1 indicate that under either
In illustrating MLE, I used a data set from Murdock method of estimation, the exponential model fit better
(1961). In this experiment subjects were presented with a than the power model. That is, for the former, the log-
set of words or letters and were asked to recall the items likelihood was larger and the SSE smaller than for the
after six different retention intervals, ðt1 ; y; t6 Þ ¼ latter. The same conclusion can be drawn even in terms
ð1; 3; 6; 9; 12; 18Þ in seconds and thus, m ¼ 6: The of r2 : Also note the appreciable discrepancies in
proportion recall at each retention interval was calcu- parameter estimate between MLE and LSE. These
lated based on 100 independent trials (i.e. n ¼ 100) to differences are not unexpected and are due to the fact
I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100 97
that the proportion data are binomially distributed, not issues in model selection, the reader is referred elsewhere
normally distributed. Further, the constant variance assump- (e.g. Linhart & Zucchini, 1986; Myung, Forster, &
tion required for the equivalence between MLE and LSE Browne, 2000; Pitt, Myung, & Zhang, 2002).
does not hold for binomial data for which the variance, s2 ¼
npð1 pÞ; depends upon proportion correct p:
5. Concluding remarks
Appendix
This appendix presents Matlab code that performs MLE and LSE analyses for the example described in the text.
while 1,
½w1; lik1; exit1
¼ fmincon (‘power mle’,init w,[],[],[],[],low w,up w,[],opts);
98 I.J. Myung / Journal of Mathematical Psychology 47 (2003) 90–100
% optimization for power model that minimizes minus log-likelihood (note that minimization of
minus log-likelihood is equivalent to maximization of log-likelihood)
% w1: MLE parameter estimates
% lik1: maximized log-likelihood value
% exit1: optimization has converged if exit1 40 or not otherwise
½w2; lik2; exit2
¼ FMINCONð‘EXPO MLE’; INIT W;½
; ½
; ½
; ½
; LOW W; UP W; ½
; OPTSÞ;
% optimization for exponential model that minimizes minus log-likelihood
prd1 ¼ w1ð1; 1Þn t:#ð-w1ð2; 1ÞÞ;% best fit prediction by power model
r2ð1; 1Þ ¼ 1-sumððprd1-yÞ:#2Þ=sumððy-meanðyÞÞ:#2Þ; % r#2 for power model
prd2 ¼ w2ð1; 1Þn expð-w2ð2; 1Þn tÞ;% best fit prediction by exponential model
r2ð2; 1Þ ¼ 1-sumððprd2-yÞ:#2Þ=sumððy-meanðyÞÞ:#2Þ; %r#2 for exponential model
if sumðr240Þ ¼¼ 2
break;
else
init w ¼ randð2; 1Þ;
end;
end;
format long;
disp(num2str([w1 w2 r2],5));% display results
disp(num2str([lik1 lik2 exit1 exit2],5));% display results
end % end of the main program
format long;
disp(num2str([w1 w2 r2],5));% display out results
disp(num2str([sse1 sse2 exit1 exi2],5));% display out results
end % end of the main program
Rubin, D. C., & Wenzel, A. E. (1996). One hundred years of Van Zandt, T. (2000). How to fit a response time distribution.
forgetting: A quantitative description of retention. Psychological Psychonomic Bulletin & Review, 7(3), 424–465.
Review, 103, 734–760. Wickens, T. D. (1998). On the form of the retention function:
Schwarz, G. (1978). Estimating the dimension of a model. Annals of Comment on Rubin and Wenzel (1996): A quantitative description
Statistics, 6, 461–464. of retention. Psychological Review, 105, 379–386.
Spanos, A. (1999). Probability theory and statistical inference. Cam- Wixted, J. T., & Ebbesen, E. B. (1991). On the form of forgetting.
bridge, UK: Cambridge University Press. Psychological Science, 2, 409–415.
Usher, M., & McClelland, J. L. (2001). The time course of perceptual
choice; The leaky, competing accumulator model. Psychological
Review, 108(3), 550–592.