0% found this document useful (0 votes)

59 views10 pages

Probability Concepts Explained

This document explains maximum likelihood estimation, which is a method for estimating the parameters of a statistical model given data. It involves choosing parameter values that maximize the likelihood of the observed data given the model. The example demonstrates calculating the maximum likelihood estimates for the mean and standard deviation parameters of a Gaussian distribution using three data points. It takes the natural logarithm of the likelihood function to simplify differentiation, finds where the derivative is zero, and rearranges to solve for the parameter values. These maximum likelihood estimates provide the parameter values that best fit the data according to the assumed statistical model.

Uploaded by

Petros Piano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views10 pages

Probability Concepts Explained

Uploaded by

Petros Piano

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

6/19/2020 Probability concepts explained: Maximum likelihood estimation

Probability concepts explained: Maximum

likelihood estimation
Jonny Brooks-Bartlett
Jan 4, 2018 · 8 min read

Introduction
In this post I’ll explain what the maximum likelihood method for parameter
estimation is and go through a simple example to demonstrate the method. Some
of the content requires knowledge of fundamental probability concepts such as the
definition of joint probability and independence of events. I’ve written a blog post
with these prerequisites so feel free to read this if you think you need a refresher.

What are parameters?

Often in machine learning we use a model to describe the process that results in
the data that are observed. For example, we may use a random forest model to
classify whether customers may cancel a subscription from a service (known

https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1 1/10
6/19/2020 Probability concepts explained: Maximum likelihood estimation

as churn modelling) or we may use a linear model to predict the revenue that will
be generated for a company depending on how much they may spend on
advertising (this would be an example of linear regression). Each model contains
its own set of parameters that ultimately defines what the model looks like.

For a linear model we can write this as y = mx + c. In this example x could

represent the advertising spend and y might be the revenue generated. m and c are
parameters for this model. Different values for these parameters will give different
lines (see figure below).

Three linear models with different parameter values.

So parameters define a blueprint for the model. It is only when specific values are
chosen for the parameters that we get an instantiation for the model that describes
a given phenomenon.

Intuitive explanation of maximum likelihood

estimation
Maximum likelihood estimation is a method that determines values for the
parameters of a model. The parameter values are found such that they maximise

https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1 2/10
6/19/2020 Probability concepts explained: Maximum likelihood estimation

the likelihood that the process described by the model produced the data that were
actually observed.

The above definition may still sound a little cryptic so let’s go through an example
to help understand this.

Let’s suppose we have observed 10 data points from some process. For example,
each data point could represent the length of time in seconds that it takes a student
to answer a specific exam question. These 10 data points are shown in the figure
below

The 10 (hypothetical) data points that we have observed

We first have to decide which model we think best describes the process of
generating the data. This part is very important. At the very least, we should have a
good idea about which model to use. This usually comes from having some domain
expertise but we wont discuss this here.

For these data we’ll assume that the data generation process can be adequately
described by a Gaussian (normal) distribution. Visual inspection of the figure
above suggests that a Gaussian distribution is plausible because most of the 10
https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1 3/10
6/19/2020 Probability concepts explained: Maximum likelihood estimation

points are clustered in the middle with few points scattered to the left and the
right. (Making this sort of decision on the fly with only 10 data points is ill-advised
but given that I generated these data points we’ll go with it).

Recall that the Gaussian distribution has 2 parameters. The mean, μ, and the
standard deviation, σ. Different values of these parameters result in different
curves (just like with the straight lines above). We want to know which curve was
most likely responsible for creating the data points that we observed? (See figure
below). Maximum likelihood estimation is a method that will find the values of μ
and σ that result in the curve that best fits the data.

The 10 data points and possible Gaussian distributions from which the data were
drawn. f1 is normally distributed with mean 10 and variance 2.25 (variance is equal
to the square of the standard deviation), this is also denoted f1 ∼ N (10, 2.25). f2 ∼
N (10, 9), f3 ∼ N (10, 0.25) and f4 ∼ N (8, 2.25). The goal of maximum likelihood is
to find the parameter values that give the distribution that maximise the probability
of observing the data.

The true distribution from which the data were generated was f1 ~ N(10, 2.25),
which is the blue curve in the figure above.

https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1 4/10
6/19/2020 Probability concepts explained: Maximum likelihood estimation

Calculating the Maximum Likelihood

Estimates
Now that we have an intuitive understanding of what maximum likelihood
estimation is we can move on to learning how to calculate the parameter values.
The values that we find are called the maximum likelihood estimates (MLE).

Again we’ll demonstrate this with an example. Suppose we have three data points
this time and we assume that they have been generated from a process that is
adequately described by a Gaussian distribution. These points are 9, 9.5 and
11. How do we calculate the maximum likelihood estimates of the parameter
values of the Gaussian distribution μ and σ?

What we want to calculate is the total probability of observing all of the data, i.e.
the joint probability distribution of all observed data points. To do this we would
need to calculate some conditional probabilities, which can get very difficult. So it
is here that we’ll make our first assumption. The assumption is that each data
point is generated independently of the others. This assumption makes the maths
much easier. If the events (i.e. the process that generates the data) are
independent, then the total probability of observing all of data is the product of
observing each data point individually (i.e. the product of the marginal
probabilities).

The probability density of observing a single data point x, that is generated from a
Gaussian distribution is given by:

The semi colon used in the notation P(x; μ, σ) is there to emphasise that the
symbols that appear after it are parameters of the probability distribution. So it
shouldn’t be confused with a conditional probability (which is typically
represented with a vertical line e.g. P(A| B)).

In our example the total (joint) probability density of observing the three data
points is given by:
https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1 5/10
6/19/2020 Probability concepts explained: Maximum likelihood estimation

We just have to figure out the values of μ and σ that results in giving the maximum
value of the above expression.

If you’ve covered calculus in your maths classes then you’ll probably be aware that
there is a technique that can help us find maxima (and minima) of functions. It’s
called differentiation. All we have to do is find the derivative of the function, set
the derivative function to zero and then rearrange the equation to make the
parameter of interest the subject of the equation. And voilà, we’ll have our MLE
values for our parameters. I’ll go through these steps now but I’ll assume that the
reader knows how to perform differentiation on common functions. If you would
like a more detailed explanation then just let me know in the comments.

The log likelihood

The above expression for the total probability is actually quite a pain to
differentiate, so it is almost always simplified by taking the natural logarithm of
the expression. This is absolutely fine because the natural logarithm is
a monotonically increasing function. This means that if the value on the x-
axis increases, the value on the y-axis also increases (see figure below). This is
important because it ensures that the maximum value of the log of the probability
occurs at the same point as the original probability function. Therefore we can
work with the simpler log-likelihood instead of the original likelihood.

https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1 6/10
6/19/2020 Probability concepts explained: Maximum likelihood estimation

Monotonic behaviour of the original function, y = x on the left and the (natural)
logarithm function y = ln(x). These functions are both monotonic because as you
go from left to right on the x-axis the y value always increases.

Example of a non-monotonic function because as you go from left to right on the

graph the value of f(x) goes up, then goes down and then goes back up again.

Taking logs of the original expression gives us:

This expression can be simplified again using the laws of logarithms to obtain:

This expression can be differentiated to find the maximum. In this example we’ll
find the MLE of the mean, μ. To do this we take the partial derivative of the
function with respect to μ, giving
https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1 7/10
6/19/2020 Probability concepts explained: Maximum likelihood estimation

Finally, setting the left hand side of the equation to zero and then rearranging for μ
gives:

And there we have our maximum likelihood estimate for μ. We can do the same
thing with σ too but I’ll leave that as an exercise for the keen reader.

Concluding remarks
Can maximum likelihood estimation always be solved in
an exact manner?
No is the short answer. It’s more likely that in a real world scenario the derivative
of the log-likelihood function is still analytically intractable (i.e. it’s way too
hard/impossible to differentiate the function by hand). Therefore, iterative
methods like Expectation-Maximization algorithms are used to find numerical
solutions for the parameter estimates. The overall idea is still the same though.

So why maximum likelihood and not maximum

probability?
Well this is just statisticians being pedantic (but for good reason). Most people
tend to use probability and likelihood interchangeably but statisticians and
probability theorists distinguish between the two. The reason for the confusion is
best highlighted by looking at the equation.

These expressions are equal! So what does this mean? Let’s first define P(data; μ,
σ)? It means “the probability density of observing the data with model

https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1 8/10
6/19/2020 Probability concepts explained: Maximum likelihood estimation

parameters μ and σ”. It’s worth noting that we can generalise this to any number
of parameters and any distribution.

On the other hand L(μ, σ; data) means “the likelihood of the parameters μ and σ
taking certain values given that we’ve observed a bunch of data.”

The equation above says that the probability density of the data given the
parameters is equal to the likelihood of the parameters given the data. But despite
these two things being equal, the likelihood and the probability density are
fundamentally asking different questions — one is asking about the data and the
other is asking about the parameter values. This is why the method is called
maximum likelihood and not maximum probability.

When is least squares minimisation the same as

maximum likelihood estimation?
Least squares minimisation is another common method for estimating parameter
values for a model in machine learning. It turns out that when the model is
assumed to be Gaussian as in the examples above, the MLE estimates are
equivalent to the least squares method. For a more in-depth mathematical
derivation check out these slides.

Intuitively we can interpret the connection between the two methods by

understanding their objectives. For least squares parameter estimation we want to
find the line that minimises the total squared distance between the data points and
the regression line (see the figure below). In maximum likelihood estimation we
want to maximise the total probability of the data. When a Gaussian distribution is
assumed, the maximum probability is found when the data points get closer to the
mean value. Since the Gaussian distribution is symmetric, this is equivalent to
minimising the distance between the data points and the mean value.

https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1 9/10
6/19/2020 Probability concepts explained: Maximum likelihood estimation

Regression line showing data points with random Gaussian noise

If there is anything that is unclear or I’ve made some mistakes in the above feel
free to leave a comment. In the next post I plan to cover Bayesian inference and
how it can be used for parameter estimation.

Thank you for reading.

Thanks to Ludovic Benistant.

https://towardsdatascience.com/probability-concepts-explained-maximum-likelihood-estimation-c7b4342fdbb1 10/10

Markov Models Supervised and Unsupervised Machine Learning: Mastering Data Science And Python
From Everand
Markov Models Supervised and Unsupervised Machine Learning: Mastering Data Science And Python
William Sullivan
2/5 (1)
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
22 pages
Maximum
No ratings yet
Maximum
3 pages
Learning Models From Data: 1 Parametric Estimation
No ratings yet
Learning Models From Data: 1 Parametric Estimation
14 pages
11 Parameter Estimation
No ratings yet
11 Parameter Estimation
6 pages
Lecture 03 Maximum Likelihood Estimation
No ratings yet
Lecture 03 Maximum Likelihood Estimation
22 pages
ML Notes
No ratings yet
ML Notes
4 pages
MIT18 05S14 Reading10b PDF
No ratings yet
MIT18 05S14 Reading10b PDF
9 pages
Sta255 Week 11-2 Pre
No ratings yet
Sta255 Week 11-2 Pre
21 pages
Session 32 - Point Estimate
No ratings yet
Session 32 - Point Estimate
53 pages
7.estimation Clustering
No ratings yet
7.estimation Clustering
56 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
10 pages
2 Mle
No ratings yet
2 Mle
28 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
6 pages
Statistics 512 Notes 12: Maximum Likelihood Estimation: X X PX X
No ratings yet
Statistics 512 Notes 12: Maximum Likelihood Estimation: X X PX X
5 pages
Etf3600 Lecture3 Mle LPM 2013
No ratings yet
Etf3600 Lecture3 Mle LPM 2013
36 pages
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
No ratings yet
Maximum Likelihood and Bayesian Parameter Estimation: Chapter 3, DHS
35 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
No ratings yet
10.0 Lesson Plan: Answer Questions Robust Estimators Maximum Likelihood Estimators
15 pages
ML and MAP - HTML
No ratings yet
ML and MAP - HTML
9 pages
Experiment 1
No ratings yet
Experiment 1
5 pages
Wk04 Machine Learning
No ratings yet
Wk04 Machine Learning
6 pages
Chapte 2 - Maximum Likelihood - HEC - Lausanne
No ratings yet
Chapte 2 - Maximum Likelihood - HEC - Lausanne
276 pages
NOTES
No ratings yet
NOTES
14 pages
Lectura 1 Point Estimation
No ratings yet
Lectura 1 Point Estimation
47 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
8 pages
MIT14 30s09 Lec19
No ratings yet
MIT14 30s09 Lec19
7 pages
Chapter 2 - Maximum Likelihood - HEC - Lausanne
No ratings yet
Chapter 2 - Maximum Likelihood - HEC - Lausanne
277 pages
2 Probability
No ratings yet
2 Probability
30 pages
Inf 2
No ratings yet
Inf 2
37 pages
MLE Dan Bayesian Estimation From Walpole Book
No ratings yet
MLE Dan Bayesian Estimation From Walpole Book
13 pages
Chapter 2: Maximum Likelihood Estimation: Advanced Econometrics - HEC Lausanne
No ratings yet
Chapter 2: Maximum Likelihood Estimation: Advanced Econometrics - HEC Lausanne
207 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
11 pages
Maximum Likelihood
No ratings yet
Maximum Likelihood
16 pages
Model Inference and Averaging: Dept. Computer Science & Engineering, Shanghai Jiao Tong University
No ratings yet
Model Inference and Averaging: Dept. Computer Science & Engineering, Shanghai Jiao Tong University
51 pages
Statistical Inference: Classical and Bayesian Methods
No ratings yet
Statistical Inference: Classical and Bayesian Methods
22 pages
L08 MaximumLikelihoodEstimation
No ratings yet
L08 MaximumLikelihoodEstimation
5 pages
Statistical Machine Learning W4400 Lecture Slides PDF
No ratings yet
Statistical Machine Learning W4400 Lecture Slides PDF
520 pages
Probability Theory For Machine Learning: Chris Cremer September 2015
No ratings yet
Probability Theory For Machine Learning: Chris Cremer September 2015
40 pages
21 Mle
No ratings yet
21 Mle
24 pages
Understanding Maximum Likelihood
No ratings yet
Understanding Maximum Likelihood
5 pages
21 Mle
No ratings yet
21 Mle
24 pages
Maximum Likelihood Estimation by K.Kashin
No ratings yet
Maximum Likelihood Estimation by K.Kashin
34 pages
Sta255 Week 11-1 Pre
No ratings yet
Sta255 Week 11-1 Pre
37 pages
Maximum Likelihood Estimators and Least Squares
No ratings yet
Maximum Likelihood Estimators and Least Squares
5 pages
A Guide To Modern Econometrics by Verbeek 181 190
No ratings yet
A Guide To Modern Econometrics by Verbeek 181 190
10 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
7 pages
Examples of Maximum Likelihood Estimation and Optimization in R
No ratings yet
Examples of Maximum Likelihood Estimation and Optimization in R
15 pages
12 MLEFilled
No ratings yet
12 MLEFilled
8 pages
PBM Notes
No ratings yet
PBM Notes
130 pages
Likelihood, Bayesian, and Decision Theory
No ratings yet
Likelihood, Bayesian, and Decision Theory
50 pages
Chapter 1.1 Mle
No ratings yet
Chapter 1.1 Mle
8 pages
CSKHKPV Palampurujhjhj
No ratings yet
CSKHKPV Palampurujhjhj
18 pages
5
No ratings yet
5
29 pages
Chap - 2point - Estimation
No ratings yet
Chap - 2point - Estimation
11 pages
Maximum Likelihood Notes1
No ratings yet
Maximum Likelihood Notes1
10 pages
Process Performance Models: Statistical, Probabilistic & Simulation
From Everand
Process Performance Models: Statistical, Probabilistic & Simulation
Vishnuvarthanan Moorthy
No ratings yet
Top Numerical Methods With Matlab For Beginners!
From Everand
Top Numerical Methods With Matlab For Beginners!
Andrei Besedin
No ratings yet
Random Sample Consensus: Robust Estimation in Computer Vision
From Everand
Random Sample Consensus: Robust Estimation in Computer Vision
Fouad Sabry
No ratings yet
Privacy Chapter
No ratings yet
Privacy Chapter
6 pages
Differential Privacy: On The Trade-Off Between Utility and Information Leakage
No ratings yet
Differential Privacy: On The Trade-Off Between Utility and Information Leakage
26 pages
Statlect: Log-Likelihood
No ratings yet
Statlect: Log-Likelihood
6 pages
Form 4 in Term Exam
No ratings yet
Form 4 in Term Exam
2 pages
CS 229 Spring 2016 Problem Set #3: Theory & Unsupervised Learning
No ratings yet
CS 229 Spring 2016 Problem Set #3: Theory & Unsupervised Learning
5 pages
Generalized Linear Model Multivariate Poisson With Artificial Marginal (GLM-MPAM) : Application of Vehicle Insurance
No ratings yet
Generalized Linear Model Multivariate Poisson With Artificial Marginal (GLM-MPAM) : Application of Vehicle Insurance
9 pages
f11 Examtopics
No ratings yet
f11 Examtopics
2 pages
Jeff 18
No ratings yet
Jeff 18
126 pages
Thesis Prospectus Elan Markov
No ratings yet
Thesis Prospectus Elan Markov
3 pages
Linear Regression Analysis 6th Edition Montgomery, Peck & Vining 1
No ratings yet
Linear Regression Analysis 6th Edition Montgomery, Peck & Vining 1
86 pages
The Current State of Earthquake Potential On Java Island, Indonesia
No ratings yet
The Current State of Earthquake Potential On Java Island, Indonesia
18 pages
The Kullback-Liebler Distance and Entropy
No ratings yet
The Kullback-Liebler Distance and Entropy
5 pages
Gan Tutorial
No ratings yet
Gan Tutorial
57 pages
10/36-702 Statistical Machine Learning Homework #2 Solutions
No ratings yet
10/36-702 Statistical Machine Learning Homework #2 Solutions
11 pages
Solution
No ratings yet
Solution
3 pages
B. J. Winer-Statistical Principles in Experimental Design-NY C. - McGraw-Hill Book Company (1962)
No ratings yet
B. J. Winer-Statistical Principles in Experimental Design-NY C. - McGraw-Hill Book Company (1962)
687 pages
Baltagi
No ratings yet
Baltagi
23 pages
Mathematical Statistics 16th Edition Keith Knight Instant Download
100% (1)
Mathematical Statistics 16th Edition Keith Knight Instant Download
38 pages
A Final Report
No ratings yet
A Final Report
39 pages
Generalized Inverse Lindley Power Series Distributions Modeling and Simulation-1563863561
No ratings yet
Generalized Inverse Lindley Power Series Distributions Modeling and Simulation-1563863561
17 pages
Estimation PDF
No ratings yet
Estimation PDF
348 pages
Journal of Petroleum Science and Engineering: Hong Tang, Christopher D. White
No ratings yet
Journal of Petroleum Science and Engineering: Hong Tang, Christopher D. White
6 pages
Comparing Scale Parameters in Several Gamma Distributions With Known Shapes
No ratings yet
Comparing Scale Parameters in Several Gamma Distributions With Known Shapes
24 pages
03 Logistic Regression
No ratings yet
03 Logistic Regression
23 pages
14JST 4999 2024
No ratings yet
14JST 4999 2024
29 pages
Quality of Analytical Measurements: Univariate Regression: 2009 Elsevier B.V. All Rights Reserved
No ratings yet
Quality of Analytical Measurements: Univariate Regression: 2009 Elsevier B.V. All Rights Reserved
43 pages
M.SC Statistics101011 PDF
No ratings yet
M.SC Statistics101011 PDF
35 pages
Ce 463
No ratings yet
Ce 463
139 pages
Maximum Likelihood Estimation
No ratings yet
Maximum Likelihood Estimation
7 pages
M.E. Bda 2021
No ratings yet
M.E. Bda 2021
64 pages
Construction and Evaluation of Actuarial Models Exam
No ratings yet
Construction and Evaluation of Actuarial Models Exam
6 pages
Probability and Statistics For Economists Yongmiaohong PDF Download
No ratings yet
Probability and Statistics For Economists Yongmiaohong PDF Download
78 pages
LNO Notes
No ratings yet
LNO Notes
113 pages

Probability Concepts Explained

Uploaded by

Probability Concepts Explained

Uploaded by

6/19/2020 Probability concepts explained: Maximum likelihood estimation

Probability concepts explained: Maximum

What are parameters?

For a linear model we can write this as y = mx + c. In this example x could

Three linear models with different parameter values.

Intuitive explanation of maximum likelihood

The 10 (hypothetical) data points that we have observed

Calculating the Maximum Likelihood

The log likelihood

Example of a non-monotonic function because as you go from left to right on the

Taking logs of the original expression gives us:

So why maximum likelihood and not maximum

When is least squares minimisation the same as

Intuitively we can interpret the connection between the two methods by

Regression line showing data points with random Gaussian noise

Thank you for reading.

Thanks to Ludovic Benistant.

You might also like