Lecture5 Maximum Likelihood
Lecture5 Maximum Likelihood
Lecture5 Maximum Likelihood
In this lecture, we are going to look at why supervised learning works from a new, probabilistic
perspective.
First, we are going to start by defining the probabilistic approach to machine learning and set up
some notation.
f :X →Y
fθ : X → Y
1
5 Review: Data Distribution
We will assume that the dataset is governed by a probability distribution P, which we will call the
data distribution. We will denote this as
x, y ∼ P.
The training set D = {(x(i) , y (i) ) | i = 1, 2, ..., n} consists of independent and identicaly distributed
(IID) samples from P.
6 Probabilistic Models
If we know Pθ (x, y), we can use the conditional Pθ (y|x) for prediction.
Consider a simple version of our example with predicting diabetes from BMI. * For the target
Y = {0, 1}, we discretize the diabetes risk score into low risk (y = 0) and high risk (y = 1). * For
the input X = {0, 1, 2}, we also discretize the BMI into low (x = 0), medium (x = 1), and high
(x = 2).
Then the following is a simple probabilistic model.
[18]: import pandas as pd
df_model = pd.DataFrame.from_records([
['low', 'low', 0.20], ['medium', 'low', 0.1], ['high', 'low', 0.2],
['low', 'high', 0.05], ['medium', 'high', 0.1], ['high', 'high', 0.35],
], columns=['BMI $x$', 'Risk $y$', 'P'])
df_model
2
5 high high 0.35
Under this model, we can compute P (y|x) = P (x, y)/P (x) as follows.
df_conditional_model.iloc[:,[0,1,4]]
x ∼ P(x).
This x can be a sample from a data distribution, or any other random variable.
Recall that the expected value of a function g : X → R when the input x to g is sampled from P is
given by ∑
Ex∼P [g(x)] = g(x)P (x),
x
3
where we assumed for simplicity that x is discrete.
In practice computing expected values is not always easy: * x can take on a very large number of
values and summing over all of them is not possible. * When x is continuous, the expected value
can be an integral with no closed form solution.
In practice, we often use approximate methods to compute expected values.
1∑
T
ĝ(x1 , · · · , xT ) ≜ g(xt )
T
t=1
Let’s say that we throw five dice. What is the expected number of twos?
• Let x = (x1 , x2 , . . . , x5 ) be a dice roll where xj ∈ {1, 2, . . . , 6} is the outcome of the j-th die.
• Let g(x) denote the number of twos in the roll of dice x.
∑
The expected value Ex∼P [g(x)] = x g(x)P (x) is the expected number of twos. We can calculate
it as follows
[57]: import numpy as np
MC Estimate: 0.8358
This makes sense, since the correct answer is 5/6 ≈ 0.83.
4
13 Properties of Monte Carlo Estimation
EP [ĝ] = EP [g(x)]
• It converges to the true expectation as we average additional samples.
1∑
T
ĝ = g(xt ) → EP [g(x)] for T → ∞
T
t=1
Thus, variance of the estimator can be reduced by increasing the number of samples.
We will assume that the dataset is governed by a probability distribution P, which we will call the
data distribution. We will denote this as
x, y ∼ Pdata .
The training set D = {(x(i) , y (i) ) | i = 1, 2, ..., n} consists of independent and identicaly distributed
(IID) samples from Pdata .
5
This model can approximate the data distribution Pdata (x, y).
Probabilistic models may also have parameters θ ∈ Θ, which we denote as
If we know P (x, y), we can use the conditional P (y|x) for prediction.
We now have a probabilistic model and a data distribution. Thus, it is natural to try to learn learn
a good probability distribution Pθ (x, y) that approximates Pdata (x, y).
What are the characteristics of a good model Pθ (x, y)? * Predictive accuracy: correctly predicting
y from x. * Does this patient have diabetes or not?
• Understanding the relationship between x, y?
– What physiological features of the patient influence their diabetes risk?
• Density estimation: approximating Pdata (x, y) so that we can later answer any query.
18 Kullback-Leibler Divergence
Observations:
• D(p ∥ q) ≥ 0 for all p, q, with equality if and only if p = q. Proof:
( )
q(x) q(x)
D(p∥q) = Ex∼p − log ≥ − log Ex∼p
p(x) p(x)
( )
∑ q(x)
= − log p(x) =0
x
p(x)
6
19 Learning Models Using KL Divergence
We may now learn a probabilistic model Pθ (x, y) that approximates Pdata (x, y) via the KL diver-
gence:
( )
Pdata (x, y)
D(Pdata || Pθ ) = Ex,y∼Pdata log
Pθ (x, y)
∑ Pdata (x, y)
= Pdata (x, y) log
x,y
Pθ (x, y)
Note that D(Pdata || Pθ ) = 0 iff the two distributions are the same.
$
$ We can simplify the KL divergence objective somewhat:
( )
Pdata (x, y)
D(Pdata || Pθ ) = Ex,y∼Pdata log
Pθ (x, y)
= Ex,y∼Pdata log Pdata (x, y) − Ex,y∼Pdata log Pθ (x, y)
The first term does not depend on Pθ : minimizing KL divergence is equivalent to maximizing the
expected log-likelihood.
• This asks that Pθ assign high probability to instances sampled from Pdata , so as to reflect the
true distribution.
• Because of log, samples x, y where Pθ (x, y) ≈ 0 weigh heavily in the objective.
Problem: In general we do not know Pdata , hence expected value is intractable.
$
$ Applying, Monte Carlo estimation, we may approximate the expected log-likelihood
7
with the empirical log-likelihood:
1 ∑
ED∼log Pθ (x,y) = log Pθ (x, y)
|D|
x,y∈D
$ $ Consider a simple example in which we repeatedly toss a biased coin and record the outcomes.
• There are two possible outcomes: heads (H) and tails (T ). A training dataset consists of
tosses of the biased coin, e.g., D = {H, H, T, H, T }
• Assumption: true probability distribution is Pdata (x), x ∈ {H, T }
• Our task is to model the probability of heads/tails. Our class of models M are Bernoulli
distributions over x ∈ {H, T }.
How should we choose Pθ (x) from M if 3 out of 5 tosses are heads in D? Let’s apply maximum
likelihood learning.
• Our model is Pθ (x = H) = θ and Pθ (x = T ) = 1 − θ
• Our data is: D = {H, H, T, H,∏T }
• The likelihood of the data is i Pθ (xi ) = θ · θ · (1 − θ) · θ · (1 − θ).
We optimize for θ which makes D most likely. What is the solution in this case?
[5]: %matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
theta_vals = np.linspace(0,1)
plt.plot(theta_vals, coin_likelihood(theta_vals))
8
24 Example: Flipping a Random Coin
The MLE estimate is the θ∗ ∈ [0, 1] such that log L(θ∗ ) is maximum.
Differentiating the log-likelihood function with respect to θ and setting the derivative to zero, we
obtain
# heads
θ∗ =
# heads + # tails
When exact solutions are not available, we can optimize the log likelihood numerically, e.g. using
gradient descent.
We will see examples of this later.
Sometimes, we may be interested in only fitting a conditional model P (y|x). For example, we may
be only interested in predicting y from x rather than learning the joint structure of x, y.
9
We can extend the principle of maximum likelihood learning to this setting as well. In this case,
we are interested in minimizing
the expected KL divergence between Pdata (y|x) and Pθ (y|x) over all the inputs x.
With a bit of math, we can show that the maximum likelihood objective becomes
Recall that in maximum likelihood learning, we are optimizing the following objective:
So far, we viewed the parameter θMLE as a fixed but unknown quantity that we want to determine.
This view is an example of the frequentist approach in statistics: there exists some true value of
θMLE and our job is to devise statistical procedure to estimate this value.
10
29 Bayesian Inference and Learning
P (D | θ)P (θ)
P (θ | D) =
P (D)
P (D | θ)P (θ)
=∫ ,
θ P (D, θ)P (θ)dθ
∏n
where P (D | θ) = i=1 P (x
(i) , y (i) | θ).
30 Bayesian Predictions
Suppose we now want to predict the value of y from x. Unlike in the frequentist setting, we no
longer have a single estimate θ of the model params, but instead we have a distribution.
The Bayesian approach to predicting y given an input x and a training dataset D consists of taking
the prediction of all the possible models
∫
P (y|x, D) = P (y | x, θ)P (θ | D)dθ.
θ
This is called the posterior predictive distribution. Note how each P (y | x, θ) is weighted by the
probability of θ given D.
The Bayesian approach is very powerful. Some of its advantages include: * Principled estimates
of uncertainty, both in the prediction and in the paramters of the model. * Ability to incorporate
prior knowledge via the prior. * Providing a general framework for reasoning about probabilistic
models.
The main disadvantage is by far the computational complexity. Averaging over all possible model
weights is typically intracatble. There exists an entire field of machine learning that studies how
to approximate it.
11
32 Maximum A Posteriori Learning
Instead of trying to use the posterior distribution of P (θ|D), a common approach is to approximate
this distribution by its most likely value:
where in the second line we used Bayes’ theorem and in the third line we used the fact that P (D)
does not depend on θ.
Thus, we have the following objective:
( )
∏
n
arg max log (i)
P (x , y (i)
| θ) + log P (θ) .
θ
i=1
The θMAP is known as the maximum a posteriori estimate. Note that we used the same formula
as we used for maximum likelihood, except that we have added the prior term log P (θ).
How should we choose P (x | θ) from M if 60 out of 100 tosses are heads in D? Let’s apply
maximum likelihood learning.
• Our model is P (x = H | θ) = θ and P (x = T | θ) = 1 − θ
• Our data is: D = {H, H, T, H,∏T }
• The likelihood of the data is i P (xi | θ) = θ · θ · (1 − θ) · θ · (1 − θ).
Let’s now make this a MAP problem. Let’s assume the prior follows the Beta distribution:
1
P (θ) = θα (1 − θ)β ,
B(α + 1, β + 1)
12
Differentiating the log-likelihood function with respect to θ and setting the derivative to zero, we
obtain
# heads + α
θ∗ =
# heads + # tails + α + β
Thus, we see that adding a Beta prior with parameters α, β allows to encode having seen α “virtual
heads” and β “virtual tails”.
This is an example of how we can add prior knowledge into the model.
[59]: %matplotlib inline
import numpy as np
from matplotlib import pyplot as plt
theta_vals = np.linspace(0,1)
plt.plot(theta_vals, coin_likelihood(theta_vals))
13