0% found this document useful (0 votes)
41 views32 pages

Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views32 pages

Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Probabilistic Models for Supervised Learning

Piyush Rai

Introduction to Machine Learning (CS771A)

August 16, 2018

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 1


Announcements

Homework 1 will be out tonight. Due on August 31, 11:59pm. Please start early.
Project ideas will be posted by tomorrow.
Project group formation deadline extended to August 25.
Piazza has a “Search for Teammates” features (the very first pinned post)

Please sign-up on Piazza (we won’t sign you up if you are waiting for that :-))

TA office hours and office locations posted on Piazza (under resources/staff section)

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 2


Recap: Probabilistic Modeling

A probabilistic model is specified by two key components


An observation model p(y |θ), a.k.a. the likelihood model
(Optionally) A prior distribution p(θ) over the unknown parameters

Note that these two components specify the joint distribution p(y , θ) of data and unknowns
We can incorporate our assumptions about the data via the observation/likelihood model

We can incorporate our assumptions about the parameters via the prior distribution
Note: Likelihood and/or prior may depend on additional “hyperparamers” (fixed/unknown)

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 3


Recap: Parameter Estimation
Can do point estimation (via MLE/MAP) for θ or infer its full posterior (via Bayesian inference)
MLE maximizes the (log of) likelihood w.r.t. the parameters θ. For i.i.d. data,
N
X
θ̂MLE = arg max log p(yn | θ) = arg min NLL(θ)
θ θ
n=1
MLE is akin to empirical/training loss minimization (no regularization)
MAP estimation maximizes the (log of) posterior w.r.t. the parameters θ. For i.i.d. data,
XN
θ̂MAP = arg max[ log p(yn | θ) + log p(θ)] = arg min[NLL(θ) − log p(θ)]
θ θ
n=1
MAP is akin to regularized loss minimization (prior acts as a regularizer)
Bayesian inference computes the full posterior distribution of θ
p(y |θ)p(θ) p(y |θ)p(θ)
p(θ|y ) = =R (intractable in general)
p(y ) p(y |θ)p(θ)dθ

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 4


Recap: Predictive Distribution
Using estimated θ, we usually want the predictive distribution p(y∗ |y ) for some future data y∗
The proper, exact way of getting the predictive distribution (assuming i.i.d. data) is
Z Z
p(y∗ |y ) = p(y∗ , θ|y )dθ = p(y∗ |θ, y )p(θ|y )dθ
| {z } | {z }
sum rule of probability chain/product rule of probability
Z
= p(y∗ |θ)p(θ|y )dθ (assuming i.i.d. data)

If using a point estimate θ̂ (e.g., MLE/MAP), p(θ|y ) ≈ δθ̂ (θ), where δ() denotes Dirac function
Z
p(y∗ |y ) = p(y∗ |θ)p(θ|y )dθ ≈ p(y∗ |θ̂MLE ) (MLE based prediction)
Z
p(y∗ |y ) = p(y∗ |θ)p(θ|y )dθ ≈ p(y∗ |θ̂MAP ) (MAP based prediction)
R
If using the fully Bayesian inference, p(y∗ |y ) = p(y∗ |θ)p(θ|y )dθ ⇐ uses the proper way!
The integral here may not always be tractable and may need to be approximated
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 5
Probabilistic Models for
Supervised Learning

Want models that give us p(y |x)

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 6


Why Probabilistic Models for Supervised Learning?

Often, we want the distribution p(y |x) over possible outputs y , given an input x

Regression Classification (say 5 classes)

0.6
0.4
(mean) 0.2

Possible output values

The distribution p(y |x) is more informative, since it can tell us


What is the “expected” or “most likely” value of the predicted output y ?
What is the “uncertainty” in the predicted output y ?
.. and gives “soft” predictions (e.g., rather than yes/no prediction, gives prob. of “yes”)

Moreover, we can use priors over model parameters, perform fully Bayesian inference, etc.

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 7


Probabilistic Models for Supervised Learning

Usually two ways to model the conditional distribution p(y |x) of outputs given inputs
Approach 1: Don’t model x, and model p(y |x) directly using a prob. distribution, e.g.,

p(y |w , x) = N (w > x, β −1 ) (prob. linear regression)


>
p(y |w , x) = Bernoulli[σ(w x)] (prob. linear binary classification)

(note: w > x above only for linear prob. model; can even replace it by a possibly nonlinear f (x))
Approach 2: Model both x and y via the joint distr. p(x, y ), and then get the conditional as
p(x, y |θ)
p(y |x, θ) = (note: θ collectively denotes all the parameters)
p(x|θ)
p(x, y = k|θ) p(x|y = k, θ)p(y = k|θ)
p(y = k|x, θ) = = PK (for K class classification)
p(x|θ) `=1 p(x|y = `, θ)p(y = `|θ)

Approach 1 called Discriminative Modeling; Approach 2 called fully Generative Modeling


Discriminative models only model y , not x, Generative Models model both y and x

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 8


Today: Discriminative Models for
Probabilistic Regression/Classification
1: Probabilistic Linear Regression
p(y |w , x) = N (w > x, β −1 )

2: Logistic Regression for Binary Classification


p(y |w , x) = Bernoulli[σ(w > x)]
(Remember that these do NOT model x, but only model y )

(Also, both are linear models (note the w > x))

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 9


Gaussian Distribution: Brief Review

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 10


Univariate Gaussian Distribution

Distribution over real-valued scalar r.v. x


Defined by a scalar mean µ and a scalar variance σ 2
Distribution defined as 1 (x−µ)2
N (x; µ, σ 2 ) = √ e− 2σ 2
2πσ 2

Mean: E[x] = µ
Variance: var[x] = σ 2
Precision (inverse variance) β = 1/σ 2

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 11


Multivariate Gaussian Distribution

Distribution over a multivariate r.v. vector x ∈ RD of real numbers


Defined by a mean vector µ ∈ RD and a D × D covariance matrix Σ
1 1 >
Σ−1 (x−µ)
N (x; µ, Σ) = p e − 2 (x−µ)
(2π)D |Σ|

The covariance matrix Σ must be symmetric and positive definite


All eigenvalues are positive
z > Σz > 0 for any real vector z

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 12


Linear Regression: A Probabilistic View

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 13


Linear Regression: A Probabilistic View

Mean Variance

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 14


Linear Regression: A Probabilistic View

Mean Variance

Gaussian

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 15


Linear Regression: A Probabilistic View

Mean Variance

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 16


Linear Regression: A Probabilistic View

Mean Variance

Equivalently

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 17


Probabilistic Linear Regression: Some Comments

Modeling p(y |w , x) as a Gaussian p(y |w , x) = N (w > x, β −1 ) is just one possibility


Can model p(y |w , x) using other distributions too, e.g., Laplace (better handles outliers)
Mean Variance Mean Variance

Gaussian Laplace

Even with Gaussian, can assume each output to have a different variance (heteroscedastic noise)

p(y |w , x n ) = N (w > x n , βn−1 )

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 18


MLE for Probabilistic Linear Regression
Since each likelihood term is a Gaussian, we have r  
β β
p(yn |x n , w ) = N (w > x n , β −1 ) = exp − (yn − w > x n )2
2π 2

Thus the likelihood (assuming i.i.d. responses) will be


N  N/2 " N
#
Y β βX > 2
p(y |X, w ) = p(yn |x n , w ) = exp − (yn − w x n )
n=1
2π 2 n=1
Note: x n (features) assumed given/fixed. Only modeling the response yn
Log-likelihood (ignoring constants w.r.t. w )
N
βX
log p(y |X, w ) ∝ − (yn − w > x n )2
2 n=1

Note that negative log likelihood (NLL) in this case is similar to squared loss function
Therefor MLE with this model will give the same solution as (unregularized) least squares
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 19
MAP Estimation for Probabilistic Linear Regression
Let’s assume a zero-mean multivariate Gaussian prior on weight vector w
  " D
#
−1 λ > λX 2
p(w ) = N (0, λ ID ) ∝ exp − w w = exp − wd
2 2
d=1

This prior encourages each weight wd to be small (close to zero), similar to `2 regularization
The MAP objective (log-posterior) will be the log-likelihood + log p(w )
N
βX λ
− (yn − w > x n )2 − w > w
2 n=1 2
Maximizing this is equivalent to minimizing the following w.r.t. w
N
X λ >
ŵMAP = arg min (yn − w > x n )2 + w w
w
n=1
β

λ
Note that β is like a regularization hyperparam (as in ridge regression)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 20
Fully Bayesian Inference for Probabilistic Linear Regression
Can also compute the full posterior distribution over w
p(w )p(y |X, w )
p(w |y , X) =
p(y |X)

Since the likelihood (Gaussian) and prior (Gaussian) are conjugate, posterior is easy to compute
After some algebra, it can be shown that (will provide a note)

p(w |y , X) = N (µN , ΣN )
ΣN = (βX> X + λID )−1
λ
µN = (X> X + ID )−1 X> y
β
Note: We are assuming the hyperparameters β and λ to be known
Note: For brevity, we have omitted the hyperparams from the conditioning in various distributions
such as p(w ), p(y |X, w ), p(y |X), p(w |y , X)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 21
Predictive Distribution

Now we want the predictive distribution p(y∗ |x ∗ , X, y ) of the output y∗ for a new input x ∗
With MLE/MAP estimate of w , the prediction can be made by simply plugging in the estimate
p(y∗ |x ∗ , X, y ) ≈ p(y∗ |x ∗ , w MLE ) = N (w >
MLE x ∗ , β
−1
) - MLE prediction
p(y∗ |x ∗ , X, y ) ≈ p(y∗ |x ∗ , w MAP ) = N (w >
MAP x ∗ , β
−1
) - MAP prediction
When doing fully Bayesian inference, we can compute the posterior predictive distribution
Z
p(y∗ |x ∗ , X, y ) = p(y∗ |x ∗ , w )p(w |X, y )dw

Due to Gaussian conjugacy, this too will be a Gaussian (note the form, ignore the proof :-))

p(y∗ |x ∗ , X, y ) = N (µ>
N x ∗, β
−1
+x >
∗ ΣN x ∗ )

In this case, we also get an input-specific predictive variance (unlike MLE/MAP prediction)
Very useful in applications where we want confidence estimates of the predictions made by the model

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 22


MLE, MAP/Fully Bayesian Linear Regression: Summary

MLE/MAP give point estimate of w


Fully Bayesian approach gives the full posterior

MLE/MAP based prediction uses a single best estimate of w


Fully Bayesian prediction does posterior averaging
Some things to keep in mind:
MLE estimation of a parameter leads to unregularized solutions
MAP estimation of a parameter leads to regularized solutions
A Gaussian likelihood model corresponds to using squared loss
A Gaussian prior on parameters acts as an `2 regularizer
Other likelihoods/priors can be chosen (result in other loss functions and regularizers)

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 23


Discriminative Models for
Probabilistic Classification
(Again, only y will be modeled, x treated as “fixed”)

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 24


Logistic Regression
Perhaps the simplest discriminative probabilistic model for linear binary classification
Defines µ = p(y = 1|x) using the sigmoid function
> 1 exp(w > x)
µ = σ(w x) = =
1 + exp(−w > x) 1 + exp(w > x)

Here w > x is the score for input x. The sigmoid turns it into a probability. Thus we have
> 1 exp(w > x)
p(y = 1|x, w ) = µ = σ(w x) = >
=
1 + exp(−w x) 1 + exp(w > x)
> 1
p(y = 0|x, w ) = 1 − µ = 1 − σ(w x) =
1 + exp(w > x)

Note: If we assume y ∈ {−1, +1} instead of y ∈ {0, 1} then p(y |x, w ) = 1


1+exp(−y w > x)

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 25


Logistic Regression: A Closer Look..

At the decision boundary where both classes are equiprobable:


p(y = 1|x, w ) = p(y = 0|x, w )
exp(w > x) 1
=
1 + exp(w > x) 1 + exp(w > x)
>
exp(w x) = 1
>
w x = 0

Thus the decision boundary of LR is a linear hyperplane

Therefore y = 1 if w > x ≥ 0, otherwise y = 0

High positive (negative) score w > x: High (low) probability of label 1

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 26


MLE for Logistic Regression
exp(w > x n )
Each label yn = 1 with probability µn = 1+exp(w > x n )

Assuming i.i.d. labels, likelihood is product of Bernoullis


N
Y N
Y
p(y |X, w ) = p(yn |x n , w ) = µynn (1 − µn )1−yn
n=1 n=1
PN
Negative log-likelihood: NLL(w ) = − log p(y |X, w ) = − n=1 (yn log µn + (1 − yn ) log(1 − µn ))
Note: The NLL in this case is the same as cross-entropy loss function (a classification loss fn)
exp(w > x n )
Plugging in µn =
1+exp(w > x n )
and chugging, we get (verify yourself)
N
> >
X
NLL(w ) = − (yn w x n − log(1 + exp(w x n )))
n=1

MLE solution: ŵ MLE = arg minw NLL(w ). No closed form solution (you can verify)
Requires iterative methods (e.g., gradient descent). We will look at these later.
Exercise: Try computing the gradient of NLL(w ) and note the form of the gradient
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 27
MAP Estimation for Logisic Regression

To do MAP estimation for w , can use a prior p(w ) on w

Just like the probabilistic linear regression case, let’s put a Gausian prior on w
λ
p(w ) = N (0, λ−1 ID ) ∝ exp(− w > w )
2

MAP objective (log of posterior) = MLE objective + log p(w )


The MAP estimate of w will be
 
λ
ŵ MAP = arg min NLL(w ) + w > w
w 2

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 28


Fully Bayesian Estimation for Logistic Regression
Doing fully Bayesian inference would require computing the posterior
QN
p(y |X, w )p(w ) p(yn |x n , w )p(w )
p(w |X, y ) = R = R QNn=1
p(y |X, w )p(w )dw n=1 p(yn |x n , w )p(w )dw

Intractable. Reason: likelihood (logistic-Bernoulli) and prior (Gaussian) here are not conjugate
Need to do approximate inference in this case
A crude approximation: Laplace approximation: Approximate a posterior by a Gaussian with
mean =w MAP and covariance = inverse hessian (hessian = second derivative of log p(w |X, y ))
p(w |X, y ) = N (w MAP , H−1 )

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 29


Logistic Regression: Predictive Distributions
When using MLE, the predictive distribution will be

p(y∗ = 1|x ∗ , X, y ) ≈ p(y∗ = 1|x ∗ , w MLE ) = σ(w >


MLE x ∗ )
p(y∗ |x ∗ , X, y ) ≈ Bernoulli(σ(w >
MLE x ∗ ))

When using MAP, the predictive distribution will be

p(y∗ = 1|x ∗ , X, y ) ≈ p(y∗ = 1|x ∗ , w MAP ) = σ(w >


MAP x ∗ )
p(y∗ |x ∗ , X, y ) ≈ Bernoulli(σ(w >
MAP x ∗ ))

When using Bayesian inference, the posterior predictive distribution, based on posterior averaging
Z Z
p(y∗ = 1|x ∗ , X, y ) = p(y∗ = 1|x ∗ , w )p(w |X, y )dw = σ(w > x ∗ )p(w |X, y )dw

Note: Unlike the linear regression case, for logistic regression (and for non-conjugate models in
general), posterior averaging can be intractable (and may require approximations)
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 30
Multiclass Logistic (a.k.a. Softmax) Regression
Also called multinoulli/multinomial regression: Basically, logistic regression for K > 2 classes
In this case, yn ∈ {1, 2, . . . , K } and label probabilities are defined as
K
exp(w >
k x n)
X
p(yn = k|x n , W) = PK >
= µnk and µn` = 1
`=1 exp(w ` x n) `=1

W = [w 1 w 2 . . . w K ] is D × K weight matrix. w 1 = 0D×1 (assumed for identifiability)


Popularly known as the softmax function
Each likelihood p(yn |x n , W) is a multinoulli distribution. Therefore
N Y
Y K
y
p(y |X, W) = µn`n`
n=1 `=1

where yn` = 1 if true class of example n is ` and yn`0 = 0 for all other `0 6= `
Can do MLE/MAP/fully Bayesian estimation for W similar to the logistic regression model
Will look at optimization methods for this and other loss functions later.
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 31
Summary

Looked at probabilistic models for supervised learning (regression and classification)


Can do MLE/MAP, of fully Bayesian inference in these models

MLE/MAP is like loss function (or regularized loss function) minimization


Fully Bayesian inference is usually harder/expensive but often considered better
We get the full posterior over the parameters
We can do posterior averagring when computing the predictive distribution
Can get variance/confidence estimate in our predictions

Can model p(y |x) direcly (discriminative models) or via p(x, y ) (generative models)

Looked at discriminative models for regression and classification


Will look at generative models for learning p(y |x) next week

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 32

You might also like