0% found this document useful (0 votes)

41 views32 pages

Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)

Uploaded by

satyajitresearchict

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

41 views32 pages

Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)

Uploaded by

satyajitresearchict

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 32

Probabilistic Models for Supervised Learning

Piyush Rai

Introduction to Machine Learning (CS771A)

August 16, 2018

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 1

Announcements

Homework 1 will be out tonight. Due on August 31, 11:59pm. Please start early.
Project ideas will be posted by tomorrow.
Project group formation deadline extended to August 25.
Piazza has a “Search for Teammates” features (the very first pinned post)

Please sign-up on Piazza (we won’t sign you up if you are waiting for that :-))

TA office hours and office locations posted on Piazza (under resources/staff section)

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 2

Recap: Probabilistic Modeling

A probabilistic model is specified by two key components

An observation model p(y |θ), a.k.a. the likelihood model
(Optionally) A prior distribution p(θ) over the unknown parameters

Note that these two components specify the joint distribution p(y , θ) of data and unknowns
We can incorporate our assumptions about the data via the observation/likelihood model

We can incorporate our assumptions about the parameters via the prior distribution
Note: Likelihood and/or prior may depend on additional “hyperparamers” (fixed/unknown)

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 3

Recap: Parameter Estimation
Can do point estimation (via MLE/MAP) for θ or infer its full posterior (via Bayesian inference)
MLE maximizes the (log of) likelihood w.r.t. the parameters θ. For i.i.d. data,
N
X
θ̂MLE = arg max log p(yn | θ) = arg min NLL(θ)
θ θ
n=1
MLE is akin to empirical/training loss minimization (no regularization)
MAP estimation maximizes the (log of) posterior w.r.t. the parameters θ. For i.i.d. data,
XN
θ̂MAP = arg max[ log p(yn | θ) + log p(θ)] = arg min[NLL(θ) − log p(θ)]
θ θ
n=1
MAP is akin to regularized loss minimization (prior acts as a regularizer)
Bayesian inference computes the full posterior distribution of θ
p(y |θ)p(θ) p(y |θ)p(θ)
p(θ|y ) = =R (intractable in general)
p(y ) p(y |θ)p(θ)dθ

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 4

Recap: Predictive Distribution
Using estimated θ, we usually want the predictive distribution p(y∗ |y ) for some future data y∗
The proper, exact way of getting the predictive distribution (assuming i.i.d. data) is
Z Z
p(y∗ |y ) = p(y∗ , θ|y )dθ = p(y∗ |θ, y )p(θ|y )dθ
| {z } | {z }
sum rule of probability chain/product rule of probability
Z
= p(y∗ |θ)p(θ|y )dθ (assuming i.i.d. data)

If using a point estimate θ̂ (e.g., MLE/MAP), p(θ|y ) ≈ δθ̂ (θ), where δ() denotes Dirac function
Z
p(y∗ |y ) = p(y∗ |θ)p(θ|y )dθ ≈ p(y∗ |θ̂MLE ) (MLE based prediction)
Z
p(y∗ |y ) = p(y∗ |θ)p(θ|y )dθ ≈ p(y∗ |θ̂MAP ) (MAP based prediction)
R
If using the fully Bayesian inference, p(y∗ |y ) = p(y∗ |θ)p(θ|y )dθ ⇐ uses the proper way!
The integral here may not always be tractable and may need to be approximated
Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 5
Probabilistic Models for
Supervised Learning

Want models that give us p(y |x)

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 6

Why Probabilistic Models for Supervised Learning?

Often, we want the distribution p(y |x) over possible outputs y , given an input x

Regression Classification (say 5 classes)

0.6
0.4
(mean) 0.2

Possible output values

The distribution p(y |x) is more informative, since it can tell us

What is the “expected” or “most likely” value of the predicted output y ?
What is the “uncertainty” in the predicted output y ?
.. and gives “soft” predictions (e.g., rather than yes/no prediction, gives prob. of “yes”)

Moreover, we can use priors over model parameters, perform fully Bayesian inference, etc.

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 7

Probabilistic Models for Supervised Learning

Usually two ways to model the conditional distribution p(y |x) of outputs given inputs
Approach 1: Don’t model x, and model p(y |x) directly using a prob. distribution, e.g.,

p(y |w , x) = N (w > x, β −1 ) (prob. linear regression)

>
p(y |w , x) = Bernoulli[σ(w x)] (prob. linear binary classification)

Approach 1 called Discriminative Modeling; Approach 2 called fully Generative Modeling

Discriminative models only model y , not x, Generative Models model both y and x

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 8

Today: Discriminative Models for
Probabilistic Regression/Classification
1: Probabilistic Linear Regression
p(y |w , x) = N (w > x, β −1 )

2: Logistic Regression for Binary Classification

p(y |w , x) = Bernoulli[σ(w > x)]
(Remember that these do NOT model x, but only model y )

(Also, both are linear models (note the w > x))

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 9

Gaussian Distribution: Brief Review

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 10

Univariate Gaussian Distribution

Distribution over real-valued scalar r.v. x

Defined by a scalar mean µ and a scalar variance σ 2
Distribution defined as 1 (x−µ)2
N (x; µ, σ 2 ) = √ e− 2σ 2
2πσ 2

Mean: E[x] = µ
Variance: var[x] = σ 2
Precision (inverse variance) β = 1/σ 2

Intro to Machine Learning (CS771A) Probabilistic Models for Supervised Learning 11

Multivariate Gaussian Distribution

Distribution over a multivariate r.v. vector x ∈ RD of real numbers

Defined by a mean vector µ ∈ RD and a D × D covariance matrix Σ
1 1 >
Σ−1 (x−µ)
N (x; µ, Σ) = p e − 2 (x−µ)
(2π)D |Σ|